Where communities thrive


  • Join over 1.5M+ people
  • Join over 100K+ communities
  • Free without limits
  • Create your own community
People
Repo info
Activity
    François Massot
    @fmassot
    My 2 cents @shikhar: prepare_commit() is a bit risky from a user point of view because you need to be really careful of what's happening between prepare_commit() and prepared_commit.commit(). So why not calling commit immediately instead of encouraging the user to use this API.
    But the problem now is that the payload is no more accessible, I think we could fix that. Let's wait a couple of hours and see what @fulmicoton has in mind.
    we could add a function commit_with_payload(payload) that would not break the current API.
    Shikhar Bhushan
    @shikhar
    something like that sgtm! :)
    happy to work on a PR for that too
    Paul Masurel
    @fulmicoton
    Setting payload the one valid use case for prepare commit.
    I will make it public or introduce francois's method.
    8 replies
    Joep Meindertsma
    @joepio

    Hi! I've just used Tantivy in my project atomic-server, a rust graph database with a dynamic schema.

    I'm toying a bit with FuzzyTermQuery, but all the results have an equal score of 1 - seems like these should differ, right? Am I doing something wrong?

    Paul Masurel
    @fulmicoton
    Fuzzytermqueey does not expose a score.
    François Massot
    @fmassot
    Actually, I wonder how I would compute a good score on this type of query, @joepio do you have something in mind ?
    restioson
    @restioson:breadpirates.chat
    [m]
    The intuitive expectation would be some lev distance
    Not sure how itd be handled in match prefix case
    François Massot
    @fmassot
    But looking at the lev distance is not sufficient. For a fuzzy query with distance 1, this may match let's say 10 terms, with that we can build a score like for multi term query.
    I had a brief look at what Lucene is doing, the scoring method is called TopTermsScoringBooleanQueryRewrite
    so I think they are just using the term with the best score otherwise you can end up with a huge number of clauses in your bool query.
    Joep Meindertsma
    @joepio
    @fmassot search is not my area of expertise, but I can try to drop some thoughts! Maybe we can use term TF/IDF to score fuzzy items. If an term hits once in a long text, it should get a lower score then if it hits once in a very short text. And maybe check the hits of the terms, and try a less fuzzy hit, see if that hits (as in: see how many characters are wrong - the less errors, the higher the score). Is it better if I open an issue about this?
    Joep Meindertsma
    @joepio
    I'm assuming this PR is already fixing this, right? quickwit-inc/tantivy#998
    10 replies
    carmel
    @carmel
    Paul Masurel
    @fulmicoton
    Your problem is merge threads
    Tantivy does not have any back pressure mechanism, so if you misuse it leads to that catastrophy :)
    Misusing it meaning:
    a) never letting it finish its merge operations. For a server that means creating a new index writer instance on each request and dropping it right away.
    In a cli that would be creating the indexwriter and quitting without calling wait_for_merge
    b) ingesting document fast, but one at a time.
    If you have a lot of documents it is crucial for them to get batch inserted
    In a server that means you need extra logic to queue docs and index that queue in small batches, either when it reach a given number of docs or when it timeouts.
    Paul Masurel
    @fulmicoton
    It can be a little hard to code if you want to return Ok to your user after the actual commit...
    carmel
    @carmel
    Oh, thanks. This search service is written for my own use.
    I was looping once before to create a new instance of writer and commit immediately after writing, but that is more efficient and slower, should I use multiple threads to do that?
    Paul Masurel
    @fulmicoton
    I am not sure I understand.
    But you want to keep a long lived indexwriter instance.
    If you commit not too often, you can commit after each add document.
    But if you want to handle a large amount of doc ingested, you should batch them one way or another
    carmel
    @carmel
    Ok
    Joep Meindertsma
    @joepio
    I was wondering whether it's possible to change the log level for tantivy. Is there an option for this? It seems like env_logger is used, which I'm also using, but for tantify I only want warn or higher.
    madmaxio
    @madmaxio
    Standard log/env_logger mechanics doesn't work?
    François Massot
    @fmassot
    @joepio did you try something like RUST_LOG=info,tantivy=warn ? That's what I use for quickwit for example.
    2 replies
    Joep Meindertsma
    @joepio

    And another question (sorry, I hope I'm not spamming this channel too much...)

    I'm using tantivy in a notion-like app, and I'd like to index new user input as quick as possible. However, I don't want to make POST operations slow by depending on the commit action, which can take a couple of seconds. So I run this async in an actix Actor, which makes it a bit better. However, If many users are posting a lot of changes, the Actor commits multiple times per second, which is probably far too much. So I've added a timer, which prevents a commit from happening too often, but I think I should use a throttle instead of this mechanic...

    Anyway, I was wondering what others used for this!

    7 replies
    Paul Masurel
    @fulmicoton
    Tantivy will do the merge automatically by default.
    6 replies
    Just make sure you keep an indexwriter instance open and alive
    restioson
    @restioson:breadpirates.chat
    [m]
    I think this is relevant quickwit-inc/tantivy#494
    Paul Masurel
    @fulmicoton
    @all We are moving to discord! https://discord.gg/jhR4PC38 . Sorry for the trouble. But it makes it somewhat easier to group the tantivy/tantivy-cli/quickwit project discussions... And it also has a cozy random channel :)
    madmaxio
    @madmaxio
    The link for invite doesn't work?
    François Massot
    @fmassot
    @madmaxio working link: https://discord.gg/MT27AG5EVE
    madmaxio
    @madmaxio
    Thx
    Stephen Eckels
    @stevemk14ebr
    hello, anywhere I can read about incremental indexing?
    I have a series of documents that will under go edits, and I'd like to have the tantivy search index stay in sync as these are modified. Am I correct that solving this is the purpose of incremental indexing?
    Stephen Eckels
    @stevemk14ebr
    to be clear my schema is constant, but the document values and how many of them are present can vary. Between updates more OR less values can exist
    François Massot
    @fmassot
    Hi @stevemk14ebr we are on discord now!