Where communities thrive


  • Join over 1.5M+ people
  • Join over 100K+ communities
  • Free without limits
  • Create your own community
People
Repo info
Activity
    matrixbot
    @matrixbot
    bbigras I have another question. For one of my "problem" a solution was to use RegexQuery. It seems that I have to do the query myself, but can I still allow my users to search for something like "client:Bob.* and color:green". I mean that if I do the query myself, I have no idea if I can handle the logic stuff like "and".
    Paul Masurel
    @fulmicoton
    Yes. Check my reply to @jonahcwest above
    You need to build a booleanquery.
    matrixbot
    @matrixbot
    bbigras I think I understand that you say to use booleanquery if I want to do a query myself but can I get a booleanquery from a string that my users will produce in my UI?
    Paul Masurel
    @fulmicoton
    You need to implement your own query parser. Tantivy's query parser does not have a syntax for regex queries
    matrixbot
    @matrixbot
    bbigras gotcha. thanks
    Stephen Becker IV
    @sbeckeriv
    Hello again. Is there a way to debug why a document matched a query?
    Paul Masurel
    @fulmicoton
    Have you looked at the explain output?
    23 replies
    henghanan
    @henghanan
    hello, can somebody tell me why tantivy is faster than lucene?
    lyj
    @lengyijun
    Written in Rust
    Paul Masurel
    @fulmicoton
    It is not due to one single reason. It is more a sum of not very well identified performance gain.
    One of them might be pure rust byte code generation.
    The usage of explicit SIMD instructions is another obvious one.
    The care given to what should be a static dispatch and what can be a dynamic dispatch is another
    On the indexer side, the datastructure is sensibly different.
    For count on unions, the algorithm is better on tantivy's side.
    Finally there is a couple of difference in phrase queries handling, I don't know if that makes a difference to be honest
    Paul Masurel
    @fulmicoton
    I don't like the simple "it is because of rust" answer because direct ports of lucene are typically slower than lucene.
    Rucene is slower, clucene was slower...
    Stephen Becker IV
    @sbeckeriv
    Hello, I have read recently regex matching above. . I see the work on the pr https://github.com/tantivy-search/tantivy/pull/918/files#diff-44834880126ba22476c7e7ef833aab1ec4767a200661577e4dc84d1a579a4fb8R236 and i see the regex query https://github.com/tantivy-search/tantivy/blob/5f574348d184559caa024912306bc54fac3b1086/src/query/regex_query.rs#L144 object... I think I understand how the two should work together. With this work will some of the query parser functions become public? If i understand the query parser right I think i want to change this line in convert literal to query https://github.com/tantivy-search/tantivy/blob/main/src/query/query_parser/query_parser.rs#L533 to check to see if it looks like it could be a regex and use RegexQuery::from_pattern? Or am i very wrong about how this should work?
    Paul Masurel
    @fulmicoton

    With this work will some of the query parser functions become public? If i understand the query parser right I think i want to change this line in convert literal to query https://github.com/tantivy-search/tantivy/blob/main/src/query/query_parser/query_parser.rs#L533 to check to see if it looks like it could be a regex and use RegexQuery::from_pattern? Or am i very wrong about how this should work?

    There is no plan to put regex into the query parser, I am afraid. You need to implement your own query parser.

    Lucene had a wildcard operator by default in their query parser for quite a few version. It was really terrible because any website using lucene would have this hidden feature with horrible computational cost
    It would make sense to make it an option when building the query parser though, and disable the regex by default.
    That's quite a bit of work however.
    Stephen Becker IV
    @sbeckeriv
    I understand. Configurable would be nice and prevent a lot of duplicate code you already so well tested. I attempted to add a bool to QueryParser and now see that convert_to_query does not use the QueryPraser at all..
    Stephen Becker IV
    @sbeckeriv
    Hello again, Can i explain a regex query? it does not appear to work like a boolean query. I can confirm it works but the results are not so clear sometimes. how does scoring work for regex?
    Paul Masurel
    @fulmicoton
    If i recall correctly the score is constant
    lyj
    @lengyijun

    I meet an error:

    thread '<unnamed>' panicked at 'Field norm not found for field "id". Was it market as indexed during indexing.', /home/mpc/.cargo/git/checkouts/tantivy-9e77a871f83bfdf7/3aff18c/src/core/segment_reader.rs:138:13

    I spent few hours and found that: the id is not STORED.
    I feel this bug information should be improved. It shold notify me the cause may be the field was not STORED.

    Paul Masurel
    @fulmicoton
    Can you give more context on which call it happened?
    And the version of Tantivy you use?
    lyj
    @lengyijun
    Sorry, I made a mistake.
    Stephen Eckels
    @stevemk14ebr
    @fulmicoton are you online this is in relation to the boolean query stuff
    there's 2049 segments. I fetch the whole document but i really only need a single field 'sha256' from the document. My documents are pretty simple, they are a STRING sha256 which represents the hash of the whole document, then for each feature of the document ~4000 there is a STRING feature_value and a u64 FAST | INDEXED farmhash of the feature_value
    Paul Masurel
    @fulmicoton
    yes
    so your problem is the 2049 segments
    I fetch the whole document but i really only need a single field 'sha256' from the document.
    You should fetch it from the fast field reader
    Stephen Eckels
    @stevemk14ebr
    how may i control the number of segments?
    Paul Masurel
    @fulmicoton
    you index from a CLI?
    a) commit less often
    b) call index_writer.wait_for_merge_threads() at the end of your program
    Stephen Eckels
    @stevemk14ebr
    I call .commit() every 500 documents i insert
    so if i just did this less i'd be all good?
    Paul Masurel
    @fulmicoton
    yes
    for 60000 docs, you can probably commit only once at the end
    Stephen Eckels
    @stevemk14ebr
    what's a rough area i should commit at, i will have a few million documents
    Paul Masurel
    @fulmicoton
    if you don't care much about indexing speed
    only once at the end is always the best thing to do if you can
    Stephen Eckels
    @stevemk14ebr
    yea that's easy, thanks!
    Paul Masurel
    @fulmicoton
    if you do that tantivy will be smart engouh to stay within your memory budget
    regardless of the amount of docs your push
    if you don't care about indexing speed you can also force merge all of the segments at the end...