Where communities thrive


  • Join over 1.5M+ people
  • Join over 100K+ communities
  • Free without limits
  • Create your own community
People
Repo info
Activity
    Paul Masurel
    @fulmicoton
    You can get all of these programmatically but not in a centralized manner.
    Sean Stangl
    @sean_stangl_twitter
    Would you have an opinion on how one might best implement a serializable, read-only index? For example, suppose there are tens of thousands of small, unrelated indexes. After creation, they are read-only.
    2 replies
    Ideally they would just be represented as files we could mmap, and we'd disable watch functionality and remove locking. (With hand-waving if some metadata needs to change at runtime: maybe that is stored in a real, separate file.)
    I could construct a RAMDirectory and then serialize it to some flatbuffer-like format, which is effectively the same as the HashMap<PathBuf, ReadOnlySource>, but works with the data in serialized format.
    Are there any obvious pitfalls I'm missing if I would try that?
    matrixbot
    @matrixbot
    bbigras If I want to index some orders and their items. Could there be a way to search an item while wanting to specify a filter on the parent order, like "order.client: something"? Would I need something like nested documents support?
    4 replies
    Sean Stangl
    @sean_stangl_twitter
    I think I need downcast_ref() support for ManagedDirectory.
    1 reply
    Njagi Mwaniki
    @urbanslug
    @fulmicoton WOAH! Congrats!!!
    Paul Masurel
    @fulmicoton
    @urbanslug thank you!
    jonahcwest
    @jonahcwest

    Hi all! I've came across Tantivy and it looks very promising. I had a small question, however: Is it possible to combine multiple queries together? In our application, tokenization and weighting are done on the client so we'd like to be able to pass a list of tokens (and possibly an edit distance) and search for those without any processing done by the search library. Forgive me if I have overlooked this, but there doesn't seem to be an obvious way to use multiple queries (such as searching for multiple terms in a FuzzyTermQuery) without using a QueryParser.

    Thank you!

    Pasha Podolsky
    @ppodolsky
    Not sure I understood you correctly, but you can programatically construct impl Query object, something like https://github.com/tantivy-search/tantivy/blob/main/src/query/boolean_query/mod.rs#L284
    Paul Masurel
    @fulmicoton
    @jonahcwest yes you can. Check out the BooleanQuery...
    You would have to build a bunch of TermQuery and FuzzyTermQuery ... and then combine them in an union or an intersection depending on your need
    The doc is not really great, impl From<Vec<(Occur, Box<Query>)>> is the one you want to use.
    If you want an union for instance, you build a BooleanQuery::from(vec![(Occur::Should, term_a), (Occur::Should, term_b), ...])
    if you want an intersection, same thing but with Occur::Must
    I really need to add helpers...
    jonahcwest
    @jonahcwest
    @fulmicoton Thank you! That's exactly what I was looking for
    Paul Masurel
    @fulmicoton
    @jonahcwest I'm glad it helped
    jonahcwest
    @jonahcwest
    Do you provide a way to search bytes that may not be valid UTF-8? ie. a ‘Vec<u8>’ instead of an ‘str’
    Paul Masurel
    @fulmicoton
    It is available in master. You will need your own query parser however
    jonahcwest
    @jonahcwest
    I see. How would you create a Term since Term::from_bytes is private?
    Paul Masurel
    @fulmicoton
    Nice catch. So the one you want to use is Term::from_term_bytes(...) and it was indeed pub(crate). I just pushed a commit to make it public
    jonahcwest
    @jonahcwest
    That's great, thanks! If you don't mind me bugging you once again, it doesn't seem possible to index a bytes field. How would you get around that?
    Paul Masurel
    @fulmicoton
    again this is only available in master. are you working in master?
    This feature is not released yet
    jonahcwest
    @jonahcwest
    Yes, but I’m referring to creating a schema with add_bytes_field
    There is no option to index a bytes field
    jonahcwest
    @jonahcwest
    My bad. I only saw the commit you made yesterday that fixed Term and not the earlier ones for indexing byte fields. Thank you anyways!
    Paul Masurel
    @fulmicoton
    no problem
    matrixbot
    @matrixbot
    bbigras Paul Masurel (Gitter): thanks for the reply last week.
    bbigras I have another question. For one of my "problem" a solution was to use RegexQuery. It seems that I have to do the query myself, but can I still allow my users to search for something like "client:Bob.* and color:green". I mean that if I do the query myself, I have no idea if I can handle the logic stuff like "and".
    Paul Masurel
    @fulmicoton
    Yes. Check my reply to @jonahcwest above
    You need to build a booleanquery.
    matrixbot
    @matrixbot
    bbigras I think I understand that you say to use booleanquery if I want to do a query myself but can I get a booleanquery from a string that my users will produce in my UI?
    Paul Masurel
    @fulmicoton
    You need to implement your own query parser. Tantivy's query parser does not have a syntax for regex queries
    matrixbot
    @matrixbot
    bbigras gotcha. thanks
    Stephen Becker IV
    @sbeckeriv
    Hello again. Is there a way to debug why a document matched a query?
    Paul Masurel
    @fulmicoton
    Have you looked at the explain output?
    23 replies
    henghanan
    @henghanan
    hello, can somebody tell me why tantivy is faster than lucene?
    lyj
    @lengyijun
    Written in Rust
    Paul Masurel
    @fulmicoton
    It is not due to one single reason. It is more a sum of not very well identified performance gain.
    One of them might be pure rust byte code generation.
    The usage of explicit SIMD instructions is another obvious one.
    The care given to what should be a static dispatch and what can be a dynamic dispatch is another
    On the indexer side, the datastructure is sensibly different.
    For count on unions, the algorithm is better on tantivy's side.
    Finally there is a couple of difference in phrase queries handling, I don't know if that makes a difference to be honest
    Paul Masurel
    @fulmicoton
    I don't like the simple "it is because of rust" answer because direct ports of lucene are typically slower than lucene.
    Rucene is slower, clucene was slower...