Where communities thrive


  • Join over 1.5M+ people
  • Join over 100K+ communities
  • Free without limits
  • Create your own community
People
Repo info
Activity
    Marty Schoch
    @mschoch
    bleve really has one kind of field right now, all of which should just have 5 or 6 different constructors for the different start points (data types)
    scorch is production ready, and with the 2.0 release coming soon, it will be the default (upsidedown and all k/v store index will be deprecated)
    going back to fields, i recommend you use numeric field as your starting point for an IP field, it likely won't be the same, but will likely do similar things
    starting with the constructors at the end of the file, there are several functions NewNumericField...
    at their core, they take a float64 and return a *NumericField
    you will probalby need ones that take IPv4 and IPv6 addrs
    Marty Schoch
    @mschoch
    at this stage all these functions do one important converstion
    they convert that incoming data, to a binary representation, which you see stored in the "value" field of the struct
    this should be some sort of lossless encoding of the data, and it is what will be "stored" if the field is stored, allowing recovery of the original value after searching
    now, the Analyze method takes that value, and possibly uses it to create a set of muliple values to be stored in the index
    this is highly data-type dependent
    for example, the numeric range data type does bit shift, to shift off less significant bits, and index those value as well (this allows for a particular technique to more efficiently perform numeric range searches)
    at this stage, you have to be able to answer the following
    what types of queries do i need to be able to perform on IP addresses?
    exact match? arbitrary range? CIDR mask? something else?
    and then, what are the right values to put into the index to facilitate answering those queries?
    anyway, thats enough for now, let me know if you have more questions
    Amnon
    @amnonbc
    The only two queries we need are exact match and CID mask.
    Amnon
    @amnonbc

    Thanks again for all the pointers. They will help me make sense of what the code is doing.
    I think I will start with storing the IP addresses as the 16 byte ipv6 address and do an inefficient
    full table scan for cidr matches.
    On the other hand, cidr matches all have the same prefix, so if I can search for the lowest member and scan forward from there, I should be able to get efficient CIDR matches.

    What happens if we want to be able to store multiple IPs in a field?
    Or for that matter, multiple keywords in a text field?
    Is this something that bleve handles naturally? Or should I generate multiple documents - if I have 5 IPs should I generate 5 copies of my document, one for each IP?

    Marty Schoch
    @mschoch
    I think starting with IPv4 and getting CIDR matches to work is a good starting point. And you'll better understand all the issues involved to try and extend it to ipv6
    1 reply
    so bleve already supports multi-valued fields, for example if a document has an array of IP addresses in one field
    but if you have multiple IP addrs with specific meanings like src and dest, then those should be separate fields with meaningful names
    Panagiotis Koursaris
    @panakour
    Hi. I would like to create a filter that works by creating multiple forms of a single greek token. i.e. from the single token καλημερα you can create greeklish forms like kalimera, kalhmera, kalimeres (while, of course, also keeping the original) eg. Replace each greek character with the corresponding latin character.
    Amnon
    @amnonbc
    If I have a conjugation query of two terms, and the first term returns 10 elements, and the second returns 1,000,000
    will I be better off running only the first search, and filtering the results bleve returns outside of bleve for the second term?
    Or does bleve do this optimisation itself?
    Does the order of the terms in a conjugation query matter?
    And if I have a conjugation query, some of whose terms are themselves conjugations, would I get better performance if I flattened them?
    Amnon
    @amnonbc

    https://github.com/blevesearch/bleve/pull/1536/

    for new IP field and CIDR search.

    Marty Schoch
    @mschoch
    @amnonbc conjunction is pretty efficient, the underling bitsets for each term are intersected and only 10 elements would be actually visited to load freq/norm/location info
    with recent versions, as long as they're all conjunction you shouldn't have to flatten them, the optimizations will propogate
    there were some bugs in that area fixed in the last 6 months though
    @panakour ok that sounds doable, do you have any specific questions we can help with?
    Panagiotis Koursaris
    @panakour

    @mschoch As I understood from the documentation this is doable using token filters https://blevesearch.com/docs/Token-Filters/. Is there any ready made token filter type in which will let me do it?
    Also which is the best way to do it? using the AddCustomTokenFilter ? Or is there any way to do it like a plugin for anyone else wanting the same and for better recycled on other projects?

    There is a plugin for elasticsearch that do the same thing I want https://github.com/skroutz/elasticsearch-analysis-greeklish https://github.com/skroutz/elasticsearch-analysis-greeklish/blob/7.7.0/src/main/java/org/elasticsearch/index/analysis/GreeklishGenerator.java

    Panagiotis Koursaris
    @panakour
    @mschoch Also I just found that you have refactoring the whole library with the new name "Bluge". So I think your suggestions about the above question should consider using the "Bulge" instead.
    Hendrik Grobler
    @hsjgrobler_gitlab
    Hey guys, quick question about best practice: At an API level, should I keep one index open and re-use that across all requests or open and close the index per request?
    Amnon
    @amnonbc

    I have closed the IP Range search PR as it had a long and circuitous history compromising dozens of commits, which looked like an intimidating prospect for any reviewer. I have squashed the history into a new PR blevesearch/bleve#1546 which comprises less than 600 lines of code, half of which are tests.
    In the end the implementation of IP range searching is quite trivial and maps very nicely to the bleve's indexReader.FieldDictRange feature.

    I have added some end to end tests which demonstrate indexing and searching IP addresses. I put the tests in test/ip_field_test.go as I was non sure what was the best location for these kind of tests. @mschoch let me know what you think.

    Marty Schoch
    @mschoch
    @panakour yes what you want to do is build a token filter, which takes greek tokens in and outputs the greeklish ones, seems pretty straightforward to port the one you linked to from java to go
    as for sharing it with others, its as simple as hosting it on github under an apporopriate license, others can use it, so long as it conforms to the token filter interface
    @hsjgrobler_gitlab bleve today doesn't support multiple processes very well, opening an index from a single process to serve all requests is the model it was designed around
    if you need more serverless-like capability, see github.com/blugelabs/bluge as it supports multi-process access
    Amnon
    @amnonbc
    ping
    Panagiotis Koursaris
    @panakour
    @mschoch thank you very much. I am really excited about the new refactored bleve called "Bluge". Its awesome and much much better than bleve in all aspects. Thanks a lot for your hard work.
    Marty Schoch
    @mschoch
    @panakour thanks for the feedback. Unfortunately, outside of a few people it has not received much interest from the community, so I've been forced to spend more time on Bleve, and less time on Bluge than I had hoped. But I'm doing my best to continue working on both (my own personal projects use Bluge, so I have interest in keeping it going, or getting everything merged back to Bleve at some point)
    Amnon
    @amnonbc
    Would you advise people to upgrade? Is it ready for production?
    Panagiotis Koursaris
    @panakour
    @amnonbc for me is working perfect. I already use it in one project.
    Amnon
    @amnonbc
    So from my point of view, bluge's support for lambda type environments would be a big plus. Porting our code to use bluge does not look like too much work, and it would allow us to remove some hacks we currently use to run on a serveless environment.
    Marty Schoch
    @mschoch
    So from my perspective, it is still at developer preview level. In particular some performance optimizations were removed, not because they don't work but because they needed a better API. Secondly, I would like to create a test framework to validate search results match Bleve, and I haven't been able to do that yet.
    That being said, "production ready" is really in the eye of the beholder.
    People coming forward and saying it works for us, helps us get to that level.
    Sergio Vera
    @sergio__vera_twitter
    Hi @mschoch , is there a way to specify an analyzer when doing a query string query?
    quantonganh
    @quantonganh
    How can I loop though all documents in index, check if a document ID is not in a []string, and delete it?
    11 replies
    Marty Schoch
    @mschoch
    @sergio__vera_twitter no, the query string query doesn't support that because it works across multiple fields
    James Mills
    @prologic
    Hey all. Having some troubles getting bleve highlight to work. I can see <mark>...</mark> if I log the output of searchResults but once I shove this into a template/html and render that it's gone