Where communities thrive


  • Join over 1.5M+ people
  • Join over 100K+ communities
  • Free without limits
  • Create your own community
People
Repo info
Activity
    Marty Schoch
    @mschoch
    there were some bugs in that area fixed in the last 6 months though
    @panakour ok that sounds doable, do you have any specific questions we can help with?
    Panagiotis Koursaris
    @panakour

    @mschoch As I understood from the documentation this is doable using token filters https://blevesearch.com/docs/Token-Filters/. Is there any ready made token filter type in which will let me do it?
    Also which is the best way to do it? using the AddCustomTokenFilter ? Or is there any way to do it like a plugin for anyone else wanting the same and for better recycled on other projects?

    There is a plugin for elasticsearch that do the same thing I want https://github.com/skroutz/elasticsearch-analysis-greeklish https://github.com/skroutz/elasticsearch-analysis-greeklish/blob/7.7.0/src/main/java/org/elasticsearch/index/analysis/GreeklishGenerator.java

    Panagiotis Koursaris
    @panakour
    @mschoch Also I just found that you have refactoring the whole library with the new name "Bluge". So I think your suggestions about the above question should consider using the "Bulge" instead.
    Hendrik Grobler
    @hsjgrobler_gitlab
    Hey guys, quick question about best practice: At an API level, should I keep one index open and re-use that across all requests or open and close the index per request?
    Amnon
    @amnonbc

    I have closed the IP Range search PR as it had a long and circuitous history compromising dozens of commits, which looked like an intimidating prospect for any reviewer. I have squashed the history into a new PR blevesearch/bleve#1546 which comprises less than 600 lines of code, half of which are tests.
    In the end the implementation of IP range searching is quite trivial and maps very nicely to the bleve's indexReader.FieldDictRange feature.

    I have added some end to end tests which demonstrate indexing and searching IP addresses. I put the tests in test/ip_field_test.go as I was non sure what was the best location for these kind of tests. @mschoch let me know what you think.

    Marty Schoch
    @mschoch
    @panakour yes what you want to do is build a token filter, which takes greek tokens in and outputs the greeklish ones, seems pretty straightforward to port the one you linked to from java to go
    as for sharing it with others, its as simple as hosting it on github under an apporopriate license, others can use it, so long as it conforms to the token filter interface
    @hsjgrobler_gitlab bleve today doesn't support multiple processes very well, opening an index from a single process to serve all requests is the model it was designed around
    if you need more serverless-like capability, see github.com/blugelabs/bluge as it supports multi-process access
    Amnon
    @amnonbc
    ping
    Panagiotis Koursaris
    @panakour
    @mschoch thank you very much. I am really excited about the new refactored bleve called "Bluge". Its awesome and much much better than bleve in all aspects. Thanks a lot for your hard work.
    Marty Schoch
    @mschoch
    @panakour thanks for the feedback. Unfortunately, outside of a few people it has not received much interest from the community, so I've been forced to spend more time on Bleve, and less time on Bluge than I had hoped. But I'm doing my best to continue working on both (my own personal projects use Bluge, so I have interest in keeping it going, or getting everything merged back to Bleve at some point)
    Amnon
    @amnonbc
    Would you advise people to upgrade? Is it ready for production?
    Panagiotis Koursaris
    @panakour
    @amnonbc for me is working perfect. I already use it in one project.
    Amnon
    @amnonbc
    So from my point of view, bluge's support for lambda type environments would be a big plus. Porting our code to use bluge does not look like too much work, and it would allow us to remove some hacks we currently use to run on a serveless environment.
    Marty Schoch
    @mschoch
    So from my perspective, it is still at developer preview level. In particular some performance optimizations were removed, not because they don't work but because they needed a better API. Secondly, I would like to create a test framework to validate search results match Bleve, and I haven't been able to do that yet.
    That being said, "production ready" is really in the eye of the beholder.
    People coming forward and saying it works for us, helps us get to that level.
    Sergio Vera
    @sergio__vera_twitter
    Hi @mschoch , is there a way to specify an analyzer when doing a query string query?
    quantonganh
    @quantonganh
    How can I loop though all documents in index, check if a document ID is not in a []string, and delete it?
    11 replies
    Marty Schoch
    @mschoch
    @sergio__vera_twitter no, the query string query doesn't support that because it works across multiple fields
    James Mills
    @prologic
    Hey all. Having some troubles getting bleve highlight to work. I can see <mark>...</mark> if I log the output of searchResults but once I shove this into a template/html and render that it's gone
    I thought there was some funky escaping/sanitizing going on but I'm a bit strumped
    Amnon
    @amnonbc
    Yes, html/template will sanitize out the <mark> tags.
    https://play.golang.org/p/y_Kj804D54F
    orenben12
    @orenben12
    Hi - I'm new to bleve - and am trying to find some detail guidance - on concurrency and thread safety - can someone please point me in the right direction?? I'd like to be able to index a large dataset multithreaded - but want to understand if there is risk to the store?
    Marty Schoch
    @mschoch
    weclome @orenben12 i'll try to answer your questions
    first, the index structure itself is threadsafe, so you can have multiple goroutines sharing reference to it and using the index/batch methods to index concurrently
    the batch objects are NOT thread-safe, but are reusable (more on this later), so if each goroutine is indexing in batches (recommended), they should use separate batch structures
    batch structures can be reused, unless you're using the "unsafe_batch" option on a scorch index, in which case they are much more difficult to reuse safely, to start i wouldn't bother reusing batches
    in general though, while some concurrency may help, you'll eventually hit some limits with a single index, most likely not saturating your disk I/O
    if you're goal is to index all the data as quickly as possible, you may find it beneficial to create multiple indexes, instead of just one
    then at query time, you can query across all of them using an index alias
    finally, in general, i recommend you start with bleve v2, if you're not already, as it produces the most optimal index we support out of the box (v1 defaults to some older technology)
    i'll stop there, but if you have follow-up questions, just let me know
    Jxic
    @Jxic

    @hsjgrobler_gitlab bleve today doesn't support multiple processes very well, opening an index from a single process to serve all requests is the model it was designed around

    Hi - I just started using bleve a few days ago and now looking for some general advice on optimizing the index search speed. In my case, I only have to index all the data once on program initialization, and there will be many concurrent search requests after initialization. I just went through this chat and came across this explanation which makes me wonder if bleve is the right choice for this scenario, as I am not sure what it means by "single process"

    will the requests be queued in this case?
    Jxic
    @Jxic
    by the way, I'm using the memory-only index
    Marty Schoch
    @mschoch
    ok, so for an in-memory index this issue isn't relevant, because only one process can have access to the memory anyway
    for indexes persisted to disk in files, it can sometimes be convenient to allow multiple processes to work with those files at one time
    bleve does not allow that, locks held by the writer process block other processes from reading as well
    bluge does allow that, by using operating system locking primitives, a single writer and multiple reader processes can share access to the files
    John Forstmeier
    @forstmeier
    Hi! I can't figure out how to get the fields back from a sub-document - basically, I'd like to run queries against one of the sub-document fields ("body") and then return all three fields ("id", "timestamp", and "body") in the results so that I can rebuild the original BleveSpecimen (a wrapper to Specimen so that the bleve.Classifier can be implemented) type. I've looked at this Gist which uses the SetInternal/GetInternal methods but is there a way to do it directly from the SearchResult type?
    package main
    
    import (
        "log"
        "os"
        "time"
    
        "github.com/blevesearch/bleve/v2"
    )
    
    // Specimen is the root specimen.
    type Specimen struct {
        ID        string    `json:"id"`
        Timestamp time.Time `json:"timestamp"`
        Body      string    `json:"body"`
    }
    
    // BleveSpecimen wraps the root specimen.
    type BleveSpecimen struct {
        // Specimen `json:"specimen"`
        Specimen Specimen `json:"specimen"`
    }
    
    // Type implements the bleve.Classifier interface.
    func (bs *BleveSpecimen) Type() string {
        return "bleve_specimen"
    }
    
    var now = time.Now()
    
    var specimens = []Specimen{
        {ID: "one", Timestamp: now, Body: `{"text":"the quick brown fox jumped over the lazy dog"}`},
        {ID: "two", Timestamp: now, Body: `{"text":"jump over the brick wall"}`},
        {ID: "three", Timestamp: now, Body: `{"text":"carnivours are delicious"}`},
    }
    
    func main() {
        specimenMapping := bleve.NewDocumentMapping()
    
        idFieldMapping := bleve.NewTextFieldMapping()
        timestampFieldMapping := bleve.NewDateTimeFieldMapping()
        bodyFieldMapping := bleve.NewTextFieldMapping()
    
        specimenMapping.AddFieldMappingsAt("id", idFieldMapping)
        specimenMapping.AddFieldMappingsAt("timestamp", timestampFieldMapping)
        specimenMapping.AddFieldMappingsAt("body", bodyFieldMapping)
    
        bleveSpecimenMapping := bleve.NewDocumentMapping()
        bleveSpecimenMapping.AddSubDocumentMapping("specimen", specimenMapping)
    
        indexMapping := bleve.NewIndexMapping()
        indexMapping.AddDocumentMapping("bleve_specimen", bleveSpecimenMapping)
    
        name := "testing.bleve"
        index, err := bleve.New(name, indexMapping)
        if err != nil {
            log.Fatalf("error creating index: %s", err.Error())
        }
        defer os.RemoveAll(name)
    
        batch := index.NewBatch()
        for _, specimen := range specimens {
            batch.Index(specimen.ID, BleveSpecimen{
                Specimen: specimen,
            })
        }
    
        if err := index.Batch(batch); err != nil {
            log.Fatalf("error calling batch: %s", err.Error())
        }
    
        query := bleve.NewMatchQuery("carnivours")
        search := bleve.NewSearchRequest(query)
        searchResults, err := index.Search(search)
        if err != nil {
            log.Fatalf("error running query: %s", err.Error())
        }
    
        log.Printf("results: %+v", searchResults)
    }
    Marty Schoch
    @mschoch
    @forstmeier we no longer recommend using internal storage for storing significant amounts of data. It works OK with the older upsidedown index format, but the scorch index is not designed to store large amounts of data with those internal values.
    Generally, it should be as simple as ensuring that you set "store" to true on the field mapping: https://github.com/blevesearch/bleve/blob/ae28975038cb25655da968e3f043210749ba382b/mapping/field.go#L50
    And then when you build the search request, set Fields to be []{"*"} https://github.com/blevesearch/bleve/blob/master/search.go#L276
    NOTE that "*" is just a magic value interpreted to mean you want to us load all stored fields, there is no pattern matching.
    1 reply
    Jxic
    @Jxic
    hi @mschoch, recently I've been using memory-only bleve for storing around 100000 documents, and I did a few benchmarks on the search performance. The document struct is quite easy as in there are only 3 string fields and 1 int field. The major performance bottleneck seems to be caused by the frequent call to golang garbage collection strategy since unmarshalling value from []byte data creates a lot of short-lived objects. So do you think it's possible to skip the marshalling and unmarshalling process when using memory-only index?
    Marty Schoch
    @mschoch
    @Jxic yeah the in-memory index is pretty bad today. We'd like to replace it with one backed by scorch as well (on the road-map for this year) Can you be more specific about which marhsal/unmarshal you think would be helpful to remove?
    Jxic
    @Jxic
    so according to the pprof diagram, the function NewBackIndexRowKV causes plenty of new memory allocation, and it boils down to the (*BackIndexRowValue).Unmarshal