Where communities thrive


  • Join over 1.5M+ people
  • Join over 100K+ communities
  • Free without limits
  • Create your own community
People
Repo info
Activity
    Amnon
    @amnonbc

    Hi @mschoch!
    I am having some performance problems with a large index, which consists of 4 million items.
    We are using the RocksDB backend, and the index weighs about 6Gb on disk.
    A simple MatchAll search takes about a minute - most of which consists of iterating through rocksdb records in CGO.
    This takes about a minute, by which time our GUI times out.

    We can limit the time the search takes by calling idx.SearchInContext, with a context with an appropriate timeout, but in this case the search rerturns a nil result.
    What would be great for us is if there was a way for Bleve to give us "best effort" results within a deadline.
    Or alternatively for Bleve to return results as it collects them, rather than computing everything first, and then giving us the results.
    We basically want to show our users something in the GUI quickly, even if the result may need revising once more data is collected.
    Is there any way I can do this in bleve?

    Marty Schoch
    @mschoch
    Sorry to hear you are having performance issues. Are you doing MatchAll as a proxy for any query that takes a long time or are you relying on using MatchAll for your application? In general the upsidedown index format (which allows you to use RocksDB) is not well designed, and many queries have to scan large ranges of the rocksdb database.
    There is a hidden capability to stream results back as they are found, instead of returning the final result set. Unfortunately, it was added by Couchbase, and has no real obvious public API. Instead it is accessed using a magic key which can be set inside the Context passed into the search request. They key is used here: https://github.com/blevesearch/bleve/blob/e7235bec9cf6d984a1683372673d4f0571fa7d94/search/collector/topn.go#L188-L197
    By default we use that built-in function MakeTopNDocumentMatchHandler but you can (and need to) write your own.
    At the moment I am cannot find any documentation about how it is used
    Amnon
    @amnonbc
    Thanks @mschoch,
    We initially do a DateRangeQuery to return the last week's data, and the user then adds more specific terms to get the items are interested in.
    The problem is that the DateRange query takes a long time - and counter-intuitively it does not depend on how many documents match that time range.
    I'll have a look at the snippet you sent and try to make sense of it. But it looks like we will have to partition the data by date.
    Marty Schoch
    @mschoch
    @amnonbc are the date ranges completely arbitrary? or do they fall into simple buckets like month/year (presumably what you would be partitioning on) Because if the queries align with those buckets, you could prepare a special field with those values, and use basic term query (fast). That might perform well enough you could skip the partitioning. But again, nothing wrong with partitioning solution either.
    Amnon
    @amnonbc
    Many of the searchers are for the last week, or the last month. Approximating these in the form of buckets (or disjunctions of buckets), and this is a lot easier to do than partitioning. I'll give this a try.
    Amnon
    @amnonbc

    I tried the idea of buckets, and it gives a 100x speedup, even when I need to combine tens of buckets to express my query.

    This leads to another question. When a user at a GUI does a search, they (eventually) get a page of results.
    When they scroll to the next page, bleve appears to perform the entire search from scratch.
    Is there any way to get bleve to cache the results?

    Another question: when I create an index, and populate it, is it possible to add a new FieldMapping at a later stage?
    Or must the index be re-created?
    Marty Schoch
    @mschoch
    bleve appears to perform the entire search from scratch
    yes it does, the size/skip literally does just that, runs the entire search, and skips over results
    there is no easy way to cache this in some useful way to save work getting the second page
    alternatively we have a different method, it is most useful to allow for "deep pagination" but it may suit your use-case as well
    you can read more about that feature here: blevesearch/bleve#1182
    essentially, you run search for page 1 results as usual, next, if you want to access page 2, you don't use skip=10, you pass search_after and use the sort key from the last result of page 1
    this has different set of trade-offs, as you can not randomly jump to a page, just first/last/next/prev
    there are unit tests showing how it can be used
    Marty Schoch
    @mschoch
    today we do not support any changes to the index mapping once the index is created, however I know some users have found a way to add new fields, so it is possible, we just don't support it
    Amnon
    @amnonbc

    Thanks for the answers.

    I'll look at the deep pagination feature.

    Korede Oluwafemi
    @Koredeoluwafemi
    Hi everyone, is there a way bleve search results can be converted to structs
    Korede Oluwafemi
    @Koredeoluwafemi
    @mschoch
    Korede Oluwafemi
    @Koredeoluwafemi
    okay, thanks @mschoch, I was actually looking for a way to convert the indexed object back into a struct
    Korede Oluwafemi
    @Koredeoluwafemi
    @amnonbc
    ABHAY MANIYAR
    @abhaymaniyar

    @mschoch Hello Marty,
    Bleve is a fantastic search library. Kudos to you and all the contributors.

    I have a use-case that I want to use bleve for. I have a list of objects which I want to index but don't want the index data store files to be saved on server instance. Do we have any cloud support available for bleve? Or can we convert the data store files into a serialized form to save it on cloud or S3 and fetch them before use?

    Marty Schoch
    @mschoch
    @Koredeoluwafemi the index is a flat list of fields, if you want to convert this back to a struct, it is up to your application to do that
    @abhaymaniyar unfortunately in Bleve it's pretty hard-coded that the segment files are on disk locally.
    I have a newer library bluge (https://github.com/blugelabs/bluge) which is an experimental fork of bleve. It has support for a pluggable "directory" interface which removes this limitation. There has been interest expressed in the slack channel to add s3 support to bluge, but no work has started yet.
    ABHAY MANIYAR
    @abhaymaniyar
    @mschoch Can we serialize the datastore in bluge?
    Marty Schoch
    @mschoch
    @abhaymaniyar I don't know what "serialize the datastore" means
    Johann Tanzer
    @tulpenhaendler
    Hi all, I just found bleve and bluge a few days ago and started to experiment with it, so far looks very good, great work @mschoch!
    What I am a bit confused about right now is how "production ready" bluge is, or if should use bleve or bluge for a new project, generally speaking I am leaning towards bluge just because bleve seems a bit bloated with different indexes and query parsers and all, i would prefer the much slimmer bluge right now but not sure if thats a good idea
    Johann Tanzer
    @tulpenhaendler
    on a side note, I actually have a similar use case to @abhaymaniyar in terms of S3 storage and was super happy to see the Directory Interface in bluge, but i would just use bluge -> Directory Interface -> afero( mix of local disk and s3 )
    Marty Schoch
    @mschoch
    welcome @tulpenhaendler in my opinion, bluge is still only at developer preview release quality. it works for my use cases, but like bleve, it has unit tests, and some basic full-stack tests, but really lacks something more rigorous. in bleve, we have gotten by because Couchbase has invested in a considerable test suite for their product, which while not perfect, has functioned as a sort of proxy for that.
    i will probably be making some announcements about bluge in the near future, but one thing is clear, it will need help from the community to become production ready, it is not something i will be able to do myself.
    i appreciate you liking bluge being much leaner, that was one of my core goals when i started it
    regarding the directory interface, it should be possible to do something with s3, but i suspect as currently implemented, you'll need some sort of local caching layer, and i'm not sure that bluge's use of segments will make that easy (not impossible either though)
    Marty Schoch
    @mschoch
    to me a longer-term interesting idea would be to use s3 lamba, to push-down search of a segment into s3, and only return the relevant matches/meta-data, not even having to download all or part of the raw segment from s3
    Johann Tanzer
    @tulpenhaendler
    thanks for your answer, i am going to look into bluge more and i plan to write some benchmarks, will share that of course
    Johann Tanzer
    @tulpenhaendler
    for s3, i want to have lots of indexes shaded among multiple instances and they would just fetch the entire index once if the dont already have it local and upload snapshots periodically (so i dont actually need bluge to do anything s3 related), technically i guess s3 supports Range queries so you might even be able to implement something like ReadAt(offset,len) but imo it would be a strange use case where you need to use s3 like a filesystem....playing around on aws cost calculator - just the PUT requests for 5 writes/second are about the same price as a 500gb ebs drive per month...
    Marty Schoch
    @mschoch
    Ah ok, I guess we just have different use cases. The indexes I work with are hundreds of GB or larger, so even individual segments are quite large. The latency implied by downloading a segment you don't have yet would be unusable.
    ged
    @gedw99
    I am adding a GIOUI ( golang) gui to beer search, and was wondering if you want it in origin or i keep in my upstream repo.
    I am talking about this repo: https://github.com/blugelabs/beer-search
    here is a kitchen sink demo of GIOUI: https://gioui.org/files/wasm/kitchen/index.html
    Because beer search expects a FS, i will be using the golang FS wrapper that compiles to WASM and normal GO.
    Marty Schoch
    @mschoch
    @gedw99 in general I try to keep the examples focused on the bleve/bluge aspects of the application. No matter which JS library we choose, or even choosing to go without one, it still gets in the way when we build web-based examples. Using a Go UI library may help some because the code is in Go, but it still hurts because it isn't how most apps would actually be built (today). So, for now I encourage you to build this if it is of interest to you, but I cannot say whether or not it would be accepted upstream. When you have something working that we can look at/use, please share it here again.
    6 replies
    Scott Cotton
    @wsc0
    Hi, I am wondering if there is an easy way to selectively index : for example to provide a database of words not to index at index creation time in order to speed it up and focus it on specific application needs. I can code it if need be, but would be interested in some pointers...
    Marty Schoch
    @mschoch
    @wsc0 custom words not to index is already supported, it is called a stop word token filter