Where communities thrive


  • Join over 1.5M+ people
  • Join over 100K+ communities
  • Free without limits
  • Create your own community
People
Activity
    James Bishop
    @bishopaj_gitlab
    Hi Alex, I've just started playing with ElastikNN and have a question about the score returned by elasticsearch. I'm using Jaccard LSH and seeing scores around 0.05 or less, even for the first result which should be an exact match. Am I misunderstanding how the score should work?
    James Bishop
    @bishopaj_gitlab
    Never mind, I figured it out. The true_indices for the sparse bool vector need to be in sorted order.
    Alex Klibisz
    @alexklibisz
    Hi @bishopaj_gitlab . Just seeing your msg. The indices shouldn't have to be provided in sorted order. The plugin should take care of sorting them. Did you find that it's not doing that?
    Alex Klibisz
    @alexklibisz
    Actually I believe I accidentally removed the sorting at some point. I'll add it back today. Planning to release another RC at some point today.
    Alex Klibisz
    @alexklibisz
    Fixed that bug here: alexklibisz/elastiknn#134
    I'm gonna cut a release with some changes I made over the weekend and then include this in the next release
    Alex Klibisz
    @alexklibisz
    @bishopaj_gitlab Fix for unsorted vectors released here: https://github.com/alexklibisz/elastiknn/releases/tag/0.1.0-PRE31
    Let me know if you have any other questions
    James Bishop
    @bishopaj_gitlab
    Thanks Alex!
    I'm only just starting to look at the code, but does this also apply to the document vector? I mean when I store documents with a sparse bool vector do the entries need to be sorted?
    James Bishop
    @bishopaj_gitlab
    Also a somewhat unrelated question: what are the implications of increasing the dimensionality of the document/query vectors? I'm currently using 25,000 which I took from one of your examples, but might want to increase this to reduce collisions (I'm hashing terms into these 25k buckets).
    Alex Klibisz
    @alexklibisz
    @bishopaj_gitlab The plugin will sort trueIndices both when indexing new vectors and when running queries. So you don't need to worry about it at all. If you are curious, the reason for this is that it's much faster to compute the intersection of two sorted arrays of indices than it is to compute the intersection of unsorted arrays.
    The implications of increased dimensionality are tricky with sparse vectors. What really has an effect is the number of true indices. Generally, having 2n true indices is going to mean 2x slower operations than having n true indices.
    Alex Klibisz
    @alexklibisz
    Div Dasani
    @divdasani
    Hi Alex! I am interested in Elastiknn for a production use case. Curious as to why the plugin isn't forward compatible with newer versions of ES?
    Alex Klibisz
    @alexklibisz
    @divdasani Hi Div. I haven't had the time yet to support multiple versions of ES. There's a ticket open for it on Github. It would involve modifying the gradle and CI setup. Nothing conceptually too challenging, just a bit tedious and you are the first to ask :). There was also a regression in the version of Lucene used beyond 7.6.2 that affected some queries. I'm not sure if they've resolved that or not.
    My main focus right now is speeding up approximate queries to make a decent submission to the ann-benchmarks benchmark project. After that, I would like to go ahead and support two or three versions of ES concurrently. If you are comfortable with Gradle and github workflows (or know someone who is), I'd be open to collaborating on it, reviewing a PR, etc.
    I check this board about once a week, so feel free to email me aklibisz@gmail.com if you want to chat about it.
    ejackson-eb
    @ejackson-eb
    Hi, I'm also interested in Elastiknn for a production use case, but we are currently using v7.9.2. I am trying to decide between a) downgrading to 7.6.2 to be consistent with elastiknn; b) trying to build a version of the plug-in from source that supports 7.9.2, or c) begging you to release a plug-in for 7.9.2. Any recommendations?
    Alex Klibisz
    @alexklibisz
    Exciting stuff! Tonight I'm looking into 1) building/releasing a branch for 7.9.2 and possibly 2) having 2 or 3 versions maintained at a time.
    (already emailed w/ Eric about this, just posting for anyone else who might be interested)
    Alex Klibisz
    @alexklibisz
    So they changed quite a bit internally between 7.8.x and 7.9.x. Trying to grok the diff.
    Alex Klibisz
    @alexklibisz
    @ejackson-eb @divdasani I got it to compile with 7.9.2 yesterday. But there are some runtime errors. The short story is that between 7.8.x and 7.9.x they did some internal refactoring to the code that lets you define custom data types. There were a handful of constructor params I had to fill in to extend the necessary classes and I bet I messed some of them up on the first pass. If anyone wants to tinker, there is a WIP PR on the project. I'll hopefully pick back up on it over the weekend.
    Alex Klibisz
    @alexklibisz
    Hey folks, here is a build of Elastiknn on ES 7.9.2 which passes all of the tests: https://github.com/alexklibisz/elastiknn/releases/tag/0.1.0-PRE42-PR173-SNAPSHOT
    Dillion
    @Dillion
    Hi Alex! trying out Elastiknn, great work and the tutorial was very useful!
    regarding the approximate query with LSH, does it work if the vectors are stored in nested fields and what kind of performance impact would there be?
    Dillion
    @Dillion
    my usecase is the reranking of long documents, I would like to perform standard ES queries on long documents, but have similarity comparison between a target query vector and all nearby sentence vectors in each document and then return the top k documents
    Dillion
    @Dillion
    E.g. searching by the phrase 'I am blue', returning 100 documents containing 'blue', each with ~100 sentences that have been encoded to vectors, finding the closest 100 sentences to the phrase 'I am blue' and then rescoring the documents
    i'm assuming that storing the sentence vectors as nested fields of each document would still allow them to be queried for vector similarity using Elastiknn, and I also want to leverage on the LSH function so that not all sentences in each document returned by keyword search need to be scored, only the nearest sentences in each document
    3 replies
    bhushanbrb
    @bhushanbrb
    Hi @alexklibisz
    trying to know your thought on this difference where in exactKNN it gave relevant product but in ANN it went way off