Where communities thrive


  • Join over 1.5M+ people
  • Join over 100K+ communities
  • Free without limits
  • Create your own community
People
Activity
    John Dagdelen
    @jdagdelen
    seems like too much disk space is being used by the container?
    1 reply
    Pontus Lundin
    @lundin

    guys, is there somehow possible to view how much memory current dev setup is using as-is. Would like to estimate given on avg memory it uses on some docs what it will end up...i do fear it a bit ;)

    As for attributes am i right that faviouring Integer values over string values if possible and with low cardinality (at least as much as possible). Are the cardinality per document or global ? I.e if i have an ID array field attribute with 1,2,3,4 in every document is it 4 x documents space.

    8 replies
    Kyle Rowan
    @karowan

    Hi, is cast a reserved keyword in yql? When trying to use a yql query of
    select * from sources * where cast contains sameElement(id contains '16');
    which returns the error

    com.yahoo.processing.IllegalInputException: com.yahoo.search.yql.ProgramCompileException: query:L1:30 no viable alternative at input 'select * from sources * where cast'\n\tat com.yahoo.search.yql.YqlParser.parseYqlProgram(YqlParser.java:792)

    When I use the same query except for replacing cast with crew it works as intended.
    Here are the fields in question:

            struct CastType {
                field cast_id type string {}
                field character type string {}
                field credit_id type string {}
                field gender type string {}
                field id type string {}
                field name type string {}
                field order type int {}
                field profile_path type string {}
            }
    
            field cast type array<CastType> {
                indexing: summary
                struct-field id { indexing: attribute }
                struct-field cast_id { indexing: attribute }
                struct-field name  { indexing: attribute }
                struct-field character { indexing: attribute }
                struct-field gender  { indexing: attribute }
            }
            struct CrewType {
                field credit_id type string {}
                field department type string {}
                field gender type int {}
                field id type string {}
                field job type string {}
                field name type string {}
                field profile_path type string {}
            }
    
            field crew type array<CrewType> {
                indexing: summary
                struct-field id { indexing: attribute }
                struct-field name  { indexing: attribute }
                struct-field department { indexing: attribute }
                struct-field gender  { indexing: attribute }
                struct-field job  { indexing: attribute }
            }

    Thanks for any help.

    2 replies
    Ijka Ep
    @IjkaE_twitter
    Hi, I've read on stackoverflow that there is no direct HTTP API for batch feeding. Is there any plan to support batch feeding without using the Java client?
    4 replies
    Seongmin Kim
    @shieldnet

    Hello, Does Vespa supports linguistic query string analyze? I tried to search http://myurl/search/?query=今日天気 (today weather), But It seems to search "today weather", not "today" and "weather".

    How can I analyze query string with Vespa setting?

    16 replies
    Seongmin Kim
    @shieldnet

    Hello, How can I set OR n-gram search on schema setting?
    I think the default setting is AND.

    When I tried to search with hello and n-gram 2 option, the vespa engine analyze the query string to [AND he el ll lo]

    Can I change the setting to [OR he el ll lo]?

    25 replies
    Seongmin Kim
    @shieldnet

    Sorry for frequent question, for ngram searching, Am I using wrong schema index/match setting?

    • field

          field title type string {
              indexing: index | summary
              index: enable-bm25
              match {
                  gram
                  gram-size: 2
              }
          }
      
          field body type string {
              indexing: index | summary
              index: enable-bm25
              match {
                  gram
                  gram-size: 2
              }
          }
    • field-set
      fieldset default {
          fields: title, body
          match {
              gram
              gram-size: 2
          }
      }
      I can't get any search result about query=hello even the documents that contains hello exist explictly.. and I checked index with vespa-index-inspect
    6 replies
    John Dagdelen
    @jdagdelen
    I have some "group" operations that take a while to run if too many results match the query. Is there any way I can limit the number of results going to the group and get approximate numbers for those?
    I'm just using them to prepare counts for "top n" filters
    7 replies
    Thomas Rose
    @thomasrose
    Hi all, firstly want to thank you for all the hard work that's been done on Vespa - it looks like an outstanding piece of software. We've got a company (ecommerce) hackathon tomorrow, and I'm keen to see what we'll be able to do with it and hopefully get some buy-in to pursue it further. I'm trying to get started with the Vespa Cloud trial but having trouble with the dev deployment - it keeps on timing out with Controller could not validate certificate within PT20M: Certificate not found in secret store. Is there something I'm missing here? Thanks in advance
    9 replies
    Ravindra H.
    @rharige
    hello! I am trying to build java artifacts after checking out vespa project master branch. I am using this command: ./bootstrap.sh && MAVEN_OPTS=" -XX:+TieredCompilation -XX:TieredStopAtLevel=1 " mvn -nsu -U -T 1C clean install -Dmaven.javadoc.skip=true -Dmaven.source.skip=true -DskipTests -Dmaven.test.skip=true
    My build is failing with the error: Failed to execute goal org.apache.maven.plugins:maven-compiler-plugin:3.8.1:compile (default-compile) on project annotations: Compilation failure [ERROR] warnings found and -Werror specified
    I am using openjdk 11.0.10, Maven 3.6.3, MacOS 11.2.3
    can you help me fix the build error?
    27 replies
    Ravindra H.
    @rharige
    re: vespa-engine/vespa#7294 - to support "onear" annotation, I think we need to extend definition of UserInput such that it also accepts "distance" param, in addition to "grammar" param. My first guess is this change has to happen in ANTLR grammar for UserInput definition. However, I am not finding the definition of UserInput in any of the grammar file. am I missing something here? if the DSL grammar definition of UserInput is not in ANTLR, is it parsed in code?
    5 replies
    Vincent Cheong
    @vinchg
    Hello! I noticed that when a synonym is matched, if a weight isn't specified, Vespa uses the default of 100 rather than the original matched item's weight. This is a bit hard to deal with as I don't wish to apply a fixed weight to synonyms when they are replaced. I don't believe an option to handle this exists, and if not, could it be added?
    6 replies
    Vincent Cheong
    @vinchg
    I had a second issue regarding a NearItem search. This issue occurs only when I'm searching "Kristen Stewart" with a NearItem(5) in a field that has many documents containing that exact string. This works for every other name I have tried as well as "Krist Stewart", but "Kristen Stewart" is unable to find a proper match. A tracelevel of 6 doesn't seem to reveal any stemming or tokenization issues, so I'm a bit lost here. Any help in tracking down what might be causing this issue would be much appreciated! Thanks in advance!
    6 replies
    Pontus Lundin
    @lundin

    Hi, i dont know if it is something wrong i do but document-summary for an ordinary array of given custom struct that do not have matched-element (but full) do not return. Ordinary string, int fields and matched elements do. On vespa-cloud 7.381.20.

    i.e (all fields other than metadatatree returns in query when leaflevel is added to &presentation.summary=leaflevel)

    document-summary leaflevel {
    summary sku type string { source: sku }
    summary name type string { source: name }
    summary stock type string { source: stock }
    summary metadatatree type array<oriometadatainfo> {
    source: metadatatree
    }
    }
    (tried full and some other options)

    where type is:

    struct oriometadatainfo {
    field weight type string {}
    field length type string {}
    field height type string {}
    field ean type string {}
    field supppliercode type string {}
    field grossvolume type string {}
    field hazardousgoodsind type string {}
    field countryoforigin type string {}
    field smallpackqty type string {}

    }
    field metadatatree type array<oriometadatainfo> {
    indexing: summary
    }

    doing an ordinary yql without presentation.summary gives the field back ( the metadatatree):

    "fields":{"metadatatree":[{"grossvolume":"","hazardousgoodsind":"","ean":"3322938024017","countryoforigin":"","height":"11","supppliercode":"1774","weight":"1990","smallpackqty":"1","length":"20"}]}}]}}

    6 replies
    Ícaro Jerry
    @IcaroJerry

    Hi everyone,

    I would like to know if it is possible to create different sizes for the search snippets . For example, I want to display a snippet of size X on my website, but for clients of my API, the snippet would be size Y.
    Currently I already have the size of my snippet configured in services.xml (configuration vespa.config.search.summary.juniperrc), but I am trying to make it "dynamic".

    Thanks!

    2 replies
    Ícaro Jerry
    @IcaroJerry

    Ah, another question...

    My application is sending a large number of documents to vespa in parallel. And Vespa is returning status HTTP 507 (Insufficient Storage), but my cluster has 500 GB available (about 20% of the total disk space).

    Any idea what that could be?

    Thanks again! :)

    5 replies
    Kyle Rowan
    @karowan
    Hi, is there a native way to check if a query term lies in a list of ranges? For example, using the attached image as a baseline, I want attributeMatch(field) = 1 when the A < term < B or C < term < D, and attributeMatch(field) = 0 otherwise. The use case that I can see this being relevant for is for checking the openness of an establishment throughout the week.
    4 replies
    image.png
    Lara Perinetti
    @PeriLara
    Hi all,
    I have encountered some issues about Linguistic searcher, more precisely, about the stemming strategy. If I use the default stemming (from SimpleLinguistics), I’ll have to handle non matching languages between documents and queries (every document in my index has a lang and one lang only, but the set of document is multilingual as for the queries).
    Let’s take an example:
    User’s language: French
    User’s query: redis
    User’s query after stemming: red
    Document A language: French, redis’s stem = red
    Document B language: English, redis’s stem = redis
    Document B will be filtered.
    To overcome this problem, I thought of using a Byte Pair Encoding method (the multilingual WordPieces method, used by some Language models such as BERT) which has the advantage of being language agnostic of the query / document.
    In practice there are some questions:
    What would be the strategy to feed the index?
    • Either we decide to feed all the wordpieces and let WAND do its magic
    • Or we decide to feed only the prefixes generated by the BPE (which would correspond to the stems) and do the same on the query side.
      Have you ever thought about these problems and this possible solution?
      The other solution that I had in mind was to use lemmatisation instead of stemming which does not have the advantage of being language agnostic but will have no effect on proper nouns (such as redis).
    Pontus Lundin
    @lundin

    Hi, i have one question. Would it not be great with a boolean query in yql which has a "must" and "should" operator in in the query or can it be solved differently ? I mean in a searcher we can do more or less whatever we want to end up with which is great but for yql simple queries it often ends up filter out too many documents that in a real life scenario still would be considered a match (or in terms of "ads" mabey not ideal candidate by the query constraints but still workable".

    For example if there is no match in "criteria.id matches ("xxxx")" it simply means that there is no critera text to be shown with this document when it is being rendered on the webpage. However due to missing xxxx value it simply discards the documents as a match (so this would be better used in a "should" clause). To solve this currently xxxx needs to be added with a empty text just to get a hit. Am i missing something ?

    5 replies
    Jacob Eisinger
    @jeisinge
    Howdy! I am super excited about Vespa! It is clear that I have much to learn. https://docs.vespa.ai/ appears to be a great resource. I am interested in exporting this as one HTML file so that I can more easily view it on my e-reader or print out portions for offline reading. Is there any existing export to single-page HTML functionality for the docs available?
    1 reply
    Ijka Ep
    @IjkaE_twitter
    Hi, is there a way to efficiently obtain 'part' of the relevance score? For example I have a first-phase ranking expression of closeness(field, embedding) + nativeRank(text), it would be really useful if we can add the score of closeness(field, embedding) to the results without re-computing.
    Jo Kristian Bergum
    @jobergum
    See summary-features in ranking doc @IjkaE_twitter
    @PeriLara SimpleLingustics is afaik only doing english and no stemming and is not default linguistics. Open NLP is. If the query language is guesses correctly it works but if you want cross language retrieval you can consider also to parse the query using the relevant languagrs
    Jo Kristian Bergum
    @jobergum
    By cross I mean that searching in french also retrives english and french. You can expand the orignal term with stems fr
    From the set of valid languages (on query side). But there is no straigth forward way. BPE and likes can work but then you need to add a new linguistics implementation (not that hard).
    Jo Kristian Bergum
    @jobergum
    Another alternative is to look at multilanguage embedding models and map to dense embedding.space and use ann for retrival and regular sparse for ranking signals (rank(nearestneighbor(),terms).
    Jo Kristian Bergum
    @jobergum
    @jeisinge sorry not aware of any such export tools.
    Jacob Eisinger
    @jeisinge
    @jobergum , thanks for checking! I am pretty sure someone who is more knowledgeable about Jekyll could hack something pretty fast. If I end up creating a script, I'll post it here.
    Div Dasani
    @divdasani
    Hi Vespa team! I'd like some insight on how to implement a particular use case: let's say I have two schemas. The first is a "user" schema containing fields user_id and user_embed. The second is an "item" schema similarly containing item_id and item_embed fields. At query time, I'd like to pass the user_id field, use it to retrieve the corresponding user_embed, and then get the top n similar documents by doing a dot product between user and item embeddings. In other words, I expect my rank-profile to look something like expression: sum(fn_retrieve_attr_by_id(user_embed, query(user_id)) * attribute(item_user_embed)). How can I do this? Thank you!
    7 replies
    renevall
    @renevall
    Hi all, quick question. I understand that the dev instance in Vespa cloud only lives for 1 week and it gets recycled automatically. Is there a chance we keep it permanently or even better recycle it on demand? Or alternatively, is it possible to create a staging env that is rather small? Thanks.
    7 replies
    Prateek Patel
    @prateekp:matrix.org
    [m]
    hi all , i have a question , want to understand when trying ann search , how does filtering works ? do we filter post ann , or pre ann, or during ann search
    Prateek Patel
    @prateekp:matrix.org
    [m]
    From some blog i read , it says eligibility list is created by applying filtering rules first and then candidates are searched in graph based index while checking for eligibility during this graph traversal. But the part i am not clear about is 1) if eligibility list was small why do we even need ANN 2) evaluating filters to create eligibility list is linear , the whole point of hnsw was to be able to scale sublinear ?
    Jo Kristian Bergum
    @jobergum
    If the candiate list surviving the filter is small brute force is faster. The query ANN operator will due to this fallback to brute force if filter removes 95% of corpus. Evalution of filters
    2 replies
    are
    done over inverted indexes and b trees and is sub linear as well.
    adriabonetmrf
    @adriabonetmrf

    Hello, I've experienced an issue with some documents disappearing from query results all of a sudden whenever fitering by a predicate field, it happened at the same time for each of them. Problem was solved once I updated each document with a partial update on that predicate field with the very same result it had which leads me to belive it may be related with some kind of indexing issue. Could any of you perhaps provide me with some insight or leads to ascertain what caused this or at least what metrics/logs could help me track the origin? I'm concerned it may happen again in the future.

    The field in question in the document schema:

     field show_condition type predicate {
        indexing: summary | attribute
        index {
            arity: 2
        }
    }

    Thanks.

    4 replies
    renevall
    @renevall
    Hello All, I wanted to ask if it's possible to use DD notation for position fields as in 34.056687222, -117.195731667 instead of N37.41638, W122.024683 notation.
    3 replies
    Audun Torp
    @auduntorp_gitlab

    Hi! We are currently using version 7.169.4 of Vespa, and I am trying to use the Simple Query Language for a filter on documents. However I am having problem specifying multiple space-separated filters. Writing the order of my filters differently yields different results:

    $ curl -H "Content-Type: application/json" --data '{"query": "published:true category:4605"}' 'http://172.18.0.2:8080/search/?queryProfile=content&tracelevel=3'
    ...
                "message": "Query parsed to: select * from content where (published contains \"true\" AND category contains \"4605\");"
    ...
    
    curl -H "Content-Tycurl -H "Content-Type: application/json" --data '{"query": "category:4605 published:true"}' 'http://172.18.0.2:8080/search/?queryProfile=content&tracelevel=3'
    ...
                "message": "Query parsed to: select * from content where category contains \"4605 published:true\";"
     ...

    What do I need to change in order for the space separator to be picked up? This is how the fields are defined:

            field category type int {
                indexing: summary | attribute
                match: exact
                rank: filter
            }
    
            field published type bool {
                indexing: summary | attribute
            }
    5 replies
    Dipan Roy
    @107dipan
    I am trying to understand the vespa architecture and one thing I am struggling with is how buckets are stored in content node. Since all the content nodes have a documentDb for storing the indexes and all how is the bucket stored in the document db. Or Is the bucket a logical way of picturing how documents are split up and stored and not actual physical entities?
    8 replies
    João
    @joaopereiramrf
    Hi all,

    I have been using vespa and at the moment my setup is:
    3 searchers
    3 storage
    3 configservers

    Today I had some timeouts and I can't find out why.
    While I was trying to dig this, I found some logs that I wasn't hable to understand:

    Node05:
    [2021-04-08 12:01:48.908] WARNING : container Container.com.yahoo.search.cluster.BaseNodeMonitor Taking search node key = 0 hostname = node07.com path = 0 in group 0 statusIsKnown = true working = true activeDocs = 14904 out of service: Connection failure: 10: Backend communication error: Error response from rpc node connection to node07.com:19105: Request timed out after 0.98 seconds.
    [2021-04-08 12:01:49.788] WARNING : container Container.com.yahoo.search.dispatch.searchcluster.SearchCluster Coverage of group 0 is only 2/3 (requires 3) (30368/30368 active docs) Failed nodes are:\nsearch node key = 0 hostname = node07.com path = 0 in group 0 statusIsKnown = true working = false activeDocs = 0
    [2021-04-08 12:01:49.788] INFO : container Container.com.yahoo.search.cluster.BaseNodeMonitor Putting search node key = 0 hostname = node07.com path = 0 in group 0 statusIsKnown = true working = false activeDocs = 14904 in service: Responds correctly
    [2021-04-08 12:01:49.892] INFO : container Container.com.yahoo.messagebus.network.rpc.RPCTarget Method mbus.getVersion() failed for target 'tcp/node09.com:19111'; Connection error
    [2021-04-08 12:01:49.892] INFO : container Container.com.yahoo.search.dispatch.rpc.RpcPing Pong 80589 from node 0 in group 0 with hostname node07.com received too late, latest is 80590
    [2021-04-08 12:01:49.995] INFO : container Container.com.yahoo.messagebus.network.rpc.RPCTarget Method mbus.getVersion() failed for target 'tcp/node09.com:19111'; Connection error

    Node07:
    [2021-04-08 12:01:04.117] INFO : container-clustercontroller Container.com.yahoo.vespa.clustercontroller.core.EventLog Added node only event: Event: storage.1: Node is no longer in slobrok, but we still have a pending state request.
    [2021-04-08 12:01:15.334] INFO : container-clustercontroller Container.com.yahoo.vespa.clustercontroller.core.EventLog Added node only event: Event: distributor.0: Node got back into slobrok with same address as before: tcp/node09.com:19112
    [2021-04-08 12:01:16.737] INFO : container-clustercontroller Container.com.yahoo.vespa.clustercontroller.core.EventLog Added node only event: Event: distributor.2: Node got back into slobrok with same address as before: tcp/node09.com:19112
    [2021-04-08 12:01:16.969] INFO : container-clustercontroller Container.com.yahoo.jrt.slobrok.api.Mirror Error when handling update from slobrok. Error: Request timed out after 40.0 seconds. (error code 103), target: Connection { Socket[addr=node05.com/178.63.95.118,port=19099,localport=38528] }
    [2021-04-08 12:01:25.895] WARNING : searchnode proton.slobrok.register slobrok.registerRpcServer(articles/search/cluster.articles/0/realtimecontroller -> tcp/node07.com:19105) failed: failed check using listNames callback

    what it means: "Node is no longer in slobrok, but we still have a pending state request" ? Is this a connection error?

    Also in node05 I saw:
    [2021-04-08 12:01:55.790] WARNING : container Container.com.yahoo.search.dispatch.searchcluster.SearchCluster Coverage of group 0 is only 2/3 (requires 3) (30368/30368 active docs) Failed nodes are:\nsearch node key = 0 hostname = node07.com path = 0 in group 0 statusIsKnown = true working = false activeDocs = 0

    Does this mean that I need to always have 3 nodes to be able to query/insert data?

    Also have some issues with the vespa_exporter:
    https://github.com/vespa-engine/vespa_exporter

    This exporter monitors the distributors, searchers and containers with the metrics: vespa_distributor_status_code_up, vespa_searchnode_status_code_up and vespa_container_status_code_up.
    Theres any way to check the configservers ?

    Sorry to flod you with so many questions, but I'm a beginner and having some troubles understanding this.

    *tried to post my services.xml but it made the post to big.
    9 replies
    Dipan Roy
    @107dipan
    I have a few queries regarding grouped distribution
    https://docs.vespa.ai/documentation/performance/sizing-search.html
    • In the diagram it shows only ready db for grouped, Does grouped distribution have only ready sub database?
    • If there are 4 groups and redundancy of 3, Will the data be stored in 3 of the groups?
    • If there are 4 groups , redundancy of 3 and searchable copies set to 2, How will the data be distributed? Will it be stored in 3 out of 4 groups with 2 group storing them in readyDb(in one of the content nodes) and one group storing the doc in the notReadyDb
    16 replies
    senecaso
    @senecaso
    I'm trying to write a small app that runs from an AWS lambda and scans a selection documents to perform an action periodically. I'm pretty sure "visiting" is what I want to do here, but since I will be running from outside of the Vespa cluster, I have been looking at using VisitorSession to provide the access I need (rather than vespa-visit). The problem I'm having is that I can't figure out where I'm supposed to configure how to connect to the cluster (host, ip, port, etc). By default, DocumentAccess.createForNonContainer() appears to be attempting to connect to tcp/localhost:19090, which is expectedly timing out. It appears to be looking to connect to a service running on my admin node, but I have no idea how to tell it to stop using localhost, and start using the proper host name. I suspect I may have to directly create an instance of MessageBusDocumentAccess passing in some form of parameters to configure the host names (somehow), but I'm not really sure where to go. Is anyone aware of any examples of doing this?
    Jo Kristian Bergum
    @jobergum
    See https://docs.vespa.ai/documentation/reference/document-v1-api-reference.html for http api visiting. DocumentAccess api is only workable/ usable from nodes which are enrolled in a vespa cluster. @senecaso
    1 reply
    Dipan Roy
    @107dipan
    Hi,
    I wanted to clarify a few things
    1. I wanted to understand how we can configure the number of distributor nodes in our content cluster? Or is there one distributor present in each content node.
    2. I came across the term partition in the documentation a few times like in bucket management -https://docs.vespa.ai/en/proton.html
      Is the partition referring the the subset of a document type that is container in a bucket.
    3. When write is done on the transaction log, Does it simultaneously write to the attribute and doc store or it is flushed after some interval. If not, what happens in a
      situation when a write is done and request is written to the transaction logs of each of the replica nodes and the client gets a success response but immediately there is a Get request or a search query?
    4 replies