Where communities thrive


  • Join over 1.5M+ people
  • Join over 100K+ communities
  • Free without limits
  • Create your own community
People
Activity
    João
    @joaopereiramrf

    Hello,

    We have updated our vespa to version 7.419.22 and since then it seems we have lost some logs.
    I can see that theres is a problem with logd but I'm a bit lost with what I can do to fix it, the error is the following:

    [2021-06-21 16:14:36.123] INFO : metricsproxy-container Container.ai.vespa.metricsproxy.service Unable to parse json 'java.io.ByteArrayInputStream@3249949a' for service 'logd:353:RUNNING:hosts/node09.mrf.io/logd': NullPointerException

    This happens in both distributor and storage, any ideas ?

    1 reply
    Øyvind Krosby
    @zoyvind

    Hello,
    I am wondering if you can help me debug an error message that showed up after a restart.

    configproxy.com.yahoo.vespa.filedistribution.FileReferenceDownloader info Request failed. Req: request filedistribution.serveFile(cae785d50766a91d,0)\nSpec: tcp/vespa-content-config-1.c.zedge-prod.internal:19070, error code: 103, set error for connection and use another for next request

    Is it possible to tell which file its trying to download from the configserver? This error is also logged on the configserver side, so I rule out network issues.

    11 replies
    Audun Torp
    @auduntorp_gitlab

    I think I misunderstand how weightedset<string> can be used to give different weights to different terms.
    I have these document fields and ranking profile for content documents:

    search content {
        document content {
        ...
            field exact_search_queries type weightedset<string> {
                indexing: attribute
                weightedset {
                    create-if-nonexistent
                }
            }
    
            field search_queries type weightedset<string> {
                indexing: index
                weightedset {
                    create-if-nonexistent
                }
            }
        ...
        rank-profile search_boost inherits default {
            weight exact_search_queries: 1000
            weight search_queries: 500
            rank-type exact_search_queries: tags
            rank-type search_queries: tags
    
            first-phase {
                expression {
                    nativeRank
                }
            }
        }
        ...
    }

    with this test data:

      {
        "put": "id:recommendations:content::1st-by-emilio",
        "fields": {
          "profile_name": "emilio",
          "title": "1st-by-emilio",
          "exact_search_queries": {
            "by:emilio": 2
          },
          "search_queries": {
            "by:emilio": 2
          }
        }
      },
      {
        "put": "id:recommendations:content::4th-by-emilio",
        "fields": {
          "profile_name": "emilio",
          "title": "4th-by-emilio",
          "exact_search_queries": {
            "by:emilio": 4
          },
          "search_queries": {
            "by:emilio": 4
          }
        }
      },
      {
        "put": "id:recommendations:content::7th-by-emilio",
        "fields": {
          "profile_name": "emilio",
          "title": "7th-by-emilio"
        }
      },
      {
        "put": "id:recommendations:content::8th-with-same-ts-by-emilio",
        "fields": {
          "profile_uuid": "68cc0959-bf89-497a-9f6b-ab2ab110f413",
          "profile_name": "emilio",
          "exact_search_queries": {
            "by:emilio": 1
          },
          "search_queries": {
            "by:emilio": 1
          },
          "title": "8th-with-same-ts-by-emilio"
        }
      }

    As you can see the different documents have different weights. Then I do a filter like this:

    {
                "queryProfile": "content",
                "filter": "+exact_search_queries:\"by:emilio\"",
                "query": "exact_search_queries:\"by:emilio\"",
                "ranking": "search_boost",
    }

    And my result becomes these items all with the same relevance:

    titles: ['8th-with-same-ts-by-emilio', '4th-by-emilio', '1st-by-emilio']
    relevance: ['0.1301362833940882', '0.1301362833940882', '0.1301362833940882']

    I would expect the documents with higher weight in the weighted set to get higher relevance. Why is it not so?

    Jo Kristian Bergum
    @jobergum
    nativeRank does not use weights in the weightedset
    So you have both attribute (which supports match exact/word and not text) and index which is text based matching. I'm not sure if you have a plan on this, but searching the attribute without fast-search is going to be linear scan over the documents. Like a database column without index. The ranking feature you want to use is elementCompleteness for indexed fields and attributeMatch from https://docs.vespa.ai/documentation/reference/rank-features.html
    5 replies
    AleksanderDrzewiecki
    @AleksanderDrzewiecki

    Hi!, I just started to look into Vespa and pyvespa.
    I have been struggling with feed blocks, when running localy with one node on docker.

    ReturnCode(NO_SPACE, External feed is blocked due to resource exhaustion: disk on node 0 [vespa-container] (0.959 > 0.800)) '}

    Is there any way to see the disk util of the node? I cannot seem to find it...

    Sorry to bother you with simple questions :)

    Jon Bratseth
    @bratseth
    There’s a metric API that can give you that: https://docs.vespa.ai/en/reference/metrics.html#metrics-api
    But apparently, you’r at 95.9% utilization so you should just increase the disk size of your docker container.
    4 replies
    Audun Torp
    @auduntorp_gitlab
    How do I order by relevance in yql? I want to combine it with a tie-breaker. Have tried order by [relevance] desc and order by relevance() desc without luck.
    2 replies
    Pontus Lundin
    @lundin
    Hello guys, a quick question. If i have multiple languages to support i can either give the field an array<item> where item contains a locale field and query on that and only match-given element such as item.locale contains("en"). But my question is if i go separate fields like name_en, name_se, name_no etc i would like to alias all these into the "same" field name in the response such using YQL as in "select name_se as name where xxxxxx" this way the struct always looks the same and the render template only reference name once. Or do in need to fix this in the searcher ? execute the yql request and change the name field hits to whatever found in name_xx. Basically i asm asking (and i dont think yql supports it) alias using "as" keyword after the field name in the select.
    5 replies
    Vincent Cheong
    @vinchg

    Hey guys, I've been attempting to use diversity within match-phase and have had trouble getting it to function. I would like my top 10 hits to be "diversified" by their item_ids.

                match-phase {
                    attribute: datetime
                    order: descending
                    max-hits: 200
                    diversity {
                        attribute: item_id
                        min-groups: 10
                    }
                }

    Where the fields are:

            field datetime type long {
                indexing: summary | attribute
                attribute: fast-search
            }
    
            field item_id type string {
                indexing: summary | attribute
           }

    When listing 10 hits, I'm still getting duplicate item_id's within those hits. This is the total count for that search:

                "fields": {
                    "totalCount": 1300
                },
                "coverage": {
                    "coverage": 0,
                    "documents": 3722,
                    "degraded": {
                        "match-phase": true,
                        "timeout": false,
                        "adaptive-timeout": false,
                        "non-ideal-state": false
                    },
                    "full": false,
                    "nodes": 5,
                    "results": 1,
                    "resultsFull": 0
                }

    From the documentation, it's a little unclear to me how to use diversity (what should I set min-groups to) and what type of results to expect.
    Could I be misunderstanding the purpose of the diversity in the match-phase? Any clarification would be much appreciated!

    5 replies
    yashkasat96
    @yashkasat96
    Hi, is it possible to install Vespa on RHEL 7. And if possible then can anyone suggest which AMI in AWS marketplace can be used.
    Ícaro Jerry
    @IcaroJerry

    Hi everyone,

    I've a two-node cluster in production. They have a lot of data and I need to add one more node to improve this situation.
    But when I add the third node, it is not recognized, until I deploy the application on that third node, using itself as the value of the variable VESPA_CONFIGSERVERS.
    Then I put the value of the main node in the variable VESPA_CONFIGSERVERS and redeploy it. Apparently the cluster works with the three nodes.
    However, the data is not redistributed among them (is this the default behavior or should it automatically distribute?).

    I use the vespa-get-node-state and vespa-get-cluster-state commands to check that everything is fine with the cluster. Would it have a better way?

    Below are some settings I use:

     <admin version="2.0">
      <adminserver hostalias="node0"/>
      <configservers>
        <configserver hostalias="node0"/>
      </configservers>
      <logserver hostalias="node0"/>
    </admin>
    
    <container>
    ...
      <nodes> 
        <node hostalias="node0"/>
        <node hostalias="node1"/>
        <node hostalias="node2"/>
      </nodes>
    </container>
    
    <content>
    ...
    <nodes>
        <node distribution-key="0" hostalias="node0"/>
        <node distribution-key="1" hostalias="node1"/>
        <node distribution-key="2" hostalias="node2"/>
     </nodes>
     ...
    <redundancy>1</redundancy>
     ...
     <searchable-copies>1</searchable-copies>
    ...
    </content>

    Thanks

    7 replies
    Greg Kavanagh
    @sirganya
    HI, I've dealing with a Vespa instance that fell over after running for a month. I was wondering under what circumstances does the existing data get overwritten. For instance if I reinstalled using yum would that delete the existing docs? Is there a way to copy them before I start hacking around.
    7 replies
    ranbole
    @ranbole
    Hi There, can anyone give some recommendation/guidance on storing and searching JSON data in VESPA? I'm looking for something similar to how Postgres and MYSQL handles it. Thanks :)
    2 replies
    Simen
    @enemis:matrix.org
    [m]

    Hi,
    I have a document that contains a lot of key-value pairs and I cant figure out how to search/filter with it: Schema+example data like this:

    schema item {
        document item {
            field id type string {
                indexing: attribute | summary
            }
            field attributes type map<string, string> {
                indexing: summary
                struct-field key { indexing: attribute }
            }
    
    {'pathId': '/document/v1/item/item/docid/13589',
     'id': 'id:item:item::13589',
     'fields': {'attributes': {'key1': 'val1',
       'key2': 'val2'}

    How can I make a query based on this so that I show e.g.
    1) all items where key1 = "val1"
    2) any items that has key1 as a key

    3 replies
    Eloghosa Ikponmwoba
    @elotech47
    Hello everyone.. kindly, I need some help. Trying out vespa for a search application (this is my first time). I have get this error when I try to feed data to my application:
    Detail resultType=FATAL_ERROR exception='ReturnCode(NO_SPACE, External feed is blocked due to resource exhaustion: disk on node 0 [vespa-search] (0.842 > 0.800))' endpoint=localhost:8080 ssl=false1625877162023 Got detail: Detail resultType=FATAL_ERROR exception='ReturnCode(NO_SPACE, External feed is blocked due to resource exhaustion: disk on node 0 [vespa-search] (0.842 > 0.800))' endpoint=localhost:8080 ssl=false
    I have tried using free more space on my PC, also I tried reducing the data i was feeding to just 10 datapoints, but I still get same error. I will kindly need some assistance.
    Eloghosa Ikponmwoba
    @elotech47
    It seems it was my system disk space that was the problem. It's working now after freeing up the space to about 20G
    Bruno
    @bratao

    Hey everyone. I recently inherited a Vespa cluster to manage. So far everything has been perfect. ( although with some pain to setup in k8s). Congratulations on the documentation and support for adding nodes in real time. But there is a message that appears very frequently in the log that I don't know what to do:

    searchnode    proton.groupingmanager    error    Could not locate attribute for grouping number 0 : Failed locating attribute vector 'tipos_juridico.label'. Ignoring grouping 'search::aggregation::Grouping {\n    id: 0\n    valid: true\n    all: false\n    topN: -1\n    firstLevel: 0\n    lastLevel: 1\n    levels: std::vector {\n        [0]: search::aggregation::GroupingLevel {\n            maxGroups: 11\n            precision: 100\n            classify: search::expression::ExpressionTree {\n                root: search::expression::AttributeNode {\n                    attributeName: 'tipos_juridico.label'\n                }\n            }\n            collect: search::aggregation::Group {\n                id: <NULL>\n                rank: 0\n                orderBy: [] {\n                    size: 1\n                    [0]: -1\n                }\n                aggregationresults: [] {\n                    size: 1\n                    [0]: search::aggregation::CountAggregationResult {\n                        expression: search::expression::ExpressionTree {\n                            root: search::expression::ConstantNode {\n                                Value: search::expression::Int64ResultNode {\n                                    value: 0\n                                }\n                            }\n                        }\n                        count: search::expression::Int64ResultNode {\n                            value: 0\n                        }\n                    }\n                }\n                expressionResults: [] {\n                    size: 1\n                    [0]: search::expression::AggregationRefNode {\n                        index: 0\n                    }\n                }\n                children: [] {\n                    size: 0\n                }\n                tag: 2\n            }\n        }\n    }\n    root: search::aggregation::Group {\n        id: <NULL>\n        rank: 0\n        orderBy: [] {\n            size: 0\n        }\n        aggregationresults: [] {\n            size: 0\n        }\n        expressionResults: [] {\n            size: 0\n        }\n        children: [] {\n            size: 0\n        }\n        tag: 1\n    }\n}\n'

    What can I do? Should I reindex it?

    Jo Kristian Bergum
    @jobergum
    The above is from a query request trying to group on a field which is not attribute (tipos..label) maybe because you have many document types and not restrict to the right document type?
    Bruno
    @bratao
    So, it´s a query problem. What i find weird is that that the query executes without any warning
    Jon Bratseth
    @bratseth
    Yes, you are right - it should be validated earlier (also, the content node should have emitted this in the response instead of in the log). I’m creating GitHub issues on this.
    Bogdan Snisar
    @bsnisar

    Hello guys,
    one question about the thing called filedistibution, during exploitation we get the situation when one node failed to download a ML model and crashes (event restart can't fix this state).

    filedistribution status says this information.

    $ ./vespa-status-filedistribution --tenant tgs
    File distribution in progress:
    db-vespa-tgs1-1.42.....net: FINISHED
    db-vespa-tgs1-2.42.....net: FINISHED
    db-vespa-tgs1-3.42.....net: FINISHED
    db-vespa-tgs1-rank-1.42.....net: IN_PROGRESS (0 of 1 finished)
    db-vespa-tgs1-rank-2.42.....net: FINISHED
    db-vespa-tgs1-rank-3.42.....net: FINISHED

    And cluster can't converged and finish distribution. It suceeded only after next package had been activatied and all services restarted (2 times)

    Could someone give a vision what actually happening during filedistribution or can it stuck because of big package (we user 2 onnx models 600Mb each) ?

    5 replies
    Chris Nell
    @raincoastchris

    Pretty minor question but... the default value of rank-score-drop-limit is defined as -Double.MAX_VALUE. Is there a constant that can be used from inside a ranking expression to reference this value?

    Context -- I want to specify an expression like if(condition, <some ML model output>, -Double.MAX_VALUE) in a first-phase with a rank-score-drop-limit, where I don't have guarantees on the range of possible outputs of the ML model (but can assume they will be above -Double.MAX_VALUE).

    In practice I'm defining a constant of my own and using that, but it's a bit ugly having a 312 character float inlined there. (on a related note, the docs don't seem to define the syntax for numeric literals in search definitions; evidently scientific notation is not supported)

    1 reply
    senecaso
    @senecaso

    I have a simple string field, defined as:

            field description type string {
                indexing: summary
            }

    but if I try to index the value abc % def, it chokes on the % and fails with:

    URLDecoder: Illegal hex characters in escape (%) pattern - Error at index 0 in: &quot; d&quot;</pre></p>

    I can't seem to find any information on escape characters in Vespa. I see the Text utility class, but I dont see anything in there about escape characters. Is there a document somewhere that outlines how to handle the % character in strings?

    1 reply
    Ijka Ep
    @IjkaE_twitter
    Hi! Is there any preference of the file system (XFS vs EXT4) for deploying vepsa content nodes?
    2 replies
    Simen
    @enemis:matrix.org
    [m]
    I am trying to feed data points to a document using pyvespa, but after 5-6 tries I get the following error:
    MaxRetryError: HTTPConnectionPool(host='10.35.246.86', port=80): Max retries exceeded with url: /document/v1/user/user/docid/3436768?create=true (Caused by ResponseError('too many 502 error responses'))
    • It does not seem to be consistent with what data point that is feeded, and happens both for data_point and feed_batch.
    • I can neither see anything occurring in the vespa.logs either on the content or the main node.
    • For a different schema, I can feed a full 8k entries without issues, but for this schema it fails after 5-6 docs ( the schema file is exactly the same except for name)
    1 reply
    Simen
    @enemis:matrix.org
    [m]

    it responds, fills other schemas, makes queries ++ with port 80, also the "vespa" service is set to port

        port: 80
        targetPort: 8080

    Ive also done this when connecting, so it returns a 200:

        if self.vespa.get_application_status().status_code != 200:
            logging.warning("Vespa application is NOT RUNNING.")
    3 replies
    Ive also tried to delete all data in the schema and then re-fill it. doesnt seem to help.
    Simen
    @enemis:matrix.org
    [m]
    uhm I renamed the schema + document to "users" instead of "user" and then it worked 😕 Maybe I have something weird in that original config? Is it possible to delete full schemas in the database so I can try with the name "I want"?
    2 replies
    Simen
    @enemis:matrix.org
    [m]
    I see. It was more whether I could delete the full schema. the delete-link you sent can delete individual data points. Is it just to delete the schema file on the master node and redeploy?
    2 replies
    Simen
    @enemis:matrix.org
    [m]
    I ended up deleting the whole cluster + pvc's. not ideal, but works well as long as Im in experimental mode. Thanks for help :)
    Jo Kristian Bergum
    @jobergum
    I don't think there is anything special with a user document schema name. We have several apps using user as schema name, for example news recommendation https://github.com/vespa-engine/sample-apps/tree/master/news/app-6-recommendation-with-searchers/src/main/application/schemas
    Simen
    @enemis:matrix.org
    [m]
    I think what happened was that I changed the schema of "user" too much over time in some way, making something incompatible. But it was very hard to debug when all I could get was a 502 bad gateway to my request. Anyways, its working now :)
    Simen
    @enemis:matrix.org
    [m]

    I have one more thing I dont really understand. I want to extract all items in the schema "user". if I use the following body I get the correct number of items (150k), but they are stripped of any useful information:

    {'yql': '\n        select * from sources user \n        where sddocname contains "user";\n        ',
     'hits': 10000000,
     'maxHits': 10000000}
    
    ----
    
    {'id': 'index:user/2/0d95af70743e004667435355',
     'relevance': 0.0,
     'source': 'user'}

    On the other hand, if I do a query where I know the ID, I get the correct item. (this ID is not part of the 150k in the first query):

    {'yql': '\n        select * from sources user \n        where id contains "666";\n        ',
     'hits': 10000000,
     'maxHits': 10000000}
    
    --- 
    
    {'id': 'id:user:user::666',
     'relevance': 0.0,
     'source': 'user',
     'fields': {'sddocname': 'user',
      'documentid': 'id:user:user::666',
      'id': '666',
      'attributes': ['list','of','strings']}}

    Do you have any idea what is going on here?

    3 replies
    Ijka Ep
    @IjkaE_twitter
    Hi, I've read from the blog post Approximate Nearest Neighbor Search in Vespa — Part 1 that the throughput is halved when corpus size is increased by 10x. Does it mean that we should start more processes to split the corpus into more content nodes? For example, should I run 4 content node instances on a machine with 2048GB of memory (512GB for each container), or run a single instance with all the available hardware resource? Thanks!
    17 replies
    Eloghosa Ikponmwoba
    @elotech47
    Hi, I have a question, I created a vespa application and feed some data. I feed 200,000 document into the application and it worked succesfuly, After sometime, I loaded another 200,000 document, It fed succesfully but when i run query, it outputs this error. {'root': {'id': 'toplevel',
    'relevance': 1.0,
    'fields': {'totalCount': 0},
    'coverage': {'coverage': 100,
    'documents': 0,
    'full': True,
    'nodes': 0,
    'results': 1,
    'resultsFull': 1},
    'errors': [{'code': 10,
    'summary': 'Backend communication error',
    'source': 'app_reviews-search',
    'message': 'Connection failure on nodes with distribution-keys: 0'}]}}
    1 reply
    Ijka Ep
    @IjkaE_twitter
    Excited to see vespa now supporting lower resolution types, thanks for the great work! The "performance considerations" section in the tensor guide noted that using bfloat16 for tensor types might come with computation overhead and bypass optimizations while ranking, does it apply to HNSW indexing as well? It seems that memory bandwidth is not a concern for HNSW, so I guess the only benefit of using bfloat16 in HNSW is memory saving at the cost of (potential) performance degradation?
    Jo Kristian Bergum
    @jobergum
    Its also true for hnsw distance calculations
    And yes your summary is correct👍
    Jonathan Thomson
    @jpthomson

    Hey guys.

    Quick question on scoping properties to sources in federated searches. We have a federation that looks like this:
    -> shop
    -> merchandising -> shop

    And would like to set a property on only the top level shop source only, and have it not apply to the nested shop source. It doesn't seem that setting properties for nested sources is supported via something like &source.merchandising.shop.myProp=. Is there a way to do this?

    Thanks

    2 replies
    Pontus Lundin
    @lundin

    Hi, a stuipid question, in vespa cloud prod, what is the application-test ? I tried upload the ordinary application.zip that includes my xml files (deployment.xml with zone location/services.xml with node config) as well as my java component build with clean install package -X

    I have a application-test.zip in target dir after compiled. But upload that with the application above says build-meta.json missing.

    I would like to submit the prod appliction by console before trying git/cmd.

    13 replies
    Vincent Cheong
    @vinchg

    Hello, I am currently have consistency issues when jointly inserting documents. The document is inserted containing a field with several documents that have been stringified. A document processor processes that field and adds the items in the form of DocumentPuts to the async session. The goal is that we only want the base document to be inserted when valid child documents are inserted.

    The issue is that in a small percentage of cases, the base document gets inserted without any of child documents. I have validated the data and there are no issues with it. If we re-feed in the same document again, it correctly populates the base and child documents.

    Do you guys know what could be happening in this scenario?

    1 reply
    yashkasat96
    @yashkasat96
    I have setup vespa with 3 nodes (8 cpu and 32gb ram), all acting as configserver, container and content node. Redundancy and searchable copies are kept at 1.
    I have added around 1.6 crore data in the cluster using :
    /opt/vespa/lib/jars/vespa-http-client-jar-with-dependencies.jar --verbose --file <jsonfile> --host localhost --port 8080
    I ran the above command on one of the node.
    The data ingestion was completed successfully with 0% error.
    But when I search in the database using the yql query and it is showing:
    "coverage":{"coverage":80,"documents":13069981, "degraded":{"match-phase":false,"timeout":true,"adaptive-timeout":false,"non-ideal-state":false},"full":false,"nodes":3,"results":1,"resultsFull":0}
    It's been more than 8 hours since the data ingestion and still the documents are not active.
    Can anyone help me where is the problem.
    10 replies
    Eloghosa Ikponmwoba
    @elotech47
    Hi everyone. I am trying to deploy my vespa app created with pyvespa to the vespa cloud. I am not clear about the "key" required. I used the application key given in the vespa console, but I get this error: ValueError: Could not deserialize key data. The data may be in an incorrect format or it may be encrypted with an unsupported algorithm. I would appreciate if someone can point me in the right direction.
    1 reply
    Bruno
    @bratao

    Hello Everyone, I´m debugging a high IO usage in a Vespa cluster. It have a very large index (2 TB). Apparently the disk usage peaks during compaction due to disk bloat.
    proton.searchlib.docstore.logdatastore info Disk bloat is now at 449514580375 of 1497699577489 at 30.01 percent. Will compact proton.searchlib.docstore.logdatastore info Done compacting. Disk bloat is now at 448518522454 of 1496703954822 at 29.97 percent
    But the compaction just reduce the bloat to a value very close to the maxbloat (30%). The compaction is take under a minute, but is happening very frequently.

    It is possible to configure to take longer, to it do not needs to be called that often?

    7 replies
    Dave McQueen
    @dmcqueen_twitter
    Hi everyone, new vespa user here. I'm running the dockerhub image in a kubernetes environment and I'm noticing that when I query the db with a new query term the first return is always empty then begins to return result after the first query. Did I miss something in the docker image configuration? Why might this be happening?
    1 reply
    Petter Ekrann
    @petterek
    Hello, I'm trying to find out if vespa is something for me and my team,
    I'm looking for a search engine that should handle relatively small amount of documents, ~100k pr customer.
    I need it to be "multi tenant"(can that be solved with namespaces?)
    And I want to run it in k8s together with everything else.
    • Is Vespa for me?
    • Is Vespa overkill? Have been looking at ES and that seems even more overkill..
    • Is there any meetups/courses/consultants that we med hire?
    • It looks to me that the learning curve is fairly steep, can anyone point me to a starter app just for searching documents.
    11 replies