Where communities thrive


  • Join over 1.5M+ people
  • Join over 100K+ communities
  • Free without limits
  • Create your own community
People
Activity
    Bruno Cabral
    @bratao
    Wow, awesome @jobergum
    Jo Kristian Bergum
    @jobergum
    New sample app released: Semantic Retrieval for Question-Answer Applications https://github.com/vespa-engine/sample-apps/tree/master/semantic-qa-retrieval
    jblankfeld
    @jblankfeld
    Hi friends,
    Regarding Visitor, is it possible to run a visitor outside of the cluster ? I would like to dump a whole index to json format but outside Vespa, is it possible to do so using the code snippet here: https://docs.vespa.ai/documentation/document-api-guide.html#visitorsession . I tried to implement it with VESPA_CONFIG_SOURCES environment variable pointing to port 19090 of one node but I don't get any doc.. Looks like I'm close though because the progress is shown.
    Jon Bratseth
    @bratseth
    Yes, it's possible. If you get progress but no documents perhaps you use a document selection which doesn't match any documents?
    I would use the HTTP API to do this instead though, see under https://docs.vespa.ai/documentation/document-api.html#get
    Vlad
    @vfil
    @jblankfeld also tried this and had a similar issue. What is interesting if you pack your jar and run it inside the docker container it works.
    have you tried to override VisitorControlHandler.onVisitorError(String message)? maybe you get some info.
    jblankfeld
    @jblankfeld
    Hi @vfil thanks for the suggestion but I get nothing in this method.
    Hi @bratseth Thank you, I missed this part of the documentation, this is exactly what I was looking for. And with concurrency parameter, this would integrate well with Apache spark for example.
    Luke Miller
    @lukealexmiller
    Hi, I've noticed on a "cold" (no queries for some time - haven't determined timescale here) cluster that queries can be 3-4x longer than a "warm" (repeated identical queries) cluster. In fact it seems like the queries have the shortest latency after 2-3 identical queries are put to Vespa. This sounds like caching, and I've found documentation of a summary cache (https://docs.vespa.ai/documentation/performance/caches-in-vespa.html), could this (or lack of using this) explain such behaviour? In our case, all fields returned in hits are indexed as attributes in the search definition, so disk I/O wouldn't seem to be an issue. Came across this while investigating (spiky) query latency. Thanks in advance!
    Jo Kristian Bergum
    @jobergum
    Hey @lukealexmiller . If your system is cold as in just restarted there the container (jvm) might need a few queries to warm up code, load classes and so on (see https://docs.vespa.ai/documentation/performance/container-tuning.htmlv) on this. On the content nodes if you search in attributes (note that if the field is both index|attribute the index and match mode is used) there should be no cold start problem but summary fetch might touch disk (if not already in the mentioned summary cache). See https://docs.vespa.ai/documentation/document-summaries.html
    If you have a small cluster and fetch many hits (summary hits) and run on a IO subsystem with limited supported IOPS you might want to consider using such a dedicated document summary where fields returned in that summary is attribute fields only.
    Eddie Ng
    @wardsng

    Hello! I’m trying to do an incremental UPDATE (via HTTP PUT /document/v1/cluster/sd/docid/1) to a position field latLong like so:

        "fields": {
            "latLong": {
                "assign": "N37.401;W121.996"
            }
        },
        "create": true
    }

    and it’s returning HTTP 200. However, if I then try to do a GET /document/v1/cluster/sd/docid/1, latLong doesn’t appear to be there; it’s not in the JSON response. Incremental updates of other field types seem to behave as expected (i.e. a subsequent GET shows the updated field value)

    However, when doing a PUT (via HTTP POST) like so:

        "fields": {
            "latLong": "N37.401;W121.996"
        }
    }

    Doing a GET in this case does show latLong in the document:
    "latLong": { "x": -121996000, "y": 37401000 },
    It looks like an incremental update of latLong will now work, but not when latLong never existed previously on the document. Is this a bug or is there a way to ensure a position field is created-if-not-exist during an incremental update?

    jblankfeld
    @jblankfeld
    Hi, I'm trying to achieve a high throughput while dumping a content cluster to disk. I'm using Spark Streaming over Document Operation API with selection for cluster partition (id.user.hash().abs() % n = partitionid), continuation and concurrency parameter.
    I have 70 nodes and I am using 20 partitions avec 60 visitors for Vespa concurrency. With this setting, I manage to get 6000 docs/s over all partitions. Any idea how to get a better throughput ? Also the cluster is mostly idle during the process so I guess the visit is not very intensive CPU wise, even with 60 visitors.
    Jo Kristian Bergum
    @jobergum
    @wardsng This looks like a bug IF the document operation was routed with the default route. If it was routed directly to the content nodes bypassing the 'indexing' chain the create condition might trigger this behaviour as the latlong parsing happens in the indexing chain. If you don't mind please open a issue at https://github.com/vespa-engine/vespa/issues
    Tor Brede Vekterli
    @vekterli
    @jblankfeld how much disk IO load do you observe on the nodes? That particular selection is likely to have to read document IDs from disk to see if the full document should be returned, even for the visitors that should not return the documents at all. Also, are you using a custom client or vespa-visit?
    Jo Kristian Bergum
    @jobergum
    @vekterli based on the documentation above I think this is the http (/document/v1) api
    jblankfeld
    @jblankfeld
    @vekterli Hi, yes this is the http api. But yeah I think the bottleneck is IO too. When I try to increase the number of partitions (over 30), the cluster struggles to give me results at all because one visitor has to skip many documents. Is it possible to partition documents by nodes ? For example, I guess it would be more efficient to have one visitor per node, fetching all documents in that particular node and achieve parallelism from nodes themselves instead of document partitions. Would it be wrong?
    Tor Brede Vekterli
    @vekterli
    Nodes are (intentionally) abstracted away from the APIs, so there is presently no way to explicitly state which nodes a visitor should go towards. It sounds like the feature you need is something like vespa-engine/vespa#5055 ? I think what's likely to be happening here is that since all started visitors have no constraints on the data they should scan in the backend, they will all end up scanning through the same parts of the data space at the same time, causing massive load on a subset of the space and minimal load on the rest.
    Jo Kristian Bergum
    @jobergum
    But aren't you using user/group oriented document schema? id.user.hash() seem to indicate that? And what is really the use case?
    jblankfeld
    @jblankfeld
    Hi @jobergum Sorry I made a mistake, I meant id.hash().abs() and no I'm not using user/group schema.
    Thanks for pointing me the issue, this is what I am talking about yes
    Jo Kristian Bergum
    @jobergum
    @jblankfeld is this a serving use case? Meaning you want to traverse the document collection with very high throughput /low latency for some query/selection logic? Or a backup or something else?
    jblankfeld
    @jblankfeld
    This is mostly for backup reasons but also offline analysis of the documents for reupdate. For example, I would like to perform updates on fields whose value is shared by many documents. Vespa API does not permit to do this so it would be easier to dump index to HDFS, perform the updates on dumps using Spark joins and reupdate. We are heavily using Spark and elasticsearch in our organisation and I'd like to mimic some of the functionalities elasticsearch-hadoop library provides (like parallel reading).
    Eddie Ng
    @wardsng
    @jobergum thanks, I've created vespa-engine/vespa#11208
    ddorian
    @ddorian
    @vespa-team
    Does it make sense to be able to "route" by "groupname" even in non-streaming-mode ?
    Assuming you know there will be few buckets-per-groupname (either small data, or big buckets), will lower a lot of network requests.
    Jo Kristian Bergum
    @jobergum
    There are a few use cases where that would work but the dispatching doing scatter and gather does not currently support such hybrid schema between mode streaming/index
    Jonathan Thomson
    @jpthomson
    Hi Vespa team. Is there a simple way to return null from a ranking function? The use case is we want a summary feature to be null (not zero) based on some conditional logic.
    Jon Bratseth
    @bratseth
    You can use NaN. There is also an isNaN()
    Luke Miller
    @lukealexmiller
    I seem to get "ERROR: rank profile : FAIL\nWARNING: invalid rank feature: 'NaN' (unknown basename: 'NaN')" when I try this
    Jon Bratseth
    @bratseth
    You want to literally write it? Not sure we have a way to do that, try 0/0?
    Marcel Neuhausler
    @marcelneu_gitlab
    Dear Vespa team .. I assume the DELETE operation with a "condition" is the correct and most efficient way to clean out large amounts of outdated entries/documents in Vespa?
    Jo Kristian Bergum
    @jobergum
    @marcelneu_gitlab There is a section on batch delete here https://docs.vespa.ai/documentation/writing-to-vespa.html. Doing it through garbage collection. See https://docs.vespa.ai/documentation/reference/services-content.html#documents. Quote "If true, regularly verify the documents stored in the cluster to see if they belong in the cluster, and delete them if not. If false, garbage collection is not run."
    zhaakhi
    @zhaakhi
    Is there a practical difference between calling Searcher.fill() and Searcher.ensureFilled()? fill() doc says "Calling this on already filled results has no cost."
    Tor Brede Vekterli
    @vekterli
    @marcelneu_gitlab the condition given for a delete operation is not an SQL-style predicate over the entire document set, but rather a test-and-set evaluation for the single document instance specified as part of the request. When used with e.g. a per-document sequence number it's useful for avoiding races between multiple concurrent clients.
    Jon Bratseth
    @bratseth
    @zhaakhi no practical difference. (fill is the hook that Searcher subclasses implements).
    ddorian
    @ddorian
    Is there a way on bold to return only the positions where words match and which words/phrase matched ?
    This way you can get the text-field from another db and implement bolding there.
    Jon Bratseth
    @bratseth
    No, there's no way to return all the match positions of each term.
    ddorian
    @ddorian
    Should I open a github issue ? Does it makes sense as a feature request in your opinion ?
    Jon Bratseth
    @bratseth
    Yes, feel free to open an issue. It makes sense but I wouldn't assign it a high priority since it facilitates making something complex that might have data synchronization issues.
    Marcel Neuhausler
    @marcelneu_gitlab
    @jobergum @vekterli thanks for the clarification .. like the "garbage collection" approach/idea
    jblankfeld
    @jblankfeld
    Hi guys,
    A question regarding DocumentDB. I'm looking at the filesystem here: /vespa/var/db/vespa/search/cluster.mycluster/n2/documents/mycluster and 45% of disk space lies in the 2.notready folder. I have a redundancy of 2 and searchable copy of 1. My assumption is that the not ready db is the set of replicated documents that are not searchable. Does it make sense ?
    Tor Brede Vekterli
    @vekterli
    @jblankfeld your assumption is correct. Documents under "not ready" will be automatically promoted to ready (i.e. indexed) if changes in the system call for it, such as other nodes going down etc.
    oexpc
    @oexpc_twitter

    Hi. I'm trying to set up Special tokens as described on Query Rewriting page. I copied that XML snippet to my services.xml as is, deployed application, re-fed documents, but it doesn't seem working.

    vespa-get-config -n vespa.configdefinition.specialtokens shows that tokenlist has been loaded, but queries like select * from sources * where name contains \"c++\" limit 1; return documents with just the letter c in the name field.

    The field name has indexing: summary | index in its definition.

    vespa-index-inspect dumpwords --indexdir /opt/vespa/var/db/vespa/search/cluster.topics/n0/documents/topic/0.ready/index/index.flush.1/ --field name | grep 'c++' returns nothing, and vespa-index-inspect showpostings --indexdir /opt/vespa/var/db/vespa/search/cluster.topics/n0/documents/topic/0.ready/index/index.flush.1/ --field name 'c++' says Unknown word c++.

    I suppose I'm missing something obvious. What else should I check?

    Jon Bratseth
    @bratseth
    Oops. If I remember correctly special tokens are implemented inside our in-house tokenizer but not the default one plugged into the open source distro. I haven't gotten around to that, sorry ...
    I think we need to implement it at that layer, i.e integrated with the tokenization process. It needs to consume the special token config, build an efficient lookup structure (for example our FSA library) and try it for every character while tokenizing.
    If you need it and can't do it yourself and submit it back I'll try to get to it, but it won't be before some time next month.
    oexpc
    @oexpc_twitter
    @bratseth, oh, that's alright. I'll try to get into it and see what I can do.
    Jo Kristian Bergum
    @jobergum
    Wow, impressed by your debug skills demonstrated here @oexpc_twitter
    marianaaaaa
    @marianaaaaa
    Hello,
    I have a couple of questions regarding the internal functions of vespa ranking that I would appreciate very much if you could give me some more insights about.
    First question is regarding the nativerank (https://docs.vespa.ai/documentation/reference/nativerank.html).
    From a couple of use cases tested manually i got that the native rank of several indexed fields is the same as the average of those fields seperately. Ie, nativerank(a,b,c) = (nativerank(a)+nativerank(b)+nativerank(c))/3 . This is always true or there some exceptions or conditions that need to be validated so that it holds?
    Second question is regarding second phase ranking. Let's say I have a vespa cluster with 10 nodes and a rank-profile with a second phase ranking with a rerank-count: 2. In theory, for what i understood, each node would re-rank its best 2 documents (considering rank 1 relevances) and retrieve the relevance calculated by the second phase for those documents. However when I query with a bigger limit than 20 (nb nodes * rerank-count) i do have more than 20 documents retrieved. What is the logic behind this? Which documents are re-ranked in this case?
    Thank you for your attention.
    Jo Kristian Bergum
    @jobergum
    1) not true generally, you probably have the same field weight and a single term query. 2) Vespa re-ranks the top 2 documents from the first phase using the second phase function. So in this case 20 documents (10x2) are re-ranked. If you ask for 30 hits with the &hits parameter you start seeing hits in pos 20 to 29 which have not been re-ranked and their relevancy score will be squashed so that those 20 which got re-ranked are always ordered before the ones which only got a first phase assigned. Usually total re-rank count (nodes x re-rank count) is higher than summary hits.