Where communities thrive


  • Join over 1.5M+ people
  • Join over 100K+ communities
  • Free without limits
  • Create your own community
People
Activity
    Jo Kristian Bergum
    @jobergum
    Define better? Summary data is stored in a blob, across fields and even across documents (chunks), compressed using zstd, with configurable level.
    senecaso
    @senecaso
    I guess better in terms of storage space, or throughput. If there is no overhead with storing individual fields as summary, then perhaps storing each field of my JSON as summary is the way to go. I'm trying to keep disk usage low so that I can keep as much of the mmap'd summaries in memory as possible. Is it possible to access the full JSON that Vespa already stores in the summary blob? If so, then I have no need for additional summary fields
    Jo Kristian Bergum
    @jobergum
    The input data is not directly stored in JSON, json is just the serialization format used when inserting data into Vespa
    senecaso
    @senecaso
    Ah. Is it available in any other format? I'm not really particular about the format so much as keeping disk usage down. For context, I am looking at building an index of ~400m documents, and the current data usage trends scare me a little, since the cost to serve is going to rise dramatically if I need to keep more of the mmap'd files in memory to serve a production load. I was anticipating that I should be able to store the 400m docs in ~200GB (across a couple of nodes), which would provide ample room for mapping, but the current trend is showing I would need ~380GB (without any summaries), and closer to 1TB when I enable the summaries. I'll need considerably more hardware to keep a meaningful portion of that mapped
    Jo Kristian Bergum
    @jobergum
    On mmap of the summary store. The default behaviour is to use directio when reading data from the summary store and where a certain percentage of the total memory is set aside as the summary cache.
    1 reply
    ~380GB (without any summaries), and closer to 1TB
    18 replies
    It's not going to be 2x, it's not a redundant storage
    So to examplify. You have a field body type string { indexing: index}, adding summary to it will not cause the summary footprint to go 2x
    10 replies
    It's just that if you have indexing: index, we still need to store the data in case you add a new node or two so that you don't have to re-feed the data
    Jo Kristian Bergum
    @jobergum
    But generally there is always a tradeoff, 400M documents on a single node with 1TB storage might not be the optimal configuration, unless serving latency is within your SLA with that number of documents. Depends on the query complexity and ranking profile, also number of threads used per search
    senecaso
    @senecaso
    I would need to keep query response times in the p95, 300ms range
    so, no, I dont anticipate keeping this all on a single node
    but I was hoping the index was small enough that I would need 2-3 nodes per group, rather than 40 (or whatever :))
    Jo Kristian Bergum
    @jobergum
    So sizing to the right hw footprint needs to consider multiple variables, but you have already covered most of it. We try to summarize sizing here https://docs.vespa.ai/documentation/performance/sizing-search.html
    But without going too deep into the details, but I do see we have some confusing documentation on the summary store, and what is actually stored (e.g index versus summary)
    yeah, see your point @senecaso. We got this cloud offering where you can also get more professional services when sizing your system at https://cloud.vespa.ai/ :)
    1 reply
    senecaso
    @senecaso
    Overall, I would say your docs are pretty good. They are just overwhelming, so its going to take me (and other new users) a bit of time to digest all of the nuance along the way. I do have search experience, but there isnt a direct mapping from the tiny bit of knowledge I have of FAST, and more knowledge of Solr/ES to this :)
    Kyle Rowan
    @karowan
    Hi, is there a way to hook into the deletion of a document to emit an event for other services to consume? My use case is that I have a separate database which stores a user's likes which correspond to specific documents and if a document is deleted from vespa, there is currently no way for my other services to know that event has occurred. The only workaround that I currently have is to periodically check for each of the liked documents and delete the ones that do not exist from my db but this doesn't seem particularly scalable.
    2 replies
    Иван Сердюк
    @oceanfish81_twitter
    I am here, in case of any interest to os72/protoc-jar-maven-plugin#100 / vespa-engine/vespa#15997
    1 reply
    jblankfeld
    @jblankfeld
    Hi Vespa team, I would like to ask you if you had any experience with manual intervention in order to correct ranking defects caused by a non optimal ranking function.
    I understand that this should be the way around, a bad ranking should be annotated and then a new ranking model should be learnt from this ground truth.
    But in the context of production, it would take too long to run feature extraction, statistical training and validation of new model to solve ranking defects so I am thinking of doing manual override in the index to temporarily fix scores using maps in documents with keys being a couple (query, ranking function) and store a raw delta score that would be append to the ranking function. Is it something that has been looked at in your experience ? Thanks a lot :)
    7 replies
    Kleba Vadim
    @klebadev
    Hello! What is the name for the document if there is a schema name?
    1 reply
    xaviergimenezmrf
    @xaviergimenezmrf
    Hello Vespa Team! Is it possible to know the predicate field value during the ranking phase? I'm not able to get it and can't find anything in the documentation or sample applications. Thanks a lot!
    9 replies
    Vlad
    @vfil

    Hi Vespa! I am trying to spin vespa-cloud in production and I am having some difficulties. I am getting Installation of tester failed to complete within 30 minutes! in the vespa-cloud console. @kkraune pointed out the root cause is:

    Caused by: org.osgi.framework.BundleException: Unable to resolve vespa-app-test [67](R 67.0): missing requirement [vespa-app-test [67](R 67.0)] osgi.wiring.package; (&(osgi.wiring.package=clojure.lang)(version>=1.10.0)(!(version>=2.0.0))) Unresolved requirements: [[vespa-app-test [67](R 67.0)] osgi.wiring.package; (&(osgi.wiring.package=clojure.lang)(version>=1.10.0)(!(version>=2.0.0)))]
    in pom.xml 
                <plugin>
                    <groupId>org.apache.maven.plugins</groupId>
                    <artifactId>maven-dependency-plugin</artifactId>
                    <executions>
                        <execution>
                            <id>copy-dependencies</id>
                            <phase>prepare-package</phase>
                            <goals>
                                <goal>copy-dependencies</goal>
                            </goals>
                            <configuration>
                                <outputDirectory>${project.build.directory}/application/components</outputDirectory>
                                <includeArtifactIds>clojure.osgi</includeArtifactIds>
                            </configuration>
                        </execution>
                    </executions>
                </plugin>

    But I am not entirely sure why this is happening in testing env and not on dev. And in general for other JVM languages is there any guidance for using vespa-cloud?

    9 replies
    Иван Сердюк
    @oceanfish81_twitter
    Anyone using Vespa on Azure?
    1 reply
    Gil Cottle
    @redcape
    I have a use-case where I'd like to search across individual company spaces(many users per company, but any user can only search within one company) and thinking of using vespa. In the case that a company wants to delete their data, we need to delete it fairly quickly (less than an hour ideally, but 12 hours or 1 day would be too long). What would be the correct way to model this? Company spaces vary widely in size, but I'm trying to figure out if I should be using groups based on an attribute or creating separate application packages or whatever buckets are... If there are 100K application packages, is that something vespa can handle or is groups the right way to allocate? If I go with groups, will the data still be partitioned across multiple content nodes when the data gets quite large for an individual group? Is there a way to delete and documents in a group relatively quickly or at least make them eligible for cleanup? Any tips?
    14 replies
    Julien Nioche
    @jnioche

    Hi, I am trying to have a simple app running on Vespa. I piggybacked the album recommendation one and modified the schema and changed the services file to point to the document type I need. When calling 'vespa deploy prepare' I am getting

    Invalid application package: default.default: Error loading model: Search definition parsing error or file does not exist: 'url'

    where url is the name of the documents I defined in the schema.
    Any suggestions?

    5 replies
    Иван Сердюк
    @oceanfish81_twitter

    vespa-engine/vespa#16124

    Another issue, on RISC-V

    Иван Сердюк
    @oceanfish81_twitter
    1 reply
    Gil Cottle
    @redcape
    what's the typical pattern when deploying a new set of base data. Let's say I have some data (like joins with a lot of data to produce a feature) that is difficult to reproduce online and I want to deploy a fresh index every day or week. It's easy to build a new set of data, but can be difficult to identify what needs to be deleted. What's the normal deployment pattern you've seen to deploy and clean-up old data?
    3 replies
    Иван Сердюк
    @oceanfish81_twitter
    Where could I find information about how OpenBLAS's usage, for Vespa?
    6 replies
    Gil Cottle
    @redcape
    Are there any docs or references to how indexes/data is stored in the filesystem? I want to understand where/how documents, posting lists, b trees, etc are stored
    6 replies
    Gil Cottle
    @redcape

    I tried vespa-cloud with a list of ~9M names, cities, county, and other info and vespa has poor performance when grouping compared to ES. I suspect the answer lies in how many records it's going through. here's the grouping part of the query, I am using:
    {"hits": 0, "yql": "select * from people where sddocname contains \"people\" | all( all(group(firstName) max(10) order(-count()) each(output(count()))) ) ;", "cluster":"people","timeout":10,"tracelevel":1,"tracelevel":1,"trace.timestamps":true}

    I tried messing adding various precision(1), precision(5), etc values but it doesn't seem to change anything. Is there a way to influence the number of records read at the cost of worse approximation? I feel like I should be able to tune the aggregation worse and worse so that it is very inaccurate, but also fast.

    Adding fast-search took it from 7s to 2s, while ES gets ~700ms. I know it's not apples to apples, but from machine specs, vespa cloud is giving a 8G memory machine while ES got a 4G machine and still does better.

            field firstName type string {
                indexing: summary | index | attribute
                attribute: fast-search
            }
    2 replies
    Gil Cottle
    @redcape
    not to say it's all bad, other querying aspects were similar latency , ES significantly lags while loading data where vespa doesn't, and the idea of not choosing shard size is very appealing. but looking for a way to tune grouping for lower latency
    One comment I have is that the CLI tools are not very accessible for the cloud instance (or maybe I don't know where to get them). How can I use vespa-proton-cmd - do I need to compile vespa from source first? Am I supposed to use the docker instance and run commands from there?
    1 reply
    Kristian Aune
    @kkraune
    Please note that we are making a change to how to deploy to the dev environment: You must also provide the instance name, e.g. mvn clean package vespa:deploy -DapiKeyFile=USER.TENANTNAME.pem -Dinstance=my-instance. Earlier the username was used, but this has caused confusion, hence the new requirement. The instance name is used in the data-plane endpoint - check the dashboard at https://console.vespa.oath.cloud/ for current instances and reuse the instance name
    Julien Nioche
    @jnioche
    Hi, I am trying to index documents using the Vespa HTTP client. One thing I need to do is to create some of these documents only if they don't already exist. They must not overwrite an existing version. Other documents (the minority) will be allowed to overwrite.
    I see that there is a CREATE operator but it sounds like it applies to UPDATEs only, whereas I'd need them for PUTs. Using them on UPDATEs would create them if they are not present but would also overwrite an existing document with the same ID which is definitely what I want.
    Am I understanding it correctly? Would I need to use a condition and if so what impact would that have on the performance of the writes?
    4 replies
    John Dagdelen
    @jdagdelen
    I have a quick question. I set up a Vespa application and fed in my data, but it looks like I missed including an important field in my document-summary. Is it possible to update the application to include the field in the summary without blowing away the data already in my application?
    Docs seem to be a little broken right now, which is why I'm asking here without consulting them first.
    Jo Kristian Bergum
    @jobergum
    Yes, we are looking into the docs search isuee
    thanks for reporting
    and no, adding a new field to a custom document-summary is a live change without blowing up anything :)
    John Dagdelen
    @jdagdelen
    So just to confirm, I can upload the new application.zip and then re-deploy?
    or do I need to do it some other way?
    Jo Kristian Bergum
    @jobergum
    Yes, prepare will tell if you need to do anything
    But adding a field to a document-summary is a live change, or adding a new document-summary etc
    Frode Lundgren
    @frodelu
    Thanks for the heads up on the documentation issues, @jdagdelen . It is working now. Seems to have been a temporary glitch in GitHub’s Nginx config, but not long after we’d done a work-around, they also fixed the issue on their end. So we’re double good now! :-)
    John Dagdelen
    @jdagdelen
    No problem!
    Jo Kristian Bergum
    @jobergum
    I'm joining this meetup on thursday to talk about Vespa https://www.meetup.com/en-AU/Haystack-Search-Relevance-Online/events/275820872, it's a live meetup but there will also be a recording.