Where communities thrive


  • Join over 1.5M+ people
  • Join over 100K+ communities
  • Free without limits
  • Create your own community
People
Activity
    Yongjoo Park
    @pyongjoo
    This means, you can first create an JdbcConnection instance using Sqlite's JDBC connection string, then supply the instance to create an instance of VerdictContext.
    This is how it works internally for other databases; we simply haven't wired this mechanism for sqlite.
    solangepaz
    @solangepaz
    OK thank you. The sqlite will have to have the TPCH data, for example, right? And then the samples are stored in a table?
    Yongjoo Park
    @pyongjoo
    Yes
    Everything else will be the same.
    But, VerdictDB does not offer any speedups for sqlite because sqlite doesn't support physical partitioning.
    This means, if your goal is simply creating scramble tables, then what I described is the simplest way I can think of.
    If you want to experiment speedup numbers for in-memory DBs, it's a different story.
    solangepaz
    @solangepaz
    OK thank you. I will try to save the samples in sqlite
    solangepaz
    @solangepaz
    I made this sqlite example, but I can not access the samples.
    solangepaz
    @solangepaz
    I have this code, from your tutorial. So, if I change the line that creates a connection with jdbc: verdict: mysql for sqlite gives error. How can I do this?
    solangepaz
    @solangepaz
    I have this error
    Exception in thread "main" java.lang.NullPointerException
    at org.verdictdb.commons.StringSplitter.splitOnSemicolon(StringSplitter.java:34)
    at org.verdictdb.connection.JdbcConnection.execute(JdbcConnection.java:167)
    at org.verdictdb.connection.JdbcConnection.executeQuery(JdbcConnection.java:219)
    at org.verdictdb.connection.JdbcConnection.getSchemas(JdbcConnection.java:235)
    at org.verdictdb.metastore.ScrambleMetaStore.retrieve(ScrambleMetaStore.java:386)
    at org.verdictdb.metastore.ScrambleMetaStore.retrieve(ScrambleMetaStore.java:372)
    at org.verdictdb.metastore.CachedScrambleMetaStore.refreshCache(CachedScrambleMetaStore.java:25)
    at org.verdictdb.VerdictContext.getCachedMetaStore(VerdictContext.java:86)
    at org.verdictdb.VerdictContext.<init>(VerdictContext.java:71)
    at Main.main(Main.java:23)
    solangepaz
    @solangepaz
    @pyongjoo ,Can you help me with this?
    Yongjoo Park
    @pyongjoo
    I'm traveling now, so I cannot work closely.
    Can you kindly post this on our Github issue page? I will ask someone else to take a look.
    solangepaz
    @solangepaz
    Yes, I already put it in github. Thanks
    Michael
    @commercial-hippie
    Hi
    If I reduce the block size will the results be less accurate?
    Michael
    @commercial-hippie

    Secondly I was wondering.. I see when creating a scramble verdict first grabs the total records from the table ie.

    select count(*) as "verdictdbtotalcount" from "default"."test_table_orc"

    If there is a huge amount of records in the table and you need to create scrambles bit by bit then it would be amazing if you can pass the count in as a option already via the query so that verdict can skip calculating it.

    I might work on that and do a pull request if I can figure out how to do it fast enough.
    Yongjoo Park
    @pyongjoo
    Block size doesn't affect the accuracy.
    Too small block size could increase overhead slightly though
    I like your suggestion. Is it because you want to build scrambles only for the data that have been newly appended?
    Michael
    @commercial-hippie
    that but as well as older tables that have too much data
    sometimes it's easier to create a scramble on a segment table and then append the rest to complete the scramble
    especially to avoid prestos max writers for partitions
    and it would be nice to avoid having to do the initial count every time :)
    Michael
    @commercial-hippie

    When you create a scramble with a condition verdict still grabs the count on the entire table.. ie.

    select count(*) as "verdictdbtotalcount" from "default"."test_table_orc"

    Shouldn't the count be only for the conditions specified in the create scramble statement?

    Sanjay Kumar
    @sanjay-94
    adding on to @commercial-hippie, are the counts necessary to be calculated for the entire table even if i request for 1 partition's scramble to be created?
    Yongjoo Park
    @pyongjoo
    @commercial-hippie What you mentioned--the count should only be for the subset--sounds very reasonable. I should check the code.
    Dan Voyce
    @voycey
    Hey guys - sorry only just made my way into this! As the guys above said this is pretty crucial to us at the moment - our data is reaching 300B rows soon and running a count across all of this is painful - we have been investigating ways around some of the bottlenecks but removing unnecessary full table scans is definitely a must have for us now :)
    Yongjoo Park
    @pyongjoo
    @voycey Thanks for the info. Let us look into it as well.
    Dan Voyce
    @voycey
    @pyongjoo Do you think with the amount of data we have 1% would be enough? METHOD HASH HASHCOLUMN id SIZE 0.01?
    Each day currently takes 3 hours to run the scramble for - we want to maintain 90 days
    Michael
    @commercial-hippie
    I was wondering this too, how low could we reasonably go while still maintaining a good level of accuracy..
    Dan Voyce
    @voycey
    @pyongjoo Also I would like to know the reason for the verdictdbblock = 100
    Currently we are creating partitions based on this - this means a 100x number of partitions
    we are currently at 1500 per month (perfectly acceptable for spark and most DBMS)
    with the 100 this takes us to 150,000 which is not so acceptable and creates a huge overhead. Can this be solved any way?
    Yongjoo Park
    @pyongjoo
    I guess your 3 hours is for creating a scramble with 50 (per day) * 100 = 5000 partitions?
    If so, we can do two things: (1) let you change partitioning columns for a scramble, and (2) let you change verdictdbblock value.
    I believe (1) will be more useful for now. Changing verdictdbblock value may have query latency impact, which we haven't measured in detail.
    Yongjoo Park
    @pyongjoo
    Assuming (1) is done, you can set the partitioning columns for a scramble as (date) only (without state); then, Verdict will effectively set the partitioning columns as (date, verdictdbblock), which will lead to a fewer number of partitions.
    Dan Voyce
    @voycey
    Sorry we have been up to our eyes in it - I think @commercial-hippie made a few changes but he will have to detail what he did - I think if we can partition based on date that is the easiest way - the fewer partitions sound good although I am not sure what the consequences of this are down the track
    Most of our queries we think are based on Date and State - im not sure if we are setting ourselves up to fail further down the track....
    Dan Voyce
    @voycey
    @pyongjoo - Ok so we definitely need a solution to this verdictdbblock thing - currently because it is creating those as an extra partition we have 90 x (52 x 100) = 468,000 partitions which is obviously impossible to support on a single table
    I think for the most part we should be able to just keep the date partitioning which would give us 90 days x 100 verdict db block = 9000 which is manageable
    Dan Voyce
    @voycey
    @pyongjoo If verdict creates the scrambles solely on the date - will this affect the return speed of the counts. We are in a "Have your cake and eat it scenario" where we want to maintain the fast count return speed but also want to make these scrambles more efficient.
    (Currently we cant even create the scrambles - a 6 x 32 CPU / 208 GB mem cluster keeps giving out of memory errors in Presto whilst trying to build them)