Where communities thrive


  • Join over 1.5M+ people
  • Join over 100K+ communities
  • Free without limits
  • Create your own community
People
Activity
    Yongjoo Park
    @pyongjoo
    Too small block size could increase overhead slightly though
    I like your suggestion. Is it because you want to build scrambles only for the data that have been newly appended?
    Michael
    @commercial-hippie
    that but as well as older tables that have too much data
    sometimes it's easier to create a scramble on a segment table and then append the rest to complete the scramble
    especially to avoid prestos max writers for partitions
    and it would be nice to avoid having to do the initial count every time :)
    Michael
    @commercial-hippie

    When you create a scramble with a condition verdict still grabs the count on the entire table.. ie.

    select count(*) as "verdictdbtotalcount" from "default"."test_table_orc"

    Shouldn't the count be only for the conditions specified in the create scramble statement?

    Sanjay Kumar
    @sanjay-94
    adding on to @commercial-hippie, are the counts necessary to be calculated for the entire table even if i request for 1 partition's scramble to be created?
    Yongjoo Park
    @pyongjoo
    @commercial-hippie What you mentioned--the count should only be for the subset--sounds very reasonable. I should check the code.
    Dan Voyce
    @voycey
    Hey guys - sorry only just made my way into this! As the guys above said this is pretty crucial to us at the moment - our data is reaching 300B rows soon and running a count across all of this is painful - we have been investigating ways around some of the bottlenecks but removing unnecessary full table scans is definitely a must have for us now :)
    Yongjoo Park
    @pyongjoo
    @voycey Thanks for the info. Let us look into it as well.
    Dan Voyce
    @voycey
    @pyongjoo Do you think with the amount of data we have 1% would be enough? METHOD HASH HASHCOLUMN id SIZE 0.01?
    Each day currently takes 3 hours to run the scramble for - we want to maintain 90 days
    Michael
    @commercial-hippie
    I was wondering this too, how low could we reasonably go while still maintaining a good level of accuracy..
    Dan Voyce
    @voycey
    @pyongjoo Also I would like to know the reason for the verdictdbblock = 100
    Currently we are creating partitions based on this - this means a 100x number of partitions
    we are currently at 1500 per month (perfectly acceptable for spark and most DBMS)
    with the 100 this takes us to 150,000 which is not so acceptable and creates a huge overhead. Can this be solved any way?
    Yongjoo Park
    @pyongjoo
    I guess your 3 hours is for creating a scramble with 50 (per day) * 100 = 5000 partitions?
    If so, we can do two things: (1) let you change partitioning columns for a scramble, and (2) let you change verdictdbblock value.
    I believe (1) will be more useful for now. Changing verdictdbblock value may have query latency impact, which we haven't measured in detail.
    Yongjoo Park
    @pyongjoo
    Assuming (1) is done, you can set the partitioning columns for a scramble as (date) only (without state); then, Verdict will effectively set the partitioning columns as (date, verdictdbblock), which will lead to a fewer number of partitions.
    Dan Voyce
    @voycey
    Sorry we have been up to our eyes in it - I think @commercial-hippie made a few changes but he will have to detail what he did - I think if we can partition based on date that is the easiest way - the fewer partitions sound good although I am not sure what the consequences of this are down the track
    Most of our queries we think are based on Date and State - im not sure if we are setting ourselves up to fail further down the track....
    Dan Voyce
    @voycey
    @pyongjoo - Ok so we definitely need a solution to this verdictdbblock thing - currently because it is creating those as an extra partition we have 90 x (52 x 100) = 468,000 partitions which is obviously impossible to support on a single table
    I think for the most part we should be able to just keep the date partitioning which would give us 90 days x 100 verdict db block = 9000 which is manageable
    Dan Voyce
    @voycey
    @pyongjoo If verdict creates the scrambles solely on the date - will this affect the return speed of the counts. We are in a "Have your cake and eat it scenario" where we want to maintain the fast count return speed but also want to make these scrambles more efficient.
    (Currently we cant even create the scrambles - a 6 x 32 CPU / 208 GB mem cluster keeps giving out of memory errors in Presto whilst trying to build them)
    Michael
    @commercial-hippie
    @pyongjoo I would just like to confirm whether we can clone a scramble table and it's data as another table with the same schema but different partitions, and then just duplicating the scramble record in the verdictdbmeta table with the new table name. So the only thing we change really is the partitions and the table name.
    The metadata will still be valid for the new table right?
    Dan Voyce
    @voycey
    ^ We basically need to know if there are any references to the partitions within the actual data
    Michael
    @commercial-hippie
    This is because we want to take out current scramble and clone it to create some tests ie:
    Dan Voyce
    @voycey
    Because re-running these scrambles isn't feasible (they take days to complete) so we want to ETL the data using Spark directly on the ORC files
    Michael
    @commercial-hippie
    1. Partition date/state/verdictdbmeta
    2. Parition date/verdictdbmeta
    3. possibly mixing partitions and buckets
    etc.
    just to play around with the structure without having to recreate everything
    Michael
    @commercial-hippie

    Hi!

    I've compiled the latest commit commit on the master branch ( which includes this https://github.com/mozafari/verdictdb/pull/380/files )

    When I add AND regexp_like('1a 2b 14m', '\\d+b') to a query of mine I get:

    Error running instance method
    java.lang.RuntimeException: syntax error occurred:no viable alternative at input 'regexp_like('1a 2b 14m', '\\d+b'))'
        at org.verdictdb.sqlreader.VerdictDBErrorListener.syntaxError(VerdictDBErrorListener.java:35)
        at org.antlr.v4.runtime.ProxyErrorListener.syntaxError(ProxyErrorListener.java:65)
        at org.antlr.v4.runtime.Parser.notifyErrorListeners(Parser.java:564)
        at org.antlr.v4.runtime.DefaultErrorStrategy.reportNoViableAlternative(DefaultErrorStrategy.java:308)
        at org.antlr.v4.runtime.DefaultErrorStrategy.reportError(DefaultErrorStrategy.java:145)
        at org.verdictdb.parser.VerdictSQLParser.predicate(VerdictSQLParser.java:5708)
        at org.verdictdb.parser.VerdictSQLParser.search_condition_not(VerdictSQLParser.java:5263)
        at org.verdictdb.parser.VerdictSQLParser.search_condition_or(VerdictSQLParser.java:5200)
        at org.verdictdb.parser.VerdictSQLParser.search_condition(VerdictSQLParser.java:5150)
        at org.verdictdb.parser.VerdictSQLParser.predicate(VerdictSQLParser.java:5500)
        at org.verdictdb.parser.VerdictSQLParser.search_condition_not(VerdictSQLParser.java:5263)
        at org.verdictdb.parser.VerdictSQLParser.search_condition_or(VerdictSQLParser.java:5200)
        at org.verdictdb.parser.VerdictSQLParser.search_condition(VerdictSQLParser.java:5140)
        at org.verdictdb.parser.VerdictSQLParser.query_specification(VerdictSQLParser.java:6041)
        at org.verdictdb.parser.VerdictSQLParser.query_expression(VerdictSQLParser.java:5753)
        at org.verdictdb.parser.VerdictSQLParser.select_statement(VerdictSQLParser.java:1836)
        at org.verdictdb.parser.VerdictSQLParser.verdict_statement(VerdictSQLParser.java:578)
        at org.verdictdb.coordinator.ExecutionContext.identifyQueryType(ExecutionContext.java:796)
        at org.verdictdb.coordinator.ExecutionContext.sql(ExecutionContext.java:152)
        at org.verdictdb.jdbc41.VerdictStatement.execute(VerdictStatement.java:107)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:498)
    I'm not sure if I did something wrong, would it be possible to get a new release on github so that I can download the compiled jar from there?
    Yongjoo Park
    @pyongjoo
    @voycey @commercial-hippie Somehow I haven't got any notifications from this channel. Sorry for that. I'm checking messages starting from older ones.
    The syntax error itself could be resolved by "mvn clean", THEN "mvn package". But, I think uploading a compiled jar is a cleaner solution.
    Yongjoo Park
    @pyongjoo
    @dongyoungy Can you upload a compiled jar to our github page? As you know, the commands are here: https://github.com/mozafari/verdictdb/wiki/Build-Deploy-commands
    Michael
    @commercial-hippie
    @pyongjoo Still getting that error unfortunately :(
    Praveen Krishna
    @Praveen2112
    hi!! Is it possible to write a scramble for a given table in a different catalog in case of Presto ? Like my data will be available in catalog1 and we might need to store the scramble in a catalog2.
    Yongjoo Park
    @pyongjoo
    Right at the moment, I don't think so.
    @Praveen2112 Changing the behavior to the way you mentioned is not hard, but we stopped maintaining this code temporarily until we hire actual developers.
    Praveen Krishna
    @Praveen2112
    If it is not a problem can I contribute for the same ?
    Yongjoo Park
    @pyongjoo
    @Praveen2112 Can you clarify what you mean by "the same". If you send a pull request to VerdictDB repo, we can merge it.
    Praveen Krishna
    @Praveen2112
    I meant raising the PR for adding support for writing the scramble to a separate catalog
    Yongjoo Park
    @pyongjoo
    Sure. But, do you think even Presto itself supports it? I was just playing with it, but copying tables between different catalogs doesn't seem to be supported.
    Oh, never mind. I was confused with something else.
    Yes, feel free to send a PR!