Where communities thrive


  • Join over 1.5M+ people
  • Join over 100K+ communities
  • Free without limits
  • Create your own community
People
Activity
    Dan Voyce
    @voycey
    Currently we are creating partitions based on this - this means a 100x number of partitions
    we are currently at 1500 per month (perfectly acceptable for spark and most DBMS)
    with the 100 this takes us to 150,000 which is not so acceptable and creates a huge overhead. Can this be solved any way?
    Yongjoo Park
    @pyongjoo
    I guess your 3 hours is for creating a scramble with 50 (per day) * 100 = 5000 partitions?
    If so, we can do two things: (1) let you change partitioning columns for a scramble, and (2) let you change verdictdbblock value.
    I believe (1) will be more useful for now. Changing verdictdbblock value may have query latency impact, which we haven't measured in detail.
    Yongjoo Park
    @pyongjoo
    Assuming (1) is done, you can set the partitioning columns for a scramble as (date) only (without state); then, Verdict will effectively set the partitioning columns as (date, verdictdbblock), which will lead to a fewer number of partitions.
    Dan Voyce
    @voycey
    Sorry we have been up to our eyes in it - I think @commercial-hippie made a few changes but he will have to detail what he did - I think if we can partition based on date that is the easiest way - the fewer partitions sound good although I am not sure what the consequences of this are down the track
    Most of our queries we think are based on Date and State - im not sure if we are setting ourselves up to fail further down the track....
    Dan Voyce
    @voycey
    @pyongjoo - Ok so we definitely need a solution to this verdictdbblock thing - currently because it is creating those as an extra partition we have 90 x (52 x 100) = 468,000 partitions which is obviously impossible to support on a single table
    I think for the most part we should be able to just keep the date partitioning which would give us 90 days x 100 verdict db block = 9000 which is manageable
    Dan Voyce
    @voycey
    @pyongjoo If verdict creates the scrambles solely on the date - will this affect the return speed of the counts. We are in a "Have your cake and eat it scenario" where we want to maintain the fast count return speed but also want to make these scrambles more efficient.
    (Currently we cant even create the scrambles - a 6 x 32 CPU / 208 GB mem cluster keeps giving out of memory errors in Presto whilst trying to build them)
    Michael
    @commercial-hippie
    @pyongjoo I would just like to confirm whether we can clone a scramble table and it's data as another table with the same schema but different partitions, and then just duplicating the scramble record in the verdictdbmeta table with the new table name. So the only thing we change really is the partitions and the table name.
    The metadata will still be valid for the new table right?
    Dan Voyce
    @voycey
    ^ We basically need to know if there are any references to the partitions within the actual data
    Michael
    @commercial-hippie
    This is because we want to take out current scramble and clone it to create some tests ie:
    Dan Voyce
    @voycey
    Because re-running these scrambles isn't feasible (they take days to complete) so we want to ETL the data using Spark directly on the ORC files
    Michael
    @commercial-hippie
    1. Partition date/state/verdictdbmeta
    2. Parition date/verdictdbmeta
    3. possibly mixing partitions and buckets
    etc.
    just to play around with the structure without having to recreate everything
    Michael
    @commercial-hippie

    Hi!

    I've compiled the latest commit commit on the master branch ( which includes this https://github.com/mozafari/verdictdb/pull/380/files )

    When I add AND regexp_like('1a 2b 14m', '\\d+b') to a query of mine I get:

    Error running instance method
    java.lang.RuntimeException: syntax error occurred:no viable alternative at input 'regexp_like('1a 2b 14m', '\\d+b'))'
        at org.verdictdb.sqlreader.VerdictDBErrorListener.syntaxError(VerdictDBErrorListener.java:35)
        at org.antlr.v4.runtime.ProxyErrorListener.syntaxError(ProxyErrorListener.java:65)
        at org.antlr.v4.runtime.Parser.notifyErrorListeners(Parser.java:564)
        at org.antlr.v4.runtime.DefaultErrorStrategy.reportNoViableAlternative(DefaultErrorStrategy.java:308)
        at org.antlr.v4.runtime.DefaultErrorStrategy.reportError(DefaultErrorStrategy.java:145)
        at org.verdictdb.parser.VerdictSQLParser.predicate(VerdictSQLParser.java:5708)
        at org.verdictdb.parser.VerdictSQLParser.search_condition_not(VerdictSQLParser.java:5263)
        at org.verdictdb.parser.VerdictSQLParser.search_condition_or(VerdictSQLParser.java:5200)
        at org.verdictdb.parser.VerdictSQLParser.search_condition(VerdictSQLParser.java:5150)
        at org.verdictdb.parser.VerdictSQLParser.predicate(VerdictSQLParser.java:5500)
        at org.verdictdb.parser.VerdictSQLParser.search_condition_not(VerdictSQLParser.java:5263)
        at org.verdictdb.parser.VerdictSQLParser.search_condition_or(VerdictSQLParser.java:5200)
        at org.verdictdb.parser.VerdictSQLParser.search_condition(VerdictSQLParser.java:5140)
        at org.verdictdb.parser.VerdictSQLParser.query_specification(VerdictSQLParser.java:6041)
        at org.verdictdb.parser.VerdictSQLParser.query_expression(VerdictSQLParser.java:5753)
        at org.verdictdb.parser.VerdictSQLParser.select_statement(VerdictSQLParser.java:1836)
        at org.verdictdb.parser.VerdictSQLParser.verdict_statement(VerdictSQLParser.java:578)
        at org.verdictdb.coordinator.ExecutionContext.identifyQueryType(ExecutionContext.java:796)
        at org.verdictdb.coordinator.ExecutionContext.sql(ExecutionContext.java:152)
        at org.verdictdb.jdbc41.VerdictStatement.execute(VerdictStatement.java:107)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:498)
    I'm not sure if I did something wrong, would it be possible to get a new release on github so that I can download the compiled jar from there?
    Yongjoo Park
    @pyongjoo
    @voycey @commercial-hippie Somehow I haven't got any notifications from this channel. Sorry for that. I'm checking messages starting from older ones.
    The syntax error itself could be resolved by "mvn clean", THEN "mvn package". But, I think uploading a compiled jar is a cleaner solution.
    Yongjoo Park
    @pyongjoo
    @dongyoungy Can you upload a compiled jar to our github page? As you know, the commands are here: https://github.com/mozafari/verdictdb/wiki/Build-Deploy-commands
    Michael
    @commercial-hippie
    @pyongjoo Still getting that error unfortunately :(
    Praveen Krishna
    @Praveen2112
    hi!! Is it possible to write a scramble for a given table in a different catalog in case of Presto ? Like my data will be available in catalog1 and we might need to store the scramble in a catalog2.
    Yongjoo Park
    @pyongjoo
    Right at the moment, I don't think so.
    @Praveen2112 Changing the behavior to the way you mentioned is not hard, but we stopped maintaining this code temporarily until we hire actual developers.
    Praveen Krishna
    @Praveen2112
    If it is not a problem can I contribute for the same ?
    Yongjoo Park
    @pyongjoo
    @Praveen2112 Can you clarify what you mean by "the same". If you send a pull request to VerdictDB repo, we can merge it.
    Praveen Krishna
    @Praveen2112
    I meant raising the PR for adding support for writing the scramble to a separate catalog
    Yongjoo Park
    @pyongjoo
    Sure. But, do you think even Presto itself supports it? I was just playing with it, but copying tables between different catalogs doesn't seem to be supported.
    Oh, never mind. I was confused with something else.
    Yes, feel free to send a PR!
    anujlal01
    @anujlal01

    am in very early stage of reading/evaluating/trying and then decide what to build/and use any existing solutions ( like blink/verdict/snappy/druid etc) for approximate queries in bigdb field

    Stared with playing with verdictdb.

    Is my understanding correct that I need to maintain scramble db which is a modified form of exiting table AND its size is going to be same as the original table? That will be a bummer and show stopper as that would come as HUGE cost