Where communities thrive


  • Join over 1.5M+ people
  • Join over 100K+ communities
  • Free without limits
  • Create your own community
People
Activity
    snakescott
    @snakescott
    @tuulos does grouping by user just resolve to sorting?
    Ville Tuulos
    @tuulos
    what do you mean?
    snakescott
    @snakescott
    if you write events to parquet sorted by an appropriate key -- perhaps (user id, timestamp) -- doesn't that give you an optimized way to scan for events related to a user
    or maybe another way to put this is it seems like row vs column major is orthogonal to scan optimization?
    Ville Tuulos
    @tuulos
    in row-major you have fields of a row adjacent to each other vs. values of a field adjacent to each other in column-major
    if you want to scan over all fields related to a set of adjacent rows, row-major is more efficient
    snakescott
    @snakescott
    ah, I guess that's the crux!
    I assumed that TrailDB queries would look similar to analytic queries on systems like RedShift/etc
    obviously they aren't SQL
    but in terms of how many fields are used per query
    especially since events -- to be concrete, say web analytic events -- can have lots of fields?
    Ville Tuulos
    @tuulos
    yeah, TrailDB optimizes for select * from users where user_id=X vs. select aggregate(field) from users that would be more efficient with a columnar layout
    snakescott
    @snakescott
    is this for simplicity, or did you expect to be working with funnel queries which cover lots of fields?
    I feel like I got a lot of insight into TrailDB design from the code, docs, and presentations (thanks!), but not on this point -- I might have missed a resource though.
    Ville Tuulos
    @tuulos
    lots of fields (say, hundreds or thousands of fields) works fine especially if not all fields are populated. TrailDB handles sparse data like that well
    if you have lots of non-empty fields but you care only about a tiny subset of them, it is more inefficient
    one approach to optimize that use case is to partition data by field and use multi-cursors to join subsets of fields on the fly
    great questions!
    snakescott
    @snakescott
    interesting, thanks
    I may try to find some way to remix arrow, parquet, and traildb and see what happens!
    Ville Tuulos
    @tuulos
    awesome. Please share your experiences here :)
    Yassine Marzougui
    @ymarzougui
    Hi! Thanks for the great tool!
    Would it be possible to add support for multi-cursors in the python bindings?
    Knut Nesheim
    @knutin
    I just pushed 0.4.0 of traildb-rs which now has support for event filters! :D
    Ville Tuulos
    @tuulos
    @ymarzougui yes, we should. Could you open a ticket about it at https://github.com/traildb/traildb-python/issues thanks!
    Yassine Marzougui
    @ymarzougui
    Great, thanks! I opened the ticket.
    Ville Tuulos
    @tuulos
    thanks!
    Vladimir Makhaev
    @vmakhaev
    hi, guys. I want to play with TrailDB and golang, but met problem with building traildb-go. Here is ticket traildb/traildb-go#9 If you can give any clue, would be awesome
    vladkluev
    @vladkluev
    Just a shot in the dark but I had some trouble with the Go extensions at one point and it turned out that I didn't have the most recent traildb installed
    Vladimir Makhaev
    @vmakhaev
    thanks, but traildb is most recent
    Ville Tuulos
    @tuulos
    let me check
    Ville Tuulos
    @tuulos
    @vmakhaev go build seems to work ok with Go 1.7 on Linux. I am trying to reproduce with your setup using 1.8 on OS X
    Ville Tuulos
    @tuulos
    yeah, I can reproduce the issue with 1.8
    we will fix it for 1.8 but meanwhile, if possible, using 1.7 should be a workaround
    Vladimir Makhaev
    @vmakhaev
    @tuulos works with Go 1.7. thanks
    Vladimir Makhaev
    @vmakhaev
    it almost works, except of db creation part traildb/traildb-go#10
    Ville Tuulos
    @tuulos
    hmm, strange. I'll try to reproduce the issue
    we are using the Go bindings in production to produce tdbs so maybe there's something special about this case
    Vladimir Makhaev
    @vmakhaev
    created another couple of issues: traildb/traildb-go#11 and traildb/traildb-go#12. I think #11 could be fixed by upgrading to golang 1.8.1.
    Ville Tuulos
    @tuulos
    thanks @vmakhaev ! I wrote new code using the Go binding just yesterday without trouble on Linux. It seems like most/all of those issues might be related to OS X, which has got less testing
    thanks for reporting
    it should be easy to fix them
    rhymes
    @rhymes
    I have just started playing with TrailDB so far (I'm playing with a tdb containing 62 million events) and Python. Just out of curiosity, is there any real performance difference in constructing a TDB file between Python and Go bindings?
    rhymes
    @rhymes
    BTW I actually wrote a "create trail db" script both in Python and in Go. They both go through a huge CSV file (Python's has 69 261 656 rows, Go's CSV has 69 755 429 rows). Python took 3h5m20 seconds to create the traildb file. Go took 1h31m25s.
    Then I wrote a "query trail db" script with an easy filter: field_1 = value AND (field_2 = value OR field_2 = value), which is basically one of the conditions of the SQL queries we use. Wrote the script in Python and it took 14.36s, rewrote it in Go and it took 5.75s
    All sequential, no optimizations, also my first time writing code in Go so I'm sure it can be better :D
    Ville Tuulos
    @tuulos
    I have seen the Go binding being up to 6x faster than Python
    and Go programs tend to be really straightforward to parallelize over multiple cores for added benefit
    rhymes
    @rhymes
    Yeah, it's not hard for me to believe that.
    Ville Tuulos
    @tuulos
    in the same benchmark, coincidentally C was also about 6x faster than Go
    but a simple multicore version of the Go program beated a single-core C