Where communities thrive


  • Join over 1.5M+ people
  • Join over 100K+ communities
  • Free without limits
  • Create your own community
People
Activity
    Ville Tuulos
    @tuulos
    thanks @vmakhaev ! I wrote new code using the Go binding just yesterday without trouble on Linux. It seems like most/all of those issues might be related to OS X, which has got less testing
    thanks for reporting
    it should be easy to fix them
    rhymes
    @rhymes
    I have just started playing with TrailDB so far (I'm playing with a tdb containing 62 million events) and Python. Just out of curiosity, is there any real performance difference in constructing a TDB file between Python and Go bindings?
    rhymes
    @rhymes
    BTW I actually wrote a "create trail db" script both in Python and in Go. They both go through a huge CSV file (Python's has 69 261 656 rows, Go's CSV has 69 755 429 rows). Python took 3h5m20 seconds to create the traildb file. Go took 1h31m25s.
    Then I wrote a "query trail db" script with an easy filter: field_1 = value AND (field_2 = value OR field_2 = value), which is basically one of the conditions of the SQL queries we use. Wrote the script in Python and it took 14.36s, rewrote it in Go and it took 5.75s
    All sequential, no optimizations, also my first time writing code in Go so I'm sure it can be better :D
    Ville Tuulos
    @tuulos
    I have seen the Go binding being up to 6x faster than Python
    and Go programs tend to be really straightforward to parallelize over multiple cores for added benefit
    rhymes
    @rhymes
    Yeah, it's not hard for me to believe that.
    Ville Tuulos
    @tuulos
    in the same benchmark, coincidentally C was also about 6x faster than Go
    but a simple multicore version of the Go program beated a single-core C
    rhymes
    @rhymes
    Yeah, I'd definitely use Go if we decide to use traildb.
    I was trying to use trck today but I failed traildb/trck#11 - It probably has something to do with me using OSX for trck.
    Ville Tuulos
    @tuulos
    @oavdeev might be able to help with that
    Marius
    @marius-plv

    Dear all, first a BIG thank you for opening up this project to the world, pushing it forward and providing good project documentation and groups like this one for new users - would like to see more projects like traildb out there :)

    My question is related to optimizing retrieval of consecutive events. I would like to efficiently retrieve all the events stored in one trail, between time interval [t1, t2].
    Let's assume first the simple use-case, where no traildb filter is configured on the cursor.
    The "data" (that gets stored with each "event") has always the same length.
    The closest API to achieve this I've seen is tdb_multi_cursor_next_batch().
    Is this the fastest traildb API to retrieve events stored in a given time range?

    Actually tangent to the above: just now I saw here some notes about a great feature: tdb indexing (created using 'tdb index -i my-tdb').
    This could speed such queries quite a bit, if I understood it correctly. Does this indexing operation apply to events as well and it is safe to run in parallel with "read" operations (cursors performing read operations from different processes)?

    Kind regards,
    Marius

    Ville Tuulos
    @tuulos
    hi @Marius-Plv
    re: retrieve all the events in one trail within a time interval: The fastest way should be to set a time filter and then use a cursor as usual. In C, there shouldn't be a huge difference between retrieving events one by one using tdb_cursor_next() vs. many events in a batch with tdb_multi_cursor_next_batch(). The batch mode tends to be much faster if using a language binding
    The index is designed to speed up queries. It is an independent feature from the standard API and the default TrailDB API doesn't use the index yet. The tdb command line tool does and you can use it in your own apps too
    since it is an independent feature, you can index in parallel with other reads
    Marius
    @marius-plv
    Thank you, Ville! I think I'll stick to the tdb_cursor_next() approach over a time filtered cursor. The index sounds very appealing to use, although I would resist to use it for now, not beying available with the standard API (and calling tdb in another process is a pain). But would consider the approach from tdb command line for the 2nd iteration - or a hybrid one in which events are streamed from tdb through a "message bus" to my application - this would also be feasible. Best regards, Marius
    Ville Tuulos
    @tuulos
    ok. Don't hesitate to ask if you have any other questions!
    Raunak Ramakrishnan
    @rrampage
    Are there any more examples in D? I built the traildb-d repo and ran a small example with dub.
    also, is there a way of using ldc for compiling the code? When I try it, I get an error about /usr/bin/ld: .dub/obj/TrailDB.o: relocation R_X86_64_32 against symbol_D9Exception7__ClassZ' can not be used when making a shared object; recompile with -fPIC`
    Ville Tuulos
    @tuulos
    @rrampage let me try to summon some people who know about the D bindings
    Lawrence Christopher Evans
    @lcevans
    Hey @rrampage, I'm not the person who wrote the TrailDB D bindings though I have used them in our other private company repos. I can't share those repos... but if you have questions on how to use the interface I can help. I've only used dmd and don't know much about ldc. It looks like a linking issue -- this tutorial explains linking and the need for the -fPIC flag: http://www.cprogramming.com/tutorial/shared-libraries-linux-gcc.html. dmd calls gcc and perhaps supplies the -fPIC flag automatically, while perhaps ldc does not?
    vladkluev
    @vladkluev
    Hey yall, not sure if you know but the travis build is failing rn, looks like a waf issue
    https://travis-ci.org/traildb/traildb
    Ville Tuulos
    @tuulos
    thanks, @vladkluev - I will take a look. Seems like a simple config issue. I wonder why it broke
    Raunak Ramakrishnan
    @rrampage
    @lcevans an example on merging tdbs will be really helpful. Also, in your company do you query the tdbs using trck or C API?
    Lawrence Christopher Evans
    @lcevans
    @rrampage My company (AdRoll) uses tdbs in a variety of places, both trck and the C API. But the project I am most familiar with uses the C API directly via the D bindings. In this project we iterate over 30 days of tdbs (each holding one day of data) keeping track of per-cookie information in memory with D associative arrays... so in particular we don't merge tdbs (side note: This task was set up in the early days of tdb so if you want to do something like this you should try trck first). I don't have familiarity with merging tdbs, and I don't believe there is a D binding for the merging functions. But it should be possible to add a D binding for the relevant C functions -- if you end up doing so you're welcome to add them via PR to the traildb-d repo :)
    Raunak Ramakrishnan
    @rrampage
    Was ale to merge TDBs from the D binding. The API method was exposed.
    able*
    Raunak Ramakrishnan
    @rrampage
    At my company we have around 40 million events daily coming into Kafka. As of now, we have set up hourly aggregation on each Kafka consumer after which we do an aggregation of all generated tdbs. The reason we are doing the same is because the tdb_cons is not threadsafe. Is there a better way to do this i.e directly write to a single tdb from multiple kafka consumers (all on same machine)?
    Ville Tuulos
    @tuulos
    writing a separate tdb for each consumer and then merging is not a bad approach
    you can write to a single tdb too but you have to take care of locking by yourself so multiple threads don't access the same tdb_cons handle concurrently
    you could have a tdb endpoint that receives data from all kafka consumers with a small in-memory buffer of events for each consumer. After the buffer is full, you lock tdb_cons and flush the buffer to tdb.
    you should get less lock contention with this buffering approach than what you would get if you wrote events one by one to tdb_cons concurrently
    depending on your load pattern, it may or may not make a difference
    András Kovács
    @alpintrekker
    Hello, reading this article https://dev.to/rhymes/adventures-in-traildb-with-millions-of-rows-python-and-go left me with question, why index creation and usage is not available via API?
    András Kovács
    @alpintrekker
    Especially via Python API.
    Ville Tuulos
    @tuulos
    hi @alpintrekker ! The index is/was considered to be an experimental feature, why it wasn't included in the official API
    it is not too hard to use via ctypes / cffi in Python, if you want to give it a try
    Ville Tuulos
    @tuulos
    Redis Streams might be interesting to people here https://news.ycombinator.com/item?id=15388481
    Marius
    @marius-plv
    Hi all, I would kindly ask for help, as I am out of ideas of what I might be doing wrong. Here is a summary of the APIs I'm calling, inside the same process, but in different scopes:
    { // add events with timestamps 1 and 2 to a.tdb
    tdb_cons_init
    tdb_cons_open
    tdb_cons_add
    tdb_cons_finalize
    tdb_cons_close
    } // running "tdb dump -i a.tdb" at this point in time will list events with timestamp 1 and 2.
    { // add event with timestamp 3 to a.tdb
    tdb_cons_init
    tdb_cons_open
    tdb_cons_add
    tdb_cons_finalize
    tdb_cons_close
    } // running "tdb dump -i a.tdb" at this point in time will list only the event with timestamp 3, the first two above (with timestamp 1 and 2) are gone.
    Is the order of the APIs I am calling OK? If yes, do you have any ideas why after writing the event with timepstamp 3 .. the previously written event will no longer be displayed?
    Marius
    @marius-plv
    It looks like my DB gets "emptied" after the tdb_cons_finalize using timestamp 3 and I don't understand why. I could provide some code as well, if desired
    Ville Tuulos
    @tuulos
    Hi @Marius-Plv - sorry for the delayed reply
    tdb_cons_open always creates a new, empty tdb. If you want to add previous events to a new tdb, open the second tdb with a different name and use tdb_cons_append to add the events of a.tdb there