Where communities thrive


  • Join over 1.5M+ people
  • Join over 100K+ communities
  • Free without limits
  • Create your own community
People
Activity
    Marius
    @marius-plv

    Dear all, first a BIG thank you for opening up this project to the world, pushing it forward and providing good project documentation and groups like this one for new users - would like to see more projects like traildb out there :)

    My question is related to optimizing retrieval of consecutive events. I would like to efficiently retrieve all the events stored in one trail, between time interval [t1, t2].
    Let's assume first the simple use-case, where no traildb filter is configured on the cursor.
    The "data" (that gets stored with each "event") has always the same length.
    The closest API to achieve this I've seen is tdb_multi_cursor_next_batch().
    Is this the fastest traildb API to retrieve events stored in a given time range?

    Actually tangent to the above: just now I saw here some notes about a great feature: tdb indexing (created using 'tdb index -i my-tdb').
    This could speed such queries quite a bit, if I understood it correctly. Does this indexing operation apply to events as well and it is safe to run in parallel with "read" operations (cursors performing read operations from different processes)?

    Kind regards,
    Marius

    Ville Tuulos
    @tuulos
    hi @Marius-Plv
    re: retrieve all the events in one trail within a time interval: The fastest way should be to set a time filter and then use a cursor as usual. In C, there shouldn't be a huge difference between retrieving events one by one using tdb_cursor_next() vs. many events in a batch with tdb_multi_cursor_next_batch(). The batch mode tends to be much faster if using a language binding
    The index is designed to speed up queries. It is an independent feature from the standard API and the default TrailDB API doesn't use the index yet. The tdb command line tool does and you can use it in your own apps too
    since it is an independent feature, you can index in parallel with other reads
    Marius
    @marius-plv
    Thank you, Ville! I think I'll stick to the tdb_cursor_next() approach over a time filtered cursor. The index sounds very appealing to use, although I would resist to use it for now, not beying available with the standard API (and calling tdb in another process is a pain). But would consider the approach from tdb command line for the 2nd iteration - or a hybrid one in which events are streamed from tdb through a "message bus" to my application - this would also be feasible. Best regards, Marius
    Ville Tuulos
    @tuulos
    ok. Don't hesitate to ask if you have any other questions!
    Raunak Ramakrishnan
    @rrampage
    Are there any more examples in D? I built the traildb-d repo and ran a small example with dub.
    also, is there a way of using ldc for compiling the code? When I try it, I get an error about /usr/bin/ld: .dub/obj/TrailDB.o: relocation R_X86_64_32 against symbol_D9Exception7__ClassZ' can not be used when making a shared object; recompile with -fPIC`
    Ville Tuulos
    @tuulos
    @rrampage let me try to summon some people who know about the D bindings
    Lawrence Christopher Evans
    @lcevans
    Hey @rrampage, I'm not the person who wrote the TrailDB D bindings though I have used them in our other private company repos. I can't share those repos... but if you have questions on how to use the interface I can help. I've only used dmd and don't know much about ldc. It looks like a linking issue -- this tutorial explains linking and the need for the -fPIC flag: http://www.cprogramming.com/tutorial/shared-libraries-linux-gcc.html. dmd calls gcc and perhaps supplies the -fPIC flag automatically, while perhaps ldc does not?
    vladkluev
    @vladkluev
    Hey yall, not sure if you know but the travis build is failing rn, looks like a waf issue
    https://travis-ci.org/traildb/traildb
    Ville Tuulos
    @tuulos
    thanks, @vladkluev - I will take a look. Seems like a simple config issue. I wonder why it broke
    Raunak Ramakrishnan
    @rrampage
    @lcevans an example on merging tdbs will be really helpful. Also, in your company do you query the tdbs using trck or C API?
    Lawrence Christopher Evans
    @lcevans
    @rrampage My company (AdRoll) uses tdbs in a variety of places, both trck and the C API. But the project I am most familiar with uses the C API directly via the D bindings. In this project we iterate over 30 days of tdbs (each holding one day of data) keeping track of per-cookie information in memory with D associative arrays... so in particular we don't merge tdbs (side note: This task was set up in the early days of tdb so if you want to do something like this you should try trck first). I don't have familiarity with merging tdbs, and I don't believe there is a D binding for the merging functions. But it should be possible to add a D binding for the relevant C functions -- if you end up doing so you're welcome to add them via PR to the traildb-d repo :)
    Raunak Ramakrishnan
    @rrampage
    Was ale to merge TDBs from the D binding. The API method was exposed.
    able*
    Raunak Ramakrishnan
    @rrampage
    At my company we have around 40 million events daily coming into Kafka. As of now, we have set up hourly aggregation on each Kafka consumer after which we do an aggregation of all generated tdbs. The reason we are doing the same is because the tdb_cons is not threadsafe. Is there a better way to do this i.e directly write to a single tdb from multiple kafka consumers (all on same machine)?
    Ville Tuulos
    @tuulos
    writing a separate tdb for each consumer and then merging is not a bad approach
    you can write to a single tdb too but you have to take care of locking by yourself so multiple threads don't access the same tdb_cons handle concurrently
    you could have a tdb endpoint that receives data from all kafka consumers with a small in-memory buffer of events for each consumer. After the buffer is full, you lock tdb_cons and flush the buffer to tdb.
    you should get less lock contention with this buffering approach than what you would get if you wrote events one by one to tdb_cons concurrently
    depending on your load pattern, it may or may not make a difference
    András Kovács
    @alpintrekker
    Hello, reading this article https://dev.to/rhymes/adventures-in-traildb-with-millions-of-rows-python-and-go left me with question, why index creation and usage is not available via API?
    András Kovács
    @alpintrekker
    Especially via Python API.
    Ville Tuulos
    @tuulos
    hi @alpintrekker ! The index is/was considered to be an experimental feature, why it wasn't included in the official API
    it is not too hard to use via ctypes / cffi in Python, if you want to give it a try
    Ville Tuulos
    @tuulos
    Redis Streams might be interesting to people here https://news.ycombinator.com/item?id=15388481
    Marius
    @marius-plv
    Hi all, I would kindly ask for help, as I am out of ideas of what I might be doing wrong. Here is a summary of the APIs I'm calling, inside the same process, but in different scopes:
    { // add events with timestamps 1 and 2 to a.tdb
    tdb_cons_init
    tdb_cons_open
    tdb_cons_add
    tdb_cons_finalize
    tdb_cons_close
    } // running "tdb dump -i a.tdb" at this point in time will list events with timestamp 1 and 2.
    { // add event with timestamp 3 to a.tdb
    tdb_cons_init
    tdb_cons_open
    tdb_cons_add
    tdb_cons_finalize
    tdb_cons_close
    } // running "tdb dump -i a.tdb" at this point in time will list only the event with timestamp 3, the first two above (with timestamp 1 and 2) are gone.
    Is the order of the APIs I am calling OK? If yes, do you have any ideas why after writing the event with timepstamp 3 .. the previously written event will no longer be displayed?
    Marius
    @marius-plv
    It looks like my DB gets "emptied" after the tdb_cons_finalize using timestamp 3 and I don't understand why. I could provide some code as well, if desired
    Ville Tuulos
    @tuulos
    Hi @Marius-Plv - sorry for the delayed reply
    tdb_cons_open always creates a new, empty tdb. If you want to add previous events to a new tdb, open the second tdb with a different name and use tdb_cons_append to add the events of a.tdb there
    Marius
    @marius-plv
    Hi Ville and a happy new year to everyone! No worries and thanks for the reply!
    So my goal is to add events to the same tdb file. OK, so I would then need to call tdb_cons_open() only the first time. What about the following times (so after a tdb_cons_finalize + tdb_cons_close), in case I want to add events again.. what API sequence shall be used to open an already existent tdb for writing? Is it a mix of reading + writing APIs? (like tdb_init, tdb_open, tdb_cons_add, tdb_finalize, tdb_close)
    Ville Tuulos
    @tuulos
    traildbs are immutable after finalization, so you can't add more events to an existing tdb. A typical pattern is to create many traildbs instead, e.g. one tdb file every hour / day. You can then optionally merge them together or query all of them using multi-cursors.
    yqylovy
    @yqylovy
    Hi ,I'm trying to use traidb to track user's daiy behavior .
    I want to create traildb file pre day.but if i want to focus on visitor level,it is hard to use . I saw multi-cursor ,but it seem to on event level, specially when the same visitor has action on different traildb files. Can anyone provide some suggestion ?
    yqylovy
    @yqylovy
    Is any open-source project use traildb .So I can get some production example?
    Ville Tuulos
    @tuulos
    hi @yqylovy - multi-cursor is implemented exactly for the use case of iterating over a single primary key (e.g. visitor id) in multiple files
    another option is to merge the daily tdbs to a single big tdb before analysis
    yqylovy
    @yqylovy

    Hi, @tuulos thinks for help!
    I had overviewed api in https://github.com/traildb/traildb/blob/master/doc/docs/api.md. these are tdb_multi_cursor_(new|free|reset) and tdb_multi_cursor_next and tdb_multi_cursor_next_batch.the *next* method is returning events.So if i terating over a single primary key,I have to merge it on visitor_id first,just like:

        visitormap = {}
        for(event in cursor){
            visitorid = event.key
            visitormap[visitorid] = append(visitormap[visitorid],event)
        }
    
        // THEN do work
        for((visitorid,events) in visitormap){
                   // DO SOME WORK
        }

    So I need to iter the data one time and hold it.It takes time and space.

    Is it a better way to do such thing ?

    Marius
    @marius-plv
    Hi Ville and thanks for the info. Based on your help, I am now evaluating the best approach for my use-case, in which I have frequent (1-5 seconds) write and read operations, spanned over large periods of time. So for the use-case when the merging of two traildbs is done more frequently (say a tiny traildb is merged frequently -each 1-5 seconds- into a larger traildb, which is MB to GB large), are there read/write performance penalties (IO and/or CPU) with this approach, which would rather make it not worthwhile in comparison with the other approach you suggested, where several traildbs are maintained instead?
    plainas
    @plainas
    Hey guys. Neat library! So I am looking for solutions to log user generated events, but IIUIC, traildb doesn't address the problem os storing the data when it's generated, rather focusing on building a queriable archive on already existent data. any suggestions on where to look instead? How do most people do? Use relational datbases and generate archives periodically?
    Marius
    @marius-plv
    Hi @plainas, I just started using traildb recently and wished to have the same storage capabilities you were describing above. There are probably several ways, it depends on your use-case and requirements. For my use-case, the solution was to open a small tmp DB and write the most recent events to it, then merge this tmp DB to the "main DB". Note: from my understanding, the tmp DB -as any traildb DB- can be queried only after the DB is "finalized"; and once finalized, one can only merge it into another DB. Alternatively, if immediate persistence is not required, one could place the tmp DB on a RAM disk to gain some speed (or cache the "most recent" events in RAM -and query them outside tdb- before persisitng them on disk in a DB). I wished tdb offered a way to store and query events which are not yet in a finalized DB. Still I am curious to learn how others approached the challenge.
    plainas
    @plainas
    @Marius-Plv what exactly is tmp db? a small traildb file that you write as events comming nad close after a while?
    Maybe an extra abstraction transparently wrapping both traildb and something else for the data on the tip maybe redis or something. It's a pitty... traildb looks really cool but the design choice of not including writing functionality is a big deal.
    Marius
    @marius-plv
    @plainas Yes, with "tmp DB" I meant a small, temporary DB where the most recent events are stored.