Where communities thrive


  • Join over 1.5M+ people
  • Join over 100K+ communities
  • Free without limits
  • Create your own community
People
Activity
    Ville Tuulos
    @tuulos
    oh, interesting
    I haven't tried it with Py3.7 yet
    Jakob Sievers
    @cannedprimates
    does tdb handle small field values (ie values that would fit into an item directly without going through a lexicon) specially? had a quick look at jsm_insert_large() and didnt see anything...
    semi-related: are there best practices around numeric field values? should I hand the byte representation to tdb?
    Ville Tuulos
    @tuulos
    Hi @cannedprimates - there's no special handling of small values. Would you need it for performance reasons?
    all values are byte blobs currently. No special handling for numeric field values. If you have floating point values and you don't need the full 64/32-bit accuracy, you can save space / increase performance by truncating values to the desired accuracy before inserting them
    Jakob Sievers
    @cannedprimates
    @tuulos thanks for the reply! no concrete need for it (yet :)), just curious
    Ville Tuulos
    @tuulos
    cool. Let me know if you have any other questions / feedback!
    donaherc
    @donaherc

    Hello! I've run into some intermittent issues reading from a handful of ~18MB files I've combined repeatedly with tdb_cons_add(). Have anyone seen any behavior that resembles this:

    ==15444== Invalid read of size 8
    ==15444==    at 0x4E3FD52: read_bits (tdb_bits.h:14)
    ==15444==    by 0x4E3FD52: read_bits64 (tdb_bits.h:38)
    ==15444==    by 0x4E3FD52: huff_decode_value (tdb_huffman.h:72)
    ==15444==    by 0x4E3FD52: _tdb_cursor_next_batch (tdb_decode.c:282)
    ==15444==    by 0x935C57: tdb_cursor_next (traildb.h:304)
    ==15444==    by 0x935C57: _cgo_4805fbb2d53a_Cfunc_tdb_cursor_next (cgo-gcc-prolog:222)
    ==15444==    by 0x46565F: runtime.asmcgocall (/usr/local/bin/go/src/runtime/asm_amd64.s:688)
    ==15444==    by 0xC4200928FF: ???
    ==15444==    by 0xB07CE87: ???
    ==15444==    by 0x460D81: runtime.(*mcache).nextFree.func1 (/usr/local/bin/go/src/runtime/malloc.go:556)
    ==15444==    by 0xC4201AABFF: ???
    ==15444==    by 0x43BB8F: ??? (/usr/local/bin/go/src/runtime/proc.go:1092)
    ==15444==  Address 0xe323ff9 is in a r-- mapped file /home/vagrant/app_files2/0157e8982def92b71fcc767d568e57883b86dba4298b66c2468127de0ef9c8cc segment
    ==15444== 
    fatal error: unexpected signal during runtime execution
    [signal SIGSEGV: segmentation violation code=0x1 addr=0xe324000 pc=0x4e3fd52]
    
    runtime stack:
    runtime.throw(0xb18c4c, 0x2a)
            /usr/local/bin/go/src/runtime/panic.go:616 +0x81
    runtime.sigpanic()
            /usr/local/bin/go/src/runtime/signal_unix.go:372 +0x28e
    
    goroutine 12 [syscall]:
    runtime.cgocall(0x935c00, 0xc42006ca10, ==15444== Use of uninitialised value of size 8
    ==15444==    at 0x438673: runtime.printhex (/usr/local/bin/go/src/runtime/print.go:219)
    ==15444==    by 0x45AA68: runtime.gentraceback (/usr/local/bin/go/src/runtime/traceback.go:406)
    ==15444==    by 0x45C4F8: runtime.traceback1 (/usr/local/bin/go/src/runtime/traceback.go:684)
    ==15444==    by 0x45C371: runtime.traceback (/usr/local/bin/go/src/runtime/traceback.go:645)
    ==15444==    by 0x45CF56: runtime.tracebackothers (/usr/local/bin/go/src/runtime/traceback.go:816)
    ==15444==    by 0x437B54: runtime.dopanic_m (/usr/local/bin/go/src/runtime/panic.go:736)
    ==15444==    by 0x46271B: runtime.dopanic.func1 (/usr/local/bin/go/src/runtime/panic.go:598)
    ==15444==    by 0x437479: runtime.dopanic (/usr/local/bin/go/src/runtime/panic.go:597)
    ==15444==    by 0x437550: runtime.throw (/usr/local/bin/go/src/runtime/panic.go:616)
    ==15444==    by 0x44CD7D: runtime.sigpanic (/usr/local/bin/go/src/runtime/signal_unix.go:372)
    ==15444==    by 0x4E3FD51: read_bits (tdb_bits.h:13)
    ==15444==    by 0x4E3FD51: read_bits64 (tdb_bits.h:38)
    ==15444==    by 0x4E3FD51: huff_decode_value (tdb_huffman.h:72)
    ==15444==    by 0x4E3FD51: _tdb_cursor_next_batch (tdb_decode.c:282)
    ==15444==    by 0x935C57: tdb_cursor_next (traildb.h:304)
    ==15444==    by 0x935C57: _cgo_4805fbb2d53a_Cfunc_tdb_cursor_next (cgo-gcc-prolog:222)
    ==15444== 
    ==15444== Conditional jump or move depends on uninitialised value(s)
    ==15444==    at 0x438685: runtime.printhex (/usr/local/bin/go/src/runtime/print.go:220)
    ==15444==    by 0x45AA68: runtime.gentraceback (/usr/local/bin/go/src/runtime/traceback.go:406)
    ==15444==    by 0x45C4F8: runtime.traceback1 (/usr/local/bin/go/src/runtime/traceback.go:684)
    ==15444==    by 0x45C371: runtime.traceback (/usr/local/bin/go/src/runtime/traceback.go:645)
    ==15444==    by 0x45CF56: runtime.tracebackothers (/usr/local/bin/go/src/runtime/traceback.go:816)
    ==15444==    by 0x437B54: runtime.dopanic_m (/usr/local/bin/go/src/runtime/panic.go:736)
    ==15444==    by 0x46271B: runtime.dopanic.func1 (/usr/local/bin/go/src/runtime/panic.go:598)
    ==15444==    by 0x437479: runtime.dopanic (/usr/local/bin/go/src/runtime/panic.go:597)
    ==15444==    by 0x437550: runtime.throw (/usr/local/bin/go/src/runtime/panic.go:616)
    ==15444==    by 0x44CD7D: runtime.sigpanic (/usr/local/bin/go/src/runtime/signal_unix.go:372)
    ==15444==    by 0x4E3FD51: read_bits (tdb_bits.h:13)
    ==15444==    by 0x4E3FD51: read_bits64 (tdb_bits.h:38)
    ==15444==    by 0x4E3FD51: huff_decode_value (tdb_huffman.h:72)

    I'm using the traildb-go bindings.

    Willing to provide more info if it'd help!
    donaherc
    @donaherc
    Having dug in more, I now suspect that the issue is that our vm.max_map_count settings on our hosts we use to tdb_cons_add were too low (they were at the default 65530). Have seen no issues after raising the setting
    donaherc
    @donaherc
    I believe we're still running into intermittent issues iterating through traildb files and also merging them using tdb_cons_append causing segfaults inside CGO, which forces a panic. Has anyone here used the traildb-go library and seen such behavior? Is it possible that undefined behavior with traildb file access would cause a panic inside CGO, but behave normally when handled with the C library directly?
    Ville Tuulos
    @tuulos
    could you try tdb merge on the command line with the same files to see if it still segfaults?
    it might be an issue with the Go bindings or (more unlikely), the C library itself
    donaherc
    @donaherc
    hello! yeah have been unable to reproduce with the tdbcli tools, although for a handful of the files we have seen intermittent segfaulting using the 'tdb index' . Some of the files that appear to be impacted have values north of 10k characters, which is pretty anomalous for the data we're storing. When pushing the traildb reads down into pure C we have seen no issues.
    Ross Wolf
    @rw-access

    hello! i saw --threads on the CLI help and am wondering what is made parallel?

    I know that tdb handles aren't thread safe but am thinking of ways to build something parallel and ordered on top of multiple tdb files and cursors within a single process. possibly a batching multi-multicursor? could that work, or is there a good chance that i'd run into other issues that i'm not thinking of? thanks!

    Ross Wolf
    @rw-access
    the more I think about it, the less sense that seems to make. for my use case, I expect that many of the underlying cursors will not return results. so I think carefully creating with something similar to tdb_multi_cursor_new but calling a version of tdb_multi_cursor_reset that threads the initial calls to tdb_cursor_peekmight actually do the trick (since many cursors will be exhausted right away). i'll have to see how much time is spent in tdb_multi_cursor_new vs tdb_multi_cursor_next
    Oleg Avdeev
    @oavdeev
    Looks like --threads is only used for indexing in tdbcli
    I'm not sure I 100% understand what do you mean by "parallel .. within a single process"?
    Since tdb is read only after you create it, a typical pattern is that you just have a db open in every thread, and a cursor, and split work between the threads based on uuid
    Ville Tuulos
    @tuulos
    right, like @oavdeev said - everything on the read-size can be parallelized by using independent handles and cursors on each thread
    Ross Wolf
    @rw-access
    yeah, I've just been brainstorming ways to do something similar to a multicursor but threaded. ideally I would want to hit tdb_multi_cursor_next or the batched version, but have the peeking for the underlying cursors be more parallelized.
    but that seems really tricky and I'd obviously to divvy up the tdb handles between threads.
    I think a quick win without too much reworking is to parallelized the initial peek calls when calling tdb_multi_cursor_reset when it's created
    what performance differences have you seen between the multicursor batched and non-batched?
    Ville Tuulos
    @tuulos
    re: "something similar to a multicursor but threaded" - you mean multiple consumers in different threads pulling events from a single cursor?
    or a single consumer but multiple threads doing decoding in parallel?
    Ross Wolf
    @rw-access
    I believe the second one. The filters I'm using are generally sparse and cover multiple trails and tdbs. I want to use multiple threads for iterating the cursors (especially since there's a chance that some won't have any matches) and then one thread to consume the results and process them in order, like a multicursor
    Ville Tuulos
    @tuulos
    makes sense. In your case I would just have K parallel threads using normal (not multi) cursors. Each thread needs to push events to some output queue/buffer. The consumer can take care of ordering e.g. using the pqueue priority queue that tdb_multicursor uses internally
    Ross Wolf
    @rw-access
    awesome. yeah that makes sense. I'll see how that looks. and there's still a good chance that I'm wrong and the single threaded consumer is the real bottleneck. thanks for the help!
    Ville Tuulos
    @tuulos
    cool! let us know how it goes
    Ross Wolf
    @rw-access
    hello again!
    quick question this time - what's the lifetime of const tdb_event * as returned by tdb_cursor_peek/tdb_cursor_next?
    i'm guessing that it's valid until _tdb_cursor_next_batch is called again
    Oleg Avdeev
    @oavdeev
    yes, basically the idea that it lives until next tdb_cursor_next() call (that may call _tdb_cursor_next_batch internally)
    luca santini
    @santoxyz
    hello everybody. i'm evaluating using traildb (python client) on an embedded system with limited ram (1GB) and storage (4GB) to save "big" data (1 year of samples - hundreds of variables - 1 second interval).
    it seems promising, but i'm not sure i understood how it's working.
    Tutorial says: create a db, add points, finalize.
    What i see adding points is: file on disk is not growing... does it persist data only on finalize() ? How could i make sure data is persisted "frequently" (i.e. every minute) to minimize the loss in case of crash/reboot/problems?
    Ville Tuulos
    @tuulos
    you can choose how often to call tdb_finalize based on your needs. You can call it every minute. You can have a separate compaction process that then merges the minute-files to a larger chunk e.g. every hour / day.
    luca santini
    @santoxyz
    sounds good! yesterday i produced a dataset containing 1year of fake data in a couple of hours; resulting in a 97MB data.tdb (very good), but i noticed temporary files for 33GB (very bad!).
    Hope that finalizing and merging every minute i'll keep the temp data small. need some testing.
    luca santini
    @santoxyz
    now trying
    tdb merge -o merged data-1year.tdb data-chunk-3minutes.tdb
    process currently in progress.. generated 33GB of temp data and running for minutes... on a fast SSD disk.
    This is not acceptable in my embedded scenario :(
    I'm starting to think that what i want to do is not doable at all.
    Ville Tuulos
    @tuulos
    during creation, tdb uses local disk for tmp files quite extensively. Reading tdbs should be very efficient even in resource-constrained environments but writing hasn't been optimized for such cases
    Marius
    @marius-plv
    Hi Luca, from personal user experience with tdb, the time of updating a single tdb (even with small amounts of data) seems to be increasing with the tdb file size. I understand that this was not a design requirement, as the goal was fast reading operation. (Personally, I would also enjoy having faster tdb write times.) But what helped in my case (which is rather a work-around) was creating smaller tdbs in a RAM based filesystems (on Linux this could be ramfs, tmpfs, ..). The time of updating is still going to increase with the tdb size, but the operation itself will be factors faster. So my understanding is that, to optimize the write time, the ideal case would be to have independent tdb files (second/minute/hour/day/you choose; the smaller the time period, the faster the writes should run - measuring would confirm what is best) and not merge these.
    ruchirj
    @ruchirj
    Hello, I am interested in trying out TrailDB as a telemetry datastore for alerting on events. We may have potentially millions of events coming in every second
    The events themselves are JSON blobs. And event group identity can be inferred from a subset of the JSON properties. The use case is to build histograms of counts by event group. We will be write intensive and the reads are going to be visualization driven. Does it make sense to use TrailDB for this use case?
    Ville Tuulos
    @tuulos
    TrailDB is optimized for read-heavy use cases, complex analytics etc. If your workload is write-heavy with simple read queries, TDB might not be the best fit
    Chen Xinlu
    @boisde
    hello, i am new to traildb, is there any recent benchmark on it?
    for reference.
    Ville Tuulos
    @tuulos
    no, unfortunately we don't have a recent benchmark. I can assure you it is plenty fast especially on the read side :)
    Chen Xinlu
    @boisde
    does tdb dump still support s3 on mac OSX? like tdb dump -i s3://xxxx/yy.tdb?
    Tried at Mac Mojave, which reports TDB_ERR_IO_OPEN
    Chen Xinlu
    @boisde
    Hello, @tuulos is there a way to recover an unfinalized tdb?
    Chen Xinlu
    @boisde
    hi, is it possible that tdb can handle sorting for single .tdb file?