These are chat archives for connectordb/connectordb

26th
Nov 2016
xloem
@xloem
Nov 26 2016 03:03
hi, connectordb looks almost exactly like what I am looking for, but it looks like it is designed with "sparse" data in mind. I am logging "dense" data: audio streams, video streams, and sensor readings that are read as fast as the sensor can spew out, rather than occasionally. I use compression codecs to shrink this data.
Is this appropriate for connectordb, or is it designed to be limited to sparse data?
Daniel Kumor
@dkumor
Nov 26 2016 03:19
I have thought about using it for dense data, but have never actually tried it! The redis backend can take several thousand inserts per second, and I have benchmarked ConnectorDB for ~200 inserts per second on a tiny virtual machine. That being said, ConnectorDB does not really offer primitives for inserting binary data - you'd have to convert your streams to base64. Also, unless the streams are set to ephemeral (not saving data), you will need quite a bit of hard drive space.
If you use compression codecs, I would recommend attempting to shrink the data into chunks of <1MB, and inserting those in base64.
Finally: if you're doing lots of inserts, make sure to use the newest version (development) of CDB, since there were a lot of changes since alpha 1
Daniel Kumor
@dkumor
Nov 26 2016 03:25
If it turns out that ConnectorDB can't fit your use case (and we can't make some changes to fix it), you could look at kafka https://kafka.apache.org/ - which has been used for very fast data streaming.
xloem
@xloem
Nov 26 2016 03:25
what do you think of storing the data outside of the database, such as encrypted on ipfs, and then inserting a stream of urls, keys, or pathnames?
oh nice I'll check kafka
Daniel Kumor
@dkumor
Nov 26 2016 03:29
Storing the data on ipfs would probably be slower than directly inserting into ConnectorDB - while I have not yet used it for such things, it should just work - I have simply not thoroughly tested it, and have not defined standard formats for audio/video to use in the database.
xloem
@xloem
Nov 26 2016 03:32
hrm my gut balks at the concept of base64'ing a video stream, especially on a mobile device, but I suppose it's just a temporary measure
kafka looks great but there doesn't seem to be an existing ecosystem for logging and marking life data, so I'd have to re-create a lot of work already done here
Daniel Kumor
@dkumor
Nov 26 2016 03:33
That's correct - If you open an issue for it, I think I can add a custom binary data json schema type, which would allow direct binary insertion.
Technically, the backend database (postgres/sqlite) converts binary data to base64 anyways, so it would still be base64 in the backend - but presumably, the goal is to not convert to base64 on a mobile device before inserting, is that correct?
xloem
@xloem
Nov 26 2016 03:36
well that's one concern, but if I have a gigabyte of recordings I don't really want that to be converted to 4 gigabytes by the storage system
but there must be some way to configure postgres/sqlite to not do that; I know they are commonly used for binary storage
Daniel Kumor
@dkumor
Nov 26 2016 03:38
base64 converts eact 3 bytes into 4 bytes, so it is just a 33% increase in size. Furthermore, the backend can be set up to gzip data as it is stored, effectively negating the 33% increase.
xloem
@xloem
Nov 26 2016 03:38
sorry, did the math in my head wrong
hrmmmm if you say so, it doesn't feel quite right though, but I guess still it would be a temporary measure somebody could find a solution to if it ends up mattering
It can be converted to use large object storage - but I'd have to benchmark it, as presumably, each datapoint would be only a couple seconds of audio/video, which would be <10MB each
xloem
@xloem
Nov 26 2016 03:41
sounds reasonable
(that might eliminate some codecs, but that's fine)
Daniel Kumor
@dkumor
Nov 26 2016 03:42
yes, it is a bit annoying - I have thought about this issue for quite some time, but I don't see a way to fix this without fundamentally altering the database architecture.
do you have preferred audio/video codecs you'd like to use? I would like to add a binary schema type, for which there would be pre-defined datatypes that the frontend can process
xloem
@xloem
Nov 26 2016 03:46
well for precise data something uncompressed such as wavpack
err lossless, not uncompressed
lemme google a bit
Daniel Kumor
@dkumor
Nov 26 2016 03:47
Can you open an issue for this? I can add it to the TODO for beta 1
This was on my bucketlist for a long time now - might as well get it done.
I see the issue! great!
xloem
@xloem
Nov 26 2016 03:49
oh, I made another
xloem @xloem merges them
xloem
@xloem
Nov 26 2016 16:32
so,up until now I have been storing my data with git-annex. Git-annex is really great at letting you back something up to an external hard-drive, amazon glacier, ipfs, etc, and then delete it locally, and if you try to access it, it will know where it has been stored and let you retrieve it. This is really great when the amount of data stored begins exceeding the size of a drive. Any thoughts on 'archiving' some data in connectordb this way? I can imagine perhaps creating a device which downloads sets of data, stores them, and then deletes them from connectordb -- for example perhaps connectordb would store only the 'steps' stream and a little bit of the raw 'accelerometer' stream, where most of the 'accelerometer' stream is archived. But if a 'pushups' stream burst into existence, perhaps manually labeled from a video stream, it may want to retrieve some of the 'acccelerometer' to see if it can detect in the future times when I do pushups automatically ...
Daniel Kumor
@dkumor
Nov 26 2016 16:57
The idea of automated machine learning and labeling is exactly the goal of connectordb - but I have personally not given thought to offloading data yet. At some point an offloading api will probably exist - or at least a way to distribute data over multiple machines.