These are chat archives for HdrHistogram/HdrHistogram

10th
Sep 2015
Alec
@ahothan
Sep 10 2015 01:08
@mikeb01 I wish I could have a day like you do ;-) KInd of swamped right now so python testing will have to wait a bit
Alec
@ahothan
Sep 10 2015 01:34
@giltene lz4 will be clearly better than zlib, have you heard about blosc? They have impressive benchmarks (http://www.blosc.org). The shuffle could be pretty good fit for counters (since the vast majority of 64-bit counters have leading zero bits)
TBH this idea of iterating and compressing non zero counters and compress zero counters count looks really like coding in assembly and even if the code looks simple, it is a pain to optimize for speed on pure python (yes there may be less bytes to go over but those are really slow bytes compared to a decompression in native speed). I'll try to assess the damages of this new format as soon as I have a bit more time.
Since we're discussing about format, how about using native-endian and save some cpu cycles when operating over these counters? In most cases the decoder will also be same endian order and let those others pay the cost.
Gil Tene
@giltene
Sep 10 2015 02:32
@ahothan: Re:blosc, it looks cool, but has very limited language support (pretty much C and things that wrap C from the looks of it), in choosing a compression format, my main concern is near-universal availability of libraries that can code and code it, and for those to be stable, readily available to links to or package (e.g. maven central for Java stuff), and acceptably licensed. Zlib is the obvious common denominator, since it is built into many platforms. lz4 and snappy are both contenders, but I'd feel better about each if the various implementations seemed more mainstream and available via a common org.
Gil Tene
@giltene
Sep 10 2015 02:38
@ahothan Re:native byte order. LEB128's variable capacity scheme requires little endian (not just by name ;-) ). The specific scheme wouldn't work if the stream started from the big endian end. A big endian variant is certainly possible, but it would be encoded differently (e.g. shift entire value left with each byte that isn't the terminating byte). Neither form fits any processor endianness though, so the little endian (on the wire) using 7 bit words form probably wins because it requires fewer operations.
@ahothan Re:cost in python: LEB128 is what google protocol buffers use, so it may worthwhile to look at what they specifically do in python implementations of GPB. E.g. do they call out to C, or do they have "pure" python code that you can borrow from?
Gil Tene
@giltene
Sep 10 2015 20:16
Pushed 2.1.7 to maven central.