These are chat archives for HdrHistogram/HdrHistogram

22nd
Sep 2017
Alec
@ahothan
Sep 22 2017 18:15
I got one first histogram ready with few data points we could check: the usual values total count, min, max, average, stddev, and a selection of count at percentile and value at percentile. I can put all this in a json file as a list of dictionary and we can add more samples to the list
regarding value check, it could be either an exact match or a range match (for the count at percentile)
I can also add a small python program that performs the check and formats the reference values from a histogram encode
example of properties that can be checked on that sample:
get_total_count: 3038422
get_min_value: 20
get_max_value: 46367
get_mean_value: 460.238235505
get_stddev: 156.525722997
value at percentile
10.000000: 374.000000
20.000000: 394.000000
30.000000: 414.000000
40.000000: 430.000000
50.000000: 446.000000
75.000000: 498.000000
90.000000: 556.000000
99.000000: 708.000000
99.900000: 1304.000000
99.990000: 4707.000000
99.999000: 19343.000000
99.999900: 39711.000000
99.999990: 46367.000000
count at percentile
10.000000: 303842.000000
20.000000: 607684.000000
30.000000: 911526.000000
40.000000: 1215368.000000
50.000000: 1519211.000000
55.000000: 1671132.000000
55.000100: 1671135.000000
55.001000: 1671162.000000
55.010000: 1671436.000000
55.100000: 1674171.000000
75.000000: 2278816.000000
90.000000: 2734579.000000
99.000000: 3008037.000000
99.900000: 3035384.000000
99.990000: 3038118.000000
99.999000: 3038392.000000
99.999900: 3038419.000000
99.999990: 3038422.000000
Alec
@ahothan
Sep 22 2017 18:20
this sample represents the latency distribution in usec for storage read I/O from 6 VMs in an openstack cloud with a ceph storage cluster (aggregated of 50K IOPs or I.O per second for 60 seconds)
since we're talking about integrated metadata, it would have been nice to at least be able to specify the unit in the encoded histogram (an arbitrary string such as "usec", sec") or perhaps more flexible define a standard json schema for describing a histogram (along with teh encoded string)
Alec
@ahothan
Sep 22 2017 18:26
instead of a json file + a sidecar file, I'd rather see everything go into 1 samples.json file
if you're ok with the format, I'll commit a first draft to Common
Alec
@ahothan
Sep 22 2017 18:32
we also need to decide whether code to verify compliance to the samples resides in each implementation repo or in common? From logistics point of view (environment setup) having such code residing in each implementation repo looks simpler
Marshall Pierce
@marshallpierce
Sep 22 2017 19:29
@ahothan A few points -- if the histograms are separate files, then they will be smaller (won't require base64 to reside in json) and we can regenerate the json with less confusing diffs. I don't think it would be too difficult to just iterate across every *.histofile and look for the accompanying json... Anyway I agree that the code to use this stuff should probably reside in each repo. Otherwise it would be constant churn keeping things up to date
I think the metadata should probably be generated with the Java impl since it is the reference implementation
I've started work on a (speculative) tool to read serialized histograms and output json with the Java implementation (written in Kotlin, but interop is easy)
I think it will also be easier to add new histograms, etc, if that doesn't require modifying one big json file by hand. Easier to throw a new .histo file in a directory, re-run the metadata tool, and call it a day.
btw @giltene it looks like the new Common repo maybe needs its settings tweaked? I couldn't push to it
Marshall Pierce
@marshallpierce
Sep 22 2017 19:35
Anyway, obviously we can reshape as needed but I was just going off what I would want to test on my own implementation: basic properties, plus several ways of iterating through each of the possible iterations (where possible)
Marshall Pierce
@marshallpierce
Sep 22 2017 19:50
And, I've literally never run the code since I didn't have any histogram files lying around, so who knows what it does ;)
Alec
@ahothan
Sep 22 2017 20:33
@marshallpierce I did not know there was an official API to output the binary version of the encoded histogram. I know I have one in python but it was for internal use mostly
base64 has the advantage of being easier to deal with than a binary file and does not take that much space
Alec
@ahothan
Sep 22 2017 20:56
that is also how the histogram log stores them
Alec
@ahothan
Sep 22 2017 21:06
anyway here's the histogram base64 in case you want to try it out:
Alec
@ahothan
Sep 22 2017 21:13
@marshallpierce regarding the code, I have nothing against kotlin but I'm not planning to add one more language to my list so won't be able to review it ;-)
I think the reference code should be in (real) Java as well
we can split samples if you don't like one big samples.json
I still think 1 json per sample would be a better "packaging" (with the histogram in base64 inside along with all the metadata)
Marshall Pierce
@marshallpierce
Sep 22 2017 22:02
We can move it to plain old Java if needed, but I think for this particular snippet of code the ability to use keyword parameters is a pretty nice plus.
Anyway, the problem with bundling the histogram into the json is that it makes adding new histogram examples, or rewriting the metadata, more complex.
If the metadata is a sidecar, all you need to do to add a histogram, or rewrite the metadata to add more fields, is re-run the tool across all files.
If the histogram is embedded, to add a file you would need to, I dunno, hand-edit a file containing just the base64, then run the tool in a different mode perhaps? It's just messy.
There are definitely official serialization formats, several in fact :)
I would hope that we end up with at least one example that's duplicated: same data, serialized in different ways.