These are chat archives for HdrHistogram/HdrHistogram

25th
Apr 2016
Gil Tene
@giltene
Apr 25 2016 16:19
A question for the other (than Java) version maintainers: I'm thinking of addressing HdrHistogram/HdrHistogram#94 by adding a new [optional] line format to the histogram log: Lines that start with "Tag:" will have an additional tag field (a string that starts immediately after the ":" and ends with the next ",", and must not include a ",") will have a tag parsed as a first field, followed by the existing line format (3 comma delimited doubles, followed by an encoded histogram).
This is a relatively small change, keeps backward compatibility (lines without a tag have "the default tag", and existing logs can be read), and should be fairly easy to implement (considered allowing tags to include "," by supporting escapes with "\", but decided against it for now to keep parsing code simple, including with linux cmd line tools)
I'd like to implement this in the next couple of days in the Java version, since we have a real use case (Cassandra stress has evolved to log per-operation latencies, and tagging will really help since we can keep everything in one line). So if you have objections or better ideas, now is the time to chime in
Oh, and the way tags are treated will be as completely orthogonal and interleaved data. I.e. it is reasonable to expect time stamps to grow monotonically and not overlap with a single tag. But the timestamps between different tags can come "out of order". And we make no attempt to create rules for what an "interval" is across multiple tags.
Marshall Pierce
@marshallpierce
Apr 25 2016 16:37
Would you envision that users might encode multiple dimensions of data into the tag (op = read, keyspace = foo, cpusocket = 3) if they were gathering very granular data, or that users would embed just one thing (eg read) and it would be part of the context of the log to say that read was an operation? Just wondering about goals vs non goals
Alec
@ahothan
Apr 25 2016 18:13
@giltene, proposed changes look good to me. The python current version will silently ignore tagged histograms from the log (because of regex mismatch), not sure what other versions do... I also assume you will add a new property to each histogram object to get/set the tag and the default no tag is same as setting a tag to ""? And no change in the signature of the method to log a histogram?
Also assume you will increase the log version?
Gil Tene
@giltene
Apr 25 2016 18:42
@marshallpierce : The goal is to allow multiplexing multiple histogram logs in one file, with tags identifying the individual log streams. How this muxing is used will depend on the user/app. So e.g. I can see some folks using the tags to identify things with fine granularities, while others might just used coarse (e.g. read vs. write) levels. Or both. E.g. aggregate logs (all opps, all reads, all writes) can be muxed in the same file as finer grain logs (e.g. read keyspace = foo). A specific non-goal is to require or suggest any relationship between the tags (i.e. are they inclusive? exclusive? does aggregating all of them together into some overall stats histogram have any meaning?). Those will be app specific choices.
@ahothan : Good to hear the existing python version will know to ignore the new lines (the existing Java version will not, unfortunately). Still debating on the API side whether histogram objects carry [optional] tags, or if the tag is API parameter. I'm gravitating towards making this an [optional] API parameter on the logging side, and not carrying it in the histogram objects. On the log reading side, I'm thinking that the API can take an [optional] tag which acts as a simple filter [only read lines that exactly match that tag, and if no tag is given then it only reads untagged lines].
Gil Tene
@giltene
Apr 25 2016 18:47
The question this raises is how does a reader find the tags in the file...
I mean "discover" which tags exist in the file. If the tags are known, this is easy. But there should probably be a way to discover tags.
jodzga
@jodzga
Apr 25 2016 20:27
Has anyone tried to use HdrHistogram with Hive or Pig?
Alec
@ahothan
Apr 25 2016 21:08
@giltene, adding an optional API argument is simple enough for read and write, only doubt is how to read the next histogram regardless of the tag? You'd need a way to return the actual tag (and in this case having a tag property for the histogram would work well. It should also be easy to have a new function to return the list of all available tags in a log file?
Michael Barker
@mikeb01
Apr 25 2016 21:14
@f
@giltene Is there any value in looking at a totally different format that would be more extensible, e.g. a generic tag/value structure so that adding things like this in the future won't break the existing format.
Martin Thompson
@mjpt777
Apr 25 2016 21:20
A format like SBE could be used for the specification. They people use generate stubs from it or write their own. Extension is part of the design. Full disclosure I was involved with SBE.
Michael Barker
@mikeb01
Apr 25 2016 21:21
Do you have C, python, Rust, .NET and Erlang bindings yet?
Martin Thompson
@mjpt777
Apr 25 2016 21:21
I needed to understand the encoded format for HdrHistogram lately and need to resort to reading the code. A schema would have been more helpful.
Specifying with the schema is independent of bindings.
Michael Barker
@mikeb01
Apr 25 2016 21:22
Good point
An EBNF grammar would be another option.
Martin Thompson
@mjpt777
Apr 25 2016 21:23
Anything that is a clear schema
Michael Barker
@mikeb01
Apr 25 2016 21:23
Yes
Martin Thompson
@mjpt777
Apr 25 2016 21:23
Needing to read the code is not a good option.
Michael Barker
@mikeb01
Apr 25 2016 21:38
@giltene This will break the C code (which just uses a simple sscanf), however if you ensure that there is a version bump to the log file format I can work from that without changing too much.