These are chat archives for HdrHistogram/HdrHistogram

27th
Feb 2017
Gil Tene
@giltene
Feb 27 2017 15:21
@jonhoo Yeh, the question of "the right range" will pop up for sure. My goal is to end up with a "standard" set of levels that all applications everywhere report (without pre-choosing the levels that are interesting to them). Something we could convince users of e.g. Metrics and Hystrix to include by default in what they log. This LowRes suggestion will end up with 3 values per order of magnitude. I could see adding 3 values to the common reporting case to get down to e.g. 100usec, but not 9 values to get down to 1usec.
For a generic API, we could allow a configurable range. The hard part is choosing the "Default" ends of the ranges. Giving up on >1 minute is just as limiting as giving up on <100usec. The tension in both cases is against "what is a reasonable number of separately tracked metrics people will accept logging into their TS-dbs for every response-time-stat thing they ever measure? My gut feel is that even 17 is a bit much for that..
Jon Gjengset
@jonhoo
Feb 27 2017 15:25
I guess technically I could just log all values as three orders of magnitude more than they are, and then divide again when I get them back out. Microsecond range timing is what I need most often when I reach for HdrHistogram, in part because that's where I accumulate enough data, and fast enough, that it makes sense.
Gil Tene
@giltene
Feb 27 2017 15:40
@jonhoo This LowRes idea sits at the other end of the spectrum from where HdrHistogram is, I guess. I tend to use nanosecond units in my HdrHistograms to keep things simple and even. I'll use the configurable unit size to limit storage waste when I care enough, but often I'll just burn the in-memory-resources (when I don't have to worry about lots of histograms) since the log-file or on-the-wire compressed V2 implementations don't really get bloated in size (the zeros at the bottom will just compress away).
For the LowRes thing, I'm thinking of using some cute name, like "ResponsiGram". The main reason for it will be to create a prescriptive set of related counters (or rates) that report on the contents of histograms INSTEAD of the current practice of reporting them as a vector of percentiles. If time-series DBs actually stored histograms, and thresholding/plotting tools actually dealt with them, they should obviously be storing some HdrHistogram (or HdrHistogram-like) form., But rather than hold my breath for that to happen, I'm looking to push a practical way for people to work within what their systems seem to be able to do.
Jon Gjengset
@jonhoo
Feb 27 2017 15:48
Mmm, that seems reasonable
Marshall Pierce
@marshallpierce
Feb 27 2017 15:49
Also, if the question of what range to represent seems intractable as far as a one-size-fits-all approach, could use a single byte to represent scale. Like, if scale k means the smallest number is 1ms, scale k-1 means the smallest is 500 us. That way, tools could introspect the data they have and display appropriately without having to get 1 scale that works for everyone (at the cost of perhaps covering range that isn't used) or configuring all tools to agree on what the scale is.
Gil Tene
@giltene
Feb 27 2017 15:49
To me, current systems carry almost nothing useful at the microsecond-level that is reaching graphana dashboard anyway. Unless you want to know what the median latency is (or god forbid the average) with microsecond-level resolution, there is no data in there that is useful for you anyway.
The data I'm looking to add can be thought of as "badness thresholds" as opposed to "I'd like to understand the distribution of latency". [Good] histograms is what you should be using if you want to understand the distribution of latency. However, RespnsiGrams are good enough if you want to ALERT when latency goes bad beyond some point.
@marshallpierce I'm thinking low tech and convention-based. I.e. use the labels that graphana ends up plotting.
Marshall Pierce
@marshallpierce
Feb 27 2017 15:52
Entirely possible that that's good enough! It would be interesting to set up a prototype and see how well that plays out.
Gil Tene
@giltene
Feb 27 2017 15:52
E.g. statsd uses term like "upper95" and "lower95", which have percolated into what people plot in tools like graph (ironically making all plots read "Wrong")
So RespnsiGreams should use labels like "1ms" "5ms" ... "1min" in the series names
Marshall Pierce
@marshallpierce
Feb 27 2017 15:53
Ah, I see. So really going for clarity over compactness. Quite possibly a good tradeoff given the target market
Gil Tene
@giltene
Feb 27 2017 15:54
Yeh, "hard coded labels" and "it's just a bunch of counters" is what I'm thinking of. Don't even describe it as a histogram-with-pre-selcted-bucket-boundaries
@marshallpierce In the context of your Metrics writeup (linked to above), this would be a way to address the counter-pattern of reporting and storing histogram data as percentiles once that data leaves the original measuring system,.
Marshall Pierce
@marshallpierce
Feb 27 2017 15:56
Speaking of counters, the period on which such things are calculated over is one thing that feels wrong to me about the current Metrics approach. Right now, such things are all implicitly "since the service came up". That leads to weirdness about comparing measurements from newly started services against old ones, so I'm leaning towards switching to Recorder style buffer-swapping for everything. In your experience are there any gotchas with that approach? It lends a really nice consistency, but maybe there are things I'm missing...
Yep, it is at least concrete progress towards not reporting incorrect numbers, even if it's not ideal.
Gil Tene
@giltene
Feb 27 2017 15:57
@marshallpierce I'm torn between recording counters vs. rates (which is what counter-over-period-of-time would effectively be)
The benefits of rates is, as you note, that there is no base value needed. And things don't look "strange" when the counters get reset
The benefit of counters is that there is no reliance on recording or tracking time boundaries in an accurate way.
The hard thing about counter-over-period-of-time is aggregation.
Marshall Pierce
@marshallpierce
Feb 27 2017 16:00
Yeah, that is a wrinkle, but the TSDB community seems to have at least some support for that
and people struggle a lot on the metrics mailing list with "how do I turn my counter into something useful" -- it requires soemthing to keep track of per-jvm "what was the counter the last time I saw it"
which turns out to be pretty awkward (or people just give up)
Gil Tene
@giltene
Feb 27 2017 16:00
E.g. when you want to look at a whole cluster's behavior for a certain metric, the time windows over which you can reliably do that are determined by the boundaries of reporting from different servers,.
Marshall Pierce
@marshallpierce
Feb 27 2017 16:01
Definitely getting a coordinated boundary with high precision across many servers is going to be... challenging. But is it good enough to just say "it's once a minute" and be off by potentially (60 - epsilon) seconds from server to server?
That's what I don't know in practice. If it is good enough, then it would solve a lot of practical issues getting counter data from A to B if it could all be batched by minute or second or what have you.
Gil Tene
@giltene
Feb 27 2017 16:02
I think of it more in terms of "if reported come every 10 seconds, then we can aggregate "ok" over periods of one minute"
Marshall Pierce
@marshallpierce
Feb 27 2017 16:02
Sure. Nyquist frequency and all that.
Gil Tene
@giltene
Feb 27 2017 16:02
The partial overlap thing is always a pain, and the only way (I think) to deal with it is to aggregate over an order of magnitude higher window than the reporting window resolutions are.
Since that sort of thing ends up needing to happen "anyway", dealing with it as "
add up the counts in the relevant time windows" vs. "compute the difference between counter levels across the time span" looks roughly the same to me.
Which is why I'm "torn"
@marshallpierce I think you have a lot more experience with actually feeding this sort of stuff into monitoring systems than I do. Which do you think is easier to cram into existing tool chains?
[If I were building the whole tool chain, I'd probably prefer the Recorder style as well, but...]
Marshall Pierce
@marshallpierce
Feb 27 2017 16:06
I don't have too much experience but I do interact with users a fair bit who are struggling with it, so we'll count that as partial credit ;) It looks like people struggle with getting systems to see count k at time t and count k+n at time t+m as two counts of the same data. They end up with a "counter" that is 2k+n.
So even though in theory infini-counter and rate-over-time are conceptually in the same, in practice systems don't seem to understand the concept of "counter for a particular node that I should be calculating deltas of over time". They only know "feed me the delta".
(naturally there are exceptions to this; some systems are geared towards counter style. And of course I only see people complaining, not people saying "hey, it was easy". But I think it's reasonable for a TSDB or monitoring system to only care about deltas and not track which nodes are coming and going, and that coupled with the weirdness around new vs old services makes me favor rate-style.)
With counter style, it seems like the TSDB would need to keep track of a "jvm id" (uuid on startup?) forever to be able to calculate future increments in the count, whereas with rate style, it need only keep track of that if you choose to (at display time) aggregate deltas to see the single-jvm view, and it is otherwise safe to ignore/discard "jvm id"
Gil Tene
@giltene
Feb 27 2017 16:10
The above seems to be "the simplest things for TS-based tools to digest (store, plot, trigger on) is delta-count logs" which seems plausible.
Marshall Pierce
@marshallpierce
Feb 27 2017 16:10
Yeah, I think that is the case.
By April or so I should be able to take a chainsaw to Metrics guts and prototype this out.
Gil Tene
@giltene
Feb 27 2017 16:12
Which would mean that ResponsiGrams should be a series of labeled count-delta metrics
Marshall Pierce
@marshallpierce
Feb 27 2017 16:13
So endpoint "GET /foo/bar" would have its own, and so forth, presumably
I guess another way of putting it is that rate (or delta) style feels like it is more general, because you can always aggregate locally or via an intermediary if necessary into "since startup" counts, but going the other way is often more awkward.
Marshall Pierce
@marshallpierce
Feb 27 2017 16:18
It's possible to go either way pretty cheaply with just a single counter, but for things like a complete histogram or other heftier data structure, it's expensive to get the atomic copies you would need.
So in the Metrics case it would be quite straightforward to write a hypothetical reporter for some system that wanted to have since-startup counts by simply updating and transmitting an in-memory sum, but we have good evidence that the converse has not been true historically.
Gil Tene
@giltene
Feb 27 2017 16:20
Well, whatever makes people think that client.response.avg, client.response.upper95, and client.response.upper90 are "coherent" can be maintained for client.response.countAbove5msec
If people sample them separately and they are a bit off in sample time, that's a problem they already deal with today.
Marshall Pierce
@marshallpierce
Feb 27 2017 16:23
Indeed. Perfect is the enemy of the good and all that.
So, what would next steps be for ResponsiGram? Sample wire format? Conceptual diagram? Sketching out how it would work in (graphana, fluentd, etc)?
Gil Tene
@giltene
Feb 27 2017 16:24
If we come up with the right labels and the right thresholds, maybe we can get statsd and friends to report this stuff too.
Marshall Pierce
@marshallpierce
Feb 27 2017 16:26
Naming wise, though, that might tend to lead people to labeling everything 'response time' rather than 'service time', which will make Martin Thompson grumpy. ;)
Gil Tene
@giltene
Feb 27 2017 16:26
I think a concrete example reporting ResponsiGram rates from some existing system, logging into Graphite, and plotting in Grapaha would be good to play with.
"ServiciGram" and "LatencyGram" doesn't have the same ring to it ;-)
TimerGrams?
CandyGrams?
Marshall Pierce
@marshallpierce
Feb 27 2017 16:28
Hm, interesting. I don't want to over-promise but I might have a good test case for that. A friend of mine wants to resuscitate a long-dormant distributed benchmarking tool I wrote way back and use it as a tool for teaching basic systems stuff to some junior folks. Having them measure and aggregate in HdrHistograms and then ship ???Grams might be a useful (for this) and practical (for them) goal.
TimerGram's not bad. It is timing things, after all.
Gil Tene
@giltene
Feb 27 2017 16:29
I think the right place to start is on the plotting side, BTW. Imagine you had these streams of counters in your TSDB, what would you plot, and how?
E.g. I really like the approaches in plots that show "completed operations" or "timeout counts" for monitoring.
And for practical monitoring I often see people look at % completed, or more often, % timed out. (plotting the 98%-100% range for example).
Marshall Pierce
@marshallpierce
Feb 27 2017 16:32
Thinking of what I would like in my own services, throughput is fairly useless to me (even though that's what New Relic wants to give me). Some ops are slow, some are fast, and I don't care if I have 10x as many fast ones as slow ones. I'd rather know about any particular minute when the number of operations that exceed 100ms is >0.
Ah yeah, % would be good. So, >= 1% are >=100ms.
Gil Tene
@giltene
Feb 27 2017 16:32
Yup.
Above-the-threshold-count on its own is not a great metric to monitor (and threshold on)
(because it's meaning changes with throughput)
Marshall Pierce
@marshallpierce
Feb 27 2017 16:33
Yeah I can see how that would quickly be an alarm that gets muffled...
Gil Tene
@giltene
Feb 27 2017 16:34
But % success (plotting 98%-100% on the Y axis) or % failure (plotting 0-2% on the Y axis) is a good dashboard thing, and a good thing to set thresholds on.
Right now this is common to see in systems that actually time out (e.g. in monitoring Hysterix)
But with TimerGrams, you can do it on "took longer than X" counts instead of just "timed out".
So you can have multiple % lines plotted together perhaps.
%-above-5msec, %-above-100msec, %-above-5sec
How easy/hard is it to put that together in a dashboard by combining logged counters tho?
E.g. you'd need something to compute count-above-5msec/total-count in each period to plot this. Is it common for people to configure plots that do that sort of thing?
Marshall Pierce
@marshallpierce
Feb 27 2017 16:38
That I don't know. But I think that the common-ness of it can be steered by providing a "Configure your fluentd/datadog/etc like this, and then you get this helpful graph".
Right now it is very much the blind leading the blind as people click around until they get ... something, whether or not it has any mathematical validity.
So even if the output of this experiment is some low level tooling/configuration for how to emit those counts, and then some docs on how to set up the right dashboard in (a selection of tools), that I think will achieve the goal. I don't see users as going in to the process of dashboard creation with a fixed goal of "I want exactly these graphs". They just make what they had at their last job, or what somebody says on StackOverflow, or ...
Gil Tene
@giltene
Feb 27 2017 16:41
Another option for what TimerGrams would log is to log % or fraction instead of counts. I.e. %above5msec
That might make it simpler for people to plot and threshold on (watch a single metric, not need to combine stuff)
Marshall Pierce
@marshallpierce
Feb 27 2017 16:42
Hm. True. More complex on the measurement side but it's usually much easier to have custom logic there than on the display side.
Gil Tene
@giltene
Feb 27 2017 16:42
This feels "wrong" to me somehow on the data integrity level
Marshall Pierce
@marshallpierce
Feb 27 2017 16:42
Or just leave that part flexible -- if you need to do percents because your display tool is simplistic, here's how it can be done, otherwise record counts.
In other words, fixed scale but user-selectable unit?
Gil Tene
@giltene
Feb 27 2017 16:43
I think prescriptive is a better way to go here. Choose one or the other (counts over prior of time, or % of total counts of a period of time), choose labels, put up examples, and then 1M people will copy it mindlessly
Marshall Pierce
@marshallpierce
Feb 27 2017 16:44
The issue with always using % is that it makes it hard to compare over time when throughput is inconsistent.
20% over 500ms is pretty bad, unless you had 5 requests that minute.
Gil Tene
@giltene
Feb 27 2017 16:44
[It's still bad ;-)]
Marshall Pierce
@marshallpierce
Feb 27 2017 16:45
Still bad, but I'll take one sad customer over hundreds :)
Gil Tene
@giltene
Feb 27 2017 16:45
One side or the other would need to compute things based on math done on two values. Either the plotted %s need this, or the plotted counts do.
Marshall Pierce
@marshallpierce
Feb 27 2017 16:45
(And if I'm alerting via % over a threshold, I'll still get woken up. But I might go back to bed in that case.)
Yeah, perhaps the answer is we'll just have to try it out in a few tools and see what's possible.
Gil Tene
@giltene
Feb 27 2017 16:46
Yeh, that's what I meant by "play with the plotting first". I'd like to put together some examples using actual tools, and get people's reactions.
Marshall Pierce
@marshallpierce
Feb 27 2017 16:47
Oh, here's a problem with percentiles, though -- suppose you have n servers (load balanced). If distribution is not uniform, you can't really do much when it comes to aggregating the percents.
Gil Tene
@giltene
Feb 27 2017 16:47
Ok. You win with that one. Aggregating %s is evil./
[It can be done, and the data is there, but 99% of people would do it wrong]
Marshall Pierce
@marshallpierce
Feb 27 2017 16:49
yeah, you would need to be able to run some non trivial logic in the aggregation layer
Gil Tene
@giltene
Feb 27 2017 16:49
This basically shows that you need the total counts logged for the same periods no matter what (to compute %, or to aggregate %, or whatever)
Marshall Pierce
@marshallpierce
Feb 27 2017 16:49
yeah
Gil Tene
@giltene
Feb 27 2017 16:50
So since you need to rely on two metrics being logged at the same time, may as well make it clear that the plotting is doing math across them (sum(above_5msec_count) in cluster / sum (total count) in cluster).
Marshall Pierce
@marshallpierce
Feb 27 2017 16:52
So, it should be possible. influxdb for instance will let you query the sum() of something, and calculate basic math, so the foundation is there.
Gil Tene
@giltene
Feb 27 2017 16:58
Lets kick this around some offline. I'd love to get some examples to evangelize on how to "monitor latency right".
I'm thinking of a talk/workshop with examples. "How YES to measure latency"?
Marshall Pierce
@marshallpierce
Feb 27 2017 17:04
OK. Let me know what I can do. In the meantime I'll try to steer that benchmark mentoring thing towards experimenting with this.