These are chat archives for HdrHistogram/HdrHistogram

22nd
May 2018
Michael Barker
@mikeb01
May 22 2018 00:16
mkdir release; cd release; cmake -DCMAKE_BUILD_TYPE=Release ..
make
There is a binary in test called perftest, which is where I get my numbers from.
Alec
@ahothan
May 22 2018 00:41
Thanks, I think I was using the debug build
a lot better now
Iteration - 1, ops/sec: 193,916,227.41
Iteration - 2, ops/sec: 198,032,899.60
Gil Tene
@giltene
May 22 2018 02:39
Btw, it’s worth noting that the C implementation of hdr_interval_recorder is the equivalent of the java SingleWriterRecorder. The java Recorder safely supports multiple writers and multiple readers, and is wait-free for writers as long as they don’t perform resizing (when autosize is on). But the single writer variant is about 2x faster.
Marshall Pierce
@marshallpierce
May 22 2018 02:41
OK, that makes more sense. On an i7-6850K, release build is at about 266M/s, which is the ballpark.
Alec
@ahothan
May 22 2018 06:48
@mikeb01 what would happen if a thread keeps writing on hdr histogram and another thread does an hdr encode?
I don't think it will crash and it looks like the only side effect is the encoded histogram might have a few bucket counts off?
In my use case, it turns out that histogram might need to be sent at interval during a benchmark from a REST server thread and it is OK to have few counters off if that can save us frn having to use 3 instances of histograms. Our use case has tight memory requirements and potentially lots of streams to record latency values, up to 32K streams, if we had to use 3 histograms per stream, at 100KB /histogram that would be over 3GB of RAM.
Gil Tene
@giltene
May 22 2018 06:56
@ahothan are you sure you want the histograms to keep accumulating? In most use cases I found, once you look at recording histograms intervals, it is more useful for each interval to be counted separately from 0.
Michael Barker
@mikeb01
May 22 2018 06:57
I think I'd have to consider that undefined behaviour.
Gil Tene
@giltene
May 22 2018 06:57
The reason I ask is that you can probably reduce footprint by using int or short counts.
And you also won’t need a 3rd histogram
Michael Barker
@mikeb01
May 22 2018 06:58
unfortunately the C histogram doesn
t
support 16/32 bit counts
Gil Tene
@giltene
May 22 2018 07:01
As for the concurrent behavior: I’m with @mikeb01 on that. It would be undefined behavior, and the “ok to lose a little” can be problematic. I’d probably prefer crashing/faulting behavior in the encode.
@mikeb01 it’s easy enough to extend and add an int variant ;-). These many-histograms-needed use cases are what I did it for in java...
Michael Barker
@mikeb01
May 22 2018 07:04
It's C, so I guess I could just cast a void* right? ;-)
@ahothan What level of accuracy are you using?
With 2 significant figures, the size of the histogram will be <1K. Which would be 96MB for 32K streams and 3 copies.
Gil Tene
@giltene
May 22 2018 07:10
A histogram with 32bit counters that covers from 0...2^32 with 2 decimals points will have a footprint of ~13KB. The same with 64 but counters is about 26KB.
Michael Barker
@mikeb01
May 22 2018 07:13
Okay, I must of been thinking about the encoded size.
Gil Tene
@giltene
May 22 2018 07:18
If you limit yourself to e.g. 1usec smallest discernible value and a max recorded value of e.g. 60 seconds, you’d save a few buckets, but not that much (you’d only save 5 buckets and about 2.5KB compared to tracking to max_int in 1 usec units)
Gil Tene
@giltene
May 22 2018 07:28
The choice between 1 nsec and 1 usec for lowest discernible value is worth 8 buckets, which (at two decimal points) is 8KB with 64 bit counters, and 4KB with 32 bit counters. I usually prefer to burn that to keep things simple on the recording path (where we usually get time in nsec), but when space is at a premium, using the right lowest discernible value is helpful.
Gil Tene
@giltene
May 22 2018 07:38
If you only use 2 histograms per stream, you’d be down to ~1.25GB.
Alec
@ahothan
May 22 2018 07:43
Our range (for traffic generator) is 1usec to 1 sec, we were trying to get 3 digit precision but the size is around 120KB. WIth 2 digit that goes down to 14KB so we may have to settle with that.
the typical use for traffic generators is to run for potentially long time (hours) and make sure we're not dropping any packets - and at the same time have a complete latency histogram for the full run + periodic histogram reporting (e.g. every 10 sec) which could be either accumulative (since start) or periodic (since start of last period)
the number of streams just forbids to have too much footprint per histogram.
Alec
@ahothan
May 22 2018 07:48
Today all traffic generators (Including expensive commercial tools) use simple fixed size buckets with coarse grain per stream
and they have a hard time reporting detailed latency histograms due to the lack of efficient compressed serialized format to report with, which is where HdrHistogram helps with its multi-langage support
Alec
@ahothan
May 22 2018 07:54
Anyway it looks like I would need 3 histograms per stream. which represents effectively ~1.3GB with 2-digit precision and 3x32 instances and 14KB per instance.
Gil Tene
@giltene
May 22 2018 07:55
2 digit precision is far better than any fixed buckets stuff has, and more than enough for any visualization. It took me a while to get used to the idea, but that’s the only precision I use for recording now.
In jHiccup, I chose a 20usec lowest discernible value because I found that it statistically improved the compressed line size somewhat.
Alec
@ahothan
May 22 2018 07:56
Given that the periodic reporting does not have to be 100% precise (few counters off is fine), it is tempting to forego the use of 3 histograms and just use 1.
I really don;t see what can go wrong with the current code, the encode simply reads a fixed list of buckets which may get increased while iterating
don;t see any crash happening
as for 2 digit, I agree this looks like sufficient
3-digit requires significantly more memory
Gil Tene
@giltene
May 22 2018 08:43
@ahothan I used to do accumulated histograms, and dropped that practice after using interval logs for a while. I dropped it not because of space, but because of complexity and obvious-wrongness that becomes apparent when you process interval logs. The main problem with accumulated histograms (beyond needing to keep them somewhere) is that there is no obvious right point to start accumulating from. Most designs that accumulate histograms deal with that by either hardcoding or accepting a “warmup period” time, and only starting the accumulation after that time has passed. When you start using interval logs, the silliness of recording with a single warmup time becomes apparent: you can establish any “warmup period” you want to ignore during log processing, and you can easily correct for wrong warmup lengths without rerunning your experiments. As a result, code I write myself now only uses interval logs, and may “emulate” an accumulated-histogram-after-warmup behavior at exit by reprocessing the log.
Another important point here is that while the log format will certainly allow you to output an ever-accumulating histogram at each interval. This has two key downsides: the first and more important one is that most tooling that reads interval histograms will probably not do what you want, and you will need to add a post-processing step to convert from accumulation log to an interval log. The second is that the output will not be as compressible as interval histograms are.
Gil Tene
@giltene
May 22 2018 08:54
@ahothan Here is an important trick that makes thread-safe interval recording as cheap as the alternative: when sampling and logging for large numbers of streams using recorders, you don’t actually need 2 histograms per Recorder. You only need M + N histograms, where M is the number of streams and N is the number of threads doing the sampling/logging work. Each thread can use the interval histogram it gets when sampling the recorder for stream m (after it is done using it for logging or accumulation) as the input to the sample call for stream m+1.
With this trick, if you replaced accumulation with interval recording, you won’t be spending any more space per stream.
And if you chose to still maintain accumulated histogram, you’d still only need 2 histograms per stream
Marshall Pierce
@marshallpierce
May 22 2018 11:09
If it helps we could add C-friendly no_mangle functions to the Rust version since it does support all the different counter types (but not recorders... yet). But either way interval logs are great.
Marshall Pierce
@marshallpierce
May 22 2018 11:17
If you defer deserializing the base64'd historgrams, you can rip through an interval log pretty quickly too, so that shouldn't be a concern (about 1-2 us per interval). This makes it practical to have a big log file from which you select just the intervals you care about.
Alec
@ahothan
May 22 2018 15:01
@giltene that trick is neat! In our case we have only 1 reaper thread, so total cost would be 1+M or 1+2*M if we keep the aggregation. Much better than 3M. It will be like recycling an instance and pass it to the next stream until all streams are reaped.
Alec
@ahothan
May 22 2018 15:18
regarding the use of accumulated histograms and whether they are useful, I think it is good to see the case of network data plane benchmarking as it has been done in the last 2 decades. Traditionally, the industry has been using very expensive hardware traffic generators (that use specialized network cards) to generate traffic at very high rate and count things like how many packets were dropped and how long it took them to come back. Because of the very high rate, up to the level of 100M packets/sec, the only way to count packets sent and received precisely is to start at given time all streams, then stop at same time, wait for all packets to return then count the received packets, which will get the precise drop. In flight counting of packets will never be accurate. Warm up period is of course always done but as part of a separate run right before the real run (for the reason cited above, you can't warm up as part of the real run because of exact counting issues). So this notion of getting stats for the entire run (say 10Mpps for an hour) made sense, after the hour you'd get exact packet stats and exact latency histogram. The need for interval reporting happened later because folks did not want to wait for an hour to see that they were dropping too many packets (or having latency through the roof), hence the acceptance of having approximate results e.g. every 30 second.
Alec
@ahothan
May 22 2018 15:26
Current industry standards for network performance are pretty hazy but most deployers (say big telcos) would require a certain drop allowance for given throughput and would be content with keeping average/min/max latencies in check. Of course with so many packets (we talk about billions), percentile distribution of latencies is even more important but the industry just did not have the tools to deal with that. To be able now to keep track of detailed per stream latency distribution is pretty much disruptive , especially for an open source traffic generator (which is where I am integrating HdrHistogram_C)
Marshall Pierce
@marshallpierce
May 22 2018 17:05
Sounds like a cool project!
Gil Tene
@giltene
May 22 2018 17:28
I think this use case (across many recorders) is useful enough to extend our APIs. I’ll add an optional relaxed recycling behavior to the Java variants (will still require the right allocating class, but won’t insist on the allocating instance id).
Gil Tene
@giltene
May 22 2018 17:40
@mikeb01 , can you look into something for the C version that will look more like the current java Recorder without the containing instance enforcement? I’m thinking of an slmeyhing like adding hdr_interval_recorder_sample_and_recycle(struct hdr_interval_recorder , struct hdr_histogram ) which will allocate a new histogram of the recycle argument is null, changing hdr_interval_recorder_init_all() to only allocate the active histogram, and changing hdr_interval_recorder_sample() to call hdr_interval_recorder_sample_and_recycle() with the current inactive histogram as a parameter.
Marshall Pierce
@marshallpierce
May 22 2018 17:40
That would also be useful in a hypothetical Metrics implementation that had a less clumsy reporter design
Gil Tene
@giltene
May 22 2018 17:42
I think that will keep the behavior if the current C api consistent with past behavior, but add the recycling capability that is useful both for this case (many recorders, one sampling thread), and for other cases that use the recycling capability in interesting ways (e.g. Netflix’s Hystrix tracking of a moving window of N samples).
Michael Barker
@mikeb01
May 22 2018 20:59
I think that would be doable. The hdr_interval_recorder_sample_and_recycle would also not store the active histogram as the inactive one, but instead return it and NULL the inactive one?
If you redid the Recorder from scratch, would you even need the inactive histogram? Could you just have an active histogram and swap it for one that is passed in?
Michael Barker
@mikeb01
May 22 2018 21:20
I'm also going to change the interval recorder to use struct hdr_histogram pointers and not void. This might break a couple of people, but will be at compile time, runtime shouldn't change.
It will only impact those who've used the interval recorder for something other than hdr_histograms.
For everyone else it will just mean removing unnecessary casts.
Michael Barker
@mikeb01
May 22 2018 21:34
I've pushed the changes to github. I've updated the example hiccup.c to show how to use the newer recycle approach.
One difference between the Java and C versions is that it in the C version is the callers responsibility to reset the histogram before passing it into be sampled.
Gil Tene
@giltene
May 22 2018 21:38
In the java version, the inactive is only temporary. It gets nulled out each time we return an interval histogram, and is fed either from an incoming recycled histogram or an allocation (if no incoming is supplied).
The logic/reason for this is that it would be surprising (to java people at least) if the interval histogram they got and did not recycle started getting stomped and mutating after another interval get happens. Things like e.g. window stuff that wants to keep those histograms across a few interval gets won’t work as expected.
Michael Barker
@mikeb01
May 22 2018 21:41
With the C version now, it is only used when calling hdr_interval_recorder_sample, if the caller only uses hdr_interval_recorder_sample_and_recycle then it is never used.
Gil Tene
@giltene
May 22 2018 21:43
I make the recycling optional for people who are too lazy to do that and are ok just wastefully allocating on each get. There are many low tech and very-low-performance use cases of HdrHistogram where the “complexity” of the recycling would “get in the way”.
(I’m being “polite” to the people who code like that)
Michael Barker
@mikeb01
May 22 2018 21:45
I'd almost be tempted to deprecate hdr_interval_recorder_sample. The nice thing about the hdr_interval_recorder_sample_and_recycle is that if you only track the return value in the calling code, you'll never run into an issue with it being changed unexpectedly.
I'm a little bit meaner to those who are a bit lazier...
Gil Tene
@giltene
May 22 2018 21:56
Well, it’s C, you’re allowed and expected to be meaner.
I agree that for C, and auto-implicitly-allocating get would not be a good idea. No GC to make that “just work”.
Gil Tene
@giltene
May 22 2018 22:01
So deprecating the old api would probably be the right thing.