These are chat archives for HdrHistogram/HdrHistogram

29th
Apr 2018
Alec
@ahothan
Apr 29 2018 16:27
@giltene @mikeb01 I have been trying the C version and running the testperf on a Intel(R) Xeon(R) CPU E5-2670 v3 @ 2.30GHz. The average cost per add is around 40nsec. (I used 2 digit precision, 1 usec to 10msec range)
So was wondering in what conditions the numbers reported in
http://hdrhistogram.org were measured? "Measurements show value recording times as low as 3-6 nanoseconds on modern (circa 2014) Intel CPUs" this is an order of magnitude faster.
Gil Tene
@giltene
Apr 29 2018 18:07
They are a bit crude, but they show 190M-230M recordings per second into a Histigram on my 2.6GHz core i7 (skylake) laptop.
This is one of those cases where Java's JIT optimizations, including deep inlining, probably beat the basic C implementation. The C code could probably be made faster by creating inlined variants of hdr_record_value. But I'm not sure it's worth the effort. 40nsec per recoridng is still pretty sweet, and still faster than any non-inlined time measurment would be.
Jon Gjengset
@jonhoo
Apr 29 2018 18:28
I know @marshallpierce has done a bunch of benchmarking of the Rust implementation too -- don't know what the numbers were though
Alec
@ahothan
Apr 29 2018 19:36
If I have time I'll try to run a profiler to see where it is spending the 40nsec. Not that I don't believe you but I'm really surprised you can get 200M/sec with JIT. It is also artificial to have tight loops like those used in the test code as it tends to abuse the LLC. In normal conditions, the code that runs an add will be tiny compared to code that actually does stuff - so assuming the hdr code has ideal LLC conditions will almost never happen. I think I
'm good with 25M add/sec - I am recording the latency for a packet stream at rate in the order of a few K/sec (this is a latency stream that runs parallel to a main packet stream that runs at 10-100Mpps)
Marshall Pierce
@marshallpierce
Apr 29 2018 22:07
on my E5-1650v3 (3.5GHz), recording 1M random values (1 to max u64 range, precalculated to avoid measuring rng) takes 3M ns, so about 300M per second. But, as you say, it's highly artificial to have the values array sitting nicely in cache.
it is surprising that you're seeing ~10x worse performance, but nonetheless it's not a problem at a few thousand per second. :)