These are chat archives for PerfDotNet/BenchmarkDotNet

8th
May 2016
damageboy
@damageboy
May 08 2016 08:28

@AndreyAkinshin

On modern hardware/software, Stopwatch use TSC. Your TSC Freq is 2MHz.

I think you meant to say that Stopwatch uses HPET, which actually might be @2Mhz
TSC -> Time Stamp Counter, which really is a cycle counter, it would be much faster than 2Mhz, I wish .NET StopWatch would rewrite the stopwatch to RDTSC

Andrey Akinshin
@AndreyAkinshin
May 08 2016 09:45
@damageboy, On Windows 8+ you have disabled HPET by default. You can enable it manually (a bad idea), then Stopwatch will use HPET. Otherwise (by default), Stopwatch uses TSC.
In fact, Stopwatch uses QPC which uses TSC.
damageboy
@damageboy
May 08 2016 09:46
@AndreyAkinshin bah, so it jump to the kernel just to do a RDTSC? What are they smoking? Or did they change QPC to not be a syscall?
Andrey Akinshin
@AndreyAkinshin
May 08 2016 09:48
There are a lot of troubles with native access to RDTSC.
damageboy
@damageboy
May 08 2016 09:48
I agree, but non of them have to do with wrapping it in a syscall or keeping it in userspace
It's a completely orthogonal discussion
But RDTSC isn't that bad on modern hardware (e.g. INVARIANT_TSC + RDTSCP)
esp. in BDN context where you control the thread affinity
Andrey Akinshin
@AndreyAkinshin
May 08 2016 09:50
But you don't know your hardware in advance.
damageboy
@damageboy
May 08 2016 09:51
It's true, but my assumption is that every year that goes by INVARIANT_TSC (which all intel procs fro 2011/2012 have) and RDTSCP are already there, that has to be a large chunk of the HW where people run their stuff already is at
Andrey Akinshin
@AndreyAkinshin
May 08 2016 09:51
What are you going to do in case, if you don't have InvariantTSC?
Some old CPUs (a lot of my friends still have such software) have a separated TSC on each core.
damageboy
@damageboy
May 08 2016 09:53
The separated TSC is a different issue than INVARIANT_TSC
INVARIANT_TSC just deals with frequency changes
Andrey Akinshin
@AndreyAkinshin
May 08 2016 09:54
QPC analyze your hardware and try to get best stable frequency.
On modern software it will be InvariantTSC.
I think, it should use rdtsc on the low level.
So, I don't understand, what the problem with QPC.
damageboy
@damageboy
May 08 2016 09:57
Right, I understand that, but for serious benchmarking you want the:
  • RDTSC as close to the code you are benchmarking
  • You want to minimize the overeah of RDTSC
  • You want to substract the time of the actual RDTSC calls from the result (I sometimes use agner fogs instruction tables)
Those are all good reasons to push the RDTSC as far as you can into your code...
Andrey Akinshin
@AndreyAkinshin
May 08 2016 10:00
On my laptop, QPCs overhead is about 17–20ns, it is a good value.
And I shouldn't think about capabilities of my hardware. It's a very good trade-off.
damageboy
@damageboy
May 08 2016 10:03
That is actually pretty good to be honest, I'll try and look into why it is so fast, back in the day, QPC used to be a syscall, which in no way should be that fast, perhaps they change QPC to be user-space in the fast path (RDTSC)
Andrey Akinshin
@AndreyAkinshin
May 08 2016 10:03
And don't forget about troubles with rdtsc. For example, you should add a memory barier before RDTSC invocation because modern CPU can perform out-of-order execution. (There are a few paragraphes in Fogs manuals about it)
damageboy
@damageboy
May 08 2016 10:03
which is why you use RDTSCP, it takes care of that, no need for anything else
Andrey Akinshin
@AndreyAkinshin
May 08 2016 10:07
I still don't understand, why we should care about it?

You want to substract the time of the actual RDTSC calls from the result

No, I don't want it. Total time of a single benchmark iteration is about 1 second. Standard deviation in a good case is about 0.01 second. Overhead of two Qpc invocation is about 40ns. It don't affect my results at all.

@damageboy, I recommend you to read the following MSDN article: https://msdn.microsoft.com/en-us/library/windows/desktop/dn553408%28v=vs.85%29.aspx It explains why we should use QPC instead of a direct rdstc invocation.
damageboy
@damageboy
May 08 2016 10:26
I just started reading it before you mentioned it, thanks for the tip!
Andrey Akinshin
@AndreyAkinshin
May 08 2016 10:27
Another good article about timers: http://www.windowstimestamp.com/description
damageboy
@damageboy
May 08 2016 10:31
@AndreyAkinshin thanks for the links, the MS page is very good, and so is the other one, though I still can't find an answer re. is QPC() on Windows a system-call or a user-space API, I will try to single step in windbg or something...
I have a different project where it is critical to avoid system calls for a specific reason, so I'm trying to answer that to myself
Andrey Akinshin
@AndreyAkinshin
May 08 2016 10:36
damageboy
@damageboy
May 08 2016 10:38
It isn't, I just checked
It used to be (I know this, since I used to benchmark general syscall perf on windows with QPC/QPF)
I just single stepped into it and it has a fast-path that calls RDTSC
Andrey Akinshin
@AndreyAkinshin
May 08 2016 10:39
Great.
Can you paste some asm?
damageboy
@damageboy
May 08 2016 10:39
Had it been a syscall it would / should take ~140cycles just to jump to the kernel
Sure, gimm a sec...
QPC i.e. [__imp_QueryPerformanceCounter ]:
00007FFD43C8A3E0  push        rbx  
00007FFD43C8A3E2  sub         rsp,20h  
00007FFD43C8A3E6  mov         al,byte ptr [7FFE03C6h]  
00007FFD43C8A3ED  mov         rbx,rcx  
00007FFD43C8A3F0  cmp         al,1  
00007FFD43C8A3F2  jne         00007FFD43C8A424  
00007FFD43C8A3F4  mov         rcx,qword ptr [7FFE03B8h]  
00007FFD43C8A3FC  rdtsc  
00007FFD43C8A3FE  shl         rdx,20h  
00007FFD43C8A402  or          rax,rdx  
00007FFD43C8A405  mov         qword ptr [rbx],rax  
00007FFD43C8A408  lea         rdx,[rax+rcx]  
00007FFD43C8A40C  mov         cl,byte ptr [7FFE03C7h]  
00007FFD43C8A413  shr         rdx,cl  
00007FFD43C8A416  mov         qword ptr [rbx],rdx  
00007FFD43C8A419  mov         eax,1  
00007FFD43C8A41E  add         rsp,20h  
00007FFD43C8A422  pop         rbx  
00007FFD43C8A423  ret
the jne that is based of 7FFE03C6h is never taken
I assume that's some global that signals the the fast-path is "safe" on my machine
damageboy
@damageboy
May 08 2016 11:07
I just followed the 00007FFD43C8A424 and it indeed jumps to the kernel (i.e. syscall):
00007FFD43CE5360  mov         r10,rcx  
00007FFD43CE5363  mov         eax,31h  
00007FFD43CE5368  test        byte ptr [7FFE0308h],1  
00007FFD43CE5370  jne         00007FFD43CE5375  
00007FFD43CE5372  syscall  
00007FFD43CE5374  ret
basically 00007FFD43C8A424 is NtQueryPerformanceCounterthat just jumps to the kernel
Andrey Akinshin
@AndreyAkinshin
May 08 2016 11:12
Sounds interesting. Did you understand, when it happens?
damageboy
@damageboy
May 08 2016 11:13
Not yet, but it's a common Kernel32/NtDll pattern, to avoid unsupported low-level ops by jumping around based on global vars
I assume that if InvariantTsc is not supposed by the CPU that memory address with be != 1
and it will jump to the kernel for older hw
damageboy
@damageboy
May 08 2016 12:37
@AndreyAkinshin it's the same pattern they use for using syscall to jump to kernel vs. using int 2e which is slower but more supported
Andrey Akinshin
@AndreyAkinshin
May 08 2016 21:20
@adamsitnik, there is a cool article about project.json files: http://kalapos.azurewebsites.net/project-json-settings-in-asp-net-core
Adam Sitnik
@adamsitnik
May 08 2016 23:13
@AndreyAkinshin Ok, my part for 0.9.6 is finished, I have fixed all "my bugs"
Andrey Akinshin
@AndreyAkinshin
May 08 2016 23:15
Great, thanks.
So, we almost ready to publish 0.9.6.
Adam Sitnik
@adamsitnik
May 08 2016 23:15
yes!
Andrey Akinshin
@AndreyAkinshin
May 08 2016 23:15
We should just make a decision about right name for GCDiagnoser.
Adam Sitnik
@adamsitnik
May 08 2016 23:15
It was a fast one!
Andrey Akinshin
@AndreyAkinshin
May 08 2016 23:15
@mattwarren, we need you.
Adam Sitnik
@adamsitnik
May 08 2016 23:16
it seems like @mattwarren is always offline for weekends
maybe it is healthy? ;)
Andrey Akinshin
@AndreyAkinshin
May 08 2016 23:16
=)
Adam Sitnik
@adamsitnik
May 08 2016 23:16
I have an idea
we could rename GCDiagnoser to MemoryDiagnoser
and then create a GCDiagnoser class that derives from MemoryDiagnoser
mark is as Obsolete
with comment to use MemoryDiagnoser now
Andrey Akinshin
@AndreyAkinshin
May 08 2016 23:18
I think, we don't need an Obsolete here because it will be the first published version of the Diagnostics package.
Adam Sitnik
@adamsitnik
May 08 2016 23:18
yes, but on the other hand Matt has published some blog posts about it
and some users were compiling the solution
and referencing it directly via add reference
so some people might be familar with the GCDiagnoser name
Andrey Akinshin
@AndreyAkinshin
May 08 2016 23:19
Ok, it sounds reasonable.
Adam Sitnik
@adamsitnik
May 08 2016 23:21
ok, I go to sleep
Andrey Akinshin
@AndreyAkinshin
May 08 2016 23:21
Good night.
Adam Sitnik
@adamsitnik
May 08 2016 23:21
thanks! see you