These are chat archives for akkadotnet/

Mar 2015
Natan Vivo
Mar 19 2015 01:24
@Aaronontheweb, @rogeralsing, I found some ways to optimize the queue. A simple thing is that Count is quite expensive in ConcurrentQueue, and it's being called twice in the PoolWorker. Just by calling it once you can improve perf by ~15%.
For more gains, the test needs to be changed because the way it is, it considers the worst possible scenario where messages are being queued and dequeued at the same time in all threads all the time, so ConcurrentQueue is already very optimized for this and it will be hard to beat it. But if you change the pattern to add some messagesm wait for a while then add more, you can measure other improvements with some tricks.
I tested a variation of your swap pattern, where I basically return the entire concurrentqueue at once and create another. locks only on add and dequeueall. I could see about 40% improvements doing that vs dequeueing one at a time
Natan Vivo
Mar 19 2015 01:31
well.. just some ideas...
Roger Johansson
Mar 19 2015 01:56
Neat, that was what i tried to solve lockless (but failed due to stale reads)
Is the perf gain st
Still true in a multi producer senario?
Or does the gain vanish then due to contention?
Bartosz Sypytkowski
Mar 19 2015 07:39
@rogeralsing @Aaronontheweb Do we have some release planned in the near future?
Roger Johansson
Mar 19 2015 07:45
Not that I know of, unless @Aaronontheweb has something planned
Natan Vivo
Mar 19 2015 09:18
@rogeralsing I didn't change the tests, so it remained one producer. But it seems most of the time spent in ConcurrentQueue is not actually waiting, but dealing with the internal structures of it.
What I did that helped is that I grabbed the ConcurrentQueue code from corefx and added it locally to the project, and run the profiler.
We can see that in the code that is on github, ~30 % of the time is spent on TryDequeue and 30% is on Count
most of this time is just looping through segments inside the concurrent queue
so, it's not actually locks, just regular code running. it's hard to optimize it more for general use cases.
(when I say locks in concurrentqueue, I mean spinwait, not the lock keyword)
Natan Vivo
Mar 19 2015 09:41
if you were to pass all the overhead to adding and zero to retrieving, the most optimal solution would be to use a simple List<T> with a predefined size using locks/spinwait on Add, and for dequeue you have a single method that locks/spinwait, and exchange the list with a new one. you can then loop the list with tasks in the pool directly as an array. the issue is that it's hard to know what size this array should be
so, becomes a trade off to a linkedlist doing the same.
after that, the path would be a linked list of arrays, and that's exactly what ConcurrentQueue is...
Natan Vivo
Mar 19 2015 09:48
the point I said about the test is that unless you have some time to fill a list with some items without dequeueing, the gains you have for swapping the list are lost. in the test, it keeps producing and swapping all the time, so the cost becomes of swapping becomes higher. if you just give it some time to fill a list and swap with a minimum amount of items, swapping becomes cheaper and these other solutions apply
another thing.. if using concurrentqueue, replacing Count for IsEmpty is much cheaper, as IsEmpty only checks the local segment, while Count needs to loop through all of them
Roger Johansson
Mar 19 2015 10:56
Roger Johansson
Mar 19 2015 11:05
another solution would be to add a true DequeAll in a modified version of ConcurrentQueue. that would be feasable too, right? we could just make the concurrentqueue head point to where the last segment is, and then enumerate over all of the segments we just removed... havent checked the code for it, but that should work, right?
Natan Vivo
Mar 19 2015 11:20
It could work. One of the things I thought is just using that code and tweaking some parameter to match the usage. ConcurrentQueue has a lot of parameters inside that were probably calculated for the most general cases
as I understand, what it does inside is that it creates a 32-item array for each segment, and it grows by linking them. Each segment has its lock, so you only lock if you are modifying that 32-item boundary
From the profiler, even with the 4th iteration (10 million actions I guess?), the time it wasted on spinwait was about 4% only. 30% was just moving through segments
so, in theory reducing the # of segments (that is - increasing the segment size) could help...
but at some point I wonder if you're not optimizing only for the benchmark
in the case of DequeueAll, if it needs to loop and lock through all segments, there might not be a lot of gain. But it could get the entire segment as an array and work those before going to the next
Aaron Stannard
Mar 19 2015 18:28
@Horusiath @rogeralsing 1.0 should be ready to go shortly
1-2 weeks IIRC