These are chat archives for libmir/public

20th
May 2017
dextorious
@dextorious
May 20 2017 02:06
Not sure if this is the right place to ask this, but as part of looking at mir.ndslice, I was going to port a simple lattice Boltzmann fluid dynamics simulation for learning purposes, starting with a collision kernel:
https://gist.github.com/dextorious/d987865a7da147645ae34cc17a87729d
which is currently a literal, non-idiomatic port of a C++ example:
https://gist.github.com/dextorious/9a65a20e353542d6fb3a8d45c515bc18
Ignoring the non-idiomatic loop syntax and similar details, the D version is over 40x slower (LDC v.1.2.0, release build with -O3 and no bounds checks, compared vs. clang v4.0.0 -O3 on a Haswell CPU), which means I'm doing something horribly wrong. Having gone through the docs (and part of the vision library) and checked that the results are correct, I'm somewhat at a loss.
Does anyone see a glaring error that would lead to this level of performance degradation?
Ilya Yaroshenko
@9il
May 20 2017 02:25
Hello @dextorious
Yes, the C++ code has single indexing for vectors while D code has doouble indexing got matrixes
You may want to declare vectors in the begining of the outer loop
like auto uxv = ux[i];
and operate with this vectors in the internal loop
D (and probably C/C++) can not vectorise double indexing like Fortran
Ilya Yaroshenko
@9il
May 20 2017 02:31
Finally the performance should be the same
Keep us in touch, I think it is a good example of porting and you can write a short blog post after (this would be very helpful for others)
Also, you may want to use https://github.com/libmir/mir-random . It implements C++ RNG standrd and more
Johan Engelen
@JohanEngelen
May 20 2017 07:11
@9il It would help a lot if you can extract a minimal example that shows that things are not vectorized/optimized well. There is so much going on in the current example that it's hard to analyze why things don't optimize well. Part of the problem could be that slices are used which don't optimize so well yet (it's a work-in-progress).
Johan Engelen
@JohanEngelen
May 20 2017 07:14
yeah I saw the post. But I missed a compilable full example.
Ilya Yaroshenko
@9il
May 20 2017 07:26
Woow, nothing is inlined Oo
ldmd2 -inline -O -enable-cross-module-inlining -release -boundscheck=off -I mir-algorithm/source/ -output-s matrix_copy.d
@JohanEngelen Both variants are very slow
This is surprising because it is probably regression
I can not confirm it because ndslice is not compatable anymore with older versions
ldc2 --version
LDC - the LLVM D compiler (1.3.0git-a969bcf):
  based on DMD v2.073.2 and LLVM 4.0.0
  built with LDC - the LLVM D compiler (0.17.5git-64a274a)
  Default target: x86_64-apple-darwin16.5.0
  Host CPU: haswell
  http://dlang.org - http://wiki.dlang.org/LDC
Johan Engelen
@JohanEngelen
May 20 2017 08:03
Can you add it to LDC issue tracker? Thanks.
Ilya Yaroshenko
@9il
May 20 2017 08:03
yep, but without code reduction
Ilya Yaroshenko
@9il
May 20 2017 09:18
ldc-developers/ldc#2121
Ilya Yaroshenko
@9il
May 20 2017 09:24
@dextorious see also new LDC issue ldc-developers/ldc#2121
Ilya Yaroshenko
@9il
May 20 2017 10:15
Hehe, i found reduced test case
dextorious
@dextorious
May 20 2017 10:37
Morning. I posted that code as I went to sleep, now I finally had a chance to look at the assembly. As you correctly said, the opIndex calls don't get inlined, which is expensive by itself and also prohibits all further optimization via alias analysis and ultimately vectorization.
Moving over to a[i][j] style indexing improved the performance by about 2x, but it's still ~720ms vs 30ms.
Ilya Yaroshenko
@9il
May 20 2017 10:38
Yep, working on workaround
dextorious
@dextorious
May 20 2017 10:38
So I guess at least at present, using ndslice is not recommended for loop-heavy computational kernels.
Ilya Yaroshenko
@9il
May 20 2017 10:38
no, this is regression
I think it can be fixed during 48 hours
dextorious
@dextorious
May 20 2017 10:39
Ah, ok. Weird that my random attempt at porting some code was the first to notice it.
Anyhow, I'll be around and very happy to test anything or give any feedback I can.
Ilya Yaroshenko
@9il
May 20 2017 10:40
Thanks, I will let you know when the new mir-algorithm release is ready
dextorious
@dextorious
May 20 2017 10:40
Great. Very happy about the quick response here. :)
Ilya Yaroshenko
@9il
May 20 2017 11:17
@dextorious Fixed in PR https://github.com/libmir/mir-algorithm/pull/41/files, tag v0.6.2
It will be in DUB registry after less then a hour.
Do not forget to remove dub.selections.jsonand update mir-algorithm version upto ~>0.6.2 if you use dub.
Also, you may want to add "dflags-ldc":["-mcpu=native"]
Please let me know your benchmarks results for C++ code and both D variants. Thanks!
dextorious
@dextorious
May 20 2017 12:36
Wow, that was quick. I'll try it right now, thanks!
@9il Okay, the version with a[i][j] style indexing now runs in 89 ms, compared to 31 ms from C++, so it's within 3x now.
Will go through the assembly now, see what's going on.
Ilya Yaroshenko
@9il
May 20 2017 12:42
Thanks! Could you please fill an issue here https://github.com/libmir/mir-algorithm/issues
Also please try to transform the code like
foreach (i; 0..m)
{
   foreach (j; 0..n)
   {
       // use matrix1[i, j], matrix2[i, j], matrix3[i, j]
   }
}
to
foreach (i; 0..m)
{
   auto v1 = matrix1[i];
   auto v2 = matrix2[i];
   auto v3 = matrix3[i];
   foreach (j; 0..n)
   {
       // use v1[j], v2[j], v3[j]
   }
}
The same for tensors
dextorious
@dextorious
May 20 2017 12:50
Okay, there was an aliasing issue similar to what I recently encountered in Julia, which I fixed by explicitly operating on temporary variables ux0, etc., and only storing the results in the matrices at the end. This enabled some vectorization and brought the timing down to 55 ms. It still doesn't unroll the loop as extensively as clang does and the vectorization isn't quite complete, but we're now within 2x.
If I understand this correctly, what I did was a more extreme version of what you just suggested with hoisting out the rows?
Anyway, I'll fix a few of the uglier details (C++-style for -> foreach, etc.) and post up my current version as an issue on the repository.
Ilya Yaroshenko
@9il
May 20 2017 12:59
Yes
Thanks!
dextorious
@dextorious
May 20 2017 13:12
Posted: libmir/mir-algorithm#42
Curiously, I ran into the same aliasing issue when I wrote a Julia version of the same code, but there manually introducing the scalar temporaries was enough to persuade the compiler to fully vectorize the loops and get to within ~20% of the C++ benchmark.
So I suspect there may be room for improvement in terms of what information LDC exposes to the LLVM optimizer.