Where communities thrive


  • Join over 1.5M+ people
  • Join over 100K+ communities
  • Free without limits
  • Create your own community
People
Activity
Ilya Yaroshenko
@9il
like auto uxv = ux[i];
and operate with this vectors in the internal loop
D (and probably C/C++) can not vectorise double indexing like Fortran
Finally the performance should be the same
Keep us in touch, I think it is a good example of porting and you can write a short blog post after (this would be very helpful for others)
Ilya Yaroshenko
@9il
Also, you may want to use https://github.com/libmir/mir-random . It implements C++ RNG standrd and more
Johan Engelen
@JohanEngelen
@9il It would help a lot if you can extract a minimal example that shows that things are not vectorized/optimized well. There is so much going on in the current example that it's hard to analyze why things don't optimize well. Part of the problem could be that slices are used which don't optimize so well yet (it's a work-in-progress).
Johan Engelen
@JohanEngelen
yeah I saw the post. But I missed a compilable full example.
Ilya Yaroshenko
@9il
Woow, nothing is inlined Oo
ldmd2 -inline -O -enable-cross-module-inlining -release -boundscheck=off -I mir-algorithm/source/ -output-s matrix_copy.d
@JohanEngelen Both variants are very slow
This is surprising because it is probably regression
I can not confirm it because ndslice is not compatable anymore with older versions
ldc2 --version
LDC - the LLVM D compiler (1.3.0git-a969bcf):
  based on DMD v2.073.2 and LLVM 4.0.0
  built with LDC - the LLVM D compiler (0.17.5git-64a274a)
  Default target: x86_64-apple-darwin16.5.0
  Host CPU: haswell
  http://dlang.org - http://wiki.dlang.org/LDC
Johan Engelen
@JohanEngelen
Can you add it to LDC issue tracker? Thanks.
Ilya Yaroshenko
@9il
yep, but without code reduction
Ilya Yaroshenko
@9il
ldc-developers/ldc#2121
Ilya Yaroshenko
@9il
@dextorious see also new LDC issue ldc-developers/ldc#2121
Ilya Yaroshenko
@9il
Hehe, i found reduced test case
dextorious
@dextorious
Morning. I posted that code as I went to sleep, now I finally had a chance to look at the assembly. As you correctly said, the opIndex calls don't get inlined, which is expensive by itself and also prohibits all further optimization via alias analysis and ultimately vectorization.
Moving over to a[i][j] style indexing improved the performance by about 2x, but it's still ~720ms vs 30ms.
Ilya Yaroshenko
@9il
Yep, working on workaround
dextorious
@dextorious
So I guess at least at present, using ndslice is not recommended for loop-heavy computational kernels.
Ilya Yaroshenko
@9il
no, this is regression
I think it can be fixed during 48 hours
dextorious
@dextorious
Ah, ok. Weird that my random attempt at porting some code was the first to notice it.
Anyhow, I'll be around and very happy to test anything or give any feedback I can.
Ilya Yaroshenko
@9il
Thanks, I will let you know when the new mir-algorithm release is ready
dextorious
@dextorious
Great. Very happy about the quick response here. :)
Ilya Yaroshenko
@9il
@dextorious Fixed in PR https://github.com/libmir/mir-algorithm/pull/41/files, tag v0.6.2
It will be in DUB registry after less then a hour.
Do not forget to remove dub.selections.jsonand update mir-algorithm version upto ~>0.6.2 if you use dub.
Also, you may want to add "dflags-ldc":["-mcpu=native"]
Please let me know your benchmarks results for C++ code and both D variants. Thanks!
dextorious
@dextorious
Wow, that was quick. I'll try it right now, thanks!
@9il Okay, the version with a[i][j] style indexing now runs in 89 ms, compared to 31 ms from C++, so it's within 3x now.
Will go through the assembly now, see what's going on.
Ilya Yaroshenko
@9il
Thanks! Could you please fill an issue here https://github.com/libmir/mir-algorithm/issues
Also please try to transform the code like
foreach (i; 0..m)
{
   foreach (j; 0..n)
   {
       // use matrix1[i, j], matrix2[i, j], matrix3[i, j]
   }
}
to
foreach (i; 0..m)
{
   auto v1 = matrix1[i];
   auto v2 = matrix2[i];
   auto v3 = matrix3[i];
   foreach (j; 0..n)
   {
       // use v1[j], v2[j], v3[j]
   }
}
The same for tensors
dextorious
@dextorious
Okay, there was an aliasing issue similar to what I recently encountered in Julia, which I fixed by explicitly operating on temporary variables ux0, etc., and only storing the results in the matrices at the end. This enabled some vectorization and brought the timing down to 55 ms. It still doesn't unroll the loop as extensively as clang does and the vectorization isn't quite complete, but we're now within 2x.
If I understand this correctly, what I did was a more extreme version of what you just suggested with hoisting out the rows?
Anyway, I'll fix a few of the uglier details (C++-style for -> foreach, etc.) and post up my current version as an issue on the repository.
Ilya Yaroshenko
@9il
Yes
Thanks!