These are chat archives for elemental/chat

31st
Oct 2016
Ryan H. Lewis
@rhl-
Oct 31 2016 00:45
@poulson can you resend me Jenkins credentials?
Jack Poulson
@poulson
Oct 31 2016 05:44
don't have access to linode information while I'm traveling; I can reset your info when I get back tomorrow night
Zoltán Csáti
@CsatiZoltan
Oct 31 2016 11:57
@rhl- Finally, I managed to build it from source on Fedora 24. All the tests passed and I wanted to run the SVD example given here. I created the makefile, the compilation was successful, but when I ran the executable, it couldn't find the Elemental shared library:
./SVD: error while loading shared libraries: libEl.so: cannot open shared object file: No such file or directory
The problem is of course with the environmental variables, but even typing PATH=/usr/local/lib64/:$PATH ; export PATH did not solve it. Then I tried it by putting the SVD executable file to the same directory where libEl.so is (i.e. /usr/local/lib64), but the same error.
Ryan H. Lewis
@rhl-
Oct 31 2016 12:03
You need to set LD_LIBRARY_PATH
Not path
Zoltán Csáti
@CsatiZoltan
Oct 31 2016 12:07
Thanks, works perfectly.
Zoltán Csáti
@CsatiZoltan
Oct 31 2016 12:16
Why is it that ./SVD --height 300 --width 300 or mpiexec ./SVD --height 300 --width 300 takes just 0.13 s to run, while mpiexec -n 2 ./SVD --height 300 --width 300 takes 33 s, mpiexec -n 4 ./SVD --height 300 --width 300 takes 66 s and mpiexec -n 4 ./SVD --height 300 --width 300 takes 138 s? So they are doubling in execution time. And also interesting that even if I set the number of processes to either 2 or 4, all the 8 threads of my Core i7 processor are at full load.
Ryan H. Lewis
@rhl-
Oct 31 2016 12:26
You need to control the openblas threading im guessing
When you use more threads than physical CPU codes gets slower
Jack Poulson
@poulson
Oct 31 2016 13:10
I agree; it is likely thread oversubscription
running export OMP_NUM_THREADS=1 should fix it
Zoltán Csáti
@CsatiZoltan
Oct 31 2016 13:49
@poulson It fixed it, but it does not scale too well. On one core: 82.4429 s, on two cores: 54.8893 s, on 4 cores: 45.928 s.
Ryan H. Lewis
@rhl-
Oct 31 2016 13:53
How about with openblas thread scaling?
Ryan H. Lewis
@rhl-
Oct 31 2016 13:58
MPI is more for distributed memory
Zoltán Csáti
@CsatiZoltan
Oct 31 2016 14:01
After export OPENBLAS_NUM_THREADS=1, it's the same. Our university cluster has about 5000 cores. Does Elemental scale until then?
Ryan H. Lewis
@rhl-
Oct 31 2016 14:02
I don't have time right now to inspect your build. @poulson also is the expert
Zoltán Csáti
@CsatiZoltan
Oct 31 2016 14:03
Ok, thanks for helping me with Elemental.
Ryan H. Lewis
@rhl-
Oct 31 2016 14:04
I can look later. When in doubt though, use a profiler
Is everything using release mode?
I was suggesting that you vary the openmp threads and leave mpi at one process
Zoltán Csáti
@CsatiZoltan
Oct 31 2016 14:06
I built it in Release mode.
Zoltán Csáti
@CsatiZoltan
Oct 31 2016 14:20
Using 1 MPI process ( mpiexec -n 1 ./SVD ) and
1 thread: 81.9656 s
2 threads: 53.925 s
4 threads: 42.0331 s
The threads were set with export OMP_NUM_THREADS=... and export OPENBLAS_NUM_THREADS=....
The matrices were 3000x3000.
Ryan H. Lewis
@rhl-
Oct 31 2016 14:22
Not sure there. Elemental should just wrap OpenBLAS in this situation. Are you timing matrix gen as well?
Gotta go
Zoltán Csáti
@CsatiZoltan
Oct 31 2016 14:23
No, the matrix generation time is excluded.
What I realised is that OMP_NUM_THREADS matters. Even if I set OPENBLAS_NUM_THREADS to 4, if OMP_NUM_THREADS is 1, it will calculate on 1 thread.
Jack Poulson
@poulson
Oct 31 2016 15:24
Those timings are not reasonable; 3000 x 3000 should only take a couple seconds, even on one core
I always recommend starting with running tests/blas_like/Gemm as a sanity check
Zoltán Csáti
@CsatiZoltan
Oct 31 2016 15:26
@poulson It took 7.61 s as part of make test.
Jack Poulson
@poulson
Oct 31 2016 15:28
I take back my "couple seconds" claim, assuming about 5 n^3 flops, running at 1e10 flops/second would take 13.5 seconds
I mean to inspect the output of tests/blas_like/Gemm, as it gives the local Gemm speed and the AllGather bandwidth
it is very common to have a misconfigured stack
Zoltán Csáti
@CsatiZoltan
Oct 31 2016 16:29

Results with OMP_NUM_THREADS=4, OPENBLAS_NUM_THREADS=4, mpiexec -n 1 ./Gemm --m 3000 --n 3000 for real matrices:
float: Stationary A algorithm - Finished in 0.0533109 seconds (33.7642 GFlop/s)
double: Stationary A algorithm - Finished in 0.112217 seconds (16.0404 GFlop/s)
quad: Stationary A algorithm - Finished in 66.826 seconds (0.0269356 GFlop/s)

Results with OMP_NUM_THREADS=1, OPENBLAS_NUM_THREADS=1, mpiexec -n 4 ./Gemm --m 3000 --n 3000 for real matrices:
float: Stationary A algorithm - Finished in 0.0983532 seconds (18.3014 GFlop/s)
double: Stationary A algorithm - Finished in 0.22003 seconds (8.1807 GFlop/s)
quad: Stationary A algorithm - Finished in 20.5049 seconds (0.0877838 GFlop/s)

Jack Poulson
@poulson
Oct 31 2016 16:33
Stationary C's local gemm speed and AllGather bandwidth is the best thing to pay attention to for square matrices
Zoltán Csáti
@CsatiZoltan
Oct 31 2016 16:59
How can I interrogate the AllGather bandwidth? I found this stackexchange question but don't know how the user could print it.
Jack Poulson
@poulson
Oct 31 2016 17:00
it should be printed by the driver
I apologize, I meant examples/blas_like/Gemm
not tests/blas_like/Gemm
Zoltán Csáti
@CsatiZoltan
Oct 31 2016 17:11
I ran them using 1 thread and processes 1, 2, 4 and using 1 process and 1, 2, 4 threads.
Jack Poulson
@poulson
Oct 31 2016 17:13
you are getting very low bandwidth
e.g., in the 4 process case
what MPI are you using?
Zoltán Csáti
@CsatiZoltan
Oct 31 2016 17:15
I am using the latest mpich, i.e. mpich-3.2.
Jack Poulson
@poulson
Oct 31 2016 17:16
hmm, I am seeing similar bandwidths on my Mac when I run locally
it seems your laptop just doesn't have very scalable memory bandwidth
(and neither does mine)
Zoltán Csáti
@CsatiZoltan
Oct 31 2016 17:17
It's not my laptop, it's a quite old desktop with an Intel i7 CPU 860.
Jack Poulson
@poulson
Oct 31 2016 17:18
you may want to check the affinity
MPI doesn't always guarantee that processes are local to a core
you may want to read the section around hwloc here: https://wiki.mpich.org/mpich/index.php/Using_the_Hydra_Process_Manager
Zoltán Csáti
@CsatiZoltan
Oct 31 2016 17:20
Although I use mpich, not openmpi, from the openmpi FAQ: "Also note that processor and memory affinity is meaningless (but harmless) on uniprocessor machines."
Jack Poulson
@poulson
Oct 31 2016 17:22
do you have a uniprocessor machine?
Zoltán Csáti
@CsatiZoltan
Oct 31 2016 17:23
Yes, I have a non-server motherboard with one processor on it having 4 cores.
Jack Poulson
@poulson
Oct 31 2016 17:23
ah, I see you have a core i7
"uniprocessor" is likely referring to single-core
nothing should guarantee your MPI processes stay local to a core without some sort of affinity
Zoltán Csáti
@CsatiZoltan
Oct 31 2016 17:37
So should I bind n processes to n cores?
Jack Poulson
@poulson
Oct 31 2016 17:38
usually, yes
though this isn't the case on BG/Q
(where two processes per core seems to win)
Zoltán Csáti
@CsatiZoltan
Oct 31 2016 17:47
I passed the -bind-to hwthread option and then the bind-to coreoption. Both of them resulted in basically no-speed up in the bandwidth.
Jack Poulson
@poulson
Oct 31 2016 17:48
what sort of scaling do you see for tests/lapack_like/Bidiag?
Jack Poulson
@poulson
Oct 31 2016 18:44
I ask because SVD was recently extended to implement a distributed Divide and Conquer approach and it isn't fully battle-tested yet
if Bidiag is scalable but SVD is not, that is a very strong sign that there could be a bug in the new distributed D&C