These are chat archives for symengine/symengine

16th
Nov 2015
Ralf Stephan
@rwst
Nov 16 2015 07:37 UTC

@certik @bluescarni This all may be due to a bug in piranha, because changing this line:

for (short deg = 0; deg <= the_poly.degree({varname}); deg++) {

to

    const short degree = the_poly.degree({varname});
    for (short deg = 0; deg <= degree; deg++) {

reduces it all to 1020ms.

Francesco Biscani
@bluescarni
Nov 16 2015 07:46 UTC
@rwst I guess it's because it is recalculating the degree of the polynomial at each step of the iteration in the first version
Ralf Stephan
@rwst
Nov 16 2015 07:55 UTC
@bluescarni Yes but that seems to be not cached and much too costly (hundreds of ms for a univar poly of degree 1000).
Changing to k_monomial does not matter either.
Francesco Biscani
@bluescarni
Nov 16 2015 07:56 UTC
the calculation of the degree always happens on demand currently
I imagine the compiler is not managing to cache it even if it could in principle
what is the definition of the poly type?
Ralf Stephan
@rwst
Nov 16 2015 07:57 UTC
True.
using pp_t = piranha::polynomial<piranha::rational,piranha::monomial<short>>;
Francesco Biscani
@bluescarni
Nov 16 2015 07:58 UTC
thanks... do you have any number on how long it takes for a single degree() computation?
I'd expect k_monomial to be slower in this case, as it has to unpack each monomial for the degree computation
Ralf Stephan
@rwst
Nov 16 2015 08:01 UTC
Oddly, it varies between 70ms and 1000ms when measured in a benchpress loop (gcc without -O). But as I said above, moving it out of the loop the time is reduced from 1400ms to 1000ms in the benchmark.
See the code links above.
Ah I see, the compiler does optimize it.
No. Confused.
Correction: it varies between 70us and 1000us when measured in a benchpress loop (gcc without -O).
Francesco Biscani
@bluescarni
Nov 16 2015 08:05 UTC
mhmm... it might be that it gets optimised out, benchpress has some special support for making sure a value is actually computed and not thrown away
benchpress::escape() I think it's called
Ralf Stephan
@rwst
Nov 16 2015 08:07 UTC
Still, hundreds of microseconds is a long time to iterate over 1000 monomials.
Francesco Biscani
@bluescarni
Nov 16 2015 08:08 UTC
When you say "gcc without -O" you mean with no optimisation enabled?
Ralf Stephan
@rwst
Nov 16 2015 08:10 UTC
Literally, no -Ogiven. I just tried -O0 with the same result.
Francesco Biscani
@bluescarni
Nov 16 2015 08:10 UTC
For performance you should give -O2 or -O3
could you post the complete set of compiler flags and definitions?
Francesco Biscani
@bluescarni
Nov 16 2015 08:21 UTC
getting a 404 error
Ralf Stephan
@rwst
Nov 16 2015 08:24 UTC
https://gist.github.com/rwst/ebacc780c7c879f9b739
Sorry, the other link was for cloning apparently.
Francesco Biscani
@bluescarni
Nov 16 2015 08:24 UTC
cheers
I am getting 32 microseconds here with -O2
Francesco Biscani
@bluescarni
Nov 16 2015 08:33 UTC
there's the cost of iteration over what is essentially a linked list, plus the overflow check that piranha does when working with machine integers as exponents
I mean, I would expect it to be noticeably slower than a for loop over a plain array
Ralf Stephan
@rwst
Nov 16 2015 08:36 UTC
70us here with -O2/3, but it's misleading because in the actual benchmark moving it out of the loop proves it costs 400 us. So I do not trust benchpress here. There may have bn something optimized out which isn't in the benchmark code because the value is used later.
*been
Francesco Biscani
@bluescarni
Nov 16 2015 08:37 UTC
you mean caching it before the for loop starts in the earlier code snippet?
AFK
Francesco Biscani
@bluescarni
Nov 16 2015 08:41 UTC
Sorry I am not completely sure I understand, let me try to see where I am misunderstanding
So you are saying that moving the computation of the degree outside of the loop reduces the runtime of the expansion.cpp test by 400 ms (milliseconds). Is that right?
Ralf Stephan
@rwst
Nov 16 2015 08:43 UTC
Exactly.
Francesco Biscani
@bluescarni
Nov 16 2015 08:44 UTC
so you are replacing 1000 computations of degree() with a single one
Francesco Biscani
@bluescarni
Nov 16 2015 08:51 UTC
which would put the computation of a single degree() to 400 microseconds I guess
whereas the degree() benchmark says that it takes about 10 times less
is that the discrepancy we are talking about?
Ralf Stephan
@rwst
Nov 16 2015 08:53 UTC
No, 5 times. The 30us was on your machine.
Francesco Biscani
@bluescarni
Nov 16 2015 08:54 UTC
ok... also the initial expansion.cpp benchmark was compiled with the same flags?
Ralf Stephan
@rwst
Nov 16 2015 08:56 UTC
No, my point was that 400us is too much in the absolute sense. The benchpress result could be influenced by anything in benchpress going wrong. I was just asking if it really takes 400us to iterate over the poly to get the degree.
Francesco Biscani
@bluescarni
Nov 16 2015 08:57 UTC
I am trying to understand if we are comparing apples to apples
and for doing that we need to know exactly in which condition the two tests were compiled... as you see above compilation flags can easily change runtimes by an order of magnitude
Ralf Stephan
@rwst
Nov 16 2015 09:08 UTC
The symengine benchmark is compiled with the usual flags for symengine which I cannot find atm. However I just compiled without debug and it's still 370ms. I mean the iterating itself is only a small part of the 400ns per iteration. As soon as you say yeah but there is this and that to do, I will concede that it's not a bug. I just had the feeling this cannot be.
Francesco Biscani
@bluescarni
Nov 16 2015 09:16 UTC
oh well this is getting weirder.... I checked out your expansion3 branch and apparently it does not make any difference whether I cache the degree or not, it's always around 900ms for the series_expansion_sincos benchmark
the Symengine flags when compiling in release mode look ok
/usr/bin/c++ -std=c++0x -Wall -Wextra -fno-common -fPIC -O3 -march=native -ffast-math -funroll-loops -Wno-unused-parameter -DNDEBUG
the only important bits here are -O3 -DNDEBUG, at least as far as Piranha is concerned
Ralf Stephan
@rwst
Nov 16 2015 09:17 UTC
what gcc? i have 5.1.2
Francesco Biscani
@bluescarni
Nov 16 2015 09:18 UTC
yardbird@berserk build % gcc --version
gcc (Gentoo 4.9.3 p1.0, pie-0.6.2) 4.9.3
Copyright (C) 2015 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
I can try with GCC 5 give me a minute
Basically identical result with GCC 5.2, hovering around 900ms with or without degree caching
yardbird@berserk build_52 % g++-5.2.0 --version                              
g++-5.2.0 (Gentoo 5.2.0 p1.1, pie-0.6.3) 5.2.0
Copyright (C) 2015 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
Francesco Biscani
@bluescarni
Nov 16 2015 09:25 UTC
a few runs:
yardbird@berserk build_52 % sudo nice -n -19 ./benchmarks/series_expansion_sincos
916ms
yardbird@berserk build_52 % sudo nice -n -19 ./benchmarks/series_expansion_sincos
898ms
yardbird@berserk build_52 % sudo nice -n -19 ./benchmarks/series_expansion_sincos
903ms
yardbird@berserk build_52 % sudo nice -n -19 ./benchmarks/series_expansion_sincos
895ms
yardbird@berserk build_52 % sudo nice -n -19 ./benchmarks/series_expansion_sincos
915ms
yardbird@berserk build_52 % sudo nice -n -19 ./benchmarks/series_expansion_sincos
898ms
Ralf Stephan
@rwst
Nov 16 2015 09:34 UTC
I can't go back to gcc-4.8 I just see so I'll build clang. That can take a while...
Francesco Biscani
@bluescarni
Nov 16 2015 09:38 UTC
Gonna try with clang too
what is your setup, by the way? is that a linux machine?
yardbird@berserk build_clang % sudo nice -n -19 ./benchmarks/series_expansion_sincos
910ms
yardbird@berserk build_clang % sudo nice -n -19 ./benchmarks/series_expansion_sincos
900ms
yardbird@berserk build_clang % sudo nice -n -19 ./benchmarks/series_expansion_sincos
915ms
yardbird@berserk build_clang % sudo nice -n -19 ./benchmarks/series_expansion_sincos
897ms
yardbird@berserk build_clang % sudo nice -n -19 ./benchmarks/series_expansion_sincos
898ms
yardbird@berserk build_clang % sudo nice -n -19 ./benchmarks/series_expansion_sincos
893ms
yardbird@berserk build_clang % sudo nice -n -19 ./benchmarks/series_expansion_sincos
914ms
yardbird@berserk build_clang % sudo nice -n -19 ./benchmarks/series_expansion_sincos
882ms
yardbird@berserk build_clang % sudo nice -n -19 ./benchmarks/series_expansion_sincos
900ms
this with clang
Ralf Stephan
@rwst
Nov 16 2015 09:40 UTC
I installed a fresh OpenSuSE Leap 42.1 yesterday on an AMD Phenom 6x3GHz.
Francesco Biscani
@bluescarni
Nov 16 2015 09:41 UTC
mhm I see... I have an OpenSuSE installation at work, and it often gives strange performance results
I narrowed it down to some apparent issues with the stock implementation of malloc(), because once I started using tcmalloc instead everything worked as expected
there's a WITH_TCMALLOC option in the symengine build system, it might be worth a shot
I never kept on investigating unfortunately as I don't use that machine a lot
but so far it has been the only setup where I have experienced performance degradation in Piranha... I have compiled it on a variety of different systems (Gentoo, Debian, Arch Linux, Fedora, Windows 64 bit, OSX) and always got rather consistent results
Ralf Stephan
@rwst
Nov 16 2015 09:46 UTC
Will try at once when I'm able to compile again the file, I don't know why I suddenly get errors though nothing changed... sigh
Francesco Biscani
@bluescarni
Nov 16 2015 09:47 UTC
the joys of development :)
do you know by any chance is opensuse enables by default some type of security features, like SElinux or similar?
Ralf Stephan
@rwst
Nov 16 2015 09:51 UTC
You can do that perhaps by chosing a respective profile different from "desktop user" which has the least security.
Francesco Biscani
@bluescarni
Nov 16 2015 09:52 UTC
I see... I thought that maybe the perf degradation was due to some security-related overhead, as the machine is maintained by the IT department. But who knows.
Ralf Stephan
@rwst
Nov 16 2015 14:29 UTC
@bluescarni Nope it's not the malloc. With tcmalloc I have 1370ms, that is just 50ms less than before.
Ralf Stephan
@rwst
Nov 16 2015 14:51 UTC
@bluescarni And with clang+tcmalloc down to 1320ms vs 990ms. I'll stop here because it's not so relevant anymore if I just care for the loops I write, and I have not infinite time.
Ondřej Čertík
@certik
Nov 16 2015 15:57 UTC
@rwst if you do make VERBOSE=1 and recompile a file, it tells you the exact options that a file is compiled with:
[ 85%] Building CXX object benchmarks/CMakeFiles/expand2.dir/expand2.cpp.o
cd /home/certik/repos/symengine/benchmarks && /home/certik/repos/hashstack/hd_base/bin/g++    -std=c++0x -Wall -Wextra -fno-common -fPIC -O3 -march=native -ffast-math -funroll-loops -Wno-unused-parameter -I/home/certik/repos/symengine/symengine/teuchos -I/home/certik/repos/symengine -I/home/certik/repos/hashstack/hd_base/include    -o CMakeFiles/expand2.dir/expand2.cpp.o -c /home/certik/repos/symengine/benchmarks/expand2.cpp
As to gcc options, in my experience in order to get the performance, besides -O3, the next essential option is -march=native, which allows to generate the latest vector instructions (AVX) etc. Then -funroll-loops typically helps since at least for the work that I do, unrolling a loop speeds it up (it can also slow it down in some pathological cases, but for me usually it speeds it up). Finally, the -ffast-math is essential if you use any kind of floating point numbers (float, double), it provides huge speedups.
Ralf Stephan
@rwst
Nov 16 2015 16:16 UTC
@certik Any reason why we don't have -O3, -march=native and -funroll-loops in the default release compile options?
Isuru Fernando
@isuruf
Nov 16 2015 16:18 UTC
@rwst, those flags are on by default for Release build
After a cmake run flags configured are printed. You can check that to see what the compile flags are
Ralf Stephan
@rwst
Nov 16 2015 16:20 UTC
ah but not with cmake ., I see...
Ondřej Čertík
@certik
Nov 16 2015 16:20 UTC
@rwst also with cmake .
If you start from scratch, the symengine cmake will use the Release mode by default, which currently uses the following options: -O3 -march=native -ffast-math -funroll-loop.
It is however possible to modify these, either on the command line or in CMakeCache.txt, and then cmake will preserve the modifications I think.
If you switch to Debug, then we don't use those options either.
Ralf Stephan
@rwst
Nov 16 2015 16:22 UTC
@certik it seems cmake . uses just what was set last time
Ondřej Čertík
@certik
Nov 16 2015 16:26 UTC
Yes, if you already ran cmake before, then cmake just reuses what you set the last time. So if you want to set it to release mode, either change CMAKE_BUILD_TYPE:STRING=Release in CMakeCache.txt by hand, or put it on the command line cmake -DCMAKE_BUILD_TYPE=Release ..
Then it will work.
Ralf Stephan
@rwst
Nov 16 2015 16:43 UTC
@bluescarni @certik That explains it. I never compiled any symengine code with -O3. You gave a hint but I didn't believe it... The number are now as expected and equal to those seen in series_benchmark.
Ondřej Čertík
@certik
Nov 16 2015 16:47 UTC
@rwst always run benchmarks in Release mode.
Also don't forget that in Debug mode, we run all kinds of safety checks in SymEngine. Those checks are turned off in Release mode.
A typical C++ program has lots of layers of indirection (classes, function calls) that get all optimized out, so if you use -O3 with g++, you can create a tiny wrapper class, for example around a machine integer, that will have exactly zero overhead over using machine integer directly. But if you compile without optimizations, this class will have an overhead.
Francesco Biscani
@bluescarni
Nov 16 2015 16:50 UTC
@rwst glad that it got cleared up :) Piranha also has quite a bit of extensive self-checking in debug mode, I remember we ran into a similar problem during the summer when Sumith was working on Piranha code
Ralf Stephan
@rwst
Nov 16 2015 16:51 UTC
@certik I know, it's just the first project using cmake I'm helping out with (and I still don't like cmake)
in gmake the default is release, you see
Ondřej Čertík
@certik
Nov 16 2015 16:55 UTC
The default in symengine cmake is release.
In your local copy, you probably switched to Debug at some point in the past.
Ralf Stephan
@rwst
Nov 16 2015 16:56 UTC
I already debated that.
It suffices to only once compile with debug, that changes the default.
Ondřej Čertík
@certik
Nov 16 2015 17:02 UTC
I am challenging your claim. :)
It doesn't for me. If I switch to Debug and then to Release, it will switch the gcc flags correctly as expected.
Ralf Stephan
@rwst
Nov 16 2015 17:04 UTC
But if you switch to Debug and then do cmake . you still get the Debug options.
Ondřej Čertík
@certik
Nov 16 2015 17:04 UTC
I think your confusion might be explained by the fact, that if you run cmake once (let's say you switch to Debug), and then you run cmake ., then you are not running from scratch and not using the default values. Rather, if you already ran cmake in the past, then subsequent runs of cmake reuse your current configuration and allow you to incrementally modify it --- e.g. if you add one more extra flag on the cmake command line, that flag will be added to your configuration.
If you want to run from scratch, then you have to create a new build directory (if you are building out of tree), or erase the CMakeCache.txt if you are running in the tree.
Ralf Stephan
@rwst
Nov 16 2015 17:05 UTC
But a plain gmake will always do Release mode.
Ondřej Čertík
@certik
Nov 16 2015 17:06 UTC
That's because gmake doesn't store the configuration.
cmake is sort of like the ./configuration step.
it prepares the make files and stores your configuration in CMakeCache.txt.
Ralf Stephan
@rwst
Nov 16 2015 17:11 UTC
One good of it all today was that I installed tcmalloc. Tomorrow I'll copy some code from Pynac to get expansions of pow(add(...), negative int) hopefully.
Ondřej Čertík
@certik
Nov 16 2015 17:55 UTC
@rwst tcmalloc is pretty cool, provides some nice speedup.
Ondřej Čertík
@certik
Nov 16 2015 18:51 UTC
@rwst if Piranha can't do the 1/f(z) expansions, then I was planning to implement what @shivamvats did in SymPy this summer in the series_inversion and _series_inversion1() functions: https://github.com/sympy/sympy/blob/e90b303e9d44029cae473408afd2a7a503000364/sympy/polys/ring_series.py#L480