These are chat archives for symengine/symengine

20th
Jun 2015
Francesco Biscani
@bluescarni
Jun 20 2015 12:28
@shivamvats hi! Piranha cannot do that type of computation currently. What you could do would be to use a sympy expression as coefficient type (or as a symbolic ring, as you say). I did some testing and it works, but first you need to wrap sympy expression with something that gives it value-like semantics (e.g, as done in this experimental code: bluescarni/csympy@74a0fac). So like this it does work, but it is to be determined if performance is going to increase much with respect to just using plain sympy (instead of a polynomial with symbolic coefficients).
Ondřej Čertík
@certik
Jun 20 2015 15:03
@isuruf yes, all the headers should be in symengine/*, so we just need to look at how other libraries handle this problem. I think we can use it without symengine in our tests, that's one solution.
Sumith Kulal
@Sumith1896
Jun 20 2015 15:05
@certik Did you check out the report?
Ondřej Čertík
@certik
Jun 20 2015 15:06
I didn't. Can you send a link?
@Sumith1896 can you send a link.
Ondřej Čertík
@certik
Jun 20 2015 15:12
Sumith Kulal
@Sumith1896
Jun 20 2015 15:13
Ohh sorry went off for a bit.
Ondřej Čertík
@certik
Jun 20 2015 15:13
So your expand2d is 23ms, while Piranha is 13ms.
Why is Piranha faster?
Sumith Kulal
@Sumith1896
Jun 20 2015 15:13
Yes
Isuru Fernando
@isuruf
Jun 20 2015 15:14
@certik, ginac for example does not have a src folder, but a folder named ginac. And then the header would be in ginac folder
Ondřej Čertík
@certik
Jun 20 2015 15:15
That's one way to do it. But I know that other projects, like Trilinos, have .h and .cpp mixed up. We need to look at many C++ projects to get an idea what people do.
We might also ask on Stackoverflow, what the recommended practice is.
Sumith Kulal
@Sumith1896
Jun 20 2015 15:16
I have no idea why Piranha is faster, @bluescarni says not all optimization of Piranha has been implemented, but at this stage it will complicate things
Ondřej Čertík
@certik
Jun 20 2015 15:16
@Sumith1896 we need to get at the bottom of this and figure out where the remaining 2x speedup is coming from. Good job so far though.
@Sumith1896 what other speedups would complicate things?
Sumith Kulal
@Sumith1896
Jun 20 2015 15:16
@certik Have you gone through previous conversation?
If not
There are a couple of things we did later
You can find it here: sympy/symengine@e6cbe8d
Isuru Fernando
@isuruf
Jun 20 2015 15:18
@certik, sure, I'll look at other libraries. btw, ginac's .cpp and .h are all in ginac folder
Ondřej Čertík
@certik
Jun 20 2015 15:19
@Sumith1896 I will have time to read through it later.
Francesco Biscani
@bluescarni
Jun 20 2015 15:19
@certik there's a few things still that piranha does extra, but they will complicate things
Ondřej Čertík
@certik
Jun 20 2015 15:19
@bluescarni ok, I don't understand why we can't "complicate" things.
Let me read the discussion now then.
Francesco Biscani
@bluescarni
Jun 20 2015 15:20
you can of course :) but that's gonna make the code much uglier
Sumith Kulal
@Sumith1896
Jun 20 2015 15:20
Let me summarize
  • std::pair being immutable we used our own pair
  • rehash, hashes the table in the start, but here 10000 is done just for the benchmark
  • fused multipy add used
Ondřej Čertík
@certik
Jun 20 2015 15:20
@Sumith1896 you summarized what was done, or what has to be done?
Sumith Kulal
@Sumith1896
Jun 20 2015 15:20
What was done
Ondřej Čertík
@certik
Jun 20 2015 15:20
Finally, with your new branch, were you able to get Piranha's performance?
Sumith Kulal
@Sumith1896
Jun 20 2015 15:21
No, expand2d is the most latest version, with all I have implemented
Even I would like to go to Piranha's depth and reach that speed
Francesco Biscani
@bluescarni
Jun 20 2015 15:22
so here's at least 2 things piranha does that the current code does not:
Sumith Kulal
@Sumith1896
Jun 20 2015 15:22
The whole point of SymEngine is to be as fast as possible, we cannot lag
Francesco Biscani
@bluescarni
Jun 20 2015 15:22
  • you can shave off some time during the lookup operations. When you do a find() and then an insert() you are computing the hash twice
and locating the same thing twice essentially
piranha's hash set has some low-level methods that allow you to decompose the find() and insert() in smaller steps
Ondřej Čertík
@certik
Jun 20 2015 15:23
@Sumith1896 , @bluescarni yes, we need to get at the bottom of this and match the performance. Only once we get there, then we'll see how to best design the module with the rest of SymEngine etc. But I want to see the code first, no matter how ugly it is.
Francesco Biscani
@bluescarni
Jun 20 2015 15:24
so for instance you can do the following sequence of operations:
  • compute the destination bucket
  • lookup the item
  • if the item is not there, do a special insert() in which the bucket index is passed so you avoid recalculating it
Ondřej Čertík
@certik
Jun 20 2015 15:25
@bluescarni thanks. Yes, we need to do this for sure. @Sumith1896 are you using the same exponent packing as Piranha?
Sumith Kulal
@Sumith1896
Jun 20 2015 15:25
@certik Yes
@bluescarni go on
Francesco Biscani
@bluescarni
Jun 20 2015 15:26
See these methods for what to use:
so the basic plan is:
  • use _bucket to comput the destination bucket
  • use _find to locate the element in C
  • if the term is not already there, use _unique_insert
  • otherwise do the multiply_accumulate, as you are already doing
Sumith Kulal
@Sumith1896
Jun 20 2015 15:31
What's the second optimization?
Francesco Biscani
@bluescarni
Jun 20 2015 15:32
The second optimisation is to order the insertion operations into C in a specific order
right now, when you are doing those lookup and insertion operations in the main multiplication loop you are jumping all around over the hash table
you might insert a term at the beginning of the table, and maybe the next term is inserted at the end, then the next in the middle of the table, etc.
so basically what you need to do is to create a vector of pointers to the existing elements of A and B
Isuru Fernando
@isuruf
Jun 20 2015 15:35
@certik, Both gcc and clang builds on OSX in travis are passing now
Francesco Biscani
@bluescarni
Jun 20 2015 15:35
and then order the operations so that you will be writing into consecutive memory positions in the destination table
ignore the multithreaded part, just assume only 1 thread is being used (this simplifies things quite a bit)
if you look at https://github.com/bluescarni/piranha/blob/master/src/polynomial.hpp#L1202 and following line, that is where the vector of pointers to the exisiting terms in A and B are ordered according to the bucket they would occupy were they to be inserted in C
the main idea is to divide the multiplication in blocks
Francesco Biscani
@bluescarni
Jun 20 2015 15:41
so you pick the first term in A (after the sorting above has been done) and the first N terms of B, and you have your first block
due to the way kronecker packing and hashing work, when you perform the multiplication within this block you will be writing only at the beginning of the destination hash table
so instead of jumping all over the place you are writing into a limited subset of the table, specifically at the beginning
it's not important that you get all the details right now
just the basic philosophy: divide by blocks, and make it so in each block we are writing only in a limited subset of the table
Sumith Kulal
@Sumith1896
Jun 20 2015 15:44
This seems like a very worthy optimization. I'd like to implement both the above
Francesco Biscani
@bluescarni
Jun 20 2015 15:46
There's one important catch: you need to have an idea of how big the table will need to be before the multiplication
Ondřej Čertík
@certik
Jun 20 2015 15:46
Yes, we should definitely implement this
@isuruf great!
Francesco Biscani
@bluescarni
Jun 20 2015 15:46
you need to anticipate the rough size of the result
now you are doing rehash(10000) because you already know the result is about 6000 terms
but in the real cases you need to be able to guess the number
either that, or you will have to re-sort the vector of pointers each time you do a rehash to increase the size of the table
because rehashing changes completely the destination buckets of the resulting terms
for the time being you can keep the rehash(10000)
so we can see if and how much this improves performance
then in a second moment I can tell you how the guesstimation works in piranha
Ondřej Čertík
@certik
Jun 20 2015 15:50
@bluescarni thanks. Yes, exactly. For now, we just need to figure out how to match the performance on this particular example, so that we can see what things influence the performance and which don't.
Francesco Biscani
@bluescarni
Jun 20 2015 15:51
@Sumith1896 you bascially have to digest: https://github.com/bluescarni/piranha/blob/master/src/polynomial.hpp from line 1184 to 1250
Ondřej Čertík
@certik
Jun 20 2015 15:51
I would stick to just 1 thread, definitely for now, but actually maybe as a design decision. That way the code is much simpler, and parallelization must be achieved at a higher level, for example with matrix manipulations.
Francesco Biscani
@bluescarni
Jun 20 2015 15:52
@certik it is all entangled in this specific algorithm... basically in the multithreaded case you take the extra step of making sure that threads are assigned non overlapping blocks
Ondřej Čertík
@certik
Jun 20 2015 15:53
Right.
Francesco Biscani
@bluescarni
Jun 20 2015 15:53
also in the multithreaded case you cannot rehash at all during multiplication, hence the need to estimate
in the single thread case you could stop and rehash without too much difficulty
Ondřej Čertík
@certik
Jun 20 2015 15:54
@bluescarni one question I have about the parallelization is that it seems to me you can get much better scaling (i.e. utilization of the CPU cores) if you stick to 1 thread, and then parallelize let's say a matrix Jacobian calculation (which is trivially parallelizable, as the parts do not depend on each other at all).
Francesco Biscani
@bluescarni
Jun 20 2015 15:54
at what level would you parallelise? at the poly multiplication or at a higher level?
Ondřej Čertík
@certik
Jun 20 2015 15:55
At much higher level --- essentially doing independent manipulations (let's say poly manipulations) --- i.e. differentiating the individual matrix elements in parallel.
Francesco Biscani
@bluescarni
Jun 20 2015 15:55
ah yes that is good
Ondřej Čertík
@certik
Jun 20 2015 15:56
I am just using a common sense (and maybe I am wrong), that if you parallelize 1 fateman benchmark, it doesn't scale linearly with the number of cores.
Francesco Biscani
@bluescarni
Jun 20 2015 15:56
but this specific algorithm is concerned with parallel multiplication of very large polynomials
it can be almost-linear, but of course if you can parallelise at a higher level it is better
Ondřej Čertík
@certik
Jun 20 2015 15:57
so if you want to get linear scaling, one must parallelize at a higher level (and I understand that sometimes your application doesn't have a higher level, perhaps your application really is how to do the fateman benchmark as fast as possible, in that case there is no other choice than what Piranha is doing).
Francesco Biscani
@bluescarni
Jun 20 2015 15:58
yes... if you are doing something like Poisson brackets, differentiation, integration, etc... everything gets much simple
*simpler
Ondřej Čertík
@certik
Jun 20 2015 15:59
I am keeping PyDy as one of our applications, which has decent symbolic matrices and requires fast manipulation of those.
Francesco Biscani
@bluescarni
Jun 20 2015 15:59
in Piranha I have a thread pool thingie which is supposed to be used for these things as well (it is already used in poly multiplication) but I haven't gotten around to parallelizing higher level
yeah it would make sense in that case to parallelise at higher level
there are still some challenges but it's gonna be easier
to be honest all this stuff about parallel multiplication of polynomials is a bit of a dick measuring contest in the community... there's so much more, but it's a well defined problem which people love to optimise and write papers about
Francesco Biscani
@bluescarni
Jun 20 2015 16:05
it is a bit like with GMP: it is super-optimised for very very large integers, yet a sizeable part of the users of GMP are using it with one or two limbs at most... and they would benefit much more from a small integer optimisation rather than being able to multiply millions of digits
Ondřej Čertík
@certik
Jun 20 2015 16:08
@bluescarni exactly.
However, these poly benchmarks is the first thing that people will try with SymEngine, for example in Sage, and so we need to show that the performance is there.
And there can be multiple datastructures, some that work great for this benchmark, some other ones might work better for other things. That's fine.
Francesco Biscani
@bluescarni
Jun 20 2015 16:13
yeah I agree. It is interesting stuff from the point of view of computer science and I personally enjoy doing it, but from the point of view of a software library that people use to actually do computations, the ratio development effort / user benefit might not be so great
Ondřej Čertík
@certik
Jun 20 2015 16:14
Essentially I am willing to maybe give up few percent performance if it simplifies the code a lot, but I am not willing to give up 2x performance, as it is now.
@bluescarni yeah. With SymPy, it was super fast to develop and it is tremendously useful to a lot of people. SymEngine is much harder to develop and nail the speed, but it will also be useful to people. So is Piranha. You spent years optimizing it, but if you count all the time it saves times the number of users, it pays off.
(as long as you get a lot of users of course.)
Ondřej Čertík
@certik
Jun 20 2015 16:19
It's like why companies like Facebook and Google spend big effort trying to shave off 1% performance, because it saves them millions of dollars in electricity and other costs.
Francesco Biscani
@bluescarni
Jun 20 2015 16:20
Been using sympy quite a bit myself for research, made sure to always cite it :)
Ondřej Čertík
@certik
Jun 20 2015 16:21
By trying various other programs for polynomials, Piranha is by fast the fastest. Perhaps it is possible to get even faster, but it seems you are pretty close to the ideal speed already. So that's why I am using it to gauge speed.
Francesco Biscani
@bluescarni
Jun 20 2015 16:23
It certainly can be faster, but I remember estimating that at least for some benchmarks the single-core speed is "optimal" in some sense
in the fateman benchmarks, if you use double-precision coefficients, you get around 5 clock cycles per term multiplication
and 5 cycles is the latency of a single double-precision multiplication, so, unless you use vectorization, it's basically optimal
Ondřej Čertík
@certik
Jun 20 2015 16:24
Exactly.
So if we are 2x slower, it means there is still meat on the bone that we need to eat.
While you already ate all the good meat.
Francesco Biscani
@bluescarni
Jun 20 2015 16:27
there's still work to do on highly sparse cases, but see above.. not sure how much it matters in the end
Ondřej Čertík
@certik
Jun 20 2015 16:27
Right.
Another way to look at it is that nobody will really come up with a competing library which will be much faster than Piranha, since it is close to optimal.
Francesco Biscani
@bluescarni
Jun 20 2015 16:29
TRIP can supposedly be faster, but it's hard to say without having access to the thing
Ondřej Čertík
@certik
Jun 20 2015 16:30
How much faster do they claim to be?
Isuru Fernando
@isuruf
Jun 20 2015 16:30
@certik, in travis with the new PR, sudo is required only to install libraries via, apt-get. Once this issue is closed, travis-ci/travis-ci#4113, we can move to docker based infrastructure
Ondřej Čertík
@certik
Jun 20 2015 16:31
@isuruf great job, I really appreciate your work on this.
Francesco Biscani
@bluescarni
Jun 20 2015 16:31
they claim better parallelisation performance, and that might as well be true as I never had really access to anything with more than 16 physical cores
they tested up to 64 I think? or maybe 128
can't remember
Ondřej Čertík
@certik
Jun 20 2015 16:32
@bluescarni what about single thread?
Francesco Biscani
@bluescarni
Jun 20 2015 16:32
it's harder to say, because the papers they put out typically report only the parallel speed up and give timings from 4 processors onwards
at least as far as I remember
Abinash Meher
@abinashmeher999
Jun 20 2015 16:33
@isuruf Why whitelist? Can't they be installed in docker infrastructure with sudo?
Francesco Biscani
@bluescarni
Jun 20 2015 16:33
at least at some point, piranha had better serial performance, but it's been years now and things have probably changed in one direction or the other
piranha should parallelise much better these days
but again, I have no access to really big machines
Ondřej Čertík
@certik
Jun 20 2015 16:35
I only care about single performance for now.
Isuru Fernando
@isuruf
Jun 20 2015 16:36
@abinashmeher999, you can't use sudo in docker. That's why there were failures in your repo. Only the container-based infrastructure can use sudo. We are trying to shift to docker because it is faster and has less queues.
Abinash Meher
@abinashmeher999
Jun 20 2015 16:39
I see, so are the packages that you have mentioned in the issue are restricted by travis for installation in docker based infrastructure?
Isuru Fernando
@isuruf
Jun 20 2015 16:39
Yes, all packages are restricted, except the ones in the whitelist
Francesco Biscani
@bluescarni
Jun 20 2015 16:39
I will be gone for a while, will be back later
Abinash Meher
@abinashmeher999
Jun 20 2015 16:39
So I guess sudo: true was not an ideal solution.
Francesco Biscani
@bluescarni
Jun 20 2015 16:39
@Sumith1896 let me know if you have questions about the tricky code
Isuru Fernando
@isuruf
Jun 20 2015 16:41
@abinashmeher999, it was, that is how we are currently testing. Since symengine is an old project, sudo:true was implied. For your fork, you had to add sudo:true because sudo:false was the default
We are just trying to shift to docker based, since it is faster (at least that's what travis is saying)
Abinash Meher
@abinashmeher999
Jun 20 2015 16:44
Oh ok. Thanks! :smile: Have we asked them to shift us to docker based architecture once this is done? Or we can do it ourselves?
Isuru Fernando
@isuruf
Jun 20 2015 16:44
We can do it ourselves. Just adding sudo: false would be enough, but to do that we need those packages white listed.
Abinash Meher
@abinashmeher999
Jun 20 2015 16:45
I forgot to tell you. The testing are running fine now. Thanks for the help. I am doing a few cleanups, before it's ready for review.
Isuru Fernando
@isuruf
Jun 20 2015 16:46
Great. Comment in the PR when it's ready for review
Ondřej Čertík
@certik
Jun 20 2015 17:29
@isuruf should we add sudo: true for now, so that tests work on forks as well?
Isuru Fernando
@isuruf
Jun 20 2015 17:30
Sure
It's added in @abinashmeher999's PR.
Ondřej Čertík
@certik
Jun 20 2015 17:38
OK.
Sumith Kulal
@Sumith1896
Jun 20 2015 18:57
@certik I have a query regarding the benchmarks
Why did the expand2b benchmark slow down on my branch
and so did expand2c when expand2d were implemented
Ondřej Čertík
@certik
Jun 20 2015 20:28
@Sumith1896 can you post a link to the commit which caused the slowdown? Let's a have a look.