@shivamvats hi! Piranha cannot do that type of computation currently. What you could do would be to use a sympy expression as coefficient type (or as a symbolic ring, as you say). I did some testing and it works, but first you need to wrap sympy expression with something that gives it value-like semantics (e.g, as done in this experimental code: bluescarni/csympy@74a0fac). So like this it does work, but it is to be determined if performance is going to increase much with respect to just using plain sympy (instead of a polynomial with symbolic coefficients).

@isuruf yes, all the headers should be in

`symengine/*`

, so we just need to look at how other libraries handle this problem. I think we can use it without `symengine`

in our tests, that's one solution.
@certik Did you check out the report?

I didn't. Can you send a link?

@Sumith1896 can you send a link.

Ok, I found it myself, here it is: https://github.com/sympy/sympy/wiki/Benchmark-results-expand2b,-SymEngine

Ohh sorry went off for a bit.

So your expand2d is 23ms, while Piranha is 13ms.

Why is Piranha faster?

Yes

@certik, ginac for example does not have a

`src`

folder, but a folder named `ginac`

. And then the header would be in `ginac`

folder
That's one way to do it. But I know that other projects, like Trilinos, have .h and .cpp mixed up. We need to look at many C++ projects to get an idea what people do.

We might also ask on Stackoverflow, what the recommended practice is.

I have no idea why Piranha is faster, @bluescarni says not all optimization of Piranha has been implemented, but at this stage it will complicate things

@Sumith1896 we need to get at the bottom of this and figure out where the remaining 2x speedup is coming from. Good job so far though.

@Sumith1896 what other speedups would complicate things?

@certik Have you gone through previous conversation?

If not

There are a couple of things we did later

You can find it here: sympy/symengine@e6cbe8d

@certik, sure, I'll look at other libraries. btw, ginac's .cpp and .h are all in ginac folder

@Sumith1896 I will have time to read through it later.

@certik there's a few things still that piranha does extra, but they will complicate things

@bluescarni ok, I don't understand why we can't "complicate" things.

Let me read the discussion now then.

you can of course :) but that's gonna make the code much uglier

Let me summarize

`std::pair`

being immutable we used our own pair`rehash`

, hashes the table in the start, but here 10000 is done just for the benchmark- fused multipy add used

@Sumith1896 you summarized what was done, or what has to be done?

What was done

Finally, with your new branch, were you able to get Piranha's performance?

No, expand2d is the most latest version, with all I have implemented

Even I would like to go to Piranha's depth and reach that speed

so here's at least 2 things piranha does that the current code does not:

The whole point of SymEngine is to be as fast as possible, we cannot lag

- you can shave off some time during the lookup operations. When you do a
`find()`

and then an`insert()`

you are computing the hash twice

and locating the same thing twice essentially

piranha's hash set has some low-level methods that allow you to decompose the

`find()`

and `insert()`

in smaller steps
@Sumith1896 , @bluescarni yes, we need to get at the bottom of this and match the performance. Only once we get there, then we'll see how to best design the module with the rest of SymEngine etc. But I want to see the code first, no matter how ugly it is.

so for instance you can do the following sequence of operations:

- compute the destination bucket

- lookup the item

- if the item is not there, do a special
`insert()`

in which the bucket index is passed so you avoid recalculating it

@bluescarni thanks. Yes, we need to do this for sure. @Sumith1896 are you using the same exponent packing as Piranha?

@certik Yes

@bluescarni go on

See these methods for what to use:

so the basic plan is:

- use
`_bucket`

to comput the destination bucket

- use
`_find`

to locate the element in`C`

- if the term is not already there, use
`_unique_insert`

- otherwise do the
`multiply_accumulate`

, as you are already doing

What's the second optimization?

The second optimisation is to order the insertion operations into

`C`

in a specific order
right now, when you are doing those lookup and insertion operations in the main multiplication loop you are jumping all around over the hash table

you might insert a term at the beginning of the table, and maybe the next term is inserted at the end, then the next in the middle of the table, etc.

so basically what you need to do is to create a vector of pointers to the existing elements of

`A`

and `B`

@certik, Both gcc and clang builds on OSX in travis are passing now

and then order the operations so that you will be writing into consecutive memory positions in the destination table

so all this is done here: https://github.com/bluescarni/piranha/blob/master/src/polynomial.hpp#L1184

ignore the multithreaded part, just assume only 1 thread is being used (this simplifies things quite a bit)

if you look at https://github.com/bluescarni/piranha/blob/master/src/polynomial.hpp#L1202 and following line, that is where the vector of pointers to the exisiting terms in *would* occupy were they to be inserted in

`A`

and `B`

are ordered according to the bucket they `C`

the main idea is to divide the multiplication in **blocks**

so you pick the first term in

`A`

(after the sorting above has been done) and the first `N`

terms of `B`

, and you have your first block
due to the way kronecker packing and hashing work, when you perform the multiplication within this block you will be writing only at the beginning of the destination hash table

so instead of jumping all over the place you are writing into a limited subset of the table, specifically at the beginning

it's not important that you get all the details right now

just the basic philosophy: divide by blocks, and make it so in each block we are writing only in a limited subset of the table

This seems like a very worthy optimization. I'd like to implement both the above

There's one important catch: you need to have an idea of how big the table will need to be *before* the multiplication

Yes, we should definitely implement this

@isuruf great!

you need to anticipate the rough size of the result

now you are doing

`rehash(10000)`

because you already know the result is about 6000 terms
but in the real cases you need to be able to guess the number

either that, or you will have to re-sort the vector of pointers each time you do a rehash to increase the size of the table

because rehashing changes completely the destination buckets of the resulting terms

for the time being you can keep the

`rehash(10000)`

so we can see if and how much this improves performance

then in a second moment I can tell you how the guesstimation works in piranha

@bluescarni thanks. Yes, exactly. For now, we just need to figure out how to match the performance on this particular example, so that we can see what things influence the performance and which don't.

@Sumith1896 you bascially have to digest: https://github.com/bluescarni/piranha/blob/master/src/polynomial.hpp from line 1184 to 1250

I would stick to just 1 thread, definitely for now, but actually maybe as a design decision. That way the code is much simpler, and parallelization must be achieved at a higher level, for example with matrix manipulations.

@certik it is all entangled in this specific algorithm... basically in the multithreaded case you take the extra step of making sure that threads are assigned non overlapping blocks

Right.

also in the multithreaded case you cannot rehash at all during multiplication, hence the need to estimate

in the single thread case you could stop and rehash without too much difficulty

@bluescarni one question I have about the parallelization is that it seems to me you can get much better scaling (i.e. utilization of the CPU cores) if you stick to 1 thread, and then parallelize let's say a matrix Jacobian calculation (which is trivially parallelizable, as the parts do not depend on each other at all).

at what level would you parallelise? at the poly multiplication or at a higher level?

At much higher level --- essentially doing independent manipulations (let's say poly manipulations) --- i.e. differentiating the individual matrix elements in parallel.

ah yes that is good

I am just using a common sense (and maybe I am wrong), that if you parallelize 1 fateman benchmark, it doesn't scale linearly with the number of cores.

but this specific algorithm is concerned with parallel multiplication of very large polynomials

it can be almost-linear, but of course if you can parallelise at a higher level it is better

so if you want to get linear scaling, one must parallelize at a higher level (and I understand that sometimes your application doesn't have a higher level, perhaps your application really is how to do the fateman benchmark as fast as possible, in that case there is no other choice than what Piranha is doing).

yes... if you are doing something like Poisson brackets, differentiation, integration, etc... everything gets much simple

*simpler

I am keeping PyDy as one of our applications, which has decent symbolic matrices and requires fast manipulation of those.

in Piranha I have a thread pool thingie which is supposed to be used for these things as well (it is already used in poly multiplication) but I haven't gotten around to parallelizing higher level

yeah it would make sense in that case to parallelise at higher level

there are still some challenges but it's gonna be easier

to be honest all this stuff about parallel multiplication of polynomials is a bit of a dick measuring contest in the community... there's so much more, but it's a well defined problem which people love to optimise and write papers about

it is a bit like with GMP: it is super-optimised for very very large integers, yet a sizeable part of the users of GMP are using it with one or two limbs at most... and they would benefit much more from a small integer optimisation rather than being able to multiply millions of digits

@bluescarni exactly.

However, these poly benchmarks is the first thing that people will try with SymEngine, for example in Sage, and so we need to show that the performance is there.

And there can be multiple datastructures, some that work great for this benchmark, some other ones might work better for other things. That's fine.

yeah I agree. It is interesting stuff from the point of view of computer science and I personally enjoy doing it, but from the point of view of a software library that people use to actually do computations, the ratio development effort / user benefit might not be so great

Essentially I am willing to maybe give up few percent performance if it simplifies the code a **lot**, but I am not willing to give up 2x performance, as it is now.

@bluescarni yeah. With SymPy, it was super fast to develop and it is tremendously useful to a lot of people. SymEngine is much harder to develop and nail the speed, but it will also be useful to people. So is Piranha. You spent years optimizing it, but if you count all the time it saves times the number of users, it pays off.

(as long as you get a lot of users of course.)

It's like why companies like Facebook and Google spend big effort trying to shave off 1% performance, because it saves them millions of dollars in electricity and other costs.

Been using sympy quite a bit myself for research, made sure to always cite it :)

By trying various other programs for polynomials, Piranha is by fast the fastest. Perhaps it is possible to get even faster, but it seems you are pretty close to the ideal speed already. So that's why I am using it to gauge speed.

It certainly can be faster, but I remember estimating that at least for some benchmarks the single-core speed is "optimal" in some sense

in the fateman benchmarks, if you use double-precision coefficients, you get around 5 clock cycles per term multiplication

and 5 cycles is the latency of a single double-precision multiplication, so, unless you use vectorization, it's basically optimal

Exactly.

So if we are 2x slower, it means there is still meat on the bone that we need to eat.

While you already ate all the good meat.

there's still work to do on highly sparse cases, but see above.. not sure how much it matters in the end

Right.

Another way to look at it is that nobody will really come up with a competing library which will be much faster than Piranha, since it is close to optimal.

TRIP can supposedly be faster, but it's hard to say without having access to the thing

How much faster do they claim to be?

@certik, in travis with the new PR,

`sudo`

is required only to install libraries via, `apt-get`

. Once this issue is closed, travis-ci/travis-ci#4113, we can move to docker based infrastructure
@isuruf great job, I really appreciate your work on this.

they claim better parallelisation performance, and that might as well be true as I never had really access to anything with more than 16 physical cores

they tested up to 64 I think? or maybe 128

can't remember

@bluescarni what about single thread?

it's harder to say, because the papers they put out typically report only the parallel speed up and give timings from 4 processors onwards

at least as far as I remember

@isuruf Why whitelist? Can't they be installed in docker infrastructure with

`sudo`

?
at least at some point, piranha had better serial performance, but it's been years now and things have probably changed in one direction or the other

piranha should parallelise much better these days

but again, I have no access to really big machines

I only care about single performance for now.

@abinashmeher999, you can't use

`sudo`

in docker. That's why there were failures in your repo. Only the container-based infrastructure can use `sudo`

. We are trying to shift to docker because it is faster and has less queues.
I see, so are the packages that you have mentioned in the issue are restricted by travis for installation in docker based infrastructure?

Yes, all packages are restricted, except the ones in the whitelist

I will be gone for a while, will be back later

So I guess

`sudo: true`

was not an ideal solution.
@Sumith1896 let me know if you have questions about the tricky code

@abinashmeher999, it was, that is how we are currently testing. Since symengine is an old project,

`sudo:true`

was implied. For your fork, you had to add `sudo:true`

because `sudo:false`

was the default
We are just trying to shift to docker based, since it is faster (at least that's what travis is saying)

Oh ok. Thanks! :smile: Have we asked them to shift us to docker based architecture once this is done? Or we can do it ourselves?

We can do it ourselves. Just adding

`sudo: false`

would be enough, but to do that we need those packages white listed.
I forgot to tell you. The testing are running fine now. Thanks for the help. I am doing a few cleanups, before it's ready for review.

Great. Comment in the PR when it's ready for review

@isuruf should we add

`sudo: true`

for now, so that tests work on forks as well?
Sure

It's added in @abinashmeher999's PR.

OK.

@certik I have a query regarding the benchmarks

Why did the

`expand2b`

benchmark slow down on my branch
and so did

`expand2c`

when `expand2d`

were implemented
@Sumith1896 can you post a link to the commit which caused the slowdown? Let's a have a look.