would you have time to walk me through that slow benchmark mentioned in the pull request?

thanks... is there a test that can be run immediately exhibiting the slowness?

I am trying with

`test_series_expansion_URatP`

and `series_expansion_sincos_piranha`

at the moment
yes.

benchmarks/series_expansion_sincos_piranha.cpp

vs

benchmarks/series_expansion_sincos_flint.cpp

from the PR

ok... but for those there's a factor of 2 of difference roughly here, I was wondering if there was a more extreme example available

ok, then I can change manually the

`N`

value in that benchmark to reproduce the behaviour?
ok thanks

gonna try a few things

so there's a few interesting things coming out of the profiler

for instance, there are about 18M calls to the C++ random number generatir

generator

this is part of the routine that estimates the final series size

this does not even account for much, about 13% of the total time

what are the typical sizes of the series involved in this computation?

`N-1`

is the degree. so `N`

terms
yes it looks like the whole series size estimation is a huge overhead for small/univariate series

`sin(x+x^2)`

is calculated by calculating `sum((x+x^2)^i/i!*(-1)^i for i odd < N)`

I can try to disable the machinery and see what happens

do we calculate the

`(poly)^i`

bit successively or new for each `i`

?
!

and 20000 series multiplications

is there a lot of scalar * series multiplications in this example?

@rwst later!

something like

`(x+1)*3`

I mean
and the formula is implemented internally using piranha's overloaded operators?

yes

piranha is horribly un-optimised for these things, if you do a scalar X series multiplication it will always be transformed into a series X series multiplication

I meant to do something about it, it's not difficult to improve

but always lacked a clear motivation to do so

I will have a go at it

Oh. There are some things I thought as being not compute heavy. For eg.

`sin(x+x^2)*cos(x+x^2)`

is represented as `1*sin(x+x^2)*cos(x+x^2)`

. I don't check the numeric value at the beginning and just multiply the whole result
Code for sin is here, https://github.com/symengine/symengine/blob/master/symengine/series.h#L279

Poly is a piranha rational polynomial. Coeff is piranha::rational and Series is the type for CRTP

it's heavily optimised for very large series

as a workaround I removed series estimation when in single thread mode, now the benchmark runs in 900 ms. But I am not sure it is correct

ah ok but wait... the two tests for flint and piranha run different Ns

so according to this: http://www.wolframalpha.com/input/?i=Taylor+Series+of+Sin%5B2*x*%28x%2B1%29%5D it basically is a univariate dense polynomial right?

there's also about 32M calls to integer GCD, not sure how flint handles this

anyway I think that the bottom line is that I can squeeze out some extra performance still (maybe another factor of 2?)

but ultimately it all boils down to the dense univariate character of the computation

the cost of accessing a coefficient in a vector (flint) vs looking it up in a hash table (piranha) is always gonna be at the very least 1 order of magnitude smaller

according to some measurements I did in the past the hash table cost vs a vector during multiplication is around a factor of 20 actually

I **suppose** it would make some sense to switch to a vector dense representation internally for univariate cases

@bluescarni it was an algorithmic problem in

`series.h`

. I'm now down from 3500ms to ...
450ms with N=200 and

N=1000 takes now 133 seconds with this patch:

```
--- a/symengine/series.h
+++ b/symengine/series.h
@@ -284,14 +284,15 @@ public:
return Series::sin(c)*Series::series_cos(t, var, prec) + Series::cos(c)*Series::series_sin(t, var, prec);
}
//! fast sin(x)
- Poly res_p(0);
+ Poly res_p(0), monom(s), ssquare(s*s);
Coeff prod(1);
for (unsigned int i = 0; i < prec / 2; i++) {
const short j = 2 * i + 1;
if (i != 0)
prod /= 1 - j;
prod /= j;
- res_p += Series::pow(s, j, prec) * prod;
+ res_p += monom * prod;
+ monom *= ssquare;
}
return res_p;
@@ -338,14 +339,15 @@ public:
const Poly t = s - c;
return Series::cos(c)*Series::series_cos(t, var, prec) - Series::sin(c)*Series::series_sin(t, var, prec);
}
- Poly res_p(1);
+ Poly res_p(1), ssquare(s*s), monom(s*s);
Coeff prod(1);
for (unsigned int i = 1; i <= prec / 2; i++) {
const short j = 2 * i;
if (i != 0)
prod /= 1 - j;
prod /= j;
- res_p += Series::pow(s, j, prec) * prod;
+ res_p += monom * prod;
+ monom *= ssquare;
}
return res_p;
//
```

ok.. so the modification I did with series estimation should cut down another factor of 3 or so in the runtime

still pretty far from flint

but in the factor of 20 ballpark

can't really understand how n = 500 results in 32M GCD though

Is it possible that a line like

`rational == 1`

invokes GCD ?
cannot really remember if that was optimised in some way

according to the timings here it's spending almost half of the time normalising the divisors of two polynomials before multiplying them

this was intended as an optimisation: you normalise all coefficients to the same divisor, and then do the multiplication on integers instead of rationals

but maybe this is a pessimization in this case, who knows

ok this is really weird... after I removed that normalisation step the test with N=200 seems to run in 40ms

maybe I botched something up

this does not seem very right... now N=1000 takes 543ms with Piranha

this is with your patch above @rwst, plus my local changes to Piranha

is there any way in which I can check the correctness?

With a patch to piranha I was able to get

`234ms`

for piranha vs `1596ms`

for flint
```
diff --git a/src/mp_rational.hpp b/src/mp_rational.hpp
index 25555eb..470eb72 100644
--- a/src/mp_rational.hpp
+++ b/src/mp_rational.hpp
@@ -238,10 +238,17 @@ class mp_rational
{
// NOTE: all this should never throw because we only operate on mp_integer objects,
// no conversions involved, etc.
- if (m_den.is_unitary() && other.m_den.is_unitary()) {
+ if (m_den.is_unitary() || other.m_den.is_unitary()) {
// Both are integers, we can just add without canonicalising.
// NOTE: this is safe is this and other are the same thing.
- m_num += other.m_num;
+ if (m_den.is_unitary() && other.m_den.is_unitary()) {
+ m_num += other.m_num;
+ } else if (m_den.is_unitary()) {
+ m_den = other.m_den;
+ m_num = m_num*m_den+other.m_num;
+ } else {
+ m_num = m_num+other.m_num*m_den;
+ }
} else if (m_den == other.m_den) {
// Denominators are the same, add numerators and canonicalise.
// NOTE: safe if this and other coincide.
```

Basically

`integer+rational`

does not need to be canonicalized
This is what flint does

I feel embarrassed to ask, but is this always the case?

I mean that you don't need to canonicalise when adding int to rat?

Consider

then

But

`a/b + s`

where `a`

,`b`

and `s`

are integers. `gcd(a, b) = 1`

then

`a/b+s = (a+b*s)/b`

But

`gcd(a+b*s, b) = gcd(a, b) = 1`

using our generic

`sin`

modified to compile cleanly (Flint is fickle) N=1000 now needs 720ms
`gcd(a+b*s, s) = gcd(a, s)`

comes from the fact that `gcd(a, b) = gcd(a-b, b)`

Sorry, edited the formula

Isn't canonicalization only needed before output to humans, and to let the numbers not grow too big? so it's a matter when to canonicalize not if at all.

I feel so dumb:D

ok I guess we need to see what happens once this patch is applied... could you send a PR please?

once it is ready I can re-run the tests with my local modifications on top of that and see where we get to

@rwst my understanding is that heuristically speaking it's better to canonicalise as early as possible, as the complexity grows with the size of the operands

but I just read it once somewhere so not sure in practice

Sorry about the noise. I had reset symengine and the benchmark had reverted to

`N=200`

Strange things happen. ATM I'm at a loss why with my Flint series version a

`series.as_basic()`

shows `1+...`

but `series.get_coeff(0)=0`

...
I think here we need to set the denominator as well or not? https://github.com/isuruf/piranha/blob/canonicalise/src/mp_rational.hpp#L246

I was also wondering if

`m_num = m_num*other.m_den + other.m_num;`

can use somehow `multiply_accumulate()`

for performance reasons, but maybe it's premature optimisation