These are chat archives for dropbox/pyston

6th
Jul 2015
Marius Wachtler
@undingen
Jul 06 2015 21:15
looking forward to speed.pyston.org results :-)
Kevin Modzelewski
@kmod
Jul 06 2015 21:21
                                     pr/651~~:              pr/651:
       django_template.py             4.9s (2)             4.7s (2)  -4.1%
            pyxl_bench.py             3.9s (2)             4.0s (2)  +0.6%
sqlalchemy_imperative2.py             5.6s (2)             5.3s (2)  -4.5%
        django_migrate.py             1.9s (2)             2.0s (2)  +3.9%
      virtualenv_bench.py             7.9s (2)             5.3s (2)  -33.0%
                  geomean                 4.4s                 4.0s  -8.5%
I think we have some investigation to do to figure out the best places to get gains from this
that "patch block transitions" pr seems like a good change but didn't affect perf much :/
Marius Wachtler
@undingen
Jul 06 2015 21:24
I like the virtualenv result but the other ones don't look very exciting. But I think the jemalloc switch should gives us faster execution (because of the added allocations inside the rewritter + baseline jit)
yeah I tried to improve the generated code more but it turned out to not change perf / make it worse. Time is probably better spend speeding up the rewriter etc...
Kevin Modzelewski
@kmod
Jul 06 2015 21:41
I think the StatTimers could be useful here
I think the llvm time went down a bunch (300ms I think on django_template), but yeah the rewriter time went up about 100ms
I also tried doing #define INVESTIGATE_STAT_TIMER "us_timer_in_baseline_jitted_code" to see what it looks like
and most of it seems reasonable, it very rarely stops in anything that I would say is "baseline jit overhead"
I did notice that it does spend about 10% of its time doing set/getLocalHelper
which maybe we could optimize away
but anyway, that could be another line of investigation
I'm going to work today on a new tool that should hopefully make the "investigate_stat_timer" stuff more useful + easier to use
Marius Wachtler
@undingen
Jul 06 2015 21:44
yes the getLocalHelper stuff is probably the best part to optimize inside JIT (at least in microbenchmarks it can get quite hot)
ok cool :-)
Kevin Modzelewski
@kmod
Jul 06 2015 21:45
I think there could be some low-hanging fruit with the rewriter
since it has to malloc all of those individually
I think adding the std::functions for addAction adds a bit of overhead
Marius Wachtler
@undingen
Jul 06 2015 21:46
yeah and for example the call lambda is AFAIK ~200bytes large
Kevin Modzelewski
@kmod
Jul 06 2015 21:46
oh heh
Marius Wachtler
@undingen
Jul 06 2015 21:48
I don't know how exactly the rewriter I want should look like. But I think while the current vector of void() functions is nice it probably causes quite a bit of overhead.
Kevin Modzelewski
@kmod
Jul 06 2015 21:57
the "investigate_stat_timer" thing gives a rudimentary way to profile the rewriter
ie do "conditional sampling" where it only takes a sample if it's in the rewriter
Travis Hance
@tjhance
Jul 06 2015 22:24
we could just replace the lambdas with an enum saying what function to call and a struct of arguments
use SmallVectors more (it uses a std::vector for each action, I think)
and of course it news each created RewriterVar
Marius Wachtler
@undingen
Jul 06 2015 22:49
I think in this case the SmallVectors may actually hurt perf if don't reuse the memory. because they will increase malloc size and they will always get copied into the lambda. I think it's better to manually dynamic alloc the stuff we will need in the emitting phase (can use smallvectors) already inside the collecting phase and just add a pointer to the data to the action list lambda. This should reduce the number of allocations and copies. (currently the size for a call is about min 200bytes)
I'm not sure if we can use std::unique_ptrs inside the lambdas to transfer the ownership with c++11. But a shared_ptr should work and I would expect that coping a shared_ptr is cheaper than the additional allocs + copy of 200bytes.
Marius Wachtler
@undingen
Jul 06 2015 23:02
Oh and I don't want to criticize the rewriter too much, it's nice to work with and easy to extend. It's just that now that I'm (ab)using the rewriter very often for the JIT tier the perf overhead of the easy extendable design becomes more noticable
@tjhance do you think it would be (easily) possible to split the rewriter into two classes: the two phase one for ICs and a one phase direct emit one for the baseline JIT? I haven't looked into why the rewriter has two phases but I suspect it's to allocate registers to the RewriterVars*?