These are chat archives for dropbox/pyston

18th
Jun 2015
Chris Toshok
@toshok
Jun 18 2015 02:47
+ 23.20% pyston_release pyston [.] 0x0000000000507df1 runtime ic?
ahh, no. that’s what happens if you strip pyston_release
Travis Hance
@tjhance
Jun 18 2015 04:41
what’s the current state of the art in comparing benchmarks
is it still measure_perf.py?
Chris Toshok
@toshok
Jun 18 2015 05:13
there’s analysis/investigate.py now
you give it two git revisions
Travis Hance
@tjhance
Jun 18 2015 05:14
oh yeah i was just trying it
it seems to work well
so when I do ‘pd’ it spits out a list of functions
what do the numbers mean? are they # of calls, or amount of time spent in them?
what are they measured in?
Chris Toshok
@toshok
Jun 18 2015 05:15
oh that i don’t know
Travis Hance
@tjhance
Jun 18 2015 05:16
darn I don’t know why my changes could result in these #s --
like bisect_right_e3_7 4174 4608? why should my change affect that...
chris why is performance analysis such madness :<
Chris Toshok
@toshok
Jun 18 2015 05:19
:/ i had GCVisitor::visit taking 10% in perf report, and then it dropped back down to 2%
no clue why
Chris Toshok
@toshok
Jun 18 2015 18:49
another thing to chalk up to performance weirdness? I’m currently seeing took 16.1ms for last iteration after pyston_release has built up the object cache for django_template.py
the 16.1 was ~6.1 before. strange thing is that the run time is pretty much the same. 5.9s
Marius Wachtler
@undingen
Jun 18 2015 19:07
oh!
I had exactly the same issue yesterday. even nearly the same numbers
I switched between the interpreter and jit around (it's a define) so I had to compile in between and the perf of the jit was always around 16ms! then I moved an if a little bit up (which should have had nearly no performance change and I was down to again to 5-6ms and never again saw the large numbers...
But do you have nearly the same overall execution time? (was nearly the same for me)
Chris Toshok
@toshok
Jun 18 2015 19:13
yup, exactly. within 0.01s of each other overall execution, but 16ms vs 6ms
Marius Wachtler
@undingen
Jun 18 2015 19:16
could it be that the gc get's executed sometimes on the last run and we measure it too?
Chris Toshok
@toshok
Jun 18 2015 19:17
yeah, that was my thought as well. but i didn’t change the size of objects (and shouldn’t have changed the number/types allocated either :/)
time for printf debugging :)
Chris Toshok
@toshok
Jun 18 2015 19:23
yup, that’s exactly what it is (which makes so little sense :/)
slow form:
iteration 494                                                                                                                                                     
iteration 495                                                                                                                                                     
iteration 496                                                                                                                                                     
iteration 497                                                                                                                                                     
iteration 498                                                                                                                                                     
iteration 499                                                                                                                                                     
runCollection                                                                                                                                           
took 16.1ms for last iteration
quick form:
Marius Wachtler
@undingen
Jun 18 2015 19:24
I'm anticipating the time when I build the executable and it's suddenly 3 times faster than it ever was because it by random chance fitted exactly in the cache etc... :-D. If this ever happens I will keep the executable :-P
Chris Toshok
@toshok
Jun 18 2015 19:24
iteration 494
iteration 495
iteration 496
iteration 497
runCollection
iteration 498
iteration 499
took  6.4ms for last iteration
heh
yeah I’ve played around with section_ordering.txt a few different times, without much success :/
Chris Toshok
@toshok
Jun 18 2015 19:29
i talked with jason evans about hhvm’s and he mentioned that they just used perf output from a “representative run”
doing that for django_template.py then recompiling pyston_release and re-running results in ~3x the overall execution time :/
Chris Toshok
@toshok
Jun 18 2015 19:37
so clearly that’s not the optimal solution :) now that cache/callgrind works (with runtime ics) maybe we can get some more useful stats than simply execution time
Chris Toshok
@toshok
Jun 18 2015 19:45
even more interesting. somehow this change results in fewer gc bytes being allocated
before: gc_registered_bytes: 882586479, after: gc_registered_bytes: 880118455
now i just need to try and understand why. it’s not like it’s a huge change
Kevin Modzelewski
@kmod
Jun 18 2015 20:13
@tjhance , was that on fasta.py?
I think that benchmark is pretty much only about the performance of the bisect() function
so yeah we get random changes in it :/
here's what the perf history looks like:
Marius Wachtler
@undingen
Jun 18 2015 20:36
I got today the new jit tier faster: current perf for pyxl_bench.py
17secs force old interp
11secs force old interp + with new jit and ICs
5secs old interp + llvm tiers (~umodified pyston) (empty object cache: 6.4secs)
It's still slower than the lowest llvm tier but it's getting closer
Marius Wachtler
@undingen
Jun 18 2015 20:44
mmh what's quite interesting: gc_collections: 75 (interp+llvm) vs. 126 (interp+new jit)
Kevin Modzelewski
@kmod
Jun 18 2015 20:53
I think the llvm tier will pre-box any int/float/string constants
and then embed the pointers directly into the IR
but I think the ast_interpreter doesn't do that?
not sure if that's enough to potentially explain the difference
Chris Toshok
@toshok
Jun 18 2015 20:59
Interesting. The increased allocations come from making the unwind session a box. Wonder if undoing the astinterpreter box subclass will result in a similar drop.
Marius Wachtler
@undingen
Jun 18 2015 21:00
yeah just compared the alloc stats (kudos @toshok ) allocsize.astinterpreter: 3 973 120 vs allocsize.astinterpreter: 402 263 296
thats quite significant if the whole bytes allocated for the run with llvm tier are only: 684 758 829
Rudi Chen
@rudi-c
Jun 18 2015 21:07
Do you guys have preferences when it comes to how to modify from_cpython stuff, like whether to modify the file in from_cpython or add code in src/ that overwrite something? Here for example, I can either expose the PyTypeObjects Match_Type (which is not in any header atm) and overwrite the destructor in src/, or I could modify the tp_dealloc slot in from_cpython/_sre.c directly.
And I already have some similar code that overwrites destructors in src/ if that matters.
But I'm also just interested in the general question.
Chris Toshok
@toshok
Jun 18 2015 21:30
I’m of two minds about that sort of thing. we definitely have some /* Pyston change */ comments in from_cpython/, but we have similar fixups for tp_dealloc’s already in src
i’d almost rather the tp_dealloc related changes be in one place
Rudi Chen
@rudi-c
Jun 18 2015 22:08
finalizer:            baseline:
20 (19.0s)           23 (18.8s)
21 (17.7s)           24 (19.1s)
After adding removing non-important finalizers from the most common object types in django-template, I get about the same performance in debug mode.
In release mode, I lose about 1 second.
With finalizers: ```
gc_us_collections: 1658939                                               
real    0m12.368s
Without finalizer changes:
gc_collections_us: 902621  
real    0m11.809s
Kevin Modzelewski
@kmod
Jun 18 2015 22:16
+1 on putting the tp_dealloc change in src/
Rudi Chen
@rudi-c
Jun 18 2015 22:18
Okay. I put extern PyTypeObject Pattern_Type in Python.h though so Pyston could see it, not sure where else it could go.
Chris Toshok
@toshok
Jun 18 2015 22:22
that’s awesome re: the timings. that’s on your local vm not jitdev, I’m guessing?
Rudi Chen
@rudi-c
Jun 18 2015 22:22
No, jitdev
I do remember jitdev being faster though...?
Chris Toshok
@toshok
Jun 18 2015 22:22
oh, 18.8s is pyston_dbg
Rudi Chen
@rudi-c
Jun 18 2015 22:23
Yes
But was jitdev always ~12s? Or 8s? I can't remember.
Chris Toshok
@toshok
Jun 18 2015 22:23
it drops to ~8 seconds on successive runs
first run can be ~10-11
Rudi Chen
@rudi-c
Jun 18 2015 22:23
Ah ok
Rudi Chen
@rudi-c
Jun 18 2015 22:50
Hm, so if I change TraceStack (virtual) to template <class T> TraceStack, then GCVisitor which stores a TraceStack also needs to be a template, and everything function that uses a GCVisitor must be a template also, including some extern "C" functions. Do we want that?
(Where T would be a functor`)
Kevin Modzelewski
@kmod
Jun 18 2015 23:07
would it make sense to have a single TraceStack implementation that could handle the different cases?
Rudi Chen
@rudi-c
Jun 18 2015 23:08
With an if statement?
Kevin Modzelewski
@kmod
Jun 18 2015 23:08
that might be easier, and the branching would probably be similar perf to the indirect function calls
Marius Wachtler
@undingen
Jun 18 2015 23:08
I'm already quite tired so I'm not sure if I have a think but shouldn't something like this do the trick:
template <typename Derived>
class TraceStack : public Derived
{
    Derived* upCast() { return (Derived*)this;  }
 public:
    bool visit_action(GCAllocation* al) { return upCast()->visit_action_impl(al); }
};

class FirstPhaseStack : public TraceStack<FirstPhaseStack>
{
 public:
    bool visit_action_impl(GCAllocation* al) {
        // real code?;
    }
};
Rudi Chen
@rudi-c
Jun 18 2015 23:10
If branching has same perf as function calls, I'd rather have separate implementations, it's a bit cleaner IMO.
@undingen That will still require specifying the template type whenever TraceStack is used no?
Marius Wachtler
@undingen
Jun 18 2015 23:14
I don't think so
you should be able to call visit_actionon a TraceStack*
Rudi Chen
@rudi-c
Jun 18 2015 23:15
That's...an interesting approach. I've never seen anything like that.
Marius Wachtler
@undingen
Jun 18 2015 23:15
llvm uses something like this sometimes
for example the DenseMap base case class doesn't know how many entries it has. It has to call the dervided class (and would normally have to use a virtual method):
unsigned getNumEntries() const {
    return static_cast<const DerivedT *>(this)->getNumEntries();
 }
Marius Wachtler
@undingen
Jun 18 2015 23:28
But now that I think about it... yeah this doesn't really help... the TraceStack* access must specify the template...