These are chat archives for dropbox/pyston

26th
Jun 2015
Travis Hance
@tjhance
Jun 26 2015 01:52
anybody know what kind of error this output is indicative of? https://travis-ci.org/dropbox/pyston/jobs/68413213
"Could not find executable /home/travis/pyston-build/gc_unittest"
Chris Toshok
@toshok
Jun 26 2015 01:52
build failure
Travis Hance
@tjhance
Jun 26 2015 01:52
… well, why not? Did it fail to compile?
Chris Toshok
@toshok
Jun 26 2015 01:52
rudi ran into this today
yeah
Travis Hance
@tjhance
Jun 26 2015 01:52
hm
well it compiles for me
so..
how can I investigate this?
Chris Toshok
@toshok
Jun 26 2015 01:53
if you expand the “compiling” fold it’ll show the error
/home/travis/build/dropbox/pyston/src/asm_writing/assembler.h:109:42: error: expected ';' after return statement
inline std::string dump() { return "" }
Travis Hance
@tjhance
Jun 26 2015 01:53
oh woah there are folds
oh derp
Chris Toshok
@toshok
Jun 26 2015 01:53
yeah crazy. even with the folds our build log is huge :)
Travis Hance
@tjhance
Jun 26 2015 01:54
thanks!
i was just like… where are the compile error messages??
totally did not see the folds
Chris Toshok
@toshok
Jun 26 2015 02:00
sigh, heap corruption :/
(gdb) print type_cls->ob_size
$2 = 0x1270014400
sure looks like a small arena pointer to me…
Chris Toshok
@toshok
Jun 26 2015 02:55
nice, removing the memset from the run allocator speeds up things by ~10%
i hate that, though. i want the gc to return zero’ed memory :(
so I’ve rebased off master and have no build changes, yet every time I switch between master and my branch it does a full rebuild
Chris Toshok
@toshok
Jun 26 2015 03:26
does perf have a uniform sampling rate?
Kevin Modzelewski
@kmod
Jun 26 2015 05:01
ok just merged #627, which should address the hide-the-compile-errors issue
Chris Toshok
@toshok
Jun 26 2015 05:01
nice
Chris Toshok
@toshok
Jun 26 2015 05:30
wow:
0. Program arguments: /mnt/toshok/pyston/build/Debug/llvm/./bin/llvm-link -o /mnt/toshok/pyston/build/Debug/src/runtime/inline/stdlib.unopt.bc boxing.cpp.pub.bc dict.cpp.pub.bc link_forcer.cpp.pub.bc list.cpp.pub.bc tuple.cpp.pub.bc xrange.cpp.pub.bc Bus error (core dumped)
Travis Hance
@tjhance
Jun 26 2015 05:37
linker error?
my favorite:D
Chris Toshok
@toshok
Jun 26 2015 05:37
crashed it
crashed some other llvm stuff too on the next build (during publicize)
Travis Hance
@tjhance
Jun 26 2015 05:38
might be of some interest: https://github.com/alehander42/hermetic <- python with some sort of hindley-milner type system
Chris Toshok
@toshok
Jun 26 2015 05:50
threading the linked list header inside the runs has uncovered some cases where we were assuming that gc_alloc returned zero filled memory
but it’s looking like it makes us 20% faster than cpython on the loop microbenchmarks
Kevin Modzelewski
@kmod
Jun 26 2015 05:55
whoa, from 2x slower to 20% faster?
Chris Toshok
@toshok
Jun 26 2015 05:58
were we 2x slower on master?
Kevin Modzelewski
@kmod
Jun 26 2015 05:58
oh I thought that's where we were
what's the differential from the new changes?
Chris Toshok
@toshok
Jun 26 2015 06:01
i need to redo master, but i think it was roughly comparable to cpython. i made the benchmark a lot longer, and it’s 0m27.383s for cpython, 0m21.297s for pyston_release
building master now. many of the tests are crashing (including django_template) at the moment, so I don’t know if it translates into something we’ll see outside of microbenchmarks
Chris Toshok
@toshok
Jun 26 2015 06:07
@tjhance i wonder if the algorithm in facebook’s flow would work for python
Travis Hance
@tjhance
Jun 26 2015 06:09
i don’t actually know what they do
Chris Toshok
@toshok
Jun 26 2015 06:09
it’s some phd thesis.. can’t remember the title and it was on my (stolen) phone
Rudi Chen
@rudi-c
Jun 26 2015 06:10
So I tried commenting out as much code from the finalizer stuff as I could and managed to get down from 5% slower to 1.5% slower. I don't know how to account for that last 1.5%, I commented out almost everything (except for a few places where I comment out the inside of the function instead of all the function calls). It's hard to debug the rest because the % was pretty inconsistent across runs except that it eventually went down. There's a lot of 0.5% that seems to be from little work like checking if there's finalizers to call every allowGLReadPreemption and inlining some functions.
The switch statement in TraceStack.push was also like another 0.5% or so.
Travis Hance
@tjhance
Jun 26 2015 06:12
i actually used flow once, with no type annotations, and successfully caught a type error (although I had to cat all my javascript files together to do it, into a 20,000-or-so-line file)
@kmod master is 0m23.956s, so ~10% faster than master, ~20% faster than python
Travis Hance
@tjhance
Jun 26 2015 06:19
wow my pdf viewer can’t even handle pdfs this large -_-
Chris Toshok
@toshok
Jun 26 2015 06:19
pypy 2.2.1 is 0m4.757s :/
i expect marius’s loop stuff will help that microbenchmark quite a bit :)
Kevin Modzelewski
@kmod
Jun 26 2015 08:19
hmm which benchmark are you looking at?
I tried 10x'ing the size of loop.py and got 10.7s for cpython and 22.6s for pyston :(
Marius Wachtler
@undingen
Jun 26 2015 14:42
mmh I encountered a very strange issue: when I run perf stat ./pyston_release -I microbenchmarks/prime_summing.py with STAT_TIMERS 0 it takes 2sec and with STAT_TIMERS 1 it takes 1 secs to run...
this challenges all my measurements I did in the last days...
Marius Wachtler
@undingen
Jun 26 2015 14:48
from the perf report output I can't really see any reason why it takes 2x time... but perf stat says that about the same number of instructions got executed but with the stat timers there are 19,66% frontend cycles idle and 9,79% in the backend and without 52,83% and 44,02%...
I mean ok... I will just always enable the stat timers if this improves the performance 2x no problem... :-D I just thought that it would be the opposite... and I better keep them disabled..
Marius Wachtler
@undingen
Jun 26 2015 14:53
and I can reproduce this behavior
Marius Wachtler
@undingen
Jun 26 2015 15:01
Chris Toshok
@toshok
Jun 26 2015 15:23
@kmod hrm, did you mean test/tests/loops.py? i greatly increased loop counters and removed the breaks and am checking that now
i had been just using a single big loop
Chris Toshok
@toshok
Jun 26 2015 15:32
interesting. if the loop is inside a function things change pretty dramatically
Marius Wachtler
@undingen
Jun 26 2015 15:33
ok it looks like it does not really have anything todo with the stats... I also encountered the perf difference when not changing the settings. It looks more like it depends on the address of the runtime ICs... at least perf tells me that the number of frontend and backend stalls is somethimes very high in the runtime ic address range and sometimes the ics do not really show up...
Chris Toshok
@toshok
Jun 26 2015 15:33

with this:

for x in xrange(4000):
    for y in xrange(20000):
        pass

cpython: 0m2.043s
pyston: 0m1.885s

with this:

def f():
    for x in xrange(4000):
        for y in xrange(20000):
            pass
f()

cpython: 0m0.878s
pyston: 0m1.644s

Marius Wachtler
@undingen
Jun 26 2015 15:35
the interpreter is very slow with local symbols
Chris Toshok
@toshok
Jun 26 2015 15:35
globals vs locals, i’m guessing?
ah, got it
Marius Wachtler
@undingen
Jun 26 2015 15:35
because it always has to look up the symbol in the hashmap kind of like its a global one..
-n should be much faster?
Chris Toshok
@toshok
Jun 26 2015 15:36
much slower if I don’t add -S, but when I do it returns to ~1.6s
Marius Wachtler
@undingen
Jun 26 2015 15:38
so the performance stays the same?
Chris Toshok
@toshok
Jun 26 2015 15:38
yeah
might help to make loop run longer
Marius Wachtler
@undingen
Jun 26 2015 15:40
can reproduce with the new jit tier
inside a func it's even slightly slower...
Chris Toshok
@toshok
Jun 26 2015 15:41
perf shows both are spending the majority of their time in osr frame + xrangenext:
loop_module:                                                                                                                                                        
+  41.66%          3439  pyston_release  perf-69676.map       [.] <module>_e3_osr8_from_<module>_e1_osr8_0_1
+  29.05%          2398  pyston_release  pyston               [.] pyston::BoxedXrangeIterator::xrangeIteratorNext(pyston::Box*)
+  14.55%          1201  pyston_release  pyston               [.] pyston::gc::SmallArena::freeUnmarked(std::vector<pyston::Box*, std::allocator<pyston::Box*> >&)

loop_function:                                                                                                                                                      
+  33.88%          2468  pyston_release  pyston               [.] pyston::BoxedXrangeIterator::xrangeIteratorNext(pyston::Box*)
+  32.90%          2397  pyston_release  perf-69702.map       [.] f_e3_osr8_from_f_e1_osr8_0_1
+  16.55%          1206  pyston_release  pyston               [.] pyston::gc::SmallArena::freeUnmarked(std::vector<pyston::Box*, std::allocator<pyston::Box*> >&)
(and in the sweep phase)
actually the increase in %time in the sweep phase is interesting
updated with event counts
those percentages don’t make much sense
loop_function takes more time, so should have more samples overall, but freeUnmark’s % went up by 2% for 5 additional events?
Marius Wachtler
@undingen
Jun 26 2015 16:02
ok I'm now quite sure that the perf variation I was seeing was caused by the malloc we use for allocating the code for the runtime ICs
Chris Toshok
@toshok
Jun 26 2015 16:04
just due to malloc overhead? you could try jemalloc LD_PRELOAD, might help
and there’s code in there that just blindly mmaps things instead of using malloc (for valgrind) that might help too
Marius Wachtler
@undingen
Jun 26 2015 16:05
I replaced it with a different one and with the result that the runtime ics do not show up inside ther perf stall numbers while with the normal malloc they do show up sometimes and if they show up the produce about half the stalls
I don't thing it's the malloc. What shows up is that the code inside the malloced array produces alot of stalls...
is there something like code alignment overhead? maybe it's just that with the malloc the code is more sparse and does sometimes not fit into the instruction cache?
Chris Toshok
@toshok
Jun 26 2015 16:08
there’s the icache prefetcher that fetches 16bytes at a time, aligned to 16bytes, iirc
so if you jump to a non-aligned address you don’t prefetch as much
but malloc should be returning 16 byte aligned blocks
hrm, but there might be cache line implications?
"Far jumps are not predicted i.e. pipeline always stalls on a conditional far jump” from http://stackoverflow.com/questions/18113995/performance-optimisations-of-x86-64-assembly-alignment-and-branch-prediction
Marius Wachtler
@undingen
Jun 26 2015 16:13
but AFAIK it's not branch prediction because the perf stat branch-misses output is about the same
ok yeah maybe they aren't counted
Chris Toshok
@toshok
Jun 26 2015 16:20
Callgrind/cachegrind work now with runtime ics, if you can stand the execution time hit
Might give more insight
Marius Wachtler
@undingen
Jun 26 2015 16:38
worth a try. But I will leave it for now because I got side-tracked - I wanted to finish the jit and then got tracked into the perf stuff because of the high perf variation. but it's more important to get it into a state for submitting and then look at the perf variation
Marius Wachtler
@undingen
Jun 26 2015 20:30
I'm trying to merge together the CallattrFlags and ArgPassSpec args when calling callattr in order to save one register (the values fit into one 64bit value, but are currently passed in two separate ones)
I changed the the compvars code to emit the right value and changed the signature. But it keeps on crashing when using ICs. Do you guys know where the rewriter assumes that the CallattrFlags or ArgPassSpec are in a specific position?
Marius Wachtler
@undingen
Jun 26 2015 20:41
ok I think I found it
Marius Wachtler
@undingen
Jun 26 2015 20:59
~/pyston/build/Release$ ninja
Segmentation fault (core dumped)
???
Chris Toshok
@toshok
Jun 26 2015 21:00
disk full?
Marius Wachtler
@undingen
Jun 26 2015 21:01
good idea. but enough space left. ram too
Debug build works
Chris Toshok
@toshok
Jun 26 2015 21:02
and blowing away build.ninja means a long wait :/
Marius Wachtler
@undingen
Jun 26 2015 21:15
it fixed the issue :-)
Chris Toshok
@toshok
Jun 26 2015 21:20
new small arena allocator passes all tests
Marius Wachtler
@undingen
Jun 26 2015 21:20
:+1:
Chris Toshok
@toshok
Jun 26 2015 21:21
now to see how much improvement there is in the non-microbenchmarks :)
Chris Toshok
@toshok
Jun 26 2015 22:41
i think the assembly view in perf might be misattributing cost to lines
18.18 │188: add $0x48,%rsp
i don’t expect that 18.18% of events happened on that add
Marius Wachtler
@undingen
Jun 26 2015 23:35
I think it's always one line off
Chris Toshok
@toshok
Jun 26 2015 23:35
yeah, seems to be
the line before was mov (%rcx),%rax, which makes sense
Marius Wachtler
@undingen
Jun 26 2015 23:36
at least I mostly encountered that there is a call one line above or a memory load.
damn memory way to slow for the cpu :-D