These are chat archives for dropbox/pyston

2nd
Jul 2015
Marius Wachtler
@undingen
Jul 02 2015 00:01
Why did we choose x86_64? the instruction encoding makes my head hurt :-D
Chris Toshok
@toshok
Jul 02 2015 00:01
it’s not too late to switch to mips
Marius Wachtler
@undingen
Jul 02 2015 00:01
maybe we should create the One True Jit Assembler for mips?
Rudi Chen
@rudi-c
Jul 02 2015 00:02
mips is awesome, we used it om baby compiler class :D
*in
Chris Toshok
@toshok
Jul 02 2015 00:04
weird, I seem to remember django_template taking 88 gcs:
gc_collections: 58
gc_collections_us: 1233876540
Rudi Chen
@rudi-c
Jul 02 2015 00:04
Yeah I remember that too
Chris Toshok
@toshok
Jul 02 2015 00:05
oh, maybe due to more constants being attached to modules?
Marius Wachtler
@undingen
Jul 02 2015 00:09
Could be - the ASTInterpreter now reuses the consts too
Chris Toshok
@toshok
Jul 02 2015 00:12
is there a way to make perf diff sort by the delta?
ah, -o 1 does it, but it makes the output substantially different looking
Kevin Modzelewski
@kmod
Jul 02 2015 01:54
man I wish "perf record" had some way of giving the return code
it returns 0 even if the process you tested failed
this is why investigate.py will happily let you run broken builds
Kevin Modzelewski
@kmod
Jul 02 2015 02:30
timings on this random microbenchmark I just wrote:
  • system python: 1.13s
  • pyston_release: 1.22s
  • pyston_release_gcc: 1.31s
  • custom-built cpython (no special patches or flags): 1.47s
time to get ubuntu's source package for cpython and see how they build it :/
CC="aarch64-linux-gnu-gcc" CFLAGS="-D_FORTIFY_SOURCE=2 -g -fstack-protector --param=ssp-buffer-size=4 -Wformat -Werror=format-security " LDFLAGS="-Wl,-Bsymbolic-functions -Wl,-z,relro"
../configure \
--prefix=/usr --enable-ipv6 --enable-unicode=ucs4 --with-dbmliborder=bdb:gdbm --with-system-expat --with-system-ffi --with-fpectl
Kevin Modzelewski
@kmod
Jul 02 2015 19:22
oh nice, thanks for looking into it :)
I think they also have a bunch of patches
I wonder how it all adds up to 30%
Marius Wachtler
@undingen
Jul 02 2015 20:09

yeah the alignment? issue is back:

 Performance counter stats for './pyston_release -I microbenchmarks/prime_summing.py':

      43747,631232      task-clock (msec)         #    1,000 CPUs utilized          
                15      context-switches          #    0,000 K/sec                  
                 0      cpu-migrations            #    0,000 K/sec                  
             2.592      page-faults               #    0,059 K/sec                  
   139.825.290.639      cycles                    #    3,196 GHz                     [83,33%]
    64.023.229.870      stalled-cycles-frontend   #   45,79% frontend cycles idle    [83,33%]
    47.645.622.416      stalled-cycles-backend    #   34,08% backend  cycles idle    [66,67%]
   175.439.120.671      instructions              #    1,25  insns per cycle        
                                                  #    0,36  stalled cycles per insn [83,34%]
    35.744.524.806      branches                  #  817,062 M/sec                   [83,34%]
        69.981.720      branch-misses             #    0,20% of all branches         [83,33%]

      43,748006249 seconds time elapsed

other run:

Performance counter stats for './pyston_release -I microbenchmarks/prime_summing.py':

      25567,650191      task-clock (msec)         #    1,000 CPUs utilized          
                52      context-switches          #    0,002 K/sec                  
                13      cpu-migrations            #    0,001 K/sec                  
             2.596      page-faults               #    0,102 K/sec                  
    81.712.218.364      cycles                    #    3,196 GHz                     [83,32%]
    16.343.874.935      stalled-cycles-frontend   #   20,00% frontend cycles idle    [83,34%]
     6.606.322.066      stalled-cycles-backend    #    8,08% backend  cycles idle    [66,68%]
   175.363.451.411      instructions              #    2,15  insns per cycle        
                                                  #    0,09  stalled cycles per insn [83,34%]
    35.740.024.868      branches                  # 1397,861 M/sec                   [83,34%]
        72.282.285      branch-misses             #    0,20% of all branches         [83,32%]

      25,568464633 seconds time elapsed
Marius Wachtler
@undingen
Jul 02 2015 20:49
@tjhance your 'emit direct calls' change is also really nice for debugging because gdb shows now the symbol name of the destination address :-)
   0x00000000022d149d:    movabs $0x40,%rcx
   0x00000000022d14a7:    callq  0x6d7740 <pyston::JitFragmentWriter::compareICHelper(pyston::CompareIC*, pyston::Box*, pyston::Box*, int)>
   0x00000000022d14ac:    mov    (%rsp),%rcx
   0x00000000022d14b0:    movq   $0x22cf470,0x18(%rcx)
   0x00000000022d14b8:    mov    %rax,%rdi
   0x00000000022d14bb:    callq  0x6d8c10 <pyston::JitFragmentWriter::nonzeroHelper(pyston::Box*)>
   0x00000000022d14c0:    movabs $0x12700140c8,%rcx
Travis Hance
@tjhance
Jul 02 2015 21:17
ooh, I hadn’t even thought about that
Marius Wachtler
@undingen
Jul 02 2015 22:21
I hate when an optimization turns out to slowdown more stuff than speedup...
Travis Hance
@tjhance
Jul 02 2015 22:22
ugh why can’t computers work the way WE want them to, eh?
Chris Toshok
@toshok
Jul 02 2015 22:36
@undingen i don’t suppose there’s any way to tell where those stalled-cycles originate?
Marius Wachtler
@undingen
Jul 02 2015 22:37
I haven't yet investigated it in more detail with perf record... but will do after I finished the optimizations I'm planning for the new JIT tier
Chris Toshok
@toshok
Jul 02 2015 22:39
i’m guessing those are on your personal machine?
because on jitdev: :(
   <not supported> stalled-cycles-frontend 
   <not supported> stalled-cycles-backend
Marius Wachtler
@undingen
Jul 02 2015 22:43
yeah exactly
but normally perf record -e stalled-cycles-frontend is quite interesting
Chris Toshok
@toshok
Jul 02 2015 22:51
I bet :)
Marius Wachtler
@undingen
Jul 02 2015 23:18
In case someone has the same idea:
I noticed that the new JIT tier emits jumps which immediately jump to next address (I think most of them are critical edge breaking nodes). So I tried to skip this trivial blocks (but only if they are not a backedge for the OSR) --> result is that it even very slightly slowed down the performance
looks like the cpu jmp instructions are so cheap one can not make up the small overhead of checking if it's a trivial basic block...