These are chat archives for dropbox/pyston

4th
Apr 2015
Chris Toshok
@toshok
Apr 04 2015 02:49
ugh, well that might be why _Unwind_Find_FDE is slow
       │       /* Linear search through the classified objects, to find the one
       │          containing the pc.  Note that pc_begin is sorted descending, and
       │          we expect objects to be non-overlapping.  */
they use a linked list for storing objects
when I was working on the dwarf stuff for mono we realized we had to package up multiple methods into the same elf object in memory
i’m guessing llvm emits an ELF object per jitted method?
of course we had to worry about flushing the current ELF object in case we needed it for something (just for the debugger, all unwinding used a different format)
Chris Toshok
@toshok
Apr 04 2015 02:55
so, not to say that _Unwind_Find_FDE is slow, just that it’s the hottest thing in the django perf output
@kmod so I’m not sure what was going on earlier on jitdev (with django /admin taking 1400ms), but it’s taking 14ms now
if we could figure out how to make _Unwind_Find_FDE disappear from the trace, we’d be down in the 8ms range
Travis Hance
@tjhance
Apr 04 2015 04:03
what does _Unwind_Find_FDE do?
I mean I realize it probably finds the FDE, but what is that
but it’s from over a year ago
Chris Toshok
@toshok
Apr 04 2015 05:00
it finds the FDE corresponding to a given address (pc)
essentially the same thing we do with the CompiledFunction registry (and that patch is essentially identical. array + binary search)
Travis Hance
@tjhance
Apr 04 2015 05:00
what’s an FDE
Chris Toshok
@toshok
Apr 04 2015 05:01
Frame Description Entry
Travis Hance
@tjhance
Apr 04 2015 05:01
ohh
Chris Toshok
@toshok
Apr 04 2015 05:01
basically a dwarf … piece of code for unwinding the stack
Travis Hance
@tjhance
Apr 04 2015 05:01
ok
Chris Toshok
@toshok
Apr 04 2015 05:02
actually more than unwinding. it’s useful for finding where things are on the stack in general
Travis Hance
@tjhance
Apr 04 2015 05:02
is this where the stackmaps are?
Chris Toshok
@toshok
Apr 04 2015 05:04
it’s definitely possible to scan stacks using the information encoded in the fdes, but I don’t know if llvm does it. i expect not, since it’s a pretty verbose format
there’s an FDE per function, generally.. and you replay the opcodes from the start until you reach the current pc
Travis Hance
@tjhance
Apr 04 2015 05:04
is this what the ordinary c++ exception unwinding does?
Chris Toshok
@toshok
Apr 04 2015 05:04
yeah
Travis Hance
@tjhance
Apr 04 2015 05:05
wait, so that means c++ unwinding is really slow if there are a lot of entries?
Chris Toshok
@toshok
Apr 04 2015 05:06
looks like it. at least with the version of libgcc in ubuntu 14.04
in their defense, it’s not usual for there to be a ton of entries
Travis Hance
@tjhance
Apr 04 2015 05:06
it’s not?
shouldn’t there be one per function?
Chris Toshok
@toshok
Apr 04 2015 05:06
oh there’s an FDE per function, but the search is over objects
i think.. maybe it’s per function
Travis Hance
@tjhance
Apr 04 2015 05:08
“objects”?
Chris Toshok
@toshok
Apr 04 2015 05:08
yeah, ELF objects, I’m assuming
line 267 there, if (dl_iterate_phdr (_Unwind_IteratePhdrCallback, &data) < 0)
dl_iterate_phdr iterates over every loaded ELF object
.so’s, executable, etc
Travis Hance
@tjhance
Apr 04 2015 05:10
ohh
right, and there usually aren’t many of those
Chris Toshok
@toshok
Apr 04 2015 05:11
yeah, on the order of hundreds at most
Chris Toshok
@toshok
Apr 04 2015 06:16
that patch makes it sound like it’s going to be fixed in > 4.8, but https://github.com/gcc-mirror/gcc/blob/master/libgcc/unwind-dw2-fde.c still shows the linear walk
that comment preceding that line matches what jitdev has on it (my paste earlier)
Kevin Modzelewski
@kmod
Apr 04 2015 06:48
oh hmm, I haven't looked at the gcc implementation, but yeah in libunwind I think it can do a binary search per eh_frame section
but we emit one eh_frame per function
my bet is that we could periodically collect all the FDEs into a single eh_frame and then reregister that
Marius Wachtler
@undingen
Apr 04 2015 20:30
what do you guys think about saving the object cache files compressed to disk
the can be really good compressed and we would also fix the problem of file corruption because the compression library can emit a checksum
maybe lz4? looks to be extremely fast but still compresses larger cache files to about 10-20% of the orig size.
We would loose being able to directly run objdump on it but there are command line tools for lz4 available so one could manually uncompress them
Marius Wachtler
@undingen
Apr 04 2015 21:00
$ ./pyston_release -S -n test/tests/distutils_test.py
LLVM ERROR: ran out of registers during register allocation
Someone called exit with code=1!
[PATCH] Patchpoint - support symbolic targets.
when this gets merged I can remove part of the object cache code which generates the -1 func dst patchpoints and we don't have to change them to the real dst after the get emitted.
Kevin Modzelewski
@kmod
Apr 04 2015 22:41
oh cool :)
Marius Wachtler
@undingen
Apr 04 2015 22:45
as for the running out of regs thing: this is easy to reproduce but sadly I currently don't have solution for it
Kevin Modzelewski
@kmod
Apr 04 2015 22:46
is that with a different register allocator?
Marius Wachtler
@undingen
Apr 04 2015 22:46
%7 = call i64 (i64, i32, i8*, i32, ...)* @llvm.experimental.patchpoint.i64(i64 0, i32 2143, i8* inttoptr (i64 -1 to i8*), i32 3, %"class.pyston::Box    "* @c17, %"class.pyston::Box"* %6, i32 77, i8* %scratch, %"struct.pyston::FrameInfo"* %frame_info, i64 ptrtoint (i8* @c25 to i64), %"class.pyston::Box    "* @c2, %"class.pyston::Box"* @c3, %"class.pyston::Box"* @c12, %"class.pyston::Box"* @c13, %"class.pyston::Box"* @c14, %"class.pyston::Box"* @c15, %"c    lass.pyston::Box"* @c16, %"class.pyston::Box"* @c17, %"class.pyston::Box"* @c18, %"class.pyston::Box"* @c19, %"class.pyston::Box"* @c20, %"class.pysto    n::Box"* @c4, %"class.pyston::Box"* @c21, %"class.pyston::Box"* @c22, %"class.pyston::Box"* @c23, %"class.pyston::Box"* @c24, %"class.pyston::Box"* @c    5, %"class.pyston::Box"* @c6, %"class.pyston::Box"* @c7, %"class.pyston::Box"* @c8, %"class.pyston::Box"* @c9, %"class.pyston::Box"* @c10, %"class.pys    ton::Box"* @c11)
No, the problem is that when I create a patchpoint and a lot of GlobalVariables as operants it will run out of regs
It's easy to reproduce, I created a function with a lot of variables and assign them an empty sting and a call to a function at the end of the function, the generated patchpoint with frame introspection will add current value of every variable and this breaks the regalloc
Marius Wachtler
@undingen
Apr 04 2015 22:53
We currently do not run into that because it looks like that there is no limit when using constant ints as argument
Travis Hance
@tjhance
Apr 04 2015 23:07
Why are you making global variables as arguments to the patch points?
Marius Wachtler
@undingen
Apr 04 2015 23:17
I don't know why we generate the IR like that but it looks like it's for the frame introspection it adds for every sym int the symtable the current value. And with the object cache I generate a GlobalVariable every time we embed a pointer.
So my understanding of the problem is that if there is a large number of variables where the last value is a pointer to some Box* we created while generating the IR (e.g. a python string) we will now generate a large number of GlobalVariables for...
Marius Wachtler
@undingen
Apr 04 2015 23:39
The more I think about the more I realize that this bad problem I overlooked while implementing the cache. Before the constant pointers where just an entry inside a constant table in the stackmap. But now they have to be really materialized... (aka I can see in the disassemblly a lot of movabs in front of the patchopoint which load the constant values from the symbols I now generate)