by

Where communities thrive


  • Join over 1.5M+ people
  • Join over 100K+ communities
  • Free without limits
  • Create your own community
People
Repo info
Activity
    Itamar Turner-Trauring
    @itamarst
    insofar as I can do passthrough back to the original APIs, it's OK
    it's a profiler, so I can just passthrough to original API
    but I was hoping jemalloc might allow me to have less platform-specific ifdefs and less "aligned_alloc is missing from all but recent macOS"-style issues
    but not much you can do if macOS does this sort of thing
    David Goldblatt
    @davidtgoldblatt
    yeah
    sorry, lots of comiseration, but little help
    Itamar Turner-Trauring
    @itamarst
    no, that was very helpful
    re fragmentation, would you expect reduced fragmentation compared to glibc?
    David Goldblatt
    @davidtgoldblatt
    on linux you mean?
    Itamar Turner-Trauring
    @itamarst
    in scenarios where there are lots of small allocations, and insofar as there is multi-threaded allocation, all allocations are guarded by global allocation lock
    yeah
    David Goldblatt
    @davidtgoldblatt
    In general I would
    Itamar Turner-Trauring
    @itamarst
    (typically Python has its own memory allocation subsystem so it can not call malloc() on every small object, but I'm forcing everything to malloc() so I can track it)
    OK
    David Goldblatt
    @davidtgoldblatt
    Our experiences have been positive where we’ve had them, but it’s not something that’s we optimize/benchmark against when compared to C/C++ code
    Itamar Turner-Trauring
    @itamarst
    I guess also there's the Rust code
    which is also doing lots of allocations (one per malloc(), in fact)
    David Goldblatt
    @davidtgoldblatt
    I think probably with a little bit of coordination we could do substantially better on GIL languages
    Itamar Turner-Trauring
    @itamarst
    I guess for the Rust code I'm using the other benefit is not having to go through the public malloc() and its thread-local "I am being reentrant or not check"
    so might save some CPU
    "LD_PRELOADing malloc() -> rust tracking code -> then calls malloc() again" is fun, I have different thread-local-storage implementations for macOS and Linux cause _Thread_local hits malloc() on macOS
    so I guess I will document why this won't work on macOS, measure if it improves things when just used as Rust allocator, and then separately see if it helps with fragmentation for allocating Python objects on Linux/glibc
    thanks, this has been very helpful
    David Goldblatt
    @davidtgoldblatt
    np, good luck
    Ben Olson
    @molson5_gitlab
    @itamarst In order to solve the recursive malloc problem, you need to use dlsym with RTLD_NEXT to get another malloc implementation to allocate things in your allocator
    But yeah, I think in the case of that one malloc not calling into your library, I think the solution in that case is to simply modify the library that's doing that. I don't know if you have access to the source code, but that's really the only option that you've got, and it's what I've had to do in quite a few cases.
    Ben Olson
    @molson5_gitlab
    @davidtgoldblatt Thanks for your help earlier. It took a few days, but I finally figured out what the cause of my performance issues were. As expected, it wasn't jemalloc; it was in my library which wraps jemalloc with some heavyweight profiling stuff.
    David Goldblatt
    @davidtgoldblatt
    sweet :)
    Ben Olson
    @molson5_gitlab
    Basically, the issue was that I was having to call pthread_getspecific (and subsequently pthread_setspecific in some cases) every time there was an allocation. This didn't affect the vast majority of benchmarks, as evidenced by extensive benchmarking, but it did affect one (1) obnoxious Fortran application which allocates at a totally-scientific rate of 2 bajillion allocations per second.
    So PSA: never, ever, ever use pthread_getspecific or pthread_setspecificif you care about performance and need thread-local storage.
    David Goldblatt
    @davidtgoldblatt
    ah yeah
    One time we got a PR on Windows that basically worked out to be the equivalent change for them
    and it was something like an immediate 2x speedup
    Ben Olson
    @molson5_gitlab
    Yeahhh, I had no idea!
    I was seeing a 30x slowdown in this application, CAM4.
    Actually, over 30x. Default jemalloc finished in 7 minutes, and I killed my run after 2.5 hours.
    David Goldblatt
    @davidtgoldblatt
    huh, that’s wild
    I’m surprised it’s that much, unless the application is crazy heavily malloc-bound
    Ben Olson
    @molson5_gitlab
    But yeah, issue's fixed, our "custom" arena layout (which uses __local as opposed to pthread_getspecific and family to assign an arena to each thread) gets exactly the same performance as default jemalloc now.
    David Goldblatt
    @davidtgoldblatt
    awesome
    Ben Olson
    @molson5_gitlab
    Yeah, so, with some of these old HPC Fortran codes, they call malloc for the slightest little things: just allocating a few bytes each one, and then freeing it up immediately. And the compiler doesn't optimize it for some reason.
    David Goldblatt
    @davidtgoldblatt
    oh my
    Ben Olson
    @molson5_gitlab
    Yeah
    Like
    One function is called TRIM
    It trims some bytes off of strings
    For some reason, this allocates
    Now, imagine you put that in a loop, running constantly, for all 48 hyperthreads on my system.
    Anyway, crisis averted. I'm gonna have to go get myself some ice cream to cool down after that one.
    David Goldblatt
    @davidtgoldblatt
    Hahaha, enjoy