Where communities thrive


  • Join over 1.5M+ people
  • Join over 100K+ communities
  • Free without limits
  • Create your own community
People
Repo info
Activity
    Jim Pivarski
    @jpivarski

    Having the OS manage memory is a blessing or a curse, depending on situation. Since it's not appropriate in your case, use file_handler=uproot.MultithreadedFileSource when opening the file, since that doesn't use mmap at all. Instead of mmap, it opens num_workers file handles, each associated with a thread (num_workers=1 is allowed and, I believe, the default). If you really want to control the behavior of the source, you can pass a file-like object to `uproot.open, which Uproot will have to treat as a single-thread interface because of its changing state.

    (One of the things I liked best about mmap is that it's a stateless file interface, open to parallel processing. That's why the alternative is multithreaded, to bind the stateful file handles (file.tell() is mutable state) to threads.)

    Angus Hollands
    @agoose77:matrix.org
    [m]
    :point_up: Edit: @jpivarski thanks for the follow up. I've created a reproducer to better explain my point; I'm comfortable with the general memory allocation/freeing behaviour in Python. I'm analysing a series of TTree's in "chunks", in order to parallelise over a cluster with a shared file store. I am increasingly seeing repeated task failures due to the Dask process nanny restarting workers for over-using memory. The issue is that the OS has many GB, but there is a per-worker cap imposed by the process nanny to prevent over-consumption. On HPC clusters, this would better be managed by something like SLURM. Here's a reproducer:
    https://gist.github.com/agoose77/9ec7c54e0658891875896c7db4957439
    Jim Pivarski
    @jpivarski
    Just passing a file-like object to uproot.open, without any special options, invokes uproot.ObjectSource, rather than uproot.MemmapSource.
    Angus Hollands
    @agoose77:matrix.org
    [m]
    Hmm, is it worth implementing some kind of MemmapSource subclass and passing that as the file_handler? I'm not sure what kind of care needs to be taken about not unmapping shared regions; I imagine that might become complicated
    Angus Hollands
    @agoose77:matrix.org
    [m]
    :point_up: Edit: Hmm, is it worth my implementing some kind of MemmapSource subclass and passing that as the file_handler? I'm not sure what kind of care needs to be taken about not unmapping shared regions; I imagine that might become complicated
    Jim Pivarski
    @jpivarski
    If you don't mind being single-threaded, you could wrap whatever you want inside a file-like object (any Python object with read, seek, etc. methods) and pass that without having to write any new Uproot code.
    Angus Hollands
    @agoose77:matrix.org
    [m]
    Yes, I suppose my query was more about whether multithreading yields much perf in standard reading
    Jim Pivarski
    @jpivarski

    The multithreading at this level is for hiding slow I/O. If the I/O is a disk, having many threads ask for different parts of the file (TBaskets to build an array) would only help if the hardware supports concurrent reading. If not, it might thrash trying to supply all the requests at once. If the I/O is remote, there's a better chance of it being able to satisfy concurrent requests since that's typical of an internet service, but it's still possible that there's only one copy of the file on the remote server and all requests for parts of that file have to wait for each other. So in the end, whether multithreading helps depends on what the hardware supports.

    Multithreading at other levels, like decompress_executor and interpretation_executor, are about CPU use after the data have been read, and generally only decompress_executor helps, and only if the compression algorithm is slow, such as LZMA.

    Angus Hollands
    @agoose77:matrix.org
    [m]
    right, gotcha.
    I've been pretty impressed with the IO speeds ancedotally; I'm reading the same file from multiple workers, and the performance still scales mostly linearly in the number of workers. That's another way of saying that event-by-event processing is very slow though ha. I suspect that there must be some kind of local caching though because it's a reasonably large compute facility.
    Jim Pivarski
    @jpivarski

    Even a single-laptop OS has caching, which significantly alters read performance. The same mechanism that supplies pages in mmap makes disk-reads effectively RAM-reads on a recently read file. To do performance tests properly, you'd have to check to see if you have "warmed cache" (the file is actually in RAM) or "cold cache" (it's not). I use "vmtouch" to do this explicitly.

    If it's some distributed system, then it gets even more complicated. What looks like disk I/O might actually be network traffic, and some networks are faster than disks.

    Angus Hollands
    @agoose77:matrix.org
    [m]
    That's why you defer to someone on Gitter who has already investigated this stuff 😉 I think I'm not overly concerned about IO performance right now; although I am concerned about it in the early stages of my analysis (I'm reading files ~GB and reading ~ 5+ branches on each), by the end the files are tiny and all my time is spent on LSQ fitting
    Angus Hollands
    @agoose77:matrix.org
    [m]
    :point_up: Edit: That's why you defer to someone on Gitter who has already investigated this stuff 😉 I think I'm not overly concerned about IO performance right now; although I am aware of it in the early stages of my analysis (I'm reading files ~GB and reading ~ 5+ branches on each), by the end the files are tiny and all my time is spent on LSQ fitting
    Eric
    @ericballabene_gitlab
    Hello, it's Eric! I have a naive question :) Is it possible to write trees in uproot4 similarly to uproot3? I was converting a pandas dataframe very nicely with uproot3 (using newtree and recreate methods) but I couldn't find a way in uproot4. Thanks!
    Henry Schreiner
    @henryiii
    That’s in progress, @jpivarski is working on it this month. :)
    You can have both uproot and uproot3 installed at the same time for now, though, to keep doing that.
    Eric
    @ericballabene_gitlab
    Thanks for replying, sure, I'm importing both uproot and uproot3 (where uproot is 4.0.6)
    Jim Pivarski
    @jpivarski
    @ericballabene_gitlab You can watch that at https://github.com/scikit-hep/uproot4/discussions/321
    Eric
    @ericballabene_gitlab
    Thanks!
    Matt LeBlanc
    @mattleblanc

    Hi all,

    I'm curious about what solutions people have found for workflows with lots of systematic variations. Usually, I store the systematic variations on my jets in a TTree as TBranches with format std::vector<std::vector<float>> ... this is much more manageable than making a new TTree for each variation (smaller total size on disk), but I've noticed that uproot seems to take a long time to read branches of this format, and I've struggled to develop a vectorised workflow from this starting point. Instead, I loop simply over each event --> systematic variation, but this approach is turning out to be far too slow to scale to a realistic number of systematics: for my jets, I have ~100 systematic variations that require the event selection and observable calculation to be re-done, each time. Even though I have things running in parallel a bit, the individual jobs still take many hours to run and adding the object systematics makes the run time completely unreasonable.

    So I guess to summarise:

    • I find that looping over each event with uproot.iterate() is painfully slow once I add in a loop over systematic variations, which I need to do right now because I use vector<vector<float>> branches to store the varied fourvectors for my objects.
    • Is there some better input TTree format I could use to speed my workflow up? Would it be better to have 100*4 std::vector<float> branches, instead of the one nested vector one? I guess I could filter which branches are getting read at any given time that way, instead of always reading in the whole thing. I'm not actually convinced this would be an improvement, because I think the size of the input files on disk will explode if I actually try this.
    • Maybe I'm just going totally off-track, so if there is a better approach to a workflow for analysing lots of events with many systematic variations, I'm all ears. :)

    Thanks!

    alexander-held
    @alexander-held
    Some thoughts about workflows are in https://github.com/CoffeaTeam/coffea/discussions/469. The discussion is about coffea, but many of the ideas are more general.
    Jim Pivarski
    @jpivarski

    On std::vector<std::vector<float>>, see https://github.com/scikit-hep/uproot4/discussions/327

    It's a (perhaps) surprising feature of TTrees that doubly nested std::vector is stored in a completely different way from single std::vector, and the NumPy tricks that can be used on a single std::vector can't be used on the doubly nested one. (The new RNTuple will use the same format for all depths of nestedness.)

    The discussion linked above references ongoing development to address non-NumPyable data structures. AwkwardForth is a mini-language with a fast virtual machine intended specifically for this purpose. It shows the necessary speedup, but it needs to be integrated into Uproot, which would probably happen by or during this summer. (Uproot needs to generate AwkwardForth code for each non-NumPyable data type, rather than Python code.)

    In the meantime, if your number of variations is fixed, you can use a fixed-size array as the inner dimension, rather than a std::vector, which is NumPyable: std::vector<three_floats>, wherethree_floatsis an array of length 3 (central value, one sigma up, one sigma down?). This should show up as a jagged array of a fixed size array, which should have interpretationAsJagged(AsDtype(...))in which the dtype has a shape of(3,)`.

    But also Coffea deals with systematics all the time and they would have good suggestions, too.

    Matt LeBlanc
    @mattleblanc
    Hi Jim, thanks for the reply! I don't understand the format you suggest in the last message -- I want to keep something which is more like pt = (nominal, var1, var2, ..., var100), for each of N objects in an event. So I'm not sure how to map that efficiently to using three_floats objects. The nested vectors I use are for the sytematics (outer vector), then the objects with that variation (inner vector), but the number of objects for a given variation can technically change if something enters the selection due to the systematic shift
    I have no experience with coffea, but I'm taking a look at what Alex linked
    Matthew Feickert
    @matthewfeickert

    @mattleblanc if you could mock out the structure of your TTree a bit (like how it might look when viewed with the tree command if it were a directory structure) that might be helpful (or give an example like they do in https://github.com/scikit-hep/uproot4/discussions/327).

    Though as mentioned, maybe the coffea team might have some general thoughts on this as well (could try emailing Lindsey Gray and Nick Smith).

    Jim Pivarski
    @jpivarski

    @mattleblanc If there's always 101 (the central value + 100 variations), then it could be a

    float hundred_and_one_floats[101]

    object inside the std::vector (which would be AsJagged(AsDtype(...)) with a dtype shape of (101,).

    But if the number of systematic variations really is variable, then std::vector<std::vector<float>> really is the right structure.

    Angus Hollands
    @agoose77:matrix.org
    [m]
    Is there an elegant way to ask re-zip the result of TTree.arrays() to give an array-of-records rather than record of arrays? In my case, I'm getting something like n_events * {"field_1": var * float64, "field_2": var * float64} instead of n_events * n_charges * {'field_1': float64, ...}. Right now, I'm going to unzip and zip together as we discussed on #awkward-array some time ago 😃.
    Jim Pivarski
    @jpivarski
    Unzipping and zipping is the recommended way to do it. There isn't a performance cost associated with that (i.e. it's O(1) where n is the length of the array). If there could be a more convenient syntax, let's hear it—it could cover only a subset of the cases, but there might be some cases that are more important than others.

    Would it just by a shorthand for

    ak.zip(dict(zip(ak.fields(array), ak.unzip(array))), depth_limit=configurable)

    ?

    Angus Hollands
    @agoose77:matrix.org
    [m]
    Entirely that, yes 🙂 Just wondered if there were a sneaky shortcut in uproot that handled it (I assume it might be fairly common to have data with this kind of shared structure).
    I don't think that there "could" be a more convenient syntax to be honest, except perhaps that unzip would return a mapping for RecordArrays. However, this would be a breaking change
    Jim Pivarski
    @jpivarski
    The above is pretty verbose; it could be an ak.rezip function (or better name). That can be a Discussion. That (and feature requests) is how I'd like to get feedback on requests for new functions. If an ak.rezip with the above definition is desirable, it would be an easy one to write.
    You know what wouldn't be a breaking change? ak.unzip(array, as_dict=True) (or similar).
    That wouldn't be a new function, but a new argument to an existing function.
    Angus Hollands
    @agoose77:matrix.org
    [m]
    I agree; I was more thinking about sane defaults vs adding new flags to things (particularly w.r.t Python's mantra of new fns instead of flags).
    I'd be happy to open a discussion on this, clearly it would be nice to have something like this;
    off-topic, I really like the .mask API that awkward has - it makes a lot of analysis easier. In numpy, MaskedArray feels a little clunkier somehow.
    When I get some more time, I plan to write up some of my findings / stumbling points into the Awkward docs.
    Huh, those Discussions are rather useful. I keep forgetting that GitHub has that feature.
    Jim Pivarski
    @jpivarski
    This chat will flow by (useful for temporary stuff), but the Issues and Discussions are discoverable by topic with their individualized flows.
    Angus Hollands
    @agoose77:matrix.org
    [m]
    <something something all life is transient> ;)
    Matt LeBlanc
    @mattleblanc
    Hi @jpivarski, coming back to my question above -- I think I mis-wrote what I am actually doing. It's my outer vector which is always of length 114, then the inner one has a length of nJets ... so a vector<float[114]> approach probably won't work without quite a bit of restructuring upstream, which wouldn't be easy to do as I'm not the only client of these inputs
    Jim Pivarski
    @jpivarski
    @mattleblanc Up the structure can't change, then the only improvement will come when Uproot starts using AwkwardForth, roughly this summer. As a last-ditch suggestion, if it's always the outer structure that has equal length, it could be 114 distinct branches. Then each branch would be singly jagged and fast to read (like the triggers in NanoAOD). But I'm sure that would require a lot of restructuring, too.
    Cédric Hernalsteens
    @chernals
    Hi @jpivarski . IPAC is fast approaching. What's your prefered way to cite uproot properly ?
    Jim Pivarski
    @jpivarski

    We use this DOI for the software (as opposed to any papers or talks about it): https://doi.org/10.5281/zenodo.4340632

    There's a badge/link on the Uproot GitHub page (same for Awkward Array and quite a few other Scikit-HEP projects).

    Matt LeBlanc
    @mattleblanc
    Hi! I just wanted to follow up on my questions above about using nested vector branches -- I have since reformatted the tree to just have a set of single-vector branches for each systematic variation, and this is significantly better on the uproot side: the I/O is no longer even close to a limiting factor, and the memory use is way down (which means I can run more threads in parallel).
    The trees have something like a thousand branches and are a bit silly, but the grid jobs seem to run, so I guess I shouldn't complain too much!
    Thanks again for the tips
    agoose77
    @agoose77:matrix.org
    [m]
    @mattleblanc: that's great news!
    Jim Pivarski
    @jpivarski
    @mattleblanc That is good to hear! NanoAOD has about a thousand branches, too.