Where communities thrive


  • Join over 1.5M+ people
  • Join over 100K+ communities
  • Free without limits
  • Create your own community
People
Activity
    Jim Pivarski
    @jpivarski
    iterate_baskets and filling an empty array is the nearest Uproot 3 equivalent to that. (You had to do it manually because Uproot 3 didn't have a concatenate function.)
    Andrzej Novak
    @andrzejnovak
    Cool, so the syntax is definitely clearer, but it's still slower than root_numpy https://gist.github.com/3909bc0c9e05da11bd65e8bbf342d164 On a side note: concatenate doesn't actually load the file contents 10 times if I pass a list with 10 times the same file name, maybe that's a bug?
    Jim Pivarski
    @jpivarski
    I don't know if that should be considered a bug or not. It probably comes from the fact that wildcard-expansion of filenames can produce the same name multiple times (e.g. if there are multiple unknowns), and it comes from that. An argument could be made that it should combine duplicates when expanding a wildcarded string but not when combining items in a user-supplied list.
    Andrzej Novak
    @andrzejnovak
    Yeah, that's what I was guessing. I imagine that in a real use case it's not likely that a file would be fed in twice, but it's a strange behaviour to do on a list.
    But the either way, it is still slower than root_numpy
    Jim Pivarski
    @jpivarski
    You can simulate it with a bunch of symlinks. Uproot won't know that the symlinks point to the same file. But that's artificial for performance tests, anyway, since it ensures that the file is in warmed-cache, and physical reading is (in good cases) the biggest bottleneck.
    Andrzej Novak
    @andrzejnovak
    ah interesting, I didn't think of that
    Angus Hollands
    @agoose77
    @jpivarski is there a fast way to seek to a particular offset (in terms of entries) of a TTree using uproot?
    Jim Pivarski
    @jpivarski
    @agoose77 What is a TTree's offset?
    Jim Pivarski
    @jpivarski

    @agoose77 Oh! Do you mean the equivalent of TTree::GetEntry(entry_number)? Uproot doesn't "seek." It doesn't have a pointer to an entry number that gets updated as you iterate through it.

    A TTree consists of named TBranches, and each of those are split into small chunks called TBaskets. When ROOT does TTree::GetEntry, it checks to see if its pointer crosses a TBasket boundary, and if it does, it loads the next TBasket. Otherwise, it fetches the entry from its already-loaded TBaskets.

    When you ask for ttree.arrays or tbranch.array from Uproot, it reads the desired range of entries from the file. It doesn't have a pointer to update: each access is a new thing, not a continuation. Except that there is a default cache (you set it when opening a file with uproot4.open), so re-reading the same array doesn't always go back to the file. Also, there's an uproot4.iterate function to iterate through chunks of events, not individual events, which handles "recently read TBaskets" the right way.

    Angus Hollands
    @agoose77
    @jpivarski thanks - I am basically looking in to using Dask in my analysis pipeline, for the parallelism + task-orchestration aspects. If i have multiple files, it's easy to introduce stupid parallelism but with for a few very large files, I'd want to be able to seek to different locations within those files for the different workers.
    Jim Pivarski
    @jpivarski
    @agoose77 What you want to do is use the entry_start and entry_stop parameters to the array/arrays methods to slice non-overlapping parts of the TBranch. If these entries don't line up with TBasket boundaries, then the tasks will redundantly load the same TBaskets. This would also be true of ROOT: the smallest unit of reading anything from a ROOT file is a TBasket, because this is a separately compressed buffer (you can't read an item from a compressed buffer without decompressing the whole buffer). If you want to try to optimize that, common_entry_offsets is a method that gives TBasket boundaries as entry numbers for a given set of TBranches.
    Angus Hollands
    @agoose77
    Thanks Jim, I appreciate the input on this!
    Eduardo Rodrigues
    @eduardo-rodrigues
    Dear all, at the July PyHEP 2020 workshop we got a lot of feedback saying that there would be interest in having topical WG meetings every now and then. We plan to take that feedback on board ... Please get in touch if you would like to make a suggestion for a topic/presentation ... You can get in touch with us conveners at hsf-pyhep-organisation@googlegroups.com ... Advance thanks.
    You can also write down here in Gitter some ideas, of course.
    Matthew Feickert
    @matthewfeickert

    Very nice talk from @eduardo-rodrigues at the HSF WLCG Virtual Workshop https://youtu.be/hf0fT0_VKfw :)

    Also I was very happy to hear the questions about citations at the end!

    Eduardo Rodrigues
    @eduardo-rodrigues
    Thanks very much @matthewfeickert ! (Interesting to hear me for once - need to improve :D.)
    Yes, I got a lot of great questions. Some points raised are very difficult to answer quantitatively, such as those on citations. But worth the discussion, and good that people think better about citations of software work, recognition and so on.
    Jim Pivarski
    @jpivarski

    (Sorry that I'm reposting this everywhere; I want everyone to be warned.)

    The Awkward/Uproot name transition is done, at least at the level of release candidates. If you do

    pip install "awkward>=1.0.0rc1" "uproot>=4.0.0rc1"

    you'll get Awkward 1.x and Uproot 4.x. (They don't strictly depend on each other, so you could do one, the other, or both.)

    If you do

    pip install "awkward1>=1.0.0rc1" "uproot4>4.0.0rc1"

    you'll get thin awkward1 and uproot4 packages that just bring in the appropriate awkward and uproot and pass names through. This is so that uproot4.whatever still works.

    If you do

    pip install awkward0 uproot3    # or just uproot3

    you'll get the old Awkward 0.x and Uproot 3.x that you can import ... as .... This also brings in uproot3-methods, which is a new name just to avoid compatibility issues with old packages that we saw last week.

    All of the above are permanent; they will continue to work after Awkward 1.x and Uproot 4.x are full releases (not release candidates). However, the following will bring in old packages before the full release and new packages after the full release.

    pip install awkward uproot

    So it is only the full release that will break scripts, and only when users pip install --update. I plan to take that step this weekend, when there might be fewer people actively working. It also gives everyone a chance to provide feedback or take action with import ... as ....

    Jim Pivarski
    @jpivarski

    (Sorry for the reposting, if you saw this message elsewhere.)

    Probably the last message about the Awkward Array/Uproot name transition: it's done. The new versions have moved from release candidates to full releases. Now when you

    pip install awkward uproot

    without qualification, you get the new ones. I think I've "dotted all the 'i's of packaging" to get the right dependencies and tested all the cases I could think of on a blank AWS instance.

    • pip install awkward0 uproot3 returns the old versions (Awkward 0.x and Uproot 3.x). The prescription for anyone who needs the old packages is import awkward0 as awkward and import uproot3 as uproot.
    • pip install awkward1 uproot4 returns thin wrappers of the new ones, which point to whatever the latest awkward and uproot are. They pass through to the new libraries, so scripts written with import awkward1, uproot4 don't need to be changed (though you'll probably want to, for simplicity).
    • uproot-methods no longer causes trouble because there's an uproot3-methods in the dependency chain: awkward0uproot3-methodsuproot3. The latest uproot-methods (no qualification) now excludes Awkward 1.x so that they can't be used together by mistake.
    Henry Schreiner
    @henryiii
    For anyone who is interested, there’s an HSF Training Hackathon next week: https://indico.cern.ch/event/975487/
    Bhanu Gupta
    @Bhanu-mbvg
    Hello folks, I would like to contribute to any open source projects under this group.
    I am currently doing my undergrad and have intermediate experience with python but plan to learn more from this community. it would be really great if someone can point me in the right direction to get started
    Jonas Eschle
    @mayou36
    Hi Bhano, welcome! I would assume that you may want to contribute to a Scikit-HEP project? I would suggest to have a look at the possible packages around and see if there is anything that you are interested or have already some special knowledge in. The range goes from data loading and dumping, plotting, statistical libraries such as likelihood fits, histograms, vector calculus, units, decay description and many more... just let us (or the main author directly) know what you're interested in
    Bhanu Gupta
    @Bhanu-mbvg
    Hey Jonas thank you for the help
    Eduardo Rodrigues
    @eduardo-rodrigues
    Hi @Bhanu-mbvg, good to see more people engaging. @mayou36 already mentioned Scikit-HEP. About it, I actually have a task since a few months in DecayLanguage, which is good as a first issue - scikit-hep/decaylanguage#105. Do you want to give that a try? Get in touch privately or directly on this issue if interested ... Thanks.
    Eduardo Rodrigues
    @eduardo-rodrigues
    Me again - just to point out that you can directly interact with Scikit-HEP authors, maintainers and users at https://gitter.im/Scikit-HEP/community.
    Bhanu Gupta
    @Bhanu-mbvg
    Thanks @eduardo-rodrigues and yes sure I would love to start with the beginner-friendly issue you mentioned above.
    Eduardo Rodrigues
    @eduardo-rodrigues
    Cool. Let's continue there ... Thanks.
    Andrzej Novak
    @andrzejnovak
    I dunno if it's possible but it would be pretty nifty if there was a place to aggregate all "good first issue" labeled issues form skhep
    Eduardo Rodrigues
    @eduardo-rodrigues
    Good question/idea. I doubt that's possible unless one can somehow aggregate labels from multiple packages.
    Bhanu Gupta
    @Bhanu-mbvg
    Github has a REST API which can be used to get all issues of a organization with one particular tag which I think can be used to collect all required data Github Issues API
    Henry Schreiner
    @henryiii
    I’ve been using the API quite a bit (now through pyGithub), it works quite well. You could setup a daily action to do collect and list these (probably easiest in Jekyll/Ruby)
    Eduardo Rodrigues
    @eduardo-rodrigues

    PyHEP WG 2021 topical meetings
    Dear all,

    It is a pleasure to announce that the HSF PyHEP WG will be running topical meetings in 2021, following popular interest from the community. These meetings will take place by default on the first Wednesday of the month. We will start on February 3rd with a tutorial on Numba by Jim Pivarski.

    See https://indico.cern.ch/category/11412/ for the list of meetings pre-scheduled on Indico.

    Best wishes,
    Ben, Eduardo, Jim

    Divyansh Rastogi
    @watch24hrs-iiitd
    Hi! I am divyansh, currently a college sophomore. I have been digging around the project DEEPLENSE. It would be really helpful if someone could guide me to their conversation chanel/ github link! Thanks in advance!
    Jonas Eschle
    @mayou36
    Hi Divyanish, unfortunately you may have picked the wrong chat. This is about software in High Energy Physics and doesn't has to do anything closely to DeepLense, so I am afraid that we can't help you find it. Or did I misunderstand your question?
    Divyansh Rastogi
    @watch24hrs-iiitd
    Actually, it was a project in GSOC 2020 that intrigues me now, and that's why I wanted to get in touch with the community and their codebase. Link https://hepsoftwarefoundation.org/gsoc/2020/proposal_DEEPLENSE.html
    Jonas Eschle
    @mayou36
    I see, I didn't remember that. Maybe someone here knows something about it (I didn't find anything), otherwise, if no one answers here, maybe consider to contact the author of the paper if you are interested in their research and like to participate
    Henry Schreiner
    @henryiii
    I’m checking with Sergei to see what has happened with it so far.
    Eduardo Rodrigues
    @eduardo-rodrigues
    Happy New Year everyone!

    Quick piece of news - SciPy 2021, the 20th annual Scientific Computing with Python conference, will be held July 5-11, 2021 in Austin, Texas
    Important Dates

    February 9, 2021 Submission deadline
    March 23, 2021 Tutorial presenters notified of acceptance
    April 2, 2021 Conference speakers and poster presenters notified of acceptance
    May 22, 2021 First draft of Proceedings Due
    July 5-6, 2021 SciPy 2021 Tutorials
    July 7-9, 2021 SciPy 2021 Conference
    July 10-11, 2021 SciPy 2021 Sprints

    alexander-held
    @alexander-held
    Hi, is anyone aware of python implementations of smoothing algorithms specifically designed for smoothing histograms with a small number (~3-30) of bins? Such algorithms are used for smoothing templates in statistical models. ROOT's TH1:Smooth is an example. scipy.signal has some algorithms that seem more suited to >>100 bins, and I have not found much else so far.
    Kevin Wang
    @LeavesWang
    Hi, what about Lowess smoothing in statsmodels.nonparametric.smoothers_lowess.lowess?
    alexander-held
    @alexander-held
    Thanks! I'm giving this a try. Like TH1::Smooth, it does not inherently take into account statistical uncertainty per bin, but I could sample from the histogram to work around that issue.. might have poor performance but could be a useful reference point.
    Jim Pivarski
    @jpivarski

    LOWESS smothers are one of my favorite techniques! You can make it take advantage of statistical uncertainties by making the linear fits incorporate them. Each sampled point is a linear fit with points weighted by some kernel, usually Gaussian of distance from the sampled point, but the weight can be Gaussian of distance times 1/σ². (I don't know if there's a library that does that, but I've always done LOWESS manually.)

    It can also be relatively fast—depending on what timescale you consider "fast"—because linear fits with weights can be implemented as a numerical formula (see, for example, https://github.com/scikit-hep/awkward-1.0/blob/1531cc98e08a2be938b53ac6c1276c9745be8f20/src/awkward/operations/reducers.py#L1289-L1293). You don't need to use all data points in every linear fit, since the pull of those with |σ| > 2 or 3 will be very weak. I usually include the union of points with |σ| < 3 and the 5 closest points (to avoid degenerate cases in which no points are within 3σ of the sampled point: you want to extrapolate with whatever the closest points are that you have). A function like this, especially if you're interested in performance, would be an excellent application of Numba.

    1 reply
    agoose77
    @agoose77:matrix.org
    [m]
    @jpivarski do you ever take a day off?
    agoose77
    @agoose77:matrix.org
    [m]
    Jim, can I confirm that seeking in a large ROOT file with uproot4 (~100GB) is not hugely costly - i.e. I don't need to split it into several files, if I'm processing it event by event?
    Jim Pivarski
    @jpivarski

    Depending on what you mean by that, it could be. It's always possible to seek to a given point in a file without reading everything up to that point, and ROOT files, unlike JSON or CSV, contain integers saying exactly where to seek to in order to find objects. (In JSON or CSV, you'd have to read everything up to that point to know where a given record starts: e.g. in CSV, it's counting "\n" characters.)

    However, the seek points ROOT files maintain are pointers to TBaskets, which must be loaded in their entirety. In a 100 GB file, you can immediately seek to the last TBasket, then you must read and decompress the whole TBasket before proceeding. That's not very large: maybe 100's of kB, 1000's of events (the exact numbers depend on the AutoFlush parameter when the file was written). So it's certainly not expensive to seek to a specific event. You don't need to split the 100 GB file to make that more efficient.

    However however, if what you're planning to do is to jump around from one event to another in random order, that might involve reading/decompressing a TBasket, throwing it away, then reading/decompressing another TBasket, then back to the first one, etc. That would be inefficient, and it would be more so if the files were separate (because there's the TFile and TTree metadata to load each time). Caching TBaskets helps (ROOT and Uproot do this automatically), but then the performance depends on the caching parameters and, like how big the cache is, and how randomly you're jumping.

    Randomly jumping around in a file is not just a ROOT problem, with its quantization in TBaskets, but also a filesystem problem. Filesystems quantize disk reads into pages (usually 4 kB) and the operating system maintains a cache of them. This happens underneath any process—ROOT or Uproot—and performance differences can be orders of magnitude because RAM is much faster than disk. (And I'm guessing that the disk your 100 GB file is sitting on is not an SSD.)

    But if you're talking about running through the file sequentially, then none of that's an issue. In fact, sequential access is optimal for JSON and CSV, too. But if it's a Python for loop over a NumPy or Awkward Array, then there are faster ways to do it (vectorized operations or Numba). If you're talking about using Uproot to extract one event with branch.array(entry_start=N, entry_stop=N+1)[0], then that's definitely going to be slow because of the infrastructure needed to find the TBasket (even if already cached), interpret it as an array, and pull one element out. Use array/arrays/etc. in as large of chunks as will fit in your memory.

    agoose77
    @agoose77:matrix.org
    [m]
    @jpivarski: I'm running on an HPC cluster using Dask Distributed, and I currently operate on sections of entries by generating entry range slices, and passing those to tasks which read the appropriate range. It sounds like you're saying that in such a case, I would not need to read much besides the baskets i'm interested in?
    Jim Pivarski
    @jpivarski
    If you pass entry_start=N, entry_stop=M to array/arrays/etc. for reasonably large M - N ranges, then Uproot will read all the TBaskets that those ranges touch and cut off the excess. The loss of efficiency due to reading and cutting off an excess is unavoidable unless you tune N and M to TBasket bounaries (TTree.common_entry_offsets computes that for a set of TBranches, if you want to try), but if the M - N ranges are considerably larger than the TBaskets, the loss is not significant.
    agoose77
    @agoose77:matrix.org
    [m]
    Thanks Jim. Seems like this will need some thought on my end!