Discussion of Python in High Energy Physics https://hepsoftwarefoundation.org/activities/pyhep.html
concatenate
doesn't actually load the file contents 10 times if I pass a list with 10 times the same file name, maybe that's a bug?
@agoose77 Oh! Do you mean the equivalent of TTree::GetEntry(entry_number)
? Uproot doesn't "seek." It doesn't have a pointer to an entry number that gets updated as you iterate through it.
A TTree consists of named TBranches, and each of those are split into small chunks called TBaskets. When ROOT does TTree::GetEntry
, it checks to see if its pointer crosses a TBasket boundary, and if it does, it loads the next TBasket. Otherwise, it fetches the entry from its already-loaded TBaskets.
When you ask for ttree.arrays
or tbranch.array
from Uproot, it reads the desired range of entries from the file. It doesn't have a pointer to update: each access is a new thing, not a continuation. Except that there is a default cache (you set it when opening a file with uproot4.open), so re-reading the same array doesn't always go back to the file. Also, there's an uproot4.iterate function to iterate through chunks of events, not individual events, which handles "recently read TBaskets" the right way.
entry_start
and entry_stop
parameters to the array
/arrays
methods to slice non-overlapping parts of the TBranch. If these entries don't line up with TBasket boundaries, then the tasks will redundantly load the same TBaskets. This would also be true of ROOT: the smallest unit of reading anything from a ROOT file is a TBasket, because this is a separately compressed buffer (you can't read an item from a compressed buffer without decompressing the whole buffer). If you want to try to optimize that, common_entry_offsets is a method that gives TBasket boundaries as entry numbers for a given set of TBranches.
Very nice talk from @eduardo-rodrigues at the HSF WLCG Virtual Workshop https://youtu.be/hf0fT0_VKfw :)
Also I was very happy to hear the questions about citations at the end!
(Sorry that I'm reposting this everywhere; I want everyone to be warned.)
The Awkward/Uproot name transition is done, at least at the level of release candidates. If you do
pip install "awkward>=1.0.0rc1" "uproot>=4.0.0rc1"
you'll get Awkward 1.x and Uproot 4.x. (They don't strictly depend on each other, so you could do one, the other, or both.)
If you do
pip install "awkward1>=1.0.0rc1" "uproot4>4.0.0rc1"
you'll get thin awkward1 and uproot4 packages that just bring in the appropriate awkward and uproot and pass names through. This is so that uproot4.whatever
still works.
If you do
pip install awkward0 uproot3 # or just uproot3
you'll get the old Awkward 0.x and Uproot 3.x that you can import ... as ...
. This also brings in uproot3-methods
, which is a new name just to avoid compatibility issues with old packages that we saw last week.
All of the above are permanent; they will continue to work after Awkward 1.x and Uproot 4.x are full releases (not release candidates). However, the following will bring in old packages before the full release and new packages after the full release.
pip install awkward uproot
So it is only the full release that will break scripts, and only when users pip install --update
. I plan to take that step this weekend, when there might be fewer people actively working. It also gives everyone a chance to provide feedback or take action with import ... as ...
.
(Sorry for the reposting, if you saw this message elsewhere.)
Probably the last message about the Awkward Array/Uproot name transition: it's done. The new versions have moved from release candidates to full releases. Now when you
pip install awkward uproot
without qualification, you get the new ones. I think I've "dotted all the 'i's of packaging" to get the right dependencies and tested all the cases I could think of on a blank AWS instance.
pip install awkward0 uproot3
returns the old versions (Awkward 0.x and Uproot 3.x). The prescription for anyone who needs the old packages is import awkward0 as awkward
and import uproot3 as uproot
.pip install awkward1 uproot4
returns thin wrappers of the new ones, which point to whatever the latest awkward
and uproot
are. They pass through to the new libraries, so scripts written with import awkward1, uproot4
don't need to be changed (though you'll probably want to, for simplicity).uproot-methods
no longer causes trouble because there's an uproot3-methods
in the dependency chain: awkward0
→ uproot3-methods
→ uproot3
. The latest uproot-methods
(no qualification) now excludes Awkward 1.x so that they can't be used together by mistake.PyHEP WG 2021 topical meetings
Dear all,
It is a pleasure to announce that the HSF PyHEP WG will be running topical meetings in 2021, following popular interest from the community. These meetings will take place by default on the first Wednesday of the month. We will start on February 3rd with a tutorial on Numba by Jim Pivarski.
See https://indico.cern.ch/category/11412/ for the list of meetings pre-scheduled on Indico.
Best wishes,
Ben, Eduardo, Jim
Quick piece of news - SciPy 2021, the 20th annual Scientific Computing with Python conference, will be held July 5-11, 2021 in Austin, Texas
Important Dates
February 9, 2021 Submission deadline
March 23, 2021 Tutorial presenters notified of acceptance
April 2, 2021 Conference speakers and poster presenters notified of acceptance
May 22, 2021 First draft of Proceedings Due
July 5-6, 2021 SciPy 2021 Tutorials
July 7-9, 2021 SciPy 2021 Conference
July 10-11, 2021 SciPy 2021 Sprints
TH1:Smooth
is an example. scipy.signal
has some algorithms that seem more suited to >>100 bins, and I have not found much else so far.
LOWESS smothers are one of my favorite techniques! You can make it take advantage of statistical uncertainties by making the linear fits incorporate them. Each sampled point is a linear fit with points weighted by some kernel, usually Gaussian of distance from the sampled point, but the weight can be Gaussian of distance times 1/σ². (I don't know if there's a library that does that, but I've always done LOWESS manually.)
It can also be relatively fast—depending on what timescale you consider "fast"—because linear fits with weights can be implemented as a numerical formula (see, for example, https://github.com/scikit-hep/awkward-1.0/blob/1531cc98e08a2be938b53ac6c1276c9745be8f20/src/awkward/operations/reducers.py#L1289-L1293). You don't need to use all data points in every linear fit, since the pull of those with |σ| > 2 or 3 will be very weak. I usually include the union of points with |σ| < 3 and the 5 closest points (to avoid degenerate cases in which no points are within 3σ of the sampled point: you want to extrapolate with whatever the closest points are that you have). A function like this, especially if you're interested in performance, would be an excellent application of Numba.
Depending on what you mean by that, it could be. It's always possible to seek to a given point in a file without reading everything up to that point, and ROOT files, unlike JSON or CSV, contain integers saying exactly where to seek to in order to find objects. (In JSON or CSV, you'd have to read everything up to that point to know where a given record starts: e.g. in CSV, it's counting "\n"
characters.)
However, the seek points ROOT files maintain are pointers to TBaskets, which must be loaded in their entirety. In a 100 GB file, you can immediately seek to the last TBasket, then you must read and decompress the whole TBasket before proceeding. That's not very large: maybe 100's of kB, 1000's of events (the exact numbers depend on the AutoFlush
parameter when the file was written). So it's certainly not expensive to seek to a specific event. You don't need to split the 100 GB file to make that more efficient.
However however, if what you're planning to do is to jump around from one event to another in random order, that might involve reading/decompressing a TBasket, throwing it away, then reading/decompressing another TBasket, then back to the first one, etc. That would be inefficient, and it would be more so if the files were separate (because there's the TFile and TTree metadata to load each time). Caching TBaskets helps (ROOT and Uproot do this automatically), but then the performance depends on the caching parameters and, like how big the cache is, and how randomly you're jumping.
Randomly jumping around in a file is not just a ROOT problem, with its quantization in TBaskets, but also a filesystem problem. Filesystems quantize disk reads into pages (usually 4 kB) and the operating system maintains a cache of them. This happens underneath any process—ROOT or Uproot—and performance differences can be orders of magnitude because RAM is much faster than disk. (And I'm guessing that the disk your 100 GB file is sitting on is not an SSD.)
But if you're talking about running through the file sequentially, then none of that's an issue. In fact, sequential access is optimal for JSON and CSV, too. But if it's a Python for
loop over a NumPy or Awkward Array, then there are faster ways to do it (vectorized operations or Numba). If you're talking about using Uproot to extract one event with branch.array(entry_start=N, entry_stop=N+1)[0]
, then that's definitely going to be slow because of the infrastructure needed to find the TBasket (even if already cached), interpret it as an array, and pull one element out. Use array
/arrays
/etc. in as large of chunks as will fit in your memory.
entry_start=N, entry_stop=M
to array
/arrays
/etc. for reasonably large M - N
ranges, then Uproot will read all the TBaskets that those ranges touch and cut off the excess. The loss of efficiency due to reading and cutting off an excess is unavoidable unless you tune N
and M
to TBasket bounaries (TTree.common_entry_offsets computes that for a set of TBranches, if you want to try), but if the M - N
ranges are considerably larger than the TBaskets, the loss is not significant.