Where communities thrive


  • Join over 1.5M+ people
  • Join over 100K+ communities
  • Free without limits
  • Create your own community
People
Activity
    Eduardo Rodrigues
    @eduardo-rodrigues
    Dear all, at the July PyHEP 2020 workshop we got a lot of feedback saying that there would be interest in having topical WG meetings every now and then. We plan to take that feedback on board ... Please get in touch if you would like to make a suggestion for a topic/presentation ... You can get in touch with us conveners at hsf-pyhep-organisation@googlegroups.com ... Advance thanks.
    You can also write down here in Gitter some ideas, of course.
    Matthew Feickert
    @matthewfeickert

    Very nice talk from @eduardo-rodrigues at the HSF WLCG Virtual Workshop https://youtu.be/hf0fT0_VKfw :)

    Also I was very happy to hear the questions about citations at the end!

    Eduardo Rodrigues
    @eduardo-rodrigues
    Thanks very much @matthewfeickert ! (Interesting to hear me for once - need to improve :D.)
    Yes, I got a lot of great questions. Some points raised are very difficult to answer quantitatively, such as those on citations. But worth the discussion, and good that people think better about citations of software work, recognition and so on.
    Jim Pivarski
    @jpivarski

    (Sorry that I'm reposting this everywhere; I want everyone to be warned.)

    The Awkward/Uproot name transition is done, at least at the level of release candidates. If you do

    pip install "awkward>=1.0.0rc1" "uproot>=4.0.0rc1"

    you'll get Awkward 1.x and Uproot 4.x. (They don't strictly depend on each other, so you could do one, the other, or both.)

    If you do

    pip install "awkward1>=1.0.0rc1" "uproot4>4.0.0rc1"

    you'll get thin awkward1 and uproot4 packages that just bring in the appropriate awkward and uproot and pass names through. This is so that uproot4.whatever still works.

    If you do

    pip install awkward0 uproot3    # or just uproot3

    you'll get the old Awkward 0.x and Uproot 3.x that you can import ... as .... This also brings in uproot3-methods, which is a new name just to avoid compatibility issues with old packages that we saw last week.

    All of the above are permanent; they will continue to work after Awkward 1.x and Uproot 4.x are full releases (not release candidates). However, the following will bring in old packages before the full release and new packages after the full release.

    pip install awkward uproot

    So it is only the full release that will break scripts, and only when users pip install --update. I plan to take that step this weekend, when there might be fewer people actively working. It also gives everyone a chance to provide feedback or take action with import ... as ....

    Jim Pivarski
    @jpivarski

    (Sorry for the reposting, if you saw this message elsewhere.)

    Probably the last message about the Awkward Array/Uproot name transition: it's done. The new versions have moved from release candidates to full releases. Now when you

    pip install awkward uproot

    without qualification, you get the new ones. I think I've "dotted all the 'i's of packaging" to get the right dependencies and tested all the cases I could think of on a blank AWS instance.

    • pip install awkward0 uproot3 returns the old versions (Awkward 0.x and Uproot 3.x). The prescription for anyone who needs the old packages is import awkward0 as awkward and import uproot3 as uproot.
    • pip install awkward1 uproot4 returns thin wrappers of the new ones, which point to whatever the latest awkward and uproot are. They pass through to the new libraries, so scripts written with import awkward1, uproot4 don't need to be changed (though you'll probably want to, for simplicity).
    • uproot-methods no longer causes trouble because there's an uproot3-methods in the dependency chain: awkward0uproot3-methodsuproot3. The latest uproot-methods (no qualification) now excludes Awkward 1.x so that they can't be used together by mistake.
    Henry Schreiner
    @henryiii
    For anyone who is interested, there’s an HSF Training Hackathon next week: https://indico.cern.ch/event/975487/
    Bhanu Gupta
    @Bhanu-mbvg
    Hello folks, I would like to contribute to any open source projects under this group.
    I am currently doing my undergrad and have intermediate experience with python but plan to learn more from this community. it would be really great if someone can point me in the right direction to get started
    Jonas Eschle
    @mayou36
    Hi Bhano, welcome! I would assume that you may want to contribute to a Scikit-HEP project? I would suggest to have a look at the possible packages around and see if there is anything that you are interested or have already some special knowledge in. The range goes from data loading and dumping, plotting, statistical libraries such as likelihood fits, histograms, vector calculus, units, decay description and many more... just let us (or the main author directly) know what you're interested in
    Bhanu Gupta
    @Bhanu-mbvg
    Hey Jonas thank you for the help
    Eduardo Rodrigues
    @eduardo-rodrigues
    Hi @Bhanu-mbvg, good to see more people engaging. @mayou36 already mentioned Scikit-HEP. About it, I actually have a task since a few months in DecayLanguage, which is good as a first issue - scikit-hep/decaylanguage#105. Do you want to give that a try? Get in touch privately or directly on this issue if interested ... Thanks.
    Eduardo Rodrigues
    @eduardo-rodrigues
    Me again - just to point out that you can directly interact with Scikit-HEP authors, maintainers and users at https://gitter.im/Scikit-HEP/community.
    Bhanu Gupta
    @Bhanu-mbvg
    Thanks @eduardo-rodrigues and yes sure I would love to start with the beginner-friendly issue you mentioned above.
    Eduardo Rodrigues
    @eduardo-rodrigues
    Cool. Let's continue there ... Thanks.
    Andrzej Novak
    @andrzejnovak
    I dunno if it's possible but it would be pretty nifty if there was a place to aggregate all "good first issue" labeled issues form skhep
    Eduardo Rodrigues
    @eduardo-rodrigues
    Good question/idea. I doubt that's possible unless one can somehow aggregate labels from multiple packages.
    Bhanu Gupta
    @Bhanu-mbvg
    Github has a REST API which can be used to get all issues of a organization with one particular tag which I think can be used to collect all required data Github Issues API
    Henry Schreiner
    @henryiii
    I’ve been using the API quite a bit (now through pyGithub), it works quite well. You could setup a daily action to do collect and list these (probably easiest in Jekyll/Ruby)
    Eduardo Rodrigues
    @eduardo-rodrigues

    PyHEP WG 2021 topical meetings
    Dear all,

    It is a pleasure to announce that the HSF PyHEP WG will be running topical meetings in 2021, following popular interest from the community. These meetings will take place by default on the first Wednesday of the month. We will start on February 3rd with a tutorial on Numba by Jim Pivarski.

    See https://indico.cern.ch/category/11412/ for the list of meetings pre-scheduled on Indico.

    Best wishes,
    Ben, Eduardo, Jim

    Divyansh Rastogi
    @watch24hrs-iiitd
    Hi! I am divyansh, currently a college sophomore. I have been digging around the project DEEPLENSE. It would be really helpful if someone could guide me to their conversation chanel/ github link! Thanks in advance!
    Jonas Eschle
    @mayou36
    Hi Divyanish, unfortunately you may have picked the wrong chat. This is about software in High Energy Physics and doesn't has to do anything closely to DeepLense, so I am afraid that we can't help you find it. Or did I misunderstand your question?
    Divyansh Rastogi
    @watch24hrs-iiitd
    Actually, it was a project in GSOC 2020 that intrigues me now, and that's why I wanted to get in touch with the community and their codebase. Link https://hepsoftwarefoundation.org/gsoc/2020/proposal_DEEPLENSE.html
    Jonas Eschle
    @mayou36
    I see, I didn't remember that. Maybe someone here knows something about it (I didn't find anything), otherwise, if no one answers here, maybe consider to contact the author of the paper if you are interested in their research and like to participate
    Henry Schreiner
    @henryiii
    I’m checking with Sergei to see what has happened with it so far.
    Eduardo Rodrigues
    @eduardo-rodrigues
    Happy New Year everyone!

    Quick piece of news - SciPy 2021, the 20th annual Scientific Computing with Python conference, will be held July 5-11, 2021 in Austin, Texas
    Important Dates

    February 9, 2021 Submission deadline
    March 23, 2021 Tutorial presenters notified of acceptance
    April 2, 2021 Conference speakers and poster presenters notified of acceptance
    May 22, 2021 First draft of Proceedings Due
    July 5-6, 2021 SciPy 2021 Tutorials
    July 7-9, 2021 SciPy 2021 Conference
    July 10-11, 2021 SciPy 2021 Sprints

    alexander-held
    @alexander-held
    Hi, is anyone aware of python implementations of smoothing algorithms specifically designed for smoothing histograms with a small number (~3-30) of bins? Such algorithms are used for smoothing templates in statistical models. ROOT's TH1:Smooth is an example. scipy.signal has some algorithms that seem more suited to >>100 bins, and I have not found much else so far.
    Kevin Wang
    @LeavesWang
    Hi, what about Lowess smoothing in statsmodels.nonparametric.smoothers_lowess.lowess?
    alexander-held
    @alexander-held
    Thanks! I'm giving this a try. Like TH1::Smooth, it does not inherently take into account statistical uncertainty per bin, but I could sample from the histogram to work around that issue.. might have poor performance but could be a useful reference point.
    Jim Pivarski
    @jpivarski

    LOWESS smothers are one of my favorite techniques! You can make it take advantage of statistical uncertainties by making the linear fits incorporate them. Each sampled point is a linear fit with points weighted by some kernel, usually Gaussian of distance from the sampled point, but the weight can be Gaussian of distance times 1/σ². (I don't know if there's a library that does that, but I've always done LOWESS manually.)

    It can also be relatively fast—depending on what timescale you consider "fast"—because linear fits with weights can be implemented as a numerical formula (see, for example, https://github.com/scikit-hep/awkward-1.0/blob/1531cc98e08a2be938b53ac6c1276c9745be8f20/src/awkward/operations/reducers.py#L1289-L1293). You don't need to use all data points in every linear fit, since the pull of those with |σ| > 2 or 3 will be very weak. I usually include the union of points with |σ| < 3 and the 5 closest points (to avoid degenerate cases in which no points are within 3σ of the sampled point: you want to extrapolate with whatever the closest points are that you have). A function like this, especially if you're interested in performance, would be an excellent application of Numba.

    1 reply
    agoose77
    @agoose77:matrix.org
    [m]
    @jpivarski do you ever take a day off?
    agoose77
    @agoose77:matrix.org
    [m]
    Jim, can I confirm that seeking in a large ROOT file with uproot4 (~100GB) is not hugely costly - i.e. I don't need to split it into several files, if I'm processing it event by event?
    Jim Pivarski
    @jpivarski

    Depending on what you mean by that, it could be. It's always possible to seek to a given point in a file without reading everything up to that point, and ROOT files, unlike JSON or CSV, contain integers saying exactly where to seek to in order to find objects. (In JSON or CSV, you'd have to read everything up to that point to know where a given record starts: e.g. in CSV, it's counting "\n" characters.)

    However, the seek points ROOT files maintain are pointers to TBaskets, which must be loaded in their entirety. In a 100 GB file, you can immediately seek to the last TBasket, then you must read and decompress the whole TBasket before proceeding. That's not very large: maybe 100's of kB, 1000's of events (the exact numbers depend on the AutoFlush parameter when the file was written). So it's certainly not expensive to seek to a specific event. You don't need to split the 100 GB file to make that more efficient.

    However however, if what you're planning to do is to jump around from one event to another in random order, that might involve reading/decompressing a TBasket, throwing it away, then reading/decompressing another TBasket, then back to the first one, etc. That would be inefficient, and it would be more so if the files were separate (because there's the TFile and TTree metadata to load each time). Caching TBaskets helps (ROOT and Uproot do this automatically), but then the performance depends on the caching parameters and, like how big the cache is, and how randomly you're jumping.

    Randomly jumping around in a file is not just a ROOT problem, with its quantization in TBaskets, but also a filesystem problem. Filesystems quantize disk reads into pages (usually 4 kB) and the operating system maintains a cache of them. This happens underneath any process—ROOT or Uproot—and performance differences can be orders of magnitude because RAM is much faster than disk. (And I'm guessing that the disk your 100 GB file is sitting on is not an SSD.)

    But if you're talking about running through the file sequentially, then none of that's an issue. In fact, sequential access is optimal for JSON and CSV, too. But if it's a Python for loop over a NumPy or Awkward Array, then there are faster ways to do it (vectorized operations or Numba). If you're talking about using Uproot to extract one event with branch.array(entry_start=N, entry_stop=N+1)[0], then that's definitely going to be slow because of the infrastructure needed to find the TBasket (even if already cached), interpret it as an array, and pull one element out. Use array/arrays/etc. in as large of chunks as will fit in your memory.

    agoose77
    @agoose77:matrix.org
    [m]
    @jpivarski: I'm running on an HPC cluster using Dask Distributed, and I currently operate on sections of entries by generating entry range slices, and passing those to tasks which read the appropriate range. It sounds like you're saying that in such a case, I would not need to read much besides the baskets i'm interested in?
    Jim Pivarski
    @jpivarski
    If you pass entry_start=N, entry_stop=M to array/arrays/etc. for reasonably large M - N ranges, then Uproot will read all the TBaskets that those ranges touch and cut off the excess. The loss of efficiency due to reading and cutting off an excess is unavoidable unless you tune N and M to TBasket bounaries (TTree.common_entry_offsets computes that for a set of TBranches, if you want to try), but if the M - N ranges are considerably larger than the TBaskets, the loss is not significant.
    agoose77
    @agoose77:matrix.org
    [m]
    Thanks Jim. Seems like this will need some thought on my end!
    Henry Schreiner
    @henryiii
    If anyone finds this useful, I recently wrote and taught this: https://henryiii.github.io/level-up-your-python
    Henry Schreiner
    @henryiii
    Eduardo Rodrigues
    @eduardo-rodrigues
    HSF PyHEP Topical Meetings
    As discussed at the PyHEP 2020 workshop, we're starting a series of topical meetings, loosely organized around a different Python module each month. So far, we have the following lined up:
    • February 3, 2021: Numba presented by Jim Pivarski
    • March 3, 2021: JAX presented by Hans Dembinski
    • April 7, 2021: pyhf presented by Giordon Stark, Lukas Heinrich, and/or Matthew Feickert
    • continuing on the first Wednesday of each month.
    Each of these will be one hour, starting at 16:00 Central European time (CERN), which is 10am U.S. Eastern, 7am U.S. Pacific, midnight in Tokyo, and 8:30pm in India.
    Next Wednesday's Numba tutorial will be presented on Zoom (Indico agenda) with an interactive Jupyter notebook in Binder (GitHub repo). No registration is required; just show up if you're interested!
    (See the intro slides and notebook to get a sense of what is planned!)
    Please kindly advertise to your own communities and communication channels! Advance thanks.
    Henry Schreiner
    @henryiii
    NumPy 1.20 is out! Static typing support in, Python 3.6 out!
    Eduardo Rodrigues
    @eduardo-rodrigues
    Kind reminder that the first HSF PyHEP Topical Meeting is today at 16:00 Central European time (CERN), which is 10am U.S. Eastern, 7am U.S. Pacific, midnight in Tokyo, and 8:30pm in India: https://indico.cern.ch/event/985350/
    Eduardo Rodrigues
    @eduardo-rodrigues
    The YouTube recording is now available, see the link on Indico or the playlist at https://www.youtube.com/playlist?list=PLKZ9c4ONm-VnFUD0XX2DmfP1JA8VIRhXP. Enjoy!
    P.S.: thank you again Jim for the tutorial!
    Hans Dembinski
    @HDembinski
    Thank you Jim for the nice talk, and Eduardo and the other organizers for setting this up. I forwarded the announcement to the LHCb statistics and machine learning WG.
    Hans Dembinski
    @HDembinski
    Just a note, the second half of the recorded talk has some persistent noise in the background
    @jpivarski Also thanks for pointing to my crude little numba-stats module. I need to work on the PyPI frontpage!
    Eduardo Rodrigues
    @eduardo-rodrigues
    Thanks for the advert :-).
    Henry Schreiner
    @henryiii
    NASA’s OSS flight framework is styled in Black and uses pre-commit. :) https://github.com/nasa/fprime/blob/035808df02706d405611b30efa396f8fb799e9a1/.pre-commit-config.yaml
    Eduardo Rodrigues
    @eduardo-rodrigues

    PyHEP topical WG meeting "module-of-the-month" - JAX
    Dear colleague,

    The second PyHEP topical meeting (Indico) will take place next Wednesday March 3rd at 16h Central European time (CERN), which is 10am U.S. Eastern, 7am U.S. Pacific, midnight in Tokyo, and 20h30 in India.

    The 1-hour tutorial will cover JAX and will be given by Hans Dembinski.

    For reference: these topical meetings are loosely organised around a different Python module each month.
    So far we have/had the following lined up:
    • February 3, 2021: Numba presented by Jim Pivarski
    • March 3, 2021: JAX presented by Hans Dembinski
    • April 7, 2021: pyhf presented by Giordon Stark, Lukas Heinrich, Matthew Feickert
    • Continuing on the first Wednesday of each month.

    No registration is required; just show up if you're interested!
    Eduardo,
    for the PyHEP WG organisers