Where communities thrive


  • Join over 1.5M+ people
  • Join over 100K+ communities
  • Free without limits
  • Create your own community
People
Activity
    agoose77
    @agoose77:matrix.org
    [m]
    Jim, can I confirm that seeking in a large ROOT file with uproot4 (~100GB) is not hugely costly - i.e. I don't need to split it into several files, if I'm processing it event by event?
    Jim Pivarski
    @jpivarski

    Depending on what you mean by that, it could be. It's always possible to seek to a given point in a file without reading everything up to that point, and ROOT files, unlike JSON or CSV, contain integers saying exactly where to seek to in order to find objects. (In JSON or CSV, you'd have to read everything up to that point to know where a given record starts: e.g. in CSV, it's counting "\n" characters.)

    However, the seek points ROOT files maintain are pointers to TBaskets, which must be loaded in their entirety. In a 100 GB file, you can immediately seek to the last TBasket, then you must read and decompress the whole TBasket before proceeding. That's not very large: maybe 100's of kB, 1000's of events (the exact numbers depend on the AutoFlush parameter when the file was written). So it's certainly not expensive to seek to a specific event. You don't need to split the 100 GB file to make that more efficient.

    However however, if what you're planning to do is to jump around from one event to another in random order, that might involve reading/decompressing a TBasket, throwing it away, then reading/decompressing another TBasket, then back to the first one, etc. That would be inefficient, and it would be more so if the files were separate (because there's the TFile and TTree metadata to load each time). Caching TBaskets helps (ROOT and Uproot do this automatically), but then the performance depends on the caching parameters and, like how big the cache is, and how randomly you're jumping.

    Randomly jumping around in a file is not just a ROOT problem, with its quantization in TBaskets, but also a filesystem problem. Filesystems quantize disk reads into pages (usually 4 kB) and the operating system maintains a cache of them. This happens underneath any process—ROOT or Uproot—and performance differences can be orders of magnitude because RAM is much faster than disk. (And I'm guessing that the disk your 100 GB file is sitting on is not an SSD.)

    But if you're talking about running through the file sequentially, then none of that's an issue. In fact, sequential access is optimal for JSON and CSV, too. But if it's a Python for loop over a NumPy or Awkward Array, then there are faster ways to do it (vectorized operations or Numba). If you're talking about using Uproot to extract one event with branch.array(entry_start=N, entry_stop=N+1)[0], then that's definitely going to be slow because of the infrastructure needed to find the TBasket (even if already cached), interpret it as an array, and pull one element out. Use array/arrays/etc. in as large of chunks as will fit in your memory.

    agoose77
    @agoose77:matrix.org
    [m]
    @jpivarski: I'm running on an HPC cluster using Dask Distributed, and I currently operate on sections of entries by generating entry range slices, and passing those to tasks which read the appropriate range. It sounds like you're saying that in such a case, I would not need to read much besides the baskets i'm interested in?
    Jim Pivarski
    @jpivarski
    If you pass entry_start=N, entry_stop=M to array/arrays/etc. for reasonably large M - N ranges, then Uproot will read all the TBaskets that those ranges touch and cut off the excess. The loss of efficiency due to reading and cutting off an excess is unavoidable unless you tune N and M to TBasket bounaries (TTree.common_entry_offsets computes that for a set of TBranches, if you want to try), but if the M - N ranges are considerably larger than the TBaskets, the loss is not significant.
    agoose77
    @agoose77:matrix.org
    [m]
    Thanks Jim. Seems like this will need some thought on my end!
    Henry Schreiner
    @henryiii
    If anyone finds this useful, I recently wrote and taught this: https://henryiii.github.io/level-up-your-python
    Henry Schreiner
    @henryiii
    Eduardo Rodrigues
    @eduardo-rodrigues
    HSF PyHEP Topical Meetings
    As discussed at the PyHEP 2020 workshop, we're starting a series of topical meetings, loosely organized around a different Python module each month. So far, we have the following lined up:
    • February 3, 2021: Numba presented by Jim Pivarski
    • March 3, 2021: JAX presented by Hans Dembinski
    • April 7, 2021: pyhf presented by Giordon Stark, Lukas Heinrich, and/or Matthew Feickert
    • continuing on the first Wednesday of each month.
    Each of these will be one hour, starting at 16:00 Central European time (CERN), which is 10am U.S. Eastern, 7am U.S. Pacific, midnight in Tokyo, and 8:30pm in India.
    Next Wednesday's Numba tutorial will be presented on Zoom (Indico agenda) with an interactive Jupyter notebook in Binder (GitHub repo). No registration is required; just show up if you're interested!
    (See the intro slides and notebook to get a sense of what is planned!)
    Please kindly advertise to your own communities and communication channels! Advance thanks.
    Henry Schreiner
    @henryiii
    NumPy 1.20 is out! Static typing support in, Python 3.6 out!
    Eduardo Rodrigues
    @eduardo-rodrigues
    Kind reminder that the first HSF PyHEP Topical Meeting is today at 16:00 Central European time (CERN), which is 10am U.S. Eastern, 7am U.S. Pacific, midnight in Tokyo, and 8:30pm in India: https://indico.cern.ch/event/985350/
    Eduardo Rodrigues
    @eduardo-rodrigues
    The YouTube recording is now available, see the link on Indico or the playlist at https://www.youtube.com/playlist?list=PLKZ9c4ONm-VnFUD0XX2DmfP1JA8VIRhXP. Enjoy!
    P.S.: thank you again Jim for the tutorial!
    Hans Dembinski
    @HDembinski
    Thank you Jim for the nice talk, and Eduardo and the other organizers for setting this up. I forwarded the announcement to the LHCb statistics and machine learning WG.
    Hans Dembinski
    @HDembinski
    Just a note, the second half of the recorded talk has some persistent noise in the background
    @jpivarski Also thanks for pointing to my crude little numba-stats module. I need to work on the PyPI frontpage!
    Eduardo Rodrigues
    @eduardo-rodrigues
    Thanks for the advert :-).
    Henry Schreiner
    @henryiii
    NASA’s OSS flight framework is styled in Black and uses pre-commit. :) https://github.com/nasa/fprime/blob/035808df02706d405611b30efa396f8fb799e9a1/.pre-commit-config.yaml
    Eduardo Rodrigues
    @eduardo-rodrigues

    PyHEP topical WG meeting "module-of-the-month" - JAX
    Dear colleague,

    The second PyHEP topical meeting (Indico) will take place next Wednesday March 3rd at 16h Central European time (CERN), which is 10am U.S. Eastern, 7am U.S. Pacific, midnight in Tokyo, and 20h30 in India.

    The 1-hour tutorial will cover JAX and will be given by Hans Dembinski.

    For reference: these topical meetings are loosely organised around a different Python module each month.
    So far we have/had the following lined up:
    • February 3, 2021: Numba presented by Jim Pivarski
    • March 3, 2021: JAX presented by Hans Dembinski
    • April 7, 2021: pyhf presented by Giordon Stark, Lukas Heinrich, Matthew Feickert
    • Continuing on the first Wednesday of each month.

    No registration is required; just show up if you're interested!
    Eduardo,
    for the PyHEP WG organisers

    Eduardo Rodrigues
    @eduardo-rodrigues
    REMINDER - PyHEP tutorial in 10-ish minutes :-). See you there ...
    rthirusenthil
    @rthirusenthil
    @eduardo-rodrigues Dear Professor, What is the passcode for zoom? I am unable to join
    Eduardo Rodrigues
    @eduardo-rodrigues
    Code is 11318709.
    Expand the arrow next to the Zoom link. Unfortunately that's not done by default.
    rthirusenthil
    @rthirusenthil
    @eduardo-rodrigues Thank you
    ramc77
    @ramc77
    Good Morning all scientists
    I am new here, want to learn PyHEP for SM simulation
    Eduardo Rodrigues
    @eduardo-rodrigues
    Hi @ramc77, your statement is super vague. By SM simulation do you mean something related to https://github.com/SModelS/smodels, though that's more for BSM? If you want to "learn PyHEP", have you been looking at some packages specifically, or you have in mind some topic? For example, if you want to learn about histogramming then https://github.com/scikit-hep/boost-histogram is your friend ... Scikit-HEP itself provides a lot of packages on the PyHEP ecosystem. zfit provides others, etc.
    ramc77
    @ramc77
    Hi @eduardo-rodrigues thank you for replying. I want to calculate and test my own BSM model (which is in langragian ).
    Eduardo Rodrigues
    @eduardo-rodrigues
    Hi. Not sure whether there is a specific package for some of the basics of what you need. If you need to minimise functions then the "PyHEP" fitters are relevant, in which case you can also ask at https://gitter.im/HSF/PyHEP-fitting ... Otherwise, again, you need to say or point to what you are doing. I suspect you're not interested in symbolic stuff hence assume you're rather in need of fitters and histogramming.
    Check iminuit and zfit for a start?
    Eduardo Rodrigues
    @eduardo-rodrigues

    PyHEP topical WG meeting "module-of-the-month" - pyhf
    Dear colleague,

    The 3rd PyHEP topical meeting (Indico https://indico.cern.ch/event/985425/) will take place next Wednesday April 7th at 16h Central European time (Geneva, CERN).

    The 1-hour tutorial will cover the pyhf High Energy Physics package for pure-Python fitting/limit-setting/interval estimation ROOT's HistFactory-style (https://github.com/scikit-hep/pyhf) and will be given by Matthew Feickert.

    Zoom link: https://cern.zoom.us/j/95400915645?pwd=TjBBWC84cFViTkgxdEwwNXp0WEdHZz09
    Passcode: 11318709

    For reference: these topical meetings are loosely organised around a different Python module each month.
    So far we have/had the following lined up:
    • February 3, 2021: Numba presented by Jim Pivarski
    • March 3, 2021: JAX presented by Hans Dembinski
    • April 7, 2021: pyhf presented by Giordon Stark, Lukas Heinrich, Matthew Feickert
    • Continuing on the first Wednesday of each month.

    No registration is required; just show up if you're interested!
    Eduardo,
    for the PyHEP WG organisers

    ramc77
    @ramc77
    thanks
    ramc77
    @ramc77
    is this same like CalcHEP and HERWIG ? @eduardo-rodrigues
    Eduardo Rodrigues
    @eduardo-rodrigues
    You mean pyhf? Nope, not at all. I do not know CalcHEP but HERWIG is a Monte Carlo event generator. pyhf is about statistical analysis of binned data.
    Looked now - "CalcHEP - a package for calculation of Feynman diagrams and integration over multi-particle phase space.". Totally different business.
    ramc77
    @ramc77
    thank you
    Eduardo Rodrigues
    @eduardo-rodrigues
    Reminder - PyHEP topical meeting on pyhf starting in 10 minutes (see above for details).
    Henry Schreiner
    @henryiii

    First release of Vector out! Version 0.8, some constructor changes planned, but should be ready to play with! Initial features:

    • 2D, 3D, and Lorentz vectors
    • Single, Array, and Awkward forms
    • Supports Numba / Awkward + Numba
    • Multiple coordinate systems
    • Geometric / momentum versions
    • Statically typed

    https://github.com/scikit-hep/vector

    Eduardo Rodrigues
    @eduardo-rodrigues

    PyHEP 2021 Workshop, July 5-9 2021 - registration and abstract submission are open!

    Dear colleague,

    The PyHEP 2021 workshop will be a virtual workshop taking place on July 5‒9. Registration is open, as well as abstracts submission, see https://indico.cern.ch/e/PyHEP2021. There are no registration fees nor a limit on the number of participants.

    The agenda will be composed of tutorials (targeting different levels of experience) and standard talks, which will be based on the accepted abstracts and the topics of interest to the “Python in HEP” community. Upon registration you will have the opportunity to shape the workshop contents and format with your input. Provisionally, the various sessions will take place in the afternoons on the Central Europe time zone, following the information from last year on the times that suit most participants across the globe.

    We welcome submissions of abstracts for live tutorials and shorter “notebook talks”, both of which are intended to target the strengths of live, online communication. Details on these two different types of talks are provided as instructions on the submission form. We encourage the use of Jupyter notebooks and help on how to set things up will be provided in due time. Jupyter notebook submissions will be made available through Binder for participants to run on their own in real time.

    Details can be found on the Indico page https://indico.cern.ch/e/PyHEP2021 or from the PyHEP WG homepage http://hepsoftwarefoundation.org/activities/pyhep.html.
    You are encouraged to also join the PyHEP WG Gitter channel (https://gitter.im/HSF/PyHEP) and/or the HSF forum (https://groups.google.com/forum/#!forum/hsf-forum) to get more information about the workshop and community.

    We are directly reachable via pyhep2021-organisation@cern.ch.

    Looking forward to your participation!

    Organising Committee
    Eduardo Rodrigues - University of Liverpool (Chair)
    Ben Krikler - University of Bristol (Co-chair)
    Jim Pivarski - Princeton University (Co-chair)
    Matthew Feickert - University of Illinois at Urbana-Champaign
    Oksana Shadura - University of Nebraska-Lincoln
    Philip Grace - The University of Adelaide

    Jim Pivarski
    @jpivarski

    Continuing our series of topical meetings, Nick Smith will be presenting an introduction on using Dask to parallelize your workflows on May 5 at 16:00 Central European time (CERN), which is 10am U.S. Eastern, 7am U.S. Pacific, midnight in Tokyo, and 8:30pm in India.

    https://indico.cern.ch/event/1027094/

    So far, we have the following lined up (with recordings of past videos):

    February 3, 2021: Numba presented by Jim Pivarski (video)
    March 3, 2021: JAX presented by Hans Dembinski (video)
    April 7, 2021: pyhf presented by Matthew Feickert (video)
    May 5, 2021: Dask presented by Nick Smith
    June 2, 2021: Jupyter presented by Jim Pivarski
    continuing on the first Wednesday of each month.
    No registration is required; just show up if you're interested!

    Jim Pivarski
    for the PyHEP Organizing Committee

    Jim Pivarski
    @jpivarski

    Continuing our series of PyHEP topical meetings, Jim Pivarski will be presenting a talk on How to Give a Good Jupyter Talk on Wednesday, June 2 at 16:00 Central European time (CERN), which is 10am U.S. Eastern, 7am U.S. Pacific, midnight in Tokyo, and 8:30pm in India. This talk will focus on presentation tips and techniques, so it would be appropriate for anyone who plans to give a talk using Jupyter, even if you're already familiar with the software.

    https://indico.cern.ch/event/1044648/

    So far, we have the following PyHEP topics lined up (with recordings of past videos):

    • February 3, 2021: Numba presented by Jim Pivarski (video)
    • March 3, 2021: JAX presented by Hans Dembinski (video)
    • April 7, 2021: pyhf presented by Matthew Feickert (video)
    • May 5, 2021: Dask presented by Nick Smith (video)
    • June 2, 2021: Jupyter presented by Jim Pivarski

    continuing on the first Wednesday of each month. Let us know if you'd like to present or request one in the future.

    No registration is required; just show up if you're interested!

    Since we'll be walking through some Jupyter configurations, you would get more out of this talk if you come with JupyterLab, RISE, and voila-reveal installed on your computer.

    Jim Pivarski
    for the PyHEP Organizing Committee

    alexander-held
    @alexander-held
    Hi, how does the lightning talk selection for PyHEP 2021 work? I only see the selection options for notebook talk / tutorial. Does an abstract submitted to these longer contribution types also get considered for a lightning talk slot if it is rejected from these other categories?
    Eduardo Rodrigues
    @eduardo-rodrigues
    You were too fast ;-). The doc and submission instructions have been updated at https://indico.cern.ch/event/1019958/abstracts/. Happy to receive feedback is the text isn't clear enough. (As ever we want to be open-minded and experimental where possible.)
    So, in short: lightning talks will most likely be 10 minutes in total and now have their contribution type that one can specify. This can be used for work that is not alpha yet, for example.
    Hope that helps?
    alexander-held
    @alexander-held
    All clear now, thank you!
    Ianna Osborne
    @ianna

    The doc and submission instructions have been updated at https://indico.cern.ch/event/1019958/abstracts/. Happy to receive feedback is the text isn't clear enough. (As ever we want to be open-minded and experimental where possible.)

    @eduardo-rodrigues - I don’t see the instructions… am I missing something?

    Eduardo Rodrigues
    @eduardo-rodrigues
    Hello @ianna. It is indeed something not trivially visible and that annoys me in Indico. The instructions are only seen at submission level, which we would all agree is too late!
    So go to https://indico.cern.ch/event/1019958/abstracts/ and act as if you were going to submit the abstract clicking on the blue button "Submit new abstract". That pops out a form where you will eventually put a title, authors list, etc. At the very top you find the message "Please don't forget to read the submission instructions before submitting an abstract." and the but "submission instructions" is the link that gives you the details.
    I may have found a way to have those instructions upfront, which is definitely the normal thing to have ... give me a minute ...
    Eduardo Rodrigues
    @eduardo-rodrigues
    OK, I finally managed to understand a bit more on how Indico sets abstract calls up! Please @ianna have a look at the text at https://indico.cern.ch/event/1019958/abstracts/ and let me know. Thanks .
    Ianna Osborne
    @ianna
    thanks! I can see it now!