Where communities thrive

  • Join over 1.5M+ people
  • Join over 100K+ communities
  • Free without limits
  • Create your own community
    Hans Dembinski
    Boost.Histogram now has 100% line coverage https://github.com/boostorg/histogram :tada:
    It had 99 % for a long time, getting the last 1 % was hard, but revealed a rare bug. o_O
    Doug Davis
    has anyone put any thought into building larger-than-can-fit-in-memory dataframes in dask via uproot? (I'm thinking something like building a lazy dask dataframe with something like uproot.iterate, possibly with dask.delayed). This is completely my imagination right now, curious if anyone else has thought of it
    Nicholas Smith
    oh jim went there, you can see some code for it in uproot presently. dunno if it is mature
    Doug Davis
    docs for the dask functions don't appear to be in the sphinx docs ;)
    Jim Pivarski
    Right, those functions exist, but I'm not sure of their future. You can try them and let me know if you find them useful. So far, I'm not sure that this makes sense as a way to work.
    Doug Davis
    gotcha, thanks! I'll definitely play around with it
    Eduardo Rodrigues
    I just go an email from the Python Software Foundation on fundraising, and thought it cannot harm to post the link here: https://www.python.org/psf/donations/2019-q2-drive/. (For info the PSF funded part of the PyHEP 2018 workshop in Sofia. That's the link ;-).)
    Martin Ritter
    Hi all, I need a short pandas tutorial for a group of undergraduate students: just one afternoon to show them the basics, optimal would be tasks for them to achieve and some relationship to HEP. Does anyone have something already?
    Henry Schreiner
    @daritter Here’s some material I taught for a class: henryiii/compclass classes/week7, you can see a rendered example here because GitHub’s rendering hasn’t been working lately
    Martin Ritter
    Thank you @henryiii , I'll have a look.
    Jim Pivarski
    Here is one I've given (intended for RISE): https://github.com/jpivarski/python-numpy-mini-course/blob/master/4-pandas.ipynb The examples continue from 2-just-numpy.ipynb, with the point of illustrating that it's easier to do things in Pandas.
    Matt Bellis
    @jpivarski and @daritter I'm totally using these for an intro to data exploration class this Fall! Thanks!
    Martin Ritter
    @jpivarski Thank you, that is also very helpful.
    Henry Schreiner
    SWIG 4.0 was released a few days ago, with some very nice sounding improvements. The minimum version of Python was changed from 1.5 to 2.7, it can turn Doxygen into PyDoc, optimizations were turned on, and supports C++11 containers. I’m still a much bigger fan of PyBind11, but it’s nice to see the other options improving!
    Jim Pivarski
    I was working on some materials for talks and tried to determine programming languages used by physicists via the GitHub API. In short, I wanted a realistic measurement of this prediction from 2015:
    So I asked GitHub for all the users who have forked the CMSSW repository (about 3000). I'm assuming that these users are all physicists (in fact, they're probably all CMS members). Then for each user, I asked for all the repositories that user owns (non-forked repositories): they each have a creation date and a primary language. So then I plotted the languages of these repos versus creation date.
    I combined C and C++ into a single category (I don't think GitHub can clearly distinguish them), and we see that C/C++ and Python are reaching a crossing point right now. Also, we see that a lot of physicists have TeX repositories (makes sense; I have a lot of those, too), and Jupyter Notebooks are on the rise as well.
    Jim Pivarski
    To generalize this beyond CMSSW, does anyone have a suggestion of other repositories to use as a seed, other than cms-sw/cmssw? Do the other collaborations encourage all their members to fork the collaboration software?
    Jonas Eschle
    actually, if you would extract how many jupter notebooks are run with Python, we may have even crossed this already
    But very nice, quite interesting
    Jim Pivarski
    GitHub doesn't give that information. It just says "Jupyter Notebook", which they must do some weighting for because in the "languages by number of lines" breakdown, Jupyter wins because its JSON is verbose.
    https://github.com/jpivarski/jupyter-performance-studies/blob/master/github-physicists.ipynb has the analysis but not the raw data or the GitHub REST calls; I need to package those up. (It was a work in progress.)
    But I'd like to know about other seed repos to round out the definition of "physicist GitHub user". This sample is pretty close to 100% CMS, and maybe there's a peculiarity in that sample.
    From your last plot there, I'd say the fraction of "python" has been stable, but c++ has been losing to "Jupyter notebooks". I wonder then how much this has been Jupyter with C++ /ROOT kernel, and Jupyter with the python kernel. But yeah this is really nice!
    The only two other collaborations I've been involved with use private gitlab instances for their main repos, so I don't think that would help! You could seed with all repositories that have keywords like Dark Matter / LHC / particle physics / high-energy physics in the title or their descriptions and at least N (=10?) forks.
    Nicholas Smith
    does gitlab @ cern publish stats like this?
    Jim Pivarski
    @benkrikler That's a good point; Python has been growing slowly and Jupyter has been growing quickly. Determining if those Jupyter Notebooks contain C++ code or Python code would require cloning the repos or otherwise getting the content of the files. Maybe. Let me think about that. (Maybe I can do a GitHub search by file language and contains "import" vs "include"?)
    Raymond Ehlers
    @jpivarski We also fork our collaboration software at ALICE: https://github.com/alisw/AliPhysics/ (You could also look at alisw/AliRoot, but I think you'll get a better snapshot of analyzers with AliPhysics)
    Jim Pivarski
    @raymondEhlers Thanks! I'll include that.

    Meanwhile, I think I've answered the question about Jupyter Notebooks: they seem to be exclusively Python. I can do searches through the API, though they're more rate-limited and I have to wait longer, so I have the gathering script print out results as it goes.

    For each repo that GitHub labels as "Jupyter Notebook", I do two searches: one for the word "include" and the other for the word "import". If imports outnumber includes, I label it as Python. I've manually followed up a few cases with non-zero "includes"; they've all been in markdown cells.

    FredStober/sandbox                                 0 vs 10        Python
    michelif/bayesian_opt_skopt                        0 vs 1        Python
    michelif/HHbbgg_ETH                                1 vs 18        Python
    michelif/quickMLTests                              0 vs 0        ???
    terrenceedmonds/titanic                            0 vs 2        Python
    hbakhshi/Analysis13TeV                             0 vs 0        ???
    clint-richardson/BU-TheBus                         0 vs 1        Python
    clint-richardson/NBA-Data                          0 vs 0        ???
    clint-richardson/X53AnalysisDemo                   0 vs 2        Python
    zhangzc11/Pi0Net                                   0 vs 1        Python
    joseph-taylor/LjLabBook                            0 vs 5        Python
    mukundvarma/kaggle-instacart                       1 vs 2        Python
    A-lxe/study-csc-daq-rate                           0 vs 1        Python
    vlimant/summer15-ArashJofrehei                     0 vs 61        Python
    vlimant/summer15-Irene                             0 vs 12        Python
    vlimant/summer15-MarinaKolosova                    1 vs 45        Python
    vlimant/summer15-SahandSeif                        0 vs 25        Python
    vlimant/summer16-NikolausHowe                      0 vs 24        Python
    vlimant/surf17-tutorial                            0 vs 4        Python
    vlimant/surf18-tutorial                            0 vs 7        Python
    davidlange6/gsocStudentSolutions                   0 vs 3        Python
    davidlange6/toy_notebooks                          2 vs 14        Python
    jbueghly/hzg_analysis                              0 vs 5        Python
    kaylanb/skinapp                                    0 vs 0        ???
    kaylanb/thesis_code                                0 vs 2        Python
    lihux25/Projects                                   1 vs 3        Python
    ArnabPurohit/Machine-Learning-applications-in-HEP  0 vs 1        Python
    nhaubrich/biophysics                               0 vs 8        Python
    emc5ud/rosalind-solutions                          1 vs 6        Python
    nmehrle/echelle                                    0 vs 1        Python
    zaixingmao/retina                                  1 vs 2        Python
    fmanteca/ImageClassification                       0 vs 1        Python
    alkaplan/jupyter-notebooks                         2 vs 6        Python
    patrykel/multi-tracking-notebooks                  0 vs 6        Python
    patrykel/MultitrackingMasterProject                0 vs 2        Python
    cfangmeier/HHC                                     0 vs 2        Python
    cfangmeier/Small                                   1 vs 4        Python
    cfangmeier/TTTT                                    0 vs 1        Python
    cfangmeier/UNL-Gantry-Encapsulation-Monitoring     0 vs 2        Python
    jiafulow/L1TMuonDocsNov2018                        1 vs 1        ???
    jiafulow/L1TMuonSimulationsMar2017                 3 vs 18        Python
    jiafulow/UF-slurm                                  0 vs 0        ???
    monttj/computational-physics                       1 vs 12        Python
    NJManganelli/TaggerTest                            0 vs 1        Python
    sciencecw/cmsjupyter                               1 vs 14        Python
    lecriste/first-binder                              0 vs 1        Python
    mzanetti79/LaboratoryOfComputationalPhysics        0 vs 11        Python
    mzanetti79/ML-INFN                                 2 vs 34        Python
    mzanetti79/MLCC18                                  0 vs 23        Python
    bpenning/jupyter_repo                              0 vs 22        Python
    bencammett/ML_project_Comp2                        6 vs 12        Python
    hbprosper/ENHEP                                    1 vs 8        Python
    hbprosper/eshep_tutorials                          0 vs 13        Python
    Jonas Eschle
    So we crossed the point already and physics analysis is dominated by python? :)
    Jim Pivarski
    Since it looks like we can add the Jupyter count to the Python column, yes. That just happened.

    Actually, in my slow-moving scan of Jupyter notebooks, I've finally come across two legitimate C++ Jupyter repos: https://github.com/gudrutis/jupyter-book-tutorials/search?utf8=%E2%9C%93&q=include&type= and https://github.com/javadebadi/learning_cpp_again/search?q=include&unscoped_q=include

    These are the first 2 out of 91.

    Henry Schreiner
    This won’t work in C++20. :)
    Jim Pivarski
    Nope. That other analysis that identifies physicists by having "Scientific" in their Linux distribution name will fail in the near future, too.
    @raymondEhlers I have results from Alice, and it's quite different. Alice is considerably more C/C++ than Python.

    It would be very interesting to find out what the other collaborations are doing. I've looked into the GitLab API—it functions on gitlab.cern.ch, but I wasn't able to repeat any of these queries without figuring out its (different) authentication mechanism. And even then, I might have to be a member of a collaboration to see its users. If there's a culture of "in-development analysis is private, even from other members of the collaboration," then there might not be anything any one user can do to get a global picture.

    Does anyone have any other suggestions? (GitHub preferred; I already have the scripts.)

    Luke Kreczko

    @jpivarski Collaborations like LZ use mostly C++ (private GitLab), Xenon1T mostly Python (on Github).

    You can always try to get representatives from the collaboration to run a script to give you the breakdown if you want the exact numbers

    Doug Davis
    I would guess ATLAS is considerably more C++ than Python, but the balance is shifting.
    Hard to measure with ATLAS heavily using gitlab.cern.ch and most members defaulting to private repos
    Raymond Ehlers
    @jpivarski Thanks for sharing! That's about what I would have guessed. I've tried to encourage python, but with only some success :-)