Where communities thrive

  • Join over 1.5M+ people
  • Join over 100K+ communities
  • Free without limits
  • Create your own community
    Martin Ritter
    Hi all, I need a short pandas tutorial for a group of undergraduate students: just one afternoon to show them the basics, optimal would be tasks for them to achieve and some relationship to HEP. Does anyone have something already?
    Henry Schreiner
    @daritter Here’s some material I taught for a class: henryiii/compclass classes/week7, you can see a rendered example here because GitHub’s rendering hasn’t been working lately
    Martin Ritter
    Thank you @henryiii , I'll have a look.
    Jim Pivarski
    Here is one I've given (intended for RISE): https://github.com/jpivarski/python-numpy-mini-course/blob/master/4-pandas.ipynb The examples continue from 2-just-numpy.ipynb, with the point of illustrating that it's easier to do things in Pandas.
    Matt Bellis
    @jpivarski and @daritter I'm totally using these for an intro to data exploration class this Fall! Thanks!
    Martin Ritter
    @jpivarski Thank you, that is also very helpful.
    Henry Schreiner
    SWIG 4.0 was released a few days ago, with some very nice sounding improvements. The minimum version of Python was changed from 1.5 to 2.7, it can turn Doxygen into PyDoc, optimizations were turned on, and supports C++11 containers. I’m still a much bigger fan of PyBind11, but it’s nice to see the other options improving!
    Jim Pivarski
    I was working on some materials for talks and tried to determine programming languages used by physicists via the GitHub API. In short, I wanted a realistic measurement of this prediction from 2015:
    So I asked GitHub for all the users who have forked the CMSSW repository (about 3000). I'm assuming that these users are all physicists (in fact, they're probably all CMS members). Then for each user, I asked for all the repositories that user owns (non-forked repositories): they each have a creation date and a primary language. So then I plotted the languages of these repos versus creation date.
    I combined C and C++ into a single category (I don't think GitHub can clearly distinguish them), and we see that C/C++ and Python are reaching a crossing point right now. Also, we see that a lot of physicists have TeX repositories (makes sense; I have a lot of those, too), and Jupyter Notebooks are on the rise as well.
    Jim Pivarski
    To generalize this beyond CMSSW, does anyone have a suggestion of other repositories to use as a seed, other than cms-sw/cmssw? Do the other collaborations encourage all their members to fork the collaboration software?
    Jonas Eschle
    actually, if you would extract how many jupter notebooks are run with Python, we may have even crossed this already
    But very nice, quite interesting
    Jim Pivarski
    GitHub doesn't give that information. It just says "Jupyter Notebook", which they must do some weighting for because in the "languages by number of lines" breakdown, Jupyter wins because its JSON is verbose.
    https://github.com/jpivarski/jupyter-performance-studies/blob/master/github-physicists.ipynb has the analysis but not the raw data or the GitHub REST calls; I need to package those up. (It was a work in progress.)
    But I'd like to know about other seed repos to round out the definition of "physicist GitHub user". This sample is pretty close to 100% CMS, and maybe there's a peculiarity in that sample.
    From your last plot there, I'd say the fraction of "python" has been stable, but c++ has been losing to "Jupyter notebooks". I wonder then how much this has been Jupyter with C++ /ROOT kernel, and Jupyter with the python kernel. But yeah this is really nice!
    The only two other collaborations I've been involved with use private gitlab instances for their main repos, so I don't think that would help! You could seed with all repositories that have keywords like Dark Matter / LHC / particle physics / high-energy physics in the title or their descriptions and at least N (=10?) forks.
    Nicholas Smith
    does gitlab @ cern publish stats like this?
    Jim Pivarski
    @benkrikler That's a good point; Python has been growing slowly and Jupyter has been growing quickly. Determining if those Jupyter Notebooks contain C++ code or Python code would require cloning the repos or otherwise getting the content of the files. Maybe. Let me think about that. (Maybe I can do a GitHub search by file language and contains "import" vs "include"?)
    Raymond Ehlers
    @jpivarski We also fork our collaboration software at ALICE: https://github.com/alisw/AliPhysics/ (You could also look at alisw/AliRoot, but I think you'll get a better snapshot of analyzers with AliPhysics)
    Jim Pivarski
    @raymondEhlers Thanks! I'll include that.

    Meanwhile, I think I've answered the question about Jupyter Notebooks: they seem to be exclusively Python. I can do searches through the API, though they're more rate-limited and I have to wait longer, so I have the gathering script print out results as it goes.

    For each repo that GitHub labels as "Jupyter Notebook", I do two searches: one for the word "include" and the other for the word "import". If imports outnumber includes, I label it as Python. I've manually followed up a few cases with non-zero "includes"; they've all been in markdown cells.

    FredStober/sandbox                                 0 vs 10        Python
    michelif/bayesian_opt_skopt                        0 vs 1        Python
    michelif/HHbbgg_ETH                                1 vs 18        Python
    michelif/quickMLTests                              0 vs 0        ???
    terrenceedmonds/titanic                            0 vs 2        Python
    hbakhshi/Analysis13TeV                             0 vs 0        ???
    clint-richardson/BU-TheBus                         0 vs 1        Python
    clint-richardson/NBA-Data                          0 vs 0        ???
    clint-richardson/X53AnalysisDemo                   0 vs 2        Python
    zhangzc11/Pi0Net                                   0 vs 1        Python
    joseph-taylor/LjLabBook                            0 vs 5        Python
    mukundvarma/kaggle-instacart                       1 vs 2        Python
    A-lxe/study-csc-daq-rate                           0 vs 1        Python
    vlimant/summer15-ArashJofrehei                     0 vs 61        Python
    vlimant/summer15-Irene                             0 vs 12        Python
    vlimant/summer15-MarinaKolosova                    1 vs 45        Python
    vlimant/summer15-SahandSeif                        0 vs 25        Python
    vlimant/summer16-NikolausHowe                      0 vs 24        Python
    vlimant/surf17-tutorial                            0 vs 4        Python
    vlimant/surf18-tutorial                            0 vs 7        Python
    davidlange6/gsocStudentSolutions                   0 vs 3        Python
    davidlange6/toy_notebooks                          2 vs 14        Python
    jbueghly/hzg_analysis                              0 vs 5        Python
    kaylanb/skinapp                                    0 vs 0        ???
    kaylanb/thesis_code                                0 vs 2        Python
    lihux25/Projects                                   1 vs 3        Python
    ArnabPurohit/Machine-Learning-applications-in-HEP  0 vs 1        Python
    nhaubrich/biophysics                               0 vs 8        Python
    emc5ud/rosalind-solutions                          1 vs 6        Python
    nmehrle/echelle                                    0 vs 1        Python
    zaixingmao/retina                                  1 vs 2        Python
    fmanteca/ImageClassification                       0 vs 1        Python
    alkaplan/jupyter-notebooks                         2 vs 6        Python
    patrykel/multi-tracking-notebooks                  0 vs 6        Python
    patrykel/MultitrackingMasterProject                0 vs 2        Python
    cfangmeier/HHC                                     0 vs 2        Python
    cfangmeier/Small                                   1 vs 4        Python
    cfangmeier/TTTT                                    0 vs 1        Python
    cfangmeier/UNL-Gantry-Encapsulation-Monitoring     0 vs 2        Python
    jiafulow/L1TMuonDocsNov2018                        1 vs 1        ???
    jiafulow/L1TMuonSimulationsMar2017                 3 vs 18        Python
    jiafulow/UF-slurm                                  0 vs 0        ???
    monttj/computational-physics                       1 vs 12        Python
    NJManganelli/TaggerTest                            0 vs 1        Python
    sciencecw/cmsjupyter                               1 vs 14        Python
    lecriste/first-binder                              0 vs 1        Python
    mzanetti79/LaboratoryOfComputationalPhysics        0 vs 11        Python
    mzanetti79/ML-INFN                                 2 vs 34        Python
    mzanetti79/MLCC18                                  0 vs 23        Python
    bpenning/jupyter_repo                              0 vs 22        Python
    bencammett/ML_project_Comp2                        6 vs 12        Python
    hbprosper/ENHEP                                    1 vs 8        Python
    hbprosper/eshep_tutorials                          0 vs 13        Python
    Jonas Eschle
    So we crossed the point already and physics analysis is dominated by python? :)
    Jim Pivarski
    Since it looks like we can add the Jupyter count to the Python column, yes. That just happened.

    Actually, in my slow-moving scan of Jupyter notebooks, I've finally come across two legitimate C++ Jupyter repos: https://github.com/gudrutis/jupyter-book-tutorials/search?utf8=%E2%9C%93&q=include&type= and https://github.com/javadebadi/learning_cpp_again/search?q=include&unscoped_q=include

    These are the first 2 out of 91.

    Henry Schreiner
    This won’t work in C++20. :)
    Jim Pivarski
    Nope. That other analysis that identifies physicists by having "Scientific" in their Linux distribution name will fail in the near future, too.
    @raymondEhlers I have results from Alice, and it's quite different. Alice is considerably more C/C++ than Python.

    It would be very interesting to find out what the other collaborations are doing. I've looked into the GitLab API—it functions on gitlab.cern.ch, but I wasn't able to repeat any of these queries without figuring out its (different) authentication mechanism. And even then, I might have to be a member of a collaboration to see its users. If there's a culture of "in-development analysis is private, even from other members of the collaboration," then there might not be anything any one user can do to get a global picture.

    Does anyone have any other suggestions? (GitHub preferred; I already have the scripts.)

    Luke Kreczko

    @jpivarski Collaborations like LZ use mostly C++ (private GitLab), Xenon1T mostly Python (on Github).

    You can always try to get representatives from the collaboration to run a script to give you the breakdown if you want the exact numbers

    Doug Davis
    I would guess ATLAS is considerably more C++ than Python, but the balance is shifting.
    Hard to measure with ATLAS heavily using gitlab.cern.ch and most members defaulting to private repos
    Raymond Ehlers
    @jpivarski Thanks for sharing! That's about what I would have guessed. I've tried to encourage python, but with only some success :-)
    Luke Kreczko
    @raymondEhlers "Python is slow", huh? ;)
    Martin Ritter
    @jpivarski I would not dare to judge how the distribution is in Belle2. I can say we only teach python (pandas,mpl) for beginners but there's a large fraction of people coming over from Belle and they have a very high inertia and prefer to use "ROOT macros". However I can tell you that any Belle2 member that would put their analysis on github/gitlab would be definitely the ones using python for analysis so I'd expect a heavy bias there.
    Hans Dembinski
    @jpivarski Thank you for this awesome analysis. According to the voluntary survey 2018 that I conducted within the LHCb collaboration, half of the LHCb members use mainly Python. It is similar to your CMS results.
    Hard data (even with caveats, perhaps Python users prefer Github??) such as yours is even more convincing than personal statements
    Jim Pivarski
    @HDembinski Could you point me to that survey?
    I am trying to dig up the URL of the actual poll now...
    Some of the free-form text answers are quite interesting :)