Where communities thrive

  • Join over 1.5M+ people
  • Join over 100K+ communities
  • Free without limits
  • Create your own community
    Jim Pivarski
    I was working on some materials for talks and tried to determine programming languages used by physicists via the GitHub API. In short, I wanted a realistic measurement of this prediction from 2015:
    So I asked GitHub for all the users who have forked the CMSSW repository (about 3000). I'm assuming that these users are all physicists (in fact, they're probably all CMS members). Then for each user, I asked for all the repositories that user owns (non-forked repositories): they each have a creation date and a primary language. So then I plotted the languages of these repos versus creation date.
    I combined C and C++ into a single category (I don't think GitHub can clearly distinguish them), and we see that C/C++ and Python are reaching a crossing point right now. Also, we see that a lot of physicists have TeX repositories (makes sense; I have a lot of those, too), and Jupyter Notebooks are on the rise as well.
    Jim Pivarski
    To generalize this beyond CMSSW, does anyone have a suggestion of other repositories to use as a seed, other than cms-sw/cmssw? Do the other collaborations encourage all their members to fork the collaboration software?
    Jonas Eschle
    actually, if you would extract how many jupter notebooks are run with Python, we may have even crossed this already
    But very nice, quite interesting
    Jim Pivarski
    GitHub doesn't give that information. It just says "Jupyter Notebook", which they must do some weighting for because in the "languages by number of lines" breakdown, Jupyter wins because its JSON is verbose.
    https://github.com/jpivarski/jupyter-performance-studies/blob/master/github-physicists.ipynb has the analysis but not the raw data or the GitHub REST calls; I need to package those up. (It was a work in progress.)
    But I'd like to know about other seed repos to round out the definition of "physicist GitHub user". This sample is pretty close to 100% CMS, and maybe there's a peculiarity in that sample.
    From your last plot there, I'd say the fraction of "python" has been stable, but c++ has been losing to "Jupyter notebooks". I wonder then how much this has been Jupyter with C++ /ROOT kernel, and Jupyter with the python kernel. But yeah this is really nice!
    The only two other collaborations I've been involved with use private gitlab instances for their main repos, so I don't think that would help! You could seed with all repositories that have keywords like Dark Matter / LHC / particle physics / high-energy physics in the title or their descriptions and at least N (=10?) forks.
    Nicholas Smith
    does gitlab @ cern publish stats like this?
    Jim Pivarski
    @benkrikler That's a good point; Python has been growing slowly and Jupyter has been growing quickly. Determining if those Jupyter Notebooks contain C++ code or Python code would require cloning the repos or otherwise getting the content of the files. Maybe. Let me think about that. (Maybe I can do a GitHub search by file language and contains "import" vs "include"?)
    Raymond Ehlers
    @jpivarski We also fork our collaboration software at ALICE: https://github.com/alisw/AliPhysics/ (You could also look at alisw/AliRoot, but I think you'll get a better snapshot of analyzers with AliPhysics)
    Jim Pivarski
    @raymondEhlers Thanks! I'll include that.

    Meanwhile, I think I've answered the question about Jupyter Notebooks: they seem to be exclusively Python. I can do searches through the API, though they're more rate-limited and I have to wait longer, so I have the gathering script print out results as it goes.

    For each repo that GitHub labels as "Jupyter Notebook", I do two searches: one for the word "include" and the other for the word "import". If imports outnumber includes, I label it as Python. I've manually followed up a few cases with non-zero "includes"; they've all been in markdown cells.

    FredStober/sandbox                                 0 vs 10        Python
    michelif/bayesian_opt_skopt                        0 vs 1        Python
    michelif/HHbbgg_ETH                                1 vs 18        Python
    michelif/quickMLTests                              0 vs 0        ???
    terrenceedmonds/titanic                            0 vs 2        Python
    hbakhshi/Analysis13TeV                             0 vs 0        ???
    clint-richardson/BU-TheBus                         0 vs 1        Python
    clint-richardson/NBA-Data                          0 vs 0        ???
    clint-richardson/X53AnalysisDemo                   0 vs 2        Python
    zhangzc11/Pi0Net                                   0 vs 1        Python
    joseph-taylor/LjLabBook                            0 vs 5        Python
    mukundvarma/kaggle-instacart                       1 vs 2        Python
    A-lxe/study-csc-daq-rate                           0 vs 1        Python
    vlimant/summer15-ArashJofrehei                     0 vs 61        Python
    vlimant/summer15-Irene                             0 vs 12        Python
    vlimant/summer15-MarinaKolosova                    1 vs 45        Python
    vlimant/summer15-SahandSeif                        0 vs 25        Python
    vlimant/summer16-NikolausHowe                      0 vs 24        Python
    vlimant/surf17-tutorial                            0 vs 4        Python
    vlimant/surf18-tutorial                            0 vs 7        Python
    davidlange6/gsocStudentSolutions                   0 vs 3        Python
    davidlange6/toy_notebooks                          2 vs 14        Python
    jbueghly/hzg_analysis                              0 vs 5        Python
    kaylanb/skinapp                                    0 vs 0        ???
    kaylanb/thesis_code                                0 vs 2        Python
    lihux25/Projects                                   1 vs 3        Python
    ArnabPurohit/Machine-Learning-applications-in-HEP  0 vs 1        Python
    nhaubrich/biophysics                               0 vs 8        Python
    emc5ud/rosalind-solutions                          1 vs 6        Python
    nmehrle/echelle                                    0 vs 1        Python
    zaixingmao/retina                                  1 vs 2        Python
    fmanteca/ImageClassification                       0 vs 1        Python
    alkaplan/jupyter-notebooks                         2 vs 6        Python
    patrykel/multi-tracking-notebooks                  0 vs 6        Python
    patrykel/MultitrackingMasterProject                0 vs 2        Python
    cfangmeier/HHC                                     0 vs 2        Python
    cfangmeier/Small                                   1 vs 4        Python
    cfangmeier/TTTT                                    0 vs 1        Python
    cfangmeier/UNL-Gantry-Encapsulation-Monitoring     0 vs 2        Python
    jiafulow/L1TMuonDocsNov2018                        1 vs 1        ???
    jiafulow/L1TMuonSimulationsMar2017                 3 vs 18        Python
    jiafulow/UF-slurm                                  0 vs 0        ???
    monttj/computational-physics                       1 vs 12        Python
    NJManganelli/TaggerTest                            0 vs 1        Python
    sciencecw/cmsjupyter                               1 vs 14        Python
    lecriste/first-binder                              0 vs 1        Python
    mzanetti79/LaboratoryOfComputationalPhysics        0 vs 11        Python
    mzanetti79/ML-INFN                                 2 vs 34        Python
    mzanetti79/MLCC18                                  0 vs 23        Python
    bpenning/jupyter_repo                              0 vs 22        Python
    bencammett/ML_project_Comp2                        6 vs 12        Python
    hbprosper/ENHEP                                    1 vs 8        Python
    hbprosper/eshep_tutorials                          0 vs 13        Python
    Jonas Eschle
    So we crossed the point already and physics analysis is dominated by python? :)
    Jim Pivarski
    Since it looks like we can add the Jupyter count to the Python column, yes. That just happened.

    Actually, in my slow-moving scan of Jupyter notebooks, I've finally come across two legitimate C++ Jupyter repos: https://github.com/gudrutis/jupyter-book-tutorials/search?utf8=%E2%9C%93&q=include&type= and https://github.com/javadebadi/learning_cpp_again/search?q=include&unscoped_q=include

    These are the first 2 out of 91.

    Henry Schreiner
    This won’t work in C++20. :)
    Jim Pivarski
    Nope. That other analysis that identifies physicists by having "Scientific" in their Linux distribution name will fail in the near future, too.
    @raymondEhlers I have results from Alice, and it's quite different. Alice is considerably more C/C++ than Python.

    It would be very interesting to find out what the other collaborations are doing. I've looked into the GitLab API—it functions on gitlab.cern.ch, but I wasn't able to repeat any of these queries without figuring out its (different) authentication mechanism. And even then, I might have to be a member of a collaboration to see its users. If there's a culture of "in-development analysis is private, even from other members of the collaboration," then there might not be anything any one user can do to get a global picture.

    Does anyone have any other suggestions? (GitHub preferred; I already have the scripts.)

    Luke Kreczko

    @jpivarski Collaborations like LZ use mostly C++ (private GitLab), Xenon1T mostly Python (on Github).

    You can always try to get representatives from the collaboration to run a script to give you the breakdown if you want the exact numbers

    Doug Davis
    I would guess ATLAS is considerably more C++ than Python, but the balance is shifting.
    Hard to measure with ATLAS heavily using gitlab.cern.ch and most members defaulting to private repos
    Raymond Ehlers
    @jpivarski Thanks for sharing! That's about what I would have guessed. I've tried to encourage python, but with only some success :-)
    Luke Kreczko
    @raymondEhlers "Python is slow", huh? ;)
    Martin Ritter
    @jpivarski I would not dare to judge how the distribution is in Belle2. I can say we only teach python (pandas,mpl) for beginners but there's a large fraction of people coming over from Belle and they have a very high inertia and prefer to use "ROOT macros". However I can tell you that any Belle2 member that would put their analysis on github/gitlab would be definitely the ones using python for analysis so I'd expect a heavy bias there.
    Hans Dembinski
    @jpivarski Thank you for this awesome analysis. According to the voluntary survey 2018 that I conducted within the LHCb collaboration, half of the LHCb members use mainly Python. It is similar to your CMS results.
    Hard data (even with caveats, perhaps Python users prefer Github??) such as yours is even more convincing than personal statements
    Jim Pivarski
    @HDembinski Could you point me to that survey?
    I am trying to dig up the URL of the actual poll now...
    Some of the free-form text answers are quite interesting :)
    Matthew Feickert

    @HDembinski As I've been trying to figure out the issues that pyhf is having with iminuit this weekend I've run into a problem where installing iminuit in a Unbuntu 18.04 Docker image with Python 3.6.8 installed from source on it fails. I have a short Gist that describes what's going on, and if you have any thoughts on what to think about with regards to what is going wrong that would be great:


    Martin Ritter
    @matthewfeickert sounds like the so would have been compiled with gcc instead of g++. Your install_python.sh passes gcc as with-cxx-main, maybe that should be g++?
    Matthew Feickert
    @daritter Thanks for the quick reply. That's a very interesting point. Let me flip that and see how things go (I'll report back morning CST time). Thanks!
    Matthew Feickert
    @daritter Actually, just tried swapping out my CXX_VERSION="$(which gcc)" for CXX_VERSION="$(which g++)" and tried to rebuild the Docker image, but this just results in CPython complaining loudly and then failing during the build. So unless I'm doing something stupid I guess that needs to be gcc.
    Martin Ritter
    Then I have no clear idea. I usually don't specify with-cxx-main flag. What does your sysconfig.get_config_var('CXX') and LDCXXSHARED say?
    Henry Schreiner
    From what I understand here, this will also be picked up from CXX if not given, so it really seems like it should be g++. Never tried passing it explicitly either.
    Henry Schreiner
    Looks like the problem is a bug in Python: https://bugs.python.org/issue23644 - it seems to be trying to build with stdatomic, which is C++ only.