Where communities thrive


  • Join over 1.5M+ people
  • Join over 100K+ communities
  • Free without limits
  • Create your own community
People
Activity
    Jim Pivarski
    @jpivarski
    @henryiii I'm looking for suggestions about how to translate ROOT histogram metadata into boost-histogram metadata, or maybe this should just wait for hist. Here's scikit-hep/uproot4#46:
    >>> import uproot4, skhep_testdata
    >>> f = uproot4.open(skhep_testdata.data_path("uproot-hepdata-example.root"))
    >>> f["hpx"]
    <TH1F (version 1) at 0x7f2f142f9950>
    >>> f["hpx"].np     # NumPy
    (array([2.000e+00, 2.000e+00, 3.000e+00, 1.000e+00, 1.000e+00, 2.000e+00,
           4.000e+00, 6.000e+00, 1.200e+01, 8.000e+00, 9.000e+00, 1.500e+01,
           1.500e+01, 3.100e+01, 3.500e+01, 4.000e+01, 6.400e+01, 6.400e+01,
           8.100e+01, 1.080e+02, 1.240e+02, 1.560e+02, 1.650e+02, 2.090e+02,
           2.620e+02, 2.970e+02, 3.920e+02, 4.320e+02, 4.660e+02, 5.210e+02,
           6.040e+02, 6.570e+02, 7.880e+02, 9.030e+02, 1.079e+03, 1.135e+03,
           1.160e+03, 1.383e+03, 1.458e+03, 1.612e+03, 1.770e+03, 1.868e+03,
           1.861e+03, 1.946e+03, 2.114e+03, 2.175e+03, 2.207e+03, 2.273e+03,
           2.276e+03, 2.329e+03, 2.325e+03, 2.381e+03, 2.417e+03, 2.364e+03,
           2.284e+03, 2.188e+03, 2.164e+03, 2.130e+03, 1.940e+03, 1.859e+03,
           1.763e+03, 1.700e+03, 1.611e+03, 1.459e+03, 1.390e+03, 1.237e+03,
           1.083e+03, 1.046e+03, 8.880e+02, 7.520e+02, 7.420e+02, 6.730e+02,
           5.550e+02, 5.330e+02, 3.660e+02, 3.780e+02, 2.720e+02, 2.560e+02,
           2.000e+02, 1.740e+02, 1.320e+02, 1.180e+02, 1.000e+02, 8.900e+01,
           8.600e+01, 3.900e+01, 3.700e+01, 2.500e+01, 2.300e+01, 2.000e+01,
           1.600e+01, 1.400e+01, 9.000e+00, 1.300e+01, 8.000e+00, 2.000e+00,
           2.000e+00, 6.000e+00, 1.000e+00, 0.000e+00, 1.000e+00, 4.000e+00],
          dtype=float32), array([ -inf, -4.  , -3.92, -3.84, -3.76, -3.68, -3.6 , -3.52, -3.44,
           -3.36, -3.28, -3.2 , -3.12, -3.04, -2.96, -2.88, -2.8 , -2.72,
           -2.64, -2.56, -2.48, -2.4 , -2.32, -2.24, -2.16, -2.08, -2.  ,
           -1.92, -1.84, -1.76, -1.68, -1.6 , -1.52, -1.44, -1.36, -1.28,
           -1.2 , -1.12, -1.04, -0.96, -0.88, -0.8 , -0.72, -0.64, -0.56,
           -0.48, -0.4 , -0.32, -0.24, -0.16, -0.08,  0.  ,  0.08,  0.16,
            0.24,  0.32,  0.4 ,  0.48,  0.56,  0.64,  0.72,  0.8 ,  0.88,
            0.96,  1.04,  1.12,  1.2 ,  1.28,  1.36,  1.44,  1.52,  1.6 ,
            1.68,  1.76,  1.84,  1.92,  2.  ,  2.08,  2.16,  2.24,  2.32,
            2.4 ,  2.48,  2.56,  2.64,  2.72,  2.8 ,  2.88,  2.96,  3.04,
            3.12,  3.2 ,  3.28,  3.36,  3.44,  3.52,  3.6 ,  3.68,  3.76,
            3.84,  3.92,  4.  ,   inf]))
    >>> f["hpx"].bh     # boost-histogram
    Histogram(Regular(100, -4, 4, metadata={'name': 'hpx', 'title': 'This is the px distribution',
        'entries': 75000.0, 'total-sumw': 74994.0, 'total-sumw2': 74994.0, 'total-sumwx': -97.16475860591163,
        'total-sumwx2': 75251.86518025988, 'norm-factor': 0.0, 'option': <TString '' at 0x7f2f1406d550>,
        'line-color': 602, 'line-style': 1, 'line-width': 1, 'fill-color': 0, 'fill-style': 1001, 'marker-color': 1,
        'marker-style': 1, 'marker-size': 1.0, 'bar-offset': 0, 'bar-width': 1000, 'labels': None, 'axis-color': 1,
        'label-color': 1, 'label-font': 42, 'label-offset': 0.004999999888241291,
        'label-size': 0.03500000014901161, 'tick-length': 0.029999999329447746, 'title-offset': 1.0,
        'title-size': 0.03500000014901161, 'title-color': 1, 'title-font': 42, 'time-display': False, 'time-format': '',
        'contour': array([], dtype=float64), 'functions': [<TPaveStats (version 4) at 0x7f2f0c777f10>]}),
    storage=Double()) # Sum: 74994.0 (75000.0 with flow)
    >>> print(f["hpx"].bh)
    ... prints an ASCII histogram

    But the boost-histogram metadata is attached to an axis, whereas some ROOT metadata is on the axis, some on the histogram itself (and I've only ported a subset):

        def metadata(self, axis):
            if axis == "x":
                axis = self.member("fXaxis")
            return {
                "name": self.member("fName"),
                "title": self.member("fTitle"),
                "entries": self.member("fEntries"),
                "total-sumw": self.member("fTsumw"),
                "total-sumw2": self.member("fTsumw2"),
                "total-sumwx": self.member("fTsumwx"),
                "total-sumwx2": self.member("fTsumwx2"),
                "norm-factor": self.member("fNormFactor"),
                "option": self.member("fOption"),
                "line-color": self.member("fLineColor"),
                "line-style": self.member("fLineStyle"),
                "line-width": self.member("fLineWidth"),
                "fill-color": self.member("fFillColor"),
                "fill-style": self.member("fFillStyle"),
                "marker-color": self.member("fMarkerColor"),
                "marker-style": self.member("fMarkerStyle"),
                "marker-size": self.member("fMarkerSize"),
                "bar-offset": self.member("fBarOffset"),
                "bar-width": self.member("fBarWidth"),
                "labels": axis.member("fLabels"),
                "axis-color": axis.member("fAxisColor"),
                "label-color": axis.member("fLabelColor"),
                "label-font": axis.member("fLabelFont"),
                "label-offset": axis.member("fLabelOffset"),
                "label-size": axis.member("fLabelSize"),
                "tick-length": axis.member("fTickLength"),
                "title-offset": axis.member("fTitleOffset"),
                "title-size": axis.member("fTitleSize"),
                "title-color": axis.member("fTitleColor"),
                "title-font": axis.member("fTitleFont"),
                "time-display": axis.member("fTimeDisplay"),
                "time-format": str(axis.member("fTimeFormat")),
                "contour": numpy.asarray(self.member("fContour")),
                "functions": list(self.member("fFunctions")),
            }
    
        @property
        def bh(self):
            boost_histogram = uproot4.extras.boost_histogram()
    
            xaxis = self.member("fXaxis")
    
            xaxis_fNbins = xaxis.member("fNbins")
            xaxis_fXbins = xaxis.member("fXbins", none_if_missing=True)
            if xaxis_fXbins is None or len(xaxis_fXbins) == 0:
                boost_xaxis = boost_histogram.axis.Regular(
                    xaxis_fNbins,
                    xaxis.member("fXmin"),
                    xaxis.member("fXmax"),
                    underflow=True,
                    overflow=True,
                    metadata=self.metadata(xaxis),
                )
            else:
                boost_xaxis = boost_histogram.axis.Variable(
                    xaxis_fXbins,
                    underflow=True,
                    overflow=True,
                    metadata=self.metadata(xaxis),
                )
    
            for base in self.bases:
                if isinstance(base, uproot4.models.TArray.Model_TArray):
                    values = numpy.asarray(base)
                    break
    
            sumw2 = self.member("fSumw2", none_if_missing=True)
    
            if sumw2 is not None and len(sumw2) == len(values):
                storage = boost_histogram.storage.Weight()
            else:
                if issubclass(values.dtype.type, numpy.integer):
                    storage = boost_histogram.storage.Int64()
                else:
                    storage = boost_histogram.storage.Double()
    
            out = boost_histogram.Histogram(boost_xaxis, storage=storage)
            view = out.view(flow=True)
    
            if sumw2 is not None and len(sumw2) == len(values):
                view.value[:] = values
                view.variance[:] = sumw2
            else:
                view[:] = values
    
            return out

    So, it up for interpretation...

    Jim Pivarski
    @jpivarski

    Perhaps this is more manageable:

        def metadata(self, axis):
            if axis == "x":
                axis = self.member("fXaxis")
            out = {
                "name": self.member("fName"),
                "title": self.member("fTitle"),
                "entries": self.member("fEntries"),
            }
            if axis.member("fLabels") is not None:
                out["labels"] = list(axis.member("fLabels"))
            if axis.member("fTimeDisplay"):
                out["time-format"] = str(axis.member("fTimeFormat"))
            return out

    results in

    >>> f["hpx"].bh
    Histogram(Regular(100, -4, 4, metadata={'name': 'hpx', 'title': 'This is the px distribution',
        'entries': 75000.0}), storage=Double()) # Sum: 74994.0 (75000.0 with flow)
    Henry Schreiner
    @henryiii
    Any histogram metadata can be stored in .metadata on the histogram (it doesn’t have slots, and if it did, I would include metadata).
    Jim Pivarski
    @jpivarski
    By changing .metadata in place?
    Henry Schreiner
    @henryiii
    Yes, h.metatdata = {…}. We could probably add it as a kw argument eventually.
    Jim Pivarski
    @jpivarski
    Is there a place to put "hidden" metadata, lots of metadata that we don't want to look at in the __repr__? If I dumped all the ROOT values there, the conversion could be round-trip.
    Is the metadata string → anything or anything → anything?
    Henry Schreiner
    @henryiii
    The metadata is not shown in the histogram, because it doesn’t know anyting about it, but the axes is anything -> anything, and does show up in the repr. I woudn’t worry about it too much, as metadata isn’t shown in Hist’s repr by default, but just the label / title. You could subclass dict and add a custom repr.
    In hist, the only change would be that title and name should be pulled out and passed as kw arguments.
    Jim Pivarski
    @jpivarski
    >>> h = uproot4.open(skhep_testdata.data_path("uproot-hepdata-example.root"))["hpx"]
    >>> h
    <TH1F (version 1) at 0x7f9bd13e6810>
    >>> h.bh
    Histogram(Regular(100, -4, 4), storage=Double()) # Sum: 74994.0 (75000.0 with flow)
    >>> h.bh.metadata
    {'name': 'hpx', 'title': 'This is the px distribution', 'entries': 75000.0}
    Henry Schreiner
    @henryiii
    :+1:
    Jim Pivarski
    @jpivarski
    Or just dumping everything as-is:
    >>> import uproot4, skhep_testdata
    >>> h = uproot4.open(skhep_testdata.data_path("uproot-hepdata-example.root"))["hpx"]
    >>> h.bh
    Histogram(Regular(100, -4, 4), storage=Double()) # Sum: 74994.0 (75000.0 with flow)
    >>> h.bh.metadata
    {'@fUniqueID': 0, '@fBits': 50331656, 'fName': 'hpx', 'fTitle': 'This is the px 
    distribution', 'fLineColor': 602, 'fLineStyle': 1, 'fLineWidth': 1, 'fFillColor'
    : 0, 'fFillStyle': 1001, 'fMarkerColor': 1, 'fMarkerStyle': 1, 'fMarkerSize': 1.
    0, 'fNcells': 102, 'fXaxis': <TAxis (version 9) at 0x7f05e0e0b090>, 'fYaxis': <T
    Axis (version 9) at 0x7f05e0e0c950>, 'fZaxis': <TAxis (version 9) at 0x7f05e0e0c
    b50>, 'fBarOffset': 0, 'fBarWidth': 1000, 'fEntries': 75000.0, 'fTsumw': 74994.0
    , 'fTsumw2': 74994.0, 'fTsumwx': -97.16475860591163, 'fTsumwx2': 75251.865180259
    88, 'fMaximum': -1111.0, 'fMinimum': -1111.0, 'fNormFactor': 0.0, 'fContour': <T
    ArrayD [] at 0x7f05e0e0cc90>, 'fSumw2': <TArrayD [] at 0x7f05e0e0cdd0>, 'fOption
    ': <TString '' at 0x7f05e0eba5d0>, 'fFunctions': <TList of 1 items at 0x7f05e0e0
    cd90>, 'fBufferSize': 0, 'fBuffer': array([], dtype=float64), 'fBinStatErrOpt': 
    0, 'fN': 102}
    The issue with trying to preserve enough information for a round trip is that some things might be modified by slicing operations. For instance, the fSumw2 doesn't belong in there.
    And axis fLabels should be converted into a Categorical axis.
    Henry Schreiner
    @henryiii
    Some things should be recreated, yes. Not sure the best way to pick out information that is duplicated. Note that entries is the same as h.sum(flow=True), etc.
    Jim Pivarski
    @jpivarski
    I think that's not true if the histogram-filling was weighted. Bins are updated by the weighted values, but entries are a strict count of values. Also, I think the entries counts NaNs.
    Henry Schreiner
    @henryiii
    Yes, you are correct
    Jim Pivarski
    @jpivarski

    I'm thinking of finishing a version of this that gets as much data into the boost-histogram objects as possible, then iterate with you about how you actually want it, and how to make good demo examples out of it.

    I just found an example with categorical bins, and I have a TProfile somewhere...

    Oops; failed to deserialize the example with categorical bins. But it gives me a chance to show off the new debugging output. When people run into deserialization errors in the future, they'll get an error message that looks like this:

        TH1F version 2 as <dynamic>.Model_TH1F_v2 (939 bytes)
            TH1 version 7 as <dynamic>.Model_TH1_v7 (893 bytes)
                (base): <TNamed 'cutflow' title='dijethad' at 0x7fafb4505f90>
                (base): <TAttLine (version 2) at 0x7fafb4506350>
                (base): <TAttFill (version 2) at 0x7fafb4506390>
                (base): <TAttMarker (version 2) at 0x7fafb4506310>
                fNcells: 9
                TAxis version 9 as <dynamic>.Model_TAxis_v9 (417 bytes)
                    (base): <TNamed 'xaxis' at 0x7fafb4506890>
                    (base): <TAttAxis (version 4) at 0x7fafb4506950>
                    fNbins: 7
                    fXmin: 0.0
                    fXmax: 7.0
                    fXbins: <TArrayD [] at 0x7fafb4506910>
                    fFirst: 0
                    fLast: 0
                    fBits2: 4
                    fTimeDisplay: False
                    fTimeFormat: <TString '' at 0x7fafb44dc1d0>
                    THashList version 5 as <dynamic>.Model_THashList_v0 (294 bytes)
                        TList version 1 as uproot4.models.TList.Model_TList (? bytes)
                            (base): <TObject None None at 0x7fafb4506fd0>
                            fName: ''
                            fSize: 475136
    
    attempting to get bytes 1851028560:1851028561
    outside expected range 0:939 for this Chunk
    in file /home/pivarski/miniconda3/lib/python3.7/site-packages/skhep_testdata/data/uproot-issue33.root
    in object /cutflow

    Maybe it won't be clear what it means, but if I get something like this copy-pasted into a GitHub issue, it will be a lot easier to narrow it down. (Which I'm about to do to this one, since we really should have an example with categorical bins...)

    Jim Pivarski
    @jpivarski
    In fact, it only took a few minutes to solve, because from the above it's clear that it goes haywire in THashList, so I didn't have very far to search.
    >>> import uproot4, skhep_testdata
    >>> h = uproot4.open(skhep_testdata.data_path("uproot-issue33.root"))["cutflow"]
    >>> h
    <TH1F (version 2) at 0x7fc1cc062bd0>
    >>> h.member("fXaxis")
    <TAxis (version 9) at 0x7fc1cf7b7990>
    >>> h.member("fXaxis").member("fLabels")
    <THashList of 7 items at 0x7fc1b4764610>
    >>> list(h.member("fXaxis").member("fLabels"))
    [<TObjString 'Dijet' at 0x7fc1b473d0d0>, <TObjString 'MET' at 0x7fc1b473d150>,
     <TObjString 'MuonVeto' at 0x7fc1b473d1d0>, <TObjString 'IsoMuonTrackVeto' at 0x7fc1b473d250>,
     <TObjString 'ElectronVeto' at 0x7fc1b473d2d0>, <TObjString 'IsoElectronTrackVeto' at 0x7fc1b473d350>,
     <TObjString 'IsoPionTrackVeto' at 0x7fc1b473d3d0>]
    Jim Pivarski
    @jpivarski
    For a categorical axis, it's the last bin that's overflow, right?
    Henry Schreiner
    @henryiii
    Very nice!
    Yes, underflow bin is not allowed on categorical.
    Jim Pivarski
    @jpivarski
    Then I think this is it:
    >>> import uproot4, skhep_testdata
    >>> h = uproot4.open(skhep_testdata.data_path("uproot-issue33.root"))["cutflow"]
    >>> h.bh
    Histogram(StrCategory(['Dijet', 'MET', 'MuonVeto', 'IsoMuonTrackVeto', 'ElectronVeto', 'IsoElectronTrackVeto', 'IsoPionTrackVeto']), storage=Weight()) # Sum: WeightedSum(value=205222, variance=205222)
    >>> h.bh.view()
    WeightedSumView(
          [(39551., 39551.), (27951., 27951.), (27911., 27911.),
           (27861., 27861.), (27737., 27737.), (27460., 27460.),
           (26751., 26751.)], dtype=[('value', '<f8'), ('variance', '<f8')])
    Henry Schreiner
    @henryiii
    That looks right! For TProfile, look at the Mean / WeightedMean storage/accumulators. Mean has 3 values to set, WM has 4.
    If you slice, metadata on the histogram will not get propogated (but could be added)
    I’ll open an issue with a proposal later in boost-histogram
    Jim Pivarski
    @jpivarski

    Just before we started talking, I found out how to get the mean out of a TProfile. For the error, I have to figure this out:

    https://github.com/root-project/root/blob/e87a6311278f859ca749b491af4e9a2caed39161/hist/hist/src/TProfileHelper.h#L660-L721

    I'm not sure if we need to explicitly handle the "spread" option. (I used to use that in analysis, and it's not immediate from the mean and error on the mean because you have to go through the "effective number of entries.")

    Propagating metadata when slicing histograms is about as important as maintaining "Lorentz vectorness" on Awkward arrays. In the conversion, it should be my job to make sure that all metadata is meta, that it doesn't include anything that scales with the number of bins. I think what remains would be valid to pass through slices/rebinnings/projections without modification. (I'll also have to put the axis metadata on the axis and the non-axis metadata on the Histogram.)
    Hans Dembinski
    @HDembinski
    @jpivarski This is super-impressive work and very important to make boost-histogram more available to the mainstream.
    Jim Pivarski
    @jpivarski
    Thanks! I hope to the it into the Uproot tutorial for PyHEP, but I'll need an example. If you or Henry don't have suggestions, I could plot one of SciKit-HEP-testdata's uproot-hepdata-example.root's histograms using mplhep.
    Henry Schreiner
    @henryiii
    In preparation for tomorrow’s PyHEP talk, boost-histogram has been updated to 0.10.0. Several nice usablility improvements, a bugfix, and ARM/PowerPC wheels are now available.
    Paul Gessinger
    @paulgessinger
    hey all! i'm running into an issue where boost-histograms raise when they are copied. copy.deepcopy tries to pickle a threading lock, and fails. I tried on various python versions to check if that makes a difference, but it seems it does not. this is on v0.10.0, but i tried as far back as v0.8.0 with the same error. v0.10.0 seems to switch __add__ over to copy if i understand correctly, which makes this error even more prominent for me.
    Paul Gessinger
    @paulgessinger
    as context maybe: this is on histograms read from a root file via uproot4 via to_boost. if i manually construct a histogram in the same environment and try to copy it, that seems to work
    Paul Gessinger
    @paulgessinger
    narrowing it down, it seems like this lock is in the metadata that is attached to the axis by uproot. removing the metadata from the axis makes copy work again
    Paul Gessinger
    @paulgessinger
    so maybe this is an uproot4 thing?
    Jim Pivarski
    @jpivarski
    It is—all objects in Uproot 4 point back to the file (and all the infrastructure needed for reading) for many reasons, including making __exit__ for context managers propagate back to close files and for better error messages. However, when handing off the data as "TAxis" and "TPaveText" for histograms, this connection to the original file should be cut. I'll do that.
    Henry Schreiner
    @henryiii
    This came up in scikit-hep/boost-histogram#431 too
    Jim Pivarski
    @jpivarski

    Fixed in scikit-hep/uproot4#58; with a short list of exceptions (things like TDirectory, TTree, TBranch...), objects from a ROOT file are now detached from the original file. Thus, it wouldn't be possible to use these objects to read more data (which is why TDirectory, TTree, etc. are exceptions). But this means that the detached objects can be pickled and don't contain any transients, like locks or threads.

    You can even save an object from a ROOT file into a pickle file and read it back in a new Python process, even though that object's class was derived from data in the ROOT file. (We pickle enough derived quantities from the TStreamerInfo to reconstitute the class object.)

    Nicholas Smith
    @nsmith-
    is there a plan to support (ab)using multidimensional boost-histograms as lookup tables?
    Hans Dembinski
    @HDembinski
    I abuse them to that end. I think we have everything in place for this use case, no?
    Jim Pivarski
    @jpivarski

    Does anyone have an answer to this: https://stackoverflow.com/questions/63813448/writing-boost-histograms-with-uproot

    It would have to be Uproot3, since Uproot4 doesn't write anything yet.

    Henry Schreiner
    @henryiii
    How would I write a histogram with variances in uproot3? It should be easy to mimic that with boost-histogram.
    Jim Pivarski
    @jpivarski

    I just looked into it and found a physt example, which handles variances:

    https://github.com/scikit-hep/uproot-methods/blob/80dbc8123c577253585b33b7d8b3d72acc42818b/uproot_methods/classes/TH1.py#L447-L508

    It creates classes with the right names and the right fields, which Uproot 3 recognizes when assigning to a key of an output file (in __setitem__). The recognition happens in

    https://github.com/scikit-hep/uproot-methods/blob/80dbc8123c577253585b33b7d8b3d72acc42818b/uproot_methods/convert.py#L44-L45

    Considering how complicated this looks, it's a toss-up whether it's valuable to do it now, so that Uproot 3 will recognize and write boost-histogram and hist objects, or if it would be better to wait a month or two for me to add the file-writing to Uproot 4. The new interface would be more formal than this.

    Maybe more than two months—I've claimed file-writing in Uproot 4 as a milestone for December 1, though.

    Henry Schreiner
    @henryiii

    Hist 2.0.0 is out! This is the result of the work @LovelyBuggies and I have been doing for Google Summer of Code 2020. Changes since Beta 1:

    • Based on boost-histogram 0.11; now supports two way boost-histogram <-> hist conversion without metadata issues.
    • mplhep is now used for all plotting. Return types changed; fig dropped, new figures only created if needed.
    • QuickConstruct was rewritten, uses new.Reg(...).Double(); not as magical but clearer types and usage.
    • Plotting requirements are no longer required, use pip install "hist[plot]" to request.

    The following new features were added:

    • Jupyter HTML repr's were added.
    • flow=False shortcut added.
    • Static type checker support for dependent projects.

    See more details at https://github.com/scikit-hep/hist

    Eduardo Rodrigues
    @eduardo-rodrigues
    Congrats, folks :+1: !
    Jan Pipek
    @janpipek
    Cool!
    Henry Schreiner
    @henryiii
    Building on top of the recently released pybind11 2.6.0, boost-histogram 0.11.1 is out! Python 3.9 support and wheels, PyPy support and wheels, 40% faster accumulators, better CMake support, and quite a bit more just from the upgrade!
    N!no
    @LovelyBuggies
    👍