Where communities thrive


  • Join over 1.5M+ people
  • Join over 100K+ communities
  • Free without limits
  • Create your own community
People
Activity
    Matthew Feickert
    @matthewfeickert
    ah okay, thanks for the info on ax._ax.metadata!
    Henry Schreiner
    @henryiii
    “name” is something that can be used to refer to the axis and should be unique. “label” is anything.
    Matthew Feickert
    @matthewfeickert
    :+1:
    Thanks Henry!
    Henry Schreiner
    @henryiii
    And you can use __dict__ on a histogram. Just not a ax. I wonder if @property; def __dict__(self): return self._ax.metadata would break anything...
    It’s a slots class so normal __dict__ is not used.
    Hans Dembinski
    @HDembinski
    @matthewfeickert The idea of metadata was is to distinguish between otherwise identical axes, e.g. distinguish axis.Regular(10, 0, 10, metadata="time in seconds") from axis.Regular(10, 0, 10, metadata="length in meters"). That's why adding histograms in boost-histogram (and hist) only works when metadata matches on all axes
    The assumption here is that you can only add histograms with identical layout, not only physical layout in terms of the axis grid but also logical layout in terms of what these axes mean
    Matthew Feickert
    @matthewfeickert
    this makes sense now. Before I wasn't assuming that anything called "metadata" could actually have any impact on operations
    Peter Fackeldey
    @pfackeldey

    Dear Hist/boost-histogram developers,

    thank you for this great histogramming library, it is a pleasure to work with it :)

    I have a question to you regarding fancy indexing on a StrCategory axis.
    My (Hist) histogram looks as follows:

    In [38]: h
    Out[38]:
    Hist(
      StrCategory(['GluGluToHHTo2B2VTo2L2Nu_node_cHHH2p45'], growth=True),
      StrCategory(['ee', 'mumu', 'emu'], growth=True),
      StrCategory(['nominal'], growth=True),
      Regular(40, 0, 200, name='MET', label='$p_{T}^{miss}$'),
      storage=Weight()) # Sum: WeightedSum(value=2608.44, variance=47.3505) (WeightedSum(value=2775.4, variance=50.5706) with flow)

    Now I would like to "group" e.g. the "ee" and "emu" category together, which means that I'd like to do something like:

    h[{"dataset": "GluGluToHHTo2B2VTo2L2Nu_node_cHHH2p45", "category": ["ee", "emu"], "systematic": "nominal"}]

    Unfortunately this does not work, as the ["ee", "emu"] is not a valid indexing operation.

    Is there a way to do such a fancy indexing on StrCategory axes? In case this is not supported, is there a nice workaround?
    (I am basically looking for something, which works similar to coffea.Hist.group method: https://github.com/CoffeaTeam/coffea/blob/master/coffea/hist/hist_tools.py#L1115)

    Thank you already a lot in advance!
    Best, Peter

    Peter Fackeldey
    @pfackeldey
    small update: I found that the following works, but I think this would be a very nice feature for Hist, if this can be handled a bit more conveniently. What do you think?
    np.sum(h[{"dataset": "GluGluToHHTo2B2VTo2L2Nu_node_cHHH2p45", "systematic": "nominal"}].view()[h.axes["category"].index(["ee", "emu"]), ...], axis=0)
    Henry Schreiner
    @henryiii
    @pfackeldey The issue is here: scikit-hep/boost-histogram#296 . There could be a shortcut in Hist if this is not solved upstream, but ideally indexing by ["ee", "emu”] would not cause the histogram to convert into a view.
    Peter Fackeldey
    @pfackeldey
    @henryiii Thank you for your reply! Alright, I'll go for now with the reduce option, so that I don't have to convert into a view and can keep the hist :) I'd love to see a nice indexing solution in the future, such as the syntax you proposed in the issue.
    Peter Fackeldey
    @pfackeldey
    I hope you don't mind another question:
    I would like to fill a histogram with a very fine (regular) binning, e.g. 1000 bins from 0 to 1. After filling I would like to rebin into a variable binning of e.g. [0, 5, 7, 8, 9.5, 1] (only 5 bins). As far as I can see, there is the rebin action, but unfortunately it only accepts a single value/factor. Is it possible to rebin a Regular/Variable axis to any arbitrary binning, as long as the binning is "valid"?
    Henry Schreiner
    @henryiii

    I think this is still a feature that hasn’t been added to Boost.Histogram yet (@HDembinski can comment), but one trick is the bin edges - you have to have exact bin edges or it doesn’t make sense. You can manually write this with copying from one histogram to another, though - maybe something Hist could simplify, as well. Think something like this (rough draft, might not be correct):

    (ax,) = new_hist.axes
    for bin in range(5):
        new_hist[bin] = old_hist[bh.loc(ax[bin]):bh.loc(ax[bin+1]):sum]

    (Didn’t include flow bins, but you can do that with bh.tag.at and going from -1 to N)

    1 reply
    Matthew Feickert
    @matthewfeickert
    @henryiii @HDembinski Is there an API in Boost.Histogram for uncertainties? That is, if you wanted to calculate the weighted uncertainty on each bin in a histogram formed from the sum of histograms (a stack hist) is there a way to ask a Boost.Histogram object for the uncertainties (and weights?) associated with each bin (as a NumPy array or list)?
    Matthew Feickert
    @matthewfeickert
    Or would you use the Accumulators here?
    Matthew Feickert
    @matthewfeickert

    Also, a somewhat unrelated question: What am I not understanding about how Boost.Histogram does the variance? Or am I making a stupid mistake?

    import boost_histogram as bh
    import numpy as np
    
    # From the Boost.Histogram examples
    values = [1]*9 + [91]
    weight=2
    weights=weight*np.ones(len(values))
    wm = bh.accumulators.WeightedMean().fill(values, weight=weight)
    
    print("\n# By hand")
    print(f"values: {values}")
    print(f"weights: {weights}")
    _sum_of_weights = sum(weights)
    print(f"sum of weights: {_sum_of_weights}")
    _sum_of_weights_squared = sum(np.square(weights))
    print(f"sum of weights squared: {_sum_of_weights_squared}")
    print(f"weighted mean: {np.average(values, weights=weights)}")
    print(f"np variance: {np.var(values)}")
    _weighted_variance = np.average((values-np.average(values, weights=weights))**2, weights=weights) # Better ways to do this
    print(f"weighted variance: {_weighted_variance}")
    print("\n# Boost.Histogram accumulator")
    print(wm) # Why is the variance different here?

    gives

    # By hand
    values: [1, 1, 1, 1, 1, 1, 1, 1, 1, 91]
    weights: [2. 2. 2. 2. 2. 2. 2. 2. 2. 2.]
    sum of weights: 20.0
    sum of weights squared: 40.0
    weighted mean: 10.0
    np variance: 729.0
    weighted variance: 729.0
    
    # Boost.Histogram accumulator
    WeightedMean(sum_of_weights=20, sum_of_weights_squared=40, value=10, variance=810)
    Matthew Feickert
    @matthewfeickert
    Hm. Is Boost.Histogram using a particular approximation scheme for the variance that is highly efficient for larger arrays? I see if I increase the length of values this deviation drops to under a percent.
    Alexander Held
    @alexander-held
    The difference between 810 and 729 is sample variance vs population variance https://en.wikipedia.org/wiki/Variance#Population_variance_and_sample_variance.
    (The n-1 vs n difference goes away as n -> inf)
    Matthew Feickert
    @matthewfeickert
    Ah yes, that should have been my first thought
    thank you @alexander-held! I appreciate you pointing that out for me. :)
    Matthew Feickert
    @matthewfeickert
    For those playing along with my goof,
    import boost_histogram as bh
    import numpy as np
    
    # From the Boost.Histogram examples
    values = [1]*9 + [91]
    weight=2
    weights=weight*np.ones(len(values))
    wm = bh.accumulators.WeightedMean().fill(values, weight=weight)
    
    print("\n# By hand")
    print(f"values: {values}")
    print(f"weights: {weights}")
    _sum_of_weights = sum(weights)
    print(f"sum of weights: {_sum_of_weights}")
    _sum_of_weights_squared = sum(np.square(weights))
    print(f"sum of weights squared: {_sum_of_weights_squared}")
    print(f"weighted mean: {np.average(values, weights=weights)}")
    print(f"np variance: {np.var(values)}")
    _weighted_variance = sum(np.square(values-np.average(values, weights=weights)))/(len(values)-1) # Better ways to do this
    print(f"weighted variance: {_weighted_variance}")
    print("\n# Boost.Histogram accumulator")
    print(wm) # Difference _was_ that Boost.Histogram nicely gives an unbiased estiamtor! :)
    # By hand
    values: [1, 1, 1, 1, 1, 1, 1, 1, 1, 91]
    weights: [2. 2. 2. 2. 2. 2. 2. 2. 2. 2.]
    sum of weights: 20.0
    sum of weights squared: 40.0
    weighted mean: 10.0
    np variance: 729.0
    weighted variance: 810.0
    
    # Boost.Histogram accumulator
    WeightedMean(sum_of_weights=20, sum_of_weights_squared=40, value=10, variance=810)
    Matthew Feickert
    @matthewfeickert
    Though if there is an API in Boost.Histogram or hist for properly handling the uncertainties per bin for summing histograms that would be great. Hopefully Hans and Henry can set me straight on that next week.
    Hans Dembinski
    @HDembinski
    @matthewfeickert Looks like the variance computation issue has been cleared up, now to your other question, you can make a histogram with a storage that uses the WeightedMean accumulators.
    import boost_histogram as bh
    h = bh.Histogram(bh.axis.Regular(10, 0, 1), storage=bh.storage.WeightedMean())
    h.fill([1], sample=[3], weight=[2])
    The first argument is used to look up the cell, then the second argument is feed to the accumulator in that cell with the weight given by the third argument.
    After filling, you can query the view() as usual to get values and variances etc.
    Matthew Feickert
    @matthewfeickert

    Hm. Okay I'll poke around at this later tonight, but does that mean that I need to make a second histogram to keep track of the uncertainties on my first histogram? For instance,

    import hist
    from hist import Hist
    import uproot4 as uproot
    
    root_file = uproot.open("example.root")
    values, edges = root_file["data"].to_numpy()
    
    _hist = Hist(
        hist.axis.Regular(len(edges) - 1, edges[0], edges[-1], name=None),
        storage=hist.storage.Double(),
    )
    _hist[:] = values
    _hist.view()

    giving

    array([  0.,   0.,   0.,   0.,   1.,  16.,  95., 272., 459., 591., 674.,
           658., 623., 531., 456., 389., 321., 254., 194., 161., 120.,  95.,
            74.,  60.,  53.,  38.,  27.,  19.,  20.,  15.,  11.,  10.,  14.,
             5.,   6.,   3.,   7.,   2.,   6.,   1.,   3.,   2.,   3.,   0.,
             0.,   1.,   0.,   0.,   0.,   0.])

    if I wanted to now get the uncertainties on this histogram (so that I could then add them in quadrature while summing this "data" histogram with others to get the uncertainty on the resultant summed histogram) I'm not quite clear how I would do so.

    Alexander Held
    @alexander-held
    I think you want storage=bh.storage.Weight() to keep track of variance
    example:
    import boost_histogram as bh
    import numpy as np
    
    bins = [1, 2, 3]
    yields = np.asarray([3, 4])
    stdev = np.asarray([1, 2])
    h = bh.Histogram(
        bh.axis.Variable(bins, underflow=False, overflow=False),
        storage=bh.storage.Weight(),
    )
    h[...] = np.stack([yields, stdev ** 2], axis=-1)
    
    print(f"yields: {h.view().value}, stdev: {np.sqrt(h.view().variance)}")
    Henry Schreiner
    @henryiii
    Sorry for the delay, had family over this weekend, and my cat had/has serious health problems. Yes, you want a weighed histogram. Currently, weighted histograms don’t plot or interact nicely with the rest of the Scikit-HEP ecosystem, but a Protocol to allow this is in the works and should be rougly ready by the end of the month. There’t not a quick Double -> Weighted conversion function, but I could add one for Hist. (Generally, some conversions don’t make sense, and this one has to be used carefully).
    Matthew Feickert
    @matthewfeickert
    thanks @alexander-held, I'll poke at this tonight too, but this indeed seems to be what I'm after (as Henry mentioned). @henryiii no need to apologize — I'm never expecting a reply over a weekend (also sorry to hear about the cat)
    Matthew Feickert
    @matthewfeickert

    Currently, weighted histograms don’t plot or interact nicely with the rest of the Scikit-HEP ecosystem, but a Protocol to allow this is in the works and should be rougly ready by the end of the month.

    @henryiii That's great and would be quite helpful to be able to use as defaults for https://github.com/matthewfeickert/heputils/blob/master/src/heputils/convert.py

    Is there an Issue open at the moment that I could follow for this? I realize I'm also making a lot of requests without volunteering to help on anything, so apologies there, but maybe by 2021 I'll understand enough that I can start making some more helpful Issues or easy PRs.

    Hans Dembinski
    @HDembinski
    @matthewfeickert Don't worry about it, I am sure you have your hands full with pyhf, which seems to get very popular. I highly value feedback from other developers within Scikit-HEP and I am sure the others do as well. Since we are all skilled devs, we can find/discuss some issues that normal users will never think about (or do not bother to ask, depending on the subject).
    I got confused by your original example, where you computed the sum of weighted values. That's a profile in ROOT language, so I explained how to make a profile. But you actually want just a sum of weights and then you need to use the WeightedSum storage as others explained. In the latter case, there are no values to be weighted, just a sum of weights is collected and the variance of that sum is given by sum of weights squared, which this storage computes for you.
    Henry Schreiner
    @henryiii
    @matthewfeickert The work on uproot is here, and the issue is scikit-hep/boost-histogram#423. I realized on inspection of Jim’s work that I’ve really muddled some of the values for different storages, so I’ll be tweaking and clarifing after the AS meeting today.
    Henry Schreiner
    @henryiii
    I’m probably about to push to make Boost-Histogram 1.0, probably sometime this month. I’ve discussed numbering and Python versions a bit, but I’m beginning to think the following plan might be best: I’ll make all the changes for 1.0, and make a release (say, 0.12). Then I’ll strip Python 2 and Python 3.5 support, and deprecated features, and make an otherwise API compatible 1.0 release. That way, going forward we don’t have to maintain support for Python 2 for most new features, and reduce the large number of wheels we have to push for a release, but the “final” API of 1.0 is available in 0.x, so if we do pick up major Python 2 users, like experiment stacks, they are not using an outdated API (well, unless they ignore FutureWarnings, that is ;) ). If we really had to, we could even release 0.13 along with 1.1 (would need serious and desperate users, but the option would be open) @HDembinski I think this is closer to your original idea, as well; thoughts? @jpivarski & @eduardo-rodrigues ?
    Hans Dembinski
    @HDembinski
    Sounds good to me. We settled the Plottable protocol and now also how slices work, was there anything else that was controversial for the interface?
    Henry Schreiner
    @henryiii
    I don’t think so. The list here is pretty complete: https://github.com/scikit-hep/boost-histogram/milestone/7 In fact, indexing a list of categories will likely wait till Boost.Histogram has support, so that will move to post 1.0. Addable weight view already is implemented, AFAIK.
    Actually, there is one more unification that I have that you might like, but I have to play with it a bit to make sure it works.
    So expect one more issue or PR. But it doesn’t really change the interface, but unify some things behind the scenes and enable a little bit more than before.
    Hans Dembinski
    @HDembinski
    ok
    If the implementation can be simplified or made more uniform, great
    Eduardo Rodrigues
    @eduardo-rodrigues
    Hi @henryiii, I think your proposal sounds very reasonable.
    Henry Schreiner
    @henryiii
    I’m giving GitHub Discussions a try! Feel free to open discussions at https://github.com/scikit-hep/boost-histogram/discussions
    Hans Dembinski
    @HDembinski
    @henryiii Could you perhaps try it out with another project? In terms of community management, it is best to point people to one place for feedback.
    By adding https://github.com/scikit-hep/boost-histogram/discussions you are splitting the community and I don't want that.