## Where communities thrive

• Join over 1.5M+ people
• Join over 100K+ communities
• Free without limits
##### Activity
Matthew Feickert
@matthewfeickert

mass_hists[0] + mass_hists[1]

results in

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-4-7bab2e7a86b3> in <module>
12 # mass_hists[0]
13 # help(stack_hist)
---> 14 mass_hists[0] + mass_hists[1]

223         result = self.copy(deep=False)
225

227         if isinstance(other, (int, float)) and other == 0:
228             return self
230
231         # Addition may change the axes if they can grow

/srv/conda/envs/notebook/lib/python3.7/site-packages/boost_histogram/_internal/hist.py in _compute_inplace_op(self, name, other)
261     def _compute_inplace_op(self, name, other):
262         if isinstance(other, Histogram):
--> 263             getattr(self._hist, name)(other._hist)
264         elif isinstance(other, _histograms):
265             getattr(self._hist, name)(other)

ValueError: axes not mergable
Jim Pivarski
@jpivarski
I'd read that error message as saying that the binning is not the same.
Henry Schreiner
@henryiii
If the axes are the same, yes, that works. If the axes don’t match, it won’t.
Matthew Feickert
@matthewfeickert
Hm.
>>> bins = [hist.to_numpy()[1] for hist in mass_hists]
all([np.array_equal(x, y) for x, y in zip(bins, bins[::-1])])
True
Henry Schreiner
@henryiii
I’d use reduce to sum up a list of histograms, functools.reduce(operator.add, mass_hists), but you should be able to do sum(mass_hists), if you’d rather.
Matthew Feickert
@matthewfeickert
oh
so things like the axis name matter?
Jim Pivarski
@jpivarski
(Maybe there ought to be an easy way to strip metadata? And it should be recommended in the "axes not mergable" error message?)
Matthew Feickert
@matthewfeickert

Hm. Though

for hist in mass_hists:
assert hist.metadata == None

passes

okay, I might be doing something stupid. I'll poke at this more after my next meeting
Henry Schreiner
@henryiii
What would you do with the axes metadata? c = a + b, what’s the metadata for c? .metadata is only one item in the metadata, any valid attribute is also metadata, like .name, etc.
Jim Pivarski
@jpivarski
One option is to use the first, though that clearly has downsides. Forcing the user to make a choice by wrapping the function call with a metadata stripper/overwriter that gets advertised in the error message might be the best way to go...
Matthew Feickert
@matthewfeickert

This is all helpful and makes me appreciate the concept of metadata more. I guess the good news is that uproot will handle things in a more intelligent manner than I have been once some issues have been resolved, where what I have been (hackily) doing is the following https://github.com/matthewfeickert/heputils/blob/3dd0e858f002041a7ae15fc310d92ddb0ea4fe26/src/heputils/convert.py#L7-L32

If I just don't even attempt to handle the name then things work with the following:

import numpy as np
import uproot4 as uproot
import mplhep
import hist
from hist import Hist
import functools
import operator

mplhep.set_style("ATLAS")

def uproot_to_hist(uproot_hist):
values, edges = uproot_hist.to_numpy()
_hist = hist.Hist(
hist.axis.Regular(len(edges) - 1, edges[0], edges[-1]),
storage=hist.storage.Double(),
)
_hist[:] = values
return _hist

root_file = uproot.open("example.root")
mass_hists = [
uproot_to_hist(root_file[key]) for key in root_file.keys()
]

stack_hist.plot1d()
for hist in mass_hists:
hist.plot1d()
Matthew Feickert
@matthewfeickert
Though I guess I still want to know, is there a more intelligent way to deal with the name metadata?
Henry Schreiner
@henryiii
As long as it matches, it should be fine. There used to be a way to get the metadata all at once (.metadata), but now there’s not a public location like that. It would be nice ot be able to just do ax1.__dict__ = ax2.__dict__ then merge.
Matthew Feickert
@matthewfeickert
yeah, I also just looked at the docs for hist and was reminded that name is the name of the axis and not the histogram
Henry Schreiner
@henryiii
(There is a private way to do exactly that - I think it’s ax._ax.metadata)
You can name your histogram if you want to.
Matthew Feickert
@matthewfeickert
so what I was doing should have been setting the label and not the name
ah okay, thanks for the info on ax._ax.metadata!
Henry Schreiner
@henryiii
“name” is something that can be used to refer to the axis and should be unique. “label” is anything.
Matthew Feickert
@matthewfeickert
:+1:
Thanks Henry!
Henry Schreiner
@henryiii
And you can use __dict__ on a histogram. Just not a ax. I wonder if @property; def __dict__(self): return self._ax.metadata would break anything...
It’s a slots class so normal __dict__ is not used.
Hans Dembinski
@HDembinski
@matthewfeickert The idea of metadata was is to distinguish between otherwise identical axes, e.g. distinguish axis.Regular(10, 0, 10, metadata="time in seconds") from axis.Regular(10, 0, 10, metadata="length in meters"). That's why adding histograms in boost-histogram (and hist) only works when metadata matches on all axes
The assumption here is that you can only add histograms with identical layout, not only physical layout in terms of the axis grid but also logical layout in terms of what these axes mean
Matthew Feickert
@matthewfeickert
this makes sense now. Before I wasn't assuming that anything called "metadata" could actually have any impact on operations
Peter Fackeldey
@pfackeldey

Dear Hist/boost-histogram developers,

thank you for this great histogramming library, it is a pleasure to work with it :)

I have a question to you regarding fancy indexing on a StrCategory axis.
My (Hist) histogram looks as follows:

In [38]: h
Out[38]:
Hist(
StrCategory(['GluGluToHHTo2B2VTo2L2Nu_node_cHHH2p45'], growth=True),
StrCategory(['ee', 'mumu', 'emu'], growth=True),
StrCategory(['nominal'], growth=True),
Regular(40, 0, 200, name='MET', label='$p_{T}^{miss}$'),
storage=Weight()) # Sum: WeightedSum(value=2608.44, variance=47.3505) (WeightedSum(value=2775.4, variance=50.5706) with flow)

Now I would like to "group" e.g. the "ee" and "emu" category together, which means that I'd like to do something like:

h[{"dataset": "GluGluToHHTo2B2VTo2L2Nu_node_cHHH2p45", "category": ["ee", "emu"], "systematic": "nominal"}]

Unfortunately this does not work, as the ["ee", "emu"] is not a valid indexing operation.

Is there a way to do such a fancy indexing on StrCategory axes? In case this is not supported, is there a nice workaround?
(I am basically looking for something, which works similar to coffea.Hist.group method: https://github.com/CoffeaTeam/coffea/blob/master/coffea/hist/hist_tools.py#L1115)

Best, Peter

Peter Fackeldey
@pfackeldey
small update: I found that the following works, but I think this would be a very nice feature for Hist, if this can be handled a bit more conveniently. What do you think?
np.sum(h[{"dataset": "GluGluToHHTo2B2VTo2L2Nu_node_cHHH2p45", "systematic": "nominal"}].view()[h.axes["category"].index(["ee", "emu"]), ...], axis=0)
Henry Schreiner
@henryiii
@pfackeldey The issue is here: scikit-hep/boost-histogram#296 . There could be a shortcut in Hist if this is not solved upstream, but ideally indexing by ["ee", "emu”] would not cause the histogram to convert into a view.
Peter Fackeldey
@pfackeldey
@henryiii Thank you for your reply! Alright, I'll go for now with the reduce option, so that I don't have to convert into a view and can keep the hist :) I'd love to see a nice indexing solution in the future, such as the syntax you proposed in the issue.
Peter Fackeldey
@pfackeldey
I hope you don't mind another question:
I would like to fill a histogram with a very fine (regular) binning, e.g. 1000 bins from 0 to 1. After filling I would like to rebin into a variable binning of e.g. [0, 5, 7, 8, 9.5, 1] (only 5 bins). As far as I can see, there is the rebin action, but unfortunately it only accepts a single value/factor. Is it possible to rebin a Regular/Variable axis to any arbitrary binning, as long as the binning is "valid"?
Henry Schreiner
@henryiii

I think this is still a feature that hasn’t been added to Boost.Histogram yet (@HDembinski can comment), but one trick is the bin edges - you have to have exact bin edges or it doesn’t make sense. You can manually write this with copying from one histogram to another, though - maybe something Hist could simplify, as well. Think something like this (rough draft, might not be correct):

(ax,) = new_hist.axes
for bin in range(5):
new_hist[bin] = old_hist[bh.loc(ax[bin]):bh.loc(ax[bin+1]):sum]

(Didn’t include flow bins, but you can do that with bh.tag.at and going from -1 to N)

Matthew Feickert
@matthewfeickert
@henryiii @HDembinski Is there an API in Boost.Histogram for uncertainties? That is, if you wanted to calculate the weighted uncertainty on each bin in a histogram formed from the sum of histograms (a stack hist) is there a way to ask a Boost.Histogram object for the uncertainties (and weights?) associated with each bin (as a NumPy array or list)?
Matthew Feickert
@matthewfeickert
Or would you use the Accumulators here?
Matthew Feickert
@matthewfeickert

Also, a somewhat unrelated question: What am I not understanding about how Boost.Histogram does the variance? Or am I making a stupid mistake?

import boost_histogram as bh
import numpy as np

# From the Boost.Histogram examples
values = [1]*9 + [91]
weight=2
weights=weight*np.ones(len(values))
wm = bh.accumulators.WeightedMean().fill(values, weight=weight)

print("\n# By hand")
print(f"values: {values}")
print(f"weights: {weights}")
_sum_of_weights = sum(weights)
print(f"sum of weights: {_sum_of_weights}")
_sum_of_weights_squared = sum(np.square(weights))
print(f"sum of weights squared: {_sum_of_weights_squared}")
print(f"weighted mean: {np.average(values, weights=weights)}")
print(f"np variance: {np.var(values)}")
_weighted_variance = np.average((values-np.average(values, weights=weights))**2, weights=weights) # Better ways to do this
print(f"weighted variance: {_weighted_variance}")
print("\n# Boost.Histogram accumulator")
print(wm) # Why is the variance different here?

gives

# By hand
values: [1, 1, 1, 1, 1, 1, 1, 1, 1, 91]
weights: [2. 2. 2. 2. 2. 2. 2. 2. 2. 2.]
sum of weights: 20.0
sum of weights squared: 40.0
weighted mean: 10.0
np variance: 729.0
weighted variance: 729.0

# Boost.Histogram accumulator
WeightedMean(sum_of_weights=20, sum_of_weights_squared=40, value=10, variance=810)
Matthew Feickert
@matthewfeickert
Hm. Is Boost.Histogram using a particular approximation scheme for the variance that is highly efficient for larger arrays? I see if I increase the length of values this deviation drops to under a percent.
Alexander Held
@alexander-held
The difference between 810 and 729 is sample variance vs population variance https://en.wikipedia.org/wiki/Variance#Population_variance_and_sample_variance.
(The n-1 vs n difference goes away as n -> inf)
Matthew Feickert
@matthewfeickert
Ah yes, that should have been my first thought
thank you @alexander-held! I appreciate you pointing that out for me. :)
Matthew Feickert
@matthewfeickert
For those playing along with my goof,
import boost_histogram as bh
import numpy as np

# From the Boost.Histogram examples
values = [1]*9 + [91]
weight=2
weights=weight*np.ones(len(values))
wm = bh.accumulators.WeightedMean().fill(values, weight=weight)

print("\n# By hand")
print(f"values: {values}")
print(f"weights: {weights}")
_sum_of_weights = sum(weights)
print(f"sum of weights: {_sum_of_weights}")
_sum_of_weights_squared = sum(np.square(weights))
print(f"sum of weights squared: {_sum_of_weights_squared}")
print(f"weighted mean: {np.average(values, weights=weights)}")
print(f"np variance: {np.var(values)}")
_weighted_variance = sum(np.square(values-np.average(values, weights=weights)))/(len(values)-1) # Better ways to do this
print(f"weighted variance: {_weighted_variance}")
print("\n# Boost.Histogram accumulator")
print(wm) # Difference _was_ that Boost.Histogram nicely gives an unbiased estiamtor! :)
# By hand
values: [1, 1, 1, 1, 1, 1, 1, 1, 1, 91]
weights: [2. 2. 2. 2. 2. 2. 2. 2. 2. 2.]
sum of weights: 20.0
sum of weights squared: 40.0
weighted mean: 10.0
np variance: 729.0
weighted variance: 810.0

# Boost.Histogram accumulator
WeightedMean(sum_of_weights=20, sum_of_weights_squared=40, value=10, variance=810)
Matthew Feickert
@matthewfeickert
Though if there is an API in Boost.Histogram or hist for properly handling the uncertainties per bin for summing histograms that would be great. Hopefully Hans and Henry can set me straight on that next week.
Hans Dembinski
@HDembinski
@matthewfeickert Looks like the variance computation issue has been cleared up, now to your other question, you can make a histogram with a storage that uses the WeightedMean accumulators.
import boost_histogram as bh
h = bh.Histogram(bh.axis.Regular(10, 0, 1), storage=bh.storage.WeightedMean())
h.fill([1], sample=[3], weight=[2])
The first argument is used to look up the cell, then the second argument is feed to the accumulator in that cell with the weight given by the third argument.
After filling, you can query the view() as usual to get values and variances etc.