## Where communities thrive

• Join over 1.5M+ people
• Join over 100K+ communities
• Free without limits
##### Activity
Matthew Feickert
@matthewfeickert

Also, a somewhat unrelated question: What am I not understanding about how Boost.Histogram does the variance? Or am I making a stupid mistake?

import boost_histogram as bh
import numpy as np

# From the Boost.Histogram examples
values = [1]*9 + [91]
weight=2
weights=weight*np.ones(len(values))
wm = bh.accumulators.WeightedMean().fill(values, weight=weight)

print("\n# By hand")
print(f"values: {values}")
print(f"weights: {weights}")
_sum_of_weights = sum(weights)
print(f"sum of weights: {_sum_of_weights}")
_sum_of_weights_squared = sum(np.square(weights))
print(f"sum of weights squared: {_sum_of_weights_squared}")
print(f"weighted mean: {np.average(values, weights=weights)}")
print(f"np variance: {np.var(values)}")
_weighted_variance = np.average((values-np.average(values, weights=weights))**2, weights=weights) # Better ways to do this
print(f"weighted variance: {_weighted_variance}")
print("\n# Boost.Histogram accumulator")
print(wm) # Why is the variance different here?

gives

# By hand
values: [1, 1, 1, 1, 1, 1, 1, 1, 1, 91]
weights: [2. 2. 2. 2. 2. 2. 2. 2. 2. 2.]
sum of weights: 20.0
sum of weights squared: 40.0
weighted mean: 10.0
np variance: 729.0
weighted variance: 729.0

# Boost.Histogram accumulator
WeightedMean(sum_of_weights=20, sum_of_weights_squared=40, value=10, variance=810)
Matthew Feickert
@matthewfeickert
Hm. Is Boost.Histogram using a particular approximation scheme for the variance that is highly efficient for larger arrays? I see if I increase the length of values this deviation drops to under a percent.
Alexander Held
@alexander-held
The difference between 810 and 729 is sample variance vs population variance https://en.wikipedia.org/wiki/Variance#Population_variance_and_sample_variance.
(The n-1 vs n difference goes away as n -> inf)
Matthew Feickert
@matthewfeickert
Ah yes, that should have been my first thought
thank you @alexander-held! I appreciate you pointing that out for me. :)
Matthew Feickert
@matthewfeickert
For those playing along with my goof,
import boost_histogram as bh
import numpy as np

# From the Boost.Histogram examples
values = [1]*9 + [91]
weight=2
weights=weight*np.ones(len(values))
wm = bh.accumulators.WeightedMean().fill(values, weight=weight)

print("\n# By hand")
print(f"values: {values}")
print(f"weights: {weights}")
_sum_of_weights = sum(weights)
print(f"sum of weights: {_sum_of_weights}")
_sum_of_weights_squared = sum(np.square(weights))
print(f"sum of weights squared: {_sum_of_weights_squared}")
print(f"weighted mean: {np.average(values, weights=weights)}")
print(f"np variance: {np.var(values)}")
_weighted_variance = sum(np.square(values-np.average(values, weights=weights)))/(len(values)-1) # Better ways to do this
print(f"weighted variance: {_weighted_variance}")
print("\n# Boost.Histogram accumulator")
print(wm) # Difference _was_ that Boost.Histogram nicely gives an unbiased estiamtor! :)
# By hand
values: [1, 1, 1, 1, 1, 1, 1, 1, 1, 91]
weights: [2. 2. 2. 2. 2. 2. 2. 2. 2. 2.]
sum of weights: 20.0
sum of weights squared: 40.0
weighted mean: 10.0
np variance: 729.0
weighted variance: 810.0

# Boost.Histogram accumulator
WeightedMean(sum_of_weights=20, sum_of_weights_squared=40, value=10, variance=810)
Matthew Feickert
@matthewfeickert
Though if there is an API in Boost.Histogram or hist for properly handling the uncertainties per bin for summing histograms that would be great. Hopefully Hans and Henry can set me straight on that next week.
Hans Dembinski
@HDembinski
@matthewfeickert Looks like the variance computation issue has been cleared up, now to your other question, you can make a histogram with a storage that uses the WeightedMean accumulators.
import boost_histogram as bh
h = bh.Histogram(bh.axis.Regular(10, 0, 1), storage=bh.storage.WeightedMean())
h.fill([1], sample=[3], weight=[2])
The first argument is used to look up the cell, then the second argument is feed to the accumulator in that cell with the weight given by the third argument.
After filling, you can query the view() as usual to get values and variances etc.
Matthew Feickert
@matthewfeickert

Hm. Okay I'll poke around at this later tonight, but does that mean that I need to make a second histogram to keep track of the uncertainties on my first histogram? For instance,

import hist
from hist import Hist
import uproot4 as uproot

root_file = uproot.open("example.root")
values, edges = root_file["data"].to_numpy()

_hist = Hist(
hist.axis.Regular(len(edges) - 1, edges[0], edges[-1], name=None),
storage=hist.storage.Double(),
)
_hist[:] = values
_hist.view()

giving

array([  0.,   0.,   0.,   0.,   1.,  16.,  95., 272., 459., 591., 674.,
658., 623., 531., 456., 389., 321., 254., 194., 161., 120.,  95.,
74.,  60.,  53.,  38.,  27.,  19.,  20.,  15.,  11.,  10.,  14.,
5.,   6.,   3.,   7.,   2.,   6.,   1.,   3.,   2.,   3.,   0.,
0.,   1.,   0.,   0.,   0.,   0.])

if I wanted to now get the uncertainties on this histogram (so that I could then add them in quadrature while summing this "data" histogram with others to get the uncertainty on the resultant summed histogram) I'm not quite clear how I would do so.

Alexander Held
@alexander-held
I think you want storage=bh.storage.Weight() to keep track of variance
example:
import boost_histogram as bh
import numpy as np

bins = [1, 2, 3]
yields = np.asarray([3, 4])
stdev = np.asarray([1, 2])
h = bh.Histogram(
bh.axis.Variable(bins, underflow=False, overflow=False),
storage=bh.storage.Weight(),
)
h[...] = np.stack([yields, stdev ** 2], axis=-1)

print(f"yields: {h.view().value}, stdev: {np.sqrt(h.view().variance)}")
Henry Schreiner
@henryiii
Sorry for the delay, had family over this weekend, and my cat had/has serious health problems. Yes, you want a weighed histogram. Currently, weighted histograms don’t plot or interact nicely with the rest of the Scikit-HEP ecosystem, but a Protocol to allow this is in the works and should be rougly ready by the end of the month. There’t not a quick Double -> Weighted conversion function, but I could add one for Hist. (Generally, some conversions don’t make sense, and this one has to be used carefully).
Matthew Feickert
@matthewfeickert
thanks @alexander-held, I'll poke at this tonight too, but this indeed seems to be what I'm after (as Henry mentioned). @henryiii no need to apologize — I'm never expecting a reply over a weekend (also sorry to hear about the cat)
Matthew Feickert
@matthewfeickert

Currently, weighted histograms don’t plot or interact nicely with the rest of the Scikit-HEP ecosystem, but a Protocol to allow this is in the works and should be rougly ready by the end of the month.

@henryiii That's great and would be quite helpful to be able to use as defaults for https://github.com/matthewfeickert/heputils/blob/master/src/heputils/convert.py

Is there an Issue open at the moment that I could follow for this? I realize I'm also making a lot of requests without volunteering to help on anything, so apologies there, but maybe by 2021 I'll understand enough that I can start making some more helpful Issues or easy PRs.

Hans Dembinski
@HDembinski
@matthewfeickert Don't worry about it, I am sure you have your hands full with pyhf, which seems to get very popular. I highly value feedback from other developers within Scikit-HEP and I am sure the others do as well. Since we are all skilled devs, we can find/discuss some issues that normal users will never think about (or do not bother to ask, depending on the subject).
I got confused by your original example, where you computed the sum of weighted values. That's a profile in ROOT language, so I explained how to make a profile. But you actually want just a sum of weights and then you need to use the WeightedSum storage as others explained. In the latter case, there are no values to be weighted, just a sum of weights is collected and the variance of that sum is given by sum of weights squared, which this storage computes for you.
Henry Schreiner
@henryiii
@matthewfeickert The work on uproot is here, and the issue is scikit-hep/boost-histogram#423. I realized on inspection of Jim’s work that I’ve really muddled some of the values for different storages, so I’ll be tweaking and clarifing after the AS meeting today.
Henry Schreiner
@henryiii
I’m probably about to push to make Boost-Histogram 1.0, probably sometime this month. I’ve discussed numbering and Python versions a bit, but I’m beginning to think the following plan might be best: I’ll make all the changes for 1.0, and make a release (say, 0.12). Then I’ll strip Python 2 and Python 3.5 support, and deprecated features, and make an otherwise API compatible 1.0 release. That way, going forward we don’t have to maintain support for Python 2 for most new features, and reduce the large number of wheels we have to push for a release, but the “final” API of 1.0 is available in 0.x, so if we do pick up major Python 2 users, like experiment stacks, they are not using an outdated API (well, unless they ignore FutureWarnings, that is ;) ). If we really had to, we could even release 0.13 along with 1.1 (would need serious and desperate users, but the option would be open) @HDembinski I think this is closer to your original idea, as well; thoughts? @jpivarski & @eduardo-rodrigues ?
Hans Dembinski
@HDembinski
Sounds good to me. We settled the Plottable protocol and now also how slices work, was there anything else that was controversial for the interface?
Henry Schreiner
@henryiii
I don’t think so. The list here is pretty complete: https://github.com/scikit-hep/boost-histogram/milestone/7 In fact, indexing a list of categories will likely wait till Boost.Histogram has support, so that will move to post 1.0. Addable weight view already is implemented, AFAIK.
Actually, there is one more unification that I have that you might like, but I have to play with it a bit to make sure it works.
So expect one more issue or PR. But it doesn’t really change the interface, but unify some things behind the scenes and enable a little bit more than before.
Hans Dembinski
@HDembinski
ok
If the implementation can be simplified or made more uniform, great
Eduardo Rodrigues
@eduardo-rodrigues
Hi @henryiii, I think your proposal sounds very reasonable.
Henry Schreiner
@henryiii
I’m giving GitHub Discussions a try! Feel free to open discussions at https://github.com/scikit-hep/boost-histogram/discussions
Hans Dembinski
@HDembinski
@henryiii Could you perhaps try it out with another project? In terms of community management, it is best to point people to one place for feedback.
By adding https://github.com/scikit-hep/boost-histogram/discussions you are splitting the community and I don't want that.
Alexander Held
@alexander-held
What is that one place you describe, gitter? To me gitter does not seem suited for focused discussions - the history is hard to read, even more so when one comes back after a month and tries to find a topic discussed previously.
Henry Schreiner
@henryiii

I don’t think it’s really a replacement for gitter; it’s more of a replacement for a) stack overflow, and b) opening issues for help instead of a bug. It’s also all in one place - GitHub, so less fragmented. It’s really just a bit more orginisation on issues, with some minor added features. An issue closes - you aren’t supposed to care about the issue anymore when it’s fixed. But if it was opened as a discussion, or moved to a discussion, in the Q&A category, then it just gets an answer, but remains easy to find, so it becomes less likely to become a recurring quesiton. The “showcase” topic looks great - people can post thier most intersting and exciting examples of boost-histogram.

And while it’s new to me, other members of our comunity have been trying them out already (the PyHF repo has had them on for over a month, for example).

This Gitter is mostly a few dedicated poeple (mostly devs), and while gitter has threaded convos, nobody really uses them. And there’s no reasonable search. It’s rather useful for announcements, but it’s taking off for the type of help discussions I would have hoped to see. Pybind11 is a great example of a vibrant Gitter community - but there, the same set of questions tend to get asked over and over.
Alexander Held
@alexander-held
Oh, is it possible to move an issue over to make it a discussion instead? That is very nice.
Henry Schreiner
@henryiii
Yep, a ton of Awkward issues got moved over. And a few Uproot ones. You can even move them by label. :) Also, you can move them to a different repo if they were opened in yours but really were for a library higher up the chain.
Alexander Held
@alexander-held
This is really useful, I like that a lot.
Henry Schreiner
@henryiii
Imagine someone opens a feature request (called Ideas) discussion in boost-histogram. It’s for something we don’t think is basic/universal enough for boost-histogram, so we move it to Hist. :)
Hans Dembinski
@HDembinski

What is that one place you describe, gitter? To me gitter does not seem suited for focused discussions - the history is hard to read, even more so when one comes back after a month and tries to find a topic discussed previously.

Yes, I was referring to Gitter as the place to communicate with users.

Alexander Held
@alexander-held
I see usefulness in both Gitter and GH Discussions:
• Gitter for issues that require some back-and-fourth for a short amount of time and are likely very user-specific (e.g.: my compilation fails and I don't know how to fix it), if that gets lost in the history it is probably ok since other users may not have the same question
• Discussions for questions that lots of users may have (and then in the future find there answered, like StackOverflow), or questions that take a long time to answer (to have one consistent thread there instead of being spread out in Gitter between other topics). I also think the "Show and tell" category could be quite nice, and a message like that would disappear quickly in Gitter. And I personally like the idea of everything in one place (repository) a lot.
Henry Schreiner
@henryiii
I’d see gitter as more for announcements and for developer discussions (scroll back in the history, it’s pretty much the same 4-5 people with maybe a couple of user questions), while most users usually open an issue (there are more question issues than questions in gitter, I believe). Discussions gives them a safer, nicer place to open these things that hopefully won’t be as large of a barier, and will be more discoverable after they are answered. But they should be seen mostly as better orginization and a few extra features like threads and answers on top of issues.
Andrzej Novak
@andrzejnovak
I haven't had the chance to use discussions much in mplhep yet, pretty quiet these days, but I am a big supporter. It's a great place for "not quite issue"'s. Gitter still works when one has a question that's more random, but it might as well not keep history past few days
Henry Schreiner
@henryiii
With gitter you also tend to “subscribe” to the whole thing, rather than just quickly ask a single question and follow just that.
Hans Dembinski
@HDembinski
Fine, what I am not interested in is proliferation of places that I have to check for user feedback.
Jim Pivarski
@jpivarski
I set them all up with email notifications. That way, I don't have to explicitly remember to check.
Hans Dembinski
@HDembinski
Hi. Thanks for the cool package, I am trying to use it for my new plots. Is there any canoncial/convenient way to do stacked histogram plots, similar to plt.histogram(stacked=True)? So far I just used that to plot my boost histograms:
My histogram is having a Regular and a strCategory axis and I want to show the cumulative sum of the categories, but with the different components stacked on top of each other.
I found out I can just use np.cumsum(axis=<category axis>) to calculate the cumulative sum and create a new histogram from that. In had also managed to draw a stacked histogram by plotting multiple plt.bar with different bottom=... arguments taken from the cumulative sums. But I dislike that visualy, because I want to have a black outline for my components, but no outlines for the individual bins, as seen in the screenshot.