stupid question. I have two numpy arrays,
b and I want to group elements in
b by the unique elements in
a = [1, 12, 1, 50] b = [10, 20, 30, 40] result == [[10, 30], , ]
Is there a numpy function that does that? It seems I am missing the right keyword in my searches
numpy groupbywill get you there
I think awkward array might be able to help instead of pandas (not tested):
reorder = np.argsort(a) _, counts = np.unique(a[reorder], return_counts=True) result = awkard.JaggedArray.fromcounts(counts, b[reorder])
The only thing I'm unsure of there is the order of the unique counts, I'm assuming the
unique method returns things in the order they're first seen, but I suspect that's not true.
We are pleased to announce the second "Python in HEP" workshop organised by the HEP Software Foundation (HSF). The PyHEP, "Python in HEP", workshops aim to provide an environment to discuss and promote the usage of Python in the HEP community at large.
PyHEP 2019 will be held in Abingdon, near Oxford, United Kingdom, from 16-18 October 2019.
The workshop will be a forum for the participants and the community at large to discuss developments of Python packages and tools, exchange experiences, and steer where the community needs and wants to go. There will be ample time for discussion.
The agenda will be composed of plenary sessions, a highlight of which is the following:
1) A keynote presentation from the Data Science domain.
2) A topical session on histogramming including a talk and a hands-on tutorial.
3) Lightning talks from participants.
4) Presentations following up from topics discussed at PyHEP 2018.
We encourage community members to propose presentations on any topic (email: firstname.lastname@example.org). We are particularly interested in new(-ish) packages of broad relevance.
The agenda will be made available on the workshop indico page (https://indico.cern.ch/event/833895/) in due time. It is also linked from the PyHEP WG homepage http://hepsoftwarefoundation.org/activities/pyhep.html.
Registration will open very soon, and we will provide detailed travel and accommodation information at that time.
Travel funds may be available at a modest level. To be confirmed once registration opens.
You are encouraged to register to the PyHEP WG Gitter channel (https://gitter.im/HSF/PyHEP) and/or to the HSF forum (https://groups.google.com/forum/#!forum/hsf-forum) to receive further information concerning the organisation of the workshop.
Looking forward to your participation!
Eduardo Rodrigues & Ben Krikler, for the organising committee
NOTICE: This domain name expired on 7/11/2019 and is pending renewal or deletion.
import numpy as np, awkward a = np.array([1, 12, 1, 10, 50, 10]) b = np.array([10, 20, 30, 40, 50, 60]) arg = a.argsort(kind='stable') offsets, = np.where(np.r_[True, np.diff(a[arg]) > 0]) output = awkward.JaggedArray.fromoffsets(offsets.flatten(), awkward.IndexedArray(arg, b))
np.where([0, 1, 0, 0, 1]).baseis surprisingly 2d (hence the flatten)
since the knowledge in this channel proved invaluable before, another question :)
group_1 = np.array([(1, 2), (3, 3), (5, 7), (4, 4)]) test_elements = np.array([(1, 2), (3, 3), (3, 5)])
and would like to test if the elements in
test_elements are in
group_1. I expect the result
[True, True, False]
as I take the tuples as unique objects.
Numpy has the function
[[True, True], [True, True], [True, False], [False, False]]
OK, so this is inverse to what I want, fine.
np.isin(group_1, test_elements) # returns [[True, True], [True, True], [True, True]]
Clearly it compares element by element and since both
5 are contained, therefore
(3,5) should be as well, right?
Well, not in my case. Is there a way to do this comparison for each 2-vector instead of element-wise? For loop (even with numba) is quite slow
Yes, you can do that. The idea is: make a comparison of all possible combinations of each element with each other element. This gives you a rank three boolean object with: number of elements in the group, number of elements to test, dimension of an element. Then make two reduce operations: 1.
reduce all on the axis of the tuple, requiring that true is if in a tuple everything is true and 2. a
reduce any on the axis of all the possible combinations, since at least one tuple has to be fully contained.
For example (may change the axis for convenience):
test_elements_expanded = np.expand_dims(test_elements, axis=1) entries_equal = group_1 == test_elements_expanded tuple_equal = np.all(entries_equal, axis=2) tuple_contained = np.any(tuple_equal, axis=0)
entries_equalis off :(
truewhere all elements in a tuple are true, otherwise false, reducing axis 1 with
anymeans that an entry with at least one matching tuple is true. So your left with the axis 0.
You can also map the tuples to scalars (easy if you have some idea what the values are going to be), e.g:
def squash(x): return 10000 * x[:,0] + x[:,1] np.in1d(squash(test_elements), squash(group_1))
This should be more memory-efficient and faster if both arrays are large.
If you are going to do membership testing repeatedly on the same array, it might be even better to convert it to a set, dictionary or some other object backed by a hash table, so membership tests are a constant time operation.
Comparing the two solutions:
yours: 28366.0 microseconds
stackoverflow: 10999.0 microseconds
So essentially same speed for this exact problem. At this point, it matters, if: you call it once or a million times? How big is your array really? That's when things like presorting can make the difference. My advice: use which ever method you understand/like better (not from the speed, from the concept) and try only to improve on it if it proves to be a bottleneck.