_getitem
is only ever called at the top-level right? If so, we can move the slow-path handling into _getitem_generic
or something to make it available
_cumsum_next
and I'm now in the getitem code 😂
charge
that do sum in a simple way? What I mean concretely is that p1[0][0].charge
and p1[0][1].charge
both exist, but (p1[0][0] + p1[0][1])
has no .charge
attribute anymore. How does this relate to the idea of passing things through where possible (last paragraph of that section)?
Vector knows about Lorentz vectors, which don't have a charge. Maybe it was a leading example, since it's not too crazy to consider behaviors that are Lorentz vectors + additive quantum numbers like charge (maybe just charge, since it's the only one that's a direct experimental observable in NHEP). In fact, that's what a "Candidate" is in Coffea. I'm pretty sure that the "Candidate" type adds charges.
Maybe a better example would have been "lepton_isolation", a quantity that would make no sense to add when you add two particles. Here, we want the Vector library (and Coffea) to be okay with a record having these fields, but it shouldn't blindly add them when particles are added. Similarly for other (vec) → vec and (vec, vec) → vec operations. It should drop them.
I realize you weren't there, but @agoose77:matrix.org asked the same question, and my answer about "modern languages" was leaning toward the distinction between structural types and nominative types.
Nominative types are classic OOP: such-and-such a method needs a type named XYZ or one of XYZ's subtypes (subclasses). That's a little looser than "only one type allowed," but not much. Downstream library developers must make subclasses of your library's classes, and that dependency can lead to issues, especially when you want non-strict dependencies in Python. That's why protocols, like NumPy's "write a method named __array_ufunc__
," rather than inheritance, like Pandas's "inherit from pd.ExtensionArray
" are looser and more convenient.
Structural types are loose like the Python protocols I mentioned in the last paragraph. If a function needs "x", "y", and "z", then any type that provides "x", "y", and "z" can be used by the function. In a dynamically typed world like Python, that is duck typing. It's still possible in a type-checked world: the type-checking pass verifies that you have the required methods, such as a method with the correct name and signature, and if so, it passes. If it's also a compiled language, then it may need to compile different, specialized code for each usage of the function with different input types.
Awkward Array was designed with these ideas in mind. It was intended to be primarily structurally typed with parameters
as an opt-in for nominatively typing specific cases. Since nominative typing can be restrictive, the behavior
is a mutable dict and different arrays can have different behaviors
. In retrospect, it might be a bit too free because a common error is for a behavior to silently fail to match. It's probably a good idea to usually keep a constant, global ak.behavior
to avoid confusion.
is_UnionType
et al. flags are? I've taken them to be the high-level representation of the layout types that Awkward provides: smoothing over the details of e.g. how the list types are implemented. But, if that were the case, then I'd consider removing is_IndexedType
as it is actually not a high-level type, but rather an implementation detail
v1 had a lot of isinstance checks against groups of Content subclasses (especially in _util, in broadcast_and_apply). Those checks were verbose and not exactly the right Venn diagrams, forcing multiple isinstance with and
and or
.
These is_UnionType
, etc. flags are v2's improvement: the checks are now things like is_IndexedType or is_OptionType
, which is succinct and easier to form the right Venn diagrams. Any subclasses of these layout nodes (if that should ever happen—not planning on it) would work just as well.
That's what they're there for. If we remove them, a lot of if-statements are going to get more complicated.
Sure, I remember the isinstance
(ab)use 😀 So, AFAICT we've replaced ak._util.listtypes
with layout.is_ListType
etc? If so, I'm still not sure we need IndexedType
- it was only needed because we had templated layouts:
(awkward._ext.IndexedArray32,
awkward._ext.IndexedArrayU32,
awkward._ext.IndexedArray64)
I'm not proposing we remove is_Indexed
though - I'm just trying to get my head around the design goals 🙂
from_iter
, which dropped the type information for a zero-length array.
depth_limit=0
argument to ak.zip
to build a Record
, but I don't know how terrible that would be UX wise.
ak._util.listtypes
) was stronger when there were several integer-specialized types for each sort of node, since that would be a lot of isinstance checks. Now that there's only one ak._v2.contents.IndexedArray
, it's less necessary, but still helpful.
ak.zip({'x': x[np.newaxis, ...], 'y': y[np.newaxis. ...]}, depth_limit=1)[0]
but where's the fun in that 😉
ak._v2.Record
works, but by passing the {"x": x, "y": y}
dict to from_iter
, it goes through ArrayBuilder
and gets treated like a Python object rather than an Awkward Array.
ak.layout.Record
, I'm not sure which you're addressing.
Hello, I have a question related to array maniuplation and behavior "propagation". I have an Array structured in this way:
<RecHitEBArray [[{sumet: 0, ... status: 13}]] type='252 * [var * RecHitEB["sumet...'>
that I would like to sort differently by achieving something similar to pandas groupby operiation. To do this I unflatten
the original array (runs.EcalPhiSymEB
) using spilts
defined from numpy.unique
: ak.unflatten(runs.EcalPhiSymEB, splits, axis=0)
. This works fine as long as the grouping is concerned but the returned Array does not retain the original behavior (except for the innermost dimension):
<Array [[[{sumet: 0, ... status: 13}]]] type='47 * var * [var * RecHitEB["sumet"...'>
.
I tried to specified the original behavior in the ak.unflatten
call but that also hasn't worked.
Is there a way to retain the RecHitEBArray behavior following the transformation?
Thanks a lot
RecHitEBArray
?
import awkward as ak
import numpy as np
class ThisArray(ak.Array):
def that(self):
print("this.that()!")
behavior = {("*", "this"): ThisArray}
array = ak.Array({"x": [[1, 2, 3], [4], [5, 6]]}, with_name="this", behavior=behavior)
next_array = ak.unflatten(array, [2,1,1,1,1], axis=-1)
assert isinstance(next_array, ThisArray)
Yes! I had this in mind, but I'm trying to get everything out of my mind and onto a formal roadmap, so here it is: https://github.com/scikit-hep/awkward-1.0/wiki#layoutbuilder-in-numba
This would be a great way to learn about how Numba works, by the way. It wouldn't be as hard as ak.Array → Numba and there's a lot of preexisting code to use as guides.
Actually, I was thinking it would take few weeks, so that's a bit of an underestimate!
It would have to implement equivalents of all of the functions you see here: https://github.com/scikit-hep/awkward-1.0/blob/main/src/awkward/_connect/_numba/builder.py
But it can't do it just by calling a function pointer, as ArrayBuilder does: https://github.com/scikit-hep/awkward-1.0/blob/fb6aa208098ae42df82db080328b3bbb5d79671a/src/awkward/_connect/_numba/builder.py#L94-L106
However, generating Python code as a string and using Numba's high-level @numba.extending.overload
would be an option, similar to this: https://github.com/scikit-hep/awkward-1.0/blob/fb6aa208098ae42df82db080328b3bbb5d79671a/src/awkward/_connect/_numba/arrayview.py#L1111-L1177
I don't want to scare anybody away—this would be a good project for a newcomer to Numba, though it would take a few weeks and you'd go away being a Numba expert.
+- 1_000% 🙃
Why can't we just call the pointers as we do for ArrayBuilder
? This is where not being so versed on what we're actually doing in the numba layer is hampering me, but I assumed that at a high level it's a similar job?
The main reason one would use LayoutBuilder instead of ArrayBuilder is for speed (beyond that, there's also the ability to make a numeric column have non-64bit type without converting it with ak.values_astype
after the fact).
Going through external function pointers should in principle slow it down because LLVM can't inline and optimize the function body with the surrounding code, but in practice the slow down is much larger than I would expect it to be. There's something odd about running code in a shared library that I don't understand that makes it much slower than I think it ought to be. Doing this in cppyy or ROOT's C++ JIT is even slower than Numba's JIT. Mysteries abound, but I don't worry about it too much because ArrayBuilder has other bottlenecks.
In addition, LayoutBuilder is currently implemented as a shim that generates AwkwardForth. AwkwardForth is not as fast as compiled code (by a factor of 2× to 5×), but faster than any other runtime-interpreted code I know of. That makes it the best choice when you don't have a JIT compiler available. It's not the best choice when you do have a JIT compiler available. It's looking like the only applications that can take advantage of a LayoutBuilder are ones that involve JIT somewhere in the chain (after all, you have to iterate over the rowwise data that you're feeding to the LayoutBuilder), so LayoutBuilder itself is getting reimplemented in JIT-compiled C++ this summer. (That's the project just above LayoutBuilder-Numba in the road map.) With the new LayoutBuilder generated on the fly, there wouldn't be a stable set of pointers to point to.
(Also worth noting: LayoutBuilder through AwkwardForth is not quite a factor of 2× better than ArrayBuilder, as is.)
So the short of it is that wrapping the current LayoutBuilder's external pointers in Numba calls is far enough from optimal that it wouldn't be worth doing.
Another (strong) benefit for using LayoutBuilder is that your output is guaranteed to have the correct form. I have some special cases in my own analysis tools where I walk the output, and replace the EmptyArray
with the corresponding zero-sized layouts. If LayoutBuilder
worked in Numba, I'd be able to remove those!
AwkwardForth is not as fast as compiled code (by a factor of 2× to 5×), but faster than any other runtime-interpreted code I know of.
Still, that's impressive perf. For most people I imagine that's more than enough to make it usable.
LayoutBuilder through AwkwardForth is not quite a factor of 2× better than ArrayBuilder
Huh, that is interesting.
Nevertheless, Is the JIT compiler planned to be cppyy?
With the new LayoutBuilder generated on the fly, there wouldn't be a stable set of pointers to point to.
Ah, that explains it.
We ought to have a function for filling in EmptyArray. Or no, this: scikit-hep/awkward-1.0#516 and let people write it themselves.
We're targeting two JIT compilers: cppyy in ROOT and cppyy outside of ROOT. From our end, we just generate the C++ strings and have Python code that knows how to invoke both interfaces.
recursively_apply
is public, and it addresses Martin & Doug's issue on #awkward-dask, but it won't help cases involving more than one array. The public function should decide whether to do recursively_apply
or broadcast_and_apply
based on whether there's one or more input array, since the interface is now (in v2) the same between them. And this public function will require significant documentation—how it works is not easy to guess.