Where communities thrive


  • Join over 1.5M+ people
  • Join over 100K+ communities
  • Free without limits
  • Create your own community
People
Repo info
Activity
    Angus Hollands
    @agoose77:matrix.org
    [m]
    I should be more precise. I think, if you're in favor, it might be worth promoting to tuple for all fast-path cases, so that it still works if a one-length tuple is passed by the user
    I.e. what I wrote above - ensuring that we only deal with tuples, and use the generalised handling where a fast path doesn't exist
    Jim Pivarski
    @jpivarski
    I agree that a user would expect a length-1 tuple to act like a non-tuple. And they do (there's that test proving it), but they just don't run at the same speed (disregarding the semantic difference in view vs copy, since ak.Arrays are immutable).
    I've been thinking we would eventually do that, though it would make the slow path harder to test.
    Angus Hollands
    @agoose77:matrix.org
    [m]
    _getitem is only ever called at the top-level right? If so, we can move the slow-path handling into _getitem_generic or something to make it available
    I need to really just focus on one issue at a time though- I started looking at _cumsum_next and I'm now in the getitem code 😂
    Jim Pivarski
    @jpivarski
    __getitem__ is only ever called by users (now that I did some clean-up, creating _getitem), but _getitem and maybe its helper functions can call _getitem. The recursion, which starts with _getitem_next and gets deeper from there, does not call _getitem.
    Sorry, "getitem" is a very deep rabbithole.
    It was the first part of Awkward 1 to be implemented, because if it wasn't possible, I would have given up.
    Angus Hollands
    @agoose77:matrix.org
    [m]
    @jpivarski: I'm impressed that v2 has been so fast to be honest. I know that it's easier once you already have the code in place, but it's pretty daunting rewriting something this big, especially when there are alternative APIs to consider
    Jim Pivarski
    @jpivarski
    Most of the v2 translation, including (and starting with) getitem, was done by @ioanaif. I agree that it's impressive!
    Angus Hollands
    @agoose77:matrix.org
    [m]
    Yeah, I've had to try and keep on top of the PRs! Super cool.
    Angus Hollands
    @agoose77:matrix.org
    [m]
    @jpivarski: I have a memory of being linked to a GH repo that was supposed to reproduce a bug in Awkward / uproot. I can't seem to find it. Does this spark anything in your memory?
    Jim Pivarski
    @jpivarski
    @agoose77:matrix.org Is it the reply to the most recent message in https://gitter.im/Scikit-HEP/uproot ?
    12 replies
    Angus Hollands
    @agoose77:matrix.org
    [m]
    I had already downloaded by browser history into JSON to try and find it haha
    Jim Pivarski
    @jpivarski
    The reply-threads in Gitter are not distinct URLs from the main discussions. That hides some things.
    Alexander Held
    @alexander-held
    I could not follow the PyHEP awkward talk today, but was wondering about one aspect discussed in the "loose coupling" section of https://github.com/jpivarski-talks/2022-04-06-pyhep-awkward-update/blob/main/EVALUATED-part-1-ecosystem.ipynb. Is there a way to carry through additional properties like charge that do sum in a simple way? What I mean concretely is that p1[0][0].charge and p1[0][1].charge both exist, but (p1[0][0] + p1[0][1]) has no .charge attribute anymore. How does this relate to the idea of passing things through where possible (last paragraph of that section)?
    Jim Pivarski
    @jpivarski

    Vector knows about Lorentz vectors, which don't have a charge. Maybe it was a leading example, since it's not too crazy to consider behaviors that are Lorentz vectors + additive quantum numbers like charge (maybe just charge, since it's the only one that's a direct experimental observable in NHEP). In fact, that's what a "Candidate" is in Coffea. I'm pretty sure that the "Candidate" type adds charges.

    Maybe a better example would have been "lepton_isolation", a quantity that would make no sense to add when you add two particles. Here, we want the Vector library (and Coffea) to be okay with a record having these fields, but it shouldn't blindly add them when particles are added. Similarly for other (vec) → vec and (vec, vec) → vec operations. It should drop them.

    Jim Pivarski
    @jpivarski

    I realize you weren't there, but @agoose77:matrix.org asked the same question, and my answer about "modern languages" was leaning toward the distinction between structural types and nominative types.

    Nominative types are classic OOP: such-and-such a method needs a type named XYZ or one of XYZ's subtypes (subclasses). That's a little looser than "only one type allowed," but not much. Downstream library developers must make subclasses of your library's classes, and that dependency can lead to issues, especially when you want non-strict dependencies in Python. That's why protocols, like NumPy's "write a method named __array_ufunc__," rather than inheritance, like Pandas's "inherit from pd.ExtensionArray" are looser and more convenient.

    Structural types are loose like the Python protocols I mentioned in the last paragraph. If a function needs "x", "y", and "z", then any type that provides "x", "y", and "z" can be used by the function. In a dynamically typed world like Python, that is duck typing. It's still possible in a type-checked world: the type-checking pass verifies that you have the required methods, such as a method with the correct name and signature, and if so, it passes. If it's also a compiled language, then it may need to compile different, specialized code for each usage of the function with different input types.

    Awkward Array was designed with these ideas in mind. It was intended to be primarily structurally typed with parameters as an opt-in for nominatively typing specific cases. Since nominative typing can be restrictive, the behavior is a mutable dict and different arrays can have different behaviors. In retrospect, it might be a bit too free because a common error is for a behavior to silently fail to match. It's probably a good idea to usually keep a constant, global ak.behavior to avoid confusion.

    Angus Hollands
    @agoose77:matrix.org
    [m]
    Ooh, that was the language that I was missing (structural vs nominal)
    Angus Hollands
    @agoose77:matrix.org
    [m]
    @jpivarski: I was thinking about our Record-Content split in #1401, and I wanted to move some of the discourse here
    What do you think the purpose of the is_UnionType et al. flags are? I've taken them to be the high-level representation of the layout types that Awkward provides: smoothing over the details of e.g. how the list types are implemented. But, if that were the case, then I'd consider removing is_IndexedType as it is actually not a high-level type, but rather an implementation detail
    Jim Pivarski
    @jpivarski

    v1 had a lot of isinstance checks against groups of Content subclasses (especially in _util, in broadcast_and_apply). Those checks were verbose and not exactly the right Venn diagrams, forcing multiple isinstance with and and or.

    These is_UnionType, etc. flags are v2's improvement: the checks are now things like is_IndexedType or is_OptionType, which is succinct and easier to form the right Venn diagrams. Any subclasses of these layout nodes (if that should ever happen—not planning on it) would work just as well.

    That's what they're there for. If we remove them, a lot of if-statements are going to get more complicated.

    These flags are in the mid-level API, the part that is public for downstream developers, but not encouraged for high-level data analysts. The Content subclasses, Record, Index, and Form have all sorts of spiders in the basement (though I still intend to do a renaming campaign to get the mid-level interface names in line with the high-level ones). We don't need to hide these flags any more than they already are. I wouldn't be comfortable with them in the high-level interface, but that's not where they are.
    Angus Hollands
    @agoose77:matrix.org
    [m]

    Sure, I remember the isinstance (ab)use 😀 So, AFAICT we've replaced ak._util.listtypes with layout.is_ListType etc? If so, I'm still not sure we need IndexedType - it was only needed because we had templated layouts:

    (awkward._ext.IndexedArray32,
     awkward._ext.IndexedArrayU32,
     awkward._ext.IndexedArray64)

    I'm not proposing we remove is_Indexed though - I'm just trying to get my head around the design goals 🙂

    Also, another topic: have you ever had the need to zip together two arrays as a record? I needed to today, and I found building a record manually to be problematic because it invoked from_iter, which dropped the type information for a zero-length array.
    I naively thought of adding a depth_limit=0 argument to ak.zip to build a Record, but I don't know how terrible that would be UX wise.
    Jim Pivarski
    @jpivarski
    Right. (I'm writing to you in two places!) The need for either the flags or the groups (such as ak._util.listtypes) was stronger when there were several integer-specialized types for each sort of node, since that would be a lot of isinstance checks. Now that there's only one ak._v2.contents.IndexedArray, it's less necessary, but still helpful.
    Angus Hollands
    @agoose77:matrix.org
    [m]
    I suppose I could just do it with ak.zip({'x': x[np.newaxis, ...], 'y': y[np.newaxis. ...]}, depth_limit=1)[0] but where's the fun in that 😉
    Ah sorry. I'm moving between tasks
    Jim Pivarski
    @jpivarski

    from_json had to do some kind of dance (internally) to deal with single records. That's what this reminds me of.

    from_iter (like from_json) should be capable of returning either an ak.Array or an ak.Record.

    Angus Hollands
    @agoose77:matrix.org
    [m]
    I might be explaining poorly - the ak._v2.Record works, but by passing the {"x": x, "y": y} dict to from_iter, it goes through ArrayBuilder and gets treated like a Python object rather than an Awkward Array.
    Jim Pivarski
    @jpivarski
    That sounds like something that ought to have a special case (isinstance(obj, ak._v2.record.Record) but doesn't.
    Or ak.layout.Record, I'm not sure which you're addressing.
    Angus Hollands
    @agoose77:matrix.org
    [m]
    I think it was v1 in the end, as I hit #1400 whilst testing v2
    simonepigazzini
    @simonepigazzini

    Hello, I have a question related to array maniuplation and behavior "propagation". I have an Array structured in this way:

    <RecHitEBArray [[{sumet: 0, ... status: 13}]] type='252 * [var * RecHitEB["sumet...'>

    that I would like to sort differently by achieving something similar to pandas groupby operiation. To do this I unflatten the original array (runs.EcalPhiSymEB) using spilts defined from numpy.unique: ak.unflatten(runs.EcalPhiSymEB, splits, axis=0). This works fine as long as the grouping is concerned but the returned Array does not retain the original behavior (except for the innermost dimension):

    <Array [[[{sumet: 0, ... status: 13}]]] type='47 * var * [var * RecHitEB["sumet"...'>.

    I tried to specified the original behavior in the ak.unflatten call but that also hasn't worked.
    Is there a way to retain the RecHitEBArray behavior following the transformation?
    Thanks a lot

    Angus Hollands
    @agoose77:matrix.org
    [m]
    @simonepigazzini: could you provide some a reproducer? I think what you're saying is that whilst the name of the record is being preserved, the behavior dictionary is not, and you're therefore not receiving a RecHitEBArray?
    For me, this works as expected:
    import awkward as ak
    import numpy as np
    
    
    class ThisArray(ak.Array):
        def that(self):
            print("this.that()!")
    
    
    behavior = {("*", "this"): ThisArray}
    
    array = ak.Array({"x": [[1, 2, 3], [4], [5, 6]]}, with_name="this", behavior=behavior)
    next_array = ak.unflatten(array, [2,1,1,1,1], axis=-1)
    
    assert isinstance(next_array, ThisArray)
    simonepigazzini
    @simonepigazzini
    Hi, while trying to build a simple reproducer, somehow I fixed the issue. Embarrassingly enough, I have no idea of what was wrong, I think before I mingled with the behavior from the original array to the point this was no longer working. Thanks a lot for taking a look and sorry for the noise
    Angus Hollands
    @agoose77:matrix.org
    [m]
    Do we have a plan anywhere about supporting layoutbuilder in numba?
    Jim Pivarski
    @jpivarski

    Yes! I had this in mind, but I'm trying to get everything out of my mind and onto a formal roadmap, so here it is: https://github.com/scikit-hep/awkward-1.0/wiki#layoutbuilder-in-numba

    This would be a great way to learn about how Numba works, by the way. It wouldn't be as hard as ak.Array → Numba and there's a lot of preexisting code to use as guides.

    Angus Hollands
    @agoose77:matrix.org
    [m]
    Fab, I was thinking it would be written down somewhere! Yeah, I'm thinking it would take a day or two, which means two or three if you factor in the 2X developer underestimate 😉
    Jim Pivarski
    @jpivarski

    Actually, I was thinking it would take few weeks, so that's a bit of an underestimate!

    It would have to implement equivalents of all of the functions you see here: https://github.com/scikit-hep/awkward-1.0/blob/main/src/awkward/_connect/_numba/builder.py

    But it can't do it just by calling a function pointer, as ArrayBuilder does: https://github.com/scikit-hep/awkward-1.0/blob/fb6aa208098ae42df82db080328b3bbb5d79671a/src/awkward/_connect/_numba/builder.py#L94-L106

    However, generating Python code as a string and using Numba's high-level @numba.extending.overload would be an option, similar to this: https://github.com/scikit-hep/awkward-1.0/blob/fb6aa208098ae42df82db080328b3bbb5d79671a/src/awkward/_connect/_numba/arrayview.py#L1111-L1177

    I don't want to scare anybody away—this would be a good project for a newcomer to Numba, though it would take a few weeks and you'd go away being a Numba expert.

    Angus Hollands
    @agoose77:matrix.org
    [m]

    +- 1_000% 🙃

    Why can't we just call the pointers as we do for ArrayBuilder? This is where not being so versed on what we're actually doing in the numba layer is hampering me, but I assumed that at a high level it's a similar job?

    Jim Pivarski
    @jpivarski

    The main reason one would use LayoutBuilder instead of ArrayBuilder is for speed (beyond that, there's also the ability to make a numeric column have non-64bit type without converting it with ak.values_astype after the fact).

    Going through external function pointers should in principle slow it down because LLVM can't inline and optimize the function body with the surrounding code, but in practice the slow down is much larger than I would expect it to be. There's something odd about running code in a shared library that I don't understand that makes it much slower than I think it ought to be. Doing this in cppyy or ROOT's C++ JIT is even slower than Numba's JIT. Mysteries abound, but I don't worry about it too much because ArrayBuilder has other bottlenecks.

    In addition, LayoutBuilder is currently implemented as a shim that generates AwkwardForth. AwkwardForth is not as fast as compiled code (by a factor of 2× to 5×), but faster than any other runtime-interpreted code I know of. That makes it the best choice when you don't have a JIT compiler available. It's not the best choice when you do have a JIT compiler available. It's looking like the only applications that can take advantage of a LayoutBuilder are ones that involve JIT somewhere in the chain (after all, you have to iterate over the rowwise data that you're feeding to the LayoutBuilder), so LayoutBuilder itself is getting reimplemented in JIT-compiled C++ this summer. (That's the project just above LayoutBuilder-Numba in the road map.) With the new LayoutBuilder generated on the fly, there wouldn't be a stable set of pointers to point to.

    (Also worth noting: LayoutBuilder through AwkwardForth is not quite a factor of 2× better than ArrayBuilder, as is.)

    So the short of it is that wrapping the current LayoutBuilder's external pointers in Numba calls is far enough from optimal that it wouldn't be worth doing.

    Angus Hollands
    @agoose77:matrix.org
    [m]

    Another (strong) benefit for using LayoutBuilder is that your output is guaranteed to have the correct form. I have some special cases in my own analysis tools where I walk the output, and replace the EmptyArray with the corresponding zero-sized layouts. If LayoutBuilder worked in Numba, I'd be able to remove those!

    AwkwardForth is not as fast as compiled code (by a factor of 2× to 5×), but faster than any other runtime-interpreted code I know of.

    Still, that's impressive perf. For most people I imagine that's more than enough to make it usable.

    LayoutBuilder through AwkwardForth is not quite a factor of 2× better than ArrayBuilder

    Huh, that is interesting.
    Nevertheless, Is the JIT compiler planned to be cppyy?

    With the new LayoutBuilder generated on the fly, there wouldn't be a stable set of pointers to point to.

    Ah, that explains it.

    Jim Pivarski
    @jpivarski

    We ought to have a function for filling in EmptyArray. Or no, this: scikit-hep/awkward-1.0#516 and let people write it themselves.

    We're targeting two JIT compilers: cppyy in ROOT and cppyy outside of ROOT. From our end, we just generate the C++ strings and have Python code that knows how to invoke both interfaces.

    Angus Hollands
    @agoose77:matrix.org
    [m]
    Now that recursively apply is "non-util", it's part of our public interface, which is nice. In my case, it's less ideal than using a form-backed layout because the EmptyArray can happen at any point in the layout (depending upon the data being iterated over). Obviously a workaround is just to write some false data and then slice it off at the end, but not ideal!
    Angus Hollands
    @agoose77:matrix.org
    [m]
    Right, but we'd want the broadcast_and_apply variant too.
    Jim Pivarski
    @jpivarski
    That's what the issue is about, then. I hadn't noticed that recursively_apply is public, and it addresses Martin & Doug's issue on #awkward-dask, but it won't help cases involving more than one array. The public function should decide whether to do recursively_apply or broadcast_and_apply based on whether there's one or more input array, since the interface is now (in v2) the same between them. And this public function will require significant documentation—how it works is not easy to guess.