Where communities thrive


  • Join over 1.5M+ people
  • Join over 100K+ communities
  • Free without limits
  • Create your own community
People
Repo info
Activity
    MPaladino
    @palamatt95
    self.branches is a dictionary where some of the self.branches[key][self.chunk_idx]. self.chunk_idx is an integer which let me access to the single elements of the dictionary for each key. This project will use large root files and this is why I am using iterators. So, summing up, which could be a proper path to follow?
    Jim Pivarski
    @jpivarski

    So I think that's the source of the misunderstanding: whenever I've been saying "chunk," I've been meaning "many events, not the whole dataset, but whatever can fit in memory at one time." If self.chunk_idx is an integer, you're pulling a single entry/event out of it.

    If the thing you want to do is make a bunch of variable-length lists have a fixed length (e.g. the largest, or some cut-off), then follow the ak.pad_none and possibly ak.fill_none suggestions above. The links go to documentation with examples.

    More generally, you might want to look at a tutorial about columnar/array-at-a-time logic: this one (STAR, most recent), this one (PyHEP, second most recent), this one (CMS, third most recent), or this one (Software Carpentries, actually more recent than the STAR one, now that I think about it).

    islazykv
    @islazykv
    Hello. I would like to thank you for this amazing library. We would like to use uproot for our analysis, but we are puzzled about one thing. When i am loading a .root file using uproot to pandas df, I receive a list of dfs. I understand that the dataframes are seperated because some values are missing. Unfortunately, it is not that easy to merge them as they might differ in number of features and they are also jagged/multi-indexed. Would it be possible (when loading a root file) to force a single df with None values for the missing values? Thank you.
    Jim Pivarski
    @jpivarski

    @islazykv Hi! The reason you get a list of DataFrames is because a decision needs to be made about how to merge them. Take this example:

    >>> import uproot, skhep_testdata
    >>> tree = uproot.open(skhep_testdata.data_path("uproot-HZZ.root"))["events"]
    >>> dfs = tree.arrays(filter_name=["Muon_*", "Jet_*"], library="pd")
    >>> type(dfs)
    <class 'tuple'>
    >>> len(dfs)
    2
    >>> dfs[0]
                       Jet_Px     Jet_Py      Jet_Pz       Jet_E  Jet_btag  Jet_ID
    entry subentry                                                                
    1     0        -38.874714  19.863453   -0.894942   44.137363      -1.0    True
    3     0        -71.695213  93.571579  196.296432  230.346008      -1.0    True
          1         36.606369  21.838793   91.666283  101.358841      -1.0    True
          2        -28.866419   9.320708   51.243221   60.084141      -1.0    True
    4     0          3.880162 -75.234055 -359.601624  367.585480      -1.0    True
    ...                   ...        ...         ...         ...       ...     ...
    2417  0        -33.196457 -59.664749  -29.040150   74.944725      -1.0    True
          1        -26.086025 -19.068407   26.774284   42.481457      -1.0    True
    2418  0         -3.714818 -37.202377   41.012222   55.950581      -1.0    True
    2419  0        -36.361286  10.173571  226.429214  229.577988      -1.0    True
          1        -15.256871 -27.175364   12.119683   33.920349      -1.0    True
    
    [2773 rows x 6 columns]
    >>> dfs[1]
                      Muon_Px    Muon_Py  ...  Muon_Charge  Muon_Iso
    entry subentry                        ...                       
    0     0        -52.899456 -11.654672  ...            1  4.200153
          1         37.737782   0.693474  ...           -1  2.151061
    1     0         -0.816459 -24.404259  ...            1  2.188047
    2     0         48.987831 -21.723139  ...            1  1.412822
          1          0.827567  29.800508  ...           -1  3.383504
    ...                   ...        ...  ...          ...       ...
    2416  0        -39.285824 -14.607491  ...           -1  1.080880
    2417  0         35.067146 -14.150043  ...           -1  3.427752
    2418  0        -29.756786 -15.303859  ...           -1  3.762945
    2419  0          1.141870  63.609570  ...           -1  0.550811
    2420  0         23.913206 -35.665077  ...           -1  0.000000
    
    [3825 rows x 6 columns]

    You get two DataFrames because muon multiplicity differs from jet multiplicity. Each event will have a different number of muons than jets. All of the muon TBranches can be combined into a single muon DataFrame and all of the jet TBranches can be combined into a single jet DataFrame, but they can't be combined any further without making some sort of choice.

    Like, should the muon with subentry 0 be in the same row as the jet with subentry 0? Why? Those particles might not be related in any way. In HEP, a typical criterion for relating muons to jets is to find pairs with minimum ΔR, but even then, some jets might have two muons that are close enough to consider part of the jet and other muons might not be close to any jet at all. That criterion for identifying connections between muons and jets is not one-to-one, so they really can't be put in the same row of a table. (Maybe a DataFrame with a three-level MultiIndex, where the subsubentry puts muons "inside" of jets, but then what about the muons unconnected to any jets?)

    If you do want to put the muon with subentry 0 in the same row as the jet with subentry 0, filling all the blanks with missing values, that is an "outer join." Pandas has a function for that:

    >>> pd.concat(dfs, axis=1, join="outer")
                       Jet_Px     Jet_Py      Jet_Pz       Jet_E  Jet_btag Jet_ID    Muon_Px    Muon_Py     Muon_Pz      Muon_E  Muon_Charge  Muon_Iso
    entry subentry                                                                                                                                    
    1     0        -38.874714  19.863453   -0.894942   44.137363      -1.0   True  -0.816459 -24.404259   20.199968   31.690445          1.0  2.188047
    3     0        -71.695213  93.571579  196.296432  230.346008      -1.0   True  22.088331 -85.835464  403.848450  413.460022         -1.0  2.728488
          1         36.606369  21.838793   91.666283  101.358841      -1.0   True  76.691917 -13.956494  335.094208  344.041534          1.0  0.552297
          2        -28.866419   9.320708   51.243221   60.084141      -1.0   True        NaN        NaN         NaN         NaN          NaN       NaN
    4     0          3.880162 -75.234055 -359.601624  367.585480      -1.0   True  45.171322  67.248787  -89.695732  120.864319         -1.0  0.000000
    ...                   ...        ...         ...         ...       ...    ...        ...        ...         ...         ...          ...       ...
    2411  0               NaN        NaN         NaN         NaN       NaN    NaN  55.720299  26.369698  -24.587757   66.367775          1.0  2.614916
          1               NaN        NaN         NaN         NaN       NaN    NaN -26.914448  -9.812821   -0.389948   28.650345         -1.0  1.190786
    2415  0               NaN        NaN         NaN         NaN       NaN    NaN  34.506527  28.839973 -150.656708  157.225632          1.0  0.000000
          1               NaN        NaN         NaN         NaN       NaN    NaN -31.567780 -10.424366 -111.264702  116.125092         -1.0  3.865161
    2420  0               NaN        NaN         NaN         NaN       NaN    NaN  23.913206 -35.665077   54.719437   69.556213         -1.0  0.000000
    
    [4560 rows x 12 columns]
    But Uproot should not do this automatically because it would usually be the wrong thing to do.
    Jim Pivarski
    @jpivarski

    The "relational database" way to deal with this problem is to keep the tables separate and only link them with indexes, even to the extent of not having a MultiIndex with "entry" and "subentry," but one table for event variables, another table for muon variables, and another table for jet variables. The muon and jet tables would each have a column named something like "event ID", which the user would have to match to the corresponding column in the event table in every expression that they write. Since we, in HEP, (almost?) always want to compare particles in the same even and (almost?) never in different events, having to include that JOIN in every calculation we write would be super-annoying. (Maybe the one exception is alignment and calibration, which joins track or shower information per-detector component, rather than per-event.)

    Pandas is a little more forgiving in that it provides this MultiIndex, which puts a (single) tree-like hierarchy on rows in a table. With the muons DataFrame (dfs[1]), each muon is in a separate row but the index knows about the difference between one row and the next. It can use that distinction in its operations, such as the outer join above, which puts jets and muons with the same entry together—not a naive mix that would ignore event-boundaries.

    However, a Pandas DataFrame fundamentally cannot represent data with more than one hierarchy, because the hierarchy is in the index, not the columns themselves. A DataFrame can have only one index. So if you wanted to be less relational and get everything into a single table, you have to make some possibly unwarranted associations, like "muon subentry i ↔ jet subentry i."

    Jim Pivarski
    @jpivarski

    Which is why we have Awkward Array. Instead of putting this hierarchical information into an index that is separate from the data, the data are structured like nested lists and records, as you would have in a non-relational, object-oriented context (C++ or Python).

    The structure you initially get out of Uproot may be disconnected (because they're all separate TBranches in ROOT):

    >>> muons = tree.arrays(filter_name=["Muon_*"])
    >>> jets = tree.arrays(filter_name=["Jet_*"])

    but you can zip them together:

    >>> import awkward as ak
    >>> muons = ak.zip({
    ...     "px": muons.Muon_Px,
    ...     "py": muons.Muon_Py,
    ...     "pz": muons.Muon_Pz,
    ...     "E": muons.Muon_E,
    ...     "charge": muons.Muon_Charge,
    ...     "iso": muons.Muon_Iso,
    ... }, with_name="Momentum4D")
    >>> jets = ak.zip({
    ...     "px": jets.Jet_Px,
    ...     "py": jets.Jet_Py,
    ...     "pz": jets.Jet_Pz,
    ...     "E": jets.Jet_E,
    ...     "btag": jets.Jet_btag,
    ...     "id": jets.Jet_ID,
    ... }, with_name="Momentum4D")

    and put them in the same array without any problems, because it's just a tree with multiple branches (unlike a MultiIndex).

    >>> events = ak.Array({"muons": muons, "jets": jets})
    >>> events.type
    2421 * {"muons": var * Momentum4D["px": float32, "py": float32, "pz": float32, "E": float32, "charge": int32, "iso": float32], "jets": var * Momentum4D["px": float32, "py": float32, "pz": float32, "E": float32, "btag": float32, "id": bool]}

    You can get everything back out again with slices:

    >>> events.muons
    <Array [[{px: -52.9, py: -11.7, ... iso: 0}]] type='2421 * var * Momentum4D["px"...'>
    >>> events.muons.px
    <Array [[-52.9, 37.7], ... 1.14], [23.9]] type='2421 * var * float32'>
    >>> events.jets.px
    <Array [[], [-38.9], ... [-36.4, -15.3], []] type='2421 * var * float32'>

    and the reason I included the with_name="Momentum4D" is because there's a package called Vector that adds Lorentz vector methods to any Awkward Array with this name and four-vector component field names.

    >>> import vector
    >>> vector.register_awkward()
    >>> events.muons
    <MomentumArray4D [[{px: -52.9, py: -11.7, ... iso: 0}]] type='2421 * var * Momen...'>

    And now

    >>> events.muons.pt
    <Array [[54.2, 37.7], [24.4, ... 63.6], [42.9]] type='2421 * var * float32'>
    >>> events.jets.pt
    <Array [[], [43.7], [], ... [37.8, 31.2], []] type='2421 * var * float32'>

    and

    >>> mu1, mu2 = ak.unzip(ak.combinations(events.muons, 2))
    >>> (mu1 + mu2).mass
    <Array [[90.2], [], [74.7], ... [], [], [], []] type='2421 * var * float32'>

    and

    >>> mu, jet = ak.unzip(ak.cartesian([events.muons, events.jets]))
    >>> mu.deltaR(jet)
    <Array [[], [2.15], [], ... [1.55, 2.94], []] type='2421 * var * float32'>

    and such. If you're getting multiple DataFrames from Uproot with library="pd" and/or you're planning on doing particle physics-type calculations, you may want to consider Awkward Array instead.

    (If you need Pandas at the end, there's ak.to_pandas. All of the functions are documented here.)

    islazykv
    @islazykv
    Hello. Thank you very much for the in-depth explanation. As you suggest, I will switch to the Awkward Array. Once again thank you.
    benw22022
    @benw22022

    Hi Experts - thanks for this helpful library! I'm trying to use uproot to load batches of data from root files to train a neural network but I am having trouble training for long periods due to a memory leak which seems to be coming from my usage of uproot. My code works by using a class to manage 9 uproot.iterate instances which fetch batches of data from 9 separate lists of root files and combines them together. This is done to avoid the need for one large shuffled dataset. My code looks something like this:
    '''
    batch = []
    for file_type, itr in self.iters.items():
    batch.append(self.process_batch(next(itr)))

    x_batch = []
    for i in range(0, len(batch[0])):
    x_batch.append(np.concatenate([result[0][i] for result in batch]))
    '''
    Where self.iters is a dictionary of uproot.iterate instances and self.process_batch is a method which generates the arrays for each input of the network and labels (I don't think the problem is here since it is leak occurs if I just return dummy numpy arrays).
    At the end of each epoch the uproot.iterate instances recreated. Have I made a mistake in the way I have used uproot here or is this a bug? Thanks in advance for any help!

    4 replies
    Angus Hollands
    @agoose77:matrix.org
    [m]
    However, you're storing the results of each iteration in a list with batch.append. I assume that these arrays are not small. If this is the case, then I would expect this will eventually consume all of your memory.
    2 replies
    Angus Hollands
    @agoose77:matrix.org
    [m]
    It would be a lot easier if you could create a reproducer. uproot shouldn't leak memory by default (assuming you don't keep Python references alive), so it's important to see the bigger picture in order to work out what's going on.
    2 replies
    benw22022
    @benw22022
    Hi, sorry I just saw the thread over in the awkward-array gitter - I realise that the example I sent wasn't as simple as it should of been. I've had another go - link here: https://drive.google.com/drive/folders/16FQtUUidYlthguG4UYUJq3Dt3l_pJO92?usp=sharing - just one python file + 9 root files. I've tried using pympler to diagnose the issue but it can't see anything wrong, all I know is that the memory usage reported by htop slowly keeps growing. Thanks in advance for any help
    Angus Hollands
    @agoose77:matrix.org
    [m]
    Thanks for sharing a simpler demo - I meant to follow up here with a request for such a thing
    1 reply
    Angus Hollands
    @agoose77:matrix.org
    [m]
    Hey Ben. Short answer - yes, but I didn't find the cause during the time that I was digging in to it
    Angus Hollands
    @agoose77:matrix.org
    [m]
    @benw22022: what is the long term behaviour that you witness? I made a simpler version of the test and memory stabilises (as expected) after some time.
    1 reply
    Alexander Held
    @alexander-held
    Hi, I have a ROOT file containing many trees, and I want to create a new file with a single tree that contains all the branches from the many trees in my input. Perhaps this is something where it makes more sense to use ROOT, but I wanted to give uproot a try. Here is a simple script that does what I want: https://gist.github.com/alexander-held/c4d8a82f45e6ec834ff49d5fa0c98c67 (note the entry_stop in arrays, that is on purpose to speed things up). I am testing this with a ~200 MB file https://cernbox.cern.ch/index.php/s/lpS1T3oLI6Dhbnc (CMS Open Data). I'm reading everything into memory, which is quite fast. Writing takes a very long time, and I wanted to understand whether there is a better approach to do this.
    • I started with reading to numpy arrays, and got a decent speedup during writing by sticking to awkward instead. Is there a better way to convert to a dict for writing? Is there a more efficient format for writing?
    • Reading everything into memory might not be the best idea in general, so I wondered whether this could be batched somehow. My understanding is that I can only extend existing branches on a tree, so when writing my first batch, I would need data from all the different trees in my input already. How I can iterate over multiple trees that are present in my input in parallel, is there a way to do this? That would then allow me to write in batches via extent. Would this actually matter for speed purposes?
    Jim Pivarski
    @jpivarski

    First bullet point: that's understandable if the data are jagged. Reading would be faster because you only need to make one Awkward Array, as opposed to 1000 (entry_stop) NumPy arrays, and writing would be much faster because you can drop the Awkward Array in the file, as opposed to iterating over the NumPy arrays, constructing an Awkward Array out of them, and writing that.

    Making the dict with

    tree_content = f_in[tree].arrays(library="ak", entry_stop=1000)
    tree_content = dict([(f, tree_content[f]) for f in tree_content.fields])

    is fine. (No performance red flags.) The second line could be

    tree_content = dict(zip(ak.fields(tree_content), ak.unzip(tree_content)))

    if you want.

    Jim Pivarski
    @jpivarski

    Second bullet point: to iterate, you can put the

    with uproot.recreate(f_out_path) as f_out:

    outside of the loop over tree in trees, and initialize the output tree with branch names and dtypes or Awkward types:

    f_out.mktree("events", {"branch1": type1, "branch2": type2, ...})

    to just initialize it without filling it. Then read 1000 (or more) entries from each tree in trees and extend:

    f_out["events"].extend(tree_content)

    Or, if you can't name the branches and types before looking at the first entries, you can have a loop assign the first time and extend in all subsequent times.

    Jim Pivarski
    @jpivarski

    Second problem: how do you iterate over all the trees together? You're combining equal numbers of entries with different branches into a single tree with the same number of entries and the union of all branches (like this Pandas merge). So you'll want to pull the same exact set of entries from each of the trees in each step.

    TTree.iterate has a step_size parameter. When you set that to an integer, it specifies an exact number of entries in each step. Use the same step_size for all of the input trees.

    Although a lot of my examples had iterate in the first line of a for loop, it returns a generator that you can use any way that you like. You can call iterate with equal step_size on all of your input trees to get a set of iterators, and then call next(iterator) on each one of them to get the next batch. That would fit nicely into a loop.

    Or do zip (Python's zip) on all of the iterators in the first line of a for loop, which is equivalent.

    Alexander Held
    @alexander-held
    Thanks a lot! I have to admit, I have done both the tree initialization and the next(iterator) thing before in other contexts with uproot and forgotten about them. It's good to know there is no major red flag with regards to performance. This now runs in 6 seconds over the 200 MB file. I realized in the meantime that I could pick a tree and make all other trees friend trees of that single tree in ROOT, which is probably ~instant in comparison, but I suspect for uproot there is not too much more that can be done here for optimization other than tuning batch size?
    Jim Pivarski
    @jpivarski
    For the particular thing you want to do—copy data from one set of trees into another—there are quite a few things that could be faster if we invested some development time. For instance, this process is uncompressing arrays, cutting them at boundaries that are likely different from the original basket boundaries, recompressing them and writing them. You'd have to do all that work if you were going to change the numbers, but since you're not, a writing process could skip the decompression and recompression. In ROOT, that's called a "fast copy" (for obvious reasons). It could be done, but it's not implemented in Uproot. If the output file is the same as the input file, I believe it would also be possible to not copy—just "link" the old baskets into a new tree. I'm pretty sure it would work, but I don't know if there are any existing routines (in ROOT) that would complain about two trees sharing the same baskets. If that works, it would be even faster than friends trees, which have to construct an index. But also, friend trees are another way of doing it... that isn't implemented in Uproot.
    Alexander Held
    @alexander-held
    Thanks Jim! The setup in which I ran into this is rather artificial, and the performance is more than fine for that purpose. It's interesting to know the various possibilities here though.
    heatherrussell
    @heatherrussell

    Hi, I'm trying to use concatenate but running into an error:

        with uproot.concatenate([filepath+sample+"/*.root:recoTree"]) as sample:
    AttributeError: __enter__

    what's the proper syntax here? There are two trees in my file so I can't remove the :recoTree part.

    Jim Pivarski
    @jpivarski
    uproot.concatenate shouldn't be used as a context manager (in a with statement). It returns arrays without leaving a file open, so there's no context to manage.
    heatherrussell
    @heatherrussell
    Ah, thanks! I was worried about leaving so many files open :D
    Jim Pivarski
    @jpivarski
    A context manager has attributes __enter__ and __exit__, and that's how far Python gets before running into an issue.
    Sure, no problem!
    Allen Xiang
    @allenxiangxin

    Hi, I have a uproot write question. I'm dealing with large files, and I tried to initialize multiple fixed length arrays in TTree via mktree:

    file = uproot.recreate("example.root")
    file.mktree("tree", {"n": "int32", "x": "n*int32", "y": "n*int32", "z": "n*int32"})

    where n specifies the length for all arrays. This returns errors.

    Meanwhile, if I do

    file.mktree("tree", {"x": "var*int32", "y": "var*int32", "z": "var*int32"})

    the output tree will have duplicated variables like nx, ny, nz. Any solution for this?

    Jim Pivarski
    @jpivarski

    It has to create nx, ny, and nz TBranches because ROOT expects to find these "counter" branches for each TBranch that has variable-length array type. Perhaps in your case, nx, ny, and nz happen to be identical, but Uproot doesn't know that, not if you give it x, y, and z as separate jagged arrays.

    You can do it by giving Uproot a single jagged array of records. For instance, suppose that you have

    >>> import awkward as ak
    >>> x = ak.Array([[1, 2, 3], [], [4, 5]])
    >>> y = ak.Array([[1.1, 2.2, 3.3], [], [4.4, 5.5]])

    It happens to be the case that the lengths of all the lists in x are equal to the lengths of all the lists in y, perhaps because they represent attributes of the same set of particles. If this is true, you'll be able to zip them all together:

    >>> together = ak.zip({"x": x, "y": y})
    >>> print(together.type)
    3 * var * {"x": int64, "y": float64}

    This array's item type is "variable length lists of records" with fields x and y. If the original x and y didn't have the same lengths, they would not have been broadcastable and ak.zip would have failed. (Sometimes we say, "use depth_limit when ak.zip fails," but you wouldn't want to do that because the purpose of what you're doing is to zip them at all depths, so that there's only one "var" in the type description, rather than one for each field.)

    Arrays containing records are split into separate TBranches when writing a TTree ("split" in the ROOT sense):

    >>> import uproot
    >>> file = uproot.recreate("tmp.root")
    >>> file["tree"] = together
    >>> file["tree"].show()
    name                 | typename                 | interpretation                
    ---------------------+--------------------------+-------------------------------
    n                    | int32_t                  | AsDtype('>i4')
    x                    | int64_t[]                | AsJagged(AsDtype('>i8'))
    y                    | double[]                 | AsJagged(AsDtype('>f8'))

    They have only one counter because the type of together had only one "var". Uproot knows that the fields have the same list lengths for all entries in the array because that had to be true for "together" to exist.

    Jim Pivarski
    @jpivarski

    If you want to do this with the mktree function (so that you don't have to start filling the TTree right away), I think this works:

    >>> file.mktree("tree2", {"outer": together.type})
    >>> file["tree2"].extend({"outer": together})
    >>> file["tree2"].show()
    name                 | typename                 | interpretation                
    ---------------------+--------------------------+-------------------------------
    nouter               | int32_t                  | AsDtype('>i4')
    outer_x              | int64_t[]                | AsJagged(AsDtype('>i8'))
    outer_y              | double[]                 | AsJagged(AsDtype('>f8'))

    Yes, that works, though you have to give it a name and the fields are nested within that.

    Yeah, I just verified: you have to give the types as a dict, and therefore you have to give the outer structure a name. The intention is to do something like

    file.mktree("events", {"muons": muons, "jets": jets, ...})

    where muons, jets, etc. are all arrays of lists of records, in which each record has fields for the particle type. They have to be separated like this (i.e. you have to get different counter variables) because the number of muons in an event is not always equal to the number of jets, etc. Something has to count muons separately from jets, but there's no reason to count muon pt separate from muon eta.

    1 reply
    Allen Xiang
    @allenxiangxin
    Thanks!
    Alexander Held
    @alexander-held

    ROOT expects to find these "counter" branches for each TBranch that has variable-length array type

    Given that I'm not used to see these kind of branches in ROOT-produced files with jagged structure, is the difference that in ROOT it would usually be a different type of variable used to write them?

    Jim Pivarski
    @jpivarski

    If the C++ type of the jagged array is a dynamic-length array (e.g. typename double[] in Uproot, the TBranch declaration would be "myname[mycounter]/D"), then the counter has to exist. There's a good reason, too: if you're reading these back into a user-allocated array, you need to know how big to make that array, and the counter branch has a method (either GetLen or GetMaximum) to get the allocation size. If you're using one of the more high-level accessors, RDataFrame or TTreeReader, then that high-level accessor has to internally allocate the array. (On the other hand, that one integer of information could have been on the jagged TBranch, rather than a counter pointed to by the jagged TBranch, so the problem could have been dealt with in other ways.)

    If the C++ type of the jagged array is std::vector, then there is no need for a counter branch. The difference here is that the user does not need to preallocate one array to be used on all entries: you construct a std::vector and STL handles its resizing whenever it encounters a larger entry than its capacity. I remember a very old ROOT User Manual saying that they designed TTree-reading to avoid allocation in a loop because allocation was a performance bottleneck, but the way that std::vector does it, it only has to reallocate logarithmically many times (you see a new high-water mark exponentially less often as you iterate through random data).

    So TL;DR: jagged arrays can be encoded as different C++ types, some require the counter branch because of a (very old) ROOT choice, others don't. You may be familiar with one and not the other. Since the different C++ types are different encodings, adding the std::vector would be a big project (scikit-hep/uproot4#257). And then we'd also need an interface for you to be able to specify which type you want to write.

    Alexander Held
    @alexander-held
    Ah yes, the files I'm used to use std::vector, that makes sense then. Thanks!
    Tristan Miralles
    @Tristan63170
    Hello everyone,
    Tristan Miralles
    @Tristan63170
    I use uproot on SWAN (software stack 97a, uproot 4.0.0, awkward1 1.0.0) for one of my FCC analysis, and I have an issue since yesterday. Since the new version of the FCCAnalyses stuff, the conversion of my data from ROOT tree to Awkward array is impossible. I checked the C++ function behind my quantities but nothing seems different comparing to the previous version (when everything was good).
    With the .show() function I seen that the problem come from the fact that my old ROOT::VecObs::RVec<float or int> was seen like std::vector<float or int> by uproot where my new ROOT::VecObs::RVec<float or int> are seen like ROOT::VecOps::RVec<float or int>. Because of that uproot is not able to interpret these quantities (it gives AsObjects(UnknownROOT…)).
    Have you any idea of solution to allow uproot to be able to interpret these quantities ?
    Thanks
    old.png
    Here you can see the output from .show() with my old data.
    new.png
    And now with the new ones.
    I add something, when I explore (with a TBrowser for exemple) the old and the new root files used, they are identical.
    Alexander Held
    @alexander-held
    Hi, I don't have anything to suggest that's very specific to this problem, but it may be worth trying out more recent versions of uproot/awkward. The versions you refer to are 1.5 years old by now, and there have been regular updates in the meantime. Perhaps worth a try if that solves the problem.
    Angus Hollands
    @agoose77:matrix.org
    [m]
    @Tristan63170: the short answer is that your TTree branches have changed type, from std::vector to ROOT::VecOps::RVec. I am not up to speed with uproot's ROOT support, but I don't think we currently handle RVec. I don't know how RVec is serialised, but I suspect that it might be similar to std::vector (i don't know if it has a different header etc). If so, supporting it wouldn't be that much work, but I'll defer to @jpivarski who will know the answer!
    Tristan Miralles
    @Tristan63170
    Thanks for your answers. The main issue is that my old data was yet build of ROOT::VecOps::RVec but with this one uproot read them like float or int, but this is not the case anymore with my new data ...
    Tristan Miralles
    @Tristan63170
    My old data was produced during October 2021
    Angus Hollands
    @agoose77:matrix.org
    [m]
    could you share a small version of your data? e.g. the first 200 entries of the tree
    Jim Pivarski
    @jpivarski

    "Whether or not Uproot can read RVec" is something we don't know until we look at a file. For most objects, Uproot learns how to deserialize it from the TStreamerInfo in the header of the ROOT file—the process is usually automated: there are more readable data types than the Uproot authors know about. The question is whether RVec uses only TStreamerInfo features we've seen before (by looking at files containing other types with the same TStreamerInfo features).

    If Uproot is saying its type is "Unknown," then that could mean that the RVec in this particular file doesn't have a TStreamerInfo record at all (sometimes, hadd drops TStreamerInfo records and some types are considered basic enough that ROOT doesn't include it) or it could mean that it's using features of TStreamerInfo we don't recognize. Since RVec is becoming an important type—RDataFrame produces it regularly—we should probably figure out how to interpret it directly, skipping TStreamerInfo. I can say for sure that we haven't hand-written any deserialization code for RVec yet.

    To do that, though, we'd need to look at a file. The best way to do it is to open an issue and include a small file containing the data type, preferably under 1 MB. The sample also doesn't need to have more than one RVec branch to be useful: it could be just "MC_px" and just enough entries that we see a few different sizes (an entry with 10 items, another with 15, etc.). GitHub doesn't like the ".root" file extension, but you can just rename it to ".txt" or put it in a ZIP archive.

    Tristan Miralles
    @Tristan63170
    Thanks for your answer, I have just report the issue on your github (and I have attached a smaller version of the files with less events). Could it be possible to just find a way to put the missing header in the root file ?
    Jim Pivarski
    @jpivarski

    Could it be possible to just find a way to put the missing header in the root file ?

    @Tristan63170 That's if the error is related to missing TStreamerInfo (as opposed to an unimplemented feature), and even in that case, someone else will run into it.

    Thanks for posting the issue! Having a file will make a big difference.

    Tristan Miralles
    @Tristan63170
    Thanks @jpivarski it works now !