da.from_array(arr), no matter what chunking you use, each chunk of the resulting dask array will have all of
arrat the root of its task graph (as opposed to only having a chunk-sized piece of
da.from_arrayexplains what you are seeing.
@d-v-b , thanks for your reply. After looking at the graphs of my 'identical' arrays and doing some additional reading of the Dask documentation I realize that they are not equal at all.
The array loaded via from_array has a single node (since it is only one chunk and it fits in memory). On the other hand, the graph of the array originating from the read_csv call is relatively complex. It starts off with two parallel read_blocks, goes via read_panda, from_delayed, values and then finally the two paths merge into one at the rechunk-merge.(see the picture above)
Now, my new question is: are both these graphs computed anew for every call to compute()? Or does Dask store intermediate values somehow? E.g., at the values nodes?
computewill re-run the exact same computation
persist-- see the "persist" section in the best practices guide here: https://docs.dask.org/en/latest/dataframe-best-practices.html
Im looking for something like
if enforce and columns and (list(df.columns) != list(columns)):
ValueError: Mismatched dtypes found in
| Column | Found | Expected |
| agency_code | object | int64 |
Since I don't have definitions, and dataset consists of 3 sepearte extracts...aka it could be int in 1/3rd of the file while float in 2/3...