Discussion channel to work on creating metadata files that can provide zarr type access speeds to older formated data of many different types
--Rudimentary draft--
Category
Talk
Official Keywords
Big Data, Data Engineering, Data Mining / Scraping
Additional Keywords
Prior Knowledge?
No previous knowledge expected
Brief Summary
We introduce ReferenceFileSystem, a virtual implementation for fsspec which views arbitrary byte chunks at specific keys, presenting chunks of HDF5, TIFF, grib2 and others at the appropriate paths conforming to zarr's model. Thus, you can use zarr to load data from potentially thousands of remote data files, selecting only what you need, and with parallelism and concurrency.
Outline
Description
fsspec's ReferenceFileSystem allows a file system like virtual view onto chunks of bytes stored in arbitrary locations elsewhere, e.g., cloud bucket storage. We can present each byte chunk as a particular path in the filesystem conforming to the zarr hierarchy model, such that the original set of chunks, potentially across many files, appears as a single zarr dataset. This brings the following advantages:
def preprocess(ds):
return ds.set_coords(['latitude', 'longitude'])
mzz = fsspec_reference_maker.combine.MultiZarrToZarr(
files,
remote_protocol="s3",
remote_options={'anon':True},
xarray_open_kwargs={
"decode_cf" : False,
"mask_and_scale" : False,
"drop_variables": ["crs", "reference_time"]
},
xarray_concat_args={
'dim' : 'time'
},
preprocess=preprocess
)
look at how _FillValue went from -999900 to 0, that would be great
It looks to me like it's happening during the xarray CF decoding with a Zarr store here
_FillValue
from -999900
to 0
, and based on my debugging that's happening during the cf conversion process
recipe
for pangeo-forge, and also mentioned NWM, but totally forgot about the HRRR work! These should all be there, even if pangeo-forge doesn’t yet have a mechanism for dealing with regularly updated datasets.
/shared/users/lsterzinger/hrrr.ipynb
, I also uploaded it to nbviewer here https://nbviewer.jupyter.org/gist/lsterzinger/c6f8c68c35f94794b5c76cf8b1fea30a
I just posted on @lsterzinger ’s GEOS tutorial repo that we ought to make that a
recipe
for pangeo-forge, and also mentioned NWM, but totally forgot about the HRRR work! These should all be there, even if pangeo-forge doesn’t yet have a mechanism for dealing with regularly updated datasets.
@martindurant 100%, let me take a closer look at your Hdf5 recipe and see what's needed
out
is a dict
.
with fs2.open(outfname, "w") as f:
f.write(ujson.dumps(out))
@martindurant , I just realized that indeed as you predicted yesterday, we have some more work to do on time variables, at least for Grib files! Check out cells [17] and [18] in this notebook:
https://nbviewer.jupyter.org/gist/rsignell-usgs/fedf4b0e2d80bd9d202792ed99100d6f
The "time" variable is the time at which the model was run, and since I'm appending the latest forecast to the "best time series", all the values at the end are the same.
Meanwhile the "valid_time" variable, what one would expect to be the "time" variable (having the time values for each hour of the forecast), has only the first two values, with all the rest NaN.
So can we just flip them? We don't really care about providing the hour at which the model was run, since that could be in the description of the dataset. An evenly-spaced variable called "time" (that apparently is in the "valid_time" variable in Grib) is what we want. Make sense?