Where communities thrive


  • Join over 1.5M+ people
  • Join over 100K+ communities
  • Free without limits
  • Create your own community
People
Activity
    Rich Signell
    @rsignell-usgs
    It's Chelle's fault
    :)
    Martin Durant
    @martindurant
    trying too hard

    --Rudimentary draft--

    Category
    Talk

    Official Keywords
    Big Data, Data Engineering, Data Mining / Scraping

    Additional Keywords

    Prior Knowledge?
    No previous knowledge expected

    Brief Summary
    We introduce ReferenceFileSystem, a virtual implementation for fsspec which views arbitrary byte chunks at specific keys, presenting chunks of HDF5, TIFF, grib2 and others at the appropriate paths conforming to zarr's model. Thus, you can use zarr to load data from potentially thousands of remote data files, selecting only what you need, and with parallelism and concurrency. 

    Outline

    • the standard formats used for massive data archiving and their shortcomings
    • brief introduction to zarr, designed for parallel access to remote data
    • the process by which we can expose archival chunks to zarr and aggregate datasets
    • examples of >>10TB datasets where we have done this

    Description
    fsspec's ReferenceFileSystem allows a file system like virtual view onto chunks of bytes stored in arbitrary locations elsewhere, e.g., cloud bucket storage. We can present each byte chunk as a particular path in the filesystem conforming to the zarr hierarchy model, such that the original set of chunks, potentially across many files, appears as a single zarr dataset. This brings the following advantages:

    • only zarr (plus the relevant codecs) is required to read the data, but the original data could be locked in HDF5, TIFF or grib2 files (and more to come)
    • the metadata is extracted once, so future opening of the dataset does not need to scan through the target files to find metadata, and so the process is much faster
    • you get a single logical view over potentially thousands of files, but due to zarr's access model, you only load the data you need
    • loading can happen chunk-wise and in parallel
      The details of how to make such reference files is described at https://github.com/intake/fsspec-reference-maker , and the latest result for some 80TB/370k files of water flow modelling can be seen at https://nbviewer.jupyter.org/gist/rsignell-usgs/02da7d9257b4b26d84d053be1af2ceeb . Note that this is using xarray to process the data, but only zarr is needed to load it.
    Chelle Gentemann
    @cgentemann
    I like Rich's suggestion of just fliping it to 'all you need is zarr: par....'
    Rich Signell
    @rsignell-usgs
    Looks good Martin -- I might also add that you can modify/fix the metadata also!
    And perhaps that you can construct multiple virtual datasets with the same collection of original files, like datacubes for each forecast hour that would allow one to, say, easily quantify how forecasts degrade over time for a specific region/metric, etc
    Lucas Sterzinger
    @lsterzinger
    @martindurant at some point in the past few weeks you helped Rich with an issue w/ the NWM data where the lat/lon vars also had a time dimension. He says you wrote some code to fix it, but I don't think it's merged into main since I'm getting the same error w/ the latest fsspec-reference-maker
    Martin Durant
    @martindurant
    This was not the case with the consolidated json file we were sharing - did something change?
    Rich Signell
    @rsignell-usgs
    Lucas, that jogged my memory! Here's what Martin did:
    def preprocess(ds):
        return ds.set_coords(['latitude', 'longitude'])
    
    mzz = fsspec_reference_maker.combine.MultiZarrToZarr(
        files,
        remote_protocol="s3",
        remote_options={'anon':True},
        xarray_open_kwargs={
            "decode_cf" : False,
            "mask_and_scale" : False,
            "drop_variables": ["crs", "reference_time"]
        },
        xarray_concat_args={
            'dim' : 'time'
        },
        preprocess=preprocess
    )
    Martin Durant
    @martindurant
    oh right, I valuely remember that
    Lucas Sterzinger
    @lsterzinger
    Ah! I forgot to preprocess when I ran mzz on the 400 consolidated dicts
    Lucas Sterzinger
    @lsterzinger
    Here is a notebook that I'm including in my blog post showing how to use dask bag to create a consolidated reference file for the 367,000 NWM files without needing to store the intermediate files
    https://nbviewer.jupyter.org/gist/lsterzinger/e8e954d04d8c316567879283639380e9
    Martin Durant
    @martindurant
    Excellent, I’ll look shortly.
    Lucas Sterzinger
    @lsterzinger

    look at how _FillValue went from -999900 to 0, that would be great

    It looks to me like it's happening during the xarray CF decoding with a Zarr store here

    https://github.com/pydata/xarray/blob/2705c63e0c03a21d2bbce3a337fac60dd6f6da59/xarray/backends/store.py#L27-L37

    Though I don't think we need to decode CF a second time? That's all done once already during SingleHdf5ToZarr, surely it doesn't need to happen again? Since the decoded metadata is saved in the JSON
    Martin Durant
    @martindurant
    I think we try to store the original metadata (attributes) except for completely new arrays, which in this case is only the time coordinate.
    Rich Signell
    @rsignell-usgs
    Lucas, I know you guys modified the code to return dicts instead of writing JSON, but is writing JSON still possible?
    Martin Durant
    @martindurant
    ^ yes; you get json if you give a “.json” as the output. You only get dicts if you give the output as None.
    1 reply
    Lucas Sterzinger
    @lsterzinger

    I think we try to store the original metadata (attributes) except for completely new arrays, which in this case is only the time coordinate.

    Right so _FillValue shouldn't be re-determined since it's in the JSON? It's encoded correctly there

    At some point between reading the JSON and returning the dataset, xarray converts _FillValue from -999900 to 0, and based on my debugging that's happening during the cf conversion process
    Martin Durant
    @martindurant

    The .zattrs files are written by xarray in _build_output, lines

            ds.to_zarr(out, chunk_store={}, compute=False,
                       consolidated=False)  # fills in metadata&coords

    so we don’t have any control here. Later on we explicitly skip copying the .zattrs files

    Rich Signell
    @rsignell-usgs
    @lsterzinger , I'm having trouble finding your notebook that creates referenceFileSystem JSON for the HRRR forecast
    Martin Durant
    @martindurant
    I just posted on @lsterzinger ’s GEOS tutorial repo that we ought to make that a recipe for pangeo-forge, and also mentioned NWM, but totally forgot about the HRRR work! These should all be there, even if pangeo-forge doesn’t yet have a mechanism for dealing with regularly updated datasets.
    Lucas Sterzinger
    @lsterzinger
    @rsignell-usgs It's on esip-qhub at /shared/users/lsterzinger/hrrr.ipynb, I also uploaded it to nbviewer here https://nbviewer.jupyter.org/gist/lsterzinger/c6f8c68c35f94794b5c76cf8b1fea30a

    I just posted on @lsterzinger ’s GEOS tutorial repo that we ought to make that a recipe for pangeo-forge, and also mentioned NWM, but totally forgot about the HRRR work! These should all be there, even if pangeo-forge doesn’t yet have a mechanism for dealing with regularly updated datasets.

    @martindurant 100%, let me take a closer look at your Hdf5 recipe and see what's needed

    Lucas Sterzinger
    @lsterzinger
    @rsignell-usgs I just realized that the hrrr filenames do not include dates, only times, so the list of jsons created by that notebook are out of order (compare to list of URLs in the 2nd cell)
    Rich Signell
    @rsignell-usgs
    I'm trying to write the HRRR json to S3 (so that I can use a distributed cluster) and something is wrong here:
        out = scan_grib(u, common, so, inline_threashold=100, filter=afilter)        
        with fs2.open(outfname, "wb") as f:
            f.write(ujson.dumps(out))
    @martindurant, I'm guessing you see what's wrong . out is a dict.
    Martin Durant
    @martindurant
    outfname is “s3://..” ? Do you need credentials in there? Does the first line complete?
    Rich Signell
    @rsignell-usgs
    yes
    oh, it's not "wb", is it?
    yep, this works:
    with fs2.open(outfname, "w") as f:
        f.write(ujson.dumps(out))
    Martin Durant
    @martindurant
    I think builtin json allows either binary, but maybe ujson doesn't
    Martin Durant
    @martindurant
    load of uncompressed FITS files in “gcs://pangeo-data/SDO_AIA_Images” (3TB) see pangeo-data/pangeo#269
    Appears to be exactly one binary chunk per file (for the only one I downloaded)
    Chelle Gentemann
    @cgentemann
    just to follow up from meeting today trying to get a higher-profile article out about how lucas/rich/martin are changing how we think about accessing data. i'm happy to help out with a medium post and like I said in our chat, right now I'm in write-only mode. If someone can figure out how to change my access back to science-only I'd appreciate it, but well, for now here is a document we can start working in. i'll try to get an outline going. https://docs.google.com/document/d/1O2dPeB1smArHg62XcNOxwwEpWDwdUiWIn09XS-fr4tc/edit?usp=sharing
    Rich Signell
    @rsignell-usgs
    Thanks Chelle! I'll check this out in more detail tomorrow.
    @lsterzinger , here's the HRRR forecast notebook with Dask and using 1 hour tau from the past forecasts + the latest forecast: https://nbviewer.jupyter.org/gist/rsignell-usgs/a047178ee12d44c2a7900ee86ba2fbc7
    I'm not sure why we are getting winds at 10m with a filter of 2m. Also we seem to be getting multiple wind variables. Will explore more tomorrow. But the hard part works!
    Rich Signell
    @rsignell-usgs

    @martindurant , I just realized that indeed as you predicted yesterday, we have some more work to do on time variables, at least for Grib files! Check out cells [17] and [18] in this notebook:
    https://nbviewer.jupyter.org/gist/rsignell-usgs/fedf4b0e2d80bd9d202792ed99100d6f

    The "time" variable is the time at which the model was run, and since I'm appending the latest forecast to the "best time series", all the values at the end are the same.

    Meanwhile the "valid_time" variable, what one would expect to be the "time" variable (having the time values for each hour of the forecast), has only the first two values, with all the rest NaN.

    So can we just flip them? We don't really care about providing the hour at which the model was run, since that could be in the description of the dataset. An evenly-spaced variable called "time" (that apparently is in the "valid_time" variable in Grib) is what we want. Make sense?

    Martin Durant
    @martindurant
    All booked up until the afternoon...
    Martin Durant
    @martindurant
    40TB of FITS imaging in ‘s3://stpubdata/tess/public/mast’ in 128GB uncompressed files?
    Rich Signell
    @rsignell-usgs
    So we can read FITS into xarray via rasterio? https://gdal.org/drivers/raster/fits.html
    Chelle Gentemann
    @cgentemann
    this is so great -
    Martin Durant
    @martindurant
    you don’t need rasterio, it’s pure C buffers (i.e., zarr reader with compression=None)
    Rich Signell
    @rsignell-usgs
    oh nice
    Chelle Gentemann
    @cgentemann
    (me opening up aws hub quickly)
    Martin Durant
    @martindurant
    For the 128GB files, I can subset the massive array on the biggest dimension, but this will be by hand for now.