Where communities thrive


  • Join over 1.5M+ people
  • Join over 100K+ communities
  • Free without limits
  • Create your own community
People
Activity
    Rich Signell
    @rsignell-usgs
    It actually would be nice if we fixed the metadata so that 0 was not converted to NaN. I doubt they meant for that to happen -- the providers just didn't understand the CF conventions well enough, which is not that uncommon.
    Lucas Sterzinger
    @lsterzinger
    Gotcha, makes sense. One time I loaded the dataset it filled the feature_id dimension with NaN but now that I check again I see it has its normal values, not sure what happened there
    Rich Signell
    @rsignell-usgs
    Lucas, If you are still working today, can you take a look at this notebook and try to figure out why the streamflow encoding in the original NetCDF files is different than in the consolidated dataset? In cells [18] and [19] here you can see the difference: the scale_factor has round off error, and the _FillValue is 0 instead of 999900.
    Martin Durant
    @martindurant
    I don’t know, but it’s plausible that some values are inferred by cfgrib, as opposed to real attributes in the data, and that this inference path is different with the zarr interface versus netcdf. Note that the zarr version is stored in JSON text and loaded by python as a float64. In the original float32, this is the closest representation possible to 0.01.
    In [10]: np.array(0.009999999776482582, dtype="f4")
    Out[10]: array(0.01, dtype=float32)
    Rich Signell
    @rsignell-usgs
    this is the NWM/NetCDF4 not the HRRR/GRIB dataset though.
    Do you have an idea of how the _FillValue could go from 999900 to 0?
    Martin Durant
    @martindurant
    ok, maybe I meant “CF stuff” in general. h5py would be able to tell us which attributes are actually in the data, not inferred on load
    Rich Signell
    @rsignell-usgs
    Cell [18] tells us what is in the data, right? -- it reads directly from NetCDF:
    'missing_value': array([-999900], dtype=int32),
     '_FillValue': array([-999900], dtype=int32),
     'scale_factor': array([0.01], dtype=float32),
     'add_offset': array([0.], dtype=float32),
    Martin Durant
    @martindurant
    I think _FillValue is probably a consequence of the exact error we get when accessing the data - xarray isn’t happy with it
    no, that’s xarray’s view
    and 0.01(32) == 0.009999999776482582(64)
    Rich Signell
    @rsignell-usgs
    $ ncdump -h 202001011100.CHRTOUT_DOMAIN1.comp | grep streamflow
            int streamflow(feature_id) ;
                    streamflow:long_name = "River Flow" ;
                    streamflow:units = "m3 s-1" ;
                    streamflow:coordinates = "latitude longitude" ;
                    streamflow:grid_mapping = "crs" ;
                    streamflow:_FillValue = -999900 ;
                    streamflow:missing_value = -999900 ;
                    streamflow:scale_factor = 0.01f ;
                    streamflow:add_offset = 0.f ;
                    streamflow:valid_range = 0, 5000000 ;
    The "f" following the values indicates floating point (32 bit)
    Martin Durant
    @martindurant
    I would comment that _FillValue == missing_value is a bizzare choice.
    Yes, float32. We just have a more precise version of the same number, because JSON isn’t binary.
    Rich Signell
    @rsignell-usgs
    I agree, they should not have set "missing_value", which is So do you have an idea of how _FillValue got set to 0 somewhere in the workflow?
    Martin Durant
    @martindurant
    Only vague guesses. I’ll say “no"
    Rich Signell
    @rsignell-usgs
    And on the subject of missing values, the provider should have just stopped at valid_range: https://cfconventions.org/Data/cf-conventions/cf-conventions-1.8/cf-conventions.html#missing-data
    Everything outside that range is turned into NaN, which includes -999900
    Martin Durant
    @martindurant
    It might be reasonable to allow metadata processing as part of our pipeline to correct such things
    Rich Signell
    @rsignell-usgs
    Lucas: So Martin answered one of my questions (why scale_factor looks like it has round off error), but if you could look at how _FillValue went from -999900 to 0, that would be great
    Martin Durant
    @martindurant
    I wonder if it conflicts with zarr’s internal missing value field, which is called “fill_value”. Note that http://xarray.pydata.org/en/stable/generated/xarray.open_zarr.html explicitly mentions _FillValue and missing_value, whereas for HDF they would be handled by h5py. Maybe we are seeing an xarray bug?
    ^ those were two unrelated guesses, if it wasn’t clear :)
    Rich Signell
    @rsignell-usgs
    I checked to see if the attributes survive round tripping with xarray and netcdf, and they do:
    https://nbviewer.jupyter.org/gist/rsignell-usgs/4ea6caac48319e0f39cd2fd1ecaee027
    So at least that's good!
    Martin Durant
    @martindurant
    but what about to_zarr?
    Martin Durant
    @martindurant
    I am submitting a talk proposal to pydata-global "Parallel access to remote HDF5, TIFF, grib2 and others. All you need is zarr.” (current title). Would anyone here like to be a co-author?
    Chelle Gentemann
    @cgentemann
    I think Rich is the king of good titles. I tend to try and take out jargon & try to make them catchy. "Faster data access for HDF5, Tiff, grib2, and others" or just "Faster data access"
    Lucas Sterzinger
    @lsterzinger

    Lucas: So Martin answered one of my questions (why scale_factor looks like it has round off error), but if you could look at how _FillValue went from -999900 to 0, that would be great

    Can do!

    Rich Signell
    @rsignell-usgs
    I think for technical crowd (like pydata folks), your title listing all those formats is actually a good thing
    But to Chelle's point, maybe switch the order, so that it's "All you need is Zarr: parallel access to..."
    Or "Long live the Zarr: parallel access to..
    Martin Durant
    @martindurant
    like tsar?
    Rich Signell
    @rsignell-usgs
    yeah
    maybe too corny
    It's Chelle's fault
    :)
    Martin Durant
    @martindurant
    trying too hard

    --Rudimentary draft--

    Category
    Talk

    Official Keywords
    Big Data, Data Engineering, Data Mining / Scraping

    Additional Keywords

    Prior Knowledge?
    No previous knowledge expected

    Brief Summary
    We introduce ReferenceFileSystem, a virtual implementation for fsspec which views arbitrary byte chunks at specific keys, presenting chunks of HDF5, TIFF, grib2 and others at the appropriate paths conforming to zarr's model. Thus, you can use zarr to load data from potentially thousands of remote data files, selecting only what you need, and with parallelism and concurrency. 

    Outline

    • the standard formats used for massive data archiving and their shortcomings
    • brief introduction to zarr, designed for parallel access to remote data
    • the process by which we can expose archival chunks to zarr and aggregate datasets
    • examples of >>10TB datasets where we have done this

    Description
    fsspec's ReferenceFileSystem allows a file system like virtual view onto chunks of bytes stored in arbitrary locations elsewhere, e.g., cloud bucket storage. We can present each byte chunk as a particular path in the filesystem conforming to the zarr hierarchy model, such that the original set of chunks, potentially across many files, appears as a single zarr dataset. This brings the following advantages:

    • only zarr (plus the relevant codecs) is required to read the data, but the original data could be locked in HDF5, TIFF or grib2 files (and more to come)
    • the metadata is extracted once, so future opening of the dataset does not need to scan through the target files to find metadata, and so the process is much faster
    • you get a single logical view over potentially thousands of files, but due to zarr's access model, you only load the data you need
    • loading can happen chunk-wise and in parallel
      The details of how to make such reference files is described at https://github.com/intake/fsspec-reference-maker , and the latest result for some 80TB/370k files of water flow modelling can be seen at https://nbviewer.jupyter.org/gist/rsignell-usgs/02da7d9257b4b26d84d053be1af2ceeb . Note that this is using xarray to process the data, but only zarr is needed to load it.
    Chelle Gentemann
    @cgentemann
    I like Rich's suggestion of just fliping it to 'all you need is zarr: par....'
    Rich Signell
    @rsignell-usgs
    Looks good Martin -- I might also add that you can modify/fix the metadata also!
    And perhaps that you can construct multiple virtual datasets with the same collection of original files, like datacubes for each forecast hour that would allow one to, say, easily quantify how forecasts degrade over time for a specific region/metric, etc
    Lucas Sterzinger
    @lsterzinger
    @martindurant at some point in the past few weeks you helped Rich with an issue w/ the NWM data where the lat/lon vars also had a time dimension. He says you wrote some code to fix it, but I don't think it's merged into main since I'm getting the same error w/ the latest fsspec-reference-maker
    Martin Durant
    @martindurant
    This was not the case with the consolidated json file we were sharing - did something change?
    Rich Signell
    @rsignell-usgs
    Lucas, that jogged my memory! Here's what Martin did:
    def preprocess(ds):
        return ds.set_coords(['latitude', 'longitude'])
    
    mzz = fsspec_reference_maker.combine.MultiZarrToZarr(
        files,
        remote_protocol="s3",
        remote_options={'anon':True},
        xarray_open_kwargs={
            "decode_cf" : False,
            "mask_and_scale" : False,
            "drop_variables": ["crs", "reference_time"]
        },
        xarray_concat_args={
            'dim' : 'time'
        },
        preprocess=preprocess
    )
    Martin Durant
    @martindurant
    oh right, I valuely remember that
    Lucas Sterzinger
    @lsterzinger
    Ah! I forgot to preprocess when I ran mzz on the 400 consolidated dicts
    Lucas Sterzinger
    @lsterzinger
    Here is a notebook that I'm including in my blog post showing how to use dask bag to create a consolidated reference file for the 367,000 NWM files without needing to store the intermediate files
    https://nbviewer.jupyter.org/gist/lsterzinger/e8e954d04d8c316567879283639380e9
    Martin Durant
    @martindurant
    Excellent, I’ll look shortly.
    Lucas Sterzinger
    @lsterzinger

    look at how _FillValue went from -999900 to 0, that would be great

    It looks to me like it's happening during the xarray CF decoding with a Zarr store here

    https://github.com/pydata/xarray/blob/2705c63e0c03a21d2bbce3a337fac60dd6f6da59/xarray/backends/store.py#L27-L37

    Though I don't think we need to decode CF a second time? That's all done once already during SingleHdf5ToZarr, surely it doesn't need to happen again? Since the decoded metadata is saved in the JSON
    Martin Durant
    @martindurant
    I think we try to store the original metadata (attributes) except for completely new arrays, which in this case is only the time coordinate.
    Rich Signell
    @rsignell-usgs
    Lucas, I know you guys modified the code to return dicts instead of writing JSON, but is writing JSON still possible?