Where communities thrive


  • Join over 1.5M+ people
  • Join over 100K+ communities
  • Free without limits
  • Create your own community
People
Activity
    Rich Signell
    @rsignell-usgs
    says "no module named zstandard"
    ??
    but if I open a terminal and do "conda list" it's there
    Martin Durant
    @martindurant

    I have both

    zstandard                 0.15.2           py38h96a0964_0    conda-forge
    zstd                      1.4.9                h582d3a0_0    conda-forge

    so I don’t actually know which is material.

    Rich Signell
    @rsignell-usgs
    Ah, I had to shutdown my server and restart
    Does this look right?
    sources:
      nwm-reanalysis:
        driver: intake_xarray.xzarr.ZarrSource
        description: 'National Water Model Reanalysis, version 2.1'
        args:
          urlpath: 'reference://'
          simple_templates: True
          storage_options:
            target_options:
              anon: true
              compression: 'zstd'
            target_protocol: s3
            fo: 's3://esip-qhub-public/noaa/nwm/nwm_reanalysis.json.zst'
            remote_options:
              anon: true
            remote_protocol: s3
    Martin Durant
    @martindurant
    simple_templates goes under storage_options
    Otherwise good
    Martin Durant
    @martindurant
    Renders quite badly! But good to see it anyway!
    Rich Signell
    @rsignell-usgs
    Renders quite badly?
    If you have ideas about how to render it better, please let me know!
    (or I guess I could ask Jim Bednar)
    Lucas Sterzinger
    @lsterzinger
    Put in nbviewer instead ;)
    Lucas Sterzinger
    @lsterzinger
    Rich can I share that S3 json in my blog post? Is it requester pays?
    Martin Durant
    @martindurant
    Yeah, I just meant gist’s view of it
    Rich Signell
    @rsignell-usgs
    And yes Lucas, everything (NetCDF files, JSON, and Intake catalog) is in public buckets that don't require auth (e.g. not requester pays) so anyone should be able to run that Notebook, regardless of whether they have an Amazon account!
    The one thing we might consider is whether to remove the "forecast" dataset from that catalog, as we don't have that updating yet to match the rolling forecast archive so it's not being kept up to date)
    Rich Signell
    @rsignell-usgs
    Here's a better nbviewer link for the NWM demo notebook: https://nbviewer.jupyter.org/gist/rsignell-usgs/02da7d9257b4b26d84d053be1af2ceeb
    Martin Durant
    @martindurant
    perfect
    I think it’s fine to leave forecast. When talking about it, we can say how this is a specific line of discussion in pageo-forge, how to keep rolling datasets up to date.
    Rich Signell
    @rsignell-usgs
    Oops, I had a few typos in my strftime, here's an hopefully final notebook link: https://nbviewer.jupyter.org/gist/rsignell-usgs/89767200a0722462d37ea971b9588004
    Lucas Sterzinger
    @lsterzinger
    I tried recreating the combined json all in one go (using dask.bag to create individual reference dicts, passing them to mzz, combining everything into a combined json) and when I open the result with xarray I get the following errors for some (but not all) of my variables:
      ds = xr.open_dataset(fs.get_mapper(""), engine='zarr')
    /home/conda/store/896e738a7fff13f931bce6a4a04b3575ecd1f4cbd0e7da9d83afcc7273e57b60-pangeo/lib/python3.8/site-packages/xarray/conventions.py:512: SerializationWarning: variable 'qBtmVertRunoff' has multiple fill values {-9999000, 0}, decoding all values to NaN.
    So some of the variables have been 100% converted to NaN, apprently due to conflicting fill values
    Rich Signell
    @rsignell-usgs
    Lucas, I think that means that both 0 and -9999000 got converted to NaN, not all values
    I get that same message.
    It actually would be nice if we fixed the metadata so that 0 was not converted to NaN. I doubt they meant for that to happen -- the providers just didn't understand the CF conventions well enough, which is not that uncommon.
    Lucas Sterzinger
    @lsterzinger
    Gotcha, makes sense. One time I loaded the dataset it filled the feature_id dimension with NaN but now that I check again I see it has its normal values, not sure what happened there
    Rich Signell
    @rsignell-usgs
    Lucas, If you are still working today, can you take a look at this notebook and try to figure out why the streamflow encoding in the original NetCDF files is different than in the consolidated dataset? In cells [18] and [19] here you can see the difference: the scale_factor has round off error, and the _FillValue is 0 instead of 999900.
    Martin Durant
    @martindurant
    I don’t know, but it’s plausible that some values are inferred by cfgrib, as opposed to real attributes in the data, and that this inference path is different with the zarr interface versus netcdf. Note that the zarr version is stored in JSON text and loaded by python as a float64. In the original float32, this is the closest representation possible to 0.01.
    In [10]: np.array(0.009999999776482582, dtype="f4")
    Out[10]: array(0.01, dtype=float32)
    Rich Signell
    @rsignell-usgs
    this is the NWM/NetCDF4 not the HRRR/GRIB dataset though.
    Do you have an idea of how the _FillValue could go from 999900 to 0?
    Martin Durant
    @martindurant
    ok, maybe I meant “CF stuff” in general. h5py would be able to tell us which attributes are actually in the data, not inferred on load
    Rich Signell
    @rsignell-usgs
    Cell [18] tells us what is in the data, right? -- it reads directly from NetCDF:
    'missing_value': array([-999900], dtype=int32),
     '_FillValue': array([-999900], dtype=int32),
     'scale_factor': array([0.01], dtype=float32),
     'add_offset': array([0.], dtype=float32),
    Martin Durant
    @martindurant
    I think _FillValue is probably a consequence of the exact error we get when accessing the data - xarray isn’t happy with it
    no, that’s xarray’s view
    and 0.01(32) == 0.009999999776482582(64)
    Rich Signell
    @rsignell-usgs
    $ ncdump -h 202001011100.CHRTOUT_DOMAIN1.comp | grep streamflow
            int streamflow(feature_id) ;
                    streamflow:long_name = "River Flow" ;
                    streamflow:units = "m3 s-1" ;
                    streamflow:coordinates = "latitude longitude" ;
                    streamflow:grid_mapping = "crs" ;
                    streamflow:_FillValue = -999900 ;
                    streamflow:missing_value = -999900 ;
                    streamflow:scale_factor = 0.01f ;
                    streamflow:add_offset = 0.f ;
                    streamflow:valid_range = 0, 5000000 ;
    The "f" following the values indicates floating point (32 bit)
    Martin Durant
    @martindurant
    I would comment that _FillValue == missing_value is a bizzare choice.
    Yes, float32. We just have a more precise version of the same number, because JSON isn’t binary.
    Rich Signell
    @rsignell-usgs
    I agree, they should not have set "missing_value", which is So do you have an idea of how _FillValue got set to 0 somewhere in the workflow?
    Martin Durant
    @martindurant
    Only vague guesses. I’ll say “no"
    Rich Signell
    @rsignell-usgs
    And on the subject of missing values, the provider should have just stopped at valid_range: https://cfconventions.org/Data/cf-conventions/cf-conventions-1.8/cf-conventions.html#missing-data
    Everything outside that range is turned into NaN, which includes -999900
    Martin Durant
    @martindurant
    It might be reasonable to allow metadata processing as part of our pipeline to correct such things
    Rich Signell
    @rsignell-usgs
    Lucas: So Martin answered one of my questions (why scale_factor looks like it has round off error), but if you could look at how _FillValue went from -999900 to 0, that would be great
    Martin Durant
    @martindurant
    I wonder if it conflicts with zarr’s internal missing value field, which is called “fill_value”. Note that http://xarray.pydata.org/en/stable/generated/xarray.open_zarr.html explicitly mentions _FillValue and missing_value, whereas for HDF they would be handled by h5py. Maybe we are seeing an xarray bug?
    ^ those were two unrelated guesses, if it wasn’t clear :)
    Rich Signell
    @rsignell-usgs
    I checked to see if the attributes survive round tripping with xarray and netcdf, and they do:
    https://nbviewer.jupyter.org/gist/rsignell-usgs/4ea6caac48319e0f39cd2fd1ecaee027
    So at least that's good!
    Martin Durant
    @martindurant
    but what about to_zarr?
    Martin Durant
    @martindurant
    I am submitting a talk proposal to pydata-global "Parallel access to remote HDF5, TIFF, grib2 and others. All you need is zarr.” (current title). Would anyone here like to be a co-author?