Where communities thrive


  • Join over 1.5M+ people
  • Join over 100K+ communities
  • Free without limits
  • Create your own community
People
Activity
    Chelle Gentemann
    @cgentemann
    this is really helpful martin, thanks. if we can demo it working, that would be a win, and we can work more with nasa --- if we can show any advancement for data access - just maybe by uncompressing the gzip files after pushing to AWS, maybe that would be a path forward? the astro community is so FITS FITS FITS.... just like I was binary binary binary 30 years ago....
    Martin Durant
    @martindurant
    So if I did that same dataset, but with
    • the wavelength as a coordinate (so you can select it with a slider)
    • mapping to helio lat/lon (or just show how you can recreate the world coordinates)
    • a version with sub-selection in one of the dimensions, so you do a timeseries on a section of the image better
      … is that enough of an argument?
      Should I do the same for the much more massive set of downsampled JPEG images on the public SDO server?
      I was hoping to get around to the dataset in intake/fsspec-reference-maker#78 , as a nice example of a dataset made by merging on multiple dimensions.
    Rich Signell
    @rsignell-usgs
    So cool that FileReferenceSystem now featuring geotiff!
    https://github.com/intake/fsspec-reference-maker/issues/78#issuecomment-924456900
    Martin Durant
    @martindurant
    Perhaps more importantly: that dataset has FIVE dimensions, three aggregated (and two-dimensional images in each chunk)
    Rich Signell
    @rsignell-usgs
    Yeah, that is also very cool
    @martindurant , did you get contacted by Ivelina Momcheva from the Space Telescope Science Institute? I spoke with her a few days ago and she is very interested in your FITS work (and in Pangeo in general).
    Martin Durant
    @martindurant
    I did not
    Martin Durant
    @martindurant
    OK, I will speak with her this afternoon.
    Rich Signell
    @rsignell-usgs
    Awesome!
    Chelle Gentemann
    @cgentemann
    this is great!!!
    Chelle Gentemann
    @cgentemann
    is anyone else going to Ocean Sciences? This looks like a good session that we should submit this project too: https://www.aslo.org/osm2022/scientific-sessions/#od
    Lucas Sterzinger
    @lsterzinger
    I wasn't planning on it since I'm already presenting this at AGU, but if are able to send me to Honolulu I certainly won't complain ;)
    Chelle Gentemann
    @cgentemann
    kk working on invited slot
    start working on an abstract!
    deadline is tomorrow midnigth
    Chelle Gentemann
    @cgentemann
    okay - ken knows your abstract is coming in - he said you have to organize it after submission - & he will look for yours. submit to : OD12 Big Data for a Big Ocean 2022
    Lucas Sterzinger
    @lsterzinger
    @rsignell-usgs do you have an up-to-date NWM + intake catalog example notebook? I'm presenting fsspec-reference-maker at a geospatial workshop on campus on Monday and thought it might be a cool thing to show off
    You could spiff this up with a dask cluster of course
    Lucas Sterzinger
    @lsterzinger
    Thank you!
    Lucas Sterzinger
    @lsterzinger
    Should we make this repo open for Hacktoberfest PRs? Just need to add the "hacktoberfest" topic to the repo https://hacktoberfest.digitalocean.com/resources/maintainers
    Chelle Gentemann
    @cgentemann
    YES!
    Lucas Sterzinger
    @lsterzinger
    (also it's one of the only repos that I actually contribute to often, and I'd like a t-shirt :wink: )
    Rich Signell
    @rsignell-usgs
    There is a repo for this?
    or do you mean the pangeo gallery?
    Lucas Sterzinger
    @lsterzinger
    fsspec-reference-maker
    Rich Signell
    @rsignell-usgs
    ah!
    Martin Durant
    @martindurant
    hacktoberfest can create a lot of noise, but I’ve never been involved before, prepared to give it a go. What do I need to do?
    if we can organize some more slides based on lucas's that introduce the project, then a few more that are where we want to go & what we need work on we should be albe to get a couple volunteers (paid) from nasa's impact project.
    Lucas Sterzinger
    @lsterzinger
    I translated some of my slides into markdown for a workshop I'm giving tomorrow, feel free to use https://github.com/lsterzinger/maptimedavis-fsspec/blob/main/01-Create_References.ipynb
    Martin Durant
    @martindurant
    For strictly passer-by hackers, I would suggest that they investigate how to gett byte offsets into any other file formats that they have lying around.
    Rich Signell
    @rsignell-usgs

    Lucas I have a slightly different take on NetCDF4.

    You said NetCDF is not cloud-optimized because it requires loading the entire dataset in order to access the header/metadata and retreive a chunk of data, but it doesn't -- it just requires a lot of small binary requests to access the metadata, which is inefficient. The data in NetCDF4 can be written in arbitrary Nd-chunks, just like in Zarr.

    The only difference with NetCDF4 is that many chunks can be in a single file (or object) while in Zarr, each chunk is in it's own object. That was important once upon a time, but now the cloud providers allow thousands of concurrent reads to an object. So the main reason NetCDF4 doesn't perform as well as Zarr is that the metadata data is not consolidated. And that's what we are addressing with reference file system approach -- we read the metadata in advance and store it so we can read it all in one shot. Then we use the Zarr library, which can take advantage of that!

    Martin Durant
    @martindurant

    I think that the virtual dataset over many files is an even bigger deal. You can find the specific parts of the specific files you need without having to read the metadata of all of the files separately (which is very slow), and load them concurrently (you could already load the hdf5s in parallel using threading, but without concurrency).

    For the “extra storage” of the references, you might want to note that there are various encoding tricks that work well, the simplest would be to zstd compress the whole json: maybe gets you a factor of 10 in size, but is super fast to unpack.

    (I mention it because I’m surprised that the references file is so big - the originals must have very small chunks)
    Rich Signell
    @rsignell-usgs
    Martin, I agree that the ability to create cloud-optimized virtual datasets from multiple files is huge -- this allows us to replace much of the virtual dataset capability in THREDDS, but in a way that doesn't require additional (and often non-scalable) data services!
    Martin, do you agree with my take on NetCDF4 though? I'm curious...
    I think it's important for us to get that right, since it's so widely used.
    Martin Durant
    @martindurant
    I also like to point out that reference-maker works for multiple file types already and could easily do so for many more.
    Oh yes, you are completely correct about netcdf. For other file types, we can do more unlocking, if they don’t already have remote and range capable libraries.
    Rich Signell
    @rsignell-usgs
    Another thing I thought of: although the NetCDF library is adding support for Zarr (e.g. reading Zarr with the NetCDF library), it won't likely be as efficient as reading NetCDF4 with the Zarr library (using our approach)! Because it will still read the metadata piecemeal. How ironic, right?
    Martin Durant
    @martindurant
    It is also true that the blocks of a netcdf would not be read concurrently in any library except via our method, I think. This is important in the limit of small chunks.
    Rich Signell
    @rsignell-usgs
    Oh, I didn't think of that. Yes, there is a Parallel NetCDF project, but that requires extra work on the part of the user and is not directly supported by Unidata. So we make it easy to read NetCDF in parallel, and with standard python libraries supported by the community!
    Martin Durant
    @martindurant
    “Concurrently” :). Parallelism with dask would already work with netCDF, thread-wise. But for small chunks, you would still pay the latency cost a lot.
    I think the pont of having zarr as an official backend for netcdf is that it encourages new data to be written into zarr. We’re happy with that - an actual cloud-friendly format/library trumps our tricks.
    Rich Signell
    @rsignell-usgs
    Ah, okay, you mean the async, right?
    Martin Durant
    @martindurant
    Right. I call this “concurrency” usually. Threads also run async (i.e., command control timing is not guaranteed) without the async python keyword. The technical difference is between cooperative multitasking and preemptive multitasking.
    Lucas Sterzinger
    @lsterzinger
    @rsignell-usgs @martindurant Thanks for your help and clarification! I've struggled explaining exactly /why/ NetCDF performs poorly in the cloud as I wasn't super familiar with the details of the library/format. I agree with martin in that the biggest benefit (as a user) that I've seen using reference filesystem is the multi-file virtual dataset -- a process that takes forever with traditional xarray/netcdf workflows.
    Lucas Sterzinger
    @lsterzinger

    For the “extra storage” of the references, you might want to note that there are various encoding tricks that work well, the simplest would be to zstd compress the whole json: maybe gets you a factor of 10 in size, but is super fast to unpack.

    This is being given to a group with only basic python knowledge and so I was really focused on just showing a concept, not optimization/storage tricks