Discussion channel to work on creating metadata files that can provide zarr type access speeds to older formated data of many different types
Lucas I have a slightly different take on NetCDF4.
You said NetCDF is not cloud-optimized because it requires loading the entire dataset in order to access the header/metadata and retreive a chunk of data, but it doesn't -- it just requires a lot of small binary requests to access the metadata, which is inefficient. The data in NetCDF4 can be written in arbitrary Nd-chunks, just like in Zarr.
The only difference with NetCDF4 is that many chunks can be in a single file (or object) while in Zarr, each chunk is in it's own object. That was important once upon a time, but now the cloud providers allow thousands of concurrent reads to an object. So the main reason NetCDF4 doesn't perform as well as Zarr is that the metadata data is not consolidated. And that's what we are addressing with reference file system approach -- we read the metadata in advance and store it so we can read it all in one shot. Then we use the Zarr library, which can take advantage of that!
I think that the virtual dataset over many files is an even bigger deal. You can find the specific parts of the specific files you need without having to read the metadata of all of the files separately (which is very slow), and load them concurrently (you could already load the hdf5s in parallel using threading, but without concurrency).
For the “extra storage” of the references, you might want to note that there are various encoding tricks that work well, the simplest would be to zstd compress the whole json: maybe gets you a factor of 10 in size, but is super fast to unpack.
For the “extra storage” of the references, you might want to note that there are various encoding tricks that work well, the simplest would be to zstd compress the whole json: maybe gets you a factor of 10 in size, but is super fast to unpack.
This is being given to a group with only basic python knowledge and so I was really focused on just showing a concept, not optimization/storage tricks
Lucas, to me it's important you don't say that you need to read the whole netcdf4 file to get the metadata
Yes I agree. This is a piece of information I got from a conversation with Kevin Paul at NCAR -- I think I either misunderstood what he was saying (the most likely reason) or he misunderstood exactly how cloud access happens... maybe thinking of a time where you could only request entire files from object storage.
I will make updates to the presentation before the workshop, thank you for your help!
{"MALLOC_TRIM_THRESHOLD_": "0"}
in the environment variables on your dask workers. " it feels fragile.