Where communities thrive


  • Join over 1.5M+ people
  • Join over 100K+ communities
  • Free without limits
  • Create your own community
People
Repo info
Activity
  • 12:31
    mjwillson edited #6053
  • 11:26
    mjwillson edited #6053
  • 11:25
    mjwillson edited #6053
  • 11:24
    mjwillson edited #6053
  • 11:24
    mjwillson opened #6053
  • 04:55
    TomNicholas commented #6052
  • 03:08
    max-sixty edited #6051
  • 00:28
    github-actions[bot] edited #6016
  • 00:06
    tasansal commented #4406
  • 00:05
    tasansal commented #4406
  • Dec 07 21:20
    dcherian commented #6051
  • Dec 07 19:44
    ryanjdillon commented #5395
  • Dec 07 18:42
    dcherian commented #6052
  • Dec 07 18:26
    tcchiao edited #6052
  • Dec 07 18:24
    jhamman labeled #6052
  • Dec 07 18:20
    tcchiao edited #6052
  • Dec 07 18:17
    tcchiao opened #6052
  • Dec 07 00:26
    github-actions[bot] edited #6016
  • Dec 07 00:00
    max-sixty commented #6051
  • Dec 06 22:04
    mx-moth synchronize #6049
epifanio
@epifanio
Hi! struggling with a conversion from pandas to xarray .. I can't get the coords in the right datatype -- example here: https://gist.github.com/epifanio/6701df2545394a59b7105b7444d1d2d7
the pandas dataframe is correct in assigning the datatype of 'datetime64[ns, UTC]' to the df.index but when converting it to xr.Dataset it gets casted to the object dtype
what I am doing wrong?
epifanio
@epifanio
I ended up not using xarray from_Dataframe() .. it was giving me t.. a time .. with object dtype instead of datetime. https://gist.github.com/epifanio/785b9573f7dee717e77d5b5c003b1041
LunarLanding
@LunarLanding
Hi. I want to use map_blocks with a function that changes the dataset dimensions. How can I create a template kwarg without allocating data?
5 replies
Steve Nesbitt
@swnesbitt
Hi! I am working with large 4-D model outputs (from NCAR's CM1 model - in total have dimensions ~ 600x512x512x140) that are split amongst a large number of small netCDF tiled output files (governed by 512 MPI processes). The files are CF compliant and I am trying to use combine_by_coords to stitch these back together based on their coordinates, but things are painfully slow using open_mfdataset compared with a solution just manually filling in tiles with the indicies. I have tried playing with chunks, but the individual files are only 1x32x16x140 so they are quite small. Has anyone had to do something like this?
Dan Dawson
@Meteodan
Hi all, I am getting incorrect time axis labels when plotting using the xarray plotting interface. The times shown are offset by +4 hours and a few of the ticks at the start of the plot aren't shown. The data itself is plotted correctly. It's weird, because the time index isn't time zone aware, but it is almost as if the plotting is assuming that the times I provide are in EDT and then adding 4 hours to them to get to UTC. But I don't see why that should be happening. I haven't seen this behavior before. I'm using xarray 0.20.0
Benjamin Root
@WeatherGod
what timezone is your computer set to?
Dan Dawson
@Meteodan
In this case, I'm working directly on an HPC cluster, so I'm not sure, but I'm assuming EDT. I normally work on my python post-processing on my Mac Pro, and I don't recall ever seeing this behavior on that.
(I.e., it just plotted the axis labels with the naive times, and didn't do any apparent conversion shenanigans
When I get a chance I'm going to compare my conda environments, versions, etc. with my Mac setup with what I have on the cluster. In any case, since my earlier message, I got the same behavior when using the matplotlib plotting function directly, so it's probably not an xarray-specific issue
Jody Klymak
@jklymak
You should probably check that plt.rcParams['timezone'] is "UTC".
If not, you will need to manually specify UTC for every axes.
Jody Klymak
@jklymak
IF that isnt the problem, please open an issue on Matplotlib with self-contained minimal example.
Dan Dawson
@Meteodan
Yes, I checked, and it is indeed "UTC". I will try to set up a minimal example and open an issue
PythonSchlumpf
@PythonSchlumpf
I am a newbie with xarray. I have hdf files, probably generated with h5py. I wrote a script in Jupyter. In a first function I open the hdf files with h5py to check if all files contain valid data, return is a valid filename. In the second run I open all valid files with xr.open_mfdataset. This works exactly one time. If I execute the jupyter cell a second time, I get a low level error from hdf that the file degree does not match. According to somebody from hdf I should not see as a normal user such an error. This happens when h5py wants to re-open the hdf files. I believe that the xarray library does not close correctly the hdf files, and when I want to open them again with the h5py command, I run into this problem. I use both commands with a context manager, so I am surprised I run into this problem at all. I am trying now to find a workaround, but I wanted to let you know, in case here are some xarray core developers. This is too much trouble for me figuring this out alone and I need to crunch my data, hence I am looking for a workaround. But it would be nice if h5py and xarray would work smoothly together.
Kai Mühlbauer
@kmuehlbauer
@PythonSchlumpf Without seeing any code, there is no chance to figure out what might go wrong in your case.
PythonSchlumpf
@PythonSchlumpf
@kmuehlbauer Yes, I know that. But I also have to share the bigger hdf files. I can try again to minimise the code, but it is not super short. Should I be brave an open an issue on xarray github?
The person from the HDF group wrote to me: "I think your problem is described in the #218 (h5py/h5py#218) issue. Very likely that xarray's backend storage engine based on the netCDF library is using a different file close degree setting than h5py. You should make sure only one of h5py or xarray have the same HDF5 file at same time." And this is where I thought the context manager takes care of such a thing.
Kai Mühlbauer
@kmuehlbauer
@PythonSchlumpf Then you might just use kwarg engine="h5netcdf" in the call to xr.open_mfdataset to circumvent that issue.
PythonSchlumpf
@PythonSchlumpf
@kmuehlbauer If I do this, I get "ValueError: variable '/entry/data/data' has no dimension scale associated with axis 0.
Use phony_dims='sort' for sorted naming or phony_dims='access' for per access naming"
The path inside the hdf is /entry/data and the numpy.array is called data. I did not understand the second part of the error message. My hdfs have no named dimensions.
Kai Mühlbauer
@kmuehlbauer
Xarray relies on the dimensions to be declared. Pure HDF5 files do not have this. The netcdf-c and via this way also netcdf4 invents these dimensions as phony_dim_0 etc. The h5netcdf backend is able to do the same. You would need to add to call of xr.open_mfdataset kwarg phony_dims="sort" if you want to have the same behaviour (naming) as netcdf4 or phony_dims="access" to have faster access with probable other naming (of the dimensions).
PythonSchlumpf
@PythonSchlumpf
@kmuehlbauer Thank you. I call it now via with xr.open_mfdataset(file_list_full_path, engine="h5netcdf",phony_dims="sort" ,combine='nested', concat_dim='phony_dim_0', group=group_path) as xrds: ds=xrds where group_path='/entry/data' it works, even if I execute the Jupyter cell multiple times. Wow!
Should I add as well phony_dims="access" ?
Kai Mühlbauer
@kmuehlbauer
No, phony_dims="access" and phony_dims="sort" are mutually exclusive. If your datasets are structurally identical you might just switch to phony_dims="access". But there is a good chance, that phony_dim_0 might change to some other value. But you would need to check.
PythonSchlumpf
@PythonSchlumpf
I am sorry, I don't understand what 'mutually exclusive' means.
Kai Mühlbauer
@kmuehlbauer
You can't have the same kwarg twice in a function ;-)
PythonSchlumpf
@PythonSchlumpf
If I have not ,phony_dims="sort" in I run into this mistake of ValueError: variable '/entry/data/data' has no dimension scale associated with axis 0. Use phony_dims='sort' for sorted naming or phony_dims='access' for per access naming. again. Thank you for the explanation of the meaning of 'mutually exclusive.'
But I have in concat_dim='phony_dim_0', but true, neither phone_dims="access" nor 'phony_dims="sort''. Ah, sorry, and now I see that the keyword argument can have two values, sorry.
But it sounds like I am more save with phony_dims= ''sort".
Kai Mühlbauer
@kmuehlbauer

But it sounds like I am more save with phone_dims=''sort".

Yes, but if you have many datasets (and/or groups) in your hdf5 file this might slow things down a bit, since h5netcdf has to iterate over all groups/datasets . Then you could try phony_dims="access". In the best case it just works. If not, we can try to fix it.

PythonSchlumpf
@PythonSchlumpf
I have in a real scan 1000 hdf files, I want to read out. They would form one data variable (I hope this is the right term in the xarray universe), basically a stack of 2d pictures. I have other hdf5 files that should be another data variable in the same dataset, this contains the motor positions where each picture was taken. And there will be a third set of information, that should form a third data variable inside the dataset.
All 3 information sets are in different hdf files.
I have not yet developed a good strategy.
Could I also use the opportunity to ask how preprocess=None work in open_mfdataset? When would you use this option, please?
Kai Mühlbauer
@kmuehlbauer

I'd read the different file types separately and combine them afterwards. You would need to think about the dimensions. A good starting point is to just open one of each file type and see what dimensions are there, what sizes they have and what you would actually name them.

preprocess=myfunc is another thing which will come into play if you want to eg. rename dimensions. myfunc is in this case a function which consumes and returns an xarray.Dataset.

def myfunc(ds):
    ds = ds.rename_dims({"phony_dim_0": "time"}
    return ds
This would rename the dimensions and you can use concate_dim="time". "time" is just an example.
PythonSchlumpf
@PythonSchlumpf
"I'd read the different file types separately and combine them afterwards. " - You mean per file set, a different xr.open_mfdatasetfor example?
Thank you for the example for preprocess. Can this be also a more complex function, for example, one that opens the hdf files to check if they contain valid data before using xr.open_mfdataset? I do this with the h5py library.
I renamed the dimensions in my dataset after creating the xarray.
Kai Mühlbauer
@kmuehlbauer

"I'd read the different file types separately and combine them afterwards. " - You mean per file set, a different xr.open_mfdatasetfor example?

Exactly.

PythonSchlumpf
@PythonSchlumpf
I clearly lack experience with xarray, but I don't want to use mpi4py as it seems to complicated in handling. I was hoping later to parallelise processes with xarray.
krisaoe
@krisaoe
Hi everyone.

If I need to split a dataset into individual timestep netcdfs, like this:

for timestep in mds[time_dim_name].values:
            timestring = str(timestep).split(".")[0]
            filename = timestring.replace("-", "").replace(":", "")
            path = final_dir / f"{filename}.nc"
            mds.sel(time=timestep).to_netcdf(
                path=path, mode="w", engine="netcdf4", encoding=encoding
            )

The resulting dataset has been "squeezed", so there is not a time dimension anymore.

<xarray.Dataset>
Dimensions:  (depth: 1, lat: 773, lon: 763)
Coordinates:
  * depth    (depth) float32 1.0
  * lon      (lon) float32 9.041586 9.069364 9.097141 ... 30.180882 30.20866
    time     datetime64[ns] ...
  * lat      (lat) float32 53.024963 53.04163 53.058296 ... 65.87433 65.89099
Data variables:
    vo       (depth, lat, lon) float32 ...
    uo       (depth, lat, lon) float32 ...

Is there a way to retain the time dimension when creating datasets with only one time value?

krisaoe
@krisaoe
Or is there a better way to split datasets by time?
Joe Hamman
@jhamman
@krisaoe - you might checkout the documentation/examples for the xr.save_mfdataset function.
krisaoe
@krisaoe
@jhamman thanks! The groupby function in the xr.save_mfdataset example is exactly what I needed.
krisaoe
@krisaoe
For anyone interested, I could create one netcdf per timestep, while retaining the time dimension and setting the encoding, using this loop:
for timestep, dataset in mds.groupby("time", squeeze=False, restore_coord_dims=True):
    timestring = str(timestep).split(".")[0]
    filename = timestring.replace("-", "").replace(":", "")
    path = final_dir / f"{filename}.nc"
    dataset.to_netcdf(
        path=path, mode="w", engine="netcdf4", encoding=encoding
    )
Kai Mühlbauer
@kmuehlbauer
@krisaoe You might use a slice with .isel. That way you retain the time-dimension.
for i, timestep in enumerate(mds[time_dim_name].values):
    timestring = str(timestep).split(".")[0]
    filename = timestring.replace("-", "").replace(":", "")
    path = final_dir / f"{filename}.nc"
    mds.isel(time=slice(i, i+1).to_netcdf(
                path=path, mode="w", engine="netcdf4", encoding=encoding
            )
krisaoe
@krisaoe
@kmuehlbauer interesting! Thanks. There was one parenthesis missing after the slice, mds.isel(time=slice(i, i+1)), but this stops the depth dimension from being squeezed out, which wasn't the case with my previous solution.
When I first looked at your slice I thought it would throw a list index out of range exception. Does anyone know why it doesn't?
krisaoe
@krisaoe
Ah, of course, because that's how slices work in Python.
Kai Mühlbauer
@kmuehlbauer
@krisaoe Yeah, sorry, this is missing the closing parentheses of the isel.