Yes this binder works fine if launched using the "launch binder" button on that page. It's exciting to watch the Dask cluster! :-) ( I'm not sure it matters but it only works fine if you launch it from that button. If you try to put the URL of the github repo into binder.pangeo.io it doesn't work.) So getting back to what I was doing: I am using this repository: https://github.com/pangeo-data/pangeo-example-notebooks -- I tried launching from the "launch binder" button in the readme. I also tried launching by taking the repo URL (https://github.com/pangeo-data/pangeo-example-notebooks.git) and putting it into the GIT URL on binder.pangeo.io, and clicking "launch". Regardless of which notebook I open in the binder, when I get to the gateway.new_cluster() call, it always fails with that same basic error.
Hi all! I've been mulling over an idea... @jbusecke created this cookie cutter https://github.com/jbusecke/cookiecutter-science-project had me thinking about how we often try to tell authors about practices earlier on in their projects that could help them later, like at publication, when they have to share their research outcomes (data, software, notebooks, etc). A cookie cutter seemed like a nice forkable approach to communicate our guidance. Also was thinking it is a nice flexible solution vs some of the admin tools that we have to use such as the DMP Tool. What if you could generate statements, etc more easily from a yaml file... there is this nice project coming out of AUS called RAID that is an identifier for projects and a wrapper for all the other identifiers (will ultimately be available via DataCite). @cgentemann also mentioned an example https://ncsu-libraries.github.io/jekyll-academic-docs/ where a cookie cutter can be this easy to use like the Jekyll Academic instructions. First post here and an idea inspired by @jbusecke but wondering if anyone is interested and would like to brainstorm further?
I am teaching master students, PhD students, postdocs and researchers about Pangeo. What is the most recommended way/package to transform vertical coordinates? We are using Pangeo CMIP6 (atmosphere data) with "atmosphere_hybrid_sigma_pressure_coordinate" or "alevel" and would like to compare models (for instance transform them on the same pressure levels). Can xgcm do that?
It can definitely do it. But we could use more examples for the documentation.
Thanks. I will try to make new examples.
I have just started to use pangeo.io yesterday and found it interesting to use. I launched into the binder and duplicated one of the examples then named it. Later on, restarting my system, I could not find the codes again, making me start afresh. What is it that I am not doing right? I wanted to sign up but did not see where or how to if that will save me. Please kindly direct me accordingly.
Hi everyone - I'm starting to use some of the pangeo tools and technologies in a workflow involving STAC, COGs and Dask. I'm trying to use xbatcher to extract batches of patches from my xarrays for machine learning, but I'm struggling with a few things. The most important issue for me is that I'm struggling to put the patches back together again at the end - the co-ordinates from the original DataArray don't seem to have been preserved. I've created an issue on the xbatcher Github (pangeo-data/xbatcher#37) and it'd be great if anyone has any ideas.
Hi Robin, thanks for your feedback! Xbatcher is very new and untested in many use cases. Your input is really valuable. Someone will get back to you on the issue you opened.
Yeah, thanks @robintw for the ping. I’ll try to respond today.
That's wonderful, thank you so much @rabernat and @jhamman
Do we have a location/issue/PR where bringing analytic coordinates to xarray is discussed?
Hi all, I'm trying to deploy pangeo on a k8s cluster and was wondering where is the appropriate place to post a technical question. Is there a pangeo chat room for topics such as deployment-related questions?
Hi all! not sure if this is the right channel to raise issues related to the Pangeo binder, but I've experimented some recent issues to launch it. Following pangeo-data/pangeo-binder#192, I'm testing a short term solution which suggests setting a Pangeo binder template, but it still returns Failed to create temporary user for gcr.io/pangeo-181919/prod-acocac-2dpangeo-2dbinder-2dtemplate-0d9a05:c3201d8e480aba5d9ebad622a251fb4660ba1021.
Hi folks, I try to get soil moisture values from sentinel 1 backscatter values with pythons keras and tensorflow libraries. So far I try it with linear regression but the CNN I train don't take in account that the values i put in for training are in relation to its previous or next value. Like it is in real world for time series soil moisture. Now i try to find some resources to learn time series ?regression? I wonder about some terms, maybe because my mother language isn't English. When I search for example "time series regression in combination with MLP, CNN or other ML keywords" i always end up with the term forecasting? I translate this with future value prediction. But I don't want to predict the future, I want to train a model with sentinel 1 backscatter values link to real world soil moisture. After training i want to get soil moisture time series from sentinel 1 backscatter time series. Can you give me some keywords I can google to research the right topic. Thanks a lot and cheers!
Hi all, I have been looking a lot into an infrastructure where AWS Lambda could connect with a Dask cluster. For example, the python Lambda function could read Zarr stored in S3 into an Xarray dataset object, and return the JSON to API Gateway. What would be the best practice for this using Dask? Could you create a Cluster using Fargate, Kubernetes, or Parallel Cluster and somehow have the distributed client within the Lambda python script connect to that? Could Lambda have a LocalCluster? I am not fully sure what is possible and would love some insights. Came across this article from several years ago which seems to talk about something similar: https://medium.com/informatics-lab/exploring-dask-and-distributed-on-aws-lambda-55d81d9641d. Thanks again, any feedback is greatly appreciated!!
But from what you described, I'm not sure why you need Dask here. Is there any computation involved?
The main benefit of Dask here would be the asynchronous computing of the data chunks coming from Zarr. Just to allow for faster computation when invoking xarray.dataset.to_dict()
The overall thought is an API that uses API Gateway to call a lambda function. Based on user provided query parameters, Xarray loads the data from Zarr, does the trimming and slicing, and returns the data response to API Gateway.
I've always seen Dask and serverless as two different distributed computing paradigms
As always, thanks for the help. Still new to Lambda and Fargate, so still trying to workout the details
With serverless, you invoke a lambda function many times to achieve parallelism. This is good for embarrassingly parallel operations.
With dask, you have a cluster that schedules jobs in parallel. It's a more sophisticated form of parallelism.
I'd be interested to learn about how they can be combined
I think lambda would be a great candidate for a subsetting service. Like if you have a giant (many TB) Zarr on S3, and you want to produce a netCDF subset, lambda would be great for that
Exactly! that is exactly what we are looking at prototyping here
In that scenario, you wouldn't want to use a dask cluster inside the lambda function but rather just the simple (default) multithreaded scheduler