Where communities thrive


  • Join over 1.5M+ people
  • Join over 100K+ communities
  • Free without limits
  • Create your own community
People
Activity
    Lucas Sterzinger
    @lsterzinger
    I'm sure time will beat this out of me, but I don't mind writing documentation/tutorials/examples (obviously, by my contributions to reference maker thusfar)
    The workshop went well, managed to crash pangeo binder for about 10-15 minutes so I guess it's time to scratch another notch into the side of my computer
    Chelle Gentemann
    @cgentemann
    glad to hear it went well! Rich - it is totally working. I feel like a real hacker now! ;) i'm free the rest of the day - when would be good - i'd like to go over it with you a little bit before scaling up.
    Rich Signell
    @rsignell-usgs
    Martin, I told Chelle I knew how to get the NASA credentials to Dask workers, but now that I look at it, I'm not sure I do. Can you look at cell [3] here and recommend a path?
    https://github.com/cgentemann/cloud_science/blob/master/make_zarr/cloud_mur_v41.ipynb
    I thought we could just copy the .netrc to workers using a dask.distributed WorkerPlugin, but I'm not sure that would do it (or is even what we would want to do)
    Lucas Sterzinger
    @lsterzinger
    Can you pass environment variables to the workers?
    Rich Signell
    @rsignell-usgs
    I'm guessing we might need to run the begin_s3_direct_access script on all the workers via a WorkerPlugin.
    Martin Durant
    @martindurant
    Bahh, I just had a conversation exactly on this sort of thing, for GCP. Looking...
    Since you are using an s3fs instance with explicit token values, it should go to the workers just fine, without having to reestablish credentials
    Rich Signell
    @rsignell-usgs
    And of course I can use fsspec instead of s3fs
    right?
    Martin Durant
    @martindurant
    fsspec.filesystem(“s3”) is identical to s3fs.S3FileSystem(), but perhaps better aesthetically
    Rich Signell
    @rsignell-usgs
    It worked! The dask workers could access the data and the jsons were produced!
    How do we then pass the credentials to the reference file system so they can extract the byte ranges?
    Martin Durant
    @martindurant
    it’s called remote_options
    Rich Signell
    @rsignell-usgs
    MARTIN!!!! you rock!
    Martin Durant
    @martindurant
    So what is this data? :)
    Chelle Gentemann
    @cgentemann
    only my favorite data. it is a 19 year global 1 km dataset of SST - the previous version was about 16 TB, I think this version is a little bit bigger. it is the most popular dataset that the nasa ocean archive has.
    i'm working on a tutorial notebook that compares access speeds for us.
    also, MARTIN you rock! thank you! Rich and I were just saying how much we love working with you and lucas.
    i need a demo from a nasa dataset to talk to the daacs. this is a great example.
    Martin Durant
    @martindurant

    Well thank you! It’s been fun!
    So you could have increased the for ranges in the notebook and come up with something massive?

    By the way, I would always recommend adding simple_templates=True to filesystem (faster init, should be the default) and maybe also setting chunks= in open_dataset. For the latter, the best values to choose depends on your analysis, but it can make a big difference to load times if the original chunks are small. That's why it’s nice to eventually write Intake specs, to hide this kind of detail.

    Chelle Gentemann
    @cgentemann
    yes, the earlier notebook was just our mini-test. now i've scaled it up to 7064 files, but am running into a problem at the very end.....
    are you on qhub?
    Chelle Gentemann
    @cgentemann

    two tests: one with 3yr of data ~900 files runs fine. but then when I scale to 7000 files... no errors but isn't running okay...
    900 files: https://jupyter.qhub.esipfed.org/user/cgentemann/doc/tree/shared/users/cgentemann/notebooks/cloud_mur_v41_3yr.ipynb
    7000 files: https://jupyter.qhub.esipfed.org/user/cgentemann/doc/tree/shared/users/cgentemann/notebooks/cloud_mur_v41-all.ipynb

    ideas on what might be wrong?

    Martin Durant
    @martindurant
    I don’t have access to that. You might well need to do a “tree reduction” where you amalgamate batches of input files, and then amalgamate those batches in a separate step. We had to do this for the NWM case, so both @rsignell-usgs and @lsterzinger know how.
    Lucas Sterzinger
    @lsterzinger

    @cgentemann here's an example of how I did what Martin mentioned with the NWM stuff. Hopefully there's enough comments/text to explain how it works but let me know if you need help with anything

    https://nbviewer.jupyter.org/gist/lsterzinger/8a93fc1780495aa84694f6d4b1a3708e

    Martin Durant
    @martindurant
    Which of our favourite datasets so far do we have reference files in public locations and/or intake stubs?
    Lucas Sterzinger
    @lsterzinger
    Rich has intake/reference files for the NWM. I don't have anything uploaded since I don't have a storage account that I'm not personally paying for
    I had totally forgotten we used Dask bag for the make_consolidated step!
    Lucas you should be working on the ESIP qhub also!
    Free compute and S3 storage, courtesy of ESIP (via AWS credits)
    Martin Durant
    @martindurant
    (I can’t get into your qhub)
    Maybe I should set some issues on the repo:
    • intro and docs
    • list of builder notebooks
    • list of the produced artefacts (json and yaml)
    Lucas Sterzinger
    @lsterzinger

    Lucas you should be working on the ESIP qhub also!
    Free compute and S3 storage, courtesy of ESIP (via AWS credits)

    Wasn't sure if I was allowed to keep using this after I was done formally working for you. If I were to make publically accessible references/intake catalog of GOES would it be okay to throw it up on the ESIP S3?

    Rich Signell
    @rsignell-usgs
    Martin, I added you to the ESIP qhub https://jupyter.qhub.esipfed.org
    Lucas, you already have access
    There is a "welcome.ipynb" in the /shared folder to give you some tips
    Lucas, UC Davis is part of ESIP, right?
    If not, you should have them join!
    I know anaconda is a member of ESIP
    Martin Durant
    @martindurant
    I can get into qhub in general now, but not see your notebook directly. We should make all the notebooks public, though (even if they reference profiles and buckets that aren’t open).
    Lucas Sterzinger
    @lsterzinger

    Lucas, UC Davis is part of ESIP, right?

    I have no idea! How can I find out?

    Martin Durant
    @martindurant
    qhub has internal chat??
    Rich Signell
    @rsignell-usgs
    Yup!
    kinda wonky though
    Looks like davis is not on there. You could join on their behalf. You don't have to pay, just apply!
    Rich Signell
    @rsignell-usgs
    Hey Chelle, Farallon is not a member either!
    Dag nurb it!
    Lucas Sterzinger
    @lsterzinger
    Shame!
    Martin Durant
    @martindurant
    We do have the NWM JSON in a public place though, right?