Where communities thrive


  • Join over 1.5M+ people
  • Join over 100K+ communities
  • Free without limits
  • Create your own community
People
Activity
    Rich Signell
    @rsignell-usgs
    go Chelle go!
    Chelle Gentemann
    @cgentemann
    i'm so at the very edge of my understanding of file systems. someday i need martin to do a vulcan mind meld with me so it all is clear.
    Martin Durant
    @martindurant
    Actually, disseminating this information is supposed to be part of my job. Along with most of my pure-code brethren, I’m not particularly good at writing and structuring documentation in a way that people can find it. Actually, that tends to be true of academic researchers too.
    Rich Signell
    @rsignell-usgs
    Perhaps you, Lucas and I all should meet with Martin, and we could try giving our best explanation of what we think we know, and then Martin can upgrade our understanding
    And then we would be in a better position to help with the docs
    and get our t-shirts
    Chelle Gentemann
    @cgentemann
    i think it would be like one of those funny youtube videos where adults ask children to explain concepts and their idea of what it is is so far from reality it ends up being funny. ;)
    Rich Signell
    @rsignell-usgs
    Yeah, exactly
    Martin Durant
    @martindurant
    Or, that the child can explain it far simpler than the adult.
    Rich Signell
    @rsignell-usgs
    When my daughter was 11 she suggested we should make her backpack lighter by filling it with "that air they have on the moon"
    Martin Durant
    @martindurant
    Teh README at /tds-mur-test is pretty unenlightening
    Chelle Gentemann
    @cgentemann
    last night austin asked if we could fill a balloon with the something lighter than helium and make our prius into a flying car.
    Martin Durant
    @martindurant
    Yes you could! Might be a big balloon.
    Rich Signell
    @rsignell-usgs
    So Martin, yes, we need your help in understanding, in other words
    Martin Durant
    @martindurant
    got it
    Chelle Gentemann
    @cgentemann
    i'm just gonna create a channel where my kids can ask martin questions instead of me.
    Rich Signell
    @rsignell-usgs
    Also reading that data seems to require an AWS profile I don't have
    Chelle Gentemann
    @cgentemann
    i've got that all worked out
    i've made json for 30 files, working on putting together before i run on bigger set
    Rich Signell
    @rsignell-usgs
    cool. Chelle, when you get it, let's do a screeshare
    Chelle Gentemann
    @cgentemann
    kk
    Rich Signell
    @rsignell-usgs
    Probably goes without saying, but if you have a lot of files to create individual jsons for, a bigger dask cluster helps.
    Lucas Sterzinger
    @lsterzinger
    I'm sure time will beat this out of me, but I don't mind writing documentation/tutorials/examples (obviously, by my contributions to reference maker thusfar)
    The workshop went well, managed to crash pangeo binder for about 10-15 minutes so I guess it's time to scratch another notch into the side of my computer
    Chelle Gentemann
    @cgentemann
    glad to hear it went well! Rich - it is totally working. I feel like a real hacker now! ;) i'm free the rest of the day - when would be good - i'd like to go over it with you a little bit before scaling up.
    Rich Signell
    @rsignell-usgs
    Martin, I told Chelle I knew how to get the NASA credentials to Dask workers, but now that I look at it, I'm not sure I do. Can you look at cell [3] here and recommend a path?
    https://github.com/cgentemann/cloud_science/blob/master/make_zarr/cloud_mur_v41.ipynb
    I thought we could just copy the .netrc to workers using a dask.distributed WorkerPlugin, but I'm not sure that would do it (or is even what we would want to do)
    Lucas Sterzinger
    @lsterzinger
    Can you pass environment variables to the workers?
    Rich Signell
    @rsignell-usgs
    I'm guessing we might need to run the begin_s3_direct_access script on all the workers via a WorkerPlugin.
    Martin Durant
    @martindurant
    Bahh, I just had a conversation exactly on this sort of thing, for GCP. Looking...
    Since you are using an s3fs instance with explicit token values, it should go to the workers just fine, without having to reestablish credentials
    Rich Signell
    @rsignell-usgs
    And of course I can use fsspec instead of s3fs
    right?
    Martin Durant
    @martindurant
    fsspec.filesystem(“s3”) is identical to s3fs.S3FileSystem(), but perhaps better aesthetically
    Rich Signell
    @rsignell-usgs
    It worked! The dask workers could access the data and the jsons were produced!
    How do we then pass the credentials to the reference file system so they can extract the byte ranges?
    Martin Durant
    @martindurant
    it’s called remote_options
    Rich Signell
    @rsignell-usgs
    MARTIN!!!! you rock!
    Martin Durant
    @martindurant
    So what is this data? :)
    Chelle Gentemann
    @cgentemann
    only my favorite data. it is a 19 year global 1 km dataset of SST - the previous version was about 16 TB, I think this version is a little bit bigger. it is the most popular dataset that the nasa ocean archive has.
    i'm working on a tutorial notebook that compares access speeds for us.
    also, MARTIN you rock! thank you! Rich and I were just saying how much we love working with you and lucas.
    i need a demo from a nasa dataset to talk to the daacs. this is a great example.
    Martin Durant
    @martindurant

    Well thank you! It’s been fun!
    So you could have increased the for ranges in the notebook and come up with something massive?

    By the way, I would always recommend adding simple_templates=True to filesystem (faster init, should be the default) and maybe also setting chunks= in open_dataset. For the latter, the best values to choose depends on your analysis, but it can make a big difference to load times if the original chunks are small. That's why it’s nice to eventually write Intake specs, to hide this kind of detail.

    Chelle Gentemann
    @cgentemann
    yes, the earlier notebook was just our mini-test. now i've scaled it up to 7064 files, but am running into a problem at the very end.....
    are you on qhub?
    Chelle Gentemann
    @cgentemann

    two tests: one with 3yr of data ~900 files runs fine. but then when I scale to 7000 files... no errors but isn't running okay...
    900 files: https://jupyter.qhub.esipfed.org/user/cgentemann/doc/tree/shared/users/cgentemann/notebooks/cloud_mur_v41_3yr.ipynb
    7000 files: https://jupyter.qhub.esipfed.org/user/cgentemann/doc/tree/shared/users/cgentemann/notebooks/cloud_mur_v41-all.ipynb

    ideas on what might be wrong?

    Martin Durant
    @martindurant
    I don’t have access to that. You might well need to do a “tree reduction” where you amalgamate batches of input files, and then amalgamate those batches in a separate step. We had to do this for the NWM case, so both @rsignell-usgs and @lsterzinger know how.
    Lucas Sterzinger
    @lsterzinger

    @cgentemann here's an example of how I did what Martin mentioned with the NWM stuff. Hopefully there's enough comments/text to explain how it works but let me know if you need help with anything

    https://nbviewer.jupyter.org/gist/lsterzinger/8a93fc1780495aa84694f6d4b1a3708e

    Martin Durant
    @martindurant
    Which of our favourite datasets so far do we have reference files in public locations and/or intake stubs?