Where communities thrive


  • Join over 1.5M+ people
  • Join over 100K+ communities
  • Free without limits
  • Create your own community
People
Repo info
Activity
Tim Dettrick
@tjdett
BTW, I'm not particularly keen on Anaconda because it requires you to accept a EULA, but if you need it then I can work with that.
Damien Irving
@DamienIrving
No, they don't all need to be in the container at the same time. I basically have data files for a bunch of different weather variables (one file for each of temperature, rainfall, sea ice concentration, wind, atmospheric pressure, etc) which are each 11.6 GB in size (which makes me think I might actually need 200-300GB of storage). I never do an analysis that requires all of them at once (e.g. one workflow might just need rainfall, pressure and wind).
Tim Dettrick
@tjdett
Are your data files public, limited distribution or private?
(Or to be more precise, where on that spectrum do they fit?)
Damien Irving
@DamienIrving
I also never really read the entire 11.6 GB array of data into Python at once. I usually break the process into 5 year chunks (there's 36 years of data all up in each file) and process those chunks in serial because (a) the machine I've been using doesn't have enough RAM, and (b) I've been too lazy to figure out how to parallelise it.
You have to register with the providers of the data and then they give you access to download the data you'd like.
Tim Dettrick
@tjdett
OK. I presume that data is under a licence which prevents redistribution without their approval?
(Some government data is freely distributable under Creative Commons, but they ask for details anyway.)
Damien Irving
@DamienIrving
Tim Dettrick
@tjdett
OK, that's pretty clear. So we must treat the data as private. It doesn't need to be writeable though I'm guessing.
The reason why I'm asking is that I'd prefer not to allocate 100GB to the compute node itself.
Damien Irving
@DamienIrving
Oh, and I really do need Anaconda. Without the conda package manager there's simply no way I'd be able to manage all the Python libraries I use
Tim Dettrick
@tjdett
Or rather, I'd prefer not to have 100GB of files inside a container. Long term, that would be a pretty wonky approach that would make redistributing containers pretty tricky.
Damien Irving
@DamienIrving
Hmm. The data storage would need to be writable, because my workflows produce intermediary data files along the way
Tim Dettrick
@tjdett

Like I said, I can work with a need for some researchers to use Anaconda. I dislike the idea of it as a default, in part because this paragraph from the EULA really gets up my nose:

The United States currently has embargoes against Cuba, Iran, North Korea, Sudan and Syria. The exportation, re-exportation, sale or supply, directly or indirectly, from the United States, or by a U.S. person wherever located, of any Continuum software to any of these countries is strictly prohibited without prior authorization by the United States Government. By accepting this Agreement, you represent to Continuum that you will comply with all applicable export regulations for Anaconda.

Damien Irving
@DamienIrving
Yuck - I can see why you don't like conda. Eventually the Python community should make pip smart like conda, but I can't see that happening for a while...
The intermediary files are kind of unavoidable - there are a bunch of command line utilities in the weather and climate sciences that simply take a netCDF data file, do stuff to the data and then produce a new netCDF file
Tim Dettrick
@tjdett
Yeah. I really don't care about North Korea, Sudan or Syria all that much, but requiring researchers to limit distribution to Cuba & Iran because they happen to use a binary Python distribution is just wrong.
I can provide you as much temporary space as you need to write your intermediate files. That's not a problem.
The problem is that eventually I want containers to be movable between compute nodes. To do that, they have to be small enough to move from place to place.
That means being a little more particular about what are source files, what are intermediate (temporary) files, and what are destination files.
Damien Irving
@DamienIrving
Ah, I see. Well the source files are large (11.6GB), the intermediate files can be large (some are also 11.6GB), but the destination files (which I'd need a non-temporary place to keep) are always pretty small (usually an .eps or .png image and a small netCDF file of the data shown in those images)
Tim Dettrick
@tjdett
The rough workflow I have in my head:
  1. Have all your source files available from a password-protected web/webdav location, like AARNet CloudStor.
  2. As you need them, download the files to the temporary area inside the container.
  3. Read from the temporary area and write your intermediate files to the temporary area.
  4. Write your destination files either to the container, or to the temporary area and then upload to a final location.
How does that sound?
Damien Irving
@DamienIrving
That sounds ideal.
Tim Dettrick
@tjdett
Based on work earlier this week with Paul & Louise, AARNet downloads are around 80MB/s to containers.
Damien Irving
@DamienIrving
So each file would take a couple of minutes - that seems fine.
Tim Dettrick
@tjdett
I realise this is not ideal for a 11.6GB file. It's around 20 minutes to download. However, the limitation is not the network.
Damien Irving
@DamienIrving
Ah, yes, 20, not 2 minutes
How long would the files exist in the temporary area?
Tim Dettrick
@tjdett
Well, keep in mind I'm using rules that we could scale up to, not rules based on your special circumstances.
I don't see a problem keeping the temporary file area around for as long as your container is running. Does that sound reasonable?
The idea is that later on, we might want people to be able to move their container to a bigger compute node to do their processing. I don't want them to move their data about on the container.
Damien Irving
@DamienIrving
That sounds ok. I typically have intensive 1-2 week periods where I'm working on data analysis, so I could leave the container running during those periods and then turn them off during other periods where I'm writing or I'm away at a conference or something.
Tim Dettrick
@tjdett
So in this hypothetical process, you would click a button, your container would shutdown, snapshot, and then spin up on a bigger node. You would then do your processing, which might take several days.
After your did your processing, you'd then check you had all your data saved, shutdown your container, snapshot and move back to a less grunty machine .
Damien Irving
@DamienIrving
That sounds like a very smart system.
Tim Dettrick
@tjdett
The way to encourage people to shutdown would be to issue a quota of hours for the grunty nodes, but many or unlimited hours on the smaller ones.
This sort of system is what cloud computing was made for, but it requires the researcher to work in slightly different ways, which is one reason why I'm thinking about it for your trial.
The good news is that this sort of workflow makes for a natural transition to parallel processing later.
Anyway, back to the problem of data transfer speed. A consumer grade laptop hard drive averages around 80-100MB/s write speed without overheads.
Damien Irving
@DamienIrving
I think it's good. At the moment I (and everyone else I work with) produce intermediary files that I never remember to go back and clean up later. Basically all that clutter builds up to the point where the data store in our department fills up and then we all panic and clean up. This system forces clean up.
Tim Dettrick
@tjdett
I hadn't thought of that benefit!
So at best, with dedicated spinning metal drives I doubt we can get you download speeds above 200MB/s, or around 10 minutes to download.
I have an idea though.
Damien Irving
@DamienIrving
That write speed is a little slow - sounds like I basically need to re-work my workflow a little so I'm not producing quite so many large intermediary files, because they'll take 20 minutes to write each time
Tim Dettrick
@tjdett
Well, sort of. I'm going to test something. One moment.
Damien Irving
@DamienIrving
(They take that long to write on the server I work on at the moment, the difference is that the files aren't temporary so I don't have to do it again)
Tim Dettrick
@tjdett
Well, keep in mind "temporary" can mean days in this context, not minutes.