Where communities thrive


  • Join over 1.5M+ people
  • Join over 100K+ communities
  • Free without limits
  • Create your own community
People
Repo info
Activity
Damien Irving
@DamienIrving
How long would the files exist in the temporary area?
Tim Dettrick
@tjdett
Well, keep in mind I'm using rules that we could scale up to, not rules based on your special circumstances.
I don't see a problem keeping the temporary file area around for as long as your container is running. Does that sound reasonable?
The idea is that later on, we might want people to be able to move their container to a bigger compute node to do their processing. I don't want them to move their data about on the container.
Damien Irving
@DamienIrving
That sounds ok. I typically have intensive 1-2 week periods where I'm working on data analysis, so I could leave the container running during those periods and then turn them off during other periods where I'm writing or I'm away at a conference or something.
Tim Dettrick
@tjdett
So in this hypothetical process, you would click a button, your container would shutdown, snapshot, and then spin up on a bigger node. You would then do your processing, which might take several days.
After your did your processing, you'd then check you had all your data saved, shutdown your container, snapshot and move back to a less grunty machine .
Damien Irving
@DamienIrving
That sounds like a very smart system.
Tim Dettrick
@tjdett
The way to encourage people to shutdown would be to issue a quota of hours for the grunty nodes, but many or unlimited hours on the smaller ones.
This sort of system is what cloud computing was made for, but it requires the researcher to work in slightly different ways, which is one reason why I'm thinking about it for your trial.
The good news is that this sort of workflow makes for a natural transition to parallel processing later.
Anyway, back to the problem of data transfer speed. A consumer grade laptop hard drive averages around 80-100MB/s write speed without overheads.
Damien Irving
@DamienIrving
I think it's good. At the moment I (and everyone else I work with) produce intermediary files that I never remember to go back and clean up later. Basically all that clutter builds up to the point where the data store in our department fills up and then we all panic and clean up. This system forces clean up.
Tim Dettrick
@tjdett
I hadn't thought of that benefit!
So at best, with dedicated spinning metal drives I doubt we can get you download speeds above 200MB/s, or around 10 minutes to download.
I have an idea though.
Damien Irving
@DamienIrving
That write speed is a little slow - sounds like I basically need to re-work my workflow a little so I'm not producing quite so many large intermediary files, because they'll take 20 minutes to write each time
Tim Dettrick
@tjdett
Well, sort of. I'm going to test something. One moment.
Damien Irving
@DamienIrving
(They take that long to write on the server I work on at the moment, the difference is that the files aren't temporary so I don't have to do it again)
Tim Dettrick
@tjdett
Well, keep in mind "temporary" can mean days in this context, not minutes.
The idea in the processing workflow I described is that your intermediary files don't disappear while you're on the grunty node, but when you finish you don't take them with you.
However I agree that adding 20 minutes to your current workflow is something we really want to avoid if we can.
Tim Dettrick
@tjdett
@DamienIrving Are you able to upload one of your files to AARNet CloudStor? I would suggest doing so over a wired connection at UniMelb, not wifi.
Damien Irving
@DamienIrving
Sure. I'm away from my desk now until about 3pm, but I can do it then.
Tim Dettrick
@tjdett
No problem. For the moment I'll use a randomly generated file of approximately the same size.
BTW, it turns out data write speed in the melbourne-only NeCTAR cell is a lot faster than I expected. It wrote a random 12GB file in 35s.
Damien Irving
@DamienIrving
nice
Tim Dettrick
@tjdett
It turns out the download speeds I were seeing may have been somewhat inaccurate. The problem was that the entire 600MB archive Paul & Louise are using only took 5s to download, which didn't make for calculating good average speed.
Download took ~90s for a 12GB file. That is impressively fast.
Tim Dettrick
@tjdett
I see where I went wrong: somehow the tool I was using thought "MB/s" was Megabits/s, not Megabytes/s.
Also, the "small" 600MB file I was using originally low-balled the download speed by, well, a lot.
To download a 12GB file in 90s, required an average of over a 1 Gb/s. (ie. The original, final phase, fibre-to-your-doorstep NBN could not have downloaded the file faster.)

BTW:

Ah, yes, 20, not 2 minutes

You had it right, not me. Sorry about that. I should have done it in my head rather than relying on an online tool.

Damien Irving
@DamienIrving
@tjdett To get my data onto cloudstor it looks like I'd have to download it to my own machine first and then upload to cloudstor using their web interface. Is there any way I can cut out the middle step and simply download directly to cloudstor (the place that I'm getting the data from provides a series of csh scripts for downloading in a unix environment - would be nice to use those and download direct to cloudstor).
Tim Dettrick
@tjdett
@DamienIrving Not easily. The closest I can offer is downloading to a machine that has a good connection to CloudStor. What's the upload speed like from your uni desktop?
Damien Irving
@DamienIrving
Not sure. Will test it out (probably tomorrow) and let you know.
Tim Dettrick
@tjdett
I can set you up with NeCTAR VM access if that would be easier. You're out and about today I take it?
Damien Irving
@DamienIrving
I'm out and about this afternoon, but I think my desktop at uni should be fine. I'll give it a try tomorrow morning.
Tim Dettrick
@tjdett
OK. Let me know how you go. I can provision a NeCTAR VM with sufficient space if you need it.
Tim Dettrick
@tjdett
@DamienIrving Excluding temporary files, what do you think would be a good upper limit for a DIT4C container size?
I'm provisioning a compute node for you to do this long-term research test on, and I'm trying to put in place the sorts of resource constraints I'm expecting we'll need later.
Damien Irving
@DamienIrving
Hard to say. What size becomes a problem for you? I'm envisaging that a researcher would only keep the finished products of their workflows (e.g. small summary data files, images) and all intermediary files would be temporary, so you might be able to get away with at little as 500 MB?
(Apologies for the delay in getting data up on Cloudstor. This week is looking a little hectic for me, so it might be next week before I get a chance.)
Tim Dettrick
@tjdett
The only Docker backend that supports me limiting disk usage (!) is "devicemapper", and it has a default total container size of 10GB. I'm wondering if I could drop that to 5GB without causing problems.
Damien Irving
@DamienIrving
I think 5GB would be plenty
Tim Dettrick
@tjdett
Our biggest images, like Slicer, have yet to reach 2.5GB in size, so that still leaves over 2GB of space for extra packages and files.
OK, we'll go with that for your test environment. If you run up against the limit, I'll manually save your container, change the size, and then reprovision it.
Tim Dettrick
@tjdett
OK, DIT4C running Autodesk Maya is our heaviest install yet, but it only reaches 2.7GB. That was my biggest worry for a 5GB limit.
@pmignone How big do Maya files get? Is it conceivable that your project files could be bigger than 2GB?