But, is the data actually transferred? If I have gigs of data on my university HPC or cloud volume, is the data transferred somewhere?
Matt should correct me if I'm wrong, but my understanding is that the data is transferred to wherever the jupyter notebook kernel is running (might be local machine, might be on a server) and then some is transferred to the web browser for display. Somewhere in that path (either data server to jupyter kernel location, or kernel location to web browser) it gets downsampled and compressed.
@agladstein , @parente is correct -- the data is usually on local filesystem where the jupyter kernel is running. But, it is possible to download the data locally via standard https over the network, or if using Dask Arrays, the data can be loaded from the network on demand and never saved to disk.
@agladstein yeah I can see that. I'm still a huge fan of R Studio. Best of luck!
Can code ocean be connected to github so it is updated with commits on github? And how can it be integrated with large data? And where does it run? Suppose, hypothetically, I publish a paper that uses 1T data and train large memory deep learning GPU models on it... can I use code ocean to make that analysis reproducible?
@agladstein To put answers to these questions in writing: 1) Code Ocean currently imports from GitHub but we are aiming for bi-directional links (pull & push) for early 2019. 2) large data can be uploaded as public datasets https://help.codeocean.com/getting-started/uploading-code-and-data/using-public-datasets -- and details on our machines are here https://help.codeocean.com/faq/what-machine-will-my-code-run-on . Future plans: A) link up with local university clusters B) get the most powerful machines available to run code 3) Currently we run on AWS. 4) 1 TB of data creation may be very slow on Code Ocean or exceed system limits -- it's hard to say in advance. It depends on how much memory the capsule uses during runtime. We recommend clearing intermediate objects out of memory early and often. One of our 2019 priorities is more powerful machines intended to help use cases like this.
the VSCode Python extension I mentioned uses a flat text file view to represent notebooks and cells with %% prefixes. i haven't played with it yet to see if it can do bi-directional transform between the ipynb JSON and the flat-file format.