These are chat archives for thunder-project/thunder

30th
Nov 2016
chenminyeh
@chenminyeh
Nov 30 2016 15:56
Hi, I have questions about loading big dataset: I have 100 binary files with extension as 'stack'. Each stack is around 500 MB and in total will be 50 GB. I used the following command to load it but was getting the error message. Is there any solution for this? Also, what would be the optimal ways to load images with 200GB? Thanks! data = td.series.frombinary('/iblsn/data/cyeh/Chen_Min/_20161011_161658',ext = 'stack',dtype = 'uint16',shape = [100,2048,2048,48]) Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/local/lib/python2.7/dist-packages/thunder/series/readers.py", line 329, in frombinary
v = frombuffer(buffer(buf), offset=offset, count=nelements, dtype=dtype)
ValueError: buffer is smaller than requested size
Davis Bennett
@d-v-b
Nov 30 2016 16:12
@chenminyeh can you load the files with numpy?
like, can you load a single file
chenminyeh
@chenminyeh
Nov 30 2016 16:35
Yes, I loaded a stack with 50331648 int. with np.fromfile and it works. Takes about 1 min.
Jeremy Freeman
@freeman-lab
Nov 30 2016 17:29
@chenminyeh each file is one image? i think the issue is that you're loading as series rather than images, try the td.images.frombinary method
chenminyeh
@chenminyeh
Nov 30 2016 17:33
Each file is a 3D volume and I have a time series (few hundreds of files in different time points) in the folder
Jeremy Freeman
@freeman-lab
Nov 30 2016 17:37
yeah so then you want to use td.images.frombinary
which assumes each file is a single image (or volume) representing one time point
chenminyeh
@chenminyeh
Nov 30 2016 17:37
Cool! Will try it now!! Thanks a lot!
chenminyeh
@chenminyeh
Nov 30 2016 18:24
Hi, I wonder there is a way to overcome this: data = td.images.frombinary('/netapp/iblsn/data/cyeh/Chen_Min/_20161011_165907',ext = 'stack',dtype = 'uint16',shape = [100,2048,2048,48])
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/local/lib/python2.7/dist-packages/thunder/images/readers.py", line 321, in frombinary
engine=engine, credentials=credentials)
File "/usr/local/lib/python2.7/dist-packages/thunder/images/readers.py", line 219, in frompath
flattened = list(itertools.chain(*data))
File "/usr/local/lib/python2.7/dist-packages/thunder/images/readers.py", line 293, in getarray
ary = frombuffer(buf, dtype=dtype, count=int(prod(shape))).reshape(shape, order=order)
ValueError: buffer is smaller than requested size
Davis Bennett
@d-v-b
Nov 30 2016 18:27
@chenminyeh i think the shape kwarg should be the shape of each image, not the shape of the entire dataset
so you would put shape = [2048, 2048, 48]
also, before getting thunder involved, you should make sure you can load a single image using numpy and that the dimensions make sense
chenminyeh
@chenminyeh
Nov 30 2016 18:30
Okay, thanks a lot!!
Davis Bennett
@d-v-b
Nov 30 2016 18:30
see if you can use np.fromfile(path_to_a_single_image, dtype='uint16').reshape(2048, 2048, 48)
this is to make sure you have the dimensions / datatype right
chenminyeh
@chenminyeh
Nov 30 2016 18:40
Cool, it works nicely; thanks a lot!!!!!!!
chenminyeh
@chenminyeh
Nov 30 2016 18:54
Hi, after loading the files (80GB), I was trying to run ICA on the data. Yet, I quickly get an memory error. Is there a good way to estimate how much RAM needed? Also, any parameters I can change to overcome the issue? Thanks!
Davis Bennett
@d-v-b
Nov 30 2016 20:29
you can increase the number of workers in your cluster and you can reduce the size of the data (by downsampling, thresholding to remove background, and making sure you use the smallest datatype, like float32 intstead of float64)
the general idea for ram demands is to assume you need 2:1 ram-to-data ratio
before running your analysis on the entire dataset, downsample the data by like a factor of 8 or 16 and make sure everything works with smaller data
then scale up
chenminyeh
@chenminyeh
Nov 30 2016 20:32
Got it! Thanks!!