These are chat archives for thunder-project/thunder

13th
Dec 2016
chenminyeh
@chenminyeh
Dec 13 2016 17:01

Hi, I have a tif date loaded as followed: Images
mode: spark
dtype: uint16
shape: (29, 47, 1229, 2048) using ICA: algorithm = ICA(k=50, k_pca=50, svd_method='em', max_iter=10, tol=0.00001, seed=200).fit(data)
I got the following errors: MemoryError Traceback (most recent call last)

<ipython-input-16-46351a70dec1> in <module>()
----> 1 algorithm = ICA(k=50, k_pca=50, svd_method='em', max_iter=10, tol=0.00001, seed=100).fit(data)

/home/cyeh/anaconda3/lib/python3.5/site-packages/factorization/base.py in fit(self, X, return_parallel)
27
28 if isinstance(data, BoltArraySpark):
---> 29 results = list(self._fit_spark(data))
30
31 # handle output types

/home/cyeh/anaconda3/lib/python3.5/site-packages/factorization/algorithms/ICA.py in _fit_spark(self, data)
52
53 # reduce dimensionality
---> 54 u, s, v = SVD(k=self.k_pca, method=self.svd_method).fit(data)
55 u = Series(u)
56

/home/cyeh/anaconda3/lib/python3.5/site-packages/factorization/base.py in fit(self, X, return_parallel)
27
28 if isinstance(data, BoltArraySpark):
---> 29 results = list(self._fit_spark(data))
30
31 # handle output types

/home/cyeh/anaconda3/lib/python3.5/site-packages/factorization/algorithms/SVD.py in _fit_spark(self, mat)
60 # initialize random matrix
61 random.seed(self.seed)
---> 62 c = random.rand(self.k, ncols)
63 niter = 0
64 error = 100

mtrand.pyx in mtrand.RandomState.rand (numpy/random/mtrand/mtrand.c:17636)()

mtrand.pyx in mtrand.RandomState.random_sample (numpy/random/mtrand/mtrand.c:13908)()

mtrand.pyx in mtrand.cont0_array (numpy/random/mtrand/mtrand.c:2055)()

MemoryError:
This is the cluster I am using: Alive Workers: 32
Cores in use: 64 Total, 64 Used
Memory in use: 3.1 TB Total, 3.1 TB Used

Davis Bennett
@d-v-b
Dec 13 2016 17:58
@chenminyeh you might be running out of memory. try reducing the size of the data dramatically (like, a factor of 100 or so, by downsampling by 10 in x and y) and see if that runs
chenminyeh
@chenminyeh
Dec 13 2016 18:51
@d-v-b Thanks, I am running 1/3 of the previous data now. I was using 5% (less than 10GB)of my data already. I wonder there is any other setting I can change to run 100 gb date. As you can see I am using cluster with huge RAM.
Davis Bennett
@d-v-b
Dec 13 2016 18:53
it looks like you are downsampling along the time axis instead of space
I don't think ICA will make much sense with 29 timepoints
use the full timeseries and reduce the spatial dimensions
chenminyeh
@chenminyeh
Dec 13 2016 18:56
Yes, my plan is first downsampling the time dimension using PAC and use ICA to extract spacial info.
Davis Bennett
@d-v-b
Dec 13 2016 18:58
is this right? the shape of your data is : (29, 47, 1229, 2048)
chenminyeh
@chenminyeh
Dec 13 2016 18:58
yes, 29 is time points, 47 is z, 1229 is x, 2048 is y
Davis Bennett
@d-v-b
Dec 13 2016 18:58
29 timepoints is too low for anything meaningful to come out of PCA or ICA, in my opinion
I'm assuming you have downsampled in time
(my belief that 29 timepoints is too low is based on experience with noisy timeseries, which might not apply in your case)
chenminyeh
@chenminyeh
Dec 13 2016 19:01
Yes, do you have more information about how I may shape my data in order to do ICA on space? Many thanks!
Davis Bennett
@d-v-b
Dec 13 2016 19:05
that's a good question, I don't remember how to do that. @jwittenbach do you have any advice?
Jason Wittenbach
@jwittenbach
Dec 13 2016 23:09
@chenminyeh @d-v-b I am understanding correctly: by "ICA on space" you want a set of statistically independent images as well as (not necessarily independent) weights over time?
If that's the case, then, if you have you data in a Thunder Series object, then you can simply give that data to the fit function on an ICA algorithm
Somewhat ironically, if your data is a Series, then it will do "spatial" ICA
and if your data is an Images, then it will do "temporal" ICA