These are chat archives for thunder-project/thunder

18th
Jun 2015
wolfbill
@wolfbill
Jun 18 2015 07:23
@freeman-lab yes,I've already try it.It's ok
I'd like to do clustering and Regression analysis use data1_3d,I don't know how to do it
Jeremy Freeman
@freeman-lab
Jun 18 2015 17:01
@bald6354 cool (and great to meet you at spark summit!) if the initial data is large and new data are much smaller, you could load the original data, cache it, and then load new data each time and combine through a union... that would prevent loading the original data every time
currently this requires going into the rdd itself, but we could easily add a concatenate method to the primary data object that just exposes union underneath
for now it would look like:
original = tsc.loadExample('mouse-images')
new = tsc.loadExample('mouse-images')
original.cache()
original.count()

from thunder import Images
updated = Images(original.rdd.union(new.rdd))
that cache/count forces the initial data to be cached
with this, any subsequent operations after the update should be faster than if starting from scratch, including statistics on the images themselves, or conversions to Series
but i haven't gone through this particular workflow myself at scale, so let us know how it goes!