These are chat archives for thunder-project/thunder

18th
May 2016
Steve Varner
@stevevarner
May 18 2016 20:39
I've noticed something that might be considered a bug. In images.py line 43, it has this. "if nrecords is None: nrecords = rdd.count()". I have been getting heap errors if the number of images in my RDD is fairly large. I'm guessing that count causes some sort of a collect on the RDD which since it contains image data would be pretty big. My workaround is this: "nrecords = rdd.map(lambda ((a,b),c): a).count()" That way it sheds all the image data before calling the count function on the new, much smaller rdd. If you agree with this proposed change, I can create a pull request.
Steve Varner
@stevevarner
May 18 2016 20:44
One more thing to mention, in the example above, I embedded the label value into the RDD as 'b' in the 2nd half of the key tuple. So in my code, readers.py line 373 is now "keys = [(idx*nvals + timepoint,labels[idx]) for timepoint in range(nvals)]". This allows me to persist the RDD to HDFS then bring it back later after doing some filtering on the RDD based on metadata stored in the label before it gets loaded into Thunder.