I've noticed something that might be considered a bug. In images.py line 43, it has this. "if nrecords is None: nrecords = rdd.count()". I have been getting heap errors if the number of images in my RDD is fairly large. I'm guessing that count causes some sort of a collect on the RDD which since it contains image data would be pretty big. My workaround is this: "nrecords = rdd.map(lambda ((a,b),c): a).count()" That way it sheds all the image data before calling the count function on the new, much smaller rdd. If you agree with this proposed change, I can create a pull request.