These are chat archives for thunder-project/thunder

9th
Aug 2016
Davis Bennett
@d-v-b
Aug 09 2016 18:38
@jwittenbach any ideas for a way around this?
test = td.images.fromarray(np.zeros([5000, 180000]).astype('float32'), engine=sc)
# a bunch of errors, leading to:
/usr/local/spark-current/python/pyspark/serializers.py in _write_with_length(self, obj, stream)
    146             raise ValueError("serialized value should not be None")
    147         if len(serialized) > (1 << 31):
--> 148             raise ValueError("can not serialize object larger than 2G")
    149         write_int(len(serialized), stream)
    150         if self._only_write_strings:

ValueError: can not serialize object larger than 2G
a few old spark mailing list entries suggest increasing partition number, but that doesn't help
Jason Wittenbach
@jwittenbach
Aug 09 2016 18:42
@d-v-b yikes, I’ve never run into that before
you just want zeros?
and 1D images?
Davis Bennett
@d-v-b
Aug 09 2016 18:44
i'm about to do some factorization so i've flattened + filtered the data
zeros is just for example
Jason Wittenbach
@jwittenbach
Aug 09 2016 18:55
ah, so the general problem spec is that you have a local array > 2GB and you need to parallelize it?
Davis Bennett
@d-v-b
Aug 09 2016 18:59
yeah
Boaz Mohar
@boazmohar
Aug 09 2016 22:20
@d-v-b If you use spark directly you could define a function that would yield zeros or something else to the sc.parallelize function and then do td.images.fromrdd
Boaz Mohar
@boazmohar
Aug 09 2016 22:37
@d
Davis Bennett
@d-v-b
Aug 09 2016 22:38
@boazmohar yeah, I also thought about saving the data to disk, but I think there's a bigger issue if it's impossible to use td.images.fromarray on arrays bigger than 2GB
Boaz Mohar
@boazmohar
Aug 09 2016 23:48
@d-v-b I think that it is a python 2 pickle serialization problem https://bugs.python.org/issue11564
Davis Bennett
@d-v-b
Aug 09 2016 23:56
possibly, but i'm using python 3, so maybe pyspark hasn't updated?