These are chat archives for thunder-project/thunder

9th
Apr 2015
Ben Poole
@poolio
Apr 09 2015 02:01
re: #160, anyone know if PySpark pipelines operations? E.g. if you're doing a map-reduce and the map stage produces large outputs, will the reduce stage start chomping through those before the map finishes? I think it doesn't but wanted to check
Jeremy Freeman
@freeman-lab
Apr 09 2015 02:03
@poolio i don't believe it currently does
and in fact, reduces on giant volumes can be demanding because lots of partial aggregates end up on the driver (i believe... at least in practice, it's a problem)
that's my other concern about the moments approach (even if we do each moment one at a time), though i think we can improve large aggregations
maybe using accumulators
Ben Poole
@poolio
Apr 09 2015 02:05
argh. one workaround I've used is mapPartitions where you incorporate the reduction in the map operation
Jeremy Freeman
@freeman-lab
Apr 09 2015 02:05
ah yup, that helps
we've also used coalesce (effectively the same)
Ben Poole
@poolio
Apr 09 2015 02:08
is that more efficient than repartition?
Jeremy Freeman
@freeman-lab
Apr 09 2015 03:45
hm, yes, in so far as it does not involve a shuffle
but can only be used to reduce the number of partitions