These are chat archives for thunder-project/thunder

Jun 2016
Joe Baines-Holmes
Jun 16 2016 09:23
No worries!
Do Series objects have a method to reshape the series after a call to the flatten method?
Jason Wittenbach
Jun 16 2016 13:30
@bainzo not at the moment
could be worth adding though, so it might be worth opening an issue about it
Steve Varner
Jun 16 2016 14:14
Has anything been done with issue #202? Databricks is heavily pushing people to use Dataframes over RDDs so I was curious as to where we were with it. @naory @freeman-lab
Chris Tech
Jun 16 2016 14:18
@stevevarner I'm curious about that as well since my understanding of Spark 2.0 is that it leverages Dataframes
Jason Wittenbach
Jun 16 2016 14:26
@stevevarner @techchrj I was curuios about this a while back as well amd started to look into it
I think that switching to DataFrames (at least as they exist in Spark 1.6) would be extremly tricky
The DataFrame has a different API than that RDD — in particular, I don’t think you can use lambda functions with DataFrames
So a large amount of the code (particularly in Bolt) would need to be changed if we wanted to use DataFrames as the underlying object model to get the potential optimizations
On the other-hand, the new Datasets framework looks quite promising
Unfortunately, it is not implemented in PySpark as of Spark 1.6
From what I gather, DataFrames and Datasets will be merged into a common API in Spark 2.0. Assuming that this common API is exposed in PySpark, that might be worth taking a look at when 2.0 comes out.
Jun 16 2016 15:42
Another somewhat related question. Is there any plan to be able to use MPI as a backend for Thunder (mpi4py)? I know part of the Thunder rewrite was to allow for different backends. What would be the major hurdles in implementing an MPI backend?
Jeremy Freeman
Jun 16 2016 15:51
@kkcthans interesting, i have little experience with MPI, in general any "backend" that lets you work with a numpy-style ndarray object should be fairly easy to integrate
Jason Wittenbach
Jun 16 2016 15:56
It would effectively involve making something akin to the BoltArray — it would need to be a Python object that exposes the key features of the NumPy ndarray API…only it would use MPI behind the scenes instead of Spark
Jun 16 2016 16:00

So if MPI was added behind the scenes to a BoltArray, would it functionally be pretty easy to be included in Thunder?

I guess I'm wondering how much of the code in Thunder relating to the spark context functions as just a pointer to code in Bolt, vs is actually integrated in Thunder with the new rewrite?

Joe Baines-Holmes
Jun 16 2016 16:16
@jwittenbach I've opened an issue #335 :)
Jason Wittenbach
Jun 16 2016 16:17
@kkcthans so the integration with Thunder would be fairly easy
it would be building an MPI-backed NumPy-array-like object that would take a lot of work
In Thunder, the spark context usually only comes into play when creating an Images or Series object. Once the object is created, we just store the array — a NumPy array in local mode, a BoltArray in spark mode — and then, because the APIs are so similar, Thunder can often operate on that array in a way that is agnostic to whether it is a NumPy array or a BoltArray
Rather than adding MPI functionality to a BoltArray, I think the thing to do would be to make a separate array that uses MPI under the hood
Jason Wittenbach
Jun 16 2016 16:22
@bainzo excellent!