These are chat archives for thunder-project/thunder

17th
Feb 2015
Jeremy Freeman
@freeman-lab
Feb 17 2015 16:28
@d-v-b The fundamental issue here is that RegressionModel returns an "incorrect" data object
i believe it's the only example of that
what about, as a short term solution, having RegressionModel.fit return a tuple of Series objects, called betas, stats, and resid?
@jwittenbach would that create any significant problems in how you've been using these outputs?
it's a big change to the RegressionModel.fit API
on the other hand, the current version is causing problems
Jason Wittenbach
@jwittenbach
Feb 17 2015 16:35
@freeman-lab the only issue is that, if you want to pull them back to the driver, you would now need to go "under the hood of Thunder" and do a join to make sure that values are matched across keys (or do the join driver-side, which would be slow city). For users comfortable with Spark RDDs, that's not a big problem though.
but for me, no, it's just one extra line of code
Jason Wittenbach
@jwittenbach
Feb 17 2015 16:42
Seeing this discussion, maybe it would be worth finishing a light-weight version of the DataTable class so we can use it as a return value here. The nuts and bolts are really straight-forward as it's just a bunch of joins and maps. Then when the Spark community gets around to releasing the newer version of the SchemaRDD, we would only need to change the back-end, as the API should stay the same. Thoughts?
Jeremy Freeman
@freeman-lab
Feb 17 2015 16:47
hm, don't totally follow, you can definitely still pull them back to the driver separately

as in

betas, stats, resid = RegressionModel.fit(data)
b = betas.pack()
s = stats.pack()

did you mean something else? agreed that to e.g. select one conditional on the other you need a join

Jason Wittenbach
@jwittenbach
Feb 17 2015 16:48
Right, but they're not guaranteed to be in the same order. So what if you want to know which r2 goes with which set of βs?
Jeremy Freeman
@freeman-lab
Feb 17 2015 16:49
b = betas.pack(sorting=True)
s = stats.pack(sorting=True)
Jason Wittenbach
@jwittenbach
Feb 17 2015 16:49
Ah, didn't know that was a thing! Good enough :smile:
Jason Wittenbach
@jwittenbach
Feb 17 2015 16:55
Two questions:
1: If sorting=True, would it makes sense to return the as well? (I'm thinking about plotting cell-based results)
2: I don't know enough about how RDD.join works, but how much speed to we lose by doing the sorting on the driver after the collect?
Jeremy Freeman
@freeman-lab
Feb 17 2015 16:56
sorry, return the what as well?
Jeremy Freeman
@freeman-lab
Feb 17 2015 17:15
btw, this is a nice blog post on the new DataFrame functionality coming in Spark 1.3 https://databricks.com/blog/2015/02/17/introducing-dataframes-in-spark-for-large-scale-data-science.html
definitely need to have some discussions about how and where to incorporate this functionality
Jason Wittenbach
@jwittenbach
Feb 17 2015 17:51
Oops, "return the keys as well"
Jeremy Freeman
@freeman-lab
Feb 17 2015 19:03
ah, so for non-contiguous keys the right call would be k, v = betas.collectAsArray()
but a concern is that there's no sorting=True for the collect methods, just for packing
so maybe that's what we should padd?
Jason Wittenbach
@jwittenbach
Feb 17 2015 19:05
That sounds eminently reasonable to me!
Jeremy Freeman
@freeman-lab
Feb 17 2015 19:07
ok great
right so then for cells you could do k, v = betas.collectAsArray(sorting=True) and k, v = stats.collectAsArray(sorting=True) which will give local arrays with both properly sorted
Jason Wittenbach
@jwittenbach
Feb 17 2015 19:10
Yeah, exactly. The only downside is that you potentially have to carrying around multiple copies of the keys, since it's possible (though I guess not likely in most situations) that they would be sorted differently. But that's really not a big issue. And if we add a DataTable in the future, then even that minor annoyance will go away.