These are chat archives for thunder-project/thunder

2nd
Dec 2016
Jason Wittenbach
@jwittenbach
Dec 02 2016 01:39
@d-v-b yeah, you’ll need to do a map to coerce each record into a format that MLLib is happy with
for unsupervised learning tasks, you can just give it a NumPy array. So simply stripping off the keys should work.
but then you have to called transform([original dataset]) to actually get the cluster assignments for the dataset after the fit
and making sure everything comes back in the right order might not be straightforward
I think the “official” way to do it would be to convert the data into a DataFrame object
PieterKJ
@PieterKJ
Dec 02 2016 14:34
Does thunder-regression allow to fit arima models?
Davis Bennett
@d-v-b
Dec 02 2016 14:37
@PieterKJ to my knowledge, thunder-regression doesn't have support for that but if fitting your model is an independent calculation for each timeseries, you could write the thunder implementation yourself
basically, if you can write a python function that will fit an arima model on a single timeseries (e.g., using https://pypi.python.org/pypi/statsmodels), then you can use the map method on a thunder.series object to apply this fitting function to every timeseries in your distributed collection
regarding your issues above, I don't have any experience loading data from a .csv file, but I would make sure that the data are formatted the way you want after you map the split -- before converting to thunder.series, call file1.first() to look at the values in this rdd and make sure they look the way you want
Davis Bennett
@d-v-b
Dec 02 2016 14:45
I think td.series.fromrdd(my_rdd) makes some assumptions about my_rdd, and your data might be violating those assumptions, and since computation in spark is lazy you can have a problem very early in your code (like when you define your rdd or series object) that doesn't show up until you force a computation later on (like when you try to get values out with toarray()).