These are chat archives for thunder-project/thunder

21st
Jan 2015
Jason Wittenbach
@jwittenbach
Jan 21 2015 00:10
Here is the general inspiration for the kind of functionality that I think would be good to add to a TimeSeries that has a hierarchically organized index: http://pandas.pydata.org/pandas-docs/stable/groupby.html
Jeremy Freeman
@freeman-lab
Jan 21 2015 00:27
ok great, so i like the idea of adding groupBy functionality on a Series, but maybe not a groupBy method by itself, because the output of that is not a Series
so for example, this kind of functionality:
s = Series([1, 2, 3, 10, 20, 30], [1, 2, 3, 1, 2, 3])
s.groupBy(level=0).sum()
Jason Wittenbach
@jwittenbach
Jan 21 2015 00:31

Yeah. I like the idea on the Panda's page about breaking it down into steps: Split, Apply, Combine. So I thought maybe even the function to be applied could be be passed as an arguments.

```

Jeremy Freeman
@freeman-lab
Jan 21 2015 00:32
right, so i would not want an actual groupBy method on a Series
but if there was a way to provide a few methods like seriesMeanByGroup or seriesCountByGroup
actually, whoa, i really like that
we already have seriesMean, seriesCount, etc
Jason Wittenbach
@jwittenbach
Jan 21 2015 00:33
something along the lines of
s = Series(...)
s.grouping(levels=('trials'), apply=add)
Yeah, I'm suggested something similar I think, just have one function, and let the use select what function (if any) to apply to the selected groups
And then we could write convenience function like seriesMeanByGroup
Jeremy Freeman
@freeman-lab
Jan 21 2015 00:34
sure, so we could do seriesAggregateByGroup as the primary function
yup, bingo
i'm really liking that, it wouldn't be the full functionality of panda's groupBy, and it would't allow all the dataframe specific stuff (like grouping on columns)
but it would handle either single index or mutli-index
Jason Wittenbach
@jwittenbach
Jan 21 2015 00:35
Yeah, the dataframe stuff is definitely out of scope
Jeremy Freeman
@freeman-lab
Jan 21 2015 00:35
and always return the aggregate, so thus always return a valid Series
Jason Wittenbach
@jwittenbach
Jan 21 2015 00:35
yeah
What would you think about some kind of flat-map-esque function for something like Series.selectByGroup()?
Jeremy Freeman
@freeman-lab
Jan 21 2015 00:37
hm, so in this case, s = Series([1, 2, 3, 10, 20, 30], index=[1, 2, 3, 1, 2, 3])
it would just return another Series with 1, 10, 2, 20, 3, 30
so it's basically just a sort
right?
Jason Wittenbach
@jwittenbach
Jan 21 2015 00:38
I was thinking more along the lines of the following
s = Series([1,2,3,4,11,12,13,14], index = [(1,1),(1,2),(1,3),(1,4),(2,1),(2,2),(2,3),(2,4)] )
and then wanting to pull out all things with a "second" index of 2 or 3
which would return a series with two records: Series([2,3]) and Series([12,13])
Jeremy Freeman
@freeman-lab
Jan 21 2015 00:41
so it can't return a series of a series
if it just returned 2,3,12,13
that would be fine
and useful?
Jason Wittenbach
@jwittenbach
Jan 21 2015 00:41
ah, I see
Yeah, I think so
Jeremy Freeman
@freeman-lab
Jan 21 2015 00:41
yeah that i'm down with as well
so why don't we first try to get a version of seriesAggregateByGroup and selectByGroup with simple linear indices
and see how it looks
then consider the multi-level case
Jason Wittenbach
@jwittenbach
Jan 21 2015 00:43
Don't we need the multilevel case from the get-go?
Otherwise the indices will not uniquely specify the elements in the series
Or do we even ask for / enforce that?
Jeremy Freeman
@freeman-lab
Jan 21 2015 00:44
hm, I was thinking of this example from the pandas page (reformulated with the proposed API):
s = Series([1, 2, 3, 10, 20, 30], [1, 2, 3, 1, 2, 3])
s.seriesSumByGroup(level=0) -> Series([11, 22, 33])
yeah so that's handling non-unique indices
that alone can actually do a lot of basic things (trial-averaging, etc.)
but sure, i guess might as well do multilevel from the get-go
Jason Wittenbach
@jwittenbach
Jan 21 2015 00:46
Yeah, at multiple points in their documentation, they mention that indices should be unique. But then they seem to constantly break that rule to do little hacks
Jeremy Freeman
@freeman-lab
Jan 21 2015 00:46
looks like it was added recently
to allow non-unique ones
which seems reasonable
Jason Wittenbach
@jwittenbach
Jan 21 2015 00:46
Ah, maybe the docs just haven't caught up yet :smile:
Jeremy Freeman
@freeman-lab
Jan 21 2015 00:46
to do, say, trial-averaging you shouldn't neccessarily need to specify both the trial ids and time within trial
Jason Wittenbach
@jwittenbach
Jan 21 2015 00:47
yeah, definitely not
and if it's a timeseries, you can always infer the time
Jeremy Freeman
@freeman-lab
Jan 21 2015 00:47
yup, agreed, for now we'll do in on Series, TimeSeries may end up a special case
good point
Jason Wittenbach
@jwittenbach
Jan 21 2015 23:59
test