These are chat archives for thunder-project/thunder

16th
Jun 2015
Daniel Goodwin
@dgoodwin208
Jun 16 2015 16:18
Could I please get some help in efficient methods to choose subsets of rows of data? That is, if I have a Series object with count()=20000, is there an obvious way to treat it like two submatrices and do separate operations
Jason Wittenbach
@jwittenbach
Jun 16 2015 16:35

@dgoodwin208 By 'rows' I'm assuming you mean 'records'? Each record is indexed by it's key, so I can think of two options:
(1): if you don't need to put the Series object back together again after breaking it into the two pieces, you could do two calls to Series.filterOnKeys

subseries1 = series.filterOnKeys(lambda k: criterion(k))
subseries2 = series.filterOnKeys(lambda k: not criterion(k))

Then you could do separate operations on each subset
(2): if you want to keep the Series object together, then you could do a single call to Series.apply using using a function that applies different operations depending on the keys:

def f(k, v):
   if criterion(k):
      vNew = ...
   else:
      vNew = ...
   return (k, vNew)
series.apply(f)

In both cases criterion is a function that take a key and returns a bool indicating to which subset the record associate with the given key belongs.

Daniel Goodwin
@dgoodwin208
Jun 16 2015 16:43
@jwittenbach this is an awesome response, huge thanks. I was just beginning to explore filterOnKeys(), but had not considered the use of the apply() function. I wonder which will be more efficient for block matrix operations on a single large Series object: apply() would be run across each record but filterOnKeys might duplicate objects in memory ... I'll report back what I find. Again, really appreciate the pointers
Jason Wittenbach
@jwittenbach
Jun 16 2015 16:49
My hunch is that the apply option might be a little quicker, since the filter option would involve filtering twice to get what is effectively the complement of records that the first filter already found, which would be a bit redundant. I'm interested to hear what you find!
Jeremy Freeman
@freeman-lab
Jun 16 2015 17:26
@dgoodwin208 @jwittenbach great thoughts, I agree that the apply route might be the most straightforward, though there shouldn't be any duplication with the filtering version unless I'm missing something. might be helpful to know what comes next, you want to treat groups of rows as submatrices so that you can apply different operations to the two groups? what is that operation?
Daniel Goodwin
@dgoodwin208
Jun 16 2015 17:35
I'm implementing Bi-Cross-Validation for NMF, which requires operations on 4 submatrices of the target original matrix, A. So the downstream operations after taking (randomized) submatrices are calculating NMF on one of them, then calculating a cost function using the frobenius norm and pseudoinverses on two others ... I probably won't be able to clearly articulate this in a short paragraph, it is section 5.1 in the link
andrew giessel
@andrewgiessel
Jun 16 2015 17:45
@freeman-lab @dgoodwin208 @jwittenbach I think double filtering will only be 2x slower than apply. pretty sure filtering and apply are both O(n) because you have to look at each record. So, slight advantage on apply.