These are chat archives for thunder-project/thunder

16th
Apr 2015
Richard A Hofer
@rhofour
Apr 16 2015 01:37
@freeman-lab Sure thing, I'll probably do that tomorrow. Just getting familiar with the code now. Planning to add an inverse method to RowMatrix.
Alex Williams
@ahwillia
Apr 16 2015 06:32
@freeman-lab Sorry, what is the difference between start and end padding? Using your code I get various combinations of [10,10], [10,0], [0,10] for the start/end padding in my example with padding=10. My understanding is that we want this property to always be [10,10] (for this 2D example)
Jeremy Freeman
@freeman-lab
Apr 16 2015 15:40
@ahwillia great checking, so this has to do with the boundary conditions
for any given block, it will padded only if there is space
so the upper left most block is only padded below and to the right
thus yielding values of [0,0] and [10,10]
tomsains
@tomsains
Apr 16 2015 15:49
Hey, I just wanted to know if anyone had managed to install thunder successfully on windows?
if so have you got any advice on how to do it?
Jeremy Freeman
@freeman-lab
Apr 16 2015 15:50
what is the error that you are getting, and at which step?
i'm pretty sure that @GrantRVD was having issues but it was due to the Python version (3 instead of 2), and after that was able to get it to work
tomsains
@tomsains
Apr 16 2015 15:51
pip install works correctly
but using commands such as which thunder is returning nothing
Richard A Hofer
@rhofour
Apr 16 2015 15:52
which is a unix command I think, not sure if there's a Windows analog
Jeremy Freeman
@freeman-lab
Apr 16 2015 15:53
yup, that's an executable that gets automatically included during installation and added to your path
if there is an analog on Windows, it may require a tweak to our setup.py
tomsains
@tomsains
Apr 16 2015 15:56
in terms of a command line interface for windows what would you recommend using - cygwin?
tomsains
@tomsains
Apr 16 2015 16:23
ah don't worry we will install ubuntu, it will just be easier
Jeremy Freeman
@freeman-lab
Apr 16 2015 16:37
cygwin should work fine
it should be easy to skip the executable, can you launch pyspark and then do:
if you just start a python shell and call:
from thunder import ThunderContext
tsc = ThunderContext(sc)
or are you unable to launch pyspark?
Richard A Hofer
@rhofour
Apr 16 2015 19:19
@freeman-lab Is there a reason mapPartitions is used when map can do exactly the same thing? I can't tell if mapPartitions is actually more efficient there.
Richard A Hofer
@rhofour
Apr 16 2015 20:59
@freeman-lab in datasets.py appendKeys has a note that it should be eliminated. What's the plan for this? I changed the keys from 3D to 1D, but if eliminating appendKeys isn't too bad then I could also do that
Jeremy Freeman
@freeman-lab
Apr 16 2015 21:49
@rhofour :point_up: April 16 2015 4:59 PM ah, pretty certain that note -- admittedly unclear! -- was just that I'd like that to be a method on DataSets rather than a standalone function
if that makes sense to you, awesome to include that in your PR
Richard A Hofer
@rhofour
Apr 16 2015 21:50
Think I get what you mean. I'll take a look at that right now.
Jeremy Freeman
@freeman-lab
Apr 16 2015 21:56
And @rhofour :point_up: April 16 2015 3:19 PM which usage are you referring to? mapPartitions can be more efficient, for example, if there is some operation that needs to happen once per partition as opposed to once per record, but entirely possible it's used somewhere unnecessarily =)
Richard A Hofer
@rhofour
Apr 16 2015 21:57
Guess I was pretty ambiguous there. I was referring to its use in RowMatrix.times
matrixSumIterator_other could be completely removed and replaced with a small lambda function if map was used instead
(I did this while trying to figure out exactly how times worked)
Jeremy Freeman
@freeman-lab
Apr 16 2015 22:00
Ah, so those give the same output, but the version with mapPartitions does an initial aggregation on each partition before doing the final aggregation in the sum (i.e. the reduce), which can potentially improve performance if the intermediate matrices are large
I wouldn't expect there to be a noticeable difference in local testing, but probably on larger matrices when running on a cluster
Richard A Hofer
@rhofour
Apr 16 2015 22:02
Ah, that makes sense
^ I'll get right on those!
Jeremy Freeman
@freeman-lab
Apr 16 2015 22:04
btw, very cool you're interested in adding matrix inversion / factor analysis! you may also want to talk to @j-friedrich who is interested in factor analysis
Richard A Hofer
@rhofour
Apr 16 2015 22:07
I'm really excited to actually write some code that could actually be used by researchers
Richard A Hofer
@rhofour
Apr 16 2015 23:36
@freeman-lab Made the changes you suggested.