These are chat archives for thunder-project/thunder

13th
Mar 2015
Uri Laserson
@laserson
Mar 13 2015 00:35
anyone around?
Jeremy Freeman
@freeman-lab
Mar 13 2015 00:35
yup!
Uri Laserson
@laserson
Mar 13 2015 00:36
hey!
so, i'm trying to use thunder on a quick-and-dirty project but am having a touch of trouble
(actually, mostly spark trouble up to now)
but now thunder trouble methinks
Jeremy Freeman
@freeman-lab
Mar 13 2015 00:36
cool! happy to try and help
Uri Laserson
@laserson
Mar 13 2015 00:37
i'm manually creating a Series object by wrangling some text data into the appropriate shape
when I run, for example, data.seriesStddev().pack()
I get stuff like
15/03/12 17:33:12 INFO scheduler.TaskSetManager: Lost task 3.2 in stage 51.0 (TID 2932) on executor bottou04-10g.pa.cloudera.com: org.apache.spark.api.python.PythonException (Traceback (most recent call last):
  File "/data/4/yarn/nm/filecache/1326/spark-assembly-1.2.1-hadoop2.4.0.jar/pyspark/worker.py", line 90, in main
    command = pickleSer._read_with_length(infile)
  File "/data/4/yarn/nm/filecache/1326/spark-assembly-1.2.1-hadoop2.4.0.jar/pyspark/serializers.py", line 151, in _read_with_length
    return self.loads(obj)
  File "/data/4/yarn/nm/filecache/1326/spark-assembly-1.2.1-hadoop2.4.0.jar/pyspark/serializers.py", line 400, in loads
    return cPickle.loads(obj)
ImportError: No module named thunder.rdds.keys
)
I'm running on a CDH cluster through ipython notebook using yarn
thunder 0.4.1
spark 1.2.1
Perhaps this is because I am manually instantiating a Series object? And there is some assumption on the input data that I have failed to meet?
Jeremy Freeman
@freeman-lab
Mar 13 2015 00:41
hmmm that's what i was thinking, except that the error is a failure to import
Uri Laserson
@laserson
Mar 13 2015 00:41
my keys are tuples of integers
if i just run data.seriesStddev(), it works fine
Jeremy Freeman
@freeman-lab
Mar 13 2015 00:41
as a sanity check, if you just try to import thunder.rdds.keys from the shell, does it work?
Uri Laserson
@laserson
Mar 13 2015 00:42
yes
i am also manually instantiating a ThunderContext
do I have to also manually add thunder to pyfiles somehow?
cuz thunder is definitely not installed on the worker nodes
Jeremy Freeman
@freeman-lab
Mar 13 2015 00:43
ah, yup, so i think it's that the workers are failing to import the module
yup, cause it's not installed
the easiest way is to build an egg file using setup.py bdist_egg or use the build executable
Uri Laserson
@laserson
Mar 13 2015 00:44
so the ThunderContext does not add thunder to the sc.pyfiles or whatever it's called?
Jeremy Freeman
@freeman-lab
Mar 13 2015 00:44
and then do sc.addPyFile with the egg
yeah... we do it in the thunder and thunder-submit executables
Uri Laserson
@laserson
Mar 13 2015 00:45
so, if i'm just on my cluster, i pip installed thunder-python
i then also need to check out the thunder repo to build an egg and distribute it?
i wonder if there is a way to eliminate that step
Jeremy Freeman
@freeman-lab
Mar 13 2015 00:46
yup, currently the only solution would be to check out the repo and build
we should definitely eliminate this, great idea
one option would be to distribute the pre-built egg with the library so it's included when you pip install
Uri Laserson
@laserson
Mar 13 2015 00:47
Yes, in my mind, this is something the ThunderContext constructor should do
Jeremy Freeman
@freeman-lab
Mar 13 2015 00:47
yup, and do the sc.addPyFile in the context creation step
Uri Laserson
@laserson
Mar 13 2015 00:47
exactly
Jeremy Freeman
@freeman-lab
Mar 13 2015 00:48
i haven't seen many python libraries that include a pre-built egg, but does seem warranted here!
Uri Laserson
@laserson
Mar 13 2015 00:48
i wonder if there are other ways to handle this
if you pip install thunder, then all the info you need should be available to package up
i'm not sure which solution i dislike more
something feels weird about packaging in an egg into its own package
Jeremy Freeman
@freeman-lab
Mar 13 2015 00:49
yes, mildly recursive =)
Uri Laserson
@laserson
Mar 13 2015 00:49
but building the egg from whatever gets placed in the site-packages directory can also be nasty
Jeremy Freeman
@freeman-lab
Mar 13 2015 00:50
we are bundling a jar file inside python/lib
Uri Laserson
@laserson
Mar 13 2015 00:50
that's true
Jeremy Freeman
@freeman-lab
Mar 13 2015 00:50
it can just go there, actually doesn't seem so bad
ok, awesome, let's do this starting with 0.5 (cutting a candidate v soon!)
Uri Laserson
@laserson
Mar 13 2015 00:51
i hadn't looked at the code in a bit, but i love how you've dropped a ton of the scala code
or at least handle it in a way that seemed to work much more smoothly for me this time
Jeremy Freeman
@freeman-lab
Mar 13 2015 00:51
cool! yeah some of that was actually mainly for streaming, which is now in it's own repo
so it's almost pure python, except for one bit of IO functionality
Felix Cheung
@felixcheung
Mar 13 2015 00:57
I think generally if you are running python on Spark in a cluster, any package will need to be installed on each worker/data node separately..
Uri Laserson
@laserson
Mar 13 2015 00:58
Unless you can distribute it through sc.addPyFile
Jeremy Freeman
@freeman-lab
Mar 13 2015 01:02
yup, in many cases you can just distribute it, especially for light-weight packages... more complex libraries with e.g. C dependencies like numpy do need to be installed on the workers
Uri Laserson
@laserson
Mar 13 2015 01:06
is it possible to even distribute binaries derived from C libraries as long as you're sure the environemtn is uniform across the cluster?
Jeremy Freeman
@freeman-lab
Mar 13 2015 01:10
not sure! we haven't tried it for anything yet, but @shababo wants to
Uri Laserson
@laserson
Mar 13 2015 01:22
so the key of my RDD that I use to build Series object is a pair of integers. however, not all pairs are represented, because they had absolutely no data (so they should have an array of zeros)
any thoughts on how to load that into Series?
i need to add the extra zeros
now when I try to run pack, it raises an error bc the number of elements in the resulting aggregate doesn't match the expected number for the dimensions of the "image"
Jeremy Freeman
@freeman-lab
Mar 13 2015 01:24
so currently you can construct a series object with sparse, non-contiguous keys fine, but a couple methods will fail (pack is the main one)
rather than add rows, a better solution would be a sparse variant of pack that would collect the keys and values and pokes entries into the appropriate locations of an array
something like packSparse
Uri Laserson
@laserson
Mar 13 2015 01:34
yes, +1 on that
Jeremy Freeman
@freeman-lab
Mar 13 2015 05:49
@kunallillaney thanks for the PR! I'm looking into the error and will report back, it may actually be a problem on the travis side (that error you reference is actually us checking that the correct error is thrown)
@tomsains i've actually wanted to support a direct path from images to a movie with some temporal filtering, ideally something as efficient. there is actually support for it now using imgs = data.toBlocks().toImages() where data is a Series object, but performance isn't great, and i think there may be a more efficient approach
still, maybe worth trying
Kunal Lillaney
@kunallillaney
Mar 13 2015 16:11
@freeman-lab Cool. Thanks, I mentioned it because that was the only grepable error in the entire Travis log. Let me know if there is any action needed on my part.
Noah Young
@npyoung
Mar 13 2015 20:13
Anyone having luck with NMF? Curious if I'm the only one running into #129 .
Jeremy Freeman
@freeman-lab
Mar 13 2015 21:09
@npyoung I'm having trouble reproducing that issue, at least on a local machine I followed exactly your steps and didn't get it (and at least some of us have been running NMF pretty routinely on real data)
will follow up over on the issue, might be something about versions of python components
Noah Young
@npyoung
Mar 13 2015 21:21
@tomsains to make a dF/F movie you'll need to present each frame to your encoder one by one on a single machine at some point, so calling Series.applyValues(lambda v: v[time]).pack() in a for loop over time is actually a reasonable solution and saves you a costly shuffle.
@freeman-lab thanks for taking a look. Are you using python2.6 and older numpy/scipy for your tests?
Jeremy Freeman
@freeman-lab
Mar 13 2015 21:44
just posted on the issue, looks like it's about Spark versions, not python (at least locally I'm testing with py2.7)
Jeremy Freeman
@freeman-lab
Mar 13 2015 21:50
update: it runs fine with Spark 1.2 and Spark 1.3, but is broken in Spark 1.2.1
(Spark 1.3 was literally just released, but we're planning to support it officially very soon)
Noah Young
@npyoung
Mar 13 2015 21:56
Awesome. If I get Spark 1.3 and point my SPARK_HOME to it locally, will the remotes get updated when I call thunder-ec2 start or do I need to manually update it on the cluster?
Uri Laserson
@laserson
Mar 13 2015 21:59
Is there an inverse to pack()?
Jeremy Freeman
@freeman-lab
Mar 13 2015 22:19
@npyoung generally yes, this one requires a couple tweaks to the ec2 launch script that I'm doing / testing now, should be ready this evening
@laserson do you mean an operation that would take a local array (like that generated by pack) as an input and generate a series object, filling in the keys automatically?
the closest thing is the ThunderContext.loadSeriesLocal method which creates a Series from an array stored in either a npy or mat file
Jeremy Freeman
@freeman-lab
Mar 13 2015 22:24
but maybe we can just modify that so the input can also be an array (rather than a file name)
how would you want the keys to be populated? contiguous array indices if not specified otherwise?