These are chat archives for thunder-project/thunder

28th
Sep 2015
Jeremy Freeman
@freeman-lab
Sep 28 2015 00:00
but we should be able to figure it out
alexandrelaborde
@AlexandreLaborde
Sep 28 2015 00:11
@freeman-lab thanks! That seems to work. I am indeed trying to deploy thunder at our local cluster, but this is just thunder running ONLY on my computer. I have Ubuntu 64bit, 5Gb RAM, spark 1.4.1 with hadoop 2.6 and thunder 0.5.1. I just open the terminal, run thunder, do data=tsc.loadExample('fish'series') and this is the output.
data = tsc.loadExample('fish-series')
15/09/28 01:05:14 INFO MemoryStore: ensureFreeSpace(256888) called with curMem=480755, maxMem=278302556
15/09/28 01:05:14 INFO MemoryStore: Block broadcast_3 stored as values in memory (estimated size 250.9 KB, free 264.7 MB)
15/09/28 01:05:14 INFO MemoryStore: ensureFreeSpace(18231) called with curMem=737643, maxMem=278302556
15/09/28 01:05:14 INFO MemoryStore: Block broadcast_3_piece0 stored as bytes in memory (estimated size 17.8 KB, free 264.7 MB)
15/09/28 01:05:14 INFO BlockManagerInfo: Added broadcast_3_piece0 in memory on localhost:46404 (size: 17.8 KB, free: 265.4 MB)
15/09/28 01:05:14 INFO SparkContext: Created broadcast 3 from newAPIHadoopFile at PythonRDD.scala:522
15/09/28 01:05:14 INFO MemoryStore: ensureFreeSpace(256496) called with curMem=755874, maxMem=278302556
15/09/28 01:05:14 INFO MemoryStore: Block broadcast_4 stored as values in memory (estimated size 250.5 KB, free 264.4 MB)
15/09/28 01:05:14 INFO MemoryStore: ensureFreeSpace(18269) called with curMem=1012370, maxMem=278302556
15/09/28 01:05:14 INFO MemoryStore: Block broadcast_4_piece0 stored as bytes in memory (estimated size 17.8 KB, free 264.4 MB)
15/09/28 01:05:14 INFO BlockManagerInfo: Added broadcast_4_piece0 in memory on localhost:46404 (size: 17.8 KB, free: 265.3 MB)
15/09/28 01:05:14 INFO SparkContext: Created broadcast 4 from broadcast at PythonRDD.scala:479
15/09/28 01:05:14 INFO FileInputFormat: Total input paths to process : 1
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/user/anaconda/lib/python2.7/site-packages/thunder/utils/context.py", line 583, in loadExample
return self.loadSeries(tmpdir).astype('float')
File "/home/user/anaconda/lib/python2.7/site-packages/thunder/utils/context.py", line 96, in loadSeries
keyType=keyType, valueType=valueType)
File "/home/user/anaconda/lib/python2.7/site-packages/thunder/rdds/fileio/seriesloader.py", line 212, in fromBinary
conf={'recordLength': str(recordSize)})
File "/home/user/Downloads/spark-1.4.1-bin-hadoop2.6/python/pyspark/context.py", line 574, in newAPIHadoopFile
jconf, batchSize)
File "/home/user/Downloads/spark-1.4.1-bin-hadoop2.6/python/lib/py4j-0.8.2.1-src.zip/py4j/javagateway.py", line 538, in _call
File "/home/user/Downloads/spark-1.4.1-bin-hadoop2.6/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py";, line 300, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.newAPIHadoopFile.
: java.lang.IncompatibleClassChangeError: Found interface org.apache.hadoop.mapreduce.JobContext, but class was expected
at thunder.util.io.hadoop.FixedLengthBinaryInputFormat$.getRecordLength(FixedLengthBinaryInputFormat.scala:32)
at thunder.util.io.hadoop.FixedLengthBinaryInputFormat.isSplitable(FixedLengthBinaryInputFormat.scala:73)
at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.getSplits(FileInputFormat.java:397)
at org.apache.spark.rdd.NewHadoopRDD.getPartitions(NewHadoopRDD.scala:95)
at org.apache.spark.rdd.RDD
KaTeX parse error: Unexpected character: '$' at position 7: anonfun̲$partitions$2.: anonfun$partitions$2.apply(RDD.scala:219)
    at org.apache.spark.rdd.RDD
anonfun$partitions$2.apply(RDD.scala:217)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:217)
at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:32)
at org.apache.spark.rdd.RDD
KaTeX parse error: Unexpected character: '$' at position 7: anonfun̲$partitions$2.: anonfun$partitions$2.apply(RDD.scala:219)
    at org.apache.spark.rdd.RDD
anonfun$partitions$2.apply(RDD.scala:217)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:217)
at org.apache.spark.rdd.RDD$$anonfun$take$1.apply(RDD.scala:1255)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:147)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:108)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:286)
at org.apache.spark.rdd.RDD.take(RDD.scala:1250)
at org.apache.spark.api.python.SerDeUtil$.pairRDDToPython(SerDeUtil.scala:202)
at org.apache.spark.api.python.PythonRDD$.newAPIHadoopF
I can load the fish-images example just fine...but that’s pretty much the only one.
Jeremy Freeman
@freeman-lab
Sep 28 2015 00:19
ah! the issue might be the hadoop version! when you download spark, try getting the one for hadoop 1.x
see last entry in basic usage here http://thunder-project.org/thunder/docs/faq.html
this was fixed recently on master but i need to push a new version to pypi (working on it)
more recently we've dropped scala dependencies entirely, so this should never be a problem moving forward
alexandrelaborde
@AlexandreLaborde
Sep 28 2015 00:26
@freeman-lab it worked!!!! finally! many many thanks!!! now I can start testing stuff
can I call this examples in thunder-submit?
Gilles Vanwalleghem
@Yassum
Sep 28 2015 00:26
@freeman-lab Well I had the issue pop up suddenly on several VMs on which I run thunder locally with iPython. It was a mix of Thunder 0.5.0 and 0.5.1. My Spark version is 1.3.0 , but I also tried upgrading to 1.5.0 and I had the same issue.
As I mentioned, it was just an issue of defining the localhost in the hosts file, which is weird since it was working before...
Jeremy Freeman
@freeman-lab
Sep 28 2015 00:30
@AlexandreLaborde that's great! those examples should work, you should be able to use thunder-submit with a python script that runs a job, or just test from the interactive shell
@Yassum very weird! might need to know more about your VM setup, could be a funny port issue or something
but i guess if it's working, everything is ok =)
alexandrelaborde
@AlexandreLaborde
Sep 28 2015 00:35
@freeman-lab I will check that today. Thanks for your help once again :)
Gilles Vanwalleghem
@Yassum
Sep 28 2015 00:35
@freeman-lab They are Ubuntu 12.04, using anaconda, thunder 0.5.0 and spark 1.3.0. The weird thing is that even on snapshots from june (that worked at the time) the issue appeared when rebuilding from them. It may be that they changed something at the network level in the cluster that force me to now explicitly define the localhost. I don't think I'll be able to give you much more info unfortunately. Thanks for the help though
timberonce
@timberonce
Sep 28 2015 15:49
@freeman-lab How can I build thunder from the source code?
Michael Churchill
@rmchurch
Sep 28 2015 18:21
@freeman-lab To expand @timberonce "build thunder from source code", how do you build for a Spark installation which is built against Hadoop v2.x? The FAQ says: "If you prefer to use a version of Spark compiled for Hadoop 2.x, then at present this requires Thunder to be rebuilt from source. (See instructions elsewhere).", but I don't see the instructions elsewhere. Is is as simple as running "python setup.py clean bdist_egg" from the root directory?
Michael Churchill
@rmchurch
Sep 28 2015 20:11
nvm, pip install git+git://github.com/thunder-project/thunder.git@master with SPARK_HOME defined seems to work for me