These are chat archives for thunder-project/thunder

6th
Oct 2015
alexandrelaborde
@AlexandreLaborde
Oct 06 2015 12:01
@PhC-PhD I did and it works. Also I can do loadImages loadSeries and convertImagesToSeries if I use thunder locally but not in my cluster
@freeman-lab Do I need a local copy of each file per node if I want to convert a big image set to series for instance?
alexandrelaborde
@AlexandreLaborde
Oct 06 2015 17:39
@PhC-PhD @freeman-lab This is what I get when trying to load my images to thunder. apparently its a Py4J error. Any ideas?
---------------------------------------------------------------------------
Py4JJavaError                             Traceback (most recent call last)
<ipython-input-3-088675fc14a8> in <module>()
      1 path = '/home/user/series'
----> 2 rawdata = tsc.loadSeries(path)
      3 #rawdata = tsc.loadExample('fish-series')
      4 data = rawdata.toTimeSeries().normalize()
      5 data.cache()

/tmp/spark-6cc8b19c-fb6d-435f-aa04-5ab6160f4d9e/userFiles-1f5827a1-ac43-49ab-bede-c9de2403ac88/thunder_python-0.5.1-py2.7.egg/thunder/utils/context.py in loadSeries(self, dataPath, nkeys, nvalues, inputFormat, minPartitions, confFilename, keyType, valueType, keyPath, varName)
     94         if inputFormat.lower() == 'binary':
     95             data = loader.fromBinary(dataPath, confFilename=confFilename, nkeys=nkeys, nvalues=nvalues,
---> 96                                      keyType=keyType, valueType=valueType)
     97         elif inputFormat.lower() == 'text':
     98             if nkeys is None:

/tmp/spark-6cc8b19c-fb6d-435f-aa04-5ab6160f4d9e/userFiles-1f5827a1-ac43-49ab-bede-c9de2403ac88/thunder_python-0.5.1-py2.7.egg/thunder/rdds/fileio/seriesloader.py in fromBinary(self, dataPath, ext, confFilename, nkeys, nvalues, keyType, valueType, newDtype, casting)
    210                                          'org.apache.hadoop.io.LongWritable',
    211                                          'org.apache.hadoop.io.BytesWritable',
--> 212                                          conf={'recordLength': str(recordSize)})
    213 
    214         data = lines.map(lambda (_, v):

/home/user/Downloads/spark/python/pyspark/context.pyc in newAPIHadoopFile(self, path, inputFormatClass, keyClass, valueClass, keyConverter, valueConverter, conf, batchSize)
    572         jrdd = self._jvm.PythonRDD.newAPIHadoopFile(self._jsc, path, inputFormatClass, keyClass,
    573                                                     valueClass, keyConverter, valueConverter,
--> 574                                                     jconf, batchSize)
    575         return RDD(jrdd, self)
    576 

/home/user/Downloads/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py in __call__(self, *args)
    536         answer = self.gateway_client.send_command(command)
    537         return_value = get_return_value(answer, self.gateway_client,
--> 538                 self.target_id, self.name)
    539 
    540         for temp_arg in temp_args:

/home/user/Downloads/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py in get_return_value(answer, gateway_client, target_id, name)
    298                 raise Py4JJavaError(
    299                     'An error occurred while calling {0}{1}{2}.\n'.
--> 300                     format(target_id, '.', name), value)
    301             else:
    302                 raise Py4JError(

Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.newAPIHadoopFile.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage 0.0 (TID 3, 10.40.11.152): java.io.FileNotFoundException: File file:/home/user/series/key01_00000-key00_00000.bin does not exist.
    at org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:397)
    at org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:251)
    at org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSInputChecker.<init>(ChecksumFileSystem.java:125)
    at org.apache.hadoop.fs.ChecksumFileSystem.open(ChecksumFileSystem.java:283)
    at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:427)
    at thunder.util.io.hadoop.FixedLengthBinaryRecordReader.initialize(FixedLengthBinaryRecordReader.scala:78)
    at org.apache.spark.rdd.NewHadoopRDD$$anon$1.<init>(NewHadoopRDD.scala:133)
    at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:104)
    at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:66)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD)
Jason Wittenbach
@jwittenbach
Oct 06 2015 17:56
@AlexandreLaborde Loading the example files on a cluster can be a little tricky. These files are stored with your Thunder installation. If you know the path to your Thunder installation (let’s say you have it in a string with the name path_to_thunder), then you can try doing something like:
tsc.loadSeries(path_to_thunder +  ‘/thunder/utils/data/fish/series’)
Each of your nodes does not need it’s own copy of the files, however it does need to be able to “see” the files on the network
alexandrelaborde
@AlexandreLaborde
Oct 06 2015 17:57
@jwittenbach thanks :) but even if I run the code with some local series that I have the result its the same
I have tried using open and it worked
how can I check if some computer in my cluster can see some file
i have tried SCP one image from my pc to a node in the cluster and it work so I am assuming the files are acessible
Jason Wittenbach
@jwittenbach
Oct 06 2015 17:59
if you could directly log in to one of the nodes, then you can just check to see if cd /home/user/series works
alexandrelaborde
@AlexandreLaborde
Oct 06 2015 18:00
but in that case my node needs to have a local copy of my series
Jason Wittenbach
@jwittenbach
Oct 06 2015 18:00
well, not necessarily
if /home is actually a mounted network share, then it would work
alexandrelaborde
@AlexandreLaborde
Oct 06 2015 18:01
ooooooohhhh
Jason Wittenbach
@jwittenbach
Oct 06 2015 18:01
that’s how we get it to work here at Janelia….the data sits on the network and a network drive is mounted on each node
alexandrelaborde
@AlexandreLaborde
Oct 06 2015 18:02
I have to do that
Jason Wittenbach
@jwittenbach
Oct 06 2015 18:02
yeah, that’s probably the easiest route
it allows you to have a single path to the data that is valid on each node, yet you don’t need a separate copy of the data on each node
alexandrelaborde
@AlexandreLaborde
Oct 06 2015 18:03
but how would I do if I wanted to load files that I have stored here in my computer to the cluster?
file:// ? just curious
Jason Wittenbach
@jwittenbach
Oct 06 2015 18:06
to get that to work, you would have to figure out some way for each node on the cluster to have access to the files stored on your computer…not sure what the easiest route to do that would be; it would depend on what you can/can’t do on your network
e.g. if you computer is on the same network as the cluster, then you could share the local folder containing the files, and then have each node mount that share
of course, you’d need the proper permissions to do that, as well as a way of running some script on each of the nodes to do the mounting
the easiest is if there’s some network share that each of the nodes can already see
then just copy the files to that location
and point the path to them
alexandrelaborde
@AlexandreLaborde
Oct 06 2015 18:27
thanks! I am doing that right now to see if it works.
alexandrelaborde
@AlexandreLaborde
Oct 06 2015 19:13
@jwittenbach I have a doubt regarding the paths.. so the stuff on my computer is in /home/user/img and have created a shared folder using sshfs from each node and mounted the drive on /home/manager/orange... now I don’t get the Py4J errors, I just get an error saying that /home/manager/orange/img doesn’t have any tif images. But it does... I have double checked for each node ls /home/manager/orange/img returns all images that are there. Do I need to place my images in /home/manager/orange/img in my pc?
Jason Wittenbach
@jwittenbach
Oct 06 2015 20:30
@AlexandreLaborde Hmm, that’s a tough one. It sounds like the nodes have access to the files, so I’m not sure why the loading function wouldn’t be able to find them. I think that our cluster does its shares through NFS, so maybe there’s some difference with SSHFS?
from one of the nodes, can you actually manipulate the files (open, copy to the local FS, etc)?
alexandrelaborde
@AlexandreLaborde
Oct 06 2015 22:25
@jwittenbach by open you mean just UI open or using the python open funciton ?
Jason Wittenbach
@jwittenbach
Oct 06 2015 22:27
Loading it into just Python would work. Just curious as to whether or not another program can read from that location to verify that the problem is in Thunder.
alexandrelaborde
@AlexandreLaborde
Oct 06 2015 22:27
@jwittenbach from my nodes i can open, move, rename...
Jason Wittenbach
@jwittenbach
Oct 06 2015 22:30
@AlexandreLaborde That’s puzzling — I’m not really sure why Thunder would have a problem with it then.
Can you post the code/error from trying to load the images now?
alexandrelaborde
@AlexandreLaborde
Oct 06 2015 22:32
Ok if I open python in the a node and open /home/manager/orange/img/img1.tif where orange is my shared folder it can open
shure
Jason Wittenbach
@jwittenbach
Oct 06 2015 22:33
ah, ok, sounds like it’s working
alexandrelaborde
@AlexandreLaborde
Oct 06 2015 22:38
---------------------------------------------------------------------------
FileNotFoundError                         Traceback (most recent call last)
<ipython-input-3-2a1138cc49e8> in <module>()
----> 1 data= tsc.loadImages(path,inputFormat='tif')

/tmp/spark-3b0266bb-1093-4dca-8743-d6349a967530/userFiles-39e74eb5-9c43-450c-baf0-b24aa30684a4/thunder_python-0.5.1-py2.7.egg/thunder/utils/context.py in loadImages(self, dataPath, dims, dtype, inputFormat, ext, startIdx, stopIdx, recursive, nplanes, npartitions, renumber, confFilename)
    196         elif inputFormat.lower().startswith('tif'):
    197             data = loader.fromTif(dataPath, ext=ext, startIdx=startIdx, stopIdx=stopIdx, recursive=recursive,
--> 198                                   nplanes=nplanes, npartitions=npartitions)
    199         else:
    200             if nplanes:

/tmp/spark-3b0266bb-1093-4dca-8743-d6349a967530/userFiles-39e74eb5-9c43-450c-baf0-b24aa30684a4/thunder_python-0.5.1-py2.7.egg/thunder/rdds/fileio/imagesloader.py in fromTif(self, dataPath, ext, startIdx, stopIdx, recursive, nplanes, npartitions)
    372         reader = getParallelReaderForPath(dataPath)(self.sc, awsCredentialsOverride=self.awsCredentialsOverride)
    373         readerRdd = reader.read(dataPath, ext=ext, startIdx=startIdx, stopIdx=stopIdx, recursive=recursive,
--> 374                                 npartitions=npartitions)
    375         nrecords = reader.lastNRecs if nplanes is None else None
    376         return Images(readerRdd.flatMap(multitifReader), nrecords=nrecords)

/tmp/spark-3b0266bb-1093-4dca-8743-d6349a967530/userFiles-39e74eb5-9c43-450c-baf0-b24aa30684a4/thunder_python-0.5.1-py2.7.egg/thunder/rdds/fileio/readers.py in read(self, dataPath, ext, startIdx, stopIdx, recursive, npartitions)
    186         """
    187         absPath = self.uriToPath(dataPath)
--> 188         filePaths = self.listFiles(absPath, ext=ext, startIdx=startIdx, stopIdx=stopIdx, recursive=recursive)
    189 
    190         lfilepaths = len(filePaths)

/tmp/spark-3b0266bb-1093-4dca-8743-d6349a967530/userFiles-39e74eb5-9c43-450c-baf0-b24aa30684a4/thunder_python-0.5.1-py2.7.egg/thunder/rdds/fileio/readers.py in listFiles(self, absPath, ext, startIdx, stopIdx, recursive)
    175             LocalFSParallelReader._listFilesRecursive(absPath, ext)
    176         if len(files) < 1:
--> 177             raise FileNotFoundError('cannot find files of type "%s" in %s' % (ext if ext else '*', absPath))
    178         files = selectByStartAndStopIndices(files, startIdx, stopIdx)
    179 

FileNotFoundError: cannot find files of type "tif" in /home/manager/orange/
ups sorry i made a mistake there
the path was a folder short on that one but the error its the same
Notice that its different from the Py4J error before I made the shared folder
Jason Wittenbach
@jwittenbach
Oct 06 2015 22:49
yeah
so we know these files are visible from the nodes, but I guess the code that is failing right now is running on the Master rather than the Workers
Is the Spark Master also a node on your cluster?
alexandrelaborde
@AlexandreLaborde
Oct 06 2015 23:02
yes
my "cluster" are 3 pcs lol one of the is the master with one worker and then the other 2 pcs have a worker each
orange is the pc from where I am running thunder
how do you know that it is the master ?
Jason Wittenbach
@jwittenbach
Oct 06 2015 23:16
the line that’s failing isn’t inside of a Spark map, reduce, etc call
so the Workers haven’t actually even been called yet
everything up to that point is taking place on the Driver (which, in reality, might be different than the Master, but I assumed they were probably on the same machine)
the way parallel reads work in Thunder is that, on the Driver, you get a list of all the files to be loaded, and then you parallelize that list…each Worker ends up with a few of the file names, and is responsible for loading those files.
Jason Wittenbach
@jwittenbach
Oct 06 2015 23:22
so the code that’s failing there is just the part where, locally on the Driver, Thunder is trying to list the contents of the path and find the files that match the extension
alexandrelaborde
@AlexandreLaborde
Oct 06 2015 23:23
I see your point...
Jason Wittenbach
@jwittenbach
Oct 06 2015 23:23
so as long as whatever maching you’re running Python/Thunder from can also see that path, then I don’t know why it would fail
do you have that path mounted in the same place on the matchine that you’re running from as well?
(i.e., the Driver)
alexandrelaborde
@AlexandreLaborde
Oct 06 2015 23:24
Is there a way I can see what thunder can list in that folder ?
all 3 computers have the folder mounted on ~/orange
and using sshfs i can navigate between every folder of the computer with the data
so.. ~/orange goes to ~ in orange
and in orange the files are in ~/imgs
Jason Wittenbach
@jwittenbach
Oct 06 2015 23:31
gotcha
alexandrelaborde
@AlexandreLaborde
Oct 06 2015 23:32
this is one weird problem
Jason Wittenbach
@jwittenbach
Oct 06 2015 23:33
and so the path that you’re giving loadImages then is ~/orange/imgs, right?
alexandrelaborde
@AlexandreLaborde
Oct 06 2015 23:33
yes
I mean... I tried a lot of combinations but that was certainly one of them
I have uploaded a small series to a google drive acount. do you want to see if we can load that to help pinpoint the isue ?
alexandrelaborde
@AlexandreLaborde
Oct 06 2015 23:40
is it ok to have the master and a worker on the same machine ?
Jason Wittenbach
@jwittenbach
Oct 06 2015 23:42
here’s a little snippit of code that you can try running on the same machine that you’re trying to do tsc.loadImages from. It replicates how Thunder tries to find the list of images to load. If you can run this, we can see what it’s finding maybe…
import urllib
import os

path = ‘/home/manager/orange/imgs’
ext="tif"

absPath = urllib.url2pathname(urlparse.urlparse(path).path)
files = glob.glob(os.path.join(absPath, '*.' + ext))
hmmm, I think it should be ok to have the Master as one of your Workers
alexandrelaborde
@AlexandreLaborde
Oct 06 2015 23:46
urlparse isn't defined
this is python3
nop it is 2. done
Jason Wittenbach
@jwittenbach
Oct 06 2015 23:52
ah, good good…if it had been Python3, I would have found our problem ;)
what does that give for absPath and files?
alexandrelaborde
@AlexandreLaborde
Oct 06 2015 23:53
so it works in a worker and in the master
but in orange it does not since the path is not that one
Jason Wittenbach
@jwittenbach
Oct 06 2015 23:56
is orange where your’e trying to run the tsc.loadImages from?
alexandrelaborde
@AlexandreLaborde
Oct 06 2015 23:57
wait a second i think i figured it out
let me check
Jason Wittenbach
@jwittenbach
Oct 06 2015 23:59
cool
if you’re running Python/Thunder from orange, then you’ll have to mount that share on orange as well