These are chat archives for thunder-project/thunder

28th
Jun 2016
Boaz Mohar
@boazmohar
Jun 28 2016 13:49
@mobcdi you could use clustering to group together similar frames http://spark.apache.org/docs/latest/mllib-clustering.html. That is not a Thunder operation but with version 1.0 you can always use tordd() do operations on the RDD and recreate a Thunder object using the fromrdd() methods.
mobcdi
@mobcdi
Jun 28 2016 18:13
@boazmohar would you have a recommendation for the attributes to use to define the cluster i.e cluster by some feature descriptor or other property of the images
W J Liddy
@WJLiddy
Jun 28 2016 19:53
This message was deleted
W J Liddy
@WJLiddy
Jun 28 2016 20:07

Hello all,
In this code snippet, I load in some images, convert to series, and then try to parallelize the series:

import thunder as td from pyspark import SparkContext sc = SparkContext() images_local = td.images.frompng("/sample_pics/*") series = images_local.toseries() series.tospark(engine = sc) ssum = series.map(lambda series: series.sum()) print ssum.collect.toarray()

However, the program will freeze at the line
series.tospark(engine = sc)

I have no trouble converting images to spark mode.
Is my usage incorrect, or is this a bug?

import thunder as td 
from pyspark import SparkContext 
sc = SparkContext()
images_local = td.images.frompng("/sample_pics/*") 
series = images_local.toseries() 
series.tospark(engine = sc) 
ssum = series.map(lambda series: series.sum())
print ssum.collect.toarray()
(Fixed the formatting)
Jason Wittenbach
@jwittenbach
Jun 28 2016 20:44
@WJLiddy how bit is the dataset that you’re working with?
another thing would be to make sure that you SparkContext is initialized correctly — you could check that by trying to do something like sc.parallelize(range(10)).collect()
W J Liddy
@WJLiddy
Jun 28 2016 21:07
@jwittenbach My dataset is ~10MB and I have made sure my SparkContext is initialized correctly.
Boaz Mohar
@boazmohar
Jun 28 2016 21:09
@m
Jason Wittenbach
@jwittenbach
Jun 28 2016 21:24

@WJLiddy I don’t have a collection of PNGs handy, but if I substitute

td.images.frompng(“/sample_pics/*”)

with

td.images.fromrandom((100, 100, 100))

(which creates a dataset of rought the same size, then everything seems to work. Which would make me think it could be an issue with loading the PNGs.

Can you watch the Spark webUI (on port 4040 of your driver) and see what the Spark job that it’s getting stuck on is?
W J Liddy
@WJLiddy
Jun 28 2016 22:12
No jobs show up on the webUI. As a sanity check, I tried again with td.images.fromrandom((100, 100, 100)) rather than frompng. Then, the job does complete, but still, no jobs show up in the webUI.
(Also, I changed 'print ssum.collect.toarray()' to 'print ssum.toarray()')
Wow I am terrible at this Markdown thing