Where communities thrive


  • Join over 1.5M+ people
  • Join over 100K+ communities
  • Free without limits
  • Create your own community
People
Repo info
Activity
  • Oct 13 2018 04:25

    freeman-lab on master

    Update README.md (compare)

  • Oct 13 2018 04:25

    freeman-lab on master

    add link back (compare)

  • Oct 13 2018 04:21

    freeman-lab on master

    remove links (compare)

  • Aug 08 2018 14:38
    ryan-williams opened #109
  • Feb 21 2017 20:35
    boazmohar commented #96
  • Feb 21 2017 20:28
    jwittenbach commented #96
  • Feb 21 2017 20:25
    jwittenbach commented #96
  • Feb 16 2017 18:57

    jwittenbach on master

    sets split to correct value aft… Merge pull request #108 from jw… (compare)

  • Feb 16 2017 18:57
    jwittenbach closed #108
  • Feb 16 2017 18:57
    jwittenbach closed #106
  • Feb 16 2017 18:50
    jwittenbach synchronize #108
  • Feb 16 2017 17:46
    jwittenbach opened #108
  • Feb 16 2017 17:29
    jwittenbach synchronize #107
  • Feb 16 2017 17:29
    jwittenbach closed #107
  • Feb 16 2017 16:59
    jwittenbach opened #107
  • Feb 16 2017 16:58
    jwittenbach opened #106
  • Jan 06 2017 20:14
    jwittenbach commented #105
  • Jan 06 2017 20:03
    jwittenbach synchronize #105
  • Jan 06 2017 19:54
    jwittenbach edited #105
  • Jan 06 2017 19:54
    jwittenbach opened #105
Aurélien Mazurie, Ph.D.
@ajmazurie
Hi guys! I just heard about Bolt, and I'm wondering about its use to bring Scikit learn to Spark. For example, I'd love to perform hyperparameter optimization (GridSearchCV or others) using Spark to distribute the jobs across a cluster. Has anybody used Bolt for that?
Stephan Hoyer
@shoyer
@ajmazurie take a look at sparkit-learn: https://github.com/lensacom/sparkit-learn
Jeremy Freeman
@freeman-lab
yup, great reference @shoyer , they're doing something related and the ArrayRDD could almost definitely be swapped for the Bolt array
that said @ajmazurie , hyperparameter optimization (e.g. running a fit on a small dataset across many parameter sweeps) and distributed array computation (e.g. running complex algorithms on large feature matrices) are somewhat different
bolt (or dask.array) would probably be well-suited to the latter
Jeremy Freeman
@freeman-lab
but the former would probably need a separate set of abstractions for use in spark, i'd definitely checkout this effort to do something related in dask blaze/dask#743
Kyle
@kr-hansen
Is there anyway to add or subtract a bolt array element-wise with another bolt array? Does that get too messy in a distributed context?
Kyle
@kr-hansen
@freeman-lab @jwittenbach I submitted a pull request to implement the element-wise operations (+, -, *, /) as discussed on the Thunder Forum. I'm still kind of learning GitHub, so let me know if I did something wrong or you aren't able to see it and I can try again.
I also believe I submitted two patches (patch-1 & patch-2) by accident. They are both exactly the same, so just disregard/delete one of them.
Kyle
@kr-hansen
Also, I just realized that Thunder all ready has this built into it (Thunder/base.py) via the element_wise function, though Bolt doesn't. Is there a reason for that? Haha, I didn't see this was all ready a part of Thunder when I embarked on that fork/pull request.
Steve Varner
@stevevarner
In the function toarray() in array.py, I'm getting an error on the reshape(self.shape) function. I created the Bolt array with nrecords=2 and dims=(900,900). "ValueError: total size of new array must be unchanged". What is the purpose of the reshape?
Jason Wittenbach
@jwittenbach
@stevevarner the reshape is to handle the distributed parts of the array. When we collect, we simply concatenation all of the pieces together. But, if there is more than one distributed dimension, then we need to reshape to separate them back out.
So if you're creating the RDD by hand, then you'll have to make sure you set the shape attribute on the BoltArray correctly, or the reshape won't have the correct size to work with
So if each of your records had a 1D tuple for the key, and a 900x900 ndarray for the value, then the correct shape for the BoltArray would be (2, 900, 900)
Steve Varner
@stevevarner
That's the shape I'm using and it's causing that error.
Jason Wittenbach
@jwittenbach
interesting. can you send a code snippit of creating the array?
Steve Varner
@stevevarner

data = thunder.images.fromtifwithmetadata(row['filepath'], engine=sc, labels=(json.dumps(row),))

rdd = data.tordd().map(lambda ((a,b),c): ((a,b),bz2.compress(pickle.dumps(c))))

rdd.saveAsPickleFile("/data/savedFromThunder")

#

rdd = sc.pickleFile("/data/savedFromThunder")

data = thunder.images.fromrdd(rdd, nrecords=None, dims=(900,900), dtype='string')

data.toarray()

I also tried making nrecords=2 (for a known RDD size of 2 images) and that had the same result.
Jason Wittenbach
@jwittenbach
Ah, ok, I think I see the issue. When you create the Images object with fromrdd, try giving it dims=(2,900,900)
Steve Varner
@stevevarner

Is this the expected output? >>> data = thunder.images.fromrdd(rdd, dims=(2, 900,900), dtype='string')

data
Images
mode: spark
dtype: string
shape: (None, 2, 900, 900)

Jason Wittenbach
@jwittenbach
Ah, ok, I think you were right all along with dims
Now I'm thinking it might be a bug where fromrdd actually isn't setup correctly to handle labels
Hmm, actually that seems ok.
Are there extra things hidden in your keys?
Though even the shouldn't matter... I'm a bit stumped
Jason Wittenbach
@jwittenbach
so I just tested it, and the following works for me
import thunder as td; import numpy as np
rdd = sc.parallelize([( (0,), np.random.randn(900, 900) ), ( (1,), np.random.randn(900, 900) )])
imgs = td.images.fromrdd(rdd, dims=(900, 900))
a = imgs.toarray()
the only other thing I can think of is that things are getting properly serialized/deserialed?
after loading form the pickle file
if you do
rdd.first()[1]
do you see a 900x900 NumPy array?
mobcdi
@mobcdi
Would the experts here be able to recommend a way to ingest over 10k ndarras of shape (251328,) ? Would it be possible to include an ID value with each ingested ndarray (Forgive the novice question but really stumped)
Jason Wittenbach
@jwittenbach
@mobcdi you might want to check out Thunder (www.thunder-project.org)
It uses Bolt as a backend, but includes lots of function for loading and saving large arrays
Jeremy Freeman
@freeman-lab
@mobcdi also what format are the raw arrays in on disk? and where are they stored?
mobcdi
@mobcdi
@freeman-lab @jwittenbach They are currently stored as (lots) of individual cpickles files on disk and are the output of running a feature descriptor called HoG on video frames of size 960x540 from various sized video sequences. Currently using vstack with lists to aggregate frames back into video sequences but hitting memory issues in Python but also need to reduce/dedup the array size to eventually run kmeans and use BoW method/process. Would greatly appreciate any help, including whether to reshape the individual ndarrays before doing anything and possible options for reducing/filtering/ dataset to make more manageable
mobcdi
@mobcdi
I'm able to unpickle the ndarrays from an earlier execution of a feature descriptor task, just need some help working with such a large number and size
Jason Wittenbach
@jwittenbach
@mobcdi are you running on a cluster with Spark installed and configured?
mobcdi
@mobcdi
@jwittenback Have access to Google Dataproc with Jupyter Notebook on Master Node
Jason Wittenbach
@jwittenbach
@mobcdi cool! If you have Bolt installed, I think you could fairly easily load it into a Bolt array object
let’s say you have a function called load_pickle that takes a filename and returns a NumPy ndarray containing the data from the loaded file
If all of your files are in the same directory, then you could use glob to generate a list containing all of the file names
then you could do something like the following:
Jason Wittenbach
@jwittenbach
fnames = glob(‘/path/to/files/')
ashape = (10, 10)
fullshape = (len(fnames), ) + ashape
rdd = sc.parallelize(enumerate(fnames)).mapValues(load_pickle)
barray = bolt.BoltArraySpark(rdd, shape=fullshape, split=1, dtype=dtype)
you’ll have to supply the dtype argument based on the dtype that the arrays from your load_pickle function have
mobcdi
@mobcdi
Thanks @jwittenbach could I ask the thinking behind the code
ashape = (10, 10)
fullshape = (len(fnames), ) + ashape
Jason Wittenbach
@jwittenbach
@mobcdi I’m assuming that all of your arrays are of size ashape — here I just picked 10-by-10 as an example
then, len(fnames) will return the total # of such arrays
so when you load all of them at once, they will end up in a 3D array with dimensions number x height x width, which is what fullshape will be