These are chat archives for thunder-project/thunder

22nd
Apr 2016
Jason Wittenbach
@jwittenbach
Apr 22 2016 16:44
I know there’s been a big discussion of adding HDFS IO to Thunder, but we haven’t thought much about pushing the IO up to Bolt, though that’s an interesting idea. Do you have any use cases in mind that, if Thunder supported reading/writing to HDFS, would still require IO on BoltArrays?
As to the second point, we’ve definitely thought about how to handle metadata
We want to keep Bolt simple, only implemeting ndarray type functionality
but Thunder is a different story
Thunder 1.0 adds a labels object that lets you keep track of some per-record metadata
We store that metadata locally though, as opposed to keeping it with the distributed records
We don’t currently have “filtering by label", but that would be fairly straightforward to implement
Steve Varner
@stevevarner
Apr 22 2016 18:29
Currently, when you load images, it has to iterate through each one individually, transfer each over the network, then add them to the bolt array one at a time to be parallelized in Spark. If it were to do something like sc.pickleFile(hdfs_path) then the loading would be a lot more efficient. If we were to add an HDFS option to the current image reader, it would actually be slower because it would have to check each file blocks location in HDFS, possibly (but not likely) reassemble the file then transfer it to be loaded individually. With sc.pickleFile, the HDFS image stack would never have to be fully reassembled from HDFS file blocks.