These are chat archives for thunder-project/thunder

26th
May 2016
Chris Tech
@techchrj
May 26 2016 14:47
@bainzo In regards to your tiff reading problem, we had issues reading 32bit LZW compressed floating point data with the previous version of thunder when it used PIL to load them. It would blank out the last 35 rows and columns of data. We loaded the TIFs using matlab, GDAL, and even the underlying tiffFile.py file to verify that nothing was wrong with the data integrity. Since PIL wouldn't load the data properly, we modified the code base to use GDAL to load the tiff data instead of PIL. When the new version of Thunder came out, we modified the code base and issued a pull request (PR #299) to use GDAL in place since we know it handles all data types and is regularly maintained. I don't know if that helps you, but I thought I would throw that out there since you were having issues as well. The code mods to incorporate GDAL are very small and might help you out.
Jeremy Freeman
@freeman-lab
May 26 2016 15:20
@jwittenbach what's going on with the [0] from this line in map_generic
return self.values.map_generic(func)[0]
Jason Wittenbach
@jwittenbach
May 26 2016 15:22
@freeman-lab IIRC, it has to do with the fact that, in Bolt, time can be chunked as well, but in Thunder a Block has all time points, so there always ends up being a singleton dimension a the beginning of the array, so that is simply removing it
or maye it was the fact that the result ends up wrapped inside a 0-dimension NumPy array, and I’m pulling it out...
either way, I’m pretty sure it’s just book-keeping
let me check quick
ok, yep
it’s the former
there’s only ever one chunk along time
so the chunking scheme always looks like (1, chunks_x, chunks_y, chunks_z)
and that’s also the shape of the array (of dtype=object) returned from map_generic
since that first dimension is always 1, I’m just dropping it
Jeremy Freeman
@freeman-lab
May 26 2016 15:30
hm ok i'm seeing other issues with map_generic
we should probably pair on this cause i'm a little confused
you in or near bobs?
@jwittenbach
Jason Wittenbach
@jwittenbach
May 26 2016 15:35
up in my office
shall I head down?
Jeremy Freeman
@freeman-lab
May 26 2016 15:35
k cool yeah in a few min i'm walking over now
Jason Wittenbach
@jwittenbach
May 26 2016 15:35
ok, cool
I need to run back to my apartment quick, but I’ll head over as soon as I’m back
Jeremy Freeman
@freeman-lab
May 26 2016 15:37
great
mobcdi
@mobcdi
May 26 2016 15:55
Silly question but does thunder need hadoop or can it like spark use shared storage? Hoping to process a video but new to distributed working
Jeremy Freeman
@freeman-lab
May 26 2016 16:06
@mobcdi definitely doesn't need hadoop
if running locally it just uses the local file system
if running distributed it just needs to be able to access the specified files from all cluster nodes
mobcdi
@mobcdi
May 26 2016 16:11
How is access controlled if 1 node opens the file ahead of another or how can multiple nodes process 1 file without all processing the same part of a file?
Nicholas Sofroniew
@sofroniewn
May 26 2016 17:35
@jwittenbach i'm getting an error in map_as_series in local mode - block size must be a tuple for local mode
can be fixed by passing a tuple (i think) but default should work with no error
Jason Wittenbach
@jwittenbach
May 26 2016 17:39
@sofroniewn I remember making that decision and thinking it was not optimal. In spark mode, the default is 150KB chunks. In local mode, we don’t allow chunk size to be specified in terms of memory footprint, but demand a chunk dimensions instead. Any thoughts on what a good default would be in that case?
Nicholas Sofroniew
@sofroniewn
May 26 2016 17:41
hmm really not sure. Is there any reason not to have a default chunk size of 1 (i.e. no chuncks)?
Jason Wittenbach
@jwittenbach
May 26 2016 17:42
That could work
though then the default for chunking is to not actually chunk, which seems like it might be misleading to a user
Nicholas Sofroniew
@sofroniewn
May 26 2016 17:43
yeah, but when i call map_as_series I don't really mind how you chunk it
or if it chunks at all
Jeremy Freeman
@freeman-lab
May 26 2016 18:18
@mobcdi that kind of behavior depends a lot on the kind of file structure
for example, when reading images, each spark task is loading a single distinct image file
Davis Bennett
@d-v-b
May 26 2016 18:47
@freeman-lab where did the source api go, e.g. for source extraction?
Jeremy Freeman
@freeman-lab
May 26 2016 18:49
most of it is now in thunder-extraction
and some of it is in a little package called regional
Davis Bennett
@d-v-b
May 26 2016 18:50
ok, looks like some of this is hot off the keyboard :)
mobcdi
@mobcdi
May 26 2016 20:19
@freeman-lab so if I want to process a video (mpeg4) file would I be able to a) work it across multiple nodes without hadoop & B) save it back to a single file afterwards
Jeremy Freeman
@freeman-lab
May 26 2016 20:22
@mobcdi hm i'm not actually sure, we typically work with image data by storing each image separately, and we've never tried to work with mp4s
it'd probably be much easier to store the full video into several smaller videos, and then just have each task work on a single chunk of the video
that's not unlike how people store large collections of images as several files each of which contain a stack of images
Nicholas Sofroniew
@sofroniewn
May 26 2016 23:25
@jwittenbach any thoughts on how block size will affect speed of map_to_series in local mode?
detrended = data.map_as_series(detrend, block_size = (10, 10))
Jason Wittenbach
@jwittenbach
May 26 2016 23:26
hmm
my initial thought is that it shouldn’t have much of an effect at all
Nicholas Sofroniew
@sofroniewn
May 26 2016 23:26
for as small 100x1682x1792 (T x X x Y) dataset is going very slow
yeah that was my initial thought too - am I right to even give it a tuple like that?
Jason Wittenbach
@jwittenbach
May 26 2016 23:27
yep
Nicholas Sofroniew
@sofroniewn
May 26 2016 23:27
the datasets is only 300MB
Jason Wittenbach
@jwittenbach
May 26 2016 23:27
hmm
Nicholas Sofroniew
@sofroniewn
May 26 2016 23:28
i wonder even if a to_series, a map, and a to_images might be faster
or unlikely?
Jason Wittenbach
@jwittenbach
May 26 2016 23:28
I’d be interested to hear how it compares to just calling toarray and then doing it however you would with just NumPy
yeah, it’s very possible
in local mode, toseries and toimages skip the blocks and just do the transpose
Jeremy Freeman
@freeman-lab
May 26 2016 23:29
yeah i'd try just doing it as a toseries
can't imagine the indirection of map_as_series helps in local mode
Jason Wittenbach
@jwittenbach
May 26 2016 23:29
there’s definitely some overhead breaking it into blocks and then putting it back together
for map_as_series
and there is litearly 0 benefit
so, at best, it will be the same
Jeremy Freeman
@freeman-lab
May 26 2016 23:30
yup yup
Nicholas Sofroniew
@sofroniewn
May 26 2016 23:38
still seems very slow - havn't been able to get it to even complete yet
Jeremy Freeman
@freeman-lab
May 26 2016 23:40
to some degree this isn't that surprising, you're doing a fairly complex operation on each of 3 million time series with no parallelism
my experience doing this stuff locally has always been that simple array math or collapsing along axes (e.g. taking the mean) is really fast, but complex per-pixel stuff becomes extremely slow
sounds consistent with what you're seeing
Nicholas Sofroniew
@sofroniewn
May 26 2016 23:46
yup - just finished running
Jason Wittenbach
@jwittenbach
May 26 2016 23:49
yeah, that’s a lot of trips around the for-loop
ok, so the PR is ready for blocks
the one last thing I might want to change:
what should auto give for a chunk size in local mode?
right now it does the same as spark mode, i.e. tries to give 100MB block sizes
the other option might be to make the default be a single chunk in local mode, and leave it as auto in spark mode
Jeremy Freeman
@freeman-lab
May 26 2016 23:51
yup, that's what i would do
for local mode auto should be a single chunk
Jason Wittenbach
@jwittenbach
May 26 2016 23:53
well, what if auto always means “100MB blocks"
but then the default can be auto for spark
and 1 chunk for local
?
Jeremy Freeman
@freeman-lab
May 26 2016 23:53
except you can't specify 1 chunk as the default
Jason Wittenbach
@jwittenbach
May 26 2016 23:53
though I guess I can see either being reasonable
Jeremy Freeman
@freeman-lab
May 26 2016 23:53
because we don't know what 1 is until we know the shape
Jason Wittenbach
@jwittenbach
May 26 2016 23:53
well, I can have None as the default
check for None
Jeremy Freeman
@freeman-lab
May 26 2016 23:54
oh and then None -> auto for spark and None -> 1 chunk for local
Jason Wittenbach
@jwittenbach
May 26 2016 23:54
yeah
Jeremy Freeman
@freeman-lab
May 26 2016 23:54
i guess it just seems like an extra level of indirection
Jason Wittenbach
@jwittenbach
May 26 2016 23:54
I’m just trying to figure out which of those options is less confusing
Jeremy Freeman
@freeman-lab
May 26 2016 23:55
compared to auto -> 100kb for spark and auto -> 1 chunk for local
Jason Wittenbach
@jwittenbach
May 26 2016 23:55
yep
@sofroniewn which of those seems more intuitive to you?
Jeremy Freeman
@freeman-lab
May 26 2016 23:55
@sofroniewn said above he just didn't care about chunks in local mode
Jason Wittenbach
@jwittenbach
May 26 2016 23:55
haha, fair enough
Jeremy Freeman
@freeman-lab
May 26 2016 23:56
which i think argues for the latter =)
Jason Wittenbach
@jwittenbach
May 26 2016 23:56
well
in either case
you just don’t override the default, and you get a single block in local mode
Nicholas Sofroniew
@sofroniewn
May 26 2016 23:56
yup - sort of thinking about the later
auto => you do what you think is best
Jeremy Freeman
@freeman-lab
May 26 2016 23:57
true, they end up the same
Jason Wittenbach
@jwittenbach
May 26 2016 23:58
ok, I’m down with that
I’ll have auto be the default
and it will do separate things in local vs spark
Jeremy Freeman
@freeman-lab
May 26 2016 23:58
great
i like that
Nicholas Sofroniew
@sofroniewn
May 26 2016 23:59
makes sense to me