These are chat archives for thunder-project/thunder

20th
Apr 2016
Chris Tech
@techchrj
Apr 20 2016 16:15
@freeman-lab I had a chance to look at this this morning. Your hunch on the partition argument being the problem with the slowdown was correct. If it's not passed in it defaults to the number of files, even with spark. In previous versions it did not do that (it seemed more dynamic in nature....). I passed in an arbitrary number that was based on the spark.default.parallelism property calculation (3-4 x number of CPUs in your cluster) and the result returned back to me in line with the same timing we were seeing in version 0.6. I briefly looked at the v0.6 and v1.0 code and couldn't see anything really that different in the way things were being handled in the reader where the partitions argument is being used. In our application that is currently using the v0.6 Thunder codebase, we do not pass anything to the npartitions argument.
Jason Wittenbach
@jwittenbach
Apr 20 2016 16:33

Here’s the key line from an older version:
https://github.com/thunder-project/thunder/blob/v0.5.1/python/thunder/rdds/fileio/readers.py#L192

And here is the corresponding place in the code from the current version:
https://github.com/thunder-project/thunder/blob/master/thunder/readers.py#L150

Looks like both versions should have been defaulting to the number of files...
Jeremy Freeman
@freeman-lab
Apr 20 2016 16:44
ah so you're looking at 0.5.1 there
in 0.6+ it was using the default parallelism, see here
@techchrj great investigation on this, seems like you totally tracked down the issue
i'm now quite confident that in 0.6 it was the default parallelism and in 1.0.0 it's the number of files
but we can definitely change it back!
unfortunately the choice that will be faster in general depends on the size of the images, and the size of your cluster, and often the "optimal" number of partitions needs to be determined empirically
Chris Tech
@techchrj
Apr 20 2016 16:54
@freeman-lab yeah that line in 0.6 would make the difference. In the grand scheme of things, what makes the most sense to use; what was in v0.6 or passing the number of partitions to use on the loadimages?? We can use either one....
Jason Wittenbach
@jwittenbach
Apr 20 2016 17:07
Ah, cool — couldn’t find a tag right away for 0.6, so I just grabbed 0.5 assuming it was the same — but not so!
Jeremy Freeman
@freeman-lab
Apr 20 2016 17:21
@jwittenbach yup figured, just kinda slipped that change in, minor but significant!
@techchrj immediate term i'd just set npartitions manually to be the value of sc.defaultParallelism
longer term, i guess i'd like to know "on average" which one is faster, across typical dataset sizes and cluster sizes
but i'm inclined to think the answer is defaultParallelism, exactly as you found
in which case we should switch the default to that
Chris Tech
@techchrj
Apr 20 2016 21:08
@freeman-lab sounds good. Thanks for helping me out with this. Since we have such a large variation in datasets, i'll keep track of our results and will provide you with what we find as the "average".
Jeremy Freeman
@freeman-lab
Apr 20 2016 22:55
@techchrj oh super cool, that'd be really informative
i've found that an empirical approach is really the best with this stuff