These are chat archives for halide/Halide

20th
Oct 2017
Steven Johnson
@steven-johnson
Oct 20 2017 20:50
TIL about Flame Graphs as a profiling visualization. Very cool.
Andrew Adams
@abadams
Oct 20 2017 21:14
ok, now matching cublas on gemm
So much for llvm not being capable of it.
Steven Johnson
@steven-johnson
Oct 20 2017 21:15
woot
Andrew Adams
@abadams
Oct 20 2017 21:15
The funny thing is I'm getting 25% occupancy, so the GPU compute units are mostly idle
but I've maxed the bandwidth from L2
as cublas does
So they take the same amount of time on this card
Zalman Stern
@zvookin
Oct 20 2017 21:16
Excellent!
What is cublas' occupancy?
Andrew Adams
@abadams
Oct 20 2017 21:17
cublas gets 50% occupancy, so they're making a different trade-off
but are both L2-limited
Zalman Stern
@zvookin
Oct 20 2017 21:17
On this card we're using less resources, but we might be slower on a card with more L2?
Andrew Adams
@abadams
Oct 20 2017 21:18
On a card with more L2 bandwidth relative to compute available, yes
I'll see if I can hunt one up
Andrew Adams
@abadams
Oct 20 2017 21:26
If I double the matrix size, it should be 8x slower. cublas gets 8x slower, but we get 16x slower :(
Looks like we fall out of L2
Definitely need different schedules for different sizes. gemm is a pain
Zalman Stern
@zvookin
Oct 20 2017 21:28
Pain that is opportunity for Halide :-)
Andrew Adams
@abadams
Oct 20 2017 21:33
I think at this size I need an rfactor
(to block the reduction dimension)
anyway, it's not a good use of time to hammer on this endlessly
Zalman Stern
@zvookin
Oct 20 2017 21:35
Probably not, but it sounds like both basic warp shuffle and matching cublas are pretty promising.
I.e. matching on one data point
Andrew Adams
@abadams
Oct 20 2017 21:35
Yeah. The warp shuffle implementation is a bit gross
The idea is that you have an allocation at warp level, where different threads store different parts of it, and if you need a value from another thread, you use a shuffle
This means you have a family of choices for how to stripe the allocation across the threads
But you really want to only ever store to your own value, not another thread's value
so the striping should correspond to what the store instructions do
sometimes it's the innermost storage dimension, sometimes it's not
Zalman Stern
@zvookin
Oct 20 2017 21:37
Was it possible to allocate at warp level before?
Andrew Adams
@abadams
Oct 20 2017 21:37
No
The basic model is to have a loop level that lives between threads and blocks that is treated as warps
Or I guess it's really inside threads
Zalman Stern
@zvookin
Oct 20 2017 21:38
I need to look at it more closely. Is "warp level" a split of the thread dimension?
Andrew Adams
@abadams
Oct 20 2017 21:38
The current implementation just uses the innermost thread dimension to mean warp lanes
but only if the size is 32
otherwise it doesn't consider warps to even exist
Zalman Stern
@zvookin
Oct 20 2017 21:38
Ok. I would expect that to generalize to multiples of 32
Andrew Adams
@abadams
Oct 20 2017 21:38
Possibly divisors
Zalman Stern
@zvookin
Oct 20 2017 21:38
I.e. split the threads into blocks of 32
Andrew Adams
@abadams
Oct 20 2017 21:38
but not multiples
Zalman Stern
@zvookin
Oct 20 2017 21:39
sorry, warps of 32
Andrew Adams
@abadams
Oct 20 2017 21:39
You can't communicate to a thread in a separate real warp
Zalman Stern
@zvookin
Oct 20 2017 21:39
sorry, the extent of the threads dim is a multiple of 32
Andrew Adams
@abadams
Oct 20 2017 21:39
I think it needs to be a divisor of 32
You need to know that other threads that differ only by thread_id_x live in the same warp for this to work
and that's only true for thread blocks of size 1, 2, 4, 8, 16, 32
Zalman Stern
@zvookin
Oct 20 2017 21:40
I'm think that we're effectively introducing a more inner dimension called "warp_id_x"
Andrew Adams
@abadams
Oct 20 2017 21:40
Otherwise you'd need a lot of syncthreads
Yeah, it's the lane within the warp
Zalman Stern
@zvookin
Oct 20 2017 21:41
I guess it that dimension goes between threads and blocks
Andrew Adams
@abadams
Oct 20 2017 21:41
If you treat cuda right, it's the same as thread_id_x
Zalman Stern
@zvookin
Oct 20 2017 21:41
Block index, warp index within block, thread index within warp
Andrew Adams
@abadams
Oct 20 2017 21:41
Yeah
Zalman Stern
@zvookin
Oct 20 2017 21:41
Limiting blocks to 32 threads is a pretty big constraint
Also, Cuda 9 loosens this up
Andrew Adams
@abadams
Oct 20 2017 21:42
I've been thinking of "warp index within block" as thread index, and "thread index within warp" as warp lane
I.e. threads are groups of warps
Zalman Stern
@zvookin
Oct 20 2017 21:42
hmmm
I think cuda terminology has a warp being a group of threads
Andrew Adams
@abadams
Oct 20 2017 21:42
similar to SIMD lanes being innermost
Zalman Stern
@zvookin
Oct 20 2017 21:42
I.e. the thread is a lane
Andrew Adams
@abadams
Oct 20 2017 21:43
Sure, the space of thread indices can be factored into warp lanes (innermost), and true thread indices (outermost)
This is all just nomenclature
Zalman Stern
@zvookin
Oct 20 2017 21:43
But they generally wouldn't subpartition for the wap
warp
I.e. warps are supposed to be a microarch detail, not a fundamental part of the programming model
Andrew Adams
@abadams
Oct 20 2017 21:44
Yeah, well, they kinda messed that up
Zalman Stern
@zvookin
Oct 20 2017 21:44
I'd say that's a bit blurry in practice
:-)
Andrew Adams
@abadams
Oct 20 2017 21:44
The thing that got complicated is that SIMD also exists and is important for minimizing memory transactions
Zalman Stern
@zvookin
Oct 20 2017 21:44
Frankly if Halide existed in the first place, I'm not sure one would have built any of it that way
Andrew Adams
@abadams
Oct 20 2017 21:44
and you want SIMD lanes to be inside of warp lanes in the storage layout
Zalman Stern
@zvookin
Oct 20 2017 21:45
There's an open issue as to how much cross generation compatibility one tries to provide
But if one is shooting for peak performance, I don't think there's a whole lot to be had even with cuda
One has to reschedule
The ptx is pretty
shfl.idx.b32 is a warp broadcast
Zalman Stern
@zvookin
Oct 20 2017 22:52
Are all those mov's broadcasting a register to a bunch of other registers?