These are chat archives for halide/Halide

27th
Nov 2017
Andrew Adams
@abadams
Nov 27 2017 18:29
@dhsarletg for the async hexagon dma, the desire is to overlap copies from device to host with computation on host, right?
@dsharletg
Dillon Sharlet
@dsharletg
Nov 27 2017 18:31
right
Andrew Adams
@abadams
Nov 27 2017 18:31
I have that working in the async branch. Overlapping copies to device with computation on device is harder, because device APIs are already sort of asynchronous, but internally synchronize copies with compute. That's true for cuda at least.
Dillon Sharlet
@dsharletg
Nov 27 2017 18:31
how about overlapping copies to device with computation on host?
overlapping copies between device <-> host and computation on host I think covers the DMA use case
Andrew Adams
@abadams
Nov 27 2017 18:32
Let me try that.
Andrew Adams
@abadams
Nov 27 2017 18:43
deadlock, cool
Steven Johnson
@steven-johnson
Nov 27 2017 18:44
clearly you have a different definition of “cool” in mind
Dillon Sharlet
@dsharletg
Nov 27 2017 18:45
Is there a way to trigger the build bots for halide/Halide#2554 ? looks like it didn't happen automatically
other than just pushing a commit to it of course
Andrew Adams
@abadams
Nov 27 2017 18:55
Turns out never releasing semaphores is a leading cause of deadlock
Steven Johnson
@steven-johnson
Nov 27 2017 18:55
"neither train shall move until the other one has passed"
Zalman Stern
@zvookin
Nov 27 2017 19:25
The DMA device doesn't ever compute in a strict sense
The current design has the copies always be synchronous
In fact copies are always that way right? When the copy returns it is done.
When we extend lowering of async to something other than semaphores and threads, we will likely need to expose asynchronous copies, and perhaps an entire event model in general.
Andrew Adams
@abadams
Nov 27 2017 19:26
Yeah, but even if I make them async and ignore the problems that introduces, we're on a single stream
Zalman Stern
@zvookin
Nov 27 2017 19:27
Well that's sort of true for Hexagon DMA if there's one engine.
I.e. yes for a device, but since the device is modeling a piece of hardware that has that restriction anyway, probably....
Andrew Adams
@abadams
Nov 27 2017 19:27
I was initially trying to overlap device compute with device<->host copies. In the cuda backend this needs a lot more work for anything like that to happen.
Zalman Stern
@zvookin
Nov 27 2017 19:28
In order to have DMA going both directions one needs two devices and those will not be a single stream.
Andrew Adams
@abadams
Nov 27 2017 19:28
That's all I meant
Zalman Stern
@zvookin
Nov 27 2017 19:29
So I think we need to keep the current model, but we should consider introducing a new one that allows much more asynchrony.
Whether that is just async versions of the copy routines or a full event model I'm not sure
I'm not sure it makes much sense to consider until we are looking at a different way of lowering async
I.e. in the current model, synchronous copy is probably required anyway. The thread doing it would just wait immediately if it wasn't synchronous.
"Event model" likely amounts to exposing the semaphore abstraction and arranging for it to be signaled by the device support code somehow
Andrew Adams
@abadams
Nov 27 2017 19:31
Overlapping CPU computation with device stuff works fine. All use of the device_api occurs on a single thread. All the CPU compute occurs on another thread.
That's what I'm targeting for now
Zalman Stern
@zvookin
Nov 27 2017 19:32
"CPU" is not necessarily correct there. It can be a different device too.
That is the common case for Hexagon right?
Andrew Adams
@abadams
Nov 27 2017 19:32
For the hexagon DMA work so far, "CPU" is hexagon, and "device" is the dma engine
Zalman Stern
@zvookin
Nov 27 2017 19:33
Ok, but the Hexagon may be invoked via offload.
Andrew Adams
@abadams
Nov 27 2017 19:33
But yeah, I think it would work to have cross-device stuff going on in parallel
All use of each device interface would be on a single distinct thread
Zalman Stern
@zvookin
Nov 27 2017 19:33
yes
Andrew Adams
@abadams
Nov 27 2017 19:33
and there's no cross-device-interface serialization I think
so it would just work
Zalman Stern
@zvookin
Nov 27 2017 19:33
That is what I was highlighting.
The only issue I see with this design is that the overhead of the thread may be too high to use for very lightweight hardware synchronization mechanisms. Other than that, I don't see a lot of reason to do the customized lowering.
I need to make a couple more changes to the hexagon DMA
Will try to do so today.
The test only calls buffer_copy, which is mostly as it should be.
Dillon Sharlet
@dsharletg
Nov 27 2017 19:37
So BTW regarding hexagon offloading, I've been thinking we simply punt on that for now
and only target standalone
anything that we get working on standalone can be made to work with offloading without solving any "hard" problems like async + storage folding, it just might involve a lot of plumbing and infrastructure
Steven Johnson
@steven-johnson
Nov 27 2017 19:39
re: the windows buildbots, proposed fix is out there.
Zalman Stern
@zvookin
Nov 27 2017 19:40
I'll have to consider the implications, but I think the current stuff just works if the DMA things are scheduled inside an offloaded thing.
Dillon Sharlet
@dsharletg
Nov 27 2017 19:41
I think there might be some hiccups with the device interface
that will need to get plumbed over via offloading
and I don't think that will happen transparently right now
it might be easy to make it work though
Zalman Stern
@zvookin
Nov 27 2017 19:41
yeah, that's small boogs territory.
I guess I'm expecting it will have to work with offload very early on to have a useful test.
Andrew Adams
@abadams
Nov 27 2017 20:24
@dsharletg the host->device case also works, but there's no benefit for cuda because the version without async already manages to overlap the cpu compute and copies in a subtle way.
Confused me for a while.
CPU compute -> synchronous copy -> async kernel launch -> next batch of CPU compute (overlapped with GPU kernel launch) -> synchronous copy (stalls until kernel launch is done) ->
Wait, so I guess the CPU compute is hidden under the GPU compute
not the copy
Dillon Sharlet
@dsharletg
Nov 27 2017 20:47
That's great news!
Steven Johnson
@steven-johnson
Nov 27 2017 21:18
I’m restarting the buildbot master now
Steven Johnson
@steven-johnson
Nov 27 2017 22:10
On the recent issue of exported symbols varying between opt levels: it looks like CMake added a feature in 3.4 that attempts to auto-build a .def file for you on Windows, with the net effect of (mostly) acting like the gcc-ish default of “export all symbols”: https://blog.kitware.com/create-dlls-on-windows-without-declspec-using-new-cmake-export-all-feature/
I haven’t tried it (and we are talking about CMake here so who knows)...
Steven Johnson
@steven-johnson
Nov 27 2017 23:08
We explicitly forbid using ‘.’ in a Func name since we use that as a separator internally, but we don’t seem to have a similar constraint on Var name. Deliberate or accidental?
Andrew Adams
@abadams
Nov 27 2017 23:09
Var names are not uniqued either
Accidental I think
Zalman Stern
@zvookin
Nov 27 2017 23:09
Var names are not uniqued by design
They're value types
Steven Johnson
@steven-johnson
Nov 27 2017 23:09
Right
Andrew Adams
@abadams
Nov 27 2017 23:10
Lack of '.' enforcement is the accidental thing
Steven Johnson
@steven-johnson
Nov 27 2017 23:11
Just idly wondering if more constraints on the names allowed would give us more flexibility in the future. (e.g. GeneratorParam names are limited to C-style identifier rules, with additional constraints on underscore usage). Probably overthinking it.
Re: the windows buildbots: I updated the scripts and did a buildbot stop and start, but builds completing since then still seem to be using the old, broken windows testing approach. I wonder, do the workers queue up the commands on the worker (and thus this could be just stale builds completing)? Investigating...
Steven Johnson
@steven-johnson
Nov 27 2017 23:31
Hmm, this is odd: I stopped buildbot again; when restarting, it is now failing with "could not find buildbot-www; is it installed?” which is something I haven’t seen before. @abadams, is it wise/unwise to restart the entire buildbot VM when updating?
Steven Johnson
@steven-johnson
Nov 27 2017 23:43
logout, log back in, now starting it is telling me I need a txrequests package installed. Oy.
Just gonna reboot the VM.
Nope. Still busticated.
Steven Johnson
@steven-johnson
Nov 27 2017 23:50
bah: chmod is not my friend
chmod’ing stuff to my user seems to have healed it, per comments in @abadams document — sadly, the failure modes were obscure and unrelated enough that I didn’t think to try that