Where communities thrive


  • Join over 1.5M+ people
  • Join over 100K+ communities
  • Free without limits
  • Create your own community
People
Repo info
Activity
  • 05:57

    abadams on super_simplify

    Revert "Match through lets. Tra… Trim down random pipeline exprs Use genuinely random schedules (compare)

  • Nov 17 23:39

    abadams on super_simplify

    Match through lets. Trades comp… (compare)

  • Nov 17 19:53

    abadams on super_simplify

    Remove source of duplicate rules (compare)

  • Nov 17 19:32

    abadams on super_simplify

    Add beam size to header (compare)

  • Nov 17 19:31

    abadams on super_simplify

    Add beam size arg to synthesize… Merge branch 'super_simplify' o… (compare)

  • Nov 17 19:26

    abadams on super_simplify

    Minor bugfix (compare)

  • Nov 17 02:09

    abadams on super_simplify

    Add expr enumeration code Better type construction for se… Fix EQ rules and 1 more (compare)

  • Nov 17 02:02

    abadams on super_simplify

    Enable OR rules Try to fix EQ matching. Still b… (compare)

  • Nov 17 01:55

    abadams on super_simplify

    Small improvements Merge branch 'super_simplify' o… Fresh rules (compare)

  • Nov 17 01:35

    abadams on super_simplify

    Update three op sequences to no… (compare)

  • Nov 16 22:05

    abadams on super_simplify

    Add all 3-op sequences on a sma… (compare)

  • Nov 16 19:58

    abadams on super_simplify

    Fresh results, with CSE set to 2 (compare)

  • Nov 16 04:54

    abadams on super_simplify

    Fix bad simplify_mod again (compare)

  • Nov 16 04:51

    abadams on super_simplify

    Fresh rules (compare)

  • Nov 15 22:15

    abadams on super_simplify

    Add max filter app (compare)

  • Nov 15 21:51
    shubhamp-ca synchronize #4402
  • Nov 15 21:42
    steven-johnson commented #4402
  • Nov 15 21:39
    shubhamp-ca commented #4402
  • Nov 15 21:22
    dsharletg closed #4358
  • Nov 15 21:22

    dsharletg on master

    [Hexagon] LUT32 implementation … [Hexagon] LUT32 implementation … [Hexagon] LUT32 implementation … and 7 more (compare)

Steven Johnson
@steven-johnson
"neither train shall move until the other one has passed"
Zalman Stern
@zvookin
The DMA device doesn't ever compute in a strict sense
The current design has the copies always be synchronous
In fact copies are always that way right? When the copy returns it is done.
When we extend lowering of async to something other than semaphores and threads, we will likely need to expose asynchronous copies, and perhaps an entire event model in general.
Andrew Adams
@abadams
Yeah, but even if I make them async and ignore the problems that introduces, we're on a single stream
Zalman Stern
@zvookin
Well that's sort of true for Hexagon DMA if there's one engine.
I.e. yes for a device, but since the device is modeling a piece of hardware that has that restriction anyway, probably....
Andrew Adams
@abadams
I was initially trying to overlap device compute with device<->host copies. In the cuda backend this needs a lot more work for anything like that to happen.
Zalman Stern
@zvookin
In order to have DMA going both directions one needs two devices and those will not be a single stream.
Andrew Adams
@abadams
That's all I meant
Zalman Stern
@zvookin
So I think we need to keep the current model, but we should consider introducing a new one that allows much more asynchrony.
Whether that is just async versions of the copy routines or a full event model I'm not sure
I'm not sure it makes much sense to consider until we are looking at a different way of lowering async
I.e. in the current model, synchronous copy is probably required anyway. The thread doing it would just wait immediately if it wasn't synchronous.
"Event model" likely amounts to exposing the semaphore abstraction and arranging for it to be signaled by the device support code somehow
Andrew Adams
@abadams
Overlapping CPU computation with device stuff works fine. All use of the device_api occurs on a single thread. All the CPU compute occurs on another thread.
That's what I'm targeting for now
Zalman Stern
@zvookin
"CPU" is not necessarily correct there. It can be a different device too.
That is the common case for Hexagon right?
Andrew Adams
@abadams
For the hexagon DMA work so far, "CPU" is hexagon, and "device" is the dma engine
Zalman Stern
@zvookin
Ok, but the Hexagon may be invoked via offload.
Andrew Adams
@abadams
But yeah, I think it would work to have cross-device stuff going on in parallel
All use of each device interface would be on a single distinct thread
Zalman Stern
@zvookin
yes
Andrew Adams
@abadams
and there's no cross-device-interface serialization I think
so it would just work
Zalman Stern
@zvookin
That is what I was highlighting.
The only issue I see with this design is that the overhead of the thread may be too high to use for very lightweight hardware synchronization mechanisms. Other than that, I don't see a lot of reason to do the customized lowering.
I need to make a couple more changes to the hexagon DMA
Will try to do so today.
The test only calls buffer_copy, which is mostly as it should be.
Dillon Sharlet
@dsharletg
So BTW regarding hexagon offloading, I've been thinking we simply punt on that for now
and only target standalone
anything that we get working on standalone can be made to work with offloading without solving any "hard" problems like async + storage folding, it just might involve a lot of plumbing and infrastructure
Steven Johnson
@steven-johnson
re: the windows buildbots, proposed fix is out there.
Zalman Stern
@zvookin
I'll have to consider the implications, but I think the current stuff just works if the DMA things are scheduled inside an offloaded thing.
Dillon Sharlet
@dsharletg
I think there might be some hiccups with the device interface
that will need to get plumbed over via offloading
and I don't think that will happen transparently right now
it might be easy to make it work though
Zalman Stern
@zvookin
yeah, that's small boogs territory.
I guess I'm expecting it will have to work with offload very early on to have a useful test.
Andrew Adams
@abadams
@dsharletg the host->device case also works, but there's no benefit for cuda because the version without async already manages to overlap the cpu compute and copies in a subtle way.
Confused me for a while.
CPU compute -> synchronous copy -> async kernel launch -> next batch of CPU compute (overlapped with GPU kernel launch) -> synchronous copy (stalls until kernel launch is done) ->
Wait, so I guess the CPU compute is hidden under the GPU compute
not the copy
Dillon Sharlet
@dsharletg
That's great news!
Steven Johnson
@steven-johnson
I’m restarting the buildbot master now