Where communities thrive


  • Join over 1.5M+ people
  • Join over 100K+ communities
  • Free without limits
  • Create your own community
People
Repo info
Activity
  • 16:06
    benoitsteiner synchronize #4462
  • 00:01
    pranavb-ca review_requested #4470
  • 00:01
    pranavb-ca review_requested #4470
  • 00:01
    pranavb-ca review_requested #4470
  • 00:01
    pranavb-ca opened #4470
  • 00:00

    pranavb-ca on fix_hvx_intrinsics

    Merge branch 'master' of https:… Merge branch 'master' of https:… Fix access to Hexagon intrinsic… (compare)

  • Dec 13 22:15
    abadams commented #4469
  • Dec 13 22:07
    vksnk opened #4469
  • Dec 13 17:55

    steven-johnson on 4462

    (compare)

  • Dec 13 17:54
    steven-johnson closed #4463
  • Dec 12 17:13
    benoitsteiner commented #4462
  • Dec 12 00:00

    abadams on apps_from_autoscheduler

    Add BGU implementation Add histogram equalization Add max filter and 4 more (compare)

  • Dec 11 22:44

    abadams on define_div_by_zero

    Calculate Expr bounds using fun… Added JIT-test and removed appl… Merge branch 'master' of https:… and 65 more (compare)

  • Dec 11 22:44
    abadams synchronize #4439
  • Dec 11 18:41

    vksnk on increase-device-num

    (compare)

  • Dec 11 18:40

    vksnk on pos_inf-memory-assert

    (compare)

  • Dec 11 18:40

    vksnk on master

    Check if shared memory allocati… Use has_upper_bound() to check … Merge pull request #4467 from h… (compare)

  • Dec 11 18:40
    vksnk closed #4467
  • Dec 11 18:09
    steven-johnson commented #4439
  • Dec 11 18:07
    steven-johnson commented #4467
Andrew Adams
@abadams
wait...
Why did that test take 30 mins to compile llvm?
It's for a fixed llvm version
should have been cached
That's worth investigating
Maybe those steps should explicitly be skipped on non-trunk builds if llvm-config/clang exists in the expected place.
Right now it relies on svn up to not touch any files, and for the cmake command to recognize that
Andrew Adams
@abadams
@dhsarletg for the async hexagon dma, the desire is to overlap copies from device to host with computation on host, right?
@dsharletg
Dillon Sharlet
@dsharletg
right
Andrew Adams
@abadams
I have that working in the async branch. Overlapping copies to device with computation on device is harder, because device APIs are already sort of asynchronous, but internally synchronize copies with compute. That's true for cuda at least.
Dillon Sharlet
@dsharletg
how about overlapping copies to device with computation on host?
overlapping copies between device <-> host and computation on host I think covers the DMA use case
Andrew Adams
@abadams
Let me try that.
Andrew Adams
@abadams
deadlock, cool
Steven Johnson
@steven-johnson
clearly you have a different definition of “cool” in mind
Dillon Sharlet
@dsharletg
Is there a way to trigger the build bots for halide/Halide#2554 ? looks like it didn't happen automatically
other than just pushing a commit to it of course
Andrew Adams
@abadams
Turns out never releasing semaphores is a leading cause of deadlock
Steven Johnson
@steven-johnson
"neither train shall move until the other one has passed"
Zalman Stern
@zvookin
The DMA device doesn't ever compute in a strict sense
The current design has the copies always be synchronous
In fact copies are always that way right? When the copy returns it is done.
When we extend lowering of async to something other than semaphores and threads, we will likely need to expose asynchronous copies, and perhaps an entire event model in general.
Andrew Adams
@abadams
Yeah, but even if I make them async and ignore the problems that introduces, we're on a single stream
Zalman Stern
@zvookin
Well that's sort of true for Hexagon DMA if there's one engine.
I.e. yes for a device, but since the device is modeling a piece of hardware that has that restriction anyway, probably....
Andrew Adams
@abadams
I was initially trying to overlap device compute with device<->host copies. In the cuda backend this needs a lot more work for anything like that to happen.
Zalman Stern
@zvookin
In order to have DMA going both directions one needs two devices and those will not be a single stream.
Andrew Adams
@abadams
That's all I meant
Zalman Stern
@zvookin
So I think we need to keep the current model, but we should consider introducing a new one that allows much more asynchrony.
Whether that is just async versions of the copy routines or a full event model I'm not sure
I'm not sure it makes much sense to consider until we are looking at a different way of lowering async
I.e. in the current model, synchronous copy is probably required anyway. The thread doing it would just wait immediately if it wasn't synchronous.
"Event model" likely amounts to exposing the semaphore abstraction and arranging for it to be signaled by the device support code somehow
Andrew Adams
@abadams
Overlapping CPU computation with device stuff works fine. All use of the device_api occurs on a single thread. All the CPU compute occurs on another thread.
That's what I'm targeting for now
Zalman Stern
@zvookin
"CPU" is not necessarily correct there. It can be a different device too.
That is the common case for Hexagon right?
Andrew Adams
@abadams
For the hexagon DMA work so far, "CPU" is hexagon, and "device" is the dma engine
Zalman Stern
@zvookin
Ok, but the Hexagon may be invoked via offload.
Andrew Adams
@abadams
But yeah, I think it would work to have cross-device stuff going on in parallel
All use of each device interface would be on a single distinct thread
Zalman Stern
@zvookin
yes
Andrew Adams
@abadams
and there's no cross-device-interface serialization I think
so it would just work
Zalman Stern
@zvookin
That is what I was highlighting.
The only issue I see with this design is that the overhead of the thread may be too high to use for very lightweight hardware synchronization mechanisms. Other than that, I don't see a lot of reason to do the customized lowering.
I need to make a couple more changes to the hexagon DMA
Will try to do so today.