Where communities thrive


  • Join over 1.5M+ people
  • Join over 100K+ communities
  • Free without limits
  • Create your own community
People
Repo info
Activity
  • Dec 09 23:49
    abadams commented #4461
  • Dec 09 23:36
    dsharletg commented #4461
  • Dec 09 23:28
    dsharletg commented #4461
  • Dec 09 23:25
    dsharletg commented #4461
  • Dec 09 23:24
    dsharletg synchronize #4461
  • Dec 09 23:24

    dsharletg on fix-fft

    Fix even smaller FFTs. (compare)

  • Dec 09 22:54
    alexreinking assigned #4284
  • Dec 09 22:52
    alexreinking assigned #4424
  • Dec 09 22:48
    alexreinking closed #2946
  • Dec 09 21:20
    kpassarella commented #4276
  • Dec 09 21:11
    alexreinking commented #4439
  • Dec 09 21:06
    abadams commented #4439
  • Dec 09 20:45
    akokoshn commented #4459
  • Dec 09 20:13
    alexreinking commented #4439
  • Dec 09 19:34
    abadams commented #4461
  • Dec 09 19:34
    dsharletg commented #4461
  • Dec 09 19:34
    dsharletg edited #4461
  • Dec 09 19:26
    dsharletg opened #4461
  • Dec 09 19:23

    dsharletg on fix-fft

    Fix bad vectorize directive for… Fix fft2d_c2r reaching out of b… (compare)

  • Dec 09 19:11
    dsharletg commented #4276
Zalman Stern
@zvookin
I was thinking we could switch the default and kill the flag later, but it’s tractable to evaluate beforehand.
Dillon and I are going to make sure the intrinsic and/or ForceStrictFloat flag meet the use cases it was done for.
(“Beforehand” above meaning before merging the PR.)
Zalman Stern
@zvookin
I’m still a bit unsure re: llvm’s level of effect. I will try to craft some more test cases that show a difference at the llvm level. So far the tests show much larger effects from our simplifier and different order of computation due to scheduling changes.
Andrew Adams
@abadams
I'd check the fft
We know that's very sensitive to llvm making different fp decisions
Zalman Stern
@zvookin
That’s a great idea I hadn’t thought of.
Both for performance and accuracy.
Dillon Sharlet
@dsharletg
another good example for fast-math tests might be a naive DFT, which would be a one-liner reduction so easy to write into a test
Dillon Sharlet
@dsharletg
I'm looking at the build bots more to see if we can speed anything up
I noticed that the mac build: https://buildbot.halide-lang.org/master/#/builders/57/builds/148/steps/14/logs/stdio took 20+ minutes, but the log itself says it only took 1 min 40 s
Andrew Adams
@abadams
The performance step waits to acquire an exclusive lock on the machine
It means that it had to hang around for 20 mins before the other builder running on the machine reached a point where it could pause
Dillon Sharlet
@dsharletg
I see, so not something to fix...
Andrew Adams
@abadams
yeah
the metal tests legitimately look really slow. 1442 seconds
Wonder if one of them is the long pole there
wait...
Why did that test take 30 mins to compile llvm?
It's for a fixed llvm version
should have been cached
That's worth investigating
Maybe those steps should explicitly be skipped on non-trunk builds if llvm-config/clang exists in the expected place.
Right now it relies on svn up to not touch any files, and for the cmake command to recognize that
Andrew Adams
@abadams
@dhsarletg for the async hexagon dma, the desire is to overlap copies from device to host with computation on host, right?
@dsharletg
Dillon Sharlet
@dsharletg
right
Andrew Adams
@abadams
I have that working in the async branch. Overlapping copies to device with computation on device is harder, because device APIs are already sort of asynchronous, but internally synchronize copies with compute. That's true for cuda at least.
Dillon Sharlet
@dsharletg
how about overlapping copies to device with computation on host?
overlapping copies between device <-> host and computation on host I think covers the DMA use case
Andrew Adams
@abadams
Let me try that.
Andrew Adams
@abadams
deadlock, cool
Steven Johnson
@steven-johnson
clearly you have a different definition of “cool” in mind
Dillon Sharlet
@dsharletg
Is there a way to trigger the build bots for halide/Halide#2554 ? looks like it didn't happen automatically
other than just pushing a commit to it of course
Andrew Adams
@abadams
Turns out never releasing semaphores is a leading cause of deadlock
Steven Johnson
@steven-johnson
"neither train shall move until the other one has passed"
Zalman Stern
@zvookin
The DMA device doesn't ever compute in a strict sense
The current design has the copies always be synchronous
In fact copies are always that way right? When the copy returns it is done.
When we extend lowering of async to something other than semaphores and threads, we will likely need to expose asynchronous copies, and perhaps an entire event model in general.
Andrew Adams
@abadams
Yeah, but even if I make them async and ignore the problems that introduces, we're on a single stream
Zalman Stern
@zvookin
Well that's sort of true for Hexagon DMA if there's one engine.
I.e. yes for a device, but since the device is modeling a piece of hardware that has that restriction anyway, probably....
Andrew Adams
@abadams
I was initially trying to overlap device compute with device<->host copies. In the cuda backend this needs a lot more work for anything like that to happen.
Zalman Stern
@zvookin
In order to have DMA going both directions one needs two devices and those will not be a single stream.
Andrew Adams
@abadams
That's all I meant
Zalman Stern
@zvookin
So I think we need to keep the current model, but we should consider introducing a new one that allows much more asynchrony.
Whether that is just async versions of the copy routines or a full event model I'm not sure
I'm not sure it makes much sense to consider until we are looking at a different way of lowering async