These are chat archives for halide/Halide

15th
Nov 2017
Andrew Adams
@abadams
Nov 15 2017 00:10
Just catching up on this conversation...
I think what's going on here is that the pattern matching done in the backends can be made incremental and some logic should be shared across backends.
If intrinsics introduced as part of this ever get transformed back into IR, that seems like a bad code smell to me.
These are intrinsics used while transforming something away from Halide IR into LLVM IR
Lerp is the opposite - it's introduced in the front-end, and optionally lowered into Halide IR
Pattern matching a saturating add, and then much of the time turning it back into IR to codegen seems pretty wonky
Another example is pmaddwd
Zalman Stern
@zvookin
Nov 15 2017 00:13
Yes, that code smell is why I only want to use intrinsics if we are planning to do the same thing with them we do with Lerp
Andrew Adams
@abadams
Nov 15 2017 00:13
I think there's a valid role for Call nodes here that is entirely different to lerp
Zalman Stern
@zvookin
Nov 15 2017 00:13
If they're meant to only exist in the backend, I'd like to encode that constraint somehow
Andrew Adams
@abadams
Nov 15 2017 00:13
an existing example is pmaddwd
we pattern match some IR, turn it into a call to that, and the implementation is in .ll
Zalman Stern
@zvookin
Nov 15 2017 00:15
The current state of intrinsics is largely undocumented I believe. But if one were reading IR.h, one would assume you can generate the intrinsics from any client of the IR right?
Andrew Adams
@abadams
Nov 15 2017 00:15
Yeah... we increasingly are going to have lowering passes that are really part of the backend
and those lowering passes may introduce backend-specific pieces of IR
Zalman Stern
@zvookin
Nov 15 2017 00:15
So if that isn't going to be the case, I'd like to make sure it is somehow made explicit
One way would be to introduce a new Call node type BackendIntrinsic
Andrew Adams
@abadams
Nov 15 2017 00:16
Currently we use PureExtern for this sort of thing
I think
Zalman Stern
@zvookin
Nov 15 2017 00:16
Which would only be produced by a custom lowering pass that the backend would call
Andrew Adams
@abadams
Nov 15 2017 00:16
because they get resolved at the .ll level
Zalman Stern
@zvookin
Nov 15 2017 00:17
Basically you're stating my "not an intrinsic" argument.
Andrew Adams
@abadams
Nov 15 2017 00:17
Yeah, I said it's a valid role for Call nodes, not intrinsics :)
Zalman Stern
@zvookin
Nov 15 2017 00:18
Yeah. Though I do like moving away from string typing on the function name :-)
Another high level question is how the matching works
We'd probably like it to be efficient
So not do a pass per thing we're trying to match
Andrew Adams
@abadams
Nov 15 2017 00:19
I would just do a peephole saturating math lowering pass called only by the backends that care, which introduces PureExtern calls to functions defined in .ll
Those functions are mostly going to be one-liners that call the equivalent llvm intrinsic
Doing it that way gives you an opportunity to do things like normalize vector widths
Zalman Stern
@zvookin
Nov 15 2017 00:19
Works so long as the support is only meant for llvm based backends
Andrew Adams
@abadams
Nov 15 2017 00:19
I think all backends can handle PureExtern calls to runtime functions
llvm or not
The C backend copy pastes source, or generates an extern decl
Zalman Stern
@zvookin
Nov 15 2017 00:20
I guess C and metal have inlines...
etc.
Andrew Adams
@abadams
Nov 15 2017 00:20
Yeah
Anyway, that's my 2c.
Zalman Stern
@zvookin
Nov 15 2017 00:21
I'd be a bit surprised if this just works for Metal, OpenCL, and GLSL, but it is probably just a matter of fixing a few things.
Dillon Sharlet
@dsharletg
Nov 15 2017 00:22
So BTW I'd actually prefer this to be a pass called from each backend that wants it
Andrew Adams
@abadams
Nov 15 2017 00:22
Yeah, that's what I meant
Then you don't handle it in the call visitor for the backend at all
it just resolves to an alwaysinline runtime function
Dillon Sharlet
@dsharletg
Nov 15 2017 00:22
yeah that's what I'd prefer, the only reason I don't lean towards that is it's kind of a pain to do that
Zalman Stern
@zvookin
Nov 15 2017 00:22
I have no problem with that, but if that is the only place the saturating_add thing is supposed to be made, I'd rather it not be an intrinsic.
Dillon Sharlet
@dsharletg
Nov 15 2017 00:22
need to override begin_func, and that requires some bookkeeping
Andrew Adams
@abadams
Nov 15 2017 00:23
?
To run inject the lowering pass?
s/run inject/run
Dillon Sharlet
@dsharletg
Nov 15 2017 00:23
yeah
it's not a big deal just mildly annoying
Zalman Stern
@zvookin
Nov 15 2017 00:24
The general thing to do would be to add a PassManager like thing for lowering. The backend would get to muck with the pass data structure. I'd really rather not do that.
Andrew Adams
@abadams
Nov 15 2017 00:27
Maybe a run_backend_passes virtual call on the CodeGen base class
that mutates the stmt
called by the base begin_func
Then at least it would be linear code within each backend
Really a tiny change to the code Dillon already linked
Dillon Sharlet
@dsharletg
Nov 15 2017 00:28
yeah I was thinking similar
Zalman Stern
@zvookin
Nov 15 2017 00:30
I'm down with this.
It would be nice to have a complete list of the runtime functions using this mechanism somewhere
Mostly for design guidance when doing new stuff
Or perhaps we should put together a doc describing all of this stuff -- IR nodes, intrinsics, runtime inlines calls, etc.
Zalman Stern
@zvookin
Nov 15 2017 00:53
Andrew, do you have time for an rfactor question?
Trying to do the simplest possible vectorization of a summation
Want to init an accumulator vector to 0, add successive vectors of the input into the accumulator, then sum the lanes of the accumulator once.
Can't seem to find an example that does this.
And figured it would be easy, but haven't gotten it yet.
Andrew Adams
@abadams
Nov 15 2017 02:41
I think that's: f.compute_root().update().split(r.x, rxo, rxi, 8).rfactor(rxi, u).compute_root().vectorize(u);
(Sorry for the delay - there was a tantrum to deal with)
Actually: f.compute_root().update().split(r.x, rxo, rxi, 8).rfactor(rxi, u).compute_root().vectorize(u).update().vectorize(u);
The intermediate returned by rfactor reduces using whole vectors, and f now just sums the lanes of the result
Zalman Stern
@zvookin
Nov 15 2017 02:44
I got it figured out with a bit of help from Dillon
Not sure why my first n attempts went awry. I was having some difficulty with Var::outermost() (what I was using for the compute_root).
I ended up with a lot of almost cases that just use one lane of the vector or do the horizontal reduction per-loop
As with a lot of stuff, one mistake leads to another.
Steven Johnson
@steven-johnson
Nov 15 2017 02:54
Merging recent Halide into Google code, I'm seeing odd failures that appear to be LICM related..
E
E.g. golden tests off by 3 or more (where off by 1 would be concerning)
Disabling LICM (and in particular, just GroupLoopInvariants) heals most of these peculiarities, but I haven't yet found a smoking gun for why. Ran out of time today.
Zalman Stern
@zvookin
Nov 15 2017 03:26
https://blog.acolyer.org/2017/11/03/log20-fully-automated-optimal-placement-of-log-printing-statements-under-specified-overhead-threshold/amp/ . Haven't had time to read it yet. You say how much overhead you're willing to tolerate and it automatically adds the best debug log statements to your code.
Probably more interesting for implementing Halide than as a Halide feature.
Looks JVM oriented, so prolly "nevermind"
Still the idea of getting all the goodness of debug logging without writing debug statements in the code seems cool.
Shoaib Kamil
@shoaibkamil
Nov 15 2017 16:33
+1 to preserving source location @abadams. One of the biggest feature requests I get
Of course, if you've got some hairy templated generators, I'm not sure how useful it is to have this information
Andrew Adams
@abadams
Nov 15 2017 16:40
It just adds the ability to query Funcs for their definition location. Adding it to error messages or profiling results is more annoying, because you need to account for environments where it will just be an empty string because I never figured out how to parse the debug metadata (Windows, Google)
It may be that llvm has better support for parsing DWARF and PDB that it did when I wrote the dwarf parser
Andrew Adams
@abadams
Nov 15 2017 16:47
Oh, of course the other requirement for not just getting an empty string is -g -fno-omit-frame-pointer
Andrew Adams
@abadams
Nov 15 2017 18:47
@steven-johnson I'm worried about this GroupLoopInvariants thing. LMK as soon as you have something you can share. It should just be reassociating linear expressions.
Steven Johnson
@steven-johnson
Nov 15 2017 18:48
Yeah, I honestly don’t know if I’m chasing a red herring or not….
Andrew Adams
@abadams
Nov 15 2017 18:48
One intermediate thing to try would be only entering reassociate_summation for ints and uints
That would tell us if this is just a floating point reassociation issue
Steven Johnson
@steven-johnson
Nov 15 2017 18:51
I’ve narrowed it down to a case where the only difference in assembly output between good and bad appears to be the choice of xmm registers selected by LLVM in one section… literally identical instruction selection though. Which seems to be ~impossible to be the culprit, so I think I’ve gone astray in my debugging. Will let you know. (If I can narrow it down to a publicly-shareable chunk of code, will put it out there for inspection.)
Andrew Adams
@abadams
Nov 15 2017 18:52
Is it a floating-point pipeline?
Steven Johnson
@steven-johnson
Nov 15 2017 18:52
the part that fails is
oh wait! I see an an actual instruction difference. Stay tuned.
C-like language that uses bounds proving to guarantee no out of bounds buffer accesses
Nigel Tao is the main person behind it.
Halide's bounds support has been mentioned in the context of this.
(Given the language is intended for secure file format support, I'm a bit disappointed that it doesn't have more domain specificity. The thinking on that appears to be that the bugs in file formats are always in the things that are weirdly irregular and pretty much require arbitrary code and can't be expressed with a more conventional grammar. I'm not sure I agree, but I haven't spent a lot of time on this.)
Andrew Adams
@abadams
Nov 15 2017 20:24
Neat
How was Halide's bounds support mentioned?
btw, I'm working on the async merge (finally)
Zalman Stern
@zvookin
Nov 15 2017 21:02
Puffs is similar to Halide in that it proves accesses are in bounds and requires the programmer to add annotations if the prover cannot prove safety.
Nigel mentioned wanting to look at Halide and I gave him some advice on paying careful attention to the error messages when the prover fails.
Puffs has a bit of a harder job because I think it is basically a side-effectful language.
But I haven't looked at it closely
My cynical take is that a Halide like "DSL embedded in C++ but backed up by llvm" approach to file format parsing would probably be more productive. But of course I'm biased.
Andrew Adams
@abadams
Nov 15 2017 21:04
I imagine he'd need to track bounds of things through stores/loads, including indirect ones
We largely give up at those boundaries
Zalman Stern
@zvookin
Nov 15 2017 21:04
I think he finesses that by encoding bounds restrictions in declarations.
I don't think it tries to track them through however
Shoaib Kamil
@shoaibkamil
Nov 15 2017 21:53
@abadams I'm trying to get someone to commit to working on better metadata parsing. It seems like it would be worth the effort
Also: is anyone currently working on Vulkan support? I don't want to duplicate effort
Andrew Adams
@abadams
Nov 15 2017 21:54
Not that I'm aware of
It's sorely needed
Someone asked me about our Android GPU story the other day and I had to deliver the verbal equivalent of a long shrug
Shoaib Kamil
@shoaibkamil
Nov 15 2017 22:01
Good to know. I'll try to carve out some weeks for it-- tired of giving the same long shrug
Zalman Stern
@zvookin
Nov 15 2017 23:13
I have about 75% of a Vulkan runtime
Haven't started on SPIR at all
Will try to get a branch pushed this week
@shoaibkamil (adding tag to make sure message gets seen)
My main take on Vulkan is that we need someone who has use cases to actually make it good
And it is not 100% clear it will be broadly applicable to lots of use cases
Zalman Stern
@zvookin
Nov 15 2017 23:18
There's basically zero use of non-cuda GPU backends at Google.
Shoaib Kamil
@shoaibkamil
Nov 15 2017 23:21
@zvookin We have some use cases internally, and a decent set of pipelines that I'd like to try to cover. Can you say a bit more about why you think it won't be broadly applicable?
In any case, it'll be good to branch off whatever you have. I'd started a SPIRV codegen at some point but abandoned it because of other priorities; I'll take a look to see if any of it is salvageable
Zalman Stern
@zvookin
Nov 15 2017 23:23
Generally it will probably make more sense for case where one wants to run Halide code in a display pipeline than for GPGPU offload.
So far as I can tell the industry is more interested in putting out new GPU APIs every two years than making any of them really good.
So Vulkan is up against cuda and ROCm now.
Apple has Metal.
Maybe this time will be different, but I'm cynical.
I'm sure one some subset of platforms and devices it will have use for some applications.
Shoaib Kamil
@shoaibkamil
Nov 15 2017 23:26
I'm cynical as well--- Vulkan is the flavor-of-the-month and Kronos has little leverage over OS vendors.
But agreed about compute code in graphics pipelines being the major usage; it doesn't really make sense otherwise. Not at all a replacement for OpenCL