These are chat archives for halide/Halide

13th
Nov 2018
Zalman Stern
@zvookin
Nov 13 01:33
I'm seeing an issue where FuseGPUThreadLoops produces OpenGLCompute code that has multiple allocate nodes in the same block with the same name.
This seems like a bug.
Andrew Adams
@abadams
Nov 13 01:42
with overlapping lifetime?
Zalman Stern
@zvookin
Nov 13 01:43
I think originally no, but after fusion yes
Andrew Adams
@abadams
Nov 13 01:43
Was it a loop with an allocation in it that got unrolled?
Zalman Stern
@zvookin
Nov 13 01:43
I think so
Andrew Adams
@abadams
Nov 13 01:43
yeah, that sounds like it could break
Zalman Stern
@zvookin
Nov 13 01:44
Was there some discussion of something like this with Shoaib recently?
Andrew Adams
@abadams
Nov 13 01:44
Not that I recall
Zalman Stern
@zvookin
Nov 13 01:44
I can likely at least get a repro case out of this, maybe a fix.
There is a bunch of unrolling here
Or maybe not
I think it is something that is inlined in a bunch of update stages
Zalman Stern
@zvookin
Nov 13 01:50
Basically a sum inside a tuple argmin and there are multiple stages of the argmin
the temp var of the sum is getting reallocated a bunch of times with the same name in the same scope
At some point we really out to write an IR validation pass
So it can be interposed between each lowering stage
Andrew Adams
@abadams
Nov 13 01:55
There are various ways you can get multiple realizations with the same name. That's one of them.
They're legal, but they shouldn't nest.
Zalman Stern
@zvookin
Nov 13 01:55
I don't think these nest. I think it's basically:
for (...) { allocate hsum$3 ... }
for (...) { allocate hsum$3 ... }
And the result is
for (...) { allocate hsum$3 ... allocate hsum$3... }
Andrew Adams
@abadams
Nov 13 01:57
So it's something like Block(allocate(...), allocate(...))
Zalman Stern
@zvookin
Nov 13 01:57
There's code to try and make the allocation shared.
Andrew Adams
@abadams
Nov 13 01:57
That's fine... but I guess codegen doesn't place appropriate scopes around it
Zalman Stern
@zvookin
Nov 13 01:57
The output is something like that
But those overlap right?
Andrew Adams
@abadams
Nov 13 01:57
allocations are only live for their bodies
Zalman Stern
@zvookin
Nov 13 01:57
Wouldn't { allocate foo ... { allocate foo ... } }
still be wrong?
Andrew Adams
@abadams
Nov 13 01:57
Yeah that would be bad
Zalman Stern
@zvookin
Nov 13 01:58
Isn't that what the Block looks like?
Andrew Adams
@abadams
Nov 13 01:58
but it sounds like this case isn't that?
Zalman Stern
@zvookin
Nov 13 01:58
The input isn't
the Input is like
Andrew Adams
@abadams
Nov 13 01:58
Block {allocate foo {...}, allocate foo {...} }
is legal
Zalman Stern
@zvookin
Nov 13 01:58
{ { allocate foo... }, { allocate foo...} }
Ah, the allocation itself is supposed to open a scope
Andrew Adams
@abadams
Nov 13 01:58
yeah
the name is only legal inside the body of the allocate. But I'm guessing that's not how we emit them
Zalman Stern
@zvookin
Nov 13 01:59
The GLSL backend is broken
Andrew Adams
@abadams
Nov 13 01:59
Some extra braces might fix it
Zalman Stern
@zvookin
Nov 13 01:59
Yeah.
I'll try that.
Zalman Stern
@zvookin
Nov 13 02:08
This is definitely putting out nested allocations with the same name
    float _hsum[1];
    {
      float _global_argmin_4[1];
      {
        float _global_argmin_3[1];
        {
          float _global_argmin_2[1];
          {
            float _global_argmin_1[1];
            {
              float _global_argmin_0[1];
              {
                float _sum[1];
                {
                  float _sumXX1[1];
                  {
                    float _sumXX2[1];
                    {
                      float _hsum[1];
Andrew Adams
@abadams
Nov 13 02:09
nuts
Zalman Stern
@zvookin
Nov 13 02:09
I guess that might compile, but it doesn't seem like what we want
Specifically, I'm not convinced the code inside the braces is getting the right allocation.
Shoaib Kamil
@shoaibkamil
Nov 13 02:37
@zvookin @abadams this does look similar to the multiple-allocations bug I encountered as well
That was with Metal. But same symptom: multiple allocations with the same name within the same block
It’s been difficult to get a small test case out of it, just because the schedule that caused it was invalid for other reasons
Zalman Stern
@zvookin
Nov 13 02:39
I'll try to track it down later this week. The schedule for this is trivial -- gpu loop around a thing with reductions.
Manimaran-P
@Manimaran-P
Nov 13 04:58
I’m trying to use mul_hi and mul_lo to do a full 16-bit multiplication, and then be able to reinterpret the two discrete outputs as a single 32-bit integer. Is there a way to express this in Halide? as_uint32 only takes a single argument, and we're hoping there is something better than upconverting and combining in separate steps.
Zalman Stern
@zvookin
Nov 13 07:19
What is the target platform?
Generally the way to express it is to cast to 32 bit and do a multiply
The compiler is then responsible for lowering it to something efficient for the backend
Which is done for at least a few targets
The opencl backend can be taught to peephole this as well
Though frankly, the opencl compiler ought to do this itself.
If you want the end result as a 32-bit value, I don't see why computing the pieces separately is the right way to go.
Zalman Stern
@zvookin
Nov 13 07:30
The are a couple ways we've looked at being able to embed literal code for languages like OpenCL as well, though I'm not sure any of them are fully plumbed through. (Basically externs for GPU languages.)
Manimaran-P
@Manimaran-P
Nov 13 08:22
@zvookin OK Thanks. Yes, my target platform is OpenCL
Steven Johnson
@steven-johnson
Nov 13 18:27
…are all of the buildbot workers down? investigating
master seems fine
but literally none of the workers are responding to pings. (I’m in SFO today and can’t inspect in person right now.)
Zalman Stern
@zvookin
Nov 13 18:42
There was some network futzing this weekend. I ignored the emails as it wasn't clear they were for anything we cared about.
Steven Johnson
@steven-johnson
Nov 13 18:43
goody
(I was OOO yesterday so didn’t notice till today)
Steven Johnson
@steven-johnson
Nov 13 18:54
OK, so there is a possibility that a network configuration change might be (partly) responsible for the buildbot outage. Stay tuned.
Andrew Adams
@abadams
Nov 13 19:19
llvm compile times scale really badly with the number of asserts
which means they scale really badly with the number of inputs
Anything above 30 inputs or so is agonizing
Zalman Stern
@zvookin
Nov 13 19:19
I thought we out of lined all of those.
Andrew Adams
@abadams
Nov 13 19:19
Each assert gets a separate basic block for the false case in which it calls the appropriate error function
Could be that
Zalman Stern
@zvookin
Nov 13 19:20
Can we move to a more programatic approach? Like putting all the buffer pointers into an array and calling a runtime routine to do checks?
Andrew Adams
@abadams
Nov 13 19:20
Yeah I think we might need to
I don't think it's the code size per se, it seems to be the number of basic blocks
So anything we can do to branch less would be good
... not sure how to avoid branching on the calls to the error functions
or how to put them in a loop
I guess a validate_buffer runtime function that takes a buffer and enough other data to do all the assertions necessary
then call that in a loop
and we may or may not inline that function
and llvm may or may not unroll that loop
if both happens, we'll get the same code as today
Zalman Stern
@zvookin
Nov 13 19:24
I guess I was thinking of encoding the constraints and error message inserts into a data representation and then writing a single function that interprets the check language (so to speak)
The limiting factor is how many instructions it takes to setup the array for input
Ideally it would be a store per buffer pointer and then a call
The interpreter need to be fast as well I suppose.
Andrew Adams
@abadams
Nov 13 19:26
it would also need to populate all the required sizes for the buffer, which currently exist as let stmts
Zalman Stern
@zvookin
Nov 13 19:26
I was hoping that could be made a constant array with holes for the buffer pointers
Andrew Adams
@abadams
Nov 13 19:26
but the required sizes are runtime-determined
so there need to be holes for those too
Zalman Stern
@zvookin
Nov 13 19:26
But if it doesn't all become constants, it doesn't work
Andrew Adams
@abadams
Nov 13 19:28
One thing we could do is && all the conditions, then branch into the error checking code if it's false. At that point we don't care about performance so we just recheck everything, and put it in an isolated function compiled at -O1
Zalman Stern
@zvookin
Nov 13 19:28
yeah
Only issue there is code size, but that was a thought I had. I was thinking to do the logical conjunction and try to record a bit of state to say which one failed, but to execute them all
all the time
I'd be a bit surprised if logical ands and branches are a lot different
Andrew Adams
@abadams
Nov 13 19:29
Yeah we could also group the conditions into batches of 64 and pack the bits of a uint64 with them
then use ctlz to see which one failed
but it's probably ok to just recheck them all too
The ctlz solution would let us use a big switch stmt for the calls to the error functions, which llvm might prefer.
Steven Johnson
@steven-johnson
Nov 13 21:41
update: the linux bots have reappeared for no obvious reason, so I’m starting them again
Shoaib Kamil
@shoaibkamil
Nov 13 21:43
It's starting to look more and more like the bots are developing their own intelligence
Evaluation metric: “compare youroutput.txt to trustedoutput.txt”. Solution: “delete trusted-output.txt, output nothing”