These are chat archives for halide/Halide

24th
Apr 2018
Zalman Stern
@zvookin
Apr 24 00:12
I think I agree that the second two atomic operations could be elided
I'm pretty sure the first one has to be there
Zalman Stern
@zvookin
Apr 24 00:17
Without the first atomic op, that code is textbook double checked locking
All that said, this code should not be hitting issues with atomics at all. I expect it will generate libcalls on 32-bit, which should fail to link if libatomic is not includes. This bit us on fast_sync. I'm going to guess we simply aren't testing CUDA in that situation.
The solution would be to use the __sync primitives
But that is problematic as there is no atomic load equivalent because those were designed for X86
which guarantees sequential consistency on normal loads
Switching this code to use ATOMIC_RELAXED on the calls inside the spin lock might be clearer
though in the device_release code, I think it uses the state variable in question (context) in its raw form.
The entire point of this was to elide the spin lock on the common path
I think the right answer is to build a fast shared lock on top of the fast_sync support and use that in a bunch of places
We should replace all uses of spin locks
Zalman Stern
@zvookin
Apr 24 00:22
(The spin locks drive up CPU utilization a bunch in cuda multithreaded use.)
Andrew Adams
@abadams
Apr 24 00:33
Can we just throw a full memory barrier after the first load then (__sync_synchronize)?
Zalman Stern
@zvookin
Apr 24 00:33
Yeah
I was trying to be smarter than that
I.e. the atomic builtins actually give one a usable programming tool
Unfortunately, it isn't portable enough for the le32-unknown-unknown target
Andrew Adams
@abadams
Apr 24 00:35
I don't remember what the failure mode is in the usual double-checked locking pattern
Zalman Stern
@zvookin
Apr 24 00:35
I'm not clear why llvm is bitching about alignment
Andrew Adams
@abadams
Apr 24 00:35
I think that because the symbol is weak, it's not confident that the final definition of it will have any particular alignment. That's my best guess, anyway.
I tried to annotate the global as aligned, and also tried to annotate the pointer itself as pointing to aligned memory, but llvm still emitted an unaligned atomic load.
Zalman Stern
@zvookin
Apr 24 00:35
We can give it an alignment attribute right?
yuck
The other possibility is the error is effectively wrong
Andrew Adams
@abadams
Apr 24 00:36
Maybe le32 doesn't guarantee alignment of globals.
Zalman Stern
@zvookin
Apr 24 00:36
There's another constraint on the atomic's turning into instructions instead of libcalls
The processor has to support all the possible atomic operations on the data type in question
PNaCl 32-bit wanted to cover some that don't
So le32-unknown-unknown will turn any 32-bit atomic into a libcall
(The support for instructions has to be complete across the set of operations otherwise some things will be instructions and some will be libcalls and there is nothing that guarantees atomicity between the two.)
I would not expect this to fail on 64-bit however
Zalman Stern
@zvookin
Apr 24 00:47
Generally the issue with double checked locking is the compiler is allowed to reorder stuff inside the locked block
I'm having a hard time comping up with a failure here however
Andrew Adams
@abadams
Apr 24 00:47
I see that there needs to be a fence before assigning to the global
The thing might get assigned to a partially-constructed value inside the cuda driver
Zalman Stern
@zvookin
Apr 24 00:48
yeah
though that can't happen in our code
because we use a local variable
Andrew Adams
@abadams
Apr 24 00:48
But I'm still a bit mystified why there needs to be a fence in between the load of the global and the first check of the value
Zalman Stern
@zvookin
Apr 24 00:50
So with the caveat that I need to look at it a bit more, generally the logic is that you can only reason about things between two fences
If there is no fence on one thread, you can't reason about it
However....
In this case, I think you can prove that one of the other fences had to happen on the thread, or the value was non zero
But that doesn't take into account the device release case which sets it back to zero
Andrew Adams
@abadams
Apr 24 00:51
Oh yeah, if we know we don't need a release fence, we can't possibly need an acquire fence. What would it synchronize with?
Zalman Stern
@zvookin
Apr 24 00:52
Yeah, I think the acquire has to be there to make sure the thread sees a device_release call setting it to NULL
Does that make sense
?
Andrew Adams
@abadams
Apr 24 01:05
Hrm, so local_val could be not-null but about-to-be null because the lock has been acquired in the destructor but the thing hasn't been nulled out yet.
But I don't see how an acquire fence avoids that
If someone's trying to release the device while another thread is acquiring the context, things are going to to wrong. All sorts of unlocked code could be using that context.
Kern Handa
@kernhanda
Apr 24 09:39
How does Halide figure out the host triple?
i think i'm seeing a case where armv6 is being mistaken as aarch64
Zalman Stern
@zvookin
Apr 24 17:16
For JIT?
The mapping for "host" is in Target.cpp I believe
Kern Handa
@kernhanda
Apr 24 17:17
Not just for jit, but when host is passed in as a target to a generator
Zalman Stern
@zvookin
Apr 24 17:18
The difference between armv6 and aarch64 requires bits being 64 vs. 32 and that's pretty hard unless something is being set explicitly. (It shouldn't be allowed to change the bit width in JIT however.)
Kern Handa
@kernhanda
Apr 24 17:26
Is there a way to quickly verify what halide thinks the host target is?
Zalman Stern
@zvookin
Apr 24 17:27
Target foo("host"); debug(0) << foo; ?
I don't think we try to match back to "host" when printing the output. (I certainly hope we don't.)
Syntax might be a bit off there
Kern Handa
@kernhanda
Apr 24 17:30
Kk, I'll try that.
Shoaib Kamil
@shoaibkamil
Apr 24 17:39
No, it doesn't match back to host.
Steven Johnson
@steven-johnson
Apr 24 17:39
that’s WAI
‘host’ should expand into whatever actual host is
eg x64-64-linux-sse4
Zalman Stern
@zvookin
Apr 24 17:40
cool
Shoaib Kamil
@shoaibkamil
Apr 24 17:41
Yeah, and as Zalman says the logic for host -> (Halide) target is in Target.cpp. That gets further boiled down to the LLVM target in Codegen_LLVM, but for all intents and purposes it's pretty much the same as what's set up in Target.cpp
Steven Johnson
@steven-johnson
Apr 24 17:41
mixing up 32 and 64 bit targets would definitely be a major weirdness though
Zalman Stern
@zvookin
Apr 24 17:42
I was going to say I'd hope llvm asserts quickly if we got that wrong, but I could see almost everything just generating code into memory and then it crashing when we call it :-)
Kern Handa
@kernhanda
Apr 24 17:56
Now that I'm back at my desk, I can provide more details. It's specifically LLVM_Runtime_Linker.cpp that calls get_initmod_aarch64_cpu_features, which returns an error. The error is expected in that function, but the function itself isn't expected to be called on armv6
alright, i see where this code is being called. i'll try to do a little debugging to see where in get_initial_module_for_target the wrong thing is being called
Kern Handa
@kernhanda
Apr 24 18:02
This stands out to me:
if (t.arch == Target::ARM) {
  if (t.bits == 64) {
    modules.push_back(get_initmod_arm_cpu_features(c, bits_64, debug));
  } else {
    modules.push_back(get_initmod_aarch64_cpu_features(c, bits_64, debug));
  }
}
Steven Johnson
@steven-johnson
Apr 24 18:08
whoa
Kern Handa
@kernhanda
Apr 24 18:08
did i get it? hole in one?
Steven Johnson
@steven-johnson
Apr 24 18:08
how could that ever have worked
Kern Handa
@kernhanda
Apr 24 18:09
you might be testing on arm64 targets only?
Steven Johnson
@steven-johnson
Apr 24 18:09
that section hasn’t been touched in ~2 years
Kern Handa
@kernhanda
Apr 24 18:14
yup, that seems to fix my issue
i'll submit a PR if you don't mind
Shoaib Kamil
@shoaibkamil
Apr 24 18:15
Lol we
We've never generated anything but arm64 either
Kern Handa
@kernhanda
Apr 24 18:15
yikes. guess i'll be testing new frontiers for halide
Shoaib Kamil
@shoaibkamil
Apr 24 18:19
I should say "we" in my statement is Adobe, not Google :)
Kern Handa
@kernhanda
Apr 24 18:20
it is interesting though that arm64 binaries can run on a 32-bit OS. does linux commonly allow that?
Andrew Adams
@abadams
Apr 24 18:25
We do have an arm32 buildbot...
The code in that initial module is a no-op, so it'll work fine on arm-32
Just misses the logic that does neon detection which exists in the correct version
I think we do lots of arm-32 testing (and historically there has been a lot of arm-32 usage - e.g. HDR+ on the Nexus 5), but I think it all was assumed to have neon.
Kern Handa
@kernhanda
Apr 24 18:30
right, but in this case, it would call get_initmod_aarch64_cpu_features on a 32-bit arm device. maybe the reason why that does work for your buildbots is LLVM is built with aarch64 support as well?
Andrew Adams
@abadams
Apr 24 18:38
initmod_aarch64_cpu_features is just some vanilla c++ that does nothing
should be safe to include on any target
Kern Handa
@kernhanda
Apr 24 18:38
assuming the version that gets called isn't the one that's defined by DECLARE_NO_INITMOD
Andrew Adams
@abadams
Apr 24 18:39
Ah!
Yeah, so the difference is that our buildbot LLVMs were compiled with all targets enabled. That makes sense now :)
Kern Handa
@kernhanda
Apr 24 18:40
how do your buildbots build LLVM for ARM anyway? i'm using a docker container + qemu
Andrew Adams
@abadams
Apr 24 18:40
tegra tx1/2 dev boards
They have enough ram to compile llvm
Kern Handa
@kernhanda
Apr 24 18:41
ah, interesting.
Andrew Adams
@abadams
Apr 24 18:43
I don't know if they've been moved since I left Google, but here was our ARM buildbot farm at the time I left: https://photos.app.goo.gl/FkWrIkBbCfzAtzcx1
Pranav Bhandarkar
@pranavb-ca
Apr 24 18:44
I am trying to compute a Func that has an update definition and I want to compute it such that the pure stage and the update stage share a loop over y. I am following the tutorial on wrappers to achieve this, but without success. Anyone mind taking a look at what I am doing wrong? https://gist.github.com/pranavb-ca/ba7e4ccb2961ac860fba5c50d85174af
Andrew Adams
@abadams
Apr 24 18:45
Looks like you already have a wrapper in the algorithm - output
So calc.compute_at(output, y) should be enough
Shoaib Kamil
@shoaibkamil
Apr 24 18:46
@abadams Pretty.
Andrew Adams
@abadams
Apr 24 18:46
Some people hang Christmas lights. I prefer dev boards.
Steven Johnson
@steven-johnson
Apr 24 18:47
They aren’t hung anymore. They are now stacked.
Kern Handa
@kernhanda
Apr 24 18:52
anyone else see a giant white block in place of the gist preview?
Pranav Bhandarkar
@pranavb-ca
Apr 24 18:53
@abadams Thanks, I made that change, but no difference. See https://gist.github.com/pranavb-ca/ba7e4ccb2961ac860fba5c50d85174af (look for "Loop Nest A" and "Loop Nest B" in the print_loop_nest output at the bottom of the gist).
Steven Johnson
@steven-johnson
Apr 24 18:55
Gitter renders Gist previews weirdly for me
Andrew Adams
@abadams
Apr 24 19:20
What is the extent of those loop nests - 4?
print_loop_nest can be misleading because every stage of every func gets a complete loop nest, but a lot of the loops have extent 1
Pranav Bhandarkar
@pranavb-ca
Apr 24 19:23
I see. I'll look in the IR. Related question - if the output of my generator is the result of a Func with an update then, if the Func with the update has the same extents as the output, is there a way to get rid of the last copy from the result of the update to the output buffer?
If I try to write the output itself using update stages then I need to make two RPC calls to offload to Hexagon (one for the pure stage and one for the update stage). I want to avoid that.
Kern Handa
@kernhanda
Apr 24 20:05
question about Halide's simplification capabilities. if I have matrices A, B, C and the values of B and C are known at compile time, would ABC be folded into AD, where D = BC?
assuming, of course, one does a simple matrix multiply and not some of the less straightforward methods
Zalman Stern
@zvookin
Apr 24 20:07
It probably falls to LLVM to do the optimization, though some constant folding might happen in Halide
It is possible it will happen.
Andrew Adams
@abadams
Apr 24 20:07
Halide only really simplifies and folds within a single expression, not across multiple Funcs
Kern Handa
@kernhanda
Apr 24 20:07
does it largely depend on the size of B and C?
Zalman Stern
@zvookin
Apr 24 20:07
Generally if I have interesting amounts of work like that, I use realize to hoist the evaluation to compile time in the generator
Kern Handa
@kernhanda
Apr 24 20:08
that'd only work if you know ahead of time that AB is being multiplied with C
Zalman Stern
@zvookin
Apr 24 20:08
I would expect it to be really fragile in any abstracted version of matrix multiply
Kern Handa
@kernhanda
Apr 24 20:08
right