Where communities thrive


  • Join over 1.5M+ people
  • Join over 100K+ communities
  • Free without limits
  • Create your own community
People
Repo info
Activity
  • 16:31
    pranavb-ca commented #4096
  • 15:37
    benoitsteiner commented #4150
  • 05:03
    jerryz123 commented #4150
  • 02:30
    steven-johnson closed #4149
  • 02:30
    steven-johnson commented #4149
  • 02:18
    jrk commented #4150
  • 00:52
    rocrover closed #4136
  • 00:52
    rocrover commented #4136
  • 00:52
    kpassarella synchronize #4154
  • 00:52

    kpassarella on kp_bit_shift

    Enable bit shifts by negative a… (compare)

  • 00:30
    steven-johnson commented #4151
  • 00:29
    abadams commented #4151
  • 00:06
    steven-johnson commented #4149
  • 00:05
    steven-johnson reopened #4149
  • Aug 19 23:48
    steven-johnson commented #4151
  • Aug 19 23:22
    pranavb-ca commented #4096
  • Aug 19 23:16

    pranavb-ca on pdb_fix_clang_build

    Merge branch 'master' of https:… If the host compiler is clang, … (compare)

  • Aug 19 23:08
    abadams synchronize #3667
  • Aug 19 23:08

    abadams on bfloat16

    Make can_use_target TSAN-safe (… Merge pull request #4146 from h… Merge remote-tracking branch 'o… (compare)

  • Aug 19 23:08
    abadams synchronize #3667
Andrew Adams
@abadams
So I guess add both the random input and the output to the weights file.
Steven Johnson
@steven-johnson
basically upgrading weights into more of an ‘environment’ — not crazy, but definitely expanding the scope of what goes into it
Andrew Adams
@abadams
I was thinking of it more as adding error checking to the weights file, so that it also validates that the weights are being interpreted correctly relative to how the network is run
A beefed up checksum
Steven Johnson
@steven-johnson
yes, though (as you know) hash-of-golden-result is a nightmare to debug when a failure is found
but in this case it’s probably better than letting errors go unnoticed
Andrew Adams
@abadams
Yeah, it would be just be error-detecting. Given that the thing works with random weights, it would be nice to have confidence that the weights are being used in the intended way
Steven Johnson
@steven-johnson
SGTM
D.pz
@decentsheep
@abadams Thanks a lot!
Steven Johnson
@steven-johnson
FYI: there’s a pending fix for llvm-trunk compile failure (see recent PR), but with that in place, I’m now getting test failures in test_internal, with Wrong number of operands !47 = !{!"branch_weights", i32 1073741824, i32 0} from LLVM. Investigating.
Steven Johnson
@steven-johnson
Hmm… looks like LLVM trunk may have finally flipped to requiring C++14, as the linux buildbots are now failing with error Expected C++14 or later. (Not seeing this locally but my local machine uses a newer gcc by default.)
Steven Johnson
@steven-johnson
hm, looks like a purge of everything and rebuild from scratch is dealing with that so far.
Steven Johnson
@steven-johnson
re: the Wrong number of operandsfailures, I’ve bisected it to https://reviews.llvm.org/rL368647, but am not clear on the fix yet. I’ve pinged @alinas for advice since she reviewed the change.
Rasim Akhunzyanov
@brotherofken
Hi! I've a question regarding to 'debug' target feature.
I've added 'debug' to OpenCL target and get a lot of useful output on x86. However there are no such output on android. Currently logcat contains only messages like Entering Pipeline, Target, list of pipeline inputs and Exiting Pipeline.
As far as i understood, these messages were produced by print statements added in DebugArguments.cpp. However logging for OpenCL backend is made by debug(user_context) object in runtime/opencl.cpp. Is there analog to halide_set_custom_print for debug? Or may be I need to re-build Halide with some custom options?
In native code I set error handler and print using halide_set_error_handler and halide_set_custom_print respectively.
Seems like I have to recompile Halide with DEBUG_RUNTIME or hack printer.cpp and replace SinkPrinter by Printer<BasicPrinter>. Am I right?
Rasim Akhunzyanov
@brotherofken
Solved. Seems that I've been linking wrong Halide's runtime library due to my messy CMakeLists.
Steven Johnson
@steven-johnson
Buiding for Android via CMake is territory we haven’t explored too deeply — not surprised it might be sketchy. (We’d welcome PRs to improve that situation.)
Steven Johnson
@steven-johnson
@abadams — does a bounds query ever adjust the output buffer sizes/shapes, or does it always purely affect the input buffer sizes/shapes?
Andrew Adams
@abadams
It can also adjust the output buffer size
e.g. increase them to satisfy a minimum output size requirement implied by the schedule
Steven Johnson
@steven-johnson
gotcha
(Found another case in RunGen where we have a complex set of constraints that can fail during bounds-query… I think the short-term fix is to skip the bounds-query entirely if 100% of the input and output buffers are provided via estimate, on the assumption those will be valid)
Rasim Akhunzyanov
@brotherofken
@steven-johnson Thank you for the answer. I'll do PR if make something meaningful.
Dillon Sharlet
@dsharletg
@abadams @pranavb-ca I've found an interesting thing, perhaps it affects Hexagon more than other targets. I have found that if I express e.g. a 3x3 stencil as a reduction over [0, 2] instead of [-1, 1], I get significantly better performance. I think this is mainly due to align_loads: when using [-1, 1], we have to keep 3 vectors alive, and slice between both the first two and the second two. But when using [0, 2], we only need to keep 2 vectors alive, and slice between both of them differently.
The reason this is interesting for Halide is because I think this is a mechanical program transformation that Halide probably should do either automatically or at least have a scheduling directive
My thinking on this is rough at this point. I need to do more digging. But I am excited about the finding :)
Pranav Bhandarkar
@pranavb-ca
nice, I am assuming you are setting up your buffers so that things are aligned and that align loads is actually able to slice up vectors
what is the perf difference that you are seeing?
Andrew Adams
@abadams
Sounds like align_loads needs fixing.
(only hexagon uses it)
Dillon Sharlet
@dsharletg
I'm seeing a 25% increase on a pipeline that has a lot of stack spills in a loop that is affected by this
I think align_loads is working OK, I think what we need is to shift all of the indexing/buffer realizations such that the vectors loaded have alignment (64, 0), (64, 1) and (64, 2) instead of (64, -1), (64, 0), (64, 1)
Pranav Bhandarkar
@pranavb-ca
right now align_loads simply breaks up a load into two loads. how would you change it? I think it is doing its job
shifting indices shouldn't be done by align_loads, IMHO
Dillon Sharlet
@dsharletg
I think maybe the thing we need is to have align_storage have a remainder in addition to a modulus
Dillon Sharlet
@dsharletg
hmm, that would make writes messy instead. This might be a tricky thing to actually do.
Dillon Sharlet
@dsharletg
the more I play around with this, the more I find other issues with align loads and loop carrying. I have two nested loops for x, for dx, where dx is a small stencil reduction. I keep getting the loop carry buffers inside the loop over x instead of outside, which is not ideal
even weirder, when I unroll things, so I have multiple inner loops, I get some loop carrying around the inner loop, and some around the outer loop. When there are multiple in the inner loop, they don't get shared, which is surely terrible for register pressure, even though I think they could be shared
Dillon Sharlet
@dsharletg
I think the problem I described before is more nuanced than I first thought. After I fixed the program not to have any inner loops except the one I wanted pipelining/carrying from, everything is working like I would hope I think. And as I think about it more, the presence of those inner loops is a really tough problem. I'm not sure we can expect any compiler to handle this exactly the way one would want.
Steven Johnson
@steven-johnson
I’m getting tired of endlessly rewriting ad-hoc code for defining/parsing command-line flags in C++ code; surely there is some lightweight/portable helper library that we could adopt (assuming no licensing issues). Anyone have any suggestions?
Marc Peter
@macpete
I'm using getopt/getopt_long. A Windows implementation with BSD lisence is here: https://github.com/kimgr/getopt_port
Steven Johnson
@steven-johnson
looking at getopt_long() examples, I’m not sure if it’s a lot simpler than just hand-rolling code :-/
Marc Peter
@macpete
A bit of boilerplate is needed, true. I looked for something simpler - with the right license - in the past, but gave up.
It's not so bad, really. It sure beats calling lots of strncmps and offers short options.
Steven Johnson
@steven-johnson
yeah, fair
Marc Peter
@macpete
I came up with a nifty template/lamda function, that hugely simplifies parsing numbers - you just call getarg(longvar) or getarg(floatvar) (all scalars except intbecause there's no stroi)
happy to share it if you're interested
Steven Johnson
@steven-johnson
I think I can make this work as-is, but thanks
Benoit Steiner
@benoitsteiner
The autoscheduler throws this error: Can only unroll for loops over a constant extent. │
Loop over fire7/concat_1_direct_conv.s0.v55077 has extent (let t57618221 = (min(likely_if_innermost(((fire8/squeeze1x1_2.s0._1._1._1/3)4)), 9) + (fir│
e8/squeeze1x1_2.s0._2._2i._2i
2)) in ((t57618221 - likely_if_innermost(t57618221)) + 2)).
I can avoid the problem by setting the HL_PERMIT_FAILED_UNROLL environment variable to 1, but it seems that t57618221 - likely_if_innermost(t57618221) should be simplified to 0. Any idea ?