Where communities thrive


  • Join over 1.5M+ people
  • Join over 100K+ communities
  • Free without limits
  • Create your own community
People
Repo info
Activity
  • Jun 24 20:38

    rootjalex on add_simpl_rules

    add min/select combo rewrite ru… (compare)

  • Jun 24 18:23

    dsharletg on master

    Remove floats from extern_produ… (compare)

  • Jun 24 18:23
    dsharletg closed #6109
  • Jun 24 16:51

    rootjalex on autoscheduler_mcts

    minor code clean-ups + linear d… (compare)

  • Jun 24 16:37
    dsharletg synchronize #6109
  • Jun 24 16:37

    dsharletg on no-float

    Use different period (compare)

  • Jun 24 16:34
    abadams commented #6109
  • Jun 24 16:25
    dsharletg opened #6109
  • Jun 24 16:24

    dsharletg on no-float

    Don't rely on floats/trig unnec… (compare)

  • Jun 24 14:53
    dsharletg opened #6108
  • Jun 23 19:43
    dsharletg commented #6102
  • Jun 23 17:34

    dsharletg on fix-monotonic2

    (compare)

  • Jun 23 17:34

    dsharletg on master

    Remove likelies and promises be… (compare)

  • Jun 23 17:34
    dsharletg closed #6105
  • Jun 23 16:47
    dsharletg commented #6105
  • Jun 23 16:36
    rootjalex closed #6107
  • Jun 23 16:27
    abadams commented #6107
  • Jun 23 16:21
    rootjalex review_requested #6107
  • Jun 23 16:20
    rootjalex opened #6107
  • Jun 23 16:14

    rootjalex on cse_before_bounds

    substitute lets + CSE for find_… (compare)

Ashish Uthama
@ashishUthama
class HammingMetric : public Halide::Generator<HammingMetric> {
  public:
    Input<Buffer<>> F1{"F1", 2};  // F x N1 uint8_t
    Input<Buffer<>> F2{"F2", 2};  // F x N2 uint8_t
    Output<Buffer<>> D{"D", 2};   // N1 x N2 float

    void generate() {
        Var x("x"), y("y"), f("f");
        // Feature length
        RDom r(0, F1.dim(0).extent(), "r");

        // Func BitCountLUT;
        // BitCountLUT(x) = cast<uint8_t>(popcount(x));
        // D(x, y) = sum(cast<float>(BitCountLUT(clamp(F1(r, x) ^ F2(r, y), 0, 255))), "sum");
        // BitCountLUT.bound(x,0,256).compute_root();

        D(x, y) = sum(cast<float>(popcount(F1(r, x) ^ F2(r, y))), "sum");

        // This is required for rungen
        D.dim(0).set_bounds(0, F1.dim(1).extent());
        D.dim(1).set_bounds(0, F2.dim(1).extent());


        F1.dim(0).set_estimate(0, 128);
        F1.dim(1).set_estimate(0, 1000);
        F2.dim(0).set_estimate(0, 128);
        F2.dim(1).set_estimate(0, 1000);
        D.dim(0).set_estimate(0, 1000);
        D.dim(1).set_estimate(0, 1000);        

        // 162
        //D.parallel(y);

        // 195
        //apply_schedule_HammingMetric(get_pipeline(), get_target());        
    }
};
HALIDE_REGISTER_GENERATOR(HammingMetric, HammingMetric)
Hi folks, I am struggling to get this to match the performance of hand written SSE2 code. I only seem to be getting to ~80% of that on an AVX2 machine :(
A simple parallel appears to be the best schedule I have so far (the 195 is from adams2019 auto schedule). Any suggestions if I am missing something obvious?
Dillon Sharlet
@dsharletg
For code like this, you'll probably need to write the reduction without the inline helpers, and schedule it. Also I would think you want to do the reduction in integers rather than float
Nikola Smiljanić
@popizdeh
One of the comments in lesson 16 says "In terms of the strides described in lesson 10" but there's no mention of strides in any of the lessons other than 16.
Ashish Uthama
@ashishUthama
@dsharletg - thanks! hacking away...
@devs : Just a note of thanks for updating the generator autoschedule option to output fully usable headers!
Ashish Uthama
@ashishUthama

@dsharletg - that helped. Ended with:

class HammingMetric : public Halide::Generator<HammingMetric> {
  public:
    Input<Buffer<>> F1{"F1", 2}; // F x N1 uint8_t
    Input<Buffer<>> F2{"F2", 2}; // F x N2 uint8_t
    Output<Buffer<>> D{"D", 2};  // N1 x N2 float

    void generate() {
        Var x("x"), y("y"), f("f");
        // Feature length
        RDom r(0, F1.dim(0).extent(), "r");

        // D is expected to be initialize to 0.
        D(x, y) = undef<float>();
        Func bcount("bcount");
        bcount(x, y, f) = popcount(F1(f, x) ^ F2(f, y));
        D(x, y) += bcount(x, y, r);

        // This is required for rungen
        D.dim(0).set_bounds(0, F1.dim(1).extent());
        D.dim(1).set_bounds(0, F2.dim(1).extent());

        D.update(0).parallel(y);
        bcount.store_in(MemoryType::Stack).compute_at(D, x).store_at(D, x);
    }
};
HALIDE_REGISTER_GENERATOR(HammingMetric, HammingMetric)

The above code still gets vectorized even though its not explicit (Thanks to @abadams video, learnt about LLVM loop opt). I was unable to vectorize it explicitly in anyway to get the same perf as what LLVM did.
Now the odd gotcha - for feature lengths (F), 64 or less - enabling AVX2 significantly slows things down.
So I am wondering if there is a way in Halide to specialize like so:

* For F<64, vectorize, but use only AVX
* For F>=64 vectorize, and allow AVX2

For now, I'll just do this logic outside Halide and generate multiple outputs from the generator by varying the target.

2 replies
Ashish Uthama
@ashishUthama
Also, depending on values of F, 2nd row on-wards can be misaligned. This also appears to have an impact on the performance. Any tricks to handle this in Halide?
Jiawen (Kevin) Chen
@jiawen
I just realized that the MacOS release's autoschedulers have extensions .so. and not .dylib
That's a bug yes?
Alex Reinking
@alexreinking:matrix.org
[m]
No that is how loadable modules work
That is intentional and correct
Jiawen (Kevin) Chen
@jiawen
ok
dylib only
i can't seem to link it in
Alex Reinking
@alexreinking:matrix.org
[m]
You're not supposed to link to the autoschedulers
Jiawen (Kevin) Chen
@jiawen
ld: can't link with bundle (MH_BUNDLE) only dylibs (MH_DYLIB) file 'bazel-out/host/bin/_solib_darwin_x86_64/_U@halide-x86-64-osx_S_S_C_Uautoschedulers___Ulib/libautoschedule_mullapudi2016.so' clang: error: linker command failed with exit code 1 (use -v to see invocation)
Alex Reinking
@alexreinking:matrix.org
[m]
100% expected
Jiawen (Kevin) Chen
@jiawen
ok - andrew thought it might work ok to link it in :)
Alex Reinking
@alexreinking:matrix.org
[m]
You load them at runtime with the load_plugin function
Jiawen (Kevin) Chen
@jiawen
yeah i'm just fighting bazel now
Alex Reinking
@alexreinking:matrix.org
[m]
No, Linux doesn't make any distinction between executables, shared objects, and loadable modules. Not so on macOS or Windows
Alex Reinking
@alexreinking:matrix.org
[m]
More precisely, ELF doesn't care about these differences, but Mach-O does
The binary formats used by Linux (and most other Unixes) and macOS, respectively
Jiawen (Kevin) Chen
@jiawen
heh thanks - i'm sure the design decision made sense to someone at some point
Ashish Uthama
@ashishUthama
I am getting a crash with non-square inputs (when transpose_A=true transpose_B=false ). I tried with various combinations of transposed, but still get a crash. Trying to work through the implementation now.
Ashish Uthama
@ashishUthama
Taking care to fix the sizes based on transpose_A gets right answers most of the time, other times, I get a crash with:
Stack Trace (from fault):
[ 0] 0x00007fccefdc3c14 par_for.result.s0.v1.v13.v8 at posix_allocator.cpp:? [ 1] 0x00007fccefd9e393 Halide::Runtime::Internal::worker_thread_already_locked(Halide::Runtime::Internal::work*)
alan__
@alan__:matrix.org
[m]
Hello everyone. Is there any recommendations to perform image warping?
sternj
@sternj
Hi all-- is it possible to specify the types of tuple outputs for generators at generation time? I know it can be done with Output<Buffer<>> _outputs {"outputs", {Int(16), Int(16)}, 2} or the like, but is it possible to pass in the types as a generator parameter?
1 reply
Rohan Yadav
@rohany
Hi! Is it possible to use Halide to generate kernels with a BLAS like signature (i.e. the LDA/LDB/LDC constants in dgemm)? I'm considering using Halide to generate some kernels to use in another project.
alan__
@alan__:matrix.org
[m]
Hello @rohany. I believe that would be useful to look at the apps/linear_algebra folder. That contains implementations for the blas libraries.
Rohan Yadav
@rohany
Yeah, I've seen those -- I know that halide can implement several blas kernels. This is more of a practical / implementation question, as I want to be able to parametrize the generate code by these load offsets, since I want to maybe pass slices of larger buffers to the kernel.
Marco Scardovi
@scardracs
hi, is clang required to build halide?
I'm porting it to gentoo for a a package that requires halide
Alex Reinking
@alexreinking:matrix.org
[m]
Yes, but only for the runtime modules (translates them to LLVM bitcode)
Marco Scardovi
@scardracs
Ok, thanks :)
Alex Reinking
@alexreinking:matrix.org
[m]
The main compiler can be either GCC or Clang on Linux
Sure thing :)
Marco Scardovi
@scardracs
Ok, thanks again :)
Chris Taylor
@catid
Is there some way to "promote" a 2D Halide Buffer to a 3D one that has an extent = 1 for the third dimension?
Without copying any data etc
Zalman Stern
@zvookin
In Halide or in C++ using the runtime?
Chris Taylor
@catid
C++ using the runtime I think
Zalman Stern
@zvookin
I think you want “embed” in HalideBuffer.h
Maybe “add_dimension”
Can’t remember
Chris Taylor
@catid
Ok
Chris Taylor
@catid
That worked :)