alexreinking on asan3
Make apps work with ASAN (compare)
alexreinking on asan3
Allow any app to be individuall… Move ASAN detection into project Fixup: disable bgu app from too… and 4 more (compare)
steven-johnson on pygen-three
Update PyGenerator.cpp (compare)
steven-johnson on pygen-three
fixes (compare)
steven-johnson on py-lesson-10
fixes (compare)
steven-johnson on pygen-three
Rework internal PYTHONPATH main… Fixes (compare)
steven-johnson on py-lesson-10
Tutorial 10 needs to be skipped… (compare)
Input<bool> lutEnabled
and Input<Buffer<>> lut
and I use f(x, y) = select(lutEnabled, lut(x, y), input(x, y));
. My hope was that specializing for lutEnabled == false
would produce code where I'm allowed to pass a nullptr for lut
input buffer but that's not the case. Bounds checking is done as a top-level check which requires me to pass a valid buffer even when lutEnabled is false. Any ideas?
specialization()
call. My goal is to do the following: I have some parameters to my algorithm and for each parameter, I run the auto-scheduler to generate a schedule.h
. I want to include each schedule and use each one according to the parameter value.
I'm trying to schedule my program on the GPU, the halide profiler is telling me:
average threads used: 0.900000
heap allocations: 0 peak heap usage: 0 bytes
halide_malloc: 0.000ms (0%) threads: 0.000
halide_free: 0.000ms (0%) threads: 0.000
endiannessSwapWordOut: 1.015ms (100%) threads: 0.900
This thread usage of 0.9 is worrying me. I checked and the debug info (after enabling) shows mehalide_opencl_run (user_context: 0x0, entry: _kernel_endiannessSwapWordOut_s0_wordIdx___block_id_x, blocks: 4313x1x1, threads: 4x1x1, ...
So it is running on multiple blocks and threads. What does this 0.9 thread utilization tell me?
CodeGen_PTX_Dev::mcpu()
in CodeGen_PTX_Dev.cpp
, it supports until sm80. Has anyone tried run GPU codegen on RTX cards?
Not as far as I know, but I recently acquired a 3090, so hopefully that will change soon!
halide/Halide#6334 changing now!
If for a Halide Generator class, when compiled, the "auto_schedule" argument is set to "false", is it possible for Halide engine to use any default parallelization/scheduling technique (e.g., vectorization, parallelization, tiling, loop reversal)? Or it is guaranteed that no scheduling primitives are going to be used?
If for a Halide Generator class, when compiled, the "auto_schedule" argument is set to "false", is it possible for Halide engine to use any default parallelization/scheduling technique (e.g., vectorization, parallelization, tiling, loop reversal)? Or it is guaranteed that no scheduling primitives are going to be used?
No scheduling primitives will be used unless you specify them.
Even with target=x86-64-linux-disable_llvm_loop_opt, I notice xmm* registers being used in the output assembly file (fileName.s). Does that mean there is auto vectorization going on somewhere in generation pipeline?
No: Halide assumes that SSE2 is present for all x86-64 architectures, and uses the XMM register for scalar floating point operations
I have this Issue:
Condition failed: in.is_bounded()
Unbounded producer->consumer relationship: Vertices-> FaceNormal
When I try to read an array with a buffer of indices.
ref:
https://github.com/halide/Halide/issues/4108#issuecomment-956546487
halide/Halide#4108
I am working "serializing" a Python object with Expr members to a Halide Func. In the process, I end up having a function with a large number of explicit definitions in one dimension. Unfortunately, I am not able to make those be calculated in an efficient way - once for each value while and sharing potential pre-calcualted values. In particular, this code:
f = Func("f")
f[row, col] = 0.0
f[row, 0] = 1.0 + sqrt(row*row)
f[row, 1] = 2.0 + sqrt(row*row)
f[row, 2] = 3.0 + sqrt(row*row)
f[row, 3] = 4.0 + sqrt(row*row)
g = Func("g")
g[row, col] = f[row, col] + 42.0
g.compile_to_lowered_stmt("out.txt", [], StmtOutputFormat.Text)
print(np.asanyarray(g.realize(2, 4)))
Leads to the following generated code:
for (g.s0.col, g.min.1, g.extent.1) {
...
for (g.s0.row, g.min.0, g.extent.0) {
allocate f[float32 * 1 * (max(t6, 3) + 1)]
produce f {
f[t7] = 0.000000f
f[t8] = (float32)sqrt_f32(float32((g.s0.row*g.s0.row))) + 1.000000f
f[t9] = (float32)sqrt_f32(float32((g.s0.row*g.s0.row))) + 2.000000f
f[t10] = (float32)sqrt_f32(float32((g.s0.row*g.s0.row))) + 3.000000f
f[t11] = (float32)sqrt_f32(float32((g.s0.row*g.s0.row))) + 4.000000f
}
consume f {
g[g.s0.row + t12] = f[t7] + 42.000000f
}
Unfortunately, Halide does not notice that only one value of f
is needed, and calculates all of f
for each g
. I guess this is expected.
Calling f.compute_root()
helps reduce the number of calculations, but results in code with 4 four loops over row instead. This is problematic in my actual use-case, because it no longer automatically shares values that can be pre-calculated (such as the sqrt above).
Is there a way to get Halide to calculate f
for each explicitly set col
in one loop over row
?
upgrading from Halide 12 to Halide 14 (tip)
running into a lot of:
Unhandled exception: Error: Cannot split a loop variable resulting from a split using PredicateLoads or PredicateStores.
Right now, it looks like something related to tile() with tailstrategy omitted (i.e the default Auto) . Does this ring a bell? (will dig more in a bit)