dependabot[bot] on bundler
Bump redcarpet from 3.4.0 to 3.… (compare)
freemo on master
Feature: Provide Kernel.compile… Feature: Provide Kernel.compile… Feature: Provide Kernel.compile… and 6 more (compare)
freemo on master
Feature: Improve Aparapi native… Merge branch 'master' into 'mas… (compare)
freemo on master
Feature: Provide Kernel.compile… Merge branch 'master' into 'mas… (compare)
@grfrost thanks again, I was not asking a magic number, but a way to ask Aparapi suggested local sizes.
That's what I get on mandelbrot set on 3 different GPU.
=======================================================================================================
AparapiFractals - Mandelbrot Benchmark - localSizes
image size : 900 x 900
maxIterations : 10,000
complex region : -2.0000000000000000d,-2.0000000000000000d 2.0000000000000000d,2.0000000000000000d
+-----+------------------+----------------+--------------------------+----------+--------+------------+
|Type | shortDescription | deviceId | Name | LSizes | ExMode | Elapsed(ms)|
+-----+------------------+----------------+--------------------------+----------+--------+------------+
| GPU | NVIDIA<GPU> | 2381463033712 | GeForce GTX 1650 SUPER | 2 x 2 | GPU | 721 |
| GPU | NVIDIA<GPU> | 2381463033712 | GeForce GTX 1650 SUPER | 4 x 4 | GPU | 170 |
| GPU | NVIDIA<GPU> | 2381463033712 | GeForce GTX 1650 SUPER | 8 x 8 | GPU | 95 |
| GPU | NVIDIA<GPU> | 2381463033712 | GeForce GTX 1650 SUPER | 10 x 10 | GPU | 118 |
| GPU | NVIDIA<GPU> | 2381463033712 | GeForce GTX 1650 SUPER | 16 x 16 | GPU | 97 |
| GPU | AMD<GPU> | 2381459464048 | Oland | 2 x 2 | GPU | 5081 |
| GPU | AMD<GPU> | 2381459464048 | Oland | 4 x 4 | GPU | 1330 |
| GPU | AMD<GPU> | 2381459464048 | Oland | 8 x 8 | GPU | 366 |
| GPU | AMD<GPU> | 2381459464048 | Oland | 10 x 10 | GPU | 468 |
| GPU | AMD<GPU> | 2381459464048 | Oland | 16 x 16 | GPU | 376 |
+-----+------------------+----------------+--------------------------+----------+--------+------------+
+-----+------------------+----------------+--------------------------+----------+--------+------------+
| GPU | Intel<GPU> | 2572020347104 | Intel(R) HD Graphics 620 | 2 x 2 | GPU | 395 |
| GPU | Intel<GPU> | 2572020347104 | Intel(R) HD Graphics 620 | 4 x 4 | GPU | 184 |
| GPU | Intel<GPU> | 2572020347104 | Intel(R) HD Graphics 620 | 8 x 8 | GPU | 194 |
| GPU | Intel<GPU> | 2572020347104 | Intel(R) HD Graphics 620 | 10 x 10 | GPU | 204 |
| GPU | Intel<GPU> | 2572020347104 | Intel(R) HD Graphics 620 | 16 x 16 | GPU | 201 |
+-----+------------------+----------------+--------------------------+----------+--------+------------+
=======================================================================================================
by now I will stay on a fixed 8x8.
put
an array and use it in multiple invocations of the same kernel, so why not put
it once and use it in invocations of multiple different kernels?kernel.put()
) ? If I put
an array, how long will it remain allocated and taking up space in the GPU's memory? I'm really unclear about how the "clean up" of allocated global device memory happens in aparapi.@bmaso Great questions. You are closing in on Aparapi's weak area. Multi-Kernel dispatch over the same data. Sadly I do not think that Aparapi currently will allow you to share buffers between kernels. I may be relating to original code I wrote in 2011 ;) so almost 10 years back, certainly at that time I think we have one OpenCL queue per Kernel (maybe @freemo can clarify). Maybe even one opencl context.
You should experiment with 'explicit management' and use a device to create your range. This way it is possible that you can 'force' Aparapi to use the same queue/context. That is the key. You certainly won't be able to do this across devices.
BTW there is a trick I docced somewhere. You put the implementation of two or more kernels in one 'run' method and have the execution switch depending on external state
public void run(){
if (state==0){
// code for algorithm 1
}else{
// code for algorithm 2
}
yeah it is hacky ;) but not terrible from a performance POV. All kernels take the same branches, so not too bad WRT wave divergence.
I used this for a few Aparapi demos back in the day.
__local float Asub[TS][TS];
__local float Bsub[TS][TS];
@hostettler They go from this:
ld.shared.f32 %f50, [%r18+56];
ld.shared.f32 %f51, [%r17+1792];
fma.rn.f32 %f52, %f51, %f50, %f49;
ld.shared.f32 %f53, [%r18+60];
ld.shared.f32 %f54, [%r17+1920];
fma.rn.f32 %f55, %f54, %f53, %f52;
to this:
ld.shared.f32 %f82, [%r101+4];
ld.shared.f32 %f83, [%r102];
fma.rn.f32 %f91, %f83, %f82, %f67;
ld.shared.f32 %f84, [%r101+516];
fma.rn.f32 %f92, %f83, %f84, %f69;
ld.shared.f32 %f85, [%r101+1028];
fma.rn.f32 %f93, %f83, %f85, %f71;
ld.shared.f32 %f86, [%r101+1540];
fma.rn.f32 %f94, %f83, %f86, %f73;
ld.shared.f32 %f87, [%r101+2052];
fma.rn.f32 %f95, %f83, %f87, %f75;
ld.shared.f32 %f88, [%r101+2564];
fma.rn.f32 %f96, %f83, %f88, %f77;
ld.shared.f32 %f89, [%r101+3076];
fma.rn.f32 %f97, %f83, %f89, %f79;
ld.shared.f32 %f90, [%r101+3588];
fma.rn.f32 %f98, %f83, %f90, %f81;
but is all in local memory. They just make better use of the Compute Units hardware
put
a buffer using a kernel object then dispatch the kernel (kernel.execute
that is), when would the device memory allocated for the buffer ever be disposed? Do I need to worry about using up all of the device's memory?
put
the buffer with one kernel, the other kernel will have access to the same data buffer during executionput
an explicitly managed buffer in order to use the buffer during kernel execution. For example, if you have a buffer that holds kernel output only, there's no need to put
the buffer prior to kernel execution. The buffer is still apparently allocated by aparapi on the device, and you only need to explicitly get
the buffer after the kernel is done executing to gather the kernel results. I honestly don't know the performance costs of buffer transfer, but I have a vague sense that it is expensive and worthwhile avoiding when you can.
@hostettler I just saw your code above. Regards this.
float acc[WPT];
for (int w=0; w<WPT; w++) {
acc[w] = 0.0f;
}
So (I am speaking from OpenCL not CUDA, but I bet they are similar in this regard).
Because WPT is known at compiler 'compile' time. It is non-trivial to unroll the loop. So the OpenCL/CUDA compiler basically uses available registers for acc[0].... acc[WPT-1].
At some point ;) the pressure on the register allocator might force the compiler to basically use private memory. Or leave it in local memory.
I have been playing with TornadoVM. They use Graal as their front end, but offer similar capabilities to Aparapi (not as fast yet, and even more buggy than Aparapi). But because they have Graal as a front end. They can actually use this code in their kernel.
float buf = new float[COMPILE_CONSTANT_EXPRESSION];
This is because Graal see's this a constant sized buffer for all kernels (an OpenCL constraint also) it can basically turn each buf[n] into a register. This allows them to relax some restrictions that we have in Aparapi.
Truthfully, with a bit more thought in Aparapi, I think we could persuade the Aparapi bytecode layer to replicate this. But..... alas.... it is not trivial.
double value_0 = 0;
double value_1 = 0;
double value_2 = 0;
double value_3 = 0;
double value_4 = 0;
double value_5 = 0;
double value_6 = 0;
double value_7 = 0;
double a = subA[row * tileSize + tileNr];
int offset = tileNr * tileSize + col;
value_0 += a * subB[offset++];
value_1 += a * subB[offset++];
value_2 += a * subB[offset++];
value_3 += a * subB[offset++];
value_4 += a * subB[offset++];
value_5 += a * subB[offset++];
value_6 += a * subB[offset++];
value_7 += a * subB[offset];