Where communities thrive


  • Join over 1.5M+ people
  • Join over 100K+ communities
  • Free without limits
  • Create your own community
People
Repo info
Activity
grfrost
@grfrost
@mstefanetti sadly there is no 'best' for every GPU. That's one of the 'joys' of OpenCL ;) But generally local sizes are powers of two.
If you know the device you can look up the device info.
My guidance would be to assume 32. Possibly 64.
mstefanetti
@mstefanetti

@grfrost thanks again, I was not asking a magic number, but a way to ask Aparapi suggested local sizes.
That's what I get on mandelbrot set on 3 different GPU.

=======================================================================================================
AparapiFractals - Mandelbrot Benchmark - localSizes 
  image size     : 900 x 900 
  maxIterations  : 10,000 
  complex region : -2.0000000000000000d,-2.0000000000000000d 2.0000000000000000d,2.0000000000000000d 

+-----+------------------+----------------+--------------------------+----------+--------+------------+
|Type | shortDescription | deviceId       | Name                     | LSizes   | ExMode | Elapsed(ms)|
+-----+------------------+----------------+--------------------------+----------+--------+------------+
| GPU | NVIDIA<GPU>      | 2381463033712  | GeForce GTX 1650 SUPER   | 2 x 2    |    GPU |        721 |
| GPU | NVIDIA<GPU>      | 2381463033712  | GeForce GTX 1650 SUPER   | 4 x 4    |    GPU |        170 |
| GPU | NVIDIA<GPU>      | 2381463033712  | GeForce GTX 1650 SUPER   | 8 x 8    |    GPU |         95 |
| GPU | NVIDIA<GPU>      | 2381463033712  | GeForce GTX 1650 SUPER   | 10 x 10  |    GPU |        118 |
| GPU | NVIDIA<GPU>      | 2381463033712  | GeForce GTX 1650 SUPER   | 16 x 16  |    GPU |         97 |
| GPU | AMD<GPU>         | 2381459464048  | Oland                    | 2 x 2    |    GPU |       5081 |
| GPU | AMD<GPU>         | 2381459464048  | Oland                    | 4 x 4    |    GPU |       1330 |
| GPU | AMD<GPU>         | 2381459464048  | Oland                    | 8 x 8    |    GPU |        366 |
| GPU | AMD<GPU>         | 2381459464048  | Oland                    | 10 x 10  |    GPU |        468 |
| GPU | AMD<GPU>         | 2381459464048  | Oland                    | 16 x 16  |    GPU |        376 |
+-----+------------------+----------------+--------------------------+----------+--------+------------+

+-----+------------------+----------------+--------------------------+----------+--------+------------+
| GPU | Intel<GPU>       | 2572020347104  | Intel(R) HD Graphics 620 | 2 x 2    |    GPU |        395 |
| GPU | Intel<GPU>       | 2572020347104  | Intel(R) HD Graphics 620 | 4 x 4    |    GPU |        184 |
| GPU | Intel<GPU>       | 2572020347104  | Intel(R) HD Graphics 620 | 8 x 8    |    GPU |        194 |
| GPU | Intel<GPU>       | 2572020347104  | Intel(R) HD Graphics 620 | 10 x 10  |    GPU |        204 |
| GPU | Intel<GPU>       | 2572020347104  | Intel(R) HD Graphics 620 | 16 x 16  |    GPU |        201 |
+-----+------------------+----------------+--------------------------+----------+--------+------------+

=======================================================================================================

by now I will stay on a fixed 8x8.

Brian Maso
@bmaso
I have questions about JVM arrays copied as GPU global memory through aparapi:
  • Is it possible to transfer an array to device global memory once, and share that global array between multiple kernels?
    • Seems like this is possible; after all, I can explicitly put an array and use it in multiple invocations of the same kernel, so why not put it once and use it in invocations of multiple different kernels?
  • What is the lifecycle of a global array that has been explicitly copied to the GPU global memory (using kernel.put()) ? If I put an array, how long will it remain allocated and taking up space in the GPU's memory? I'm really unclear about how the "clean up" of allocated global device memory happens in aparapi.
^^^ These questions seem to me to be key to understanding how I would implement these 3 different scenarios in aparapi when invoking several device-accelerated computations that all operate on the same source array. For example, imagine I need to compute the min, max, and average values of some large source array. I can imagine 3 progressively more efficient ways to do this:
  1. Least efficient: use 3 different kernels (a "min" kernel, a "max" kernel, and an "average" kernel); each kernel invocation is completely separate from the others: each invocation copies to source array to global memory, works out the statistic, and relies on aparapi to "clean up" the globally allocated device memory at the end of kernel invocation
  2. More efficient: use 3 different kernels; copy the source array to the device only once (using explicit buffer handling), and invoke each kernel using the shared global array; have faith the aparapi will "know" when to de-allocate the global device memory
  3. Probably most efficient: merge the 3 kernels into a single kernel that computes all 3 statistics on the source data in a single invocation
    • Smart use of local memory during computation to speed things up even further, though really what I'm interested in is avoiding unnecesary data transfers between device global memory and JVM-accessible host memory.
Brian Maso
@bmaso
While 3 is most efficient at runtime, 2 I think is a better way to design a library of re-usable code. That is, I imagine a dev to have a library of classes implementing various accelerated computations: "min", "max", "average", etc. The dev works by composing these basic accelerated computations into more complex ones. Being able to "share" buffers between kernels in such a way to minimize unnecessary transfers would avoid a lot of needless data transfers, while still getting lots of performance benefits.
hostettler
@hostettler
Hello, thanks to your help I managed to implement a matrix multiplication that outperforms the CPU. So first Thank you :) Second, i am trying to implement the optimisation indicated at https://cnugteren.github.io/tutorial/pages/page4.html but it does not perform. So my question, with aparapi will a float[] inside the kernel definition be put on the registers or in shared memory? I've tried to put it in the local memory but it does bring any benefit vs the page3 optimisation. Thanks again for your help.
grfrost
@grfrost

@bmaso Great questions. You are closing in on Aparapi's weak area. Multi-Kernel dispatch over the same data. Sadly I do not think that Aparapi currently will allow you to share buffers between kernels. I may be relating to original code I wrote in 2011 ;) so almost 10 years back, certainly at that time I think we have one OpenCL queue per Kernel (maybe @freemo can clarify). Maybe even one opencl context.
You should experiment with 'explicit management' and use a device to create your range. This way it is possible that you can 'force' Aparapi to use the same queue/context. That is the key. You certainly won't be able to do this across devices.
BTW there is a trick I docced somewhere. You put the implementation of two or more kernels in one 'run' method and have the execution switch depending on external state
public void run(){
if (state==0){
// code for algorithm 1
}else{
// code for algorithm 2
}

yeah it is hacky ;) but not terrible from a performance POV. All kernels take the same branches, so not too bad WRT wave divergence.

I used this for a few Aparapi demos back in the day.

CoreRasurae
@CoreRasurae
@hostettler Hi, great to know you managed to solve the matrix multiplication performance issue. Arrays are never stored in registers, from what I know, only simple variables can be stored in registers (like float) if they are private memory, i.e., they are only used within a single work item and thus cannot be shared in a workgroup. In variable definition area all variables are global by default, they can never be private, and at most can be local when properly annotated. So float[] arrays can be stored either in global or local memory, but never on private memory (registers) from what I know. Variables declared within the kernel method scope will be private. Hope this helps a bit.
@hostettler Local memory only brings advantage when their values are used multiple times, when there are multiple accesses. For instance the GPU needs to load data from the global memory, if it has to do a memory access to retrieve a value, and then that value needs to be updated and stored back in global memory in cycles, multiple times, then local memory brings advantage. However if the global memory is only read once and updated once, local memory will bring no additional benefit.
hostettler
@hostettler
@CoreRasurae thanks for the answer. I will try w/ a dirty trick 8 private variables and an ugly serie of if statement to see if it improves performance.
CoreRasurae
@CoreRasurae
@hostettler If statements can kill performance if different branches are taken be different work items within a workgroup.
@hostettler Just make sure all work items follow the same if branch
hostettler
@hostettler
@CoreRasurae indeed it makes sense, so my idea will not work. Thanks for the reminder. Then I do not get how that optimisation is supposed to work
: https://cnugteren.github.io/tutorial/pages/page4.html
CoreRasurae
@CoreRasurae
@hostettler Clarifying the first statement: If statements can kill performance if different branches are taken by different work items within a workgroup.
@hostettler From what I see it seems they are using local memory for Asub and Bsub, which they use to perform multiple operations from those 3x3 sub-matrices - two multiplications and one sum
__local float Asub[TS][TS];
 __local float Bsub[TS][TS];
hostettler
@hostettler
@CoreRasurae yes but what you describe is the page4.html optimisation. This one has been implemented and works. I meant page5.html optimisation
float acc[WPT];
for (int w=0; w<WPT; w++) {
acc[w] = 0.0f;
}
What is weird is that the assembler code generated seems to address registers and not local memory for the acc[] array
CoreRasurae
@CoreRasurae
@hostettler The title of the page says: Kernel 3: More work per thread, maybe that is what they are intending to do, make the thread busier we more tasks, maybe that does improve performance
@hostettler I would have to spend more time studying those kernels, but I don't have that time, right now, sorry
hostettler
@hostettler
@CoreRasurae of course I understand. Thanks a lot
CoreRasurae
@CoreRasurae
@hostettler Anyway the code I am seeing is PTX which is NVIDIA assembler
@hostettler and from that, I just notice they improve the usage on local memory by giving more work to the thread
hostettler
@hostettler
// Initialise the accumulation registers
float acc[WPT];
for (int w=0; w<WPT; w++) {
    acc[w] = 0.0f;
}
CoreRasurae
@CoreRasurae

@hostettler They go from this:

    ld.shared.f32   %f50, [%r18+56];
    ld.shared.f32   %f51, [%r17+1792];
    fma.rn.f32  %f52, %f51, %f50, %f49;
    ld.shared.f32   %f53, [%r18+60];
    ld.shared.f32   %f54, [%r17+1920];
    fma.rn.f32  %f55, %f54, %f53, %f52;

to this:

    ld.shared.f32   %f82, [%r101+4];
    ld.shared.f32   %f83, [%r102];
    fma.rn.f32  %f91, %f83, %f82, %f67;
    ld.shared.f32   %f84, [%r101+516];
    fma.rn.f32  %f92, %f83, %f84, %f69;
    ld.shared.f32   %f85, [%r101+1028];
    fma.rn.f32  %f93, %f83, %f85, %f71;
    ld.shared.f32   %f86, [%r101+1540];
    fma.rn.f32  %f94, %f83, %f86, %f73;
    ld.shared.f32   %f87, [%r101+2052];
    fma.rn.f32  %f95, %f83, %f87, %f75;
    ld.shared.f32   %f88, [%r101+2564];
    fma.rn.f32  %f96, %f83, %f88, %f77;
    ld.shared.f32   %f89, [%r101+3076];
    fma.rn.f32  %f97, %f83, %f89, %f79;
    ld.shared.f32   %f90, [%r101+3588];
    fma.rn.f32  %f98, %f83, %f90, %f81;

but is all in local memory. They just make better use of the Compute Units hardware

hostettler
@hostettler
the point is that by default : the acc[WPT] would be put in the shared memory not in the local memory (__local not present), therefore it cannot perform better than the previous version. That is why I drawn the wrong conclusion.
Thanks for the help
CoreRasurae
@CoreRasurae
@hostettler CUDA Shared memory is OpenCL Local memory
@hostettler They are using CUDA nomenclature, not OpenCL
hostettler
@hostettler
ahhh thanks that's explained a lot
CoreRasurae
@CoreRasurae
No proble
No problem
hostettler
@hostettler
I think I will try the ultra ugly version :) 8 local variables to the kernel (basically unrolling the w<WPT loop). Ugly but might be interesting. Thanks a lot for the guidance
Brian Maso
@bmaso
@grfrost Thanks for the reply! Definitely understand how the same data could only be shared within a single device. I suppose some simple experiments are in order using explicit buffer management. I still am wondering about buffer de-allocation/disposal under explicit buffer management. If I explicitly put a buffer using a kernel object then dispatch the kernel (kernel.execute that is), when would the device memory allocated for the buffer ever be disposed? Do I need to worry about using up all of the device's memory?
Brian Maso
@bmaso
Reporting results of sharing buffers b/w 2 kernels:
  • Two kernels using the identical device-specific range for dispatch will share explicitly managed buffers
    • put the buffer with one kernel, the other kernel will have access to the same data buffer during execution
  • Two kernels using different device-specific ranges associated with the same device will also share explicitly managed buffers
Brian Maso
@bmaso
Also learned something interesting: you don't need to put an explicitly managed buffer in order to use the buffer during kernel execution. For example, if you have a buffer that holds kernel output only, there's no need to put the buffer prior to kernel execution. The buffer is still apparently allocated by aparapi on the device, and you only need to explicitly get the buffer after the kernel is done executing to gather the kernel results. I honestly don't know the performance costs of buffer transfer, but I have a vague sense that it is expensive and worthwhile avoiding when you can.
I would like to get an explanation of buffer disposal under explicit buffer management. I'd be happy to update the aparapi docs with what I learn.
grfrost
@grfrost

@hostettler I just saw your code above. Regards this.

float acc[WPT];
for (int w=0; w<WPT; w++) {
acc[w] = 0.0f;
}

So (I am speaking from OpenCL not CUDA, but I bet they are similar in this regard).

Because WPT is known at compiler 'compile' time. It is non-trivial to unroll the loop. So the OpenCL/CUDA compiler basically uses available registers for acc[0].... acc[WPT-1].

At some point ;) the pressure on the register allocator might force the compiler to basically use private memory. Or leave it in local memory.

I have been playing with TornadoVM. They use Graal as their front end, but offer similar capabilities to Aparapi (not as fast yet, and even more buggy than Aparapi). But because they have Graal as a front end. They can actually use this code in their kernel.

float buf = new float[COMPILE_CONSTANT_EXPRESSION];

This is because Graal see's this a constant sized buffer for all kernels (an OpenCL constraint also) it can basically turn each buf[n] into a register. This allows them to relax some restrictions that we have in Aparapi.

Truthfully, with a bit more thought in Aparapi, I think we could persuade the Aparapi bytecode layer to replicate this. But..... alas.... it is not trivial.

hostettler
@hostettler
@grfrost thanks for the explanation. That being said I used the ugly approach to no avail. Meaning I did
    double value_0 = 0;
    double value_1 = 0;
    double value_2 = 0;
    double value_3 = 0;
    double value_4 = 0;
    double value_5 = 0;
    double value_6 = 0;
    double value_7 = 0;
and then
            double a = subA[row * tileSize + tileNr];

            int offset = tileNr * tileSize + col;
            value_0 += a * subB[offset++];
            value_1 += a * subB[offset++];
            value_2 += a * subB[offset++];
            value_3 += a * subB[offset++];
            value_4 += a * subB[offset++];
            value_5 += a * subB[offset++];
            value_6 += a * subB[offset++];
            value_7 += a * subB[offset];
to save the load of a but unfortunately it is slower than the more naive approach. I must be missing something. I will commit the code on github in case someone has a an idea. What is strange is that I objectively do less instructions with this version.
I just would like to thank all of you for taking the time to help newbies :)
Jeffrey Phillips Freeman
@freemo
sorry guys ive been away for the holidays but im back
after new years ill start working on a new release for everyone. Let me know if anyone needs anything.
@hostettler I do think what grfrost said is correct. i do not recall approving or testing any fixes that allow for a change to the one queue per kernel... at least i dont remember anyone touching that part of the code. With that said the beauty of open source is we can always make that fix if someone is willing to try to do it. I might have some time next year as my personal time is freeing up significantly as im living off passive income more lately.
hostettler
@hostettler
@freemo thanks, that being as mentioned previously I simulated the optimization by having 8 variables but it does not improve the performance.