Where communities thrive


  • Join over 1.5M+ people
  • Join over 100K+ communities
  • Free without limits
  • Create your own community
People
Repo info
Activity
grfrost
@grfrost
@CoreRasurae my new work has no pretence of being Aparapi compatible. It requires Java 11 and C++17. I see that Oracle just dropped JVMCI in Java 17 ... which sucks.
Soloem
@Soloem
@CoreRasurae I was under the impression that the _globalWidth parameter in execute was for the total amount of threads. Am I misunderstanding something about Range and execute()?
CoreRasurae
@CoreRasurae
@Soloem yes, the globalWidth is for the total amount of threads. However when the localWidth is not specified Aparapi tries to assign one automatically, and it possibly made a bad decision, which exceeds the local number of threads allowed by NVIDIA OpenCL ICD
Soloem
@Soloem
Then if I understand correctly, I would have to keep my program within the 256 threads loop until it is finished. Do you know whether this would affect performance? Would it be possible to just execute the kernel multiple times to fill up as many threads as possible as a workaround?
CoreRasurae
@CoreRasurae
@Soloem The OpenCL/CUDA GPGPU performance in general is tricky. What works well for one device, may not be adequate for another, due to internal architectural differences.
In any case, there are some common procedures
Regarding your question you can have multiple work groups in the same kernel execution
that will maximize the compute units usage
The memory access patterns done by the kernel are also important for the final performance
since GPU cache size/thread is very small normally
CoreRasurae
@CoreRasurae
To improve on that it is up to the programmer to make clever usage of the local memory, in order to reduce memory accesses
Just ensure that the local work group size is a sub-multiple of the global work group size
*sorry I meant global size
Make sure that you specify both the global and local sizes
for the Range
Soloem
@Soloem
Alright, thank-you. I'll have to look further into the documentation for those methods.
grfrost
@grfrost

@Soloem to clarify on @CoreRasurae comment about Aparapi making a bad decision ;) Aparapi queries the device to determine the max possible group size. Some devices 'lie'? and so sometimes Aparapi will trust this value and request a dispatch using a bad value. I have an NVidia Mac OpenCL driver which lies. It claims it can handle more than it can.
Another issue is that 'max group size' may only apply for small kernels. If a kernel needs lots of local state, it may not be possible to use the max group size for a given kernel.
OpenCL does allow one to query the runtime and ask the 'best group size' for a given kernel. We do not yet use this (yet?)

So whilst in many cases 'Aparapi's' guess will be OK, it is not foolproof. Follow @CoreRasurae advice and specify it. Understanding the interplay between global size and group size will be worth learning, lots of tricks will start to make sense. ;)

CoreRasurae
@CoreRasurae
@grfrost I believe the best work group size for a given kernel can only be retrieved on OpenCL 2.0
and later
@Soloem GPGPU brought speed ups for some applications, however one may argue that it was at the cost of breaking high-level programming concepts and taking away simplicity and maintainability. Aparapi does hide some of that, but not all can be hidden.
CoreRasurae
@CoreRasurae
@grfrost Thanks. If you don't mind that I give a peek at the code, I don't mind if the sausages are still growing in the fields. ;) May be I can contribute with something, or some ideas. I am sure it will take some time before I assimilate all that, given my reduced spare time...
Even so, may be I can contribute with something
Soloem
@Soloem
So if I understand correctly @grfrost @CoreRasurae, when creating a Range, I should always have a maximum group size of 256. In this case Range.create(1024, 256), or a multiple of up to 256 with 2D and 3D Ranges? This doesn't necessarily mean total threads does it? I feel as if I'm misunderstanding the Multiple Dim Ranges documentation on this. https://aparapi.com/documentation/multiple-dim-ranges.html
Soloem
@Soloem
At the same time, looking into local memory documentation, the localBarrier() method synchronizes across the entire group. I may not be adverse to GPGPU, but I would think that this would have some downside to doing so? At the same time, the doc also explains that it is common to load all of the global array first into the local array, which I would assume would also take some time as well. Is there no way to direct copy the global array to the local array?
grfrost
@grfrost
@Soloem Rules for Group Size
1) Smaller than global size (might seem obvious but hey)
2) The largest factor of global size that is less than or equal to the maximum group size allowed by the device
So if global size is 1000 a good group size would be 250 (256 will not work because 1000%256 != 0 )
So if global size is 333, then group size might be 111
This means that if global size is prime... you can only use a group size of 1
Powers of two are very common because they map well to hardware (generally hardware has 16,32,64, 128 or 256 lanes)
@CoreRasurae I will respond in provate chat
CoreRasurae
@CoreRasurae
@Soloem Complementing on what @grfrost said
The local memory is typically 48kB on NVIDIA GPUs, so it is unlikely that the whole data will fit at once
That is why I was saying the other day, that: GPGPU brought speed ups for some applications, however one may argue that it was at the cost of breaking high-level programming concepts and taking away simplicity and maintainability. Aparapi does hide some of that, but not all can be hidden.
The OpenCL memory model has four kinds of memories
CoreRasurae
@CoreRasurae
The device memory which is memory that off the GPU chip
*that is off the GPU chip, while still making part of the GPU device
the device memory is normally where the OpenCL global memory resides, as well as the constant memory
then there is memory inside the GPU chip and it comes in two forms: small RAMs per Compute Unit and registers
the small RAMs are used for the OpenCL local memory
and the registers are the private memory
in the OpenCL memory model
and that is up to the Kernel programmer to efficiently manage all those resources in the kernel code
CoreRasurae
@CoreRasurae
it is a kind of a bumpy ride for the programmer, where one could select regular high level languages to do its job comfortably, but instead, one wants the code to go faster and in order to do so, he has to sacrifice on the comfort
most device architecture details are exposed almost directly to the programmer
GPUs are not flexible enough to support all high-level language concepts
CoreRasurae
@CoreRasurae
there is almost no hardware abstraction
Soloem
@Soloem
Okay, thank-you! I did do a test of multidimensional arrays, and it seems to be working. The site states that Aparapi doesn't work with multidimensional arrays, unless it's because Java's multidimensional arrays are actually arrays of arrays. Is there some form of under the hood work being done, or am I just being fooled by my outputs?
Also, (I could probably just test this too) would enums be prevented through Aparapi? I'm assuming it wouldn't be if it's just converted to Integer and back.
grfrost
@grfrost

@Soloem hmm I am suspicious of mutidim arrays. It is possible someone has added support.... but generally I would not rely on it.

Instead of

int [][] a = { {1,2,3,4},{5,6,7,8,9}};
for (int x=0;x<2; x++){
    for (int y=0; y<4; y++){
        doSomethingWith(a[x][y]);
    }
}

Use

int [] a = {1,2,3,4,5,6,7,8}
for (int x=0;x<2; x++){
    for (int y=0; y<4; y++){
        doSomethingWith(a[x*4+y]);
    }
}
Enums are heap objects in Java, so the GPU cannot access them, so sadly no, you should not use them.
CoreRasurae
@CoreRasurae
@Soloem Yes, the support for multidimensional arrays was squared away a few Aparapi versions ago. It supports multi-dimensional arrays up to three dimensions.
@Soloem Yes, there is work being done under the hood. The multidimensional arrays are just for ease of testing/porting Java code to kernels. If you want to reach the peak peformance you should only use 1D arrays.
@Soloem I should update the documentation.
I never tested arrays of enums with Aparapi. I am not sure if is supported.