cupy.cuda.Function
has the interface to specify threads/blocks but its mostly used internally.
cupy.cuda.Device
:with cupy.cuda.Device(1):
a = cupy.exp(cupy.ones((2, 3), 'f'))
print(a.device) # Prints <CUDA Device 1>
That's one thing I notice from scipy's implementation. I'm wondering if there's code I can refer to within cupy for cupy-oriented optimizations.
Also I don't know if it is correct to handle the row sum case (axis=0) by returning a flattened array, I just did what was most simple and intuitive to match the output formats with scipy and cupy dense mean.
cupy.util.memoize
, but now both ufuncs/element-wise kernels (never use it before) and reduction kernels (changed after 2584) move away from it? Is there a performance reason, or just to avoid decorators in Cython codes?
cupyx
name space given that the Autograd project is inactive now?
[Emilio, chainer] autograd & cupy are orthogonal to each other,
Mostly you can think of chainer as an autograd library that has components for nns already included.
As we decided to relegate chainer to maintenance mode, we won't be adding or maintaining any kind of autograd support for cupy.
nvcc
compared to nvrtc
is another factor of 5x slower)