Where communities thrive


  • Join over 1.5M+ people
  • Join over 100K+ communities
  • Free without limits
  • Create your own community
People
Repo info
Activity
valorcurse
@valorcurse
Looks interesting, maybe this should do it
The size of the array doesn't change very often, so it might work
@stuartarchibald Thank you
stuartarchibald
@stuartarchibald
@valorcurse no problem :) thanks for trying Numba!
JSKenyon
@JSKenyon
Hi all! I have been running into a strange error when using typed lists and dask and I was wondering if anyone else had run into something similar. Essentially I use a typed list to hold several arrays with equal ndim, but differing shapes. I then pass these typed lists into some numba code, mutate the contents of the arrays in the list and then return the results. However, I sporadically get core dumps and wildly varying error messages. Otherwise it runs through. There doesn't really seem to be a predictable pattern and my efforts to write a minimum reproducer have thus far been unsuccessful. However, the problem goes away if I replace typed_list[array_ind] = some_function(typed_list[array_ind]) with typed_list[array_ind][:] = some_function(typed_list[array_ind]). So it would seem that somehow assigning into the list element as opposed to the array itself is causing strange behaviour. Bear in mind that some_function returns an array of the same size as the array in the list. I know that this is a somewhat vague question, but I felt I should ask here first before spending more time on constructing a MWE for such an erratic bug.
stuartarchibald
@stuartarchibald
@JSKenyon any chance you could please open a ticket with a rough explanation and a backtrace from the core dump please?
I'd guess its some sort of internal state corruption. Also, are you using dask with the threading backend? Am wondering if its some sort of concurrent access issue too.
JSKenyon
@JSKenyon
@stuartarchibald I am working on it - unfortunately the server on which I am experimenting is missing some tools for inspecting the core dump. They should be installed by Monday, at which point I will do as you suggest. I am using the threading backend. That being said, there should only be one (dask) thread operating on each list. I am not using any parallelism in the numba code at the moment.
stuartarchibald
@stuartarchibald
@JSKenyon perhaps try https://docs.python.org/3/library/faulthandler.html and/or catchsegv, save you having to wait for tooling. Is the GIL enabled in your compiled functions?
JSKenyon
@JSKenyon
@stuartarchibald All the numba is nogil and nopython.
stuartarchibald
@stuartarchibald
@JSKenyon right, so it could be N threads accessing the List internal structure at the same time and doing some mutating?
JSKenyon
@JSKenyon
@stuartarchibald I must admit to still being approximately an amateur, but I wouldn't have thought so. I am using blockwise with a typed list per dask chunk, and each chunk should be handled by a single thread.
stuartarchibald
@stuartarchibald
@JSKenyon ah ok, so you are not sharing a List, right, in which case, this may well be an impl bug. If you can get a reproducer, even if its just sporadic, and put it on the issue tracker that'd be really helpful, thanks :)
JSKenyon
@JSKenyon
@stuartarchibald Will try my best. :-)
stuartarchibald
@stuartarchibald
@JSKenyon great, thanks, even just observations and back trace, the environment, an explanation of the code, code snippted etc. any sort of information is helpful :) and feel free to update it if new info appears!
Diptorup Deb
@diptorupd
Oops sorry for the flood
Do you use any pastebin for logs of this nature
?
stuartarchibald
@stuartarchibald
usually a github gist
Would this be an issue related to libedit? I did not see it installed on my server
I was building llvlite against LLVM 8.0.1
stuartarchibald
@stuartarchibald
would guess that LLVM build isn't quite right
probably lacks fPIC somewhere?
Diptorup Deb
@diptorupd
Yes, my guess too. I used the build.sh under llvmlite/conda-recipies/llvmdev
stuartarchibald
@stuartarchibald
do you actually need a llvm build ?
there's prebuilt ones in our conda channel
stuartarchibald
@stuartarchibald
The recipe under that dir works fine against 8.0.0 as it's the one the currently shipped llvmlite is built against
perhaps 8.0.1 has something that broke it
Diptorup Deb
@diptorupd
No, right now I do not need to build LLVM by hand. I can use the prebuilt binary from now. But down the road would like to be able to build LLVM and use that with llvmlite.
Ok that might be the reason. Let me see if that helps. But, for now I may just take the detour you suggest and use the pre-built binary.
Thanks @stuartarchibald
stuartarchibald
@stuartarchibald
np
Leo Fang
@leofang

@stuartarchibald

@leofang thanks, RE numba/numba#4175 are you interested in an opinion on the spec, fixes, or both :)?

I suppose either or both. On the strides part, I think @pentschev's comment there about making it mandatory, plus adding one more flag to indicate any contiguity, would be very helpful.

On the other hand, I see that instead of having a contiguity flag, NumPy's __array_interface__ uses strides=None to indicate C-contiguity. (This should be emphasized that even though the spec says strides is optional, NumPy always gives it (with value None for C).) If following the NumPy convention is intended, then Numba's spec is good, just need a code fix, and this fix must be propagated to, e.g., PyTorch, which currently also always explicitly gives strides like Numba does now, regardless of it's C-contiguous or not.

In any case, a __cuda_array_interface__ consumer should only be in charge of parsing the information, not attempting to validate it; it's the producer's responsibility to provide accurate info and make parsing easy (thus it's good to have the spec modified and/or clarified).

On the pointer to zero-size array, Numba is not compliant, so a code fix is sufficient. Spec is good here.

Leo Fang
@leofang

On the pointer to zero-size array, Numba is not compliant, so a code fix is sufficient. Spec is good here.

Sorry, I was wrong. Spec should be modified for this corner case, and as I mentioned in the issue, following CuPy's approach and set the pointer to 0 would be nice.

francoislauger
@francoislauger

I'm trying to iterate through a tuple of function an i'm getting an error.
This work

def run_a(all_a):
    for a in range(len(all_a)):
        print(all_a[a])
tuple_a = (1,2)
run_a(tuple_a)

But this don't

@njit
def a1(x):
    return x+1
@njit
def a2(x):
    return x+2
@njit
def run_a(all_a):
    for a in range(len(all_a)):
        print(all_a[a](1))
tuple_a = (a1,a2)
run_a(tuple_a)

#Error :
#Invalid use of Function(<built-in function getitem>) with argument(s) of type(s): ((type(CPUDispatcher(<function a1 at 0xa28c6c6a8>)), type(CPUDispatcher(<function a2 at 0xa289daae8>))), int64)
#* parameterized
#In definition 0:
#    All templates rejected with literals.
# ...
#This error is usually caused by passing an argument of a type that is unsupported by the named function.
#[1] During: typing of intrinsic-call at <ipython-input-404-341a0e6b14c6> (10)
#
#File "<ipython-input-404-341a0e6b14c6>", line 10:
#def run_a(all_a):
#    <source elided>
#    for a in range(len(all_a)):
#       print(all_a[a](1))
#       ^

But this work too :

@njit
def run_a(all_a):
    for a in range(len(all_a)):
        print(all_a[1](1)) #fix index
tuple_a = (a1,a2)
run_a(tuple_a)
francoislauger
@francoislauger
The problem seem to be with HeteregenousTuple because this don't work too :
def run_a(all_a):
    for a in range(len(all_a)):
        print(all_a[a])
tuple_a = (1,2.0)
run_a(tuple_a)
stuartarchibald
@stuartarchibald
@francoislauger the issue you are experiencing is due to numba not being able to determine what the types of the output should be. In the case of a fixed index, that's a compile time constant, so the tuple item can be indexed into at compile time and the specific type of the thing being accessed can be found and take part in type inference. In the case of indexes that are typed as e.g. int64 it's not possible to infer at compile time the types of the items at the indexes hence not possible to compile.
@leofang ah I see, thanks. Would you be interested in creating a patch to address the noted ?
resident12
@resident12
@sklam GPUdirect-RDMA would allow to stream data directly into the memory of a GPU. How can I access that data with Numba/Cuda?
Leo Fang
@leofang
@stuartarchibald if a final decision can be agreed upon and we get a green light from you, I could find some time to make a patch (with your help :D), but probably not in the next two weeks. If @pentschev is in a rush, he could also take over the task I suppose.
Alex Marshall
@AlexMarshall12
Is numba.cuda.const.array_like([config_var1,config_var2..]) something that I can use that allocates an array in the constant global memory on the gpu?
like is that how I set global config variables?
why does it have to be an array?
Alex Marshall
@AlexMarshall12
I see now I have to call it in a @cuda.jit decorator
luk-f-a
@luk-f-a
@francoislauger , the workaround here might help: #2542
Alex Marshall
@AlexMarshall12
what can I do when I have a big list of config variables that I need on the gpu? Right now I am making a big numpy array with them and sending it over with cuda.to_device(). But then there are issues because all of my config variables have to have the exact same type. https://bpaste.net/show/t-zu
should I just create a new gpu varaiable with cuda.to_device() for each one individually and then pass those to the functions that are cuda.jit'ed ?
Alex Marshall
@AlexMarshall12
basically where do people usually store their config variables for cuda.jit functions? Do they pass them in from a config file, pass them in as function paramters ( each one an array of a scalar), something else?
Sean M. Law
@seanlaw

I have a situation right now that I am trying to optimize which looks like this:

@cuda.jit
def first_func(x, y):
    start = cuda.grid(1)
    stride = cuda.gridsize(1)

    for j in range(start, x.shape[0], stride):
       # Do something with x and y
        pass

@cuda.jit
def second_func(y, z):
    start = cuda.grid(1)
    stride = cuda.gridsize(1)

    for j in range(start, z.shape[0], stride):
        # Do something with y and z
        pass

@cuda.jit
def third_func(z):
    start = cuda.grid(1)
    stride = cuda.gridsize(1)

    for j in range(start, z.shape[0], stride):
        # Do something with z
        pass

def main(x, y, z):
    device_x = cuda.to_device(x)
    device_y = cuda.to_device(y)
    device_z = cuda.to_device(z)

    threads_per_block = 512
    blocks_per_grid = math.ceil(x.shape/threads_per_block)

    for i in range(100000):
        first_func[blocks_per_grid, threads_per_block](device_x, device_y)
        second_func[blocks_per_grid, threads_per_block](device_y, device_z)
        third_func[blocks_per_grid, threads_per_block](device_z)

    return device_y.copy_to_host(), device_z.copy_to_host()

Here's my problem: When the outer i for-loop range in main() is large, the majority of my time is spent calling the first_func, second_func, and third_func kernels while very little time is spent on the GPUs (since the kernel functions are short). While each iteration of the for-loop is independent from the others, device_yand device_z are being updated depending on device_x. The first thing that I tried was to combine the kernels into a single kernel in order to decrease the total kernel calls:

@cuda.jit
def combined_func(x, y, z):
    start = cuda.grid(1)
    stride = cuda.gridsize(1)

    for j in range(start, x.shape[0], stride):
       # Do something with x and y
        pass

    for j in range(start, z.shape[0], stride):
        # Do something with y and z
        pass

    for j in range(start, z.shape[0], stride):
        # Do something with z
        pass

def main(x, y, z):
    device_x = cuda.to_device(x)
    device_y = cuda.to_device(y)
    device_z = cuda.to_device(z)

    threads_per_block = 512
    blocks_per_grid = math.ceil(x.shape/threads_per_block)

    for i in range(100000):
        combined_func[blocks_per_grid, threads_per_block](device_x, device_y, device_z)

    return device_y.copy_to_host(), device_z.copy_to_host()

This helped improve the speed by about a factor of three but there is still 100,000 kernel calls (instead of 300,000). Right now, I can't seem to wrap my head around how I could move the i for-loop into the combined_func kernel altogether while (1) ensuring that I don't have a race condition where the computations fromi+1 don't overwrite the computations from i (since everything will be writing to device_y and device_z for any value of i) and (2) it isn't clear how to set up the threads/blocks separately for the i for-loop (since different threads might see a different value of i depending on the block). I have looked into thread synchronization but that is only available at the block level but it would be nice if I could force the GPU to synchronize at the end of each i. Any help would be greatly appreciated!