Where communities thrive


  • Join over 1.5M+ people
  • Join over 100K+ communities
  • Free without limits
  • Create your own community
People
Repo info
Activity
    Andrea Giardini
    @AndreaGiardini
    Can one of the maintainers please give me a second opinion on dask/dask-gateway#353 ? During the past weeks we had two more issues created with the same problem
    Ixiodor
    @Ixiodor
    Hi, can I ask a question about some kind of numpy to dask conversion?

    I have that:
    img_new = cs[band_values]

    used to be img_new(my future image result) cs(reference values, it's like an index where band_values should refer) band_values(My 2D matrix that need to be "mapped" basend on cs values)

    actually my band_values was flatten and now band_values are not. There is a way to make it work as 2D without flatten and reshape again?
    ldacey
    @ldacey_gitlab
    anyone know if there is a performance difference between using "in" filters versus "or" filters? for example 'date in ["2020-01-01", "2020-01-05"]' versus ('date == "2020-01-01") or ('date == "2020-01-05")
    assuming that there is a much bigger list (10+ at least) of values being filtered?
    I am saving these filters externally between Airflow tasks, so using "in" makes the list neater when I need to filter a single column, but I wonder if there are any performance differences
    Martin Durant
    @martindurant
    ^ best of measuring, but I would expect in to perform better in pandas, because fewer operations and intermediates, and also result in a smaller graph (although that might not matter after optimisation)
    ldacey
    @ldacey_gitlab
    cool, is Dask using the pyarrow filtering for parquet files or is there difference in how those filter expressions are handled? any idea?
    Martin Durant
    @martindurant
    Ah, filtering on parquet is something else again. Yes, dask attempts to use the loader’s internal filtering, but I would still run the benchmark experiment
    ldacey
    @ldacey_gitlab
    sure, I will test it. yeah I was mainly considering the partition level and then row level filtering for parquet files specifically
    Eldad Cohen
    @eldadcohen1
    Hi all,
    There is any reference to install dask on AWS EMR that works?
    Craig Russell
    @ctr26
    How does one set the default serialisezer and deserializer to dill or cloudpickle for a local dask client?
    Eldad Cohen
    @eldadcohen1
    Hi all and GM,
    I installed Dask cluster on AWS EMR
    I have access to the Jupyter notebook but nothing is running.
    What I'm doing wrong ?
    image.png
    David Grant
    @davegravy

    I want to call dask_yarn.scale() so that all the cores of my EMR cluster are used. My clusters may be launched with varying #s and types of instances so I can't hard-code scaling constants.

    Do I need to calculate for myself how many cores I have (e.g. using boto3 and querying instance type info) and manage dask_yarn.scale() or does dask_yarn have this knowledge? Or should I use .adapt() for this use-case?

    David Grant
    @davegravy
    Also is there any guidance anywhere on how to use EMR "steps" with Dask? I want to launch my cluster and work programmatically then have it auto-terminate afterwards.
    David Grant
    @davegravy
    @eldadcohen1 : I think it has to do with this: jupyter/notebook#5920
    I'm having the same issue
    Andrea Giardini
    @AndreaGiardini

    Hello everyone
    I am having the same issues that has been reported here: dask/dask-yarn#118 but on a kubernetes cluster
    TLDR: The cluster dashboard is available, but when I try to open the dask workers dashboard i get a 403. My dask and distributed versions are the latest

    Anyone experiencing the same problem?

    ldacey
    @ldacey_gitlab
    Trying to troubleshoot why the ds.dataset() method is taking a very long time to run when scheduled with Airflow (11 minutes) compared to running it manually in a jupyter notebook (6 seconds)... same exact server, same code
    perhaps a docker network slowing things down? potentially different packages (I looked at the likely culprits of pyarrow, fsspec, adlfs, and dask and they all seem to match)
    Katharine Hyatt
    @kshyatt
    Hello again, back with another question. I'm running a dask-cuda workload (where my functions are using jax) and creating the workers on the command line with dask-cuda-worker --device-memory-limit=0. However, when I look at the dashboard I see that there is still a lot of disk-write-my_gpu_function happening. The dashboard also indicates that the workers are nowhere near the CPU memory limit. From reading the dask-cuda docs it seems that setting the device memory limit to 0 should disable spilling, and instead the code should just OOM if the GPU runs out of memory. Is that not correct?
    Martin Durant
    @martindurant
    Are you sure your data is stored on the GPU? You may be spilling to disc from main memory.
    Katharine Hyatt
    @kshyatt
    It's a little unclear to me. The function sends a bunch of arrays to the GPU, multiplies them, and then brings the resulting array back to CPU memory. I suspect the issue is GPU memory because the dask dashboard is showing there's no memory pressure on the CPU side (no worker ever approaches even 10% of its CPU memory limit) whereas the GPU memory is under high pressure -- when I look with nvidia-smi the GPU has 15300 MB/16160MB alloced.
    I can try setting --memory-limit=0 for the CUDA workers as well and report back?
    Martin Durant
    @martindurant
    Worth a go
    Katharine Hyatt
    @kshyatt
    Will do, once my current slow job finishes :) Thanks!
    Katharine Hyatt
    @kshyatt
    Alright, tried again with --memory-limit=0. I ran the job several times and the first run went swimmingly but the second (which I started immediately after) is again hitting disk-write. I wonder if the problem is that jax/dask-cuda are not releasing GPU memory?
    For the first run I completely shut down the scheduler and workers and restarted them.
    Martin Durant
    @martindurant
    Maybe possible, I don’t know about that. You can always restart worker processes though, just with a call on the client.
    Katharine Hyatt
    @kshyatt
    Yes, that's what's odd about it. To test what's going on, I do client = Client( my_addr) and then client.restart() (I know this is inefficient, but I wanted to try to understand the problem). However even then it seems the memory is not released...
    Diederik Greveling
    @DPGrev
    I have a general question about dask/gcsfs. I was wondering what the purpose is of https://github.com/dask/gcsfs/blob/master/gcsfs/auth.py. Why was it added and where is it used?
    David Grant
    @davegravy
    Does anyone know how to modify the bootstrap.sh provided by Dask to work with the Jupyter Enterprise Gateway that's integrated into AWS EMR in more recent versions of EMR?
    Martin Durant
    @martindurant
    @DPGrev : can you put this in an issue on gcsfs?
    Katharine Hyatt
    @kshyatt
    OK, after some more investigating it looks like Jax is the culprit. I'll try restricting how much memory Jax can use and see if that helps. I'm still confused about why dask-cuda is sending data to disk if I've set --device-memory-limit=0 and --memory-limit=0 though.
    Diederik Greveling
    @DPGrev
    @martindurant Sure!
    agoose77
    @agoose77:matrix.org
    [m]
    Can I confirm; dask doesn't rebalance jobs that have been submitted if the cluster subsequently grows, right?
    Martin Durant
    @martindurant
    @agoose77:matrix.org : although you can explicit call rebalance, you should expect new workers to sometimes steal work from other workers. Whether they do steal depends on estimated task duration, data size, network speed and some conf thresholds.
    tasks that are already running on a worker shouldn’t move
    Schrute Data Farm
    @MarkovianPirate_twitter
    Hi, I have a several large dataframes (5 gbs each) in memory.
    For each dataframe, I would like to groupby a key -> results in 200 smaller dataframes ->then on each smaller dataframe train a regression model.
    Due to the overhead of getting the data into a process, how does dask deal with this?
    Scott Sievert
    @stsievert
    @MarkovianPirate_twitter is dask/dask-ml#765 relevant?
    @MarkovianPirate_twitter Dask spills to disk if it runs out of memory if I understand correctly
    Inkee1
    @Inkee1

    Hi, I'm researching on AI and planning to get a Ensemble by running multiple model at once. Is there any example about this case?
    dask.delayed can be applied in class objects?
    I tried,
    class BasicModel(nn.Module):
    def init(self):
    super(BasicModel, self).init()

        self.linears1 = dask.delayed(nn.Linear)(input_dim, num_hidden)
        self.linears2 = dask.delayed(nn.Linear)(num_hidden, num_hidden)
        self.linears3 = dask.delayed(nn.Linear)(num_hidden, num_hidden)
        self.linears4 = dask.delayed(nn.Linear)(num_hidden, output_dim)
    
    def forward(self, x):
        x = dask.delayed(F.elu)(dask.delayed(self.linears1)(x))
        x = dask.delayed(F.elu)(dask.delayed(self.linears2)(x))
        x = dask.delayed(F.elu)(dask.delayed(self.linears3)(x))
        x=dask.delayed(self.linears4)(x)
        return x
            model = dask.delayed(BasicModel)().double().to(device)
            model1 = dask.delayed(BasicModel)().double().to(device)
            model2 = dask.delayed(BasicModel)().double().to(device)
    optimizer = optim.Adam(list(model.parameters())+list(model1.parameters())+list(model2.parameters()), lr=1e-3)

    then I get
    ''
    raise TypeError("Delayed objects of unspecified length are not iterable")
    TypeError: Delayed objects of unspecified length are not iterable '' at optimizer

    Inkee1
    @Inkee1

    or I'm planning
    class BasicModel(nn.Module):
    def init(self):
    super(BasicModel, self).init()
    self.set=[]
    for i in range(4):
    self.linears1[i] = dask.delayed(nn.Linear)(input_dim, num_hidden)
    self.linears2[i] = dask.delayed(nn.Linear)(num_hidden, num_hidden)
    self.linears3[i] = dask.delayed(nn.Linear)(num_hidden, num_hidden)
    self.linears4[i] = dask.delayed(nn.Linear)(num_hidden, output_dim)

    def forward(self, x):
        y={}
        for i in range(4):
            temp = dask.delayed(F.elu)(dask.delayed(self.linears1)(x))
            temp = dask.delayed(F.elu)(dask.delayed(self.linears2)(temp))
            temp = dask.delayed(F.elu)(dask.delayed(self.linears3)(temp))
            y[i]=dask.delayed(self.linears4)(temp)
        x=torch.cat((y[0],y[1],y[2],y[3]),dim=1)
        return x 

    ~~ model=dask.delayed(BasicModel)()
    ~~loss= dask.delayed(loss_fn(pred, Y_batch) )
    ~~ loss.backward() .. ~ optimizer.step() ..... . can it work?..

    Scott Sievert
    @stsievert
    @Inkee1 why use Dask delayed inside the model? Typically forward takes ~100ms, much less for a few linear layers (especially if on a GPU). Calling Dask delayed multiple times seems unnecessary and slow.
    Inkee1
    @Inkee1
    There were some mistakes in the code above. ( temp = dask.delayed(F.elu)(dask.delayed( self.linears1[i] )(x)) ) I want to get multiple outputs by passing several same-structure-networks from one inputs. ( input --∈三三三 three outputs by passing three parallel layers )
    Kyle Smith
    @smith-kyle
    hi
    Adil
    @adilrizvi
    Hi