@quasiben Asked me to re-post here from RAPIDS-GoAI -
We have a use case where we need to run dask in a multi-tenant environment. We're aware of dask-gateway, but unfortunately, we need to serve 1000+ concurrent users and (i) it's not feasible to run cluster for each (ii) zero startup time is important (iii) we lose the ability for dask to "play tetris" with our jobs to be efficient with compute usage. We'd be happy to run a single dask instance, my only concern is that potentially a single user can submit a large job that swamps everyone else, unless there is some mechanism to limit the total number of workers per user at a given time.
Is this something that dask supports - or does anyone have any experience doing something like this?
@quasiben It would need to be <5 seconds, and ideally <1 second. (Unfortunately, otherwise, our users will get bored.)
We will typically have 1000s of concurrent users, each with anywhere from 10MB-10GB of data. (And potentially a tail of users with >10GB.) Sessions will typically be quite short, starting at seconds, typically 1-10 minutes but with a long tail again up to an hour. It's important to provide a fast user experience with the initial load, and also responding to the changes in data volume.
My guess is that the overheads involved with launching pods and autoscaling the dask cluster (we are on kubernetes) would make the one cluster per user approach infeasible. For example, a user might only need to interact for 5 seconds of CPU with 100MB of memory, but with all the overheads involved throwing up a cluster we end up with a pod with 1 CPU, 1GB of memory running for 1 minute. (Whereas on the single cluster, the scheduler is making efficient use of cluster resources.) Interested to hear your thoughts.
But on the other hand, with the single cluster there's always the possibility that some user has a query that needs a huge number of tasks and swamps the cluster resources. I'm not sure how to catch this without inspecting the generated task graph @mrocklin?
dask.persist()in particular): https://eliot.readthedocs.io/en/stable/scientific-computing.html#dask
Hey guys, I'm trying to understand some logic dask logic and thought this would be the best place to ask.
Could you guys explain to me why a blockwise operator has to tokenize the function first before creating the compute graph? I'm asking this as I'm trying to apply a blockwise operation of a method that is a part of a large object, which is difficult to pickle. (and tokenize tries to pickle functions)
I can fix this issue by overriding
__getstate__ in an object to raise an exception but I'm really not sure if this is gonna cause any issues or if this gonna cause any issues.
OSError: Timed out trying to connect to 'tcp://
or change the data
No, dask objects are immutable, you’ll want to operate on it to produce new dataframes.
Hi all, I have a stackoverflow question about long running tasks in a dask-kubernetes cluster. Wondering if someone has experience with this an can offer some advise. https://stackoverflow.com/questions/59970973/what-are-recommended-dask-kubernetes-configuration-overrides-for-long-running-ta