That last time I looked, pandas loaded everything. It would be reasonable to implement that iteratively, and fastparquet does have a specific method to do that.
Is there a way to do that query without knowing that row-group 1 is where you want to look
Parquet optionally stores columns max and min values for each row-group, so maybe
split_everyequal or lower than length of
chunksizeignores the uneven chunking of the array)
Hello. I have an environment where I'm upgrading:
dask==1.1.0 to dask==1.2.0 distributed==1.25.3 to distributed==1.27.0 dask-kubernetes==0.8.0 to dask-kubernetes==0.9.2
And I notice that my worker pods start, but then fail to connect to the scheduler. Further inspection points out that the worker pods have
DASK_SCHEDULER_ADDRESS set to a localhost (i.e.
tcp://127.0.0.1:44131) address while the working version points to a correct remote IP (scheduler and worker pods are on seperate clusters).
I've spent some time digging through the codebase to find out why this is not correctly being set.
Where should I begin looking?
client.submiton a function that depends on a module I wrote, the worker nodes would fail with
ModuleNotFoundError. I'm wondering if dask.distributed requires all worker nodes to install dependencies beforehand, or am I doing something wrong?
I'm looking for a simple resource manager. I have a small set of computers, each with different characteristics (when it comes to memory size and GPU count). I'm running ML training algorithms in the form of standalone python scripts on these machines. One script runs on a single node, and there is no distributed computation needed. I know beforehand how much memory and how many GPUs are required for a particular python script.
All I want is to have a resource manager that helps me automatically execute these python scripts on a node that has enough free mem/gpu capacity at the time of execution.
I've seen the Dask distributed resource manager, but can't figure out if I can use it as a Resource Manager for executing python scripts, and nothing more. Can you give me some guidance here?
I have quite a bit of experience with Apache Spark, but as of Spark 2, GPU-aware scheduling doesn't seem to be possible. I've checked Mesos+Chronos things get pretty complicated there compared to the simplicity of this use-case. K8S is a monster. So I'm checking if there is a straightforward way to get this done in Dask?