That last time I looked, pandas loaded everything. It would be reasonable to implement that iteratively, and fastparquet does have a specific method to do that.
Is there a way to do that query without knowing that row-group 1 is where you want to look
Parquet optionally stores columns max and min values for each row-group, so maybe
split_everyequal or lower than length of
chunksizeignores the uneven chunking of the array)
Hello. I have an environment where I'm upgrading:
dask==1.1.0 to dask==1.2.0 distributed==1.25.3 to distributed==1.27.0 dask-kubernetes==0.8.0 to dask-kubernetes==0.9.2
And I notice that my worker pods start, but then fail to connect to the scheduler. Further inspection points out that the worker pods have
DASK_SCHEDULER_ADDRESS set to a localhost (i.e.
tcp://127.0.0.1:44131) address while the working version points to a correct remote IP (scheduler and worker pods are on seperate clusters).
I've spent some time digging through the codebase to find out why this is not correctly being set.
Where should I begin looking?
client.submiton a function that depends on a module I wrote, the worker nodes would fail with
ModuleNotFoundError. I'm wondering if dask.distributed requires all worker nodes to install dependencies beforehand, or am I doing something wrong?