functools.partialcallable that I am trying to map across 100k arguments using dask-jobqueue. It turns out that my script is using many tens of GB of memory and going into swap because when I call
Client.map(func, ...), it is pickling my function 100k times, and the pickled function itself is about 100 kB.
distributed.Client.map. I am guessing that the reason that memory consumption was never an issue with
multiprocessing.Pool.map()is that multiprocessing does not tend to pickle all of the iterations all at once, instead it consumes the iterables only about as quickly as the process pool is able to consume them.
client.mapdoes some of this for you: dask/distributed@169a8d9
Does Dask support
snappy compression for Parquet files? I wrote a bunch of Parquet files using
pandas.DataFrame.to_parquet (which uses
snappy by default), and read them back into Dask using
dask.dataframe.read_parquet successfully (had to install the
conda-forge packages first), but trying to run a
.do().a().thing().compute() on the data throws a
Here's an SO question with more details: https://stackoverflow.com/questions/63157674/operations-on-a-dask-dataframe-fail-when-using-snappy-compression
python-snappyinstalled in the client environment as well. I'm accessing the cluster from a local Jupyter notebook on my machine via SSH port forwarding, and did not have these packages installed locally. Installing them locally (
conda install -c conda-forge snappy python-snappy) resolved the issue. I guess
snappyis getting used over the wire as well. :) I'll update the SO question with the resolution.
~/.local/lib/python3.7/site-packages/pyarrow/fs.py in get_file_info_selector(self, selector) 159 infos =  160 selected_files = self.fs.find( --> 161 selector.base_dir, maxdepth=maxdepth, withdirs=True, detail=True 162 ) 163 for path, info in selected_files.items():
/opt/conda/lib/python3.7/site-packages/fsspec/spec.py in walk(self, path, maxdepth, **kwargs) 324 325 try: --> 326 listing = self.ls(path, detail=True, **kwargs) 327 except (FileNotFoundError, IOError): 328 return , ,