This channel is rarely used. For other channels to contact the dask community, please see https://docs.dask.org/en/stable/support.html
I am confused that you are talking about arrays and then dataframes. pangeo-forge is mostly around arrays.
I thought dask is able to store big data frames and allow parallel operations on them to be faster
That’s certainly true (with caveats), but you must present your data in an understandable way. The docs give varous ways to produce arrays and dataframes from, for example, delayed functions.
map_partitions
with a custom function, this function merges some timeseries data. I do client.persist()
on the variable for the map_partitions step. The last step is then to run some calculations which should easily fit in memory, sums etc. The problem I keep running into is that I hit OOM on the first node. I have read the managing memory page on the distributed docs, but was hoping someone could point me to some other ideas?
Hello everyone; I'm trying to spawn a new dask cluster for Prefect to use, but can't make it work:
from dask_kubernetes import KubeCluster
cluster = KubeCluster(
pod_template="worker-spec.yaml",
name="dask-cluster",
namespace="flows",
deploy_mode="remote",
)
It's just timing out with
Creating scheduler pod on cluster. This may take some time.
ERROR:asyncio:Unclosed client session
client_session: <aiohttp.client.ClientSession object at 0x119a78e20>
port=
in the consrtuctor or --worker-port
on the command line
Hello, I was trying to set a dask + slurm cluster and do some benchmarks. However, there are some unknowns that are setting me back from understanding my findings. I am attaching the report of my call - https://rawcdn.githack.com/ikabadzhov/DaskSlurmBenchmarks/24fb57d60c0cbf992d90278a0843af27d8933e3d/SET04/D03_1149_N8_f512_p1024_r3/dask-report_0.html. Here args.Nodes=8, args.cores=32, i.e. running on 32*8 cores total over a cluster.
It is unclear to me from where the number of workers is coming in the report. It is now 160, and I am not setting anything to 160, indeed everything that I set manually is a power of 2. If it is a default value, how can I change it?
How to guarantee that all workers start at the same time? I tried with client.wait_for_workers(args.Nodes)
, but I have only 8 nodes and 160 workers, which is certainly far from desired. I tried with async
-> await
and so far on my toy example it works as expected, but is that a guarantee?
When going through the System tab of the report, I see that there is very low CPU utilization. I entered several node, where computations were done and htop
was showing me that all CPU-s are full. Is the CPU report reliable?
Thanks in advance for your time.
for col1 in columns_1:
for col2 in columns_2:
df.loc[df['any_column_in_df'] == col2, col1] = 0
What I want : I want alternative Code/Way to get this done in dask ! working in pandas.
Problem : Can't use assign ( = ) in dask while using df.loc because of inplace is not support ?
Explanation : I want to assign 0/value where condition meet and return dataframe ! ( not series ! )
I Tried using mask, map_partitions with df.replace (working fine for this simple 1 column value manipulation and returning dataframe as required)...
def replace(x: pd.DataFrame) -> pd.DataFrame:
return x.replace(
{'any_column_to_replace_value': [np.nan]},
{'any_column_to_replace_value': [0]}
)
df = df.map_partitions(replace)
How to do for first code ? and return dataframe.
Thanks in advance, Please help me Dask Experts i'm new to dask and exploring it..
WARNING:distributed.utils_perf:full garbage collections took 49% CPU time recently (threshold: 10%)
WARNING:distributed.worker_memory:Unmanaged memory use is high. This may indicate a memory leak or the memory may not be released to the OS; see https://distributed.dask.org/en/latest/worker-memory.html#memory-not-released-back-to-the-os for more information. -- Unmanaged memory: 5.34 GiB -- Worker memory limit: 7.50 GiB
WARNING:distributed.utils_perf:full garbage collections took 48% CPU time recently (threshold: 10%)
WARNING:distributed.utils_perf:full garbage collections took 48% CPU time recently (threshold: 10%)
WARNING:distributed.worker_memory:Unmanaged memory use is high. This may indicate a memory leak or the memory may not be released to the OS; see https://distributed.dask.org/en/latest/worker-memory.html#memory-not-released-back-to-the-os for more information. -- Unmanaged memory: 5.34 GiB -- Worker memory limit: 7.50 GiB