Hey there; We recently discovered an instance where
df[['col_a', 'col_b']].groupby('col_a').col_b.nunique().compute() on a Dask dataframe did not produce a result equal to the equivalent pandas statement. It had more results (due to duplicates across partitions not being handled?).
I believe that line should produce results equal to the pandas version, since nunique is clearly implemented. See: https://github.com/dask/dask/blob/master/dask/dataframe/groupby.py (line 490). And the unit test demonstrates correct behaviour.
My intention is to post this as a reproducible issue on the Github repo, but I first wanted to double check if I had missed anything?
.compute()and then dask (not sure if the scheduler would like this) would just hang/wait for the chunk to be ready
I am writing this message on behalf of Apache Airflow community and it's about Apache Airflow integration with Dask.
We are working on Apache Airflow 2.0 (still at least 3-4 months to release it) but we are doing various cleanup tasks. One of the things we noticed is that we have a DaskExecutor https://github.com/apache/airflow/blob/master/airflow/executors/dask_executor.py . It uses Dask to run the tasks. The executor is very little used. It's used so little that we even have all the Dask Executor tests failing for months - without anyone noticing. We understand that Dask is important for a number of people, but also in order to keep it in Airflow we need someone who actually uses Dask (and Dask Executor in Airflow) to keep it in a healthy state. We are querying at our devlist and checking if someone is willing to maintain it but we also think it might be a good idea to ask here. We are happy as committers to review and accept all the related code but we would love someone more familiar and using dask to maintain it :).
Note that the executor is rather simple and the tests https://github.com/apache/airflow/blob/master/tests/executors/test_dask_executor.py are just a few. It's literally 200 lines of code in total. We could fix it ourselves, but maybe having someone who actually uses dask being just a bit more active in our community is a good idea :)
def paralell_prophet(df_g): return df_g g = df.groupby(['pdv_id', 'producto_id']) client = Client('scheduler:8786') futures =  for i in g: x = client.submit(paralell_prophet, i) futures.append(x) for future, result in as_completed(futures, with_results=True): result_data = result_data.append(result) del g del df del futures
Hi, starting out with dask on kubernetes in azure.
In my scenario I want to run a Azure Function App which gets triggered by a storage queue. Do I need to deploy dask inside the function on the same cluster or should I set up a seperate dask kubernetes cluster? Anyone has experience with this? The standard azure doc doesnt go to much into depth
Its a longer running process which as of now runs fine in azure functions (python) on a single machine.
For larger workload I am looking to distribute numpy/numba calculations
Azure functions appears to work well with kubernetes and KEDA (Scaling up based on number of events)
My main concern is how to handle the dynamic scaling of the dask cluster. Couldnt find any information so far how to link it to KEDA