Please use Stack Overflow with the #dask tag for usage questions and github issues for bug reports
@mrocklin Do you think a maximum limit of workers which can be used by a submitted job would be useful, Although I see that we already have a param for providing a set of workers.
On a broader idea worker pools, something like this could be rather useful.
Can these two behaviours be achieved currently?
map_blocks(func_that_returns_a_debug_string)
)
map_blocks
assumes that I'm mapping an array on to another array, which doesn't match these use cases well
I use pytest to test my application. Since it relies on dask, I need to have a local cluster as a fixture.
I tried to do it myself and close the cluster on the fixture cleanup. Even though it works, I get a warning with more than 100 lines because dask tries to communicate with the cluster after it has been closed.
I use pytest to test my application. Since it relies on dask, I need to have a local cluster as a fixture.
I tried to do it myself and close the cluster on the fixture cleanup. Even though it works, I get a warning with more than 100 lines because dask tries to communicate with the cluster after it has been closed.
I've also tried to use the client fixture from distributed.utils_test, but that gives me the following error
file /Users/ruiloureiro/.virtualenvs/tarsier/lib/python3.7/site-packages/distributed/utils_test.py, line 561
@pytest.fixture
def client(loop, cluster_fixture):
E fixture 'loop' not found
> available fixtures: cache, capfd, capfdbinary, caplog, capsys, capsysbinary, client, cov, doctest_namespace, monkeypatch, no_cover, pytestconfig, record_property, record_testsuite_property, record_xml_attribute, recwarn, tmp_path, tmp_path_factory, tmpdir, tmpdir_factory
> use 'pytest --fixtures [testpath]' for help on them.`
Hey all, my application requires me to launch tasks from within other tasks, like the following
def a():
# ... some computation ..
def b():
# ... some computation ..
def c():
client = get_client()
a = client.submit(a)
b = client.submit(b)
[a,b] = client.gather([a,b])
return a+b
client = get_client()
res = client.submit(c)
However, I would like to have access to the intermediate results a
and b
, but only c shows up in client.futures
.
Is there a way to tell dask to keep the results for a
and b
?
Thank you
task_states[task] = executor.submit(
self.run_task,
task=task,
state=task_state,
... snipped ...
)
Hi guys,
I have an issue of pulling large table( can't fit the memory) from the azure database or another server, that table i need to divide in multiple csv-s to generate.
So ,i basicaly have no transformation except for dividing it to equal parts.
I think that the Dask is the right tool i'm looking for?
I tried many ways to make a simple connection to the sql server, but i just can't
import dask.dataframe as dd
import sqlachemy as sa
engine =sa.create_engine('mssql+pyodbc://VM/Data?driver=SQL+Server+Native+Client+11.0')
metadata = sa.MetaData()
posts = sa.Table('posts', metadata, schema= 'dbo', autoload= True, autoload_with= engine)
query = sa.select([posts])
sql_reader = dd.read_sql_table('posts', uri =engine, npartitions=16, index_col='userId')
Any help with this ?
User \"system:serviceaccount:nublado-athornton:dask\" cannot get resource \"pods\" in API group \"\" in the namespace \"nublado-athornton\""
but I have what look like the right rules in my role:rules:
- apiGroups:
- ""
resources:
- pods
verbs:
- list
- create
- delete
n_jobs=1
in the RandomForestRegressor()
constructor, but still end up with some of my dask workers using 2000% CPU, which is looking really weird.