Where communities thrive


  • Join over 1.5M+ people
  • Join over 100K+ communities
  • Free without limits
  • Create your own community
People
Repo info
Activity
    dayu
    @dayuoba
    i've containerized a demo python job , and deploy a pod on my k8s cluster, it always retiring the other workers.
    i've also created rbac for the pod
    simaster123
    @simaster123
    Hello - I'm struggling to resolve the issue I posted here: dask/dask#5634. Any chance that there's anyone here open to a short consulting gig to help me debug it?
    Ray Bell
    @raybellwaves
    AwesomeCap
    @AwesomeCap
    Hi, how does dask work with Amdahl's Law?
    iu.png
    AwesomeCap
    @AwesomeCap
    What is the parallel portion?
    JoranDox
    @JoranDox
    @AwesomeCap I think that depends on your code
    if you write code that doesn't need to shuffle/sync across dask nodes and is "embarrasingly parallellisable", you'll go into the realm of 95% maybe
    if you write code that sequentially goes over your data row per row you'll be at 0%
    that said, the scheduler can be a bottleneck for really big task graphs
    but I'm not sure if that's always the case, we haven't scaled to the size where it made sense to look into that
    AwesomeCap
    @AwesomeCap
    do you think dask will scale for a more effective scheduler, maybe sometime in the future? or is it more "nice-to-have"? :)
    Martin Durant
    @martindurant
    The performance of the scheduler is always being optimised… There have been specific attempts to reimplement in cython or other, but be assured that the often quoated “1ms overhead per task” is pessimistic.
    codecnotsupported
    @codecnotsupported
    I tried to make a SSHCluster with a tunnel but it would seem Dask doesn't play nicely with Asyncssh. https://bpaste.net/show/K7XOG :"got Future <Future pending> attached to a different loop".
    Arnab Biswas
    @arnabbiswas1
    As I understand from the documentation (https://docs.dask.org/en/latest/remote-data-services.html), that Dask does not support Azure Blob or Azure Data Lake Gen 2 as a data source right now. Is there any time line in mind? We are planning to store our data in Azure Data Lake Gen 2 and use Dask for Feature Engineering as well as Training using XGBoost.
    Martin Durant
    @martindurant
    “adlfs” is now available on pypi, but only on a personal channel for conda ( https://anaconda.org/defusco/adlfs ). conda-forge should be coming soon. The master version of fsspec knows about adlfs and will use it, if installed. So the short answer is: yes, dask can read and write to both azure datalate and blob.
    @TomAugspurger , what happened to the release, is it time to update the text in the docs yet?
    Tom Augspurger
    @TomAugspurger
    No idea. I haven’t done anything on adlfs in a few week.s
    Martin Durant
    @martindurant
    Oh, it’s @AlbertDeFusco ’s PR
    Davis Bennett
    @d-v-b
    have any dask-jobqueue users gotten adaptive deployment working?
    [in this channel]
    Yuvi Panda
    @yuvipanda
    with dask gateway, does the gateway initiate connections to the client? Or is it one way?
    with some tunneling, can I have my client (notebook) be on my local machine and the gateway on a remote k8s cluster?
    (with some kubectl port-forwarding style stuff)
    jkmacc-LANL
    @jkmacc-LANL
    @martindurant Thanks for the SO answer! I’ll try it out shortly.
    Arnab Biswas
    @arnabbiswas1
    @martindurant Thank you for your reply. However, I have not able to install it from the personal channel. I have posted an issue here : dask/adlfs#22
    Matt Nicolls
    @nicolls1
    I would like to delete erred futures and cannot see an easy way, more info here: https://stackoverflow.com/questions/59284765/how-to-remove-an-erred-future-from-dask-scheduler Thanks in advance if you have any thoughts!
    Jim Crist-Harif
    @jcrist
    The client initiates all connections. dask-gateway is designed for precisely the situation you describe - the pangeo client-in-the-same-cluster model works, but doesn't make use of the proxying we do. If both the web proxy and scheduler proxy are visible outside the cluster you can connect and work with your client external.
    @yuvipanda ^^
    Yuvi Panda
    @yuvipanda
    awesome ok
    quaeritis
    @quaeritis
    Hi, I use dask.distributed.progress(futures) if I then do client.gather(futures) the progress bar is stopped. How can I view the progress bar and wait for the result?
    quaeritis
    @quaeritis
    I think that refers to: dask/distributed#21
    kollmats
    @kollmats
    Hi, could anyone help me clarify whether dask arrays are meant to behave differently depending on the way they're created? Please see this question https://stackoverflow.com/questions/59291051/iterating-over-seemingly-identical-dask-arrays-takes-different-time
    Davis Bennett
    @d-v-b
    @kollmats I'm not sure if this explains what you see, but da.from_array specifically has very surprising behavior w.r.t. to chunking -- see dask/dask#5367 and dask/distributed#3032
    basically, if you make a dask array via da.from_array(arr), no matter what chunking you use, each chunk of the resulting dask array will have all of arr at the root of its task graph (as opposed to only having a chunk-sized piece of arr)
    i'm not familiar with the dataframe side of things, so I can't be sure that this behavior of da.from_array explains what you are seeing.
    kollmats
    @kollmats
    image.png

    @d-v-b , thanks for your reply. After looking at the graphs of my 'identical' arrays and doing some additional reading of the Dask documentation I realize that they are not equal at all.

    The array loaded via from_array has a single node (since it is only one chunk and it fits in memory). On the other hand, the graph of the array originating from the read_csv call is relatively complex. It starts off with two parallel read_blocks, goes via read_panda, from_delayed, values and then finally the two paths merge into one at the rechunk-merge.(see the picture above)

    Now, my new question is: are both these graphs computed anew for every call to compute()? Or does Dask store intermediate values somehow? E.g., at the values nodes?

    Davis Bennett
    @d-v-b
    without setting up caching (https://docs.dask.org/en/latest/caching.html) I believe repeated calls to compute will re-run the exact same computation
    if you are using dask-distributed you can look into storing intermediate values with persist -- see the "persist" section in the best practices guide here: https://docs.dask.org/en/latest/dataframe-best-practices.html
    kollmats
    @kollmats
    @d-v-b so In my case, that would mean repeated calls to pandas.read_csv ? Then all of it begins to make sense. I'll look into caching and persist. Thank you!
    Davis Bennett
    @d-v-b
    :thumbsup:
    Interview
    @interviewer_gitlab
    Why not install graphviz with dask as a dependency?
    Martin Durant
    @martindurant
    ^ because dask works well without graphviz, and we don’t want to force people to install unecessary packages. Indeed, how to install graphviz right has been problematic through the years.
    lszyba1
    @lszyba1
    hello,
    would anybody know how can I parase a csv file and check if dask dataframe column types match...
    I'm having a problem where there might be a row of bad data, but I can't find any tools for this..
    would be nice to get...
    1000,000 times int
    1 time string on row 565
    Scott Sievert
    @stsievert
    @Iszyba1 why not write a custom function to check each row then use “df.apply(row_check_func, axis=1)”?
    lszyba1
    @lszyba1

    Im looking for something like
    if enforce and columns and (list(df.columns) != list(columns)):
    ValueError: Mismatched dtypes found in pd.read_csv/pd.read_table.

    +-------------+--------+----------+
    | Column | Found | Expected |
    +-------------+--------+----------+
    | agency_code | object | int64 |
    +-------------+--------+----------+

    Since I don't have definitions, and dataset consists of 3 sepearte extracts...aka it could be int in 1/3rd of the file while float in 2/3...

    I guess I'm looking for a way to create dtypes list by parsing the whole file...
    would be nice to get stats on float 1 times and int 1000 times... do...
    @stsievert do you have an example of row_check_func that would be detecting if its object, or int64?
    lszyba1
    @lszyba1
    I guess if there was a way to extract that part of the code about column type missmatch and print line number, and print 'agency_code':'object' but was expecting 'agency_code':'inte64' then I could copy these results to openoffice and summerize.. unless I'm missing something