Where communities thrive


  • Join over 1.5M+ people
  • Join over 100K+ communities
  • Free without limits
  • Create your own community
People
Repo info
Activity
    Ray Bell
    @raybellwaves
    image.png
    Ok. Found using the scheduler external IP
    image.png
    Ray Bell
    @raybellwaves
    Just hoping to pin in next to my notebook using the Dask extension
    slavarazbash
    @slavarazbash
    Hi! The Enterprise Data Science Architecture Conference focuses on how to properly productionise data science solutions at scale. Dask is a tool that I have personally used to get the job done. Most Dask users would be interested in seeing how large companies productionise machine learning solutions. 27th March 2020 is a great time to visit Melbourne, Australia for a unique and high quality conference. I invite you view our speakers list at https://edsaconf.io and reserve your place because we have a unique mix of speakers.
    dayu
    @dayuoba
    hi gus, i'm new to dask, i want to konw if there is some tutorials about deploying dask jobs on k8s in native way? i tried with official docs, but i can not even run the demo successfully
    i've containerized a demo python job , and deploy a pod on my k8s cluster, it always retiring the other workers.
    i've also created rbac for the pod
    simaster123
    @simaster123
    Hello - I'm struggling to resolve the issue I posted here: dask/dask#5634. Any chance that there's anyone here open to a short consulting gig to help me debug it?
    Ray Bell
    @raybellwaves
    AwesomeCap
    @AwesomeCap
    Hi, how does dask work with Amdahl's Law?
    iu.png
    AwesomeCap
    @AwesomeCap
    What is the parallel portion?
    JoranDox
    @JoranDox
    @AwesomeCap I think that depends on your code
    if you write code that doesn't need to shuffle/sync across dask nodes and is "embarrasingly parallellisable", you'll go into the realm of 95% maybe
    if you write code that sequentially goes over your data row per row you'll be at 0%
    that said, the scheduler can be a bottleneck for really big task graphs
    but I'm not sure if that's always the case, we haven't scaled to the size where it made sense to look into that
    AwesomeCap
    @AwesomeCap
    do you think dask will scale for a more effective scheduler, maybe sometime in the future? or is it more "nice-to-have"? :)
    Martin Durant
    @martindurant
    The performance of the scheduler is always being optimised… There have been specific attempts to reimplement in cython or other, but be assured that the often quoated “1ms overhead per task” is pessimistic.
    codecnotsupported
    @codecnotsupported
    I tried to make a SSHCluster with a tunnel but it would seem Dask doesn't play nicely with Asyncssh. https://bpaste.net/show/K7XOG :"got Future <Future pending> attached to a different loop".
    Arnab Biswas
    @arnabbiswas1
    As I understand from the documentation (https://docs.dask.org/en/latest/remote-data-services.html), that Dask does not support Azure Blob or Azure Data Lake Gen 2 as a data source right now. Is there any time line in mind? We are planning to store our data in Azure Data Lake Gen 2 and use Dask for Feature Engineering as well as Training using XGBoost.
    Martin Durant
    @martindurant
    “adlfs” is now available on pypi, but only on a personal channel for conda ( https://anaconda.org/defusco/adlfs ). conda-forge should be coming soon. The master version of fsspec knows about adlfs and will use it, if installed. So the short answer is: yes, dask can read and write to both azure datalate and blob.
    @TomAugspurger , what happened to the release, is it time to update the text in the docs yet?
    Tom Augspurger
    @TomAugspurger
    No idea. I haven’t done anything on adlfs in a few week.s
    Martin Durant
    @martindurant
    Oh, it’s @AlbertDeFusco ’s PR
    Davis Bennett
    @d-v-b
    have any dask-jobqueue users gotten adaptive deployment working?
    [in this channel]
    Yuvi Panda
    @yuvipanda
    with dask gateway, does the gateway initiate connections to the client? Or is it one way?
    with some tunneling, can I have my client (notebook) be on my local machine and the gateway on a remote k8s cluster?
    (with some kubectl port-forwarding style stuff)
    jkmacc-LANL
    @jkmacc-LANL
    @martindurant Thanks for the SO answer! I’ll try it out shortly.
    Arnab Biswas
    @arnabbiswas1
    @martindurant Thank you for your reply. However, I have not able to install it from the personal channel. I have posted an issue here : dask/adlfs#22
    Matt Nicolls
    @nicolls1
    I would like to delete erred futures and cannot see an easy way, more info here: https://stackoverflow.com/questions/59284765/how-to-remove-an-erred-future-from-dask-scheduler Thanks in advance if you have any thoughts!
    Jim Crist-Harif
    @jcrist
    The client initiates all connections. dask-gateway is designed for precisely the situation you describe - the pangeo client-in-the-same-cluster model works, but doesn't make use of the proxying we do. If both the web proxy and scheduler proxy are visible outside the cluster you can connect and work with your client external.
    @yuvipanda ^^
    Yuvi Panda
    @yuvipanda
    awesome ok
    quaeritis
    @quaeritis
    Hi, I use dask.distributed.progress(futures) if I then do client.gather(futures) the progress bar is stopped. How can I view the progress bar and wait for the result?
    quaeritis
    @quaeritis
    I think that refers to: dask/distributed#21
    kollmats
    @kollmats
    Hi, could anyone help me clarify whether dask arrays are meant to behave differently depending on the way they're created? Please see this question https://stackoverflow.com/questions/59291051/iterating-over-seemingly-identical-dask-arrays-takes-different-time
    Davis Bennett
    @d-v-b
    @kollmats I'm not sure if this explains what you see, but da.from_array specifically has very surprising behavior w.r.t. to chunking -- see dask/dask#5367 and dask/distributed#3032
    basically, if you make a dask array via da.from_array(arr), no matter what chunking you use, each chunk of the resulting dask array will have all of arr at the root of its task graph (as opposed to only having a chunk-sized piece of arr)
    i'm not familiar with the dataframe side of things, so I can't be sure that this behavior of da.from_array explains what you are seeing.
    kollmats
    @kollmats
    image.png

    @d-v-b , thanks for your reply. After looking at the graphs of my 'identical' arrays and doing some additional reading of the Dask documentation I realize that they are not equal at all.

    The array loaded via from_array has a single node (since it is only one chunk and it fits in memory). On the other hand, the graph of the array originating from the read_csv call is relatively complex. It starts off with two parallel read_blocks, goes via read_panda, from_delayed, values and then finally the two paths merge into one at the rechunk-merge.(see the picture above)

    Now, my new question is: are both these graphs computed anew for every call to compute()? Or does Dask store intermediate values somehow? E.g., at the values nodes?

    Davis Bennett
    @d-v-b
    without setting up caching (https://docs.dask.org/en/latest/caching.html) I believe repeated calls to compute will re-run the exact same computation
    if you are using dask-distributed you can look into storing intermediate values with persist -- see the "persist" section in the best practices guide here: https://docs.dask.org/en/latest/dataframe-best-practices.html
    kollmats
    @kollmats
    @d-v-b so In my case, that would mean repeated calls to pandas.read_csv ? Then all of it begins to make sense. I'll look into caching and persist. Thank you!
    Davis Bennett
    @d-v-b
    :thumbsup:
    Interview
    @interviewer_gitlab
    Why not install graphviz with dask as a dependency?
    Martin Durant
    @martindurant
    ^ because dask works well without graphviz, and we don’t want to force people to install unecessary packages. Indeed, how to install graphviz right has been problematic through the years.