Where communities thrive


  • Join over 1.5M+ people
  • Join over 100K+ communities
  • Free without limits
  • Create your own community
People
Repo info
Activity
    Martin Durant
    @martindurant
    Ah… Yes, you need to be able to read them to get anything. If the workers don’t die, you can get them theough the dashboard, or there’s a client method (get_logs?) that does it.
    Sarah Bird
    @birdsarah
    okay sweet - thanks!
    Martin Durant
    @martindurant
    (if you find anything, please report in the issue I linked)
    Sarah Bird
    @birdsarah
    Definitely!
    Takes a reliable ~1.45 hours to hit the error.
    Sarah Bird
    @birdsarah
    @martindurant fut = df.to_parquet(.., compute=False); client.compute(fut) is not doing the to_parquet operation.
    any thoughts?
    i have slight difference in my code:
    futures = df.to_parquet(.., compute=False)
    try:
        client.compute(futures)
    except Exception as e:
        ..log stuff..
        client.retry(futures)
    Sarah Bird
    @birdsarah
    ok, figured it out. client.compute(futures, sync=True)
    Sarah Bird
    @birdsarah
    for what it's worth the pick up with retry did not work - i lost ~2000 partitions.
    Matthew Rocklin
    @mrocklin
    @birdsarah have you considered adding retries to your compute call? That way if one task fails once they don't all fail
    compute(retries=10)

    also, slight nit in terminology

    futures = df.to_parquet(.., compute=False)

    Those are delayed objects, not futures. Futures point to work that has been launched already.

    Sarah Bird
    @birdsarah
    I have not tried retries. I will.
    @mrocklin, any ideas on how to increase s3fs logging level on workers?
    the above suggestions do not appear to be having the desired effect. i.e. i'm not seeing a change in verbosity on what i get back from client.get_worker_logs()
    also @mrocklin i'm not able to get anything off yarn application logs as discussed offline.
    Matthew Rocklin
    @mrocklin
    (I think that you were discussing that with @jcrist , not me. I don't know much about logs)
    I don't know anything about s3fs logging either. I recommend raising an issue so that Martin can help out (and so that we can peel off parts of the conversation for which it would be good to build some record for future users)
    RuiLoureiro
    @RuiLoureiro
    In my application, I need to create a Dask DataFrame that is composed of Delayed Dask Objects (like Scalars). Right now, when I call compute on the DataFrame, I end up with a Pandas DataFrame with Delayed objects inside. Is it possible that when I call compute on the DataFrame, all the delayed objects inside it also get computed?
    Philipp Kats
    @Casyfill

    Hi everyone. probably simple question but couldn't find the specific docs. How should I deploy an adaptive cluster on k8s using dask? Shall I write a .py file with

    from dask_kubernetes import KubeCluster
    
    cluster = KubeCluster()
    cluster.adapt(minimum=0, maximum=100)  # scale between 0 and 100 workers

    and run it on scheduler machine?

    Philipp Kats
    @Casyfill
    to clarify - I don't want to use helm and I do want to keep it working indefinitely
    jkmacc-LANL
    @jkmacc-LANL
    I’m interested in the answer to this question, too. I suspect that it may be something like, “you can always do it manually,” since helm’s job seems to be precisely this.
    Philipp Kats
    @Casyfill
    I guess I am all for doing it "manually" via drone, just wasn't sure if I need to keep that process live, or say if there is an easy way to add dashboard/jupyter image to the same pod similarly to how helm does it
    Jim Crist-Harif
    @jcrist
    @birdsarah, it's yarn logs -applicationId <your application id>, and only is available for stopped applications. If the application is still running you can get the worker logs using get_worker_logs, as you said above, or through the yarn resourcemanager web ui (Skein's webui server shows live logs for all services).
    yasir-din-cko
    @yasir-din-cko
    Hi, I'm new to Dask and am writing a 8GB .csv to parquet using a local cluster (30GB ram, 10 workers each with 1 thread), but this takes close to 1 hour to complete — Is this expected? And where should I look to try and speed this up?
    yasir-din-cko
    @yasir-din-cko
    Re: my question, I realised that writing to S3 with .to_parquet() was causing a bottleneck.
    yasir-din-cko
    @yasir-din-cko
    What is the behaviour when I .read_parquet()from S3? Since it's lazy, is the data streamed from S3 when running some computation?
    Sarah Bird
    @birdsarah
    thanks @jcrist
    Sarah Bird
    @birdsarah
    Hi all, can someone set expectations on how long I should expect dask to get started with a ~100-200k size task graph. Specifically, I have made a df from concatting other dataframes. My df "concat" lists 129k tasks to be done. I am writing out that concatted df with df.to_parquet, but I have yet to see the tasks appear in my dashboard.
    Dave Hirschfeld
    @dhirschfeld
    I've run into some performance issues with dask.dataframe so I'm curious if you avoid it do you get better performance - e.g.
    def read_table(filename):
        import pyarrow.parquet as pq
        tbl = pq.read_table(filename)
        return tbl.to_pandas()
    
    futs = client.map(read_table, filenames)
    df = client.submit(pd.concat, futs).result()
    I can clarify if you have any questions
    but I doubt it's not applying the function in a parallel way pandas take the exact same time for the whole thing
    Adrian
    @adrianchang__twitter
    hey guys, does anyone have a good example of using like a dask ml pipeline in a real time service?
    mostly having trouble like deal with real time one hot encoding ....
    dealing with
    I am the author of disk.frame. I am trying to do fair benchmarks with Dask. Can anyone help me get this right?
    I need to know how to properly tune dask for best performance on a single-machine
    So far, it feels like disk.frame is slightly faster than Dask, but I think I might be doing something wrong
    RuiLoureiro
    @RuiLoureiro
    Hey everyone, I'm trying to define custom operations and am a bit uncertain as to how to implement it. Posted a more detailed question on SO:
    https://stackoverflow.com/questions/57597151/how-do-i-use-dask-to-efficiently-compute-custom-statistics
    Vishesh Mangla
    @XtremeGood
    hey has anyone used joblib?
    Scott Sievert
    @stsievert
    @RuiLoureiro maybe https://docs.dask.org/en/latest/caching.html or Pythons builtin lru_cache? (also submitted comment on SO)
    Loïc Estève
    @lesteve
    About caching within a dask context I have heard of this https://github.com/radix-ai/graphchain from this comment. I have not used it though.
    evalparse
    @xiaodaigh
    Gotta say the support from dask vs Julia Community is pretty ordinary
    Jim Crist-Harif
    @jcrist
    @xiaodaigh, dask developers are generally not active on gitter, please reach out via github or stackoverflow with questions like this.
    evalparse
    @xiaodaigh
    @jcrist Thanks for the tip. I will post to github next. Cos my SO post hasn't receieved much attention either....