by

Where communities thrive


  • Join over 1.5M+ people
  • Join over 100K+ communities
  • Free without limits
  • Create your own community
People
Repo info
Activity
    Philipp Kats
    @Casyfill
    to clarify - I don't want to use helm and I do want to keep it working indefinitely
    jkmacc-LANL
    @jkmacc-LANL
    I’m interested in the answer to this question, too. I suspect that it may be something like, “you can always do it manually,” since helm’s job seems to be precisely this.
    Philipp Kats
    @Casyfill
    I guess I am all for doing it "manually" via drone, just wasn't sure if I need to keep that process live, or say if there is an easy way to add dashboard/jupyter image to the same pod similarly to how helm does it
    Jim Crist-Harif
    @jcrist
    @birdsarah, it's yarn logs -applicationId <your application id>, and only is available for stopped applications. If the application is still running you can get the worker logs using get_worker_logs, as you said above, or through the yarn resourcemanager web ui (Skein's webui server shows live logs for all services).
    yasir-din-cko
    @yasir-din-cko
    Hi, I'm new to Dask and am writing a 8GB .csv to parquet using a local cluster (30GB ram, 10 workers each with 1 thread), but this takes close to 1 hour to complete — Is this expected? And where should I look to try and speed this up?
    yasir-din-cko
    @yasir-din-cko
    Re: my question, I realised that writing to S3 with .to_parquet() was causing a bottleneck.
    yasir-din-cko
    @yasir-din-cko
    What is the behaviour when I .read_parquet()from S3? Since it's lazy, is the data streamed from S3 when running some computation?
    Sarah Bird
    @birdsarah
    thanks @jcrist
    Sarah Bird
    @birdsarah
    Hi all, can someone set expectations on how long I should expect dask to get started with a ~100-200k size task graph. Specifically, I have made a df from concatting other dataframes. My df "concat" lists 129k tasks to be done. I am writing out that concatted df with df.to_parquet, but I have yet to see the tasks appear in my dashboard.
    Dave Hirschfeld
    @dhirschfeld
    I've run into some performance issues with dask.dataframe so I'm curious if you avoid it do you get better performance - e.g.
    def read_table(filename):
        import pyarrow.parquet as pq
        tbl = pq.read_table(filename)
        return tbl.to_pandas()
    
    futs = client.map(read_table, filenames)
    df = client.submit(pd.concat, futs).result()
    I can clarify if you have any questions
    but I doubt it's not applying the function in a parallel way pandas take the exact same time for the whole thing
    Adrian
    @adrianchang__twitter
    hey guys, does anyone have a good example of using like a dask ml pipeline in a real time service?
    mostly having trouble like deal with real time one hot encoding ....
    dealing with
    I am the author of disk.frame. I am trying to do fair benchmarks with Dask. Can anyone help me get this right?
    I need to know how to properly tune dask for best performance on a single-machine
    So far, it feels like disk.frame is slightly faster than Dask, but I think I might be doing something wrong
    RuiLoureiro
    @RuiLoureiro
    Hey everyone, I'm trying to define custom operations and am a bit uncertain as to how to implement it. Posted a more detailed question on SO:
    https://stackoverflow.com/questions/57597151/how-do-i-use-dask-to-efficiently-compute-custom-statistics
    Vishesh Mangla
    @XtremeGood
    hey has anyone used joblib?
    Scott Sievert
    @stsievert
    @RuiLoureiro maybe https://docs.dask.org/en/latest/caching.html or Pythons builtin lru_cache? (also submitted comment on SO)
    Loïc Estève
    @lesteve
    About caching within a dask context I have heard of this https://github.com/radix-ai/graphchain from this comment. I have not used it though.
    evalparse
    @xiaodaigh
    Gotta say the support from dask vs Julia Community is pretty ordinary
    Jim Crist-Harif
    @jcrist
    @xiaodaigh, dask developers are generally not active on gitter, please reach out via github or stackoverflow with questions like this.
    evalparse
    @xiaodaigh
    @jcrist Thanks for the tip. I will post to github next. Cos my SO post hasn't receieved much attention either....
    Jim Crist-Harif
    @jcrist
    Dask is an open source project, and like many other projects has limited developer resources. Asking a question and then complaining that you got no response within 24 hours isn't productive. Please be patient and respectful of others' time.
    evalparse
    @xiaodaigh
    Not complaining. Just comparing experiences, usually I get fairly quick response on Julia questions. So it is a relative experience of the communities. I am also open source author so I understand. That's why I never bother anyone in dask/dev. I tried SO Twitter etc but hardly any response to my noob questions . So to me, the community is not very active that's all.
    Niloy-Chakraborty
    @Niloy-Chakraborty
    Hi All, Can I use Streamz for consuming data from RabbitMQ? then process using dask..
    Pedro Lopes
    @pedroallenrevez

    Hey all, I have the following problem reproducible with the example:

    s = pd.Series([1,2,3,4,5])
    ds = dd.from_pandas(s, npartitions=2)
    print(ds.sum())
    print(da.sqrt(ds.sum()))
    print(da.sin(ds.sum()))
    print(da.power(ds.sum(), 2))

    Computing any dask.array ufunc of a dask.Scalar will trigger acomputation.
    If it is done on the Series, the behavior is as expected (dask graph is returned). Any ideas on why this happens?

    Sarah Bird
    @birdsarah
    Can anyone give me a quick pulse on whether I'm going crazy with issue: dask/dask#5319
    (it will help me know how to proceed with my etl)
    Michael Adkins
    @madkinsz
    Any advice on running docker-in-docker on Dask? e.g. running a containerized task in a Dask worker node on Kubernetes
    Kolmar Kafran
    @kafran
    @xiaodaigh Have you tried StackOverflow?
    suraj bhatt
    @surisurajbhatt_twitter
    Hi, im unable to execute following querry : import dask.dataframe as dd
    df = dd.read_parquet('gcs://anaconda-public-data/nyc-taxi/nyc.parquet/part.0.parquet')
    suraj bhatt
    @surisurajbhatt_twitter

    Hi, im unable to execute following querry : import dask.dataframe as dd
    df = dd.read_parquet('gcs://anaconda-public-data/nyc-taxi/nyc.parquet/part.0.parquet')

    Error : ArrowIOError: Unexpected end of stream: Page was smaller (5242780) than expected (6699768)

    Martin Durant
    @martindurant
    ^ please try with fsspec 0.4.2
    suraj bhatt
    @surisurajbhatt_twitter
    could you please give me demo syntax @martindurant ?
    Martin Durant
    @martindurant
    same syntax, but update your version of fsspec, available via pip or conda
    suraj bhatt
    @surisurajbhatt_twitter
    did that but error: KeyError: 'gcs' @matindurant
    Martin Durant
    @martindurant
    Then you should probably also update gcsfs
    suraj bhatt
    @surisurajbhatt_twitter
    which version
    Martin Durant
    @martindurant
    latest
    suraj bhatt
    @surisurajbhatt_twitter
    nothing working for this import dask.dataframe as dd
    df = dd.read_parquet('gcs://anaconda-public-data/nyc-taxi/nyc.parquet/part.0.parquet')

    nothing working for this import dask.dataframe as dd
    df = dd.read_parquet('gcs://anaconda-public-data/nyc-taxi/nyc.parquet/part.0.parquet')

    @martindurant

    Tom Augspurger
    @TomAugspurger
    @surisurajbhatt_twitter can you write a minimal example and post a github issue? http://matthewrocklin.com/blog/work/2018/02/28/minimal-bug-reports
    James Stidard
    @jamesstidard
    Hi, I was wondering if it's OK for a dask delayed function to use a process pool within it? Or will that cause havoc with the scheduler/resource monitoring of dask?