Where communities thrive


  • Join over 1.5M+ people
  • Join over 100K+ communities
  • Free without limits
  • Create your own community
People
Repo info
Activity
    joeschmid
    @joeschmid
    @RuiLoureiro got it, that makes sense. It might be overkill, but I suppose you could build up the entire tree of intermediate results. One open source project that I'm familiar with runs on Dask and does that: https://github.com/PrefectHQ/prefect/blob/master/src/prefect/engine/flow_runner.py The key line:
    task_states[task] = executor.submit(
                            self.run_task,
                            task=task,
                            state=task_state,
                           ... snipped ...
                        )
    We run large Flows with Prefect on Dask. When our Flow Run is finished, we get back a results object that we can then examine for the state and return value of the various tasks. Depending on what you're doing, it might be easier to just use Prefect to build your flow and run it on Dask. It makes it very easy to return data from tasks and build up multi-step flows that run on Dask. We've had a good experience with it.
    RuiLoureiro
    @RuiLoureiro
    @joeschmid Very interesting, will look into it. Even if we don't end up using the library, it's always interesting to see how other people build on top of dask. Thank you very much!
    garanews
    @garanews
    is it possible to have 2 different dask cluster running on same machine?
    joeschmid
    @joeschmid
    @RuiLoureiro You bet. Also, Prefect has a really helpful public Slack community available at prefect-community.slack.com . You can post questions there and get helpful, fast responses from their team.
    bgoodman44
    @bgoodman44
    Anyone here?
    konstantinmds
    @konstantinmds

    Hi guys,
    I have an issue of pulling large table( can't fit the memory) from the azure database or another server, that table i need to divide in multiple csv-s to generate.
    So ,i basicaly have no transformation except for dividing it to equal parts.
    I think that the Dask is the right tool i'm looking for?
    I tried many ways to make a simple connection to the sql server, but i just can't

    import dask.dataframe as dd
    import sqlachemy as sa
    engine =sa.create_engine('mssql+pyodbc://VM/Data?driver=SQL+Server+Native+Client+11.0')
    metadata = sa.MetaData()
    posts = sa.Table('posts', metadata, schema= 'dbo', autoload= True, autoload_with= engine)
    query = sa.select([posts])
    sql_reader = dd.read_sql_table('posts', uri =engine, npartitions=16, index_col='userId')

    Any help with this ?

    Martin Durant
    @martindurant
    uri should, as the name implies, be the URI and not a engine instance
    uri : string
        Full sqlalchemy URI for the database connection
    Bradley McElroy
    @limx0
    Hi dask team, is there an equivalent SystemMonitor for the dask-kubernetes workers as there is for the LocalCluster workers?
    Bradley McElroy
    @limx0
    I'm interested in gathering some worker resource stats for some dask graph runs
    Matthew Rocklin
    @mrocklin
    All dask workers run the system monitor internally
    Bradley McElroy
    @limx0
    Okay good to know, do the kubes workers record their usage like the local cluster? I'm actually trying to do a range query after some computation, is this also available on the kubes cluster?
    Matthew Rocklin
    @mrocklin
    (FYI, I don't answer questions here as a rule. if you're looking for help from me personally you'll have to use stack overflow or github)
    Bradley McElroy
    @limx0
    understood, thanks @mrocklin - dask/dask-kubernetes#180
    Matthew Rocklin
    @mrocklin
    I appreciate it!
    Adam Thornton
    @athornton
    I just opened this issue: dask/dask-kubernetes#181 but figured someone here might know. What changed in dask-kubernetes with respect to RBAC after 0.9.1 ? I get a 403 Forbidden: User \"system:serviceaccount:nublado-athornton:dask\" cannot get resource \"pods\" in API group \"\" in the namespace \"nublado-athornton\"" but I have what look like the right rules in my role:
    rules:
    - apiGroups:
      - ""
      resources:
      - pods
      verbs:
      - list
      - create
      - delete
    Eric Ma
    @ericmjl
    @TomAugspurger I hope you're doing well. I've got a question for you regarding dask-ml - is there a way to prevent parallel random forest fitting from using >100% CPU? I have set n_jobs=1 in the RandomForestRegressor() constructor, but still end up with some of my dask workers using 2000% CPU, which is looking really weird.
    (It also makes me a bad citizen on our compute cluster, haha.)
    FWIW, i
    I've been using dask as part of an automated ML system - one RF per slice/subset of data in our database - and there's like thousands of slices.
    It's super cool that Dask enables me to build this out.
    That said, I get a lot of garbage collection errors whenever I'm using parallel RF on this many slices of data - is it because of overwhelming the dask scheduler?
    Matthew Rocklin
    @mrocklin
    @ericmjl it might be better to raise this as an issue on github
    That way other people who run into the same issue will be able to do a web search and benefit from the answer
    Eric Ma
    @ericmjl
    OK! Thanks @mrocklin! :D
    Scott Sievert
    @stsievert
    I’m referencing dask-jobqueue in an academic paper. What’s the preferred method of referencing it?
    I’m currently planning on linking to the readthedocs homepage.
    joeschmid
    @joeschmid
    anakaine
    @anakaine
    Random question: When createding a new column in a dataframe by returning the results from an np.select operation dask complains that the data coming back is not supported, being of type numpy.ndarray. Doing the same in pandas (same code, just one dataframe is set up in dask, the other in pandas) works fine. Is this as expected?
    Scott Sievert
    @stsievert
    Thanks for that @joeschmid
    Simon
    @anakaine_twitter
    Is there an equivalent of np.select in dask? Checking the docs theres quite a few references to "see also: select", but no section for it.
    David Hoese
    @djhoese
    Not sure of a better place for this, but I'm looking for a small example of using fastparquet that shows the advantages of defining row-groups over the default "single" row group. I'm presenting some information on file formats but don't use parquet in my day to day (I really don't even use pandas that often). Anyone have a simple code example?
    Uwe L. Korn
    @xhochy
    @djhoese Here's a very simple example: https://gist.github.com/xhochy/0880fd0bed3a7eaca3c31bf07b6868d9 (although it uses pyarrow)
    Martin Durant
    @martindurant
    correct, whether or not to use row-groups is not specific to the engine; Dask will load in parallel with either and can make (limited) decisions to exclude some row groups based on metadata.
    David Hoese
    @djhoese

    @xhochy Ok thanks. Is there a way to do that query without knowing that row-group 1 is where you want to look?

    Does generating a pandas DataFrame (not dask) load the entire parquet file in to memory? Or does it do it lazily?

    Martin Durant
    @martindurant

    That last time I looked, pandas loaded everything. It would be reasonable to implement that iteratively, and fastparquet does have a specific method to do that.

    Is there a way to do that query without knowing that row-group 1 is where you want to look

    Parquet optionally stores columns max and min values for each row-group, so maybe

    David Hoese
    @djhoese
    Thanks @martindurant
    Davis Bennett
    @d-v-b
    an interesting blog post about designing a threaded scheduler in rust: https://tokio.rs/blog/2019-10-scheduler/
    Dean Langsam
    @DeanLa
    how do I expand a series containing fixed-size list into a data frame?
    Scott Sievert
    @stsievert
    Dean Langsam
    @DeanLa
    not what i need.
    i need to expand a list to columns (on axis 1), not explode and create new rows
    Sergio Meana
    @smeana
    Hello
    Is any build in function to do forward rolling on timeseries index?
    I have seen same question asked by @AlbaFS on Aug 04 without reply. Any solution at all? thks
    mcguip
    @mcguipat
    Hi all, I have a problem which requires data to be loaded in several blocks which are then subdivided and operated on. Scatter seems like the most sensible way of accomplishing this; however, if I want loading to be executed on the cluster as well, this requires scatter from within another task. Is there a better way to accomplish this?
    Dario Vianello
    @dvianello
    Hey! I'm struggling a bit with s3 & IAM roles while using dask. We have a set of workers in EC2 with instance profiles authorised to assume a role in a different account and read from a bucket. I've tried to do the assume role beforehand in boto3 and pass the session into dask, but the session object can't be pickled apparently (fails to pickle the lock). Is there a way to pull this off in Dask? Sending creds along to workers isn't the best idea ever and it would be cool if the system was able to do the assume role on the workers before trying to access s3...
    Martin Durant
    @martindurant
    Whatever you do to assume the role, you could execute the same thing on the cluster workers using client.run
    Dario Vianello
    @dvianello
    right!
    Dean Langsam
    @DeanLa
    Running dask-yarn, notebook crashed, I think Cluster instance (scheduler?) is dangling because I can't create a new one. How can I make sure?
    xavArtley
    @xavArtley

    Hello,
    I'm using

    dask.compute(*delayeds, scheduler='processes', num_workers=4)

    to run computations.
    I was wondering which function was used to serialized object between processes and if it was possible to change it
    Thanks