Where communities thrive


  • Join over 1.5M+ people
  • Join over 100K+ communities
  • Free without limits
  • Create your own community
People
Repo info
Activity
    Matthew Rocklin
    @mrocklin
    (FYI, I don't answer questions here as a rule. if you're looking for help from me personally you'll have to use stack overflow or github)
    Bradley McElroy
    @limx0
    understood, thanks @mrocklin - dask/dask-kubernetes#180
    Matthew Rocklin
    @mrocklin
    I appreciate it!
    Adam Thornton
    @athornton
    I just opened this issue: dask/dask-kubernetes#181 but figured someone here might know. What changed in dask-kubernetes with respect to RBAC after 0.9.1 ? I get a 403 Forbidden: User \"system:serviceaccount:nublado-athornton:dask\" cannot get resource \"pods\" in API group \"\" in the namespace \"nublado-athornton\"" but I have what look like the right rules in my role:
    rules:
    - apiGroups:
      - ""
      resources:
      - pods
      verbs:
      - list
      - create
      - delete
    Eric Ma
    @ericmjl
    @TomAugspurger I hope you're doing well. I've got a question for you regarding dask-ml - is there a way to prevent parallel random forest fitting from using >100% CPU? I have set n_jobs=1 in the RandomForestRegressor() constructor, but still end up with some of my dask workers using 2000% CPU, which is looking really weird.
    (It also makes me a bad citizen on our compute cluster, haha.)
    FWIW, i
    I've been using dask as part of an automated ML system - one RF per slice/subset of data in our database - and there's like thousands of slices.
    It's super cool that Dask enables me to build this out.
    That said, I get a lot of garbage collection errors whenever I'm using parallel RF on this many slices of data - is it because of overwhelming the dask scheduler?
    Matthew Rocklin
    @mrocklin
    @ericmjl it might be better to raise this as an issue on github
    That way other people who run into the same issue will be able to do a web search and benefit from the answer
    Eric Ma
    @ericmjl
    OK! Thanks @mrocklin! :D
    Scott Sievert
    @stsievert
    I’m referencing dask-jobqueue in an academic paper. What’s the preferred method of referencing it?
    I’m currently planning on linking to the readthedocs homepage.
    joeschmid
    @joeschmid
    anakaine
    @anakaine
    Random question: When createding a new column in a dataframe by returning the results from an np.select operation dask complains that the data coming back is not supported, being of type numpy.ndarray. Doing the same in pandas (same code, just one dataframe is set up in dask, the other in pandas) works fine. Is this as expected?
    Scott Sievert
    @stsievert
    Thanks for that @joeschmid
    Simon
    @anakaine_twitter
    Is there an equivalent of np.select in dask? Checking the docs theres quite a few references to "see also: select", but no section for it.
    David Hoese
    @djhoese
    Not sure of a better place for this, but I'm looking for a small example of using fastparquet that shows the advantages of defining row-groups over the default "single" row group. I'm presenting some information on file formats but don't use parquet in my day to day (I really don't even use pandas that often). Anyone have a simple code example?
    Uwe L. Korn
    @xhochy
    @djhoese Here's a very simple example: https://gist.github.com/xhochy/0880fd0bed3a7eaca3c31bf07b6868d9 (although it uses pyarrow)
    Martin Durant
    @martindurant
    correct, whether or not to use row-groups is not specific to the engine; Dask will load in parallel with either and can make (limited) decisions to exclude some row groups based on metadata.
    David Hoese
    @djhoese

    @xhochy Ok thanks. Is there a way to do that query without knowing that row-group 1 is where you want to look?

    Does generating a pandas DataFrame (not dask) load the entire parquet file in to memory? Or does it do it lazily?

    Martin Durant
    @martindurant

    That last time I looked, pandas loaded everything. It would be reasonable to implement that iteratively, and fastparquet does have a specific method to do that.

    Is there a way to do that query without knowing that row-group 1 is where you want to look

    Parquet optionally stores columns max and min values for each row-group, so maybe

    David Hoese
    @djhoese
    Thanks @martindurant
    Davis Bennett
    @d-v-b
    an interesting blog post about designing a threaded scheduler in rust: https://tokio.rs/blog/2019-10-scheduler/
    Dean Langsam
    @DeanLa
    how do I expand a series containing fixed-size list into a data frame?
    Scott Sievert
    @stsievert
    Dean Langsam
    @DeanLa
    not what i need.
    i need to expand a list to columns (on axis 1), not explode and create new rows
    Sergio Meana
    @smeana
    Hello
    Is any build in function to do forward rolling on timeseries index?
    I have seen same question asked by @AlbaFS on Aug 04 without reply. Any solution at all? thks
    mcguip
    @mcguipat
    Hi all, I have a problem which requires data to be loaded in several blocks which are then subdivided and operated on. Scatter seems like the most sensible way of accomplishing this; however, if I want loading to be executed on the cluster as well, this requires scatter from within another task. Is there a better way to accomplish this?
    Dario Vianello
    @dvianello
    Hey! I'm struggling a bit with s3 & IAM roles while using dask. We have a set of workers in EC2 with instance profiles authorised to assume a role in a different account and read from a bucket. I've tried to do the assume role beforehand in boto3 and pass the session into dask, but the session object can't be pickled apparently (fails to pickle the lock). Is there a way to pull this off in Dask? Sending creds along to workers isn't the best idea ever and it would be cool if the system was able to do the assume role on the workers before trying to access s3...
    Martin Durant
    @martindurant
    Whatever you do to assume the role, you could execute the same thing on the cluster workers using client.run
    Dario Vianello
    @dvianello
    right!
    Dean Langsam
    @DeanLa
    Running dask-yarn, notebook crashed, I think Cluster instance (scheduler?) is dangling because I can't create a new one. How can I make sure?
    xavArtley
    @xavArtley

    Hello,
    I'm using

    dask.compute(*delayeds, scheduler='processes', num_workers=4)

    to run computations.
    I was wondering which function was used to serialized object between processes and if it was possible to change it
    Thanks

    Fred Massin
    @FMassin
    Hello,
    I would like to use template matching with time-series. The objective is to look for many relatively short 1d pattern in a relatively long time-series. Any suggestion on how to do this in Dask? I mean to have something like https://docs.dask.org/en/latest/dataframe-api.html#dask.dataframe.Series.corr with many other and split_every equal or lower than length of other...
    Thanks!
    Berk Gercek
    @berkgercek
    Heya, hate to add to the pile of questions but I'm currently going through dask-tutorial and I am on the weather example in Ch. 2: Arrays. The computation of mean in 500x500 chunks takes 14m31s of time (not wall time), with only 1m50s of that being User time and the rest is sys. Is this a normal result? The wall time for reference is ~37s, and I am using a 12-core processor
    Martin Durant
    @martindurant
    ^ many of these sounds like stack-overflow questions or issues (to tutorials in the latter case). We discourage anything more than general conversation here, because we would like solutions to be searchable, so others can benefit too
    Fred Massin
    @FMassin
    ok
    Matthew Rocklin
    @mrocklin
    +1 to what @martindurant said. @xavArtley @FMassin @berkgercek those are all great and valid questions. Most of the Dask maintainers don't watch this channel often. See https://docs.dask.org/en/latest/support.html for suggestions on where to ask for help
    Martin Durant
    @martindurant
    (it may still take some time for replies, but the audience is much broader)
    lszyba1
    @lszyba1
    Hello,
    I'm trying to load a csv file, but there are some inconsistent column types...
    So while initially the file looks like bunch of 0, then its an object...
    is there a way to:
    a) parse through all the csv file 10gb+ and create correct dtype. (I have over 100 columns)
    or
    b) go through the file and summersize the type counts... for example is there 99,999 rows of int and 1 row of object...and I can just clean that row?
    Dean Langsam
    @DeanLa
    is dask-yarn (specifically EMR) having known issues?
    Cluster just dies on every take or persist action.
    Martin Durant
    @martindurant
    ^ you should probably file an issue with more details. I believe dask-yarn has been successfully used on EMR.
    xavArtley
    @xavArtley
    Hello, is it anti-pattern to use dask.delayed to contruct task wich will be submit to the cluster latter?