Where communities thrive


  • Join over 1.5M+ people
  • Join over 100K+ communities
  • Free without limits
  • Create your own community
People
Repo info
Activity
    Max Kanter
    @kmax12
    @bgoel2003 you need to pass the python definition rather than the string
    custom_primitive = make_trans_primitive(...)
    ft.dfs(trans_primitives=[custom_primitive],...)
    Jan Koch
    @datajanko
    @kmax12 no I didn’t realize that. Thanks for the hint. I think I can leverage cut_off_times and training window, to also construct my validation window. Then, I’d only have to do this multiple times for all the “folds” I am interested in. I hope, I’ll find some time to work on this.
    Max Kanter
    @kmax12
    @datajanko exactly. let us know how it goes. if you have any other questions, feel free to message here, or if you think it'd be helpful for other in the future post on StackOverflow with the featuretools tag: https://stackoverflow.com/questions/tagged/featuretools
    tomasgreif
    @tomasgreif
    Is there a way to use multiple training windows? I am trying to generate features for last 3/6/9/12... months. https://stackoverflow.com/questions/51865267/get-features-by-different-time-windows
    Jan Koch
    @datajanko
    maybe, this can be achieved by using interesting values, not very sure though
    Max Kanter
    @kmax12
    @tomasgreif the recommended way to do that now is to make multiple calls to calculate_feature_matrix with the same list of feature definitions but different training_windows and then combine the result
    we'll follow up and answer on stack overflow as well. thanks for posting
    tomasgreif
    @tomasgreif
    I see, thank you. Would that be something you would consider adding? In the area I am working in (financial services, credit scoring), having different time windows is one the most typical feature engineering tasks.
    Max Kanter
    @kmax12
    @tomasgreif it is something we'd consider adding!
    Maximilian Christ
    @MaxBenChrist
    hi featuretools team. your featuretools library looks great!
    unfortunately, I have been working solely on time series for the last years so I did not have a dataset to try it out
    now, I am the maintainer of tsfesh (https://github.com/blue-yonder/tsfresh), we also perform feature extraction, but on time series instead of relational data
    but maybe we can embed tsfresh into featuretools in some way? I worked on Data Science problems where I had to process time series and relational data at the same time. A fully automated feature extraction framework would have helped a lot
    Max Kanter
    @kmax12
    Hey @MaxBenChrist ! Thanks for the kind words. tsfresh is an interesting library as well. We actually have experimented with embedding tsfresh into featuretools by using custom primitives. You can see an example of that in this notebook: https://github.com/Featuretools/predict-remaining-useful-life/blob/master/Advanced%20Featuretools%20RUL.ipynb
    what are your thoughts on the best way to integrate the two libraries?
    Maximilian Christ
    @MaxBenChrist
    @kmax12 yes, I saw that notebook. I still have to use a the featuretools library on a few datasets over the weekend to get more experience with it.
    in any case, we could have a brainstorming session over skype sometime next week and discuss possible starting points for a collaboration?
    Max Kanter
    @kmax12
    yep, let's do that!
    dugland123
    @dugland123
    Hello - under what circumstances would an entity variable from EntitySet es and defined as index show up as id when listing es.entities? I'm trying to resolve the following warning: both an index level and a column label.
    Defaulting to column, but this will raise an ambiguity error in a future version
    end_entity_id=child_eid)
    Max Kanter
    @kmax12
    you can read my full reponse there, but the warning you are seeing is unrelated to the featuretools variable type and will go away in the next release of Featuretools.
    Max Kanter
    @kmax12
    let us know if you have any questions
    dugland123
    @dugland123
    Thank you.
    Silvio Normey Gómez
    @silviogn
    Hi everyone.
    I'm Silvio.. from Uruguay
    I have a question.
    Is Featuretools capable of building useful attributes from semistructured data such as xml, json or rfd?
    Or is necessary convert the dataset to a tabular form?
    Max Kanter
    @kmax12
    @silviogn correct, you'd have to convert to tabular form to use featuretools.
    Waco Holve
    @WacoHolve_gitlab

    Hi all,

    I've been using this tool the past few days and it has been great so far. I work heavily with financial data and noticed that when I'm creating my EntitySet if I have column names as dtype int I get a failure message.

    I was wondering if this is desired behavior that the column names need to be dtype str for the entity set to work.

    Thank you for the awesome product.
    Waco Holve

    Max Kanter
    @kmax12
    @WacoHolve_gitlab can you share the stack trace and the some code to reproduce it?
    we'd like to support that since they are valid pandas column names
    Waco Holve
    @WacoHolve_gitlab
    @kmax12 I'd be happy to. Please give me a bit to create a notebook with some example data.
    Waco Holve
    @WacoHolve_gitlab
    @kmax12 I've attached a notebook with the csv. Thinking more about it now I should have just created some random data.
    Keaton Armentrout
    @ksarmentrout
    Hey all! I've been playing with the predict-appointment-noshow notebook tutorial for a bit, and I'm a bit confused by the output of the PERCENT_TRUE primitive. My understanding is that a column like locations.PERCENT_TRUE(appointments.sms_received) gives the percent of rows for which sms_received is True, given a single location. I'd expect that column to be the same for all rows of a single location, because that's what it was conditioned on, but I'm not finding that to be the case. Any ideas why?
    (sorry to interrupt current topic!)
    In that notebook, if I run fm.loc[fm.neighborhood == 'HORTO', 'locations.PERCENT_TRUE(appointments.sms_received)'].describe() I get:
    count 144.00
    mean 0.20
    std 0.09
    min 0.00
    25% 0.20
    50% 0.23
    75% 0.26
    max 0.31
    Name: locations.PERCENT_TRUE(appointments.sms_received), dtype: float64
    Max Kanter
    @kmax12
    @ksarmentrout that's a great question for stackoverflow. can you post it there with the featuretools tag? https://stackoverflow.com/questions/tagged/featuretools
    that way we can answer it and make sure it is searchable by other users
    @WacoHolve_gitlab thanks for the quick reply. i am running out of time today, but if i dont get to ill reply tomorrow
    Keaton Armentrout
    @ksarmentrout
    @kmax12 Will do! I'll post the link here
    Max Kanter
    @kmax12
    :thumbsup:
    Waco Holve
    @WacoHolve_gitlab
    @kmax12 no worries, I just wanted to put it out there. Appreciate the work you all are doing on this project.
    Max Kanter
    @kmax12
    @ksarmentrout thanks! We will answer tomorrow
    Seth Rothschild
    @Seth-Rothschild
    @WacoHolve_gitlab thanks for reporting this. I've posted it as an issue on the Featuretools repo #254