Where communities thrive


  • Join over 1.5M+ people
  • Join over 100K+ communities
  • Free without limits
  • Create your own community
People
Repo info
Activity
    Max Kanter
    @kmax12
    you can read my full reponse there, but the warning you are seeing is unrelated to the featuretools variable type and will go away in the next release of Featuretools.
    Max Kanter
    @kmax12
    let us know if you have any questions
    dugland123
    @dugland123
    Thank you.
    Silvio Normey Gómez
    @silviogn
    Hi everyone.
    I'm Silvio.. from Uruguay
    I have a question.
    Is Featuretools capable of building useful attributes from semistructured data such as xml, json or rfd?
    Or is necessary convert the dataset to a tabular form?
    Max Kanter
    @kmax12
    @silviogn correct, you'd have to convert to tabular form to use featuretools.
    Waco Holve
    @WacoHolve_gitlab

    Hi all,

    I've been using this tool the past few days and it has been great so far. I work heavily with financial data and noticed that when I'm creating my EntitySet if I have column names as dtype int I get a failure message.

    I was wondering if this is desired behavior that the column names need to be dtype str for the entity set to work.

    Thank you for the awesome product.
    Waco Holve

    Max Kanter
    @kmax12
    @WacoHolve_gitlab can you share the stack trace and the some code to reproduce it?
    we'd like to support that since they are valid pandas column names
    Waco Holve
    @WacoHolve_gitlab
    @kmax12 I'd be happy to. Please give me a bit to create a notebook with some example data.
    Waco Holve
    @WacoHolve_gitlab
    @kmax12 I've attached a notebook with the csv. Thinking more about it now I should have just created some random data.
    Keaton Armentrout
    @ksarmentrout
    Hey all! I've been playing with the predict-appointment-noshow notebook tutorial for a bit, and I'm a bit confused by the output of the PERCENT_TRUE primitive. My understanding is that a column like locations.PERCENT_TRUE(appointments.sms_received) gives the percent of rows for which sms_received is True, given a single location. I'd expect that column to be the same for all rows of a single location, because that's what it was conditioned on, but I'm not finding that to be the case. Any ideas why?
    (sorry to interrupt current topic!)
    In that notebook, if I run fm.loc[fm.neighborhood == 'HORTO', 'locations.PERCENT_TRUE(appointments.sms_received)'].describe() I get:
    count 144.00
    mean 0.20
    std 0.09
    min 0.00
    25% 0.20
    50% 0.23
    75% 0.26
    max 0.31
    Name: locations.PERCENT_TRUE(appointments.sms_received), dtype: float64
    Max Kanter
    @kmax12
    @ksarmentrout that's a great question for stackoverflow. can you post it there with the featuretools tag? https://stackoverflow.com/questions/tagged/featuretools
    that way we can answer it and make sure it is searchable by other users
    @WacoHolve_gitlab thanks for the quick reply. i am running out of time today, but if i dont get to ill reply tomorrow
    Keaton Armentrout
    @ksarmentrout
    @kmax12 Will do! I'll post the link here
    Max Kanter
    @kmax12
    :thumbsup:
    Waco Holve
    @WacoHolve_gitlab
    @kmax12 no worries, I just wanted to put it out there. Appreciate the work you all are doing on this project.
    Max Kanter
    @kmax12
    @ksarmentrout thanks! We will answer tomorrow
    Seth Rothschild
    @Seth-Rothschild
    @WacoHolve_gitlab thanks for reporting this. I've posted it as an issue on the Featuretools repo #254
    @WacoHolve_gitlab We're leaning towards option 1 or 2 listed there. Do you have a strong preference for one or the other?
    Waco Holve
    @WacoHolve_gitlab
    @Seth-Rothschild thank you for the quick work on this. I'd prefer option 1, however, it isn't a strong preference.
    Tullsokk
    @Tullsokk
    Just discovered the feautretools package. Looks very interesting! I have been doing a similar, but a bit less automated approach when creating features. One slight difference, that I so far haven't found in featuretools, is that I tend to aggregate the child data in time periods, e.g. last 30 days before the event, as well as the 30-90 days previous, depending on the outcome, and creating variables for lag of value, difference between lag value and most recent value, percentage difference, etc. This will e.g. give me a variable for percentage change in minimum balance between the last two or more time periods before the event. As far as I can tell, featuretools will aggregate "all" the child elements in the dataset, without regard to splitting the data in time buckets, thus not be able to create features that will pick up on say a temporal change in the minimun balcance of a custumer. Have I overlooked something? I can see how I could force something like this on the featuretools framework by feeding it several datasets from different time periods, and then create variables for the difference between the time periods, but is there some option to specify time periods, or event better, let the algoritm find not just the most relevant aggregations (mean, max, etc), but also the optimal length of temporal aggregations (last 30 days? last 60? change between last 3 weeks and the previous 6?)
    Max Kanter
    @kmax12
    @Tullsokk to get aggregate over time different time periods, you can pick a cutoff_time you'd like to create features at and a training_window which specifies how much historical data to use. So, you can create the different time period features you want by make multiple calls to ft.calculate_feature_matrix for each window. you can read more about handling time here: https://docs.featuretools.com/automated_feature_engineering/handling_time.html
    to create the lag features you want, you'd need to create a custom transform primitive. there is info on doing that here: https://docs.featuretools.com/automated_feature_engineering/primitives.html#defining-custom-primitives
    if you're interested in trying to make custom primitives, i'd love to find a time to talk!
    geoHeil
    @geoHeil
    How are one to many relationships with multiple join keys represented in featuretools? Or should the join key manually be concatenated into a single column?
    Max Kanter
    @kmax12
    @geoHeil do you mind posting on stackoverflow so others can more easily find the answer in the future?
    geoHeil
    @geoHeil
    geoHeil
    @geoHeil
    @kmax12 thanks for the clarification. One more question is open https://stackoverflow.com/questions/52463356/featuretools-categorical-handling as it is unclear to me which type of categorical is better supported by featuretools
    Max Kanter
    @kmax12
    @geoHeil answered all your questions. feel free to keep posting questions as they come up. hope you're enjoying using featuretools!
    geoHeil
    @geoHeil
    Thanks just getting started. I will experiment more during the next days
    Tullsokk
    @Tullsokk
    @kmax12 Sorry for the late reply, didnt see it before the weekend. I will take a look at the links you sent me and get back to you
    geoHeil
    @geoHeil

    How @kmax12 should

    Unable to add relationship because dwhvid_anonym in metadata is Pandas dtype category and dwhvid_anonym in transactions is Pandas dtype category

    be handled? Should the join keys be kept as string?

    Max Kanter
    @kmax12
    @geoHeil that's a tough to understand error given the dtypes are equal in the error message. are you able to share the data you are using?
    actually, I believe i understand the error. can you put on stack overflow and I'll answer there