Where communities thrive


  • Join over 1.5M+ people
  • Join over 100K+ communities
  • Free without limits
  • Create your own community
People
Repo info
Activity
    Max Kanter
    @kmax12
    @ksarmentrout that's a great question for stackoverflow. can you post it there with the featuretools tag? https://stackoverflow.com/questions/tagged/featuretools
    that way we can answer it and make sure it is searchable by other users
    @WacoHolve_gitlab thanks for the quick reply. i am running out of time today, but if i dont get to ill reply tomorrow
    Keaton Armentrout
    @ksarmentrout
    @kmax12 Will do! I'll post the link here
    Max Kanter
    @kmax12
    :thumbsup:
    Waco Holve
    @WacoHolve_gitlab
    @kmax12 no worries, I just wanted to put it out there. Appreciate the work you all are doing on this project.
    Max Kanter
    @kmax12
    @ksarmentrout thanks! We will answer tomorrow
    Seth Rothschild
    @Seth-Rothschild
    @WacoHolve_gitlab thanks for reporting this. I've posted it as an issue on the Featuretools repo #254
    @WacoHolve_gitlab We're leaning towards option 1 or 2 listed there. Do you have a strong preference for one or the other?
    Waco Holve
    @WacoHolve_gitlab
    @Seth-Rothschild thank you for the quick work on this. I'd prefer option 1, however, it isn't a strong preference.
    Tullsokk
    @Tullsokk
    Just discovered the feautretools package. Looks very interesting! I have been doing a similar, but a bit less automated approach when creating features. One slight difference, that I so far haven't found in featuretools, is that I tend to aggregate the child data in time periods, e.g. last 30 days before the event, as well as the 30-90 days previous, depending on the outcome, and creating variables for lag of value, difference between lag value and most recent value, percentage difference, etc. This will e.g. give me a variable for percentage change in minimum balance between the last two or more time periods before the event. As far as I can tell, featuretools will aggregate "all" the child elements in the dataset, without regard to splitting the data in time buckets, thus not be able to create features that will pick up on say a temporal change in the minimun balcance of a custumer. Have I overlooked something? I can see how I could force something like this on the featuretools framework by feeding it several datasets from different time periods, and then create variables for the difference between the time periods, but is there some option to specify time periods, or event better, let the algoritm find not just the most relevant aggregations (mean, max, etc), but also the optimal length of temporal aggregations (last 30 days? last 60? change between last 3 weeks and the previous 6?)
    Max Kanter
    @kmax12
    @Tullsokk to get aggregate over time different time periods, you can pick a cutoff_time you'd like to create features at and a training_window which specifies how much historical data to use. So, you can create the different time period features you want by make multiple calls to ft.calculate_feature_matrix for each window. you can read more about handling time here: https://docs.featuretools.com/automated_feature_engineering/handling_time.html
    to create the lag features you want, you'd need to create a custom transform primitive. there is info on doing that here: https://docs.featuretools.com/automated_feature_engineering/primitives.html#defining-custom-primitives
    if you're interested in trying to make custom primitives, i'd love to find a time to talk!
    geoHeil
    @geoHeil
    How are one to many relationships with multiple join keys represented in featuretools? Or should the join key manually be concatenated into a single column?
    Max Kanter
    @kmax12
    @geoHeil do you mind posting on stackoverflow so others can more easily find the answer in the future?
    geoHeil
    @geoHeil
    geoHeil
    @geoHeil
    @kmax12 thanks for the clarification. One more question is open https://stackoverflow.com/questions/52463356/featuretools-categorical-handling as it is unclear to me which type of categorical is better supported by featuretools
    Max Kanter
    @kmax12
    @geoHeil answered all your questions. feel free to keep posting questions as they come up. hope you're enjoying using featuretools!
    geoHeil
    @geoHeil
    Thanks just getting started. I will experiment more during the next days
    Tullsokk
    @Tullsokk
    @kmax12 Sorry for the late reply, didnt see it before the weekend. I will take a look at the links you sent me and get back to you
    geoHeil
    @geoHeil

    How @kmax12 should

    Unable to add relationship because dwhvid_anonym in metadata is Pandas dtype category and dwhvid_anonym in transactions is Pandas dtype category

    be handled? Should the join keys be kept as string?

    Max Kanter
    @kmax12
    @geoHeil that's a tough to understand error given the dtypes are equal in the error message. are you able to share the data you are using?
    actually, I believe i understand the error. can you put on stack overflow and I'll answer there
    Max Kanter
    @kmax12

    the error essentially comes down to the categories being different between the categorical variables you are trying to relate. See this code example

    import pandas as pd
    from pandas.api.types import is_dtype_equal
    
    s = pd.Series(["a","b","a"], dtype="category")
    s2 = pd.Series(["b","b","a"], dtype="category")
    s3 = pd.Series(["a","b","c"], dtype="category")
    
    is_dtype_equal(s.dtype, s2.dtype) # this is True
    is_dtype_equal(s.dtype, s3.dtype) # this is False

    You need update your dataframe before loading it into Featuretool to make sure the Pandas Categoricals have the same values category values. Here's how you do that

    # if s is missing categories from s3
    new_s = s.astype(s3.dtype)
    is_dtype_equal(new_s.dtype, s3.dtype) # this is True
    
    # if both are missing categories from each other
    import pandas.api.types as pdtypes
    s4 = pd.Series(["b","c"], dtype="category")
    categories = set(s.dtype.categories + s4.dtype.categories)
    new_s = s.astype("category", categories=categories)
    new_s4 = s4.astype("category", categories=categories)
    is_dtype_equal(new_s.dtype, new_s4.dtype) # this is True

    please also post on SO where I can give a more detailed answer for everyone else

    geoHeil
    @geoHeil
    @kmax12 no it is not really that tough. It means that if both types are of pandas.Category type it cant ensure that a join works i.e. if it joins via category codes it would fail to join as they are not necessarily the same in both datasets (as pandas infers the codes in order of reading the file)
    Max Kanter
    @kmax12
    @geoHeil ya, i understand the error now.
    Max Kanter
    @kmax12
    @geoHeil answers on SO. let me know if that helps
    geoHeil
    @geoHeil
    @kmax12 Initially I just used plain strings for the join - but forcing same categories is probably a more efficient idea
    Max Kanter
    @kmax12
    ya, i think that is the more memory efficient approach. how big is the dataset?
    might not matter that much
    geoHeil
    @geoHeil
    no not really. Decompressed about 3G CSV files
    Max Kanter
    @kmax12
    @Tullsokk @geoHeil I just answered this question about using multiple training windows on Stack Overflow. Hopefully, it's helpful for you two: https://stackoverflow.com/questions/52472930/featuretools-multiple-cutoff-times-aggregation
    geoHeil
    @geoHeil
    How can I additionally to agg_primitives als invoke trans_primitives?
    Max Kanter
    @kmax12
    @geoHeil sorry, can you try rephrasing the question? I don't undersand
    geoHeil
    @geoHeil
    it looks that only SUM, MEAN columns (agg columns) are generated from feature synthesis, but none of the trans primitives.
    Max Kanter
    @kmax12
    what are you passing for the trans_primitive argument?
    geoHeil
    @geoHeil
    trs_primitives = ['percentile', 'year', 'days', 'diff', 'negate', 'month', 'cum_max',
    'divide', 'days_since', 'week', 'time_since_previous',
    'cum_mean', 'minute', 'weekday', 'or', 'isin', 'weeks', 'weekend']
    Max Kanter
    @kmax12
    can you share the repr of your entityset?
    you may need to increase max depth
    geoHeil
    @geoHeil
    I will try this and report tomorrow
    geoHeil
    @geoHeil
    max_depth of 2 does not seem to finish calculation, even if only 20 / 50 records are passed to the relationships
    Max Kanter
    @kmax12
    @geoHeil can you provide some details about your entityset? You can do print(your_entityset_object) and copy the results of that here or in a direct message to me
    geoHeil
    @geoHeil
    @kmax12 unfortunately I fear this will not be possible due to NDA ...