Where communities thrive


  • Join over 1.5M+ people
  • Join over 100K+ communities
  • Free without limits
  • Create your own community
People
Repo info
Activity
    geoHeil
    @geoHeil
    How are one to many relationships with multiple join keys represented in featuretools? Or should the join key manually be concatenated into a single column?
    Max Kanter
    @kmax12
    @geoHeil do you mind posting on stackoverflow so others can more easily find the answer in the future?
    geoHeil
    @geoHeil
    geoHeil
    @geoHeil
    @kmax12 thanks for the clarification. One more question is open https://stackoverflow.com/questions/52463356/featuretools-categorical-handling as it is unclear to me which type of categorical is better supported by featuretools
    Max Kanter
    @kmax12
    @geoHeil answered all your questions. feel free to keep posting questions as they come up. hope you're enjoying using featuretools!
    geoHeil
    @geoHeil
    Thanks just getting started. I will experiment more during the next days
    Tullsokk
    @Tullsokk
    @kmax12 Sorry for the late reply, didnt see it before the weekend. I will take a look at the links you sent me and get back to you
    geoHeil
    @geoHeil

    How @kmax12 should

    Unable to add relationship because dwhvid_anonym in metadata is Pandas dtype category and dwhvid_anonym in transactions is Pandas dtype category

    be handled? Should the join keys be kept as string?

    Max Kanter
    @kmax12
    @geoHeil that's a tough to understand error given the dtypes are equal in the error message. are you able to share the data you are using?
    actually, I believe i understand the error. can you put on stack overflow and I'll answer there
    Max Kanter
    @kmax12

    the error essentially comes down to the categories being different between the categorical variables you are trying to relate. See this code example

    import pandas as pd
    from pandas.api.types import is_dtype_equal
    
    s = pd.Series(["a","b","a"], dtype="category")
    s2 = pd.Series(["b","b","a"], dtype="category")
    s3 = pd.Series(["a","b","c"], dtype="category")
    
    is_dtype_equal(s.dtype, s2.dtype) # this is True
    is_dtype_equal(s.dtype, s3.dtype) # this is False

    You need update your dataframe before loading it into Featuretool to make sure the Pandas Categoricals have the same values category values. Here's how you do that

    # if s is missing categories from s3
    new_s = s.astype(s3.dtype)
    is_dtype_equal(new_s.dtype, s3.dtype) # this is True
    
    # if both are missing categories from each other
    import pandas.api.types as pdtypes
    s4 = pd.Series(["b","c"], dtype="category")
    categories = set(s.dtype.categories + s4.dtype.categories)
    new_s = s.astype("category", categories=categories)
    new_s4 = s4.astype("category", categories=categories)
    is_dtype_equal(new_s.dtype, new_s4.dtype) # this is True

    please also post on SO where I can give a more detailed answer for everyone else

    geoHeil
    @geoHeil
    @kmax12 no it is not really that tough. It means that if both types are of pandas.Category type it cant ensure that a join works i.e. if it joins via category codes it would fail to join as they are not necessarily the same in both datasets (as pandas infers the codes in order of reading the file)
    Max Kanter
    @kmax12
    @geoHeil ya, i understand the error now.
    Max Kanter
    @kmax12
    @geoHeil answers on SO. let me know if that helps
    geoHeil
    @geoHeil
    @kmax12 Initially I just used plain strings for the join - but forcing same categories is probably a more efficient idea
    Max Kanter
    @kmax12
    ya, i think that is the more memory efficient approach. how big is the dataset?
    might not matter that much
    geoHeil
    @geoHeil
    no not really. Decompressed about 3G CSV files
    Max Kanter
    @kmax12
    @Tullsokk @geoHeil I just answered this question about using multiple training windows on Stack Overflow. Hopefully, it's helpful for you two: https://stackoverflow.com/questions/52472930/featuretools-multiple-cutoff-times-aggregation
    geoHeil
    @geoHeil
    How can I additionally to agg_primitives als invoke trans_primitives?
    Max Kanter
    @kmax12
    @geoHeil sorry, can you try rephrasing the question? I don't undersand
    geoHeil
    @geoHeil
    it looks that only SUM, MEAN columns (agg columns) are generated from feature synthesis, but none of the trans primitives.
    Max Kanter
    @kmax12
    what are you passing for the trans_primitive argument?
    geoHeil
    @geoHeil
    trs_primitives = ['percentile', 'year', 'days', 'diff', 'negate', 'month', 'cum_max',
    'divide', 'days_since', 'week', 'time_since_previous',
    'cum_mean', 'minute', 'weekday', 'or', 'isin', 'weeks', 'weekend']
    Max Kanter
    @kmax12
    can you share the repr of your entityset?
    you may need to increase max depth
    geoHeil
    @geoHeil
    I will try this and report tomorrow
    geoHeil
    @geoHeil
    max_depth of 2 does not seem to finish calculation, even if only 20 / 50 records are passed to the relationships
    Max Kanter
    @kmax12
    @geoHeil can you provide some details about your entityset? You can do print(your_entityset_object) and copy the results of that here or in a direct message to me
    geoHeil
    @geoHeil
    @kmax12 unfortunately I fear this will not be possible due to NDA ...
    Fabio Votta
    @favstats

    Hi everyone! I love featuretools and the idea to automically engeineer features. Unfortunately I can't seem to add interesting variables and I would be happy if someone could help out :)

    I suspect that it has something to do with my data because I can reproduce the example in the docs just fine..

    https://stackoverflow.com/questions/52673694/specifying-interesting-variables-with-featuretools-does-not-work

    Maybe this is an easy question for Pythonistas.. I am an ardent R user so maybe there is something I am just not seeing.
    Max Kanter
    @kmax12
    @favstats thanks for posting. we're taking a look and will put up an answer shortly.
    Fabio Votta
    @favstats
    Thanks a lot, really! :) Saw your comment on the initial post and I accidentally deleted the whole post when I wanted to edit it, sorry for that.
    It's not high priority though, just something that I couldn't figure out. Enjoy your sunday everyone :)
    Max Kanter
    @kmax12
    Happy to help! Will ping you here once I have an answer.
    Fabio Votta
    @favstats
    @kmax12 works well for me :)
    Max Kanter
    @kmax12

    @favstats This looks like incorrect behavior, thanks for sharing with us. I just made of fix for it on a branch. Can you try to install that branch of featuretools and run your code again? You can install that branch using pip with this command

    pip install -e git://github.com/featuretools/featuretools.git@interesting-values-direct-features#egg=featuretools

    Let us know if it helps!

    there's also a github pull request here if you'd like to comment there: Featuretools/featuretools#279
    Fabio Votta
    @favstats
    Oh wow, I was certain that the issue would be on my part. I'll try this out immediately. Thanks!
    Fabio Votta
    @favstats
    This worked perfectly! Thank you so much!
    Fabio Votta
    @favstats

    @kmax12 one thing I just noticed.. I mistyped the value name at first and it gave me back only NaN values, which makes sense since it can't match the arguments. However, I wonder if this is intended behaviour or if it should say something like "value not found" and throw an error. Just thinking out loud :)

    Anyway, thank you again for answering this so fast! :)

    Max Kanter
    @kmax12
    ya, that is correct behavior. its still a valid feature even if the value is nan for your particular data
    Fabio Votta
    @favstats
    alright :) great
    Max Kanter
    @kmax12
    happy to help! let us know if you have any other questions