Where communities thrive


  • Join over 1.5M+ people
  • Join over 100K+ communities
  • Free without limits
  • Create your own community
People
Activity
    Charlene Chambliss
    @blissfulchar_twitter
    @pesobreiro This is a nice idea! And definitely in line with how we're currently asking retention questions at my co. Did you do that by predicting the survival function for each individual and then multiplying by customer LTV accordingly?
    Pedro Sobreiro
    @pesobreiro
    @blissfulchar_twitter we used the survival probabilities under each curve (cohort) and the monthly payment to calculate CLV. We didn’t used individual customer but customers grouped in the survival curves. This option as some limitations but gives us an idea for an estimated CLV. What you say should be very interesting. I think there are other approaches to calculate the predictions of individual CLV.
    Cameron Davidson-Pilon
    @CamDavidsonPilon
    :wave: minor lifelines release. Important thing is that scipy 1.3 can be used with it now: https://github.com/CamDavidsonPilon/lifelines/releases/tag/v0.22.2
    Charlene Chambliss
    @blissfulchar_twitter
    Thanks @pesobreiro, this is helpful :)
    Alessandro Nesti
    @aleva85
    Hi, what is the best way to retrieve log likelihood of a fit? it is shown via 'model.print_summary()' but not via 'model.summary', which only shows a summary of the parameters.
    I managed to get it via model._log_likelihood, but had to look into the source code for that.
    Thanks and kudos for the library!
    Cameron Davidson-Pilon
    @CamDavidsonPilon
    Hi @aleva85, that currently is the best way, but you bring up a good observation that it’s not easy to find. Maybe in a future release I’ll promote it and document it well
    Cameron Davidson-Pilon
    @CamDavidsonPilon
    FYI I've been playing around with pure-python & autograd neural nets for better prediction. I may make this it's own experimental package. Need a good name for it though
    Cameron Davidson-Pilon
    @CamDavidsonPilon
    lifelike, lifenets?
    Rob deCarvalho
    @robdmc
    @CamDavidsonPilon maybe a plugin-like architecture? Different project that can attach itself to lifelines hooks
    Diego S
    @diego-s
    Hi, sorry I was wondering if any maintainers/users of lifelines based in Europe would like to do tutorial/workshop or talk about this package in our conference Python in Pharma (PyPharma) in Basel? Apologies for the unrequested advertisement, I will delete it if this is a problem. The conference would be free to attend (under invites) and 100% volunteer run. It will take place in November 21-22 and our target is 100-150 attendees. We would really be happy if lifelines is represented at this event.
    Cameron Davidson-Pilon
    @CamDavidsonPilon
    ^ no need to apologize, this message welcome here. I would love to join, hopefully someone can take my place. There were a few Euro speakers of lifelines already: Linda Uruchurtu , Lorna Brightmore and Elena Sharova have all recently (past few years) given talks on lifelines. You can search for their videoes online.
    Cameron Davidson-Pilon
    @CamDavidsonPilon
    They may be interested in coming? Unfortunately I can't make an introduction, as I don't know them personally
    Unrelated: :wave: new minor (but important) version of lifelines released: https://github.com/CamDavidsonPilon/lifelines/releases/tag/v0.22.3
    Diego S
    @diego-s
    @CamDavidsonPilon Thanks for the suggestions! I will look them up. :)
    harish ramani
    @linkinnation_twitter
    @julianspaeth I don't know if it's late, but checkout pysurvival for conditional forest models. Might be similar to your ask.
    @julianspaeth the only issue with pysurvival is the support for that isnt good as compared to lifelines.
    Julian Späth
    @julianspaeth
    @linkinnation_twitter Thank you, I also found the implementation of Random Survival Forest in pysurvival. However, I've implemented my own forest now, which had the advantage to understand the theory behind the whole process better.
    Sandu Ursu
    @sursu
    Is it possible to estimate the baseline hazard function with lifelines?
    Cameron Davidson-Pilon
    @CamDavidsonPilon
    Sure is.
    After you fit a CoxPHFitter, try .baseline_cumulativehazard
    ack that formatted wrong sorry
    Sandu Ursu
    @sursu
    I mean in the case of Time Varying Survival Regression. I'm seeing CoxTimeVaryingFitter. Will look at the reference now.
    hgfabc
    @hgfabc
    Hi, I’m using a loop to create 65000 KM-curve images for a certain dataset, and I’m using kmf .plot() function for that. However there’s always an error”fail to allocate bitmap“, and saying there are too many figures opened. How can I fix that? Thanks
    Cameron Davidson-Pilon
    @CamDavidsonPilon
    Hi @hgfabc, it sounds like you are pushing the limits of matplotlib. What inference are you doing where you need to visualize 65k curves?
    datablade
    @data-blade

    i just raised your NotImplementedError for conditional_after for CoxPH. it felt like running into a wall after reading about the new argument in the docs. i even bit the bullet and switched from the conda to the pip package ;)
    your commit message does not sound to hopeful for that one, are you still working on it?

    ps: still an awesome library

    Gareth Brown
    @gabrown

    Hi @CamDavidsonPilon, I am new to survival analysis and am using it for trying to predict customer churn. I created a model using CoxPHFitter and I wanted to evaluate how well the model performed by comparing the survival after 12 months (using the correct row from predict_survival_function output), to the observed churn rate (1-survival rate). I noticed that it is consistently getting a higher survival rate compared to actual (~10%).
    I paired back the model so that it was only based off the baseline hazard (passed no extra variables) and I still get a difference in survival rates.
    I have tried this on open data, and can reproduce the result:

    import lifelines
    import numpy as np
    import pandas as pd
    churn_data = pd.read_csv('https://raw.githubusercontent.com/'
                             'treselle-systems/customer_churn_analysis/'
                             'master/WA_Fn-UseC_-Telco-Customer-Churn.csv')
    event_col = 'Churn'
    duration_col = 'tenure'
    churn_data[event_col] = churn_data[event_col].map({'No':0, 'Yes':1})
    churn_data_example = churn_data[[event_col, duration_col]]
    cph = lifelines.CoxPHFitter()
    cph.fit(churn_data[[event_col, duration_col]], duration_col=duration_col, event_col=event_col)
    # cph.print_summary()
    # get predicted churn:
    unconditioned_sf = cph.predict_survival_function(churn_data_example)
    predicted_survival = unconditioned_sf[[0]].T[12.0][0]
    predicted_churn = 1 - predicted_survival
    #Create churn at tenure = 12: logic is 
    #      if tenure > 12 then they didnt churn => churn_12 =0; 
    #      if they have tenure < 12 and churn=1, then the churn_12 =1; 
    #      if tenure < 12 and churn=0, dont know if they churn => churn_12 = np.nan
    churn_data_example['churn_12'] = churn_data_example['Churn']
    churn_data_example.loc[(churn_data_example.tenure < 12) & (churn_data_example.churn_12 == 0), 'churn_12'] = np.nan
    churn_data_example.loc[(churn_data_example.tenure > 12) , 'churn_12'] = 0
    actual_churn = churn_data_example['churn_12'].mean()
    
    print(f'actual churn: {round(actual_churn,2)}')
    print(f'predicted churn: {round(predicted_churn,2)}')
    print(f'ratio: {round(predicted_churn/actual_churn,2)}')

    The results are:
    actual churn: 0.17
    predicted churn: 0.15
    ratio: 0.89
    And it deviates further as tenure increases.

    Have you got any idea why I am seeing the behaviour? I feel it is either to do with me not understanding what predict_survival_function returns, or I am mis calculating the ‘actual churn’?

    Cameron Davidson-Pilon
    @CamDavidsonPilon
    Hi @data-blade, yea I'm still working on it. It's surprisingly more difficult. If you goal is prediction, I suggest taking a look at the AFT models!
    Cameron Davidson-Pilon
    @CamDavidsonPilon

    @gabrown, I am able to replicate what you are seeing locally. If I understand correctly, your definition of churn is "fraction of uncensored users who died before 12 months". I think this is going to bias your churn rate up, as you are not taking into account censoring. In an extreme case, where all but one subject is censored, then your def of churn will give 0% or 100%. But, that feels a bit strange, no? If they died early on, and the other subjects were censored later, we should feel that the churn isn't 100%.

    Please correct me if I am mistaken, or I am not making sense. Happy to discuss more!

    Sandu Ursu
    @sursu
    @CamDavidsonPilon can you please help me understand why using value_and_grad(negative_log_likelihood) in the minimization function, in fitters, helps? Why not simply minimize the negative_log_likelihood directly?
    In the same file: it seems that class ParametericAFTRegressionFitter(ParametricRegressionFitter) contains an extra 'e' :D
    Cameron Davidson-Pilon
    @CamDavidsonPilon
    Re typo: yea I know 🙃
    Value and grad function is specifically used in the minimization routine.
    The routine does minimize the log likelihood, but it also requires information from both f and f-prime. That’s what value and grad provides
    Sandu Ursu
    @sursu
    I have some issues with autograd and while looking at their documentation I've noticed the note saying that they won't develop it further. Have you thought about migrating to JAX?
    Cameron Davidson-Pilon
    @CamDavidsonPilon

    @sursu I likely won't port lifelines to JAX, as JAX is i) overkill for what is needed, ii) harder to install, iii) harder to use too. (I am using JAX for another project though).

    I'd love to help debug your autograd issues. What are you seeing?

    Gareth Brown
    @gabrown

    Thanks for the quick response, @CamDavidsonPilon . I get your point about by ignoring the censored users who haven’t been there 12 months, we are ignoring, and as churn rate is low, they would be predominately non-churners, so this would add a bias. However, in my analysis dataset, if I only consider users who could have completed 12 months (so there are no censored users with tenure<12) I still see a systematic difference.

    If we consider this in the context of survival, how would you measure the survival after 12 months just from the data? As I think this problem would have exactly the same issues.

    Thanks for you input, and also for your awesome package!

    Cameron Davidson-Pilon
    @CamDavidsonPilon
    There's also the more subtle point that S(12) does not necessarily equal (deaths prior to 12) / (population) - they are two estimators of churn rate
    (and hence have their own variances)
    Sandu Ursu
    @sursu
    Thank you! So, I have this file. I want to maximize the (log) likelihood there. I get the error: ValueError: setting an array element with a sequence. I've read from their documentation that "Assignment is hard to support...", but I at this point I can't imagine how it should be rightly implemented.
    Cameron Davidson-Pilon
    @CamDavidsonPilon
    @sursu I was able to replicate the problem locally. I made some changes to get it to converge: https://gist.github.com/CamDavidsonPilon/161fc665f6fccc91e21a543d1132a192
    1) Instead of one large matrix for the x_ variables (which may cause problems with autograd), I instead chose a list of small matrices.
    2) the lik variable is now incrementing as we go.
    3) I like BFGS as a first routine to use, feel free to try others though.
    Sandu Ursu
    @sursu
    Thanks a lot @CamDavidsonPilon !! Really helpful!
    Gareth Brown
    @gabrown
    @CamDavidsonPilon hmm, I get your point. I have looked at the Kaplan-Meier distribution and it seems to match for the example I provided, however, when I look at the dataset I am interested in, there is a bias after the first 12 months. Is there any assumption about the hazards? We have very spikey hazards, with high hazard once every 12 months, and through the rest of the year the hazard is low. Do you think that would effect the performance?
    image.png
    That is an example of the comparison, where the blue line is the Kaplan-Meier estimate and the red line is the cox regresstion
    Cameron Davidson-Pilon
    @CamDavidsonPilon
    Sorry @gabrown, I'm not exactly sure. KMF estimate != CoxPH baseline, generally, so differences are expected.
    Cameron Davidson-Pilon
    @CamDavidsonPilon
    datablade
    @data-blade

    @CamDavidsonPilon for what it's worth, a short snippet of a slightly misleading error involving pandas.DataFrame.apply that took me a day to debug

    task: use Cox to predict event probability for censored items at the time of their current duration

    import lifelines as ll
    import numpy as np
    import pandas as pd
    
    df = pd.DataFrame(np.random.randint(0, 100, size=(10, 2)), columns=['regressor', 'duration'])
    df['event'] = np.random.choice([True, False], 10)
    display(df)
    
    # uncomment to lose the bool and fix the TypeError
    #df['event'] = df['event'].astype(int)
    
    cf = ll.CoxPHFitter()
    cf.fit(df, duration_col='duration', event_col='event')
    
    # select only censored items
    df = df[df['event'] == 0]
    
    func = lambda row: cf.predict_survival_function(row[['regressor']], times=row['duration'])
    df.apply(func, axis=1)

    'misleading' cause it will say the regressor column is non-numerical...

    Cameron Davidson-Pilon
    @CamDavidsonPilon
    Thanks for reporting @data-blade, sorry about the wasted time!