## Where communities thrive

• Join over 1.5M+ people
• Join over 100K+ communities
• Free without limits
##### Activity
Gareth Brown
@gabrown

Hi @CamDavidsonPilon, I am new to survival analysis and am using it for trying to predict customer churn. I created a model using CoxPHFitter and I wanted to evaluate how well the model performed by comparing the survival after 12 months (using the correct row from predict_survival_function output), to the observed churn rate (1-survival rate). I noticed that it is consistently getting a higher survival rate compared to actual (~10%).
I paired back the model so that it was only based off the baseline hazard (passed no extra variables) and I still get a difference in survival rates.
I have tried this on open data, and can reproduce the result:

import lifelines
import numpy as np
import pandas as pd
'treselle-systems/customer_churn_analysis/'
'master/WA_Fn-UseC_-Telco-Customer-Churn.csv')
event_col = 'Churn'
duration_col = 'tenure'
churn_data[event_col] = churn_data[event_col].map({'No':0, 'Yes':1})
churn_data_example = churn_data[[event_col, duration_col]]
cph = lifelines.CoxPHFitter()
cph.fit(churn_data[[event_col, duration_col]], duration_col=duration_col, event_col=event_col)
# cph.print_summary()
# get predicted churn:
unconditioned_sf = cph.predict_survival_function(churn_data_example)
predicted_survival = unconditioned_sf[[0]].T[12.0][0]
predicted_churn = 1 - predicted_survival
#Create churn at tenure = 12: logic is
#      if tenure > 12 then they didnt churn => churn_12 =0;
#      if they have tenure < 12 and churn=1, then the churn_12 =1;
#      if tenure < 12 and churn=0, dont know if they churn => churn_12 = np.nan
churn_data_example['churn_12'] = churn_data_example['Churn']
churn_data_example.loc[(churn_data_example.tenure < 12) & (churn_data_example.churn_12 == 0), 'churn_12'] = np.nan
churn_data_example.loc[(churn_data_example.tenure > 12) , 'churn_12'] = 0
actual_churn = churn_data_example['churn_12'].mean()

print(f'actual churn: {round(actual_churn,2)}')
print(f'predicted churn: {round(predicted_churn,2)}')
print(f'ratio: {round(predicted_churn/actual_churn,2)}')

The results are:
actual churn: 0.17
predicted churn: 0.15
ratio: 0.89
And it deviates further as tenure increases.

Have you got any idea why I am seeing the behaviour? I feel it is either to do with me not understanding what predict_survival_function returns, or I am mis calculating the ‘actual churn’?

Cameron Davidson-Pilon
@CamDavidsonPilon
Hi @data-blade, yea I'm still working on it. It's surprisingly more difficult. If you goal is prediction, I suggest taking a look at the AFT models!
Cameron Davidson-Pilon
@CamDavidsonPilon

@gabrown, I am able to replicate what you are seeing locally. If I understand correctly, your definition of churn is "fraction of uncensored users who died before 12 months". I think this is going to bias your churn rate up, as you are not taking into account censoring. In an extreme case, where all but one subject is censored, then your def of churn will give 0% or 100%. But, that feels a bit strange, no? If they died early on, and the other subjects were censored later, we should feel that the churn isn't 100%.

Please correct me if I am mistaken, or I am not making sense. Happy to discuss more!

Sandu Ursu
@sursu
@CamDavidsonPilon can you please help me understand why using value_and_grad(negative_log_likelihood) in the minimization function, in fitters, helps? Why not simply minimize the negative_log_likelihood directly?
In the same file: it seems that class ParametericAFTRegressionFitter(ParametricRegressionFitter) contains an extra 'e' :D
Cameron Davidson-Pilon
@CamDavidsonPilon
Re typo: yea I know 🙃
Value and grad function is specifically used in the minimization routine.
The routine does minimize the log likelihood, but it also requires information from both f and f-prime. That’s what value and grad provides
Sandu Ursu
@sursu
I have some issues with autograd and while looking at their documentation I've noticed the note saying that they won't develop it further. Have you thought about migrating to JAX?
Cameron Davidson-Pilon
@CamDavidsonPilon

@sursu I likely won't port lifelines to JAX, as JAX is i) overkill for what is needed, ii) harder to install, iii) harder to use too. (I am using JAX for another project though).

I'd love to help debug your autograd issues. What are you seeing?

Gareth Brown
@gabrown

Thanks for the quick response, @CamDavidsonPilon . I get your point about by ignoring the censored users who haven’t been there 12 months, we are ignoring, and as churn rate is low, they would be predominately non-churners, so this would add a bias. However, in my analysis dataset, if I only consider users who could have completed 12 months (so there are no censored users with tenure<12) I still see a systematic difference.

If we consider this in the context of survival, how would you measure the survival after 12 months just from the data? As I think this problem would have exactly the same issues.

Thanks for you input, and also for your awesome package!

Cameron Davidson-Pilon
@CamDavidsonPilon
There's also the more subtle point that S(12) does not necessarily equal (deaths prior to 12) / (population) - they are two estimators of churn rate
(and hence have their own variances)
Sandu Ursu
@sursu
Thank you! So, I have this file. I want to maximize the (log) likelihood there. I get the error: ValueError: setting an array element with a sequence. I've read from their documentation that "Assignment is hard to support...", but I at this point I can't imagine how it should be rightly implemented.
Cameron Davidson-Pilon
@CamDavidsonPilon
@sursu I was able to replicate the problem locally. I made some changes to get it to converge: https://gist.github.com/CamDavidsonPilon/161fc665f6fccc91e21a543d1132a192
1) Instead of one large matrix for the x_ variables (which may cause problems with autograd), I instead chose a list of small matrices.
2) the lik variable is now incrementing as we go.
3) I like BFGS as a first routine to use, feel free to try others though.
Sandu Ursu
@sursu
Thanks a lot @CamDavidsonPilon !! Really helpful!
Gareth Brown
@gabrown
@CamDavidsonPilon hmm, I get your point. I have looked at the Kaplan-Meier distribution and it seems to match for the example I provided, however, when I look at the dataset I am interested in, there is a bias after the first 12 months. Is there any assumption about the hazards? We have very spikey hazards, with high hazard once every 12 months, and through the rest of the year the hazard is low. Do you think that would effect the performance?
That is an example of the comparison, where the blue line is the Kaplan-Meier estimate and the red line is the cox regresstion
Cameron Davidson-Pilon
@CamDavidsonPilon
Sorry @gabrown, I'm not exactly sure. KMF estimate != CoxPH baseline, generally, so differences are expected.
Cameron Davidson-Pilon
@CamDavidsonPilon

@CamDavidsonPilon for what it's worth, a short snippet of a slightly misleading error involving pandas.DataFrame.apply that took me a day to debug

task: use Cox to predict event probability for censored items at the time of their current duration

import lifelines as ll
import numpy as np
import pandas as pd

df = pd.DataFrame(np.random.randint(0, 100, size=(10, 2)), columns=['regressor', 'duration'])
df['event'] = np.random.choice([True, False], 10)
display(df)

# uncomment to lose the bool and fix the TypeError
#df['event'] = df['event'].astype(int)

cf = ll.CoxPHFitter()
cf.fit(df, duration_col='duration', event_col='event')

# select only censored items
df = df[df['event'] == 0]

func = lambda row: cf.predict_survival_function(row[['regressor']], times=row['duration'])
df.apply(func, axis=1)

'misleading' cause it will say the regressor column is non-numerical...

Cameron Davidson-Pilon
@CamDavidsonPilon
absolutely no need to apologize, especially since it was i who ignored docstring regarding the event data type in the first place ;)
Cameron Davidson-Pilon
@CamDavidsonPilon
Minor lifelines release: https://github.com/CamDavidsonPilon/lifelines/releases/tag/v0.22.5
@CamDavidsonPilon i feel so appreciated ;) glad i could help
Cameron Davidson-Pilon
@CamDavidsonPilon
Alexander Dmitryuk
@dmitryuk
Hello, Is it possible to use lifelines to predict ~ queue times? In my example, employees checks client's documents, they don't work on weekends, at night, have a dinner and etc. I want to know if client uploaded the document, predict how long time to wait left.
For example, training data - table contains documents verification process
document_id|start_at|ends_at|duration(ends_at - start_at)
Cameron Davidson-Pilon
@CamDavidsonPilon
Hi @dmitryuk, sure that can be done. Queue times fit perfectly into survival analysis. Since you suggest that when the client uploaded the doc is important, I would suggest that you use that feature (mapped to a cyclic variable¹) in a regression model. Ex:
from lifelines import WeibullAFTFitter

df['start_time'] = df['start_time'].map(map_to_seconds)
df['sin_start_time'] = np.sin(2*np.pi*df['start_time']/seconds_in_day)
df['cos_start_time'] = np.cos(2*np.pi*df['start_time']/seconds_in_day)
df = df.drop('start_time', axis=1)

wf = WeibullAFTFitter().fit(df, "duration")

wf.predict_survival_function(df)
wf.predict_median(df)
Since you want how long left to wait, you probably want to use the conditional_after kwarg in the predict_* methods as well
Vilane.
@vgs549
@CamDavidsonPilon have you considered counterfactual analysis in lifelines?
Cameron Davidson-Pilon
@CamDavidsonPilon
@vgs549 mmm not much - are you thinking about casual inference techniques?
Alexander Dmitryuk
@dmitryuk
This way I prepared the data as
id(doc id)|start_from_week_seconds(seconds past from start of week after client uploaded doc)|duration(seconds spent to check the doc)
After code line executed wf = WeibullAFTFitter().fit(df, "duration") exception throw
"StatisticalWarning: The diagonal of the variancematrix has negative values. This could be a problem with WeibullFitter's fit to the data."
Could you help to understand what is wrong in the code?
Vilane.
@vgs549
@CamDavidsonPilon yes, causal inference.
Cameron Davidson-Pilon
@CamDavidsonPilon
@vgs549 I've thought some about it, however I've left most of the burden on the user to choose models and inference appropriately. I would suggest checking out the Zepid package for more causal inference assistance
@dmitryuk ah, ignore it, I need to suppress that. Also, make sure to drop the id col in your model
Cameron Davidson-Pilon
@CamDavidsonPilon
:wave: minor lifelines release. Better support for pickling! https://github.com/CamDavidsonPilon/lifelines/releases
Vilane.
@vgs549
@CamDavidsonPilon Thanks, I will have a look.
Niranjan Ravichandra
@nravic
Hello! I'm trying to predict failure of a few robots with a pretty substantial time-series dataset, and I've been looking at lifelines as a potential method for doing so. The time series data has a few instances of failure, and I'm trying to correlate a number of other variables we have data on (such as forward velocity, number of stationary hours, etc) with failure. In short, I'm trying to get a window in which to predict possible failure based on historical data. Should I be using survival regression for this?
Niranjan Ravichandra
@nravic
Also going off the survival regression chapter in the wiki, each of my observations are obtained daily. Does the fact that the duration in my data is just 1 matter?
Cameron Davidson-Pilon
@CamDavidsonPilon
@nravic I think you can use lifelines, but you're in the realm of recurrent events, which lifelines has only a little support for (there may be another package out there?). Since you have daily snapshots, you probably want to use time-varying regression: https://lifelines.readthedocs.io/en/latest/Time%20varying%20survival%20regression.html
Niranjan Ravichandra
@nravic
Great, thanks @CamDavidsonPilon ! I'll look into this and also see if there's anything around for recurrent events.
Niranjan Ravichandra
@nravic
@CamDavidsonPilon I have data granular down to the second too however. would that see a better use from lifelines or am I better off looking elsewhere?
d-seki
@d-seki
Thank you very much for creating this wonderful tool for statistics. Let me ask you a subtle question. What null hypothesis are you assuming for CoxTimeVaryingFitter? I guess it is for beta to be zero. Best,
Julian Späth
@julianspaeth
Hey, I have a question concerning the concordance_index. I want to use my predicted cumulative hazard functions to compute the concordance_index and use them as predicted_scores. Is it the right way to sum up the chf of each sample and take the negative of it to compute the concordance_index on the basis of the cumulative hazard functions?
Youyang
@zxclcsq
Hello. I'm trying to replicate the Weibull AFT model prediction section in the lifelines docs, but the return is all NANs from the predict_survival_function. Any thoughts on this? The code I used is :
from lifelines import WeibullAFTFitter
aft.predict_survival_function(X)