Hi @CamDavidsonPilon, I am new to survival analysis and am using it for trying to predict customer churn. I created a model using CoxPHFitter and I wanted to evaluate how well the model performed by comparing the survival after 12 months (using the correct row from predict_survival_function output), to the observed churn rate (1-survival rate). I noticed that it is consistently getting a higher survival rate compared to actual (~10%).
I paired back the model so that it was only based off the baseline hazard (passed no extra variables) and I still get a difference in survival rates.
I have tried this on open data, and can reproduce the result:
import lifelines
import numpy as np
import pandas as pd
churn_data = pd.read_csv('https://raw.githubusercontent.com/'
'treselle-systems/customer_churn_analysis/'
'master/WA_Fn-UseC_-Telco-Customer-Churn.csv')
event_col = 'Churn'
duration_col = 'tenure'
churn_data[event_col] = churn_data[event_col].map({'No':0, 'Yes':1})
churn_data_example = churn_data[[event_col, duration_col]]
cph = lifelines.CoxPHFitter()
cph.fit(churn_data[[event_col, duration_col]], duration_col=duration_col, event_col=event_col)
# cph.print_summary()
# get predicted churn:
unconditioned_sf = cph.predict_survival_function(churn_data_example)
predicted_survival = unconditioned_sf[[0]].T[12.0][0]
predicted_churn = 1 - predicted_survival
#Create churn at tenure = 12: logic is
# if tenure > 12 then they didnt churn => churn_12 =0;
# if they have tenure < 12 and churn=1, then the churn_12 =1;
# if tenure < 12 and churn=0, dont know if they churn => churn_12 = np.nan
churn_data_example['churn_12'] = churn_data_example['Churn']
churn_data_example.loc[(churn_data_example.tenure < 12) & (churn_data_example.churn_12 == 0), 'churn_12'] = np.nan
churn_data_example.loc[(churn_data_example.tenure > 12) , 'churn_12'] = 0
actual_churn = churn_data_example['churn_12'].mean()
print(f'actual churn: {round(actual_churn,2)}')
print(f'predicted churn: {round(predicted_churn,2)}')
print(f'ratio: {round(predicted_churn/actual_churn,2)}')
The results are:
actual churn: 0.17
predicted churn: 0.15
ratio: 0.89
And it deviates further as tenure increases.
Have you got any idea why I am seeing the behaviour? I feel it is either to do with me not understanding what predict_survival_function returns, or I am mis calculating the ‘actual churn’?
@gabrown, I am able to replicate what you are seeing locally. If I understand correctly, your definition of churn is "fraction of uncensored users who died before 12 months". I think this is going to bias your churn rate up, as you are not taking into account censoring. In an extreme case, where all but one subject is censored, then your def of churn will give 0% or 100%. But, that feels a bit strange, no? If they died early on, and the other subjects were censored later, we should feel that the churn isn't 100%.
Please correct me if I am mistaken, or I am not making sense. Happy to discuss more!
value_and_grad(negative_log_likelihood)
in the minimization function, in fitters, helps? Why not simply minimize the negative_log_likelihood
directly?
class ParametericAFTRegressionFitter(ParametricRegressionFitter)
contains an extra 'e' :D
autograd
and while looking at their documentation I've noticed the note saying that they won't develop it further. Have you thought about migrating to JAX?
Thanks for the quick response, @CamDavidsonPilon . I get your point about by ignoring the censored users who haven’t been there 12 months, we are ignoring, and as churn rate is low, they would be predominately non-churners, so this would add a bias. However, in my analysis dataset, if I only consider users who could have completed 12 months (so there are no censored users with tenure<12) I still see a systematic difference.
If we consider this in the context of survival, how would you measure the survival after 12 months just from the data? As I think this problem would have exactly the same issues.
Thanks for you input, and also for your awesome package!
ValueError: setting an array element with a sequence
. I've read from their documentation that "Assignment is hard to support...", but I at this point I can't imagine how it should be rightly implemented.
x_
variables (which may cause problems with autograd), I instead chose a list of small matrices.lik
variable is now incrementing as we go.@CamDavidsonPilon for what it's worth, a short snippet of a slightly misleading error involving pandas.DataFrame.apply that took me a day to debug
task: use Cox to predict event probability for censored items at the time of their current duration
import lifelines as ll
import numpy as np
import pandas as pd
df = pd.DataFrame(np.random.randint(0, 100, size=(10, 2)), columns=['regressor', 'duration'])
df['event'] = np.random.choice([True, False], 10)
display(df)
# uncomment to lose the bool and fix the TypeError
#df['event'] = df['event'].astype(int)
cf = ll.CoxPHFitter()
cf.fit(df, duration_col='duration', event_col='event')
# select only censored items
df = df[df['event'] == 0]
func = lambda row: cf.predict_survival_function(row[['regressor']], times=row['duration'])
df.apply(func, axis=1)
'misleading' cause it will say the regressor column is non-numerical...
from lifelines import WeibullAFTFitter
df['start_time'] = df['start_time'].map(map_to_seconds)
df['sin_start_time'] = np.sin(2*np.pi*df['start_time']/seconds_in_day)
df['cos_start_time'] = np.cos(2*np.pi*df['start_time']/seconds_in_day)
df = df.drop('start_time', axis=1)
wf = WeibullAFTFitter().fit(df, "duration")
wf.predict_survival_function(df)
wf.predict_median(df)
conditional_after
kwarg in the predict_*
methods as well
wf = WeibullAFTFitter().fit(df, "duration")
exception throwid
col in your model
from lifelines import WeibullAFTFitter
from lifelines.datasets import load_rossi
rossi_dataset = load_rossi()
aft = WeibullAFTFitter()
aft.fit(rossi_dataset, duration_col='week', event_col='arrest')
X = rossi_dataset.loc[:10]
aft.predict_survival_function(X)