by

## Where communities thrive

• Join over 1.5M+ people
• Join over 100K+ communities
• Free without limits
##### Activity
Cameron Davidson-Pilon
@CamDavidsonPilon

Hi @quanthubscl, your correct that the method you describe is a way to get the survival function. There are other ways however, and it depends on what model you are using. For example, parametric forms often have a closed form formula for the integral of the hazard, and lifelines uses that. Kaplan Meier estimates the survival function directly, and doesn't estimate any hazard.

Can I ask what model you are using?

quanthubscl
@quanthubscl
@CamDavidsonPilon, I am using the Cox Proportional Hazard Model. I am actually following the examples given for the rossi dataset. Mostly, I am trying to make sure that I am doing things and understanding things correctly. I take the baseline hazard function then multiply it by the partial hazard function for a sample. I then integrate this function with scipy and take the negative exponent.
Cameron Davidson-Pilon
@CamDavidsonPilon
@quanthubscl in the case of the cox model, we can just cumulatively sum the baseline hazard to get the cumulative baseline hazard. Why? In the Cox model, we actually estimate the cumulative hazard first (using https://stats.stackexchange.com/questions/46532/cox-baseline-hazard), and then take the .diffto recover the baseline hazard, so .cumsum recovers the original cumulative hazard
davidrindt
@davidrindt
Hi! How do we access the p value of the Likelihood ratio test? I can see it printed after cph.print_summary() but I don't know where it is stored?
@CamDavidsonPilon
Cameron Davidson-Pilon
@CamDavidsonPilon
@davidrindt ah, yea this is something I'm thinking of exposing differently. ATM you can access it using _, _, log2p = cph._compute_likelihood_ratio_test()
sam
@veggiet_gitlab
So, I'm finally at the place where I'm using "check assumptions" on my cph model. And I've got a few variables that are reported as "failed the non-proportional test" that I can see clearly do, but then there are a couple that visual inspection doesn't seem like they do... As I understand it to pass the test the variable needs to produce a straight line? And in this image the "giving" parameter is clearly showing a straight line, but I'm getting: 1. Variable 'giving' failed the non-proportional test: p-value is <5e-05.
Cameron Davidson-Pilon
@CamDavidsonPilon
@veggiet_gitlab the left-hand graph does dip in the right tail, which is probably the violation. However, it's very very minor, and because you have so much data, the test has enough power to detect even this minor violation. It's safe to ignore this minor violation.
fredrichards72
@fredrichards72

@CamDavidsonPilon Thanks so much for your work on lifelines. Very cool. I was particularly excited to see your last post on SaaS churn. https://dataorigami.net/blogs/napkin-folding/churn.

I've been trying to follow along but have run into a couple issues. First, I don't see that 'PiecewiseExponentialRegressionFitter' exists in lifelines. I do see 'PiecewiseExponentialFitter', however. If I use ''PiecewiseExponentialFitter' I get an error:
'object has no attribute 'predict_cumulative_hazard'

Cameron Davidson-Pilon
@CamDavidsonPilon
Hey @fredrichards72, the model isn't in lifelines yet (I should have added that to the blog article). It's in a PR right now, and I should merge it soon. CamDavidsonPilon/lifelines#715
fredrichards72
@fredrichards72
Got it. Thanks!
One other question: I'm dealing with subscriber data which is right censored (still have lots of active subcribers whose death events have not been observed). In addition, we have acquired companies with active subscribers over the past few years and their start dates are unknown. If we acquired a company on jan 1, 2018, we know that the subscriber start date (birth) was at least that early, but it could have been years before. I believe that would be left censored. Any suggestion for how to handle that?
Cameron Davidson-Pilon
@CamDavidsonPilon
That's similar to a previous situation talked about in this room: https://gitter.im/python-lifelines/Lobby?at=5ccb47e6375bac74704463e3
the gist is: it's actually a right-censored problem, since you are measuring their life durations, and you have censored data. Like in the comment linked, the concept of "baseline covariates" is muddy and not clear
fredrichards72
@fredrichards72
So it is BOTH right and left censored?
Cameron Davidson-Pilon
@CamDavidsonPilon
just right-censored
fredrichards72
@fredrichards72
even though I have a mix of unobserved deaths and births?
Cameron Davidson-Pilon
@CamDavidsonPilon

it took me a while to see how it was right-censored. But once I considered that the goal is to measure durations, it became more clear.

Yea, like consider someone in that acquired group. You know that their duration is atleast 495 days,(now - jan 2018), and certainly more. Thus we have a lower bound on their duration -> right censoring

fredrichards72
@fredrichards72
ahh...that makes sense...
thanks so much
Sachin Abeywardana
@sachinruk
Hi everyone, just wondering if anyone can answer this SO question here: https://stackoverflow.com/questions/56126057/predicting-survival-probability-at-current-time
Bojan Kostic
@bkos
@CamDavidsonPilon I was wondering, in general, what is the effect of censorship on estimation? durations and observed are basically passed to any fit method, but it's not clear from the provided equations how they are used? Even for the simplest KM estimator, where n is the number of subjects at risk and d is the number of death events (assuming observed death?), so what about censored observations? Thanks!
Cameron Davidson-Pilon
@CamDavidsonPilon

Hi @bkos, good question. Generally, censorship makes estimation harder, as we loose information. It might be not obvious from the KM equations, but it affects the denominator $n_i$ - the number of subjects at risk. A censored individual may not be present in this count.

Sachin Abeywardana
@sachinruk
Thanks for answering my previous question.
I am trying to recreate fitting the WeibullAFTFitter in pytorch by writing my own loss function. I am trying to follow the tutorial here in pymc3 but I think that is rather different to the weibull AFT model mentioned in lifelines. Is there some tutorial that I can look to try and implement the lifelines model myself?
If I do manage to use a loss function, it opens up the possibility to use deep learning models instead of linear models, which would be an advantage.
Sachin Abeywardana
@sachinruk
If it helps when I assumed that time = beta * x + e where e was gumbel distributed, I ended up with this loss function:
def gumbel_sa_loss(y_pred, targ):
failed = targ[...,:1]

e = y_pred - targ[...,1:]
exp_e = torch.exp(e)
failed_loss = failed * (exp_e - e)
censored_loss = - (1 - failed) * torch.log(1 - torch.exp(-exp_e))
log_lik = failed_loss + censored_loss
return log_lik.mean()
Im fairly sure though that weibullAFT models are attempting to do something else to model the time to begin with.
Cameron Davidson-Pilon
@CamDavidsonPilon

You'll need to implement the likelihood in pytorch, and I think a good summary of parametric survival likelihoods is in section 2 of this article: https://cran.r-project.org/web/packages/flexsurv/vignettes/flexsurv.pdf

For AFT models, note that it's log(time) = beta * x + e (the log is important)

Cameron Davidson-Pilon
@CamDavidsonPilon
:wave: minor lifelines release, v0.21.2 is available. Changelog: https://github.com/CamDavidsonPilon/lifelines/releases/tag/v0.21.2
Sachin Abeywardana
@sachinruk
Can someone take a look at my question here: https://stackoverflow.com/questions/56214952/upper-limit-on-duration-for-survival-analysis. Sorry for asking so many questions.
Cameron Davidson-Pilon
@CamDavidsonPilon
@sachinruk, are you able to share the final dataset with me, privately? I'm at cam.davidson.pilon@gmail.com. Datasets that cause problems are a good motivation for internal improvements
Dan Guim
@dcguim

Hello All,

I would like to ask for some help getting started to lifelines, there was simple tasks I could not found an direct way of doing, I am particularly interested on Cox models.

1) I was not able to find a way to retrieve the hazards at time t and the survival at time t, for a Cox PH model, I only get the information about the baselines and the coefficients (which are for some reason called "hazards_"). Of course I could generate the hazards and the survival, with this information but it would be nice to do it directly. If it is not my fault of not being able to find this direct way, I would gladly contribute to lifelines.

2) How does one generate the adjusted (considering the covariates) survival curves for PH Cox model, for the data used to fit the model, before I do any prediction, apparently now there is only support for plotting the baseline survival function.

Dan Guim
@dcguim
from math import exp
def hazard( phdata, coef, baseline, i, t ):
cov = phdata.iloc[i].drop(labels=['censored?, 'eventdate'])
base = baseline.at[float(t),'baseline hazard']
haz= base*exp(cov.dot(coef))
return haz
this is what I meant by the hazard at t, for instance. I could not find where it was implemented, is this equivalent to the predict_partial_hazard(X)?
Thanks
Cameron Davidson-Pilon
@CamDavidsonPilon
You can view how to terms "parital", "log-partial", etc. relate using this formula: https://lifelines.readthedocs.io/en/latest/Survival%20Regression.html#cox-s-proportional-hazard-model
Cameron Davidson-Pilon
@CamDavidsonPilon

It looks like you want the hazard per subject over time (not the cumulative hazard). The most appropriate way is to do something like cph.predict_cumulative_hazard(phdata).diff() as the cumulative hazard and the hazard are related in that manner. Using predict_cumulative_hazard is helpful since it takes care of any strata arguments, and de-meaning necessary.

You mentioned to me privately about why I subtract the mean of the covariates in the prediction methods. The reason is that the algorithm is computed using demeaned data, and hence the baseline hazard "accounts" for the demeaned data and grows or shrinks appropriately (I can't think of a better way to say this without a whiteboard/latex). From the POV of the hazard then, the values are the same. However, the log-partial hazard and the partial hazard will be different. This is okay, as there is no interpretation of the (log-)partial hazard without the baseline hazard (another way to think about this: it's unit-less). The only use of the (log-)partial hazard is determining rankings. That is, the multiplying by the baseline hazard is necessary to recover the hazards.

All this to say: the Cox model can be confusing, and ((log-)partial) hazards are not intuitive. I am more and more of a fan of AFT models now: https://lifelines.readthedocs.io/en/latest/Survival%20Regression.html#accelerated-failure-time-models
Dan Guim
@dcguim
@CamDavidsonPilon , I believe there is an interpretation of the "(log-)partial hazard without the baseline", namely, when you are computing the hazard ratio (the baselines are crossed out, because, essentially, you divide the hazard of one individual by the hazard of another).
How do you compute the log likelihood, to estimate the beta coefficients if not by the hazard ratio?
Cameron Davidson-Pilon
@CamDavidsonPilon
ah yes good catch, there is that interpretation. I was too quick in my assertion. It's still a unitless quantity, so we can't talk about (absolute) hazard without the baseline hazard.
William Tarte
@WillTarte
@CamDavidsonPilon Love your work! I have encountered a problem when fitting a CoxPHfitter and trying to test the proportional hazards assumption. I make a call to check_assumptions() with show_plots=True but my program hangs and never shows the plots (or the full advice either I'm pretty sure). Do you have an idea on whats going on?
Jenny Yang
@jennyatClover

Hi all, I'm having issues with convergence for the CoxPHFitter in lifelines that I'm not seeing in R. It gives me this message

"ConvergenceError: Convergence halted due to matrix inversion problems. Suspicion is high collinearity. Please see the following tips in the lifelines documentation:
Matrix is singular."

Even when I reduce the dataset to just the time at observation, whether or not event happened, and a single covariate. Graphically I have verified that there is not a perfect decision boundary - I have also modeled this in R and it has worked perfectly. When looking at the algorithm step output, it does look like that Newton algorithm is diverging:

Iteration 1: norm_delta = 2.29694, step_size = 0.9500, ll = -27814.17290, newton_decrement = 3481.08865, seconds_since_start = 0.0
Iteration 2: norm_delta = 5.84762, step_size = 0.9500, ll = -36925.79270, newton_decrement = 37483.21855, seconds_since_start = 0.0
Iteration 3: norm_delta = 108.73423, step_size = 0.7125, ll = -40227.22948, newton_decrement = 210617.17243, seconds_since_start = 0.1
Iteration 4: norm_delta = 14575.06691, step_size = 0.2095, ll = -1076963.03641, newton_decrement = 106456100.74164, seconds_since_start = 0.1

Any thoughts on what is happening here?
MattB27
@MattB27
Hi @CamDavidsonPilon I have a question about the Efron calculations in CoxPHFitter. Mainly I'm interested if the quantity numer is the numerator and denom the denominator of the likelihood equation used in efron ties (taken from slide 7 from http://myweb.uiowa.edu/pbreheny/7210/f18/notes/11-01.pdf)? Thanks in advance and thank you so much for lifelines!
Cameron Davidson-Pilon
@CamDavidsonPilon
@WillTarte, yes, that's because the bootstrapping + lowess plotting is very slow per variable, so if you have many variables, it can hang for a while. I suggest trying without show_plots and seeing if you can fix the presented problems (actually, this gives me the idea of being able to select what variables to check). CamDavidsonPilon/lifelines#730
@jennyatClover, try decreasing the step size (default 0.95) in fit. For example, cph.fit( ..., step_size=0.30) (or decrease more if necessary). I would appreciate if you could share the dataset with me as well (privately, at cam.davidson.pilon@gmail.com), as datasets that fail convergence as useful to try new methods against.
Cameron Davidson-Pilon
@CamDavidsonPilon

@MattB27 the equation on page 7 of the pdf is not what is implemented. Recall, the MLE, I take the log, then differentiate. numer and denom in the code refer to the numerator and denominator in the fraction here:

which is the resulting equation after logging + differentiating the eq on page 7

MattB27
@MattB27
Ok, I can see that now. And if I’m following the logic right then when there are no shared event times or when the event is shared with censored times than the normal MLE is used which gives the different Numer and Denom in the else statement. I’m trying to implement a Breslow tie method (I understand Efron should be preferred) but Breslow might be nice to match with other software that may still have it as default.
Cameron Davidson-Pilon
@CamDavidsonPilon
Yup makes sense. Breslow should be much easier to implement (in fact, I wouldn't necessary use my efron code as a template, as it's a) a more complicated method and b) highly optimized, so less transparent.)