- Join over
**1.5M+ people** - Join over
**100K+ communities** - Free
**without limits** - Create
**your own community**

I'm confused by this page: https://lifelines.readthedocs.io/en/latest/Survival%20Regression.html#goodness-of-fit

The paragraph states that the first image has a better fit than the second image, but the two lines seem to correlate in the second image near perfectly... I also don't understand how to generate the "baseline survival" metric.

The paragraph states that the first image has a better fit than the second image, but the two lines seem to correlate in the second image near perfectly... I also don't understand how to generate the "baseline survival" metric.

Hi @veggiet_gitlab, you know, today I was thinking of removing that blurb from my docs, so I suggest ignoring it completely. If you are interested in model selection, I think a more appropriate way is using the log-likelihood test in the

`print_summary()`

output.
I'd like some advice, I really like the Cox model, it's ability to quantize how different factors might influence the lifetime. But I have a lot of unknowns, my data started getting collected at a certain point years ago and we have people in the system without a known "start point," now I read that this is what left censoring is for, but in all the models I've looked at, it either has no left censoring parameter OR it has a "starting point" parameter which suggests that I know how long a person has been involved before data began to be collect, which I don't, and even if I crawl through old record books I won't have complete knowledge of everyone.

Is there a guideline for creating a probable starting point?

ok I read a different description for Left censoring, and I guess I was wrong, that left censoring is for when you don't know exactly when the event happened but you know it happened before a certain point? Is this true?

If so my original question still is valid, what to do with people who we don't know their start point?

Yea, left censoring is best described with an example: a sensor can't detect values less than 0.05, so we know that some observations are less than 0.05, but we don't know their exact value.

In your case, you actually have right-censoring! Let me explain (I hope I understand your problem well enough). You are trying to model lifetimes, let's call this durations. For the first type of subject, where we *do* know their start date, then their observed duration is `end_date (or now) - start_date`

and a 0/1 for if we observed their end date or not.

For the second type of subject, where we *don't* know their start date, then their observed duration is `end_date (or now) - first observed date`

and then *always* have a 0 (since we know they lived longer than what we observed).

However, there are going to be some restrictions on your model. You can't use "year_joined" as a baseline covariate, since you don't know that for some subjects. Similarly, I don't know how to extend this to time-varying covariates (if you were interested in that).

Also, what do baseline covariates even mean in this context? I don't know, since the second type of subject may have evolving covariates that don't reflect the subjects state when they initially started.

So, I think you can model it, but you'll need to be careful with what variables you include.

Hi, I have a question on the python lifelines software. I am new to survival analysis so please bare with me. If you want the survival function you should integrate the hazard function and take the negative exponent? If this is true, how does lifelines handle the integration. I am using scipy trapz and the survival curves I get a slightly different from what lifelines predicts. I am wondering if maybe I just have a misunderstanding on how to get survival curves from the hazard function.

Hi @quanthubscl, your correct that the method you describe is a way to get the survival function. There are other ways however, and it depends on what model you are using. For example, parametric forms often have a closed form formula for the integral of the hazard, and lifelines uses that. Kaplan Meier estimates the survival function directly, and doesn't estimate any hazard.

Can I ask what model you are using?

@CamDavidsonPilon, I am using the Cox Proportional Hazard Model. I am actually following the examples given for the rossi dataset. Mostly, I am trying to make sure that I am doing things and understanding things correctly. I take the baseline hazard function then multiply it by the partial hazard function for a sample. I then integrate this function with scipy and take the negative exponent.

@quanthubscl in the case of the cox model, we can just cumulatively sum the baseline hazard to get the cumulative baseline hazard. Why? In the Cox model, we actually estimate the cumulative hazard first (using https://stats.stackexchange.com/questions/46532/cox-baseline-hazard), and then take the

`.diff`

to recover the baseline hazard, so `.cumsum`

recovers the original cumulative hazard
@CamDavidsonPilon

So, I'm finally at the place where I'm using "check assumptions" on my cph model. And I've got a few variables that are reported as "failed the non-proportional test" that I can see clearly do, but then there are a couple that visual inspection doesn't seem like they do... As I understand it to pass the test the variable needs to produce a straight line? And in this image the "giving" parameter is clearly showing a straight line, but I'm getting: 1. Variable 'giving' failed the non-proportional test: p-value is <5e-05.

@veggiet_gitlab the left-hand graph does dip in the right tail, which is probably the violation. *However*, it's very very minor, and because you have so much data, the test has enough power to detect even this minor violation. It's safe to ignore this minor violation.

@CamDavidsonPilon Thanks so much for your work on lifelines. Very cool. I was particularly excited to see your last post on SaaS churn. https://dataorigami.net/blogs/napkin-folding/churn.

I've been trying to follow along but have run into a couple issues. First, I don't see that 'PiecewiseExponentialRegressionFitter' exists in lifelines. I do see 'PiecewiseExponentialFitter', however. If I use ''PiecewiseExponentialFitter' I get an error:

'object has no attribute 'predict_cumulative_hazard'

Hey @fredrichards72, the model isn't in lifelines yet (I should have added that to the blog article). It's in a PR right now, and I should merge it soon. CamDavidsonPilon/lifelines#715

One other question: I'm dealing with subscriber data which is right censored (still have lots of active subcribers whose death events have not been observed). In addition, we have acquired companies with active subscribers over the past few years and their start dates are unknown. If we acquired a company on jan 1, 2018, we know that the subscriber start date (birth) was at least that early, but it could have been years before. I believe that would be left censored. Any suggestion for how to handle that?

That's similar to a previous situation talked about in this room: https://gitter.im/python-lifelines/Lobby?at=5ccb47e6375bac74704463e3

the gist is: it's actually a right-censored problem, since you are measuring their life durations, and you have censored data. Like in the comment linked, the concept of "baseline covariates" is muddy and not clear

it took me a while to see how it was right-censored. But once I considered that the goal is to measure *durations*, it became more clear.

Yea, like consider someone in that acquired group. You know that their *duration* is atleast 495 days,(now - jan 2018), and certainly more. Thus we have a lower bound on their duration -> right censoring

thanks so much

Hi everyone, just wondering if anyone can answer this SO question here: https://stackoverflow.com/questions/56126057/predicting-survival-probability-at-current-time

@CamDavidsonPilon I was wondering, in general, what is the effect of censorship on estimation? *durations* and *observed* are basically passed to any *fit* method, but it's not clear from the provided equations how they are used? Even for the simplest KM estimator, where *n* is the number of subjects at risk and *d* is the number of death events (assuming observed death?), so what about censored observations? Thanks!

Hi @bkos, good question. Generally, censorship makes estimation harder, as we loose information. It might be not obvious from the KM equations, but it affects the denominator $n_i$ - the number of subjects at risk. A censored individual may not be present in this count.

For parametric models, check out section 2 of this article: https://cran.r-project.org/web/packages/flexsurv/vignettes/flexsurv.pdf

I am trying to recreate fitting the WeibullAFTFitter in pytorch by writing my own loss function. I am trying to follow the tutorial here in pymc3 but I think that is rather different to the weibull AFT model mentioned in lifelines. Is there some tutorial that I can look to try and implement the lifelines model myself?

If I do manage to use a loss function, it opens up the possibility to use deep learning models instead of linear models, which would be an advantage.

If it helps when I assumed that

`time = beta * x + e`

where e was gumbel distributed, I ended up with this loss function:```
def gumbel_sa_loss(y_pred, targ):
failed = targ[...,:1]
e = y_pred - targ[...,1:]
exp_e = torch.exp(e)
failed_loss = failed * (exp_e - e)
censored_loss = - (1 - failed) * torch.log(1 - torch.exp(-exp_e))
log_lik = failed_loss + censored_loss
return log_lik.mean()
```

Im fairly sure though that weibullAFT models are attempting to do something else to model the time to begin with.

You'll need to implement the likelihood in pytorch, and I think a good summary of parametric survival likelihoods is in section 2 of this article: https://cran.r-project.org/web/packages/flexsurv/vignettes/flexsurv.pdf

For AFT models, note that it's `log(time) = beta * x + e`

(the log is important)

:wave: minor lifelines release, v0.21.2 is available. Changelog: https://github.com/CamDavidsonPilon/lifelines/releases/tag/v0.21.2

Can someone take a look at my question here: https://stackoverflow.com/questions/56214952/upper-limit-on-duration-for-survival-analysis. Sorry for asking so many questions.

@sachinruk, are you able to share the final dataset with me, privately? I'm at cam.davidson.pilon@gmail.com. Datasets that cause problems are a good motivation for internal improvements

Hello All,

I would like to ask for some help getting started to lifelines, there was simple tasks I could not found an direct way of doing, I am particularly interested on Cox models.

1) I was not able to find a way to retrieve the hazards at time t and the survival at time t, for a Cox PH model, I only get the information about the baselines and the coefficients (which are for some reason called "hazards_"). Of course I could generate the hazards and the survival, with this information but it would be nice to do it directly. If it is not my fault of not being able to find this direct way, I would gladly contribute to lifelines.

2) How does one generate the adjusted (considering the covariates) survival curves for PH Cox model, for the data used to fit the model, before I do any prediction, apparently now there is only support for plotting the baseline survival function.

```
from math import exp
def hazard( phdata, coef, baseline, i, t ):
cov = phdata.iloc[i].drop(labels=['censored?, 'eventdate'])
base = baseline.at[float(t),'baseline hazard']
haz= base*exp(cov.dot(coef))
return haz
```

this is what I meant by the hazard at t, for instance. I could not find where it was implemented, is this equivalent to the `predict_partial_hazard(X)`

?Thanks