I'd like some advice, I really like the Cox model, it's ability to quantize how different factors might influence the lifetime. But I have a lot of unknowns, my data started getting collected at a certain point years ago and we have people in the system without a known "start point," now I read that this is what left censoring is for, but in all the models I've looked at, it either has no left censoring parameter OR it has a "starting point" parameter which suggests that I know how long a person has been involved before data began to be collect, which I don't, and even if I crawl through old record books I won't have complete knowledge of everyone.
Is there a guideline for creating a probable starting point?
ok I read a different description for Left censoring, and I guess I was wrong, that left censoring is for when you don't know exactly when the event happened but you know it happened before a certain point? Is this true?
If so my original question still is valid, what to do with people who we don't know their start point?
Yea, left censoring is best described with an example: a sensor can't detect values less than 0.05, so we know that some observations are less than 0.05, but we don't know their exact value.
In your case, you actually have right-censoring! Let me explain (I hope I understand your problem well enough). You are trying to model lifetimes, let's call this durations. For the first type of subject, where we do know their start date, then their observed duration is end_date (or now) - start_date
and a 0/1 for if we observed their end date or not.
For the second type of subject, where we don't know their start date, then their observed duration is end_date (or now) - first observed date
and then always have a 0 (since we know they lived longer than what we observed).
Also, what do baseline covariates even mean in this context? I don't know, since the second type of subject may have evolving covariates that don't reflect the subjects state when they initially started.
So, I think you can model it, but you'll need to be careful with what variables you include.
Hi @quanthubscl, your correct that the method you describe is a way to get the survival function. There are other ways however, and it depends on what model you are using. For example, parametric forms often have a closed form formula for the integral of the hazard, and lifelines uses that. Kaplan Meier estimates the survival function directly, and doesn't estimate any hazard.
Can I ask what model you are using?
.diff
to recover the baseline hazard, so .cumsum
recovers the original cumulative hazard
@CamDavidsonPilon Thanks so much for your work on lifelines. Very cool. I was particularly excited to see your last post on SaaS churn. https://dataorigami.net/blogs/napkin-folding/churn.
I've been trying to follow along but have run into a couple issues. First, I don't see that 'PiecewiseExponentialRegressionFitter' exists in lifelines. I do see 'PiecewiseExponentialFitter', however. If I use ''PiecewiseExponentialFitter' I get an error:
'object has no attribute 'predict_cumulative_hazard'
it took me a while to see how it was right-censored. But once I considered that the goal is to measure durations, it became more clear.
Yea, like consider someone in that acquired group. You know that their duration is atleast 495 days,(now - jan 2018), and certainly more. Thus we have a lower bound on their duration -> right censoring
Hi @bkos, good question. Generally, censorship makes estimation harder, as we loose information. It might be not obvious from the KM equations, but it affects the denominator $n_i$ - the number of subjects at risk. A censored individual may not be present in this count.
For parametric models, check out section 2 of this article: https://cran.r-project.org/web/packages/flexsurv/vignettes/flexsurv.pdf
time = beta * x + e
where e was gumbel distributed, I ended up with this loss function:def gumbel_sa_loss(y_pred, targ):
failed = targ[...,:1]
e = y_pred - targ[...,1:]
exp_e = torch.exp(e)
failed_loss = failed * (exp_e - e)
censored_loss = - (1 - failed) * torch.log(1 - torch.exp(-exp_e))
log_lik = failed_loss + censored_loss
return log_lik.mean()
You'll need to implement the likelihood in pytorch, and I think a good summary of parametric survival likelihoods is in section 2 of this article: https://cran.r-project.org/web/packages/flexsurv/vignettes/flexsurv.pdf
For AFT models, note that it's log(time) = beta * x + e
(the log is important)
Hello All,
I would like to ask for some help getting started to lifelines, there was simple tasks I could not found an direct way of doing, I am particularly interested on Cox models.
1) I was not able to find a way to retrieve the hazards at time t and the survival at time t, for a Cox PH model, I only get the information about the baselines and the coefficients (which are for some reason called "hazards_"). Of course I could generate the hazards and the survival, with this information but it would be nice to do it directly. If it is not my fault of not being able to find this direct way, I would gladly contribute to lifelines.
2) How does one generate the adjusted (considering the covariates) survival curves for PH Cox model, for the data used to fit the model, before I do any prediction, apparently now there is only support for plotting the baseline survival function.
from math import exp
def hazard( phdata, coef, baseline, i, t ):
cov = phdata.iloc[i].drop(labels=['censored?, 'eventdate'])
base = baseline.at[float(t),'baseline hazard']
haz= base*exp(cov.dot(coef))
return haz
this is what I meant by the hazard at t, for instance. I could not find where it was implemented, is this equivalent to the predict_partial_hazard(X)
?It looks like you want the hazard per subject over time (not the cumulative hazard). The most appropriate way is to do something like cph.predict_cumulative_hazard(phdata).diff()
as the cumulative hazard and the hazard are related in that manner. Using predict_cumulative_hazard
is helpful since it takes care of any strata arguments, and de-meaning necessary.
You mentioned to me privately about why I subtract the mean of the covariates in the prediction methods. The reason is that the algorithm is computed using demeaned data, and hence the baseline hazard "accounts" for the demeaned data and grows or shrinks appropriately (I can't think of a better way to say this without a whiteboard/latex). From the POV of the hazard then, the values are the same. However, the log-partial hazard and the partial hazard will be different. This is okay, as there is no interpretation of the (log-)partial hazard without the baseline hazard (another way to think about this: it's unit-less). The only use of the (log-)partial hazard is determining rankings. That is, the multiplying by the baseline hazard is necessary to recover the hazards.