I'd like some advice, I really like the Cox model, it's ability to quantize how different factors might influence the lifetime. But I have a lot of unknowns, my data started getting collected at a certain point years ago and we have people in the system without a known "start point," now I read that this is what left censoring is for, but in all the models I've looked at, it either has no left censoring parameter OR it has a "starting point" parameter which suggests that I know how long a person has been involved before data began to be collect, which I don't, and even if I crawl through old record books I won't have complete knowledge of everyone.
Is there a guideline for creating a probable starting point?
ok I read a different description for Left censoring, and I guess I was wrong, that left censoring is for when you don't know exactly when the event happened but you know it happened before a certain point? Is this true?
If so my original question still is valid, what to do with people who we don't know their start point?
Yea, left censoring is best described with an example: a sensor can't detect values less than 0.05, so we know that some observations are less than 0.05, but we don't know their exact value.
In your case, you actually have right-censoring! Let me explain (I hope I understand your problem well enough). You are trying to model lifetimes, let's call this durations. For the first type of subject, where we do know their start date, then their observed duration is
end_date (or now) - start_date and a 0/1 for if we observed their end date or not.
For the second type of subject, where we don't know their start date, then their observed duration is
end_date (or now) - first observed date and then always have a 0 (since we know they lived longer than what we observed).
Also, what do baseline covariates even mean in this context? I don't know, since the second type of subject may have evolving covariates that don't reflect the subjects state when they initially started.
So, I think you can model it, but you'll need to be careful with what variables you include.
Hi @quanthubscl, your correct that the method you describe is a way to get the survival function. There are other ways however, and it depends on what model you are using. For example, parametric forms often have a closed form formula for the integral of the hazard, and lifelines uses that. Kaplan Meier estimates the survival function directly, and doesn't estimate any hazard.
Can I ask what model you are using?
.diffto recover the baseline hazard, so
.cumsumrecovers the original cumulative hazard
@CamDavidsonPilon Thanks so much for your work on lifelines. Very cool. I was particularly excited to see your last post on SaaS churn. https://dataorigami.net/blogs/napkin-folding/churn.
I've been trying to follow along but have run into a couple issues. First, I don't see that 'PiecewiseExponentialRegressionFitter' exists in lifelines. I do see 'PiecewiseExponentialFitter', however. If I use ''PiecewiseExponentialFitter' I get an error:
'object has no attribute 'predict_cumulative_hazard'
it took me a while to see how it was right-censored. But once I considered that the goal is to measure durations, it became more clear.
Yea, like consider someone in that acquired group. You know that their duration is atleast 495 days,(now - jan 2018), and certainly more. Thus we have a lower bound on their duration -> right censoring
Hi @bkos, good question. Generally, censorship makes estimation harder, as we loose information. It might be not obvious from the KM equations, but it affects the denominator $n_i$ - the number of subjects at risk. A censored individual may not be present in this count.
For parametric models, check out section 2 of this article: https://cran.r-project.org/web/packages/flexsurv/vignettes/flexsurv.pdf
time = beta * x + ewhere e was gumbel distributed, I ended up with this loss function:
def gumbel_sa_loss(y_pred, targ): failed = targ[...,:1] e = y_pred - targ[...,1:] exp_e = torch.exp(e) failed_loss = failed * (exp_e - e) censored_loss = - (1 - failed) * torch.log(1 - torch.exp(-exp_e)) log_lik = failed_loss + censored_loss return log_lik.mean()
You'll need to implement the likelihood in pytorch, and I think a good summary of parametric survival likelihoods is in section 2 of this article: https://cran.r-project.org/web/packages/flexsurv/vignettes/flexsurv.pdf
For AFT models, note that it's
log(time) = beta * x + e (the log is important)