by

Where communities thrive


  • Join over 1.5M+ people
  • Join over 100K+ communities
  • Free without limits
  • Create your own community
People
Activity
    sam
    @veggiet_gitlab
    I'm confused by this page: https://lifelines.readthedocs.io/en/latest/Survival%20Regression.html#goodness-of-fit
    The paragraph states that the first image has a better fit than the second image, but the two lines seem to correlate in the second image near perfectly... I also don't understand how to generate the "baseline survival" metric.
    Cameron Davidson-Pilon
    @CamDavidsonPilon
    Hi @veggiet_gitlab, you know, today I was thinking of removing that blurb from my docs, so I suggest ignoring it completely. If you are interested in model selection, I think a more appropriate way is using the log-likelihood test in the print_summary() output.
    sam
    @veggiet_gitlab
    Thank you
    Ossama Alshabrawy
    @OssamaAlshabrawy
    Hi I was just wondering whether I could use CoxPHFitter() with more than one duration column. Is that possible?
    Cameron Davidson-Pilon
    @CamDavidsonPilon
    @OssamaAlshabrawy no, but I'm confused what you are trying to model? What might more than one duration column represent?
    sam
    @veggiet_gitlab

    I'd like some advice, I really like the Cox model, it's ability to quantize how different factors might influence the lifetime. But I have a lot of unknowns, my data started getting collected at a certain point years ago and we have people in the system without a known "start point," now I read that this is what left censoring is for, but in all the models I've looked at, it either has no left censoring parameter OR it has a "starting point" parameter which suggests that I know how long a person has been involved before data began to be collect, which I don't, and even if I crawl through old record books I won't have complete knowledge of everyone.

    Is there a guideline for creating a probable starting point?

    sam
    @veggiet_gitlab

    ok I read a different description for Left censoring, and I guess I was wrong, that left censoring is for when you don't know exactly when the event happened but you know it happened before a certain point? Is this true?

    If so my original question still is valid, what to do with people who we don't know their start point?

    Cameron Davidson-Pilon
    @CamDavidsonPilon

    Yea, left censoring is best described with an example: a sensor can't detect values less than 0.05, so we know that some observations are less than 0.05, but we don't know their exact value.

    In your case, you actually have right-censoring! Let me explain (I hope I understand your problem well enough). You are trying to model lifetimes, let's call this durations. For the first type of subject, where we do know their start date, then their observed duration is end_date (or now) - start_date and a 0/1 for if we observed their end date or not.
    For the second type of subject, where we don't know their start date, then their observed duration is end_date (or now) - first observed date and then always have a 0 (since we know they lived longer than what we observed).

    However, there are going to be some restrictions on your model. You can't use "year_joined" as a baseline covariate, since you don't know that for some subjects. Similarly, I don't know how to extend this to time-varying covariates (if you were interested in that).

    Also, what do baseline covariates even mean in this context? I don't know, since the second type of subject may have evolving covariates that don't reflect the subjects state when they initially started.

    So, I think you can model it, but you'll need to be careful with what variables you include.

    quanthubscl
    @quanthubscl
    Hi, I have a question on the python lifelines software. I am new to survival analysis so please bare with me. If you want the survival function you should integrate the hazard function and take the negative exponent? If this is true, how does lifelines handle the integration. I am using scipy trapz and the survival curves I get a slightly different from what lifelines predicts. I am wondering if maybe I just have a misunderstanding on how to get survival curves from the hazard function.
    Cameron Davidson-Pilon
    @CamDavidsonPilon

    Hi @quanthubscl, your correct that the method you describe is a way to get the survival function. There are other ways however, and it depends on what model you are using. For example, parametric forms often have a closed form formula for the integral of the hazard, and lifelines uses that. Kaplan Meier estimates the survival function directly, and doesn't estimate any hazard.

    Can I ask what model you are using?

    quanthubscl
    @quanthubscl
    @CamDavidsonPilon, I am using the Cox Proportional Hazard Model. I am actually following the examples given for the rossi dataset. Mostly, I am trying to make sure that I am doing things and understanding things correctly. I take the baseline hazard function then multiply it by the partial hazard function for a sample. I then integrate this function with scipy and take the negative exponent.
    Cameron Davidson-Pilon
    @CamDavidsonPilon
    @quanthubscl in the case of the cox model, we can just cumulatively sum the baseline hazard to get the cumulative baseline hazard. Why? In the Cox model, we actually estimate the cumulative hazard first (using https://stats.stackexchange.com/questions/46532/cox-baseline-hazard), and then take the .diffto recover the baseline hazard, so .cumsum recovers the original cumulative hazard
    davidrindt
    @davidrindt
    Hi! How do we access the p value of the Likelihood ratio test? I can see it printed after cph.print_summary() but I don't know where it is stored?
    @CamDavidsonPilon
    Cameron Davidson-Pilon
    @CamDavidsonPilon
    @davidrindt ah, yea this is something I'm thinking of exposing differently. ATM you can access it using _, _, log2p = cph._compute_likelihood_ratio_test()
    sam
    @veggiet_gitlab
    Shoenfield giving.png
    So, I'm finally at the place where I'm using "check assumptions" on my cph model. And I've got a few variables that are reported as "failed the non-proportional test" that I can see clearly do, but then there are a couple that visual inspection doesn't seem like they do... As I understand it to pass the test the variable needs to produce a straight line? And in this image the "giving" parameter is clearly showing a straight line, but I'm getting: 1. Variable 'giving' failed the non-proportional test: p-value is <5e-05.
    Cameron Davidson-Pilon
    @CamDavidsonPilon
    @veggiet_gitlab the left-hand graph does dip in the right tail, which is probably the violation. However, it's very very minor, and because you have so much data, the test has enough power to detect even this minor violation. It's safe to ignore this minor violation.
    fredrichards72
    @fredrichards72

    @CamDavidsonPilon Thanks so much for your work on lifelines. Very cool. I was particularly excited to see your last post on SaaS churn. https://dataorigami.net/blogs/napkin-folding/churn.

    I've been trying to follow along but have run into a couple issues. First, I don't see that 'PiecewiseExponentialRegressionFitter' exists in lifelines. I do see 'PiecewiseExponentialFitter', however. If I use ''PiecewiseExponentialFitter' I get an error:
    'object has no attribute 'predict_cumulative_hazard'

    Cameron Davidson-Pilon
    @CamDavidsonPilon
    Hey @fredrichards72, the model isn't in lifelines yet (I should have added that to the blog article). It's in a PR right now, and I should merge it soon. CamDavidsonPilon/lifelines#715
    fredrichards72
    @fredrichards72
    Got it. Thanks!
    One other question: I'm dealing with subscriber data which is right censored (still have lots of active subcribers whose death events have not been observed). In addition, we have acquired companies with active subscribers over the past few years and their start dates are unknown. If we acquired a company on jan 1, 2018, we know that the subscriber start date (birth) was at least that early, but it could have been years before. I believe that would be left censored. Any suggestion for how to handle that?
    Cameron Davidson-Pilon
    @CamDavidsonPilon
    That's similar to a previous situation talked about in this room: https://gitter.im/python-lifelines/Lobby?at=5ccb47e6375bac74704463e3
    the gist is: it's actually a right-censored problem, since you are measuring their life durations, and you have censored data. Like in the comment linked, the concept of "baseline covariates" is muddy and not clear
    fredrichards72
    @fredrichards72
    So it is BOTH right and left censored?
    Cameron Davidson-Pilon
    @CamDavidsonPilon
    just right-censored
    fredrichards72
    @fredrichards72
    even though I have a mix of unobserved deaths and births?
    Cameron Davidson-Pilon
    @CamDavidsonPilon

    it took me a while to see how it was right-censored. But once I considered that the goal is to measure durations, it became more clear.

    Yea, like consider someone in that acquired group. You know that their duration is atleast 495 days,(now - jan 2018), and certainly more. Thus we have a lower bound on their duration -> right censoring

    fredrichards72
    @fredrichards72
    ahh...that makes sense...
    thanks so much
    Sachin Abeywardana
    @sachinruk
    Hi everyone, just wondering if anyone can answer this SO question here: https://stackoverflow.com/questions/56126057/predicting-survival-probability-at-current-time
    Bojan Kostic
    @bkos
    @CamDavidsonPilon I was wondering, in general, what is the effect of censorship on estimation? durations and observed are basically passed to any fit method, but it's not clear from the provided equations how they are used? Even for the simplest KM estimator, where n is the number of subjects at risk and d is the number of death events (assuming observed death?), so what about censored observations? Thanks!
    Cameron Davidson-Pilon
    @CamDavidsonPilon

    Hi @bkos, good question. Generally, censorship makes estimation harder, as we loose information. It might be not obvious from the KM equations, but it affects the denominator $n_i$ - the number of subjects at risk. A censored individual may not be present in this count.

    For parametric models, check out section 2 of this article: https://cran.r-project.org/web/packages/flexsurv/vignettes/flexsurv.pdf

    Sachin Abeywardana
    @sachinruk
    Thanks for answering my previous question.
    I am trying to recreate fitting the WeibullAFTFitter in pytorch by writing my own loss function. I am trying to follow the tutorial here in pymc3 but I think that is rather different to the weibull AFT model mentioned in lifelines. Is there some tutorial that I can look to try and implement the lifelines model myself?
    If I do manage to use a loss function, it opens up the possibility to use deep learning models instead of linear models, which would be an advantage.
    Sachin Abeywardana
    @sachinruk
    If it helps when I assumed that time = beta * x + e where e was gumbel distributed, I ended up with this loss function:
    def gumbel_sa_loss(y_pred, targ):
        failed = targ[...,:1]
    
        e = y_pred - targ[...,1:]
        exp_e = torch.exp(e)
        failed_loss = failed * (exp_e - e) 
        censored_loss = - (1 - failed) * torch.log(1 - torch.exp(-exp_e))
        log_lik = failed_loss + censored_loss
        return log_lik.mean()
    Im fairly sure though that weibullAFT models are attempting to do something else to model the time to begin with.
    Cameron Davidson-Pilon
    @CamDavidsonPilon

    You'll need to implement the likelihood in pytorch, and I think a good summary of parametric survival likelihoods is in section 2 of this article: https://cran.r-project.org/web/packages/flexsurv/vignettes/flexsurv.pdf

    For AFT models, note that it's log(time) = beta * x + e (the log is important)

    Cameron Davidson-Pilon
    @CamDavidsonPilon
    :wave: minor lifelines release, v0.21.2 is available. Changelog: https://github.com/CamDavidsonPilon/lifelines/releases/tag/v0.21.2
    Sachin Abeywardana
    @sachinruk
    Can someone take a look at my question here: https://stackoverflow.com/questions/56214952/upper-limit-on-duration-for-survival-analysis. Sorry for asking so many questions.
    Cameron Davidson-Pilon
    @CamDavidsonPilon
    @sachinruk, are you able to share the final dataset with me, privately? I'm at cam.davidson.pilon@gmail.com. Datasets that cause problems are a good motivation for internal improvements
    Dan Guim
    @dcguim

    Hello All,

    I would like to ask for some help getting started to lifelines, there was simple tasks I could not found an direct way of doing, I am particularly interested on Cox models.

    1) I was not able to find a way to retrieve the hazards at time t and the survival at time t, for a Cox PH model, I only get the information about the baselines and the coefficients (which are for some reason called "hazards_"). Of course I could generate the hazards and the survival, with this information but it would be nice to do it directly. If it is not my fault of not being able to find this direct way, I would gladly contribute to lifelines.

    2) How does one generate the adjusted (considering the covariates) survival curves for PH Cox model, for the data used to fit the model, before I do any prediction, apparently now there is only support for plotting the baseline survival function.

    Dan Guim
    @dcguim
    from math import exp
    def hazard( phdata, coef, baseline, i, t ):
        cov = phdata.iloc[i].drop(labels=['censored?, 'eventdate'])
        base = baseline.at[float(t),'baseline hazard']
        haz= base*exp(cov.dot(coef))
        return haz
    this is what I meant by the hazard at t, for instance. I could not find where it was implemented, is this equivalent to the predict_partial_hazard(X)?
    Thanks
    Cameron Davidson-Pilon
    @CamDavidsonPilon
    You can view how to terms "parital", "log-partial", etc. relate using this formula: https://lifelines.readthedocs.io/en/latest/Survival%20Regression.html#cox-s-proportional-hazard-model
    Cameron Davidson-Pilon
    @CamDavidsonPilon

    It looks like you want the hazard per subject over time (not the cumulative hazard). The most appropriate way is to do something like cph.predict_cumulative_hazard(phdata).diff() as the cumulative hazard and the hazard are related in that manner. Using predict_cumulative_hazard is helpful since it takes care of any strata arguments, and de-meaning necessary.

    You mentioned to me privately about why I subtract the mean of the covariates in the prediction methods. The reason is that the algorithm is computed using demeaned data, and hence the baseline hazard "accounts" for the demeaned data and grows or shrinks appropriately (I can't think of a better way to say this without a whiteboard/latex). From the POV of the hazard then, the values are the same. However, the log-partial hazard and the partial hazard will be different. This is okay, as there is no interpretation of the (log-)partial hazard without the baseline hazard (another way to think about this: it's unit-less). The only use of the (log-)partial hazard is determining rankings. That is, the multiplying by the baseline hazard is necessary to recover the hazards.

    All this to say: the Cox model can be confusing, and ((log-)partial) hazards are not intuitive. I am more and more of a fan of AFT models now: https://lifelines.readthedocs.io/en/latest/Survival%20Regression.html#accelerated-failure-time-models
    Dan Guim
    @dcguim
    @CamDavidsonPilon , I believe there is an interpretation of the "(log-)partial hazard without the baseline", namely, when you are computing the hazard ratio (the baselines are crossed out, because, essentially, you divide the hazard of one individual by the hazard of another).
    How do you compute the log likelihood, to estimate the beta coefficients if not by the hazard ratio?