Where communities thrive


  • Join over 1.5M+ people
  • Join over 100K+ communities
  • Free without limits
  • Create your own community
People
Activity
    Cameron Davidson-Pilon
    @CamDavidsonPilon
    @WillTarte, yes, that's because the bootstrapping + lowess plotting is very slow per variable, so if you have many variables, it can hang for a while. I suggest trying without show_plots and seeing if you can fix the presented problems (actually, this gives me the idea of being able to select what variables to check). CamDavidsonPilon/lifelines#730
    @jennyatClover, try decreasing the step size (default 0.95) in fit. For example, cph.fit( ..., step_size=0.30) (or decrease more if necessary). I would appreciate if you could share the dataset with me as well (privately, at cam.davidson.pilon@gmail.com), as datasets that fail convergence as useful to try new methods against.
    Cameron Davidson-Pilon
    @CamDavidsonPilon

    @MattB27 the equation on page 7 of the pdf is not what is implemented. Recall, the MLE, I take the log, then differentiate. numer and denom in the code refer to the numerator and denominator in the fraction here: img

    which is the resulting equation after logging + differentiating the eq on page 7

    MattB27
    @MattB27
    Ok, I can see that now. And if I’m following the logic right then when there are no shared event times or when the event is shared with censored times than the normal MLE is used which gives the different Numer and Denom in the else statement. I’m trying to implement a Breslow tie method (I understand Efron should be preferred) but Breslow might be nice to match with other software that may still have it as default.
    Cameron Davidson-Pilon
    @CamDavidsonPilon
    Yup makes sense. Breslow should be much easier to implement (in fact, I wouldn't necessary use my efron code as a template, as it's a) a more complicated method and b) highly optimized, so less transparent.)
    Cameron Davidson-Pilon
    @CamDavidsonPilon
    :wave: a minor version of lifelines was released, with some quality-of-life improvements. https://github.com/CamDavidsonPilon/lifelines/releases/tag/v0.21.3
    Charlene Chambliss
    @blissfulchar_twitter

    What's the best way to save a lifelines model? (or is this not possible)? I'd like to automate the model to make predictions daily, but retrain only weekly. The model in question is a lognormal AFT.

    I tried to use joblib, but it threw a PicklingError: PicklingError: Can't pickle <function unary_to_nary.<locals>.nary_operator.<locals>.nary_f at 0x1a378e9f28>: it's not found as autograd.wrap_util.unary_to_nary.<locals>.nary_operator.<locals>.nary_f

    Cameron Davidson-Pilon
    @CamDavidsonPilon
    @blissfulchar_twitter joblib doesn't work well with autograd, a lib we use. dill is my recommended package to use
    Charlene Chambliss
    @blissfulchar_twitter
    @CamDavidsonPilon Thanks! I'll check out dill :)
    Jenny Yang
    @jennyatClover
    @CamDavidsonPilon :
    1. Thanks for the response - I did try decreasing the stepsize and didn't help. Unfortunately I can't send you the data as it's covered under HIPAA.
    2. Out of curiousity - did you end up exploring using autograd for the Cox model?
    Cameron Davidson-Pilon
    @CamDavidsonPilon
    @jennyatClover for 2., I did, and it wasn't faster (actually much slower).
    For 1., hm, so strange. I am surprised that even reducing it to a single covariate still makes it fail. Is there a constant column in the dataframe?
    Jenny Yang
    @jennyatClover
    1. No, I removed the intercept too.
    2. I do expect it to be slower but wondering about algorithm stability.
    sam
    @veggiet_gitlab

    If you have an individual, who has the 'death' event, but then becomes alive again, and then has a 'death' event again. How do you treat this? should you use a time based model, and record the death event something like this [t0 - t1, death] [t2 - t3, death], or do you not record the death event but still you a time based model, recording a gap in between the 'observations' [t0 - t1, t2-t3, death]

    OR could you use a standard(non-temportal) model and treat them as separate individuals? What would the mathematical ramifications be to use a standard model like this?

    Cameron Davidson-Pilon
    @CamDavidsonPilon
    @veggiet_gitlab this is called recurrent event analysis, and is a harder problem than survival analysis (obviously). You can still use some survival analysis tools though, but with some caution. One approach is to use coxph model with the "cluster" argument: https://lifelines.readthedocs.io/en/latest/Examples.html#correlations-between-subjects-in-a-cox-model
    Charlene Chambliss
    @blissfulchar_twitter
    I'd like to introduce some interaction terms between ordinal variables into the lognormal AFT model, but after adding the interaction column, the algorithm now fails to converge. Is there a way to introduce interactions for categorical/ordinal variables without creating convergence issues?
    Paul Zivich
    @pzivich
    @blissfulchar_twitter it sounds like the convergence issues might be due to sparse data. You could check the counts for each category to verify. If interaction terms are important, you could consider collapsing some of the ordinal categories together. For example, if you have 5 categories, you could make it 3 instead
    Charlene Chambliss
    @blissfulchar_twitter
    @pzivich Thanks Paul! Looking into your sparsity suggestion I realized the DF with the interaction terms was not merging correctly with the main DF (it was dropping about 80% of the data). I fixed this issue and the model fits correctly now. Oops. Thanks for pointing me in the right direction :)
    Cameron Davidson-Pilon
    @CamDavidsonPilon
    :raised_hands: glad you got it working!
    Cameron Davidson-Pilon
    @CamDavidsonPilon
    I added some more helpful context for users to check when convergence fails.
    Rob deCarvalho
    @robdmc
    Hi @CamDavidsonPilon Question about custom fitters. I was looking at this documentation https://lifelines.readthedocs.io/en/latest/jupyter_notebooks/Piecewise%20Exponential%20Models%20and%20Creating%20Custom%20Models.html . Before I put any effort into experimenting, I was wondering if it would be possible to make one of the parameters an arbitrary list. Say for example, I wanted the date associated with each element of the times parameter. If this were possible, I think it might allow me to add seasonality to a competing risk model that captures the cumulative hazard of the outcome-of-interest. So I guess my question is two-fold. a) Is that possible with lifelines, b) Does that make sense for modeling competing risk.
    Rob deCarvalho
    @robdmc
    Here is a crude sketch of what I'd like to do.
    class SeasonalHazardFitter(ParametericUnivariateFitter):
        """
        The idea of this class would be to fit custom seasonality to an
        exponential-like hazard model.
        """
    
        _fitted_parameter_names = ['a_q1_', 'a_q2_', 'a_q3_', 'a_q4_' 'dates']
        def _cumulative_hazard(self, params, times):
            # Pull out fiscal quarters and dates corresponding to times.
            # Each element of the dates array corresponds an element of the 
            # times array.
            a_q1_, a_q2_, a_q3_, a_q4_, dates = params
    
            # Call a function that associates fiscal quarter with date
            quarters = get_fiscal_quarters(dates)
    
            # Get the hazard for each time
            q_lookup = {1: a_q1_, 2: a_q2_, 3: a_q3_, 4:a_q4}
            hazards = np.array([q_lookup[quarter] for quarter in quarters])
    
            # Return the cumulative hazard
            # You'd have to be more careful to actually do the
            # integration properly, but you get the idea.
            return np.cumsum(hazards)
    Cameron Davidson-Pilon
    @CamDavidsonPilon

    @robdmc, but dates isn't an unknown, is it? If not, if could be a global variable. If it is unknown, then I think you'll need to "flatten" it, i.e. one parameter for each element of the list.

    Can you tell me more about this seasonal model?

    Rob deCarvalho
    @robdmc
    @CamDavidsonPilon You are correct. dates are not an unknown. They are known constants. It makes sense that everything that goes into params should be unknown. Not sure what I was thinking there. Putting it in a global/class/instance variable makes sense. I just want to be sure I understand how _cumulative_hazard() is called.
    params: get tweaked by the optimization
    times: the times passed into the fitter as "durations"
    return: The cumulative hazard encountered over the duration represented by each time
    Is that right?
    Rob deCarvalho
    @robdmc
    The process I am trying to model consists of two competing kinds of events. The hazard for each event is a function of date. So the cumulative hazard for each time would be the integral of the hazard from the "start_date" to the "end_date". (where these can be derived from an element of time and its corresponding date.) What I really care about is the cumulative incidence function (CIF) for each kind of event. If the idea of getting dates into the _cumulative_hazard function works, then I was hoping to use this technique to model the CIF for one of the competing event types.
    Is this making sense?
    Cameron Davidson-Pilon
    @CamDavidsonPilon

    Your explanation of _cumulative_hazard is correct. But you can also see it as simply the cumulative hazard you wish to implement (i.e., not necessary to think about "durations" or "unknowns")

    I was thinking about your seasonal model, and actually tried to code something up, but there is a problem I think. The _cumulative_hazard is invoked for both the censored and uncensored data, so your code needs to handle that (and you won't know which until you see the shapes of the input data)

    Cameron Davidson-Pilon
    @CamDavidsonPilon
    yea I don't know if this can be done... I'm playing with it locally, and having some trouble
    I'll think more about it. Try to write down the hazard mathematically - I think the problem is that it is clock-time dependent.
    Rob deCarvalho
    @robdmc
    Thank you for thinking about it. Clock-dependent hazards I think are actually pretty common
    I love this interface you have for arbitrary models. If there was a way to hack that, it could be pretty useful.
    Rob deCarvalho
    @robdmc
    maybe with (..., *args, **kwargs) to the _cumulative_hazard? I actually don't understand very well how _cumulative_hazard is used under the hood, so perhaps I'm spouting nonsense.
    Cameron Davidson-Pilon
    @CamDavidsonPilon
    (..., *args, **kwargs) I was thinking about this, too

    Clock-dependent hazards I think are actually pretty common

    Agree, but I feel like the common strategy is to use a regression model or fit N univariate models (i.e. partition the data)

    I think a seasonal model is a great idea, so I want this to work.

    Cameron Davidson-Pilon
    @CamDavidsonPilon
    :wave: new lifelines release: 0.22.0. Some important API changes to take a look at, but some really powerful new regression models: https://github.com/CamDavidsonPilon/lifelines/releases/tag/v0.22.0
    Julian Späth
    @julianspaeth
    Hi all, does lifelines somehow offer a Random Survival Forest? Or is there a specific reason why not? As there is no real python implementation of RSF and I want to implement it for my Master thesis, I was wondering if you are interested in including it into lifelines?
    Cameron Davidson-Pilon
    @CamDavidsonPilon
    Hi Julian, lifelines does not have a RF model. Maybe scikit-learn survival does though.
    lifelines has focused less on purely predictive models, and more on inference
    Julian Späth
    @julianspaeth
    Hi Cameron, thank you for your answer. As far as I can see scikit-survival does not have a RF model. So I guess I need to implement it from scratch to use it in python. Thank you anyways 🙂
    Cameron Davidson-Pilon
    @CamDavidsonPilon
    hm, I thought it was, okay - have fun!
    Pedro Sola
    @pedrosola
    Hi everyone, I'm trying to fit a model onto a recurrent process. I.E: Patient returns to a doctor. Is there a way to do so using lifelines ? So far the closest that I've got was this repo: https://github.com/dunan/MultiVariatePointProcess
    Cameron Davidson-Pilon
    @CamDavidsonPilon
    Lifelines has limited support for recurring events.
    Unfortunately
    mohit
    @mohit-shrma
    Hi, I am using CoxPHFitter with IPS weights and robust=True flag. However, the fit is taking really long time to finish. I have about million instances and 6 features in my dataset. Let me know if slower runtime is expected in weighted version and what can be done to speed it up.
    mohit
    @mohit-shrma

    @CamDavidsonPilon Let me know if you have any suggestions for question below:

    Hi, I am using CoxPHFitter with IPS weights and robust=True flag. However, the fit is taking really long time to finish. I have about million instances and 6 features in my dataset. Let me know if slower runtime is expected in weighted version and what can be done to speed it up.

    Cameron Davidson-Pilon
    @CamDavidsonPilon
    hello! A million is a lot, much more than needed for only 6 features. I would suggest subsampling to 50k or even less, and checking the std. errors.
    Cameron Davidson-Pilon
    @CamDavidsonPilon
    @mohit-shrma, another suggestion is to "collapse" similar rows and use weights. Ex: with only 6 variables, you likely have the same row appear twice. You can group these, assign that row an integer count, and use the weight_col argument in fit
    Cameron Davidson-Pilon
    @CamDavidsonPilon
    :wave: minor version of lifelines released: https://github.com/CamDavidsonPilon/lifelines/releases/tag/v0.22.1
    Robert Green
    @rgreen13_gitlab

    Hi all. I've somewhat new to using lifelines, and in using the CoxPHFitter, when I run check_assumptions, I end up with an error that reads as follows: /RuntimeWarning: overflow encountered in exp scores = weights * np.exp(np.dot(X, self.params_))

    Any suggestions on dealing with this issue? I'm starting down the road of normalization, but I'm not sure if that's 100% correct.