## Where communities thrive

• Join over 1.5M+ people
• Join over 100K+ communities
• Free without limits
##### Activity
d-seki
@d-seki
@CamDavidsonPilon Thanks very much!
Cameron Davidson-Pilon
@CamDavidsonPilon
I got conda forge working again, so we should start to see simultaneous conda & pypi releases again
Cameron Davidson-Pilon
@CamDavidsonPilon
:wave: Also, new minor release with some useful bug fixes: https://github.com/CamDavidsonPilon/lifelines/releases/tag/v0.22.9
Bojan Kostic
@bkos
@CamDavidsonPilon I see there are estimators for cumulative hazard function, and it is as well in your mathematical links between entities diagram (nice one, BTW). What's the point (/advantage?) of introducing/estimating CHF in our survival analysis? It seems that all we need is hazard and survival functions, which have a direct transform. I can't explain the meaning of CHF, it doesn't bring anything, seems redundant... I'm reading about deep survival models (there's lots of papers and code lately) and they hardly mention it...
Cameron Davidson-Pilon
@CamDavidsonPilon
@bkos good question. A few points / advantages: i) The CHF is easier to estimate (less variance) than the hazard ii) The CHF, and the HF, are present in the likelihood equation for survival models, see equation (2.5) in https://cran.r-project.org/web/packages/flexsurv/vignettes/flexsurv.pdf iii) because of the "ease of differentiation" vs "hardness of integration", specifying the CHF and working out the HF is easier than the other way around, iv) it's 1-1 with the SF, that is, SF = exp(-CHF).
Bojan Kostic
@bkos
Thanks a lot, @CamDavidsonPilon! Is the equation you mentioned used in lifelines for some models? With it we don't lose any information, but it's different from the Cox partial likelihood, which includes only uncensored observations and softmax terms...
i completely missed that one, thx a lot!
mitchgallerstein-toast
@mitchgallerstein-toast
Has anyone had the issue where you get a "ZeroDivisionError: float division by zero" when using the CoxTimeVaryingFitter?
We originally thought it had to do with having multiple events with the same duration but that doesnt seem to be the case.
mitchgallerstein-toast
@mitchgallerstein-toast
This seems to be the problem! Does anyone know how we would get around this until it is fixed?
Cameron Davidson-Pilon
@CamDavidsonPilon
@bkos yup, that equation is the basis of parametric models (you're right that it's not used in the Cox model)
@mitchgallerstein-toast hm, this sounds similar to the issue here: CamDavidsonPilon/lifelines#768
:wave: also minor release with some bug fixes: https://github.com/CamDavidsonPilon/lifelines/releases/tag/v0.22.10
kdkaiser
@kdkaiser
hello! Im somewhat new to survival analysis, and I havent found any resources explaining why convergence would be poor when I have a variable that correlates strongly with being censored or not - it isnt correlated with the time to event for the uncensored data. I have a very small data set, and when I bootstrap sample it many times, I end up with combinations of the data where certain of my boolean variables correlate with the censoring variable. The link that lifelines provides is related to logistic regression, where a variable correlates strongly with the class label that you are trying to predict/model, which seems different than what is happening with survival analysis...thanks for any pointers!!
Im also curious what type of model CPHFitter uses for the baseline, but didnt see that in the documentation
Cameron Davidson-Pilon
@CamDavidsonPilon
@kdkaiser for your second question, it's the Breslow method, see https://stats.stackexchange.com/questions/46532/cox-baseline-hazard
Cameron Davidson-Pilon
@CamDavidsonPilon
Your first question is really good, and I thought about it for a while
Take a look at the Cox log-likelihood:
$ll(\beta) = \sum_{i:C_i = 1} X_i \beta - \log{\sum_{j: Y_i \ge Y_j} \theta_j}$
Suppose, in an extreme case, that X_i = C_i, that is, we have a single column that is equal to the E vector. Then the first sum is equal to:
$\sum_{i:C_i=1} X_i \beta = \sum_{i:C_i=1} C_i \beta = \sum_{i:C_i=1} \beta$
so to maximize the $ll$, we can just make $\beta$ as large as possible!
Cameron Davidson-Pilon
@CamDavidsonPilon
And this is what an optimization algorithm will do if you have a column that has too high of a correlation with E
kdkaiser
@kdkaiser
@CamDavidsonPilon Thank you!! Im familiar with a slightly different notation so I'll work through it on my end too, but what you wrote makes sense. I appreciate your help!
Cameron Davidson-Pilon
@CamDavidsonPilon
:wave: Good morning, a new lifelines release has just been released. Some small API changes, but lots of QOL improvements: https://github.com/CamDavidsonPilon/lifelines/releases/tag/v0.23.0
Bojan Kostic
@bkos
I see the Brier score is used by some people to measure the goodness-of-fit of survival models. As lifelines containes many useful functions, is there any specific reason why the Brier score is not included? It's present in scikit-learn, but not used in any examples in lifelines...
Cameron Davidson-Pilon
@CamDavidsonPilon
@bkos mostly because I haven't gotten around to implementing it. I think it's a good measure, and should be included.
A N
@aleva85
Hi Cam, sometime ago we discussed the undocumented use of _log_likelihood. I see you now added a log_likelihood attribute (thanks!). It works for some models but not for the ExponentialFitter, which throws an AttributeError for log_likelihood and a deprecation warning for _log_likelihood. Just a heads up, hope i didn't do something wrong on my side
Cameron Davidson-Pilon
@CamDavidsonPilon
whoops! Thanks for the heads up! I'll fix that shortly
@aleva85 actually, can you try log_likelihood_?
A N
@aleva85
that one works smoothly for all the paramtetric models, tried on version 0.22.9
Talgutman
@Talgutman
hey, I'm new to survival analysis and have some questions. my dataset is comprised of about 3000 features and about 9000 patients. each feature is binary and indicates the existence/absence of a certain mutation in a patient's DNA. I also have the death status and time variable of each patient. questions: 1) is the Cox proportional hazard regression compatible for this task? I've read about it but I still lack knowledge. 2) I've tried implementing it and I cant seem to make it converge. I'm using a 0.1 penalizer and a step size of 0.1. still at the 5th or 6th iteration the norm_delta starts to rise again... 3)assuming it will work eventually, what is the expected training duration for a dataset this large? thank you very much :)
Cameron Davidson-Pilon
@CamDavidsonPilon
Hi @Talgutman, let me try to help:
1) It will be compatible, but very likely many of your variables will fail the proportional hazard test. Now, you may not care that you fail the proportional hazard test, which is common if your task is prediction for example.
2) With that many variables, I can see co-linearity between a subset of variables being a problem, even with a positive penalizer value. Try a large penalizer value, like 100, to see if that converges, and then bring it down from there.
3) With that many variables, convergence might take a while. The hardest part is computing the Hessian matrix, which is a 3000x3000 matrix. I would hope that it takes less than 10 steps, but it's possible it may take more.
Cameron Davidson-Pilon
@CamDavidsonPilon
Generally, I think lifelines needs better support for high-dimensional problems
I am tinkering with alternative optimization algorithms for these cases
Talgutman
@Talgutman
I removed features with correlation 0.8 or higher to others... Will try to increase the penalty as you suggested. In the end, I want to get the individual survival function for each patient, so does it matter if some of the variables fail the assumption? thank you! (also for the very quick response)
Cameron Davidson-Pilon
@CamDavidsonPilon
it won't matter then, no
you could try compressing the covariates using MCA (like PCA but for categorical data: https://en.wikipedia.org/wiki/Multiple_correspondence_analysis). That way you retain as much of the original information as possible, convergence will improve, and duration will be shorter
Bojan Kostic
@bkos
Building on your discussion, in what cases is then important not to fail the proportional hazards test? In the lifelines documentation it says that "non-proportional hazards is a case of model misspecification. "
Cameron Davidson-Pilon
@CamDavidsonPilon
@bkos if your goal is prediction, model assumptions don't matter
if your goal is inference / correlation study, then yes it matters greatly
Cameron Davidson-Pilon
@CamDavidsonPilon
:wave: new minor lifelines release with lots of bug fixes and performance improvements: https://github.com/CamDavidsonPilon/lifelines/releases/tag/v0.23.1
^ a lot of these bugs were found after starting to use type-hints and mypy. It's a pretty useful tool!
kpeters
@kpeters
Hi Cameron, I'm trying to apply survival analysis on time series data. I've fitted the CoxTimeVaryingFitter on a dataframe with 100 unique IDs, 200 observations per ID(pandas rows) and 20-ish continuous variables. Although the partial hazard seems to inform which unique IDs are nearing their 'death' event. I was hoping for something a bit more concrete like the time to event predictions of the regular CoxPHFitter. Fitting the CoxPHFitter with the data as-is does not pass assumption checks, using only one observation with lagged variables also didn't yield good results. Could you give me any tips?
kpeters
@kpeters
^That should be max 200 observations per ID, the first 'death' events start occuring after 140 observations, about 40 % is still 'alive' at 200 observations
Cameron Davidson-Pilon
@CamDavidsonPilon
Hi @kpeters sorry I haven't gotten back to you, I will do shortly
:wave: new lifelines minor release, v0.23.2. Some bug fixes, performance improvements, and new rmst_plot: https://lifelines.readthedocs.io/en/latest/Examples.html#restricted-mean-survival-times-rmst