I’ve been working with Comcast and Samsung on fault tolerance issues. I’ve been compiling a bunch of small fixes that together significantly improve the ability to recover from crashes. For the last few days I’ve been running four 7-node clusters continuously killing nodes and restarting entire clusters. I found a bunch of really intricate bugs that are difficult to reproduce. I’ve also been working on a TLA+ spec of the entire Raft implementation and some individual components.