I'm having lots of trouble with Atomix with what I think should be standard operations. I have 3 hosts member-a, member-b, and member-c. Things work great until I do deployments. I'll deploy to one host at a time and the first host maybe or may not be able to rejoin the cluster. Since I've started using atomix a week or so ago this has been my pain point -- testing around nodes going offline and rejoining doesn't seem to work very well at all (maybe I'm doing something wrong) ...
stop the running instance and start it again with a newer version
(this is a docker image... so stop the old docker image, copy over the new one, start it up)
so for a minute or so, one member is totally offline, then attempts to start up again... this fails most of the time.. so i end up having to do a fleet wide deployment of all 3 at once... and that works
The regression is probably because of 0577ed6a0dbcb61577ddae8c89027b91e0140360 which fixed a bug that retained persistent members in the cluster after they were removed. But really, for this type of use case they probably shouldn’t be removed. If you’re shutting down a node to restart it, it should remain part of the cluster and just stay inactive. You also need to ensure the persistent state of the cluster is preserved across the containers.
There’s probably a bug in re-adding a persistent member after it was removed then (something with the tombstone), but it probably shouldn’t be removed. If you’re shutting down a node that’s still a member of a Raft partition group then it should not be removed from the cluster membership. Maybe just need to add another method to remove() a node from the cluster.