These are chat archives for atomix/atomix
Suppose you have a three node Atomix cluster, where the nodes (
n1, n2, n3) are part of a distributed group. Is the following scenario possible (and, if so, how can it be prevented)?
- `n1` is elected *group* leader - `n1` becomes partitioned from `n2` and `n3` - `n2` becomes leader
n1 realizes it has been disconnected from the cluster, it belives itself to be the leader. But,
n2 also believes itself to be the leader.
This has some adverse consequences if, for example, you have a protocol such that only the leader of the group accepts writes to a shared file system.
In this case, you may have two writers writing to the shared file system. NB: Just because a partition occured on the Atomix side does not mean a
partition occurred on the shared file system.
nseconds, but having the server timeout the client in
sessiontimeouts, but I haven't been able to convince myself that it works as I intend it to.
Yes that's possible, and TBH it's not even entirely safe to use timeouts, but it's close. The fact another node is elected leader doesn't imply the old leader knows it's no longer the leader.
So, Atomix works the exact same way as ZooKeeper in this respect: Atomix (and Copycat) clients communicate with the Raft cluster through a session. When the client believes it's disconnected from the cluster (hasn't completed a keep-alive request successfully for
sessionTimeout) it's state changes from
SUSPENDED to indicate it doesn't know about state changes that are occurring in the cluster. When a session expires is when a leadership change occurs. The safe thing to do at that point is to assume leadership was lost and in fact attempt to resign as the current leader.
But like I said, this isn't even entirely safe. There are many reasons beyond network partitions that a client may not learn it might have lost its leadership. For example, clocks used to guess whether a session has expired may be unreliable, or a long GC pause might have resulted in a leader change before the client could even react. But this is why we use leadership terms. The term is based on the Raft log and is guaranteed to be globally unique and monotonically increasing. So, the term number can be used to ensure two leaders' terms don't overlap when communicating with external systems. Even if two leaders do exist at the same time, if a leader attempts to change some state that anther leader with a higher term has already changed, it knows an election has occurred and can step down.
unstabilityTimeoutin Copycat actually closes the client if it can’t communicate with the cluster. But even without it set, the client will transition into the
SUSPENDEDstate when it can’t communicate with the cluster, and that state should indicate one can’t assume events haven’t been missed, which IIRC is essentially the same thing that happens in ZooKeeper.
term, but it's not always possible to do so, either because the external system's API (e.g. writes to HDFS) do not support this or it would be prohibitively expensive (e.g. writes to a remote database, where each write is actually a transaction that validates the term).