These are chat archives for atomix/atomix

26th
Oct 2017
Zachary Heilbron
@zheilbron
Oct 26 2017 00:00

Suppose you have a three node Atomix cluster, where the nodes (n1, n2, n3) are part of a distributed group. Is the following scenario possible (and, if so, how can it be prevented)?

- `n1` is elected *group* leader
- `n1` becomes partitioned from `n2` and `n3`
- `n2` becomes leader

Before n1 realizes it has been disconnected from the cluster, it belives itself to be the leader. But, n2 also believes itself to be the leader.
This has some adverse consequences if, for example, you have a protocol such that only the leader of the group accepts writes to a shared file system.
In this case, you may have two writers writing to the shared file system. NB: Just because a partition occured on the Atomix side does not mean a
partition occurred on the shared file system.

For example, in ZooKeeper, this can be prevented by allowing the client to timeout in n seconds, but having the server timeout the client in n*2 seconds.
It looks as if the CopycatClient/Server may allow for this timeout solution with the unstability and session timeouts, but I haven't been able to convince myself that it works as I intend it to.
Jordan Halterman
@kuujo
Oct 26 2017 01:55

Yes that's possible, and TBH it's not even entirely safe to use timeouts, but it's close. The fact another node is elected leader doesn't imply the old leader knows it's no longer the leader.

So, Atomix works the exact same way as ZooKeeper in this respect: Atomix (and Copycat) clients communicate with the Raft cluster through a session. When the client believes it's disconnected from the cluster (hasn't completed a keep-alive request successfully for unstabilityTimeout or sessionTimeout) it's state changes from CONNECTED to SUSPENDED to indicate it doesn't know about state changes that are occurring in the cluster. When a session expires is when a leadership change occurs. The safe thing to do at that point is to assume leadership was lost and in fact attempt to resign as the current leader.

But like I said, this isn't even entirely safe. There are many reasons beyond network partitions that a client may not learn it might have lost its leadership. For example, clocks used to guess whether a session has expired may be unreliable, or a long GC pause might have resulted in a leader change before the client could even react. But this is why we use leadership terms. The term is based on the Raft log and is guaranteed to be globally unique and monotonically increasing. So, the term number can be used to ensure two leaders' terms don't overlap when communicating with external systems. Even if two leaders do exist at the same time, if a leader attempts to change some state that anther leader with a higher term has already changed, it knows an election has occurred and can step down.

@zheilbron
Jordan Halterman
@kuujo
Oct 26 2017 04:26
BTW the unstabilityTimeout in Copycat actually closes the client if it can’t communicate with the cluster. But even without it set, the client will transition into the SUSPENDED state when it can’t communicate with the cluster, and that state should indicate one can’t assume events haven’t been missed, which IIRC is essentially the same thing that happens in ZooKeeper.
Jordan Halterman
@kuujo
Oct 26 2017 06:44
Also, I should probably point out that Copycat is no longer maintained. The Atomix 2 Raft implementation is simpler, more stable, more efficient, and more complete.
Andrius Dagys
@adagys
Oct 26 2017 15:59
Thanks @kuujo, I'll have a look at RaftServiceManager. I guess the main question is still whether the log and the replication mechanism can scale well to large data volumes.
Zachary Heilbron
@zheilbron
Oct 26 2017 19:18
@kuujo Agreed, I would have liked to use the term, but it's not always possible to do so, either because the external system's API (e.g. writes to HDFS) do not support this or it would be prohibitively expensive (e.g. writes to a remote database, where each write is actually a transaction that validates the term).
Zachary Heilbron
@zheilbron
Oct 26 2017 19:52
Now that I think about it though, even if you had full control in something like HDFS, validating the term (e.g. accepting only writes from the highest term seen so far) pushes the problem down to the fact that all writes must go through a single entity--but that's the original problem we intended to solve! We wanted to ensure that writes only happen through a single entity.