These are chat archives for atomix/atomix
ServerStateMachinedoesn’t even concern itself with terms. That’s all only handled in the replication and election protocols. The only thing that happens in a term/leader change is an
InitializeEntryis committed by the new leader, but that just relates to sessions being expired and not really anything else
@kuujo Found something funny... The log I uploaded is for one of the instances of the cluster. Let's call it instance1. It gets the sequence=235 command and appends it locally. Then it sends append requests to the other 2 instances. Instance2 fails to receive it due to a timeout.
Everything goes a bit messy, the client times out. The server enters voting to elect a new leader and instance2 is elected. Things get back working again.
The client sends the sequence=235 command again. It hits instance1 and it forwards it to the leader (instance2). The problem is that instance2 has never applied the original request as the former leader had. So instance2 actually thinks this command is a new one and records it at index=737.
applyfor the command on index 737 is applied on the leader but it isn't on the follower because of the sequence number difference.
So it doesn't seem like this should stop progress, but let's think about this...
There are really two uses of sequence numbers in the servers. First is the condition in
LeaderState that ensures commands are written to the log in sequential order. It uses the session's
requestSequence to track that. If a request's sequence number is
> the next sequence number it waits for the missing requests, otherwise it logs the request. What's important to note here is that the leader relies on the client resending sequence numbers that were never committed in order to ensure the client can progress. If a sequence number is never committed then later commands will basically be ignored until the correct command makes it to the leader.
The other use of sequence numbers is in the state machine which you've mentioned. What's important to note about
ServerStateMachine is nothing is applied to the state machine until it's committed and therefore won't be lost. Any command that is applied to the state machine is guaranteed to eventually be applied to all state machines. When the command is applied to the state machine, the session's
commandSequence number is checked to determine whether the command has already been applied. Because commands are written to the log in sequential order and only committed commands are applied to the state machine, if
n is applied then we know
n-1 should have already been applied.
n-1and then switch servers and submit
n-1to another leader. In that case, what should happen is once
n-1is applied on the first sever, the session's
requestSequencewill be updated, thus ensuring the old leader has up to date sequence numbers for the session.
235command and they’re simply being rejected then it seems to be an issue in the server state. If the client skipped over a command because of a failure and didn’t try to resubmit it, that could prevent the server from allowing the client’s requests to progress
NoOpCommandin its place. This is because the client can’t know whether a command was or was not committed. The client assigns unique sequence numbers to every request, and if a sequence number is never committed then that will hault the progress of the system. So, if a client’s command fails then it should submit a
NoOpCommandfor the failed sequence number: https://github.com/atomix/copycat/blob/master/client/src/main/java/io/atomix/copycat/client/session/ClientSessionSubmitter.java#L296-L302
ServerSessionContextis the one that is in 235 when, after the leader change, it should go back to 234
commandSequenceonly seems to get updated at
ServerSessionContext::apply(CommandEntry entry). Which only goes up?
235, that should indicate that commands
0-234have been commmitted as well.
commandSequenceis only ever updated inside
ServerStateMachine, and only entries that have been committed are applied to
ServerStateMachineand will therefore not be lost. If a leader change occurs, then the new leader should have any entries that have been applied on the old leader.
Applying ConnectEntry[index=634, term=2,
Applying CommandEntry[index=634, term=1, session=8, sequence=235,
AppendRequestfrom the leader for term 2 to the leader for term 1 was not handled correctly - the leader for term 1 didn't truncate its log and replace the entries. That's a huge find. I'll poke around some more and try to track down where that happened. I should be able to deduce the state of the server from these logs
AppendRequestfrom the leader for term 2 which causes
634to be committed. I think the bug should ultimately be fairly easy to track down now...
Received AppendRequest[term=2, leader=-1408161302, logIndex=632, logTerm=1, entries=, commitIndex=634, globalIndex=632]
So, the problem is indeed in how followers handle
AppendRequest. What the follower (and former leader) had in its log - entries from term 1 up to index 634 - allowed the
AppendRequest consistent checks to succeed, but because the request didn't contain any entries the follower kept the entries that were already in its log. The follower then commits those entries and applies them: https://github.com/atomix/copycat/blob/master/server/src/main/java/io/atomix/copycat/server/state/PassiveState.java#L210
There are a couple potential solutions here. First, the follower could truncate its log regardless of whether the leader sent any entries. That's probably the safest solution. Otherwise, the follower perhaps could not update its commitIndex beyond what the leader sent in that request. In other words,
setCommitIndex(Math.min(request.commitIndex(), lastEntryIndex)) but I have to explore whether that could negatively affect any other usage of
commitIndex. It should be fine.
Going to think a bit about the impact of either of these solutions and submit a PR
It seems to me the only good way this can really be tested reliably is sort of in a vacuum… by setting up a single server and mocking communication with it to/from other servers.
It’s possible we could set up a
TestTransportfor tests which allows you to cut off connections to simulate partitions. I used to have something like that. Then you would cut off the leader’s connections to other servers, submit a command to the leader, wait for another leader to get elected, then resubmit the command. But then the question is what the assertion becomes. Are you asserting that the logs up to the commit index on each server are consistent? I think that’s starting to get into building a sort of fuzzy testing framework for Copycat