These are chat archives for atomix/atomix

4th
Jan 2017
Thiago Santos
@thiagoss
Jan 04 2017 00:17
Something just occurred to me: Could it be related to the term change? (Didn't look at it yet, just got back to the computer)
Jordan Halterman
@kuujo
Jan 04 2017 00:18
hmm
doesn’t seem like it should be… the ServerStateMachine doesn’t even concern itself with terms. That’s all only handled in the replication and election protocols. The only thing that happens in a term/leader change is an InitializeEntry is committed by the new leader, but that just relates to sessions being expired and not really anything else
Thiago Santos
@thiagoss
Jan 04 2017 00:42
Got it
Will go to the logs again
Jordan Halterman
@kuujo
Jan 04 2017 00:43
we will track it down
and it will be a good find!
Thiago Santos
@thiagoss
Jan 04 2017 02:03
Couldn't find anything wrong on this log. I'll maybe need to reproduce it again with more logs.
Thiago Santos
@thiagoss
Jan 04 2017 02:14
@kuujo Uploaded the logs on the github issue. It is polluted with custom logs I added. Let me know if you'd like me to remove them or something that would help in case you look at them. I'll try to reproduce it again with more logs around that part of the code tomorrow.
Jordan Halterman
@kuujo
Jan 04 2017 02:21
awesome… I’ll take a look at them
thanks!
Jordan Halterman
@kuujo
Jan 04 2017 09:45
these logs should help… I found where the requests start and just need to dig through the requests/responses but gotta get some rest for now
Thiago Santos
@thiagoss
Jan 04 2017 14:24

@kuujo Found something funny... The log I uploaded is for one of the instances of the cluster. Let's call it instance1. It gets the sequence=235 command and appends it locally. Then it sends append requests to the other 2 instances. Instance2 fails to receive it due to a timeout.

Everything goes a bit messy, the client times out. The server enters voting to elect a new leader and instance2 is elected. Things get back working again.

The client sends the sequence=235 command again. It hits instance1 and it forwards it to the leader (instance2). The problem is that instance2 has never applied the original request as the former leader had. So instance2 actually thinks this command is a new one and records it at index=737.

Thiago Santos
@thiagoss
Jan 04 2017 18:34
Ok, I confirm that the ConfigurationRequest that comes from the new leader has index=633. The 634 was only applied locally on the former leader.
Jordan Halterman
@kuujo
Jan 04 2017 18:40
So that makes sense. The original command had never been committed so the second attempt was committed on the new leader and was treated as the first commitment, so the response containing index 737 is fine. Hmmm
Going to look through the logs more in a I
bit*
Thiago Santos
@thiagoss
Jan 04 2017 18:41
So the current sequence number being 235 in the instance1 is not true in the leader
Should it be reverted when the new configuration is received?
Jordan Halterman
@kuujo
Jan 04 2017 18:43
Hmm...
You might be on to something
Lemme poke around some more
Thiago Santos
@thiagoss
Jan 04 2017 18:45
What I see is that the applyfor the command on index 737 is applied on the leader but it isn't on the follower because of the sequence number difference.
Jordan Halterman
@kuujo
Jan 04 2017 19:03

So it doesn't seem like this should stop progress, but let's think about this...

There are really two uses of sequence numbers in the servers. First is the condition in LeaderState that ensures commands are written to the log in sequential order. It uses the session's requestSequence to track that. If a request's sequence number is > the next sequence number it waits for the missing requests, otherwise it logs the request. What's important to note here is that the leader relies on the client resending sequence numbers that were never committed in order to ensure the client can progress. If a sequence number is never committed then later commands will basically be ignored until the correct command makes it to the leader.

The other use of sequence numbers is in the state machine which you've mentioned. What's important to note about ServerStateMachine is nothing is applied to the state machine until it's committed and therefore won't be lost. Any command that is applied to the state machine is guaranteed to eventually be applied to all state machines. When the command is applied to the state machine, the session's commandSequence number is checked to determine whether the command has already been applied. Because commands are written to the log in sequential order and only committed commands are applied to the state machine, if commandSequence n is applied then we know n-1 should have already been applied.

In the first case, it's possible a client could submit sequence number n before n-1 and then switch servers and submit n-1 to another leader. In that case, what should happen is once n-1 is applied on the first sever, the session's requestSequence will be updated, thus ensuring the old leader has up to date sequence numbers for the session.
Hmm but I actually don't see where that's done
Abdul's find where requestSequence is reset
Ugh... typing on my phone
Really need to just figure out whether this is an issue in the client’s state or the server’s state. If the client is continuing to resubmit the 235 command and they’re simply being rejected then it seems to be an issue in the server state. If the client skipped over a command because of a failure and didn’t try to resubmit it, that could prevent the server from allowing the client’s requests to progress
hmm… actually I have an idea
there’s a case where if a client’s request fails repeatedly then the client will submit a NoOpCommand in its place. This is because the client can’t know whether a command was or was not committed. The client assigns unique sequence numbers to every request, and if a sequence number is never committed then that will hault the progress of the system. So, if a client’s command fails then it should submit a NoOpCommand for the failed sequence number: https://github.com/atomix/copycat/blob/master/client/src/main/java/io/atomix/copycat/client/session/ClientSessionSubmitter.java#L296-L302
Thiago Santos
@thiagoss
Jan 04 2017 19:20
It seems like the client actually thinks it succeeded because it gets a response after the leader change
commandSequence in ServerSessionContext is the one that is in 235 when, after the leader change, it should go back to 234
This commandSequence only seems to get updated at ServerSessionContext::apply(CommandEntry entry). Which only goes up?
Jordan Halterman
@kuujo
Jan 04 2017 19:24
In order for commandSequence to hit 235, that should indicate that commands 0-234 have been commmitted as well. commandSequence is only ever updated inside ServerStateMachine, and only entries that have been committed are applied to ServerStateMachine and will therefore not be lost. If a leader change occurs, then the new leader should have any entries that have been applied on the old leader.
Thiago Santos
@thiagoss
Jan 04 2017 19:27
Ok, I'm seeing it applied in the logs
This is after it became a follower
At time 2016-12-22 19:10:28,566
The Leader applies:
Applying ConnectEntry[index=634, term=2,
And the follower:
Applying CommandEntry[index=634, term=1, session=8, sequence=235,
I'm guessing same indexes should get the same entry?
Jordan Halterman
@kuujo
Jan 04 2017 19:31
yeah...
Thiago Santos
@thiagoss
Jan 04 2017 19:32
Ok, so this is where the follower does something wrong and bumps its sequence number by itself.
Need to figure out why it is applying that entry
Jordan Halterman
@kuujo
Jan 04 2017 19:34
yeah that would certainly seem to be the root of the problem and is a big one
I’ve never seen this happen
Thiago Santos
@thiagoss
Jan 04 2017 19:34
\o/ We're getting close
Jordan Halterman
@kuujo
Jan 04 2017 19:34
now how to track it down...
hopefully the logs are verbose enough to figure out what allowed that
So I guess here's what we have to do...
Jordan Halterman
@kuujo
Jan 04 2017 19:40
This might be hard to deduce. It seems what happened is the leader for term 1 appended the second entry and the leader for term 2 appended the first. The leader for term 1 likely appended the entry and then the leader for term 2 was elected and appended the other entry. My question is, when the leader for term 1 applies that entry, is it still the leader or is it a follower? If it's still a leader then that indicates split brain. If it's a follower then that indicates something in the handling of AppendRequest failed. The node that was the leader for term 1 should have removed that entry from its log and replaced it with the entry from term 2.
Thiago Santos
@thiagoss
Jan 04 2017 19:41
It appends when it is a leader but never applies it until later when it becomes a follower
I need to leave for a while. Will get back to those logs in ~2h
Jordan Halterman
@kuujo
Jan 04 2017 19:43
K, so some AppendRequest from the leader for term 2 to the leader for term 1 was not handled correctly - the leader for term 1 didn't truncate its log and replace the entries. That's a huge find. I'll poke around some more and try to track down where that happened. I should be able to deduce the state of the server from these logs
Jordan Halterman
@kuujo
Jan 04 2017 19:49
I guess there are actually two possibilities here:
Either the leader for term 1 didn't properly truncate its log and replace the entry at index 634
Or the leader for term 2 should never have won the election in the event the leader for term 1 successfully committed that entry
Jordan Halterman
@kuujo
Jan 04 2017 19:57
hmmm
I think I’m getting close
Jordan Halterman
@kuujo
Jan 04 2017 20:06
So, this is the first AppendRequest from the leader for term 2 which causes 634 to be committed. I think the bug should ultimately be fairly easy to track down now...
Received AppendRequest[term=2, leader=-1408161302, logIndex=632, logTerm=1, entries=[0], commitIndex=634, globalIndex=632]
K I found the problem and will submit a PR after a couple meetings I have coming up
Jordan Halterman
@kuujo
Jan 04 2017 20:26

So, the problem is indeed in how followers handle AppendRequest. What the follower (and former leader) had in its log - entries from term 1 up to index 634 - allowed the AppendRequest consistent checks to succeed, but because the request didn't contain any entries the follower kept the entries that were already in its log. The follower then commits those entries and applies them: https://github.com/atomix/copycat/blob/master/server/src/main/java/io/atomix/copycat/server/state/PassiveState.java#L210

There are a couple potential solutions here. First, the follower could truncate its log regardless of whether the leader sent any entries. That's probably the safest solution. Otherwise, the follower perhaps could not update its commitIndex beyond what the leader sent in that request. In other words, setCommitIndex(Math.min(request.commitIndex(), lastEntryIndex)) but I have to explore whether that could negatively affect any other usage of commitIndex. It should be fine.

Going to think a bit about the impact of either of these solutions and submit a PR

Jordan Halterman
@kuujo
Jan 04 2017 22:11
This was an amazing find. I have the solution. Submitting a PR now. Can you try it out once it's merged?
Thiago Santos
@thiagoss
Jan 04 2017 22:21
Hi, just got back
Yes but, as I said, it is a rare race condition. But yes I'm planning on running my tests the whole day tomorrow. It would be nice to try to come up with a automatic test to reproduce and verify the fix as well. Not sure programming that would be easy, though
Great news :) Will start building here when you get it up
Jordan Halterman
@kuujo
Jan 04 2017 22:24
well at least it was a rare occurrance
alright there it is
really huge find
Jordan Halterman
@kuujo
Jan 04 2017 22:35
damn… I can’t believe that never happened in Jepsen testing. It doesn’t seem like it should be that difficult to reproduce in that type of environment where Jepsen is forcing leader changes like crazy
Thiago Santos
@thiagoss
Jan 04 2017 22:39
Do you see an easy to way to write a test forcing a leader change right after some entry is appended on the leader?
I'm guessing the next entry to the new leader needs to be a client connection that might result in an empty information propagating?
Jordan Halterman
@kuujo
Jan 04 2017 22:43

It seems to me the only good way this can really be tested reliably is sort of in a vacuum… by setting up a single server and mocking communication with it to/from other servers.

It’s possible we could set up a TestTransportfor tests which allows you to cut off connections to simulate partitions. I used to have something like that. Then you would cut off the leader’s connections to other servers, submit a command to the leader, wait for another leader to get elected, then resubmit the command. But then the question is what the assertion becomes. Are you asserting that the logs up to the commit index on each server are consistent? I think that’s starting to get into building a sort of fuzzy testing framework for Copycat

haha
really, rather than making general tests too complex, this is exactly the type of thing I would have hoped Jepsen would expose, but we probably really just need to be running the Jepsen tests more often
Thiago Santos
@thiagoss
Jan 04 2017 22:46
yeah... not exactly something easy to automate
I don't know what jepsen is. Will read on that
Thiago Santos
@thiagoss
Jan 04 2017 23:03
I'm starting a round of builds here and will try the fix. It is likely that I'll only have enough testing by tomorrow. I'll let you know how it went
Jordan Halterman
@kuujo
Jan 04 2017 23:40
And the popular series of blog posts on Jepsen testing various systems: https://aphyr.com/tags/jepsen