These are chat archives for atomix/atomix

10th
Jan 2017
Thiago Santos
@thiagoss
Jan 10 2017 00:16
@kuujo waltznetworks/copycat@0eb0dd4 I tried this when I reproduced the same issue again today. It seems to break more stuff than fix :worried: It makes entries in the log go missing and clients to repeatedly get errors when resubmitting commands.
Anyway, the patch for last week wasn't enough to fix. I still see the same behavior. Let me know if/when you have some time to discuss this issue again.
Jordan Halterman
@kuujo
Jan 10 2017 01:44
That change definitely seems to have been correct and actually comes from the Raft paper. It might have been another change that is causing clients to get errors when resubmitting commands, though - atomix/copycat@61a9b6f - which limits the number of commands that can be enqueued for a session. That means if a client is submitting a lot of commands and one gets lost, the server will start rejecting them until the client resends the lost request. Otherwise, the problem we were seeing was stack sizes were too large when too many requests were enqueued. Once the client sent the missing request and enqueued requests are handled, the stack can be too large when handling too many requests at once.
Of course, there's probably another way to solve the stack size issue by not using recursion, but it limits memory consumption for a session anyways

Anyways, I'd be curious to see some evidence the consistency issue itself is not fixed - in other words two different commands are applied at the same index. That's the most pressing issue. There could be separate but related issues like the stack size limitation that is causing other issues we should figure out.

Maybe we should try to create a reproducer I can use in my own environment.

Or there could have been multiple issues and we just fixed one of them. But in general I'd expect to be seeing clients receive more errors after adding the request queue limitation.
Jordan Halterman
@kuujo
Jan 10 2017 01:52
There may be some work to be done on the client to limit the errors actually. The problem is, once the queue is full and the client's request is rejected it may be just trying again. The server should perhaps be sending back the missing index to get the client to start resending requests at the right point.
Previously, the server would hold on to a request and wait for any missing requests from the client. If the server is more likely to reject an out of sequence request then that causes a bit of recursion between the client and server until the sequencing is correct again. That should be fairly easily fixable by sending the last sequence number back to the client.
Thiago Santos
@thiagoss
Jan 10 2017 02:13
Hi. I'm sanitizing my logs again and also doing some preliminary analysis

In onos1:

/172.17.0.2:9876 - Appended CommandEntry[index=780, term=1, session=18, sequence=198, timestamp=1483752077971, command=InstanceCommand
/172.17.0.2:9876 - Applying CommandEntry[index=780, term=1, session=18, sequence=198, timestamp=1483752077971, command=InstanceCommand

In onos2:

/172.17.0.3:9876 - Applying ConfigurationEntry[index=780, term=2, timestamp=1483752077663, members=[ServerMember
Thiago Santos
@thiagoss
Jan 10 2017 02:18
The first line in onos1 was done when it was the leader, then it changes to follower and onos2 takes over leadership.

/172.17.0.2:9876 - Received ConfigureRequest[term=2, leader=-1408161302, index=780, timestamp=1483752077663, members=[

The configure request onos1 receives has index=780

That seems to be why it doesn't discard that entry it appended.
Thiago Santos
@thiagoss
Jan 10 2017 02:29
I might be missing something here, distributed systems and raft are both new to me
Thiago Santos
@thiagoss
Jan 10 2017 02:56
@kuujo atomix/copycat#268 Uploaded logs here.
Jordan Halterman
@kuujo
Jan 10 2017 03:20
this is great
should be able to deduce what happened in this case again
checking them out now
but these are hugely helpful discussions… committing two different entries at the same index is a critical bug
Jordan Halterman
@kuujo
Jan 10 2017 03:35
especially considering Jepsen didn’t show these issues
this seems like it could be a similar bug related to how configuration changes are done
configuration changes in Raft are a really delecate process because they can result in a change in the quorum
This bit is actually really interesting too:
2017-01-07 01:21:17,663 Appended InitializeEntry[index=780, term=2, timestamp=1483752077659]
2017-01-07 01:21:17,663 Appended ConfigurationEntry[index=780, term=2, timestamp=1483752077663, members=[ServerMember[type=ACTIVE, status=AVAILABLE, serverAddress=/172.17.0.2:9876, clientAddress=/172.17.0.2:9876], ServerMember[type=ACTIVE, status=AVAILABLE, serverAddress=/172.17.0.4:9876, clientAddress=/172.17.0.4:9876], ServerMember[type=ACTIVE, status=AVAILABLE, serverAddress=/172.17.0.3:9876, clientAddress=/172.17.0.3:9876]]] to log at index 780
it doesn’t seem like two entries should have been appended at that index, but it seems like those are two different log calls
the first doesn’t have to log at index …
Jordan Halterman
@kuujo
Jan 10 2017 03:43
I’m not seeing on server 1 where 780 is appended to the log
Thiago Santos
@thiagoss
Jan 10 2017 08:52
Seems to be: /172.17.0.2:9876 - Appended CommandEntry[index=780, term=1, session=18, sequence=198, timestamp=1483752077971, command=InstanceCommand, happens around 2017-01-07 01:21:17,971
Jordan Halterman
@kuujo
Jan 10 2017 09:16
Ugh I must have cut it out of my cleaned up version of the log on accident
Will try again in the morning
Thiago Santos
@thiagoss
Jan 10 2017 09:16
Cool, I shall be around
Will try to understand a bit more the code around Configuration changes until then
Jordan Halterman
@kuujo
Jan 10 2017 09:27

But in general, it seems unrelated to the first issue I fixed yesterday IMO. That was definitely a bug. This seems to be a separate issue that's equally as problematic. It may or may not be related to the configuration change.

The reason configuration changes are delicate is because they're counted/applied differently than normal entries. Configuration change entries are applied when they're written to the log, not when they're committed (stored on a majority of the servers). This is so that the leader knows once a configuration. Hangs has been stored on all servers that it has also been applied on all servers, which is necessary to do counting for commits. But the configuration change that's happening here isn't actually adding or removing any nodes. This is basically just an entry Copycat uses to share a node's status. Inside the Copycat configuration is a flag for each node that says whether the node is alive/dead which is used to prevent certain communication with nodes that won't answer anyways (in particular reading and sending batches of entries). It's also used in Atomix to automatically replace nodes. If there's a bug in that process that could be the cause of other issues in the Atomix BalancingClusterManager which is really experimental.

As for finding the cause of the inconsistency - two different entries being committed at the same index - what I'll do in the morning is try to deduce whether server 1 should have applied that command (it was able to replicate it to a majority of the cluster). If that part of the algorithm is correct, I would have to ask how server 2 got elected. The election protocol should have prevented server 2 from being elected and overwriting an entry committed by server 1 if server 2 didn't also have the committed entry in its log.

Anyways, I'll be back at it in 7 hours or so.

Thiago Santos
@thiagoss
Jan 10 2017 09:29
Thanks for the explanations, really helpful. I'll drop here any info I find meanwhile.
Jordan Halterman
@kuujo
Jan 10 2017 09:31
actually gonna look one more time before I go to bed :-)
now I think I have the section of log I want
Jordan Halterman
@kuujo
Jan 10 2017 09:36
there are three nodes in the cluster?
Thiago Santos
@thiagoss
Jan 10 2017 09:36
yes
Do you want logs for the 3rd as well?
Jordan Halterman
@kuujo
Jan 10 2017 09:37
nope
hmmm
oops
actually, that bug was only fixed in passive members… I forgot that the appendEntries method is overridden in ActiveState, so it wasn’t even fixed there
just need to verify that was the problem
but it looks like it was
Jordan Halterman
@kuujo
Jan 10 2017 09:42
server 2 becomes the leader and sends an AppendRequest with a commit index of 780 with no entries to server 1, causing it to commit that command
I can commit the fix there before I go to bed and do a search to make sure I’m not missing anything else
actually releived that this seems to just be the same issue
the bug was fixed here but that’s overridden here oops :-P
Jordan Halterman
@kuujo
Jan 10 2017 09:47
it was basically just fixed for non-Raft nodes
Thiago Santos
@thiagoss
Jan 10 2017 09:50
That was quick! If you put a patch somewhere I'll test it today
Jordan Halterman
@kuujo
Jan 10 2017 09:50
here’s the PR… I’m going to wait for the tests to run and then merge it: atomix/copycat#269
it’s just the same fix as the other PR
couple lines of code
minus an unnecessary if statement
going to poke through the logs some more to make sure that’s what happened while the tests run
ahh actually I found another bug
Thiago Santos
@thiagoss
Jan 10 2017 09:54
I'm still out of my desk. Will be there in 1-2h and start testing
Jordan Halterman
@kuujo
Jan 10 2017 09:55
so, the other issue is the two log statements I mentioned earlier:
2017-01-07 01:21:17,663 Appended InitializeEntry[index=780, term=2, timestamp=1483752077659]
2017-01-07 01:21:17,663 Appended ConfigurationEntry[index=780, term=2, timestamp=1483752077663, members=[ServerMember[type=ACTIVE, status=AVAILABLE, serverAddress=/172.17.0.2:9876, clientAddress=/172.17.0.2:9876], ServerMember[type=ACTIVE, status=AVAILABLE, serverAddress=/172.17.0.4:9876, clientAddress=/172.17.0.4:9876], ServerMember[type=ACTIVE, status=AVAILABLE, serverAddress=/172.17.0.3:9876, clientAddress=/172.17.0.3:9876]]] to log at index 780
two entries appended at the same index
when assertions are disabled, the runtime doesn’t execute the line at all, so the entry’s never actually appended to the log
will submit another PR for that one and get them merged and you can have at it when you get to your desk
Thiago Santos
@thiagoss
Jan 10 2017 09:57
Oops. Code with side effects in an assert.
thanks!
Jordan Halterman
@kuujo
Jan 10 2017 09:58
fortunately InitializeEntry isn’t really even necessary, it’s just a performance thing but great find from the logs
Jordan Halterman
@kuujo
Jan 10 2017 10:04
this is great… if this fixes the consistency issues (again) I think we’ll have to reconsider how the client/server resolve missing requests. Servers should be sending the missing request sequence number back to the client to allow it to restart at the correct point rather than clients arbitrarily retrying requests until they succeed.
and I don’t see AppendRequest being handled anywhere else, so that should fix this particular issue
gonna head off to bed… be back in the morning
Jordan Halterman
@kuujo
Jan 10 2017 10:23
k both PRs passed and are merged
Jordan Halterman
@kuujo
Jan 10 2017 20:23
@hazemkmammu let me double check on what the current behavior should be, but I think you're right...
Jordan Halterman
@kuujo
Jan 10 2017 20:34
So, IIRC ASYNC sends a message and gets a response when it's persisted, SYNC sends a message and gets a response when it's acked, REQUEST-REPLY waits for a response. DIRECT sends directly to a member, ditto creating a producer on the member. If the member is persistent (was created with an explicit member ID) then messages should indeed be enqueued until the member recovers. So, I actually just found a bug in sending asynchronous messages. I'll have to look in to what the side effects of it could be and if it's the cause.
Jordan Halterman
@kuujo
Jan 10 2017 21:12
Part of the problem is the state machines are not well unit tested, and the group state machine in particular is pretty complex. They should be fairly easy to test, but I’m still developing the test framework that’s necessary to write those tests - simulating commits, asserting logs are properly managed by the state machine, etc.
Jon Hall
@jhall11
Jan 10 2017 21:45
@kuujo, If these 2 PRs fix the issues @thiagoss is looking at, what do you think the timeline for the next RC will be?
Thiago Santos
@thiagoss
Jan 10 2017 21:47
I'm finishing my test session but looking good so far
Jon Hall
@jhall11
Jan 10 2017 21:49
Yeah, some others in ONOS have tested in their setup with promising results