These are chat archives for atomix/atomix
So, on #77 I am only able to reproduce this when the cluster is initially misconfigured. What a
CopycatServer does at startup is this:
MetaStore, load and use the last known cluster configuration, otherwise...
Addressis listed in the list of server addresses provided by the user, start the server as a full member of the cluster
Addressis not listed in the list of server addresses provided by the user, attempt to join via the provided list of servers
JoinRequestor immediately transition to the
FOLLOWERstate to participate in the Raft algorithm. The transition is based on the configuration determined by the above link.
Address address = new Address(“localhost”, 5000); List<Address> members = Arrays.asList( new Address(“localhost”, 5001), new Address(“localhost”, 5002) ); CopycatServer server = CopycatServer.builder(address, members).build();
Because the local server’s address
Address(“localhost”, 5000) is not listed in the
members list, this code will cause the server to initially attempt to join the cluster.
Address address = new Address(“localhost”, 5000); List<Address> members = Arrays.asList( new Address(“localhost”, 5000), new Address(“localhost”, 5001), new Address(“localhost”, 5002) ); CopycatServer server = CopycatServer.builder(address, members).build();
Because the local server’s address
Address(“localhost”, 5000) is listed in the
members list, this code will cause the server to initially transition to the
FOLLOWER state and begin talking to other servers.
The problem is, if all servers are started using the first method, then all servers will effectively attempt to be joining all other servers. The initial cluster needs to have some set of members that are started using the second method so that additional servers can be added. Raft requires that cluster membership changes be done as commits through the Raft log, and so at least one server needs to be started in the Raft
FOLLOWER state. I’ve only been able to reproduce the join timeouts by starting all servers using the first method.
Once a server has been started and successfully joined the cluster, thereafter it will use the configuration stored on disk (if the
MAPPED). So, the second time you start the server in the first example, it will immediately transition to
FOLLOWER and continue its participation in the Raft algorithm (assuming it was able to successfully join the first time it started).
If this is not what’s causing the startup failure then it could be a bug that I can’t reproduce. Also, when testing you should keep in mind if using
StorageLevel.DISK (the default) that state will still be loaded from disk on separate runs. I typically use
StorageLevel.MEMORY for a lot of testing and then switch to
StorageLevel.DISK for final tests.
StorageLevel.MAPPEDwhich are waaaaay faster than disk
CopycatServer server = CopycatServer.builder() .withElectionTimeout(Duration.ofSeconds(1)) .build();
In this case, the leader sends a heartbeat to a follower, but the heartbeat doesn’t arrive for a couple seconds. Keep in mind, this isn’t even over a network. This is just threading.
14:16:52.190  DEBUG i.a.c.server.state.LeaderAppender - 5001 - Sent AppendRequest[term=1, leader=2130712285, logIndex=5, logTerm=1, entries=, commitIndex=5, globalIndex=5] to 5004 14:16:53.238  DEBUG i.a.c.server.state.FollowerState - 5002 - Heartbeat timed out in PT1.28S 14:16:53.341  DEBUG i.a.c.server.state.FollowerState - 5003 - Heartbeat timed out in PT1.387S 14:16:53.422  DEBUG i.a.c.server.state.FollowerState - 5004 - Heartbeat timed out in PT1.469S 14:16:53.432  DEBUG i.a.c.server.state.FollowerState - 5005 - Heartbeat timed out in PT1.479S
This one is over 8 seconds, but I’ve only seen these in Copycat tests and not in Atomix
14:32:09.136  DEBUG i.a.copycat.server.state.ServerState - 5001 - Sending server identification to 5002 14:32:09.138  DEBUG i.a.copycat.server.state.ServerState - 5003 - Sending server identification to 5002 14:32:17.633  DEBUG i.a.c.server.state.FollowerState - 5001 - Heartbeat timed out in PT0.79S 14:32:17.633  DEBUG i.a.c.server.state.FollowerState - 5003 - Heartbeat timed out in PT0.85S
0.79Sbut it still took
8seconds to trigger