We are no longer monitoring this channel, please join Slack! https://join.slack.com/t/atomixio/shared_invite/enQtNDgzNjA5MjMyMDUxLTVmMThjZDcxZDE3ZmU4ZGYwZTc2MGJiYjVjMjFkOWMyNmVjYTc5YjExYTZiOWFjODlkYmE2MjNjYzZhNjU2MjY
And my cluster got some warnings.
2018-09-30 14:13:33,627 [WARN] [io.atomix.utils.logging.DelegatingLogger.warn(DelegatingLogger.java:230)] [raft-server-data-partition-1] - RaftServer{data-partition-1}{role=FOLLOWER} - java.net.ConnectException
2018-09-30 14:13:34,780 [WARN] [io.atomix.utils.logging.DelegatingLogger.warn(DelegatingLogger.java:230)] [raft-server-system-partition-1] - RaftServer{system-partition-1}{role=FOLLOWER} -java.net.ConnectException
2018-09-30 14:13:34,781 [WARN] [io.atomix.utils.logging.DelegatingLogger.warn(DelegatingLogger.java:230)] [raft-server-system-partition-1] - RaftServer{system-partition-1}{role=FOLLOWER} -java.net.ConnectException
2018-09-30 14:13:37,486 [WARN] [io.atomix.utils.logging.DelegatingLogger.warn(DelegatingLogger.java:230)] [raft-server-data-partition-1] - RaftServer{data-partition-1}{role=FOLLOWER} - java.net.ConnectException
2018-09-30 14:13:37,486 [WARN] [io.atomix.utils.logging.DelegatingLogger.warn(DelegatingLogger.java:230)] [raft-server-data-partition-1] - RaftServer{data-partition-1}{role=FOLLOWER} - java.net.ConnectException
2018-09-30 14:13:39,804 [WARN] [io.atomix.utils.logging.DelegatingLogger.warn(DelegatingLogger.java:230)] [raft-server-system-partition-1] - RaftServer{system-partition-1}{role=FOLLOWER} -java.util.concurrent.TimeoutException: Request timed out in 5024 milliseconds
Does this matter?
ClusterCommunicationService
?
But to answer your question more directly: no you don’t have to recreate clients periodically or on every message. That would be incredibly inefficient.
If Raft partitions are available and a node is able to read leadership information from Raft partitions, that suggests the nodes are not actually having trouble communicating. So if nodes are able to communicate but the message to the leader is not getting a response then I’d become concerned about a blocked thread or some exception causing messages to be lost. It’s hard to debug the problem just from your question. Is the sender just getting back TimeoutException
s?
LeaderElection<MemberId> election = atomix.<MemberId>leaderElectionBuilder(Resource.RUNNER_GROUP_NAME)
.withSerializer(Serializer.using(Namespace.builder()
.register(Namespaces.BASIC)
.nextId(Namespaces.BEGIN_USER_CUSTOM_ID)
.register(MemberId.class)
.build()))
.build();
election.addListener(new DEElectionListener(resource));
Leadership<MemberId> leadership = election.run(
atomix.getMembershipService().getLocalMember().id());
return atomix.getCommunicationService().send(
producerName, req,
ProtocolTools.getSerializer()::encode,
ProtocolTools.getSerializer()::decode,
election.getLeadership().leader().id(),
Duration.ofMinutes(3));
election
once like this: LeaderElection<MemberId> election = atomix.<MemberId>leaderElectionBuilder(groupName)
.withSerializer(Serializer.using(Namespace.builder()
.register(Namespaces.BASIC)
.nextId(Namespaces.BEGIN_USER_CUSTOM_ID)
.register(MemberId.class)
.build())).build();
NullPointerException
here:election.getLeadership().leader().id()
That seems to imply that the leader’s session expired at some point. Basically, the clients that call run
were disconnected from the Raft partitions at some point, so the leader election state machine removed the leader, and no leader was left to take its place.
I think this type of scenario needs to be better described in the documentation. The LeaderElection
primitive provides events to notify client code when it becomes disconnected, thus risking a leader being evicted. The correct way to ensure a persistent leader is to listen for those events and re-run the leader election or otherwise ensure a leader still exists once the client is reconnected.
I have some example code from ONOS where we do this.
I’m guessing the leader change happens around the same time as the connection issues. It’s hard to tell why that’s happening. There are some ways to reduce the chance of false positives, though, which is what this is. But they’re impossible to avoid completely and so just need to be handled correctly. For some reason, at some point either the leader’s keep-alive is not making it to a Raft partition quickly enough or the ClusterService
detects a failed node, also resulting in a leader change.
election.addStateChangeListener(state -> {
if (state == PrimitiveState.SUSPENDED) {
// The client cannot communicate with the cluster, so the leader may or may not be lost
}
if (state == PrimitiveState.CONNECTED) {
// The client reconnected to the cluster after a brief disruption. Recreate leadership elections if necessary
}
});
LeaderElection
sends periodic keep-alives to the cluster, and if a keep alive is for some reason lost, the client will be assumed crashed and the leader will be removed, as it should be. The client just has to recognize that perceived crash and re-inform the cluster that it’s ready to be the leader again. For example, this is exactly what would happen in a network partition, and it’s how you handle a partition gracefullly.
run
again!
Recovery.RECOVER
in the multi-Raft protocol configuration and the client will recreate itself when it becomes disconnected.
election.run
is what request’s leadership for the client. But when the client’s session is expired (PrimitiveState.SUSPENDED
) it’s leadership is potentially lost so it needs to request it again
run
again to run for another election term.
ACTIVE
(or CONNECTED
in Atomix 3), it calls run
again for all the pending elections to ensure the client is in the leader queue. The LeaderElector
run
calls are idempotent, and I think LeaderElection
should be the same
run
as much as you want to keep the client in the election
LeaderElector
primitive. I guess that’s another discussion
Leadership
terms should also account for no leaders being present (the cause of the NPE), which can still happen if all the candidates become disconnected from the Raft partition at the same time for some reason
run
the LeaderElection
every time the node is CONNECTED
after it is SUSPENDED
?
Yeah, the latter. The nodes that are trying to get elected leader need to also listen for LeaderElection
disconnect/connect events and call run
after a reconnect event (CONNECTED
). That should be all that’s needed with the default configuration.
I really think we should modify the leader election primitives to handle this as an option.
restart
means: I restarted my proxy program, which will get the LeaderElection
again, and still cannot get the leader
Gotcha. Yeah that’s really unreliable for coordination primitives like locks and elections. Probably shouldn’t be that unreliable, but it will gladly cause split brain because it doesn’t use consensus. You have to at least have a Raft cluster somewhere either managing the primaries or managing the leader election.
IIRC the tests do not run on a profile that doesn’t use a Raft partition for primary election. That suggests the algorithm used for primary election when Raft is not present is unreliable. We probably need to investigate that.
Good question. ONOS actually should be closing primitives when components are deactivated, but we deactivate components so infrequently that it’s never been addressed.
There’s a lot of overhead to closing and recreating primitives. Basically, when a new primitive instance is created (via a builder), e.g. a new logical Raft session is opened. That’s one additional write to each Raft partition used by the primitive. Then when it’s closed, another write to each partition used by the primitive to tell the partitions the primitive is no longer in use. So, it becomes a lot more costly to create and use a primitive rather than reusing an existing one. If you plan to continue to use a primitive, it’s much more efficient just to keep it.
Where closing the primitive becomes useful is e.g. if you have a leader election primitive and stop using it, you can call close
to immediately notify the cluster that it’s no longer in use so e.g. a new leader can be elected. Closing the primitive will free up some resources - e.g. periodic keep-alives - as well, so if you do know you’re not going to use it you should close it, and you certainly don’t want to keep creating primitives without closing them because that will lead to a memory leak and likely clog the network.