These are chat archives for atomix/atomix

8th
Oct 2018
william.z
@zwillim
Oct 08 2018 01:52
I keep a client instance and let it send some message to leader from time to time. But after one day or two, I found that I cannot get response from leader. I'm confused, if I need to restart a client from time to time? Or I need to build a new client instance every time I need to send a msg?

And my cluster got some warnings.

2018-09-30 14:13:33,627 [WARN] [io.atomix.utils.logging.DelegatingLogger.warn(DelegatingLogger.java:230)] [raft-server-data-partition-1] - RaftServer{data-partition-1}{role=FOLLOWER} - java.net.ConnectException
2018-09-30 14:13:34,780 [WARN] [io.atomix.utils.logging.DelegatingLogger.warn(DelegatingLogger.java:230)] [raft-server-system-partition-1] - RaftServer{system-partition-1}{role=FOLLOWER} -java.net.ConnectException
2018-09-30 14:13:34,781 [WARN] [io.atomix.utils.logging.DelegatingLogger.warn(DelegatingLogger.java:230)] [raft-server-system-partition-1] - RaftServer{system-partition-1}{role=FOLLOWER} -java.net.ConnectException
2018-09-30 14:13:37,486 [WARN] [io.atomix.utils.logging.DelegatingLogger.warn(DelegatingLogger.java:230)] [raft-server-data-partition-1] - RaftServer{data-partition-1}{role=FOLLOWER} - java.net.ConnectException
2018-09-30 14:13:37,486 [WARN] [io.atomix.utils.logging.DelegatingLogger.warn(DelegatingLogger.java:230)] [raft-server-data-partition-1] - RaftServer{data-partition-1}{role=FOLLOWER} - java.net.ConnectException
2018-09-30 14:13:39,804 [WARN] [io.atomix.utils.logging.DelegatingLogger.warn(DelegatingLogger.java:230)] [raft-server-system-partition-1] - RaftServer{system-partition-1}{role=FOLLOWER} -java.util.concurrent.TimeoutException: Request timed out in 5024 milliseconds

Does this matter?

Jordan Halterman
@kuujo
Oct 08 2018 02:06
What version are you running?
What do you mean by leader? You’re using a leader election primitive and sending messages to the leader? Or a Raft primitive can’t reach a Raft leader?
@zwillim
Jordan Halterman
@kuujo
Oct 08 2018 02:13
The answer depends on the context
william.z
@zwillim
Oct 08 2018 03:34
I'm using a leader election and send msg to the leader
Jordan Halterman
@kuujo
Oct 08 2018 05:56
Okay... a few things then
I’m assuming you’re sending a message using ClusterCommunicationService?
On the warnings, by themselves they’re nothing to be concerned about. The Raft code is designed to tolerate connection issues between nodes. Connection failures and congestion will be handled with retries and backoff algorithms. They’re only something to be concerned about if they persist for a long time and there’s no obvious reason for them (e.g. a long network partition).
The most important aspect of the Raft partitions is node roles. If there’s no leader in a partition, the cluster can’t make progress. As long as there’s a leader that means the cluster is making progress and there’s nothing to worry about.
The exceptions may just be the result of some brief network issue which is common.
Jordan Halterman
@kuujo
Oct 08 2018 06:02
If there are many of them and they are being logged at the same time you’re seeing failed messages, that may indicate they’re part of the same problem.
Jordan Halterman
@kuujo
Oct 08 2018 06:08

But to answer your question more directly: no you don’t have to recreate clients periodically or on every message. That would be incredibly inefficient.

If Raft partitions are available and a node is able to read leadership information from Raft partitions, that suggests the nodes are not actually having trouble communicating. So if nodes are able to communicate but the message to the leader is not getting a response then I’d become concerned about a blocked thread or some exception causing messages to be lost. It’s hard to debug the problem just from your question. Is the sender just getting back TimeoutExceptions?

william.z
@zwillim
Oct 08 2018 06:44
I looked up my log and found the reason is that I cannot get the leader .
My election is like this:
        LeaderElection<MemberId> election = atomix.<MemberId>leaderElectionBuilder(Resource.RUNNER_GROUP_NAME)
                .withSerializer(Serializer.using(Namespace.builder()
                        .register(Namespaces.BASIC)
                        .nextId(Namespaces.BEGIN_USER_CUSTOM_ID)
                        .register(MemberId.class)
                        .build()))
                .build();
        election.addListener(new DEElectionListener(resource));
        Leadership<MemberId> leadership = election.run(
                atomix.getMembershipService().getLocalMember().id());
and my client looks for leader like this:
                return atomix.getCommunicationService().send(
                        producerName, req,
                        ProtocolTools.getSerializer()::encode,
                        ProtocolTools.getSerializer()::decode,
                        election.getLeadership().leader().id(),
                        Duration.ofMinutes(3));
and I init the election once like this:
            LeaderElection<MemberId> election = atomix.<MemberId>leaderElectionBuilder(groupName)
                    .withSerializer(Serializer.using(Namespace.builder()
                            .register(Namespaces.BASIC)
                            .nextId(Namespaces.BEGIN_USER_CUSTOM_ID)
                            .register(MemberId.class)
                            .build())).build();
Jordan Halterman
@kuujo
Oct 08 2018 06:48
What does cannot get the leader mean? Is there an exception?
william.z
@zwillim
Oct 08 2018 06:49
I got a NullPointerException here:
election.getLeadership().leader().id()
Jordan Halterman
@kuujo
Oct 08 2018 06:50
Okay...
william.z
@zwillim
Oct 08 2018 06:53
This happens after my program runs for several hours
Jordan Halterman
@kuujo
Oct 08 2018 06:53
BTW what version are you running?
william.z
@zwillim
Oct 08 2018 06:53
atomix-3.0.5
I used to use 1.0.8 before and meet the same problem, so I tried the latest version.
but it happens again
I'm confused
Jordan Halterman
@kuujo
Oct 08 2018 06:57

That seems to imply that the leader’s session expired at some point. Basically, the clients that call run were disconnected from the Raft partitions at some point, so the leader election state machine removed the leader, and no leader was left to take its place.

I think this type of scenario needs to be better described in the documentation. The LeaderElection primitive provides events to notify client code when it becomes disconnected, thus risking a leader being evicted. The correct way to ensure a persistent leader is to listen for those events and re-run the leader election or otherwise ensure a leader still exists once the client is reconnected.

I have some example code from ONOS where we do this.

I’m guessing the leader change happens around the same time as the connection issues. It’s hard to tell why that’s happening. There are some ways to reduce the chance of false positives, though, which is what this is. But they’re impossible to avoid completely and so just need to be handled correctly. For some reason, at some point either the leader’s keep-alive is not making it to a Raft partition quickly enough or the ClusterService detects a failed node, also resulting in a leader change.

election.addStateChangeListener(state -> {
  if (state == PrimitiveState.SUSPENDED) {
    // The client cannot communicate with the cluster, so the leader may or may not be lost
  }
  if (state == PrimitiveState.CONNECTED) {
    // The client reconnected to the cluster after a brief disruption. Recreate leadership elections if necessary
  }
});
Jordan Halterman
@kuujo
Oct 08 2018 07:03
The LeaderElection sends periodic keep-alives to the cluster, and if a keep alive is for some reason lost, the client will be assumed crashed and the leader will be removed, as it should be. The client just has to recognize that perceived crash and re-inform the cluster that it’s ready to be the leader again. For example, this is exactly what would happen in a network partition, and it’s how you handle a partition gracefullly.
william.z
@zwillim
Oct 08 2018 07:04
I restart my client, and found that the Exception is still there.
Jordan Halterman
@kuujo
Oct 08 2018 07:04
Restarted and then ran run again!
Meant “?”
You don’t need to restart the client. Just use Recovery.RECOVER in the multi-Raft protocol configuration and the client will recreate itself when it becomes disconnected.
Calling election.run is what request’s leadership for the client. But when the client’s session is expired (PrimitiveState.SUSPENDED) it’s leadership is potentially lost so it needs to request it again
You shouldn’t need to recreate any objects. I’m pretty sure all the defaults are set to reopen sessions. The client should just have to call run again to run for another election term.
The cause of the expired sessions would be great to investigate. That’s actually what I’m working on tonight
Jordan Halterman
@kuujo
Oct 08 2018 07:09
False positives are not entirely avoidable, but they’re still too frequent with the default Atomix configuration
Let me find this in the ONOS code
This is based on an earlier iteration of Atomix when this portion of Atomix was in ONOS, but the principle is the same:
https://github.com/opennetworkinglab/onos/blob/master/core/store/dist/src/main/java/org/onosproject/store/cluster/impl/DistributedLeadershipStore.java#L143
When the client’s state changes back to ACTIVE (or CONNECTED in Atomix 3), it calls run again for all the pending elections to ensure the client is in the leader queue. The LeaderElector run calls are idempotent, and I think LeaderElection should be the same
So you should be able to just call run as much as you want to keep the client in the election
Of course, one could also argue this should at least be an option of the LeaderElector primitive. I guess that’s another discussion
Once a node runs for election, its interest in being the leader should persist through connection disruptions
Jordan Halterman
@kuujo
Oct 08 2018 07:15
But the client code reading Leadership terms should also account for no leaders being present (the cause of the NPE), which can still happen if all the candidates become disconnected from the Raft partition at the same time for some reason
Part of the cause of the lost sessions could be the aggressive failure detection algorithm and session timeouts. They’re designed for ONOS were we have to elect new leaders very quickly when a leader crashes.
I’m working on a new luster membership protocol that will be much more conservative in detecting crashed nodes
william.z
@zwillim
Oct 08 2018 07:43
I'm not so sure about this ..
Do you mean that I need to let my client take part in the leader election?
My system is like this:
there are servers, as Atomix Cluster Members, that all take part in the leader election.
and there is a proxy, as a Atomix Client, tries to send messages to the leader in the election
or I need to ask my servers to re-run the LeaderElection every time the node is CONNECTED after it is SUSPENDED?
BTW, I use RAFT protocol, and I found that the primary-backup protocol will elect multiple leaders
Jordan Halterman
@kuujo
Oct 08 2018 07:49

Yeah, the latter. The nodes that are trying to get elected leader need to also listen for LeaderElection disconnect/connect events and call run after a reconnect event (CONNECTED). That should be all that’s needed with the default configuration.

I really think we should modify the leader election primitives to handle this as an option.

The confusion in terminology is that there are a bunch of layers to Atomix. At the Raft level there are clients and servers. At the Atomix level there are primitive clients (objects) and servers (partitions). And at the application level you can create client-server architectures on top of all that as you have done.
william.z
@zwillim
Oct 08 2018 07:49
And I said restart means: I restarted my proxy program, which will get the LeaderElection again, and still cannot get the leader
Jordan Halterman
@kuujo
Oct 08 2018 07:50
Well, restarting the proxy wouldn’t have any affect on the state of the election since it’s just reading the leader
william.z
@zwillim
Oct 08 2018 07:51
Oh, I see, and I will try what you said.
Jordan Halterman
@kuujo
Oct 08 2018 07:52
I’m curious about the primary-backup thing. You used just primary-backup? Or a Raft management group + primary backup group for leader election?
william.z
@zwillim
Oct 08 2018 07:52
I used DataGrid profile
and many member says that it is the leader
Jordan Halterman
@kuujo
Oct 08 2018 07:56

Gotcha. Yeah that’s really unreliable for coordination primitives like locks and elections. Probably shouldn’t be that unreliable, but it will gladly cause split brain because it doesn’t use consensus. You have to at least have a Raft cluster somewhere either managing the primaries or managing the leader election.

IIRC the tests do not run on a profile that doesn’t use a Raft partition for primary election. That suggests the algorithm used for primary election when Raft is not present is unreliable. We probably need to investigate that.

The problem is, when there’s no Raft partition available at least for primary election, we have to do a leader election using the eventually consistent group membership protocol. That’s always unsafe. But we should be able to get it to be a fairly reliable absent network partitions.
Alex Robin
@alexrobin
Oct 08 2018 08:02
@kuujo Please see PR #863 that I submitted that contains 2 fixes regarding the primary-backup protocol. I will keep working on making this part more reliable, so I can contribute if you're interested
Jordan Halterman
@kuujo
Oct 08 2018 08:02
I saw them thanks! Going to review and merge PRs tomorrow
and absolutely!
Alex Robin
@alexrobin
Oct 08 2018 08:02
excellent thanks.
Alex Robin
@alexrobin
Oct 08 2018 08:11
I'm planning to add options to keep in-memory or persistent operation logs for the primary backup protocol, mostly to provide more efficient and robust recovery capability. The idea would be to store only COMMAND operations in the log, not QUERY or anything else like Heartbeats or reconnection events like Raft does. Do you think that's a good idea? I'm assuming I can still gain in performance compared to a Raft partition, but do you think that's true?
Johno Crawford
@johnou
Oct 08 2018 08:12
@kuujo how's that tie in with your raft log rewrite?
Alex Robin
@alexrobin
Oct 08 2018 08:12
Note that I can probably reuse the journal and log classes that are used for Raft
Johno Crawford
@johnou
Oct 08 2018 08:14
makes sense to me as query ops should not mutate the state machine and it would increase throughput on read heavy work loads
Alex Robin
@alexrobin
Oct 08 2018 08:18
yep
@kuujo @johnou As I understand the current primary-backup code does a full backup/restore when a backup gets out of sync (perhaps because it was offline for a little while), so what I'm trying to do is allow a faster restore by replaying the last operations from the log instead. I haven't looked at details but I'm assuming that's what Raft does
Alex Robin
@alexrobin
Oct 08 2018 08:24
Just to clarify: the full backup-restore is not acceptable to me because I will be storing lots of data behind a primitive! (database nodes)
Johno Crawford
@johnou
Oct 08 2018 08:24
uses snapshots, so it doesn't replay back all individual ops
Alex Robin
@alexrobin
Oct 08 2018 08:26
right there are snapshots too, but in my case snapshots will be much larger than replaying the end of the log, in the case a backup node was down for only a few minutes
snapshots are not incremental, are they?
Jordan Halterman
@kuujo
Oct 08 2018 08:28
Right. Maybe we ought to do some computations to determine whether it’s more efficient to replicate the logs or a snapshot and only take snapshots if the logs are going to be larger than the snapshot
Alex Robin
@alexrobin
Oct 08 2018 08:29
yes, I was thinking of something like that
Jordan Halterman
@kuujo
Oct 08 2018 08:29
Or just allow a primitive to be configured to keep logs for longer to avoid sending a lot of duplicate data
It will certainly still out perform Raft
Alex Robin
@alexrobin
Oct 08 2018 08:31
It should be possible to only replicate the log starting at a given index also, right? If backup node has been down after receiving command with index N, it should be able to come back online after replicating the logs from index N onward.
@kuujo ok thanks
Jordan Halterman
@kuujo
Oct 08 2018 08:31
Yes exactly. Which is why it may be good to keep logs for longer in that case. We should also add the option to Raft
Alex Robin
@alexrobin
Oct 08 2018 08:32
ah ok I see what you meant.
I thought Raft was already doing that
so Raft either replicates the entire log if there are no snapshot yet or uses the latest snapshot right?
Jordan Halterman
@kuujo
Oct 08 2018 08:33
Because the problem is when you get rid of logs you’re replacing them with a persisted snapshot of the entire state. Maybe a backup is only one entry behind that snapshot, but it still needs to be sent the entire snapshot. The longer you keep the logs the longer you can start sending from the last index of the backup rather than sending a snapshot
Alex Robin
@alexrobin
Oct 08 2018 08:33
yes exactly
in my case replicating a snapshot that contains the entire state would be so inefficient since the database could reach several TB or at least hundreds of GB on a single node
Jordan Halterman
@kuujo
Oct 08 2018 08:35
Yep
Alex Robin
@alexrobin
Oct 08 2018 08:35
And I want to be able to have my backup back online ASAP
that is also the sense of the PR I submitted since it favors selection of an existing backup as the next primary. The goal here is also to avoid selecting an entirely new node that would need to restore the entire state
so i will start experimenting with this with the primary backup protocol and submit something so you can review
Jordan Halterman
@kuujo
Oct 08 2018 08:38
Yes that’s what should have been happening all along
My brain is deep into the TLA+ spec for SWIM right now. Almost done though. I’ll think about it a little more once I finish the spec tomorrow.
Alex Robin
@alexrobin
Oct 08 2018 08:44
ok no problem, and you probably need to sleep also :-)
Jordan Halterman
@kuujo
Oct 08 2018 09:14
Sleep is a waste of time!
😉
Alex Robin
@alexrobin
Oct 08 2018 09:17
yeah but I noticed sometimes the brain works better after :smile:
Jordan Halterman
@kuujo
Oct 08 2018 09:20
That’s clearly a myth!
Alex Robin
@alexrobin
Oct 08 2018 09:21
yeah, because sometimes it works better before also, so I never know :smile:
Jordan Halterman
@kuujo
Oct 08 2018 09:23
Part of the round earth agenda
Eric Chavez
@echavez
Oct 08 2018 21:45
@kuujo I notice that in the documentation for all, or almost all, the atomix primitives there's a cleanup section to call close() on the instance to free up the resources. What is the impact to calling this as soon as you retrieve data from or modify the primitive but then quickly accessing this again.
In the onos project I don't seem to see any calls to this function and i'm just wondering if it's necessary or if you've noticed any issues with it's use
Jordan Halterman
@kuujo
Oct 08 2018 21:56

Good question. ONOS actually should be closing primitives when components are deactivated, but we deactivate components so infrequently that it’s never been addressed.

There’s a lot of overhead to closing and recreating primitives. Basically, when a new primitive instance is created (via a builder), e.g. a new logical Raft session is opened. That’s one additional write to each Raft partition used by the primitive. Then when it’s closed, another write to each partition used by the primitive to tell the partitions the primitive is no longer in use. So, it becomes a lot more costly to create and use a primitive rather than reusing an existing one. If you plan to continue to use a primitive, it’s much more efficient just to keep it.

Where closing the primitive becomes useful is e.g. if you have a leader election primitive and stop using it, you can call close to immediately notify the cluster that it’s no longer in use so e.g. a new leader can be elected. Closing the primitive will free up some resources - e.g. periodic keep-alives - as well, so if you do know you’re not going to use it you should close it, and you certainly don’t want to keep creating primitives without closing them because that will lead to a memory leak and likely clog the network.

Eric Chavez
@echavez
Oct 08 2018 22:10
Since you infrequently deactivate components in onos and you mention that creating primitives without closing them could lead to a memory leak and a clogged network do you simply not have enough primitives open at the same time for it to be an issue or how are you dealing with this?
Eric Chavez
@echavez
Oct 08 2018 22:21
In my use case i'm accessing and updating a group of primitives for a short period of time but then they may not be used again for some time after which the same group will be reopened and used again. As the primitives don't have a TTL i'll need to manage this myself but if the node that created or opened a primitive goes down before closing it.
  1. Will the cluster assume the primitive is still open and one of the other nodes will now need to close this connection?
  2. Is there a way to look for open primitives?