These are chat archives for atomix/atomix

9th
Aug 2016
terrytan
@txm119161336_twitter
Aug 09 2016 02:29
Hi Jordan ,there are two ways that we can created the cluster , one of which is join(), the other one of which is adding it the bootstrap ,but if we config like this , each server needs to be config ed the bootstrap (addresslist) as the same ,right?
I mean the addresslist
Jordan Halterman
@kuujo
Aug 09 2016 02:29
right...
actually the two ways to create a cluster are:
  • bootstrap(Address… cluster) with the full list of Addresses on each node
  • bootstrap() on a single node and join(address) on all the other nodes
terrytan
@txm119161336_twitter
Aug 09 2016 02:31
ok, I have seen your codes, thank you
terrytan
@txm119161336_twitter
Aug 09 2016 02:39
what is differences between configure and reconfigure , are you using two phase commit ? it seems you just simply changed the members list
when the server method configure being called
Jordan Halterman
@kuujo
Aug 09 2016 02:42

The configure method is something that’s called internally on the leader to update the configuration. When a configuration change occurs, a new ConfigurationEntry is appended to the log and committed, and cluster membership is based on that configuration. But Raft dictates that configuration changes must be applied immediately for safety, which is why you see the membership being immediately updated as well.

The ReconfigureRequest is a server-to-server request that’s sent by followers/passive members to request a change in the cluster configuration, for example to promote or demote a member between the passive/active states.

Respectively, join adds a member to the configuration, leave removes a member, and reconfigure changes the property of a member
configure logs a configuration change to the Raft log, updates the leader’s configuration, and commits the change
Love all the questions :-)
Jordan Halterman
@kuujo
Aug 09 2016 02:47
Configuration changes are a really delicate process since they affect the quorum size. Removing a leader from the cluster is different than removing a follower from the cluster is different than removing a passive member. Removing a voting member risks decreasing the cluster to the point where it can no longer progress, so to be able to commit the configuration change it has to happen in a very specific order. Split brain can occur if more than one configuration change is allowed to occur simultaneously, etc
terrytan
@txm119161336_twitter
Aug 09 2016 03:04
leader is the only one who can handle configuring , also the only one has the method join ,but i found the join client will invoke the join requests to all the servers , regradless if it is leader or not , why ?
it is a iterator ,to loop each active server to send join requests
Jordan Halterman
@kuujo
Aug 09 2016 03:46
Because a joining node doesn't know which node is the leader until it joins, and a node can join through any other node. So, in effect it's just sending the request to the first node that will answer it. If a follower receives a JoinRequest it will proxy it to the leader of it knows of a leader.
Almost any request can be proxied. Clients can connect to any server. Servers can join any server. This means a client doesn't have to know about all servers to register a session, and a server doesn't have to know about all servers to join. They just have to know of one server that knows who the leader is.
Requiring the server to connect to the leader to join assumes that server knows that the current leader exists at all. That may be a poor assumption considering that cluster membership can change and a server's local stored configuration may be outdated.
terrytan
@txm119161336_twitter
Aug 09 2016 05:36
I have seen, i misunderstood your design at first , I thought it was volilating the raft paper , now i have seen the follower state is actually extending the reserved state class in which there is a forward method
terrytan
@txm119161336_twitter
Aug 09 2016 05:59
I found the initial state of the joint server is active, which means if the server finishes the config , it will be one of the voter , but it is not sync with the leader server , will it be a problem?
Jordan Halterman
@kuujo
Aug 09 2016 06:09
There's some risk to it, but it's not a safety issue. If a member is added as a full voting member but isn't largely caught up to the rest of the cluster, the failure of another node could be more likely to result in a loss of availability but not a safety violation. Ideally it should be added in a PROMOTABLE state and caught up to bear the leader before becoming a voting member. There's not really any perfect way to determine what "in-sync" means, but certainly getting close can reduce that risk.
near*
terrytan
@txm119161336_twitter
Aug 09 2016 07:04
for the append entry part , i did not find you have any check for receiving the majority of the servers responses ,just commit the log ?

/**

  • Handles a {@link Response.Status#OK} response.
    */
    protected void handleAppendResponseOk(MemberState member, AppendRequest request, AppendResponse response) {
    // Reset the member failure count and update the member's availability status if necessary.
    succeedAttempt(member);

    // If replication succeeded then trigger commit futures.
    if (response.succeeded()) {
    updateMatchIndex(member, response);

    // If entries were committed to the replica then check commit indexes.
    if (!request.entries().isEmpty()) {
    commitEntries();
    }

    // If there are more entries to send then attempt to send another commit.
    if (hasMoreEntries(member)) {
    appendEntries(member);
    }
    }
    // If we've received a greater term, update the term and transition back to follower.
    else if (response.term() > context.getTerm()) {
    context.setTerm(response.term()).setLeader(0);
    context.transition(CopycatServer.State.FOLLOWER);
    }
    // If the response failed, the follower should have provided the correct last index in their log. This helps
    // us converge on the matchIndex faster than by simply decrementing nextIndex one index at a time.
    else {
    resetMatchIndex(member, response);
    resetNextIndex(member);

    // If there are more entries to send then attempt to send another commit.
    if (hasMoreEntries(member)) {
    appendEntries(member);
    }
    }
    }

Jordan Halterman
@kuujo
Aug 09 2016 07:14
It's right there... Try the commitEntries method
An entry is committed when commitIndex is >= the entry index. There's no need to literally count responses from followers. That would be inefficient. It's more efficient for messages to each follower to behave independently. Raft already tracks the highest entry stored on any follower - matchIndex - and commitIndex is a function of allmatchIndexes. That is, when the majority of matchIndexes are >= an entry index, that entry is committed and the commitIndex can be increased.
Jordan Halterman
@kuujo
Aug 09 2016 07:28
The LeaderAppender is designed to pipeline a couple requests to each follower at any given time. Each follower can be at different points in the log, so entries can continue to be committed while a follower lags behind. The matchIndexes are used to determine the commitIndex and thus commitment for all entries on each response from any follower. This is preferable to just sending all the requests outstanding to a follower and counting responses to commit entries.
terrytan
@txm119161336_twitter
Aug 09 2016 07:47
that is a really good design ,if the middle server must have the middle matchIndex already ,so it is safe to commit it
but for method private void appendCommand(long index, CompletableFuture<CommandResponse> future),it is still single thread ,so it has to wait util at least majority of the requests being received ,then apply to the state machine
i dont know if what i am saying is correct or not
terrytan
@txm119161336_twitter
Aug 09 2016 07:56
i understood ,you have a sort of checking ,if it is null ,it is finished immedaitelly
terrytan
@txm119161336_twitter
Aug 09 2016 08:53
you made the commit process and appendentry together in one call, it will append those entry not in follower log , and take the commit index to the follower , make the follower commit the change and apply the change ,right?
this process is the same as what raft paper said
Raghav Babu Subramanian
@RaghavBabu
Aug 09 2016 14:46
@kuujo Hi Jordan, How quickly will be a log compaction process started after a commit release from state machine?? Each time I restart, my old writes are getting replayed, so does it mean I have not give enough time for compaction to occur?
Jordan Halterman
@kuujo
Aug 09 2016 20:28
Sorry I've been in meetings...
Jordan Halterman
@kuujo
Aug 09 2016 20:34
@RaghavBabu that's totally expected. The log is broken into segments, and compaction only occurs on full segments where all entries are committed. When release is called on a commit, nothing is persisted since it's unnecessary. We could write a flag to disk but that would mean a costly index lookup and write since commits can be released from any point in the log. So, when release is called a bit is flipped in an in-memory bit array that's lost on restart. But this does my negatively impact anything. Replaying the commits is just necessary to rebuild that bit array. There are some other factors that contribute to deciding when to compact a segment too, but Copycat ensures it's done safely. e.g. a segment won't be compacted until all clients have received events related to commands in that segment. That allows the replay of the log to also rebuild events in the event of a crash. If an entry that triggered an event were to be removed from disk too soon, a failure could mean that event is lost. Similarly, not all released entries will be removed when a segment is compacted. During minor compaction, only non-tombstone entries are removed and tombstones will remain and may be replayed at startup. Major compaction is necessary to sequentially remove all tombstones from all segments to ensure that any entries that created state that's later deleted by a tombstone are removed before the tombstone itself.
Jordan Halterman
@kuujo
Aug 09 2016 20:51
That's right. In practice, followers lag slightly behind leaders in what they believe to be committed. So, once a leader increases the commitIndex it is safe to apply it to its own state machine, and on the next AppendEntries RPC to a follower it will get the updated commitIndex. If the leader crashes, the next node that's elected will have all the entries up to that leader's commitIndex and will itself commit them.
Raghav Babu Subramanian
@RaghavBabu
Aug 09 2016 22:51
@kuujo okay Thank you.But for now, am just working on a single node cluster,and I am writing a single value. So each time I restart my node, the log is replayed and a duplicate write command is added to the log. So as per my understanding, does it mean I need to delete the storage directory each time if I restart, if am using disk based storage?
Jordan Halterman
@kuujo
Aug 09 2016 23:08
If you want to wipe out all the state e.g. for testing you can delete the logs. But you have to keep in mind that there are other things stored in the log. It's not just a history of state machine commands. It's also configuration changes, sessions, leader changes, etc. but if you're restarting the entire cluster it's fine to get rid of those. But in a state machine there's no difference between put-delete-put-delete-put and just put. The final state is all that matters. Log compaction necessarily has to happen to get rid of commands that no longer contribute to that final state. The only case where replaying commands could have a negative impact on the state machine is if it's a persistent state machine, i.e. writing the commit to disk or to some other data store. In that case, the proper way to handle replays of the log is by comparing the Commit.index() to that external data store and doing a compare-and-set to ensure a newer value is not overwritten by an older one. I don't know the context of what you're doing, but that could be relevant.