These are chat archives for atomix/atomix
We are no longer monitoring this channel, please join Slack! https://join.slack.com/t/atomixio/shared_invite/enQtNDgzNjA5MjMyMDUxLTVmMThjZDcxZDE3ZmU4ZGYwZTc2MGJiYjVjMjFkOWMyNmVjYTc5YjExYTZiOWFjODlkYmE2MjNjYzZhNjU2MjY
maxSegmentSizewhen figuring out which segments it could compact, but then using the persisted segment size of the largest segment to create a new segment during compaction. So, if the user increases the
maxSegmentSizeafter the cluster has been running for a while then major compaction can fail.
IndexOutOfBoundsExceptionyet though :-(
IndexOutOfBoundsException. Not the same one but this is really promising.
StackOverflowExceptions in the clients which should be pretty easy to prevent.
@jhall11 the code looks vaguely correct to me, but wtf do I know? But FYI what we should be expecting (in Java codes) when building and starting a replica is something like:
AtomixReplica replica = AtomixReplica.builder(new Address("18.104.22.168", 9876)) .withTransport(new NettyTransport()) .withStorage(Storage.builder() .withStorageLevel(StorageLevel.DISK) .build()) .build(); replica.bootstrap(cluster).join();
cluster is a list of
Addresses (all the initial nodes in the cluster).
So, compared to the original API we basically moved the cluster membership list from the builder method to the
Alternatively, one can
bootstrap() a single node (with no arguments) and
join(bootstrappedAddress) the other nodes to the bootstrapped node. The latter is an unproven aspect of the Raft protocol that I'd love to test in Jepsen. There were bugs found in the Raft protocol for configuration changes just last year.
As for the work I've been doing, I found quite a few minor issues. First were the configuration issues which were simple to fix. But more critically there are really some fundamental flaws with concurrency in the log. The log operates fine for writing and replicating changes. But the compaction process introduces concurrency that's not handled well enough by the log and would likely be expensive to try to prevent. Copycat 2.0's log is designed from the ground up for concurrency, supporting multiple concurrent readers for replication. The support for multiple concurrent readers will solve the concurrency issues that infrequently occur during log compaction. These concurrency issues can materialize when catching up a follower or committing old entries in the log during compaction of a segment. They don't prevent the system from progressing. Copycat guards against failures during compaction: it won't commit a segment until the compaction process is complete, and it will clean up partially compacted segments at startup. The latter process could be extended to perform periodic cleanup of the log during normal operation, but that's just a bandaid. And partially compacted segments could still potentially wreak havoc in unpredictable ways.
So, I think after the next release of ONOS I should probably work on getting the Copycat 2.0 log into
master so it can be done for the following release. The new log will allow much more concurrency within the Copycat server, and hopefully we will be able to put the tests in place to relax synchronization within the server and feel confident in its continued stability. The new log also makes it possible to remove serialization inside the Copycat server. So, increased concurrency and far less serialization/deserialization will hopefully mean a sizable performance boost.