These are chat archives for atomix/atomix

11th
Feb 2017
Jordan Halterman
@kuujo
Feb 11 2017 00:36
I’ll let this run for the night, but it seems to be doing pretty well in general. No real issues so far aside from the one I fixed.
Jon Hall
@jhall11
Feb 11 2017 00:37
:+1:
Jordan Halterman
@kuujo
Feb 11 2017 00:39
I rewrote it with a lot more randomization and I’ll probably commit it somewhere
Jon Hall
@jhall11
Feb 11 2017 00:39
is this a unit test or a test application?
Jordan Halterman
@kuujo
Feb 11 2017 00:39
unit test
well… not really a unit test, I just wrote it as one so it would be easy to debug if I find something. It’s not deterministic. Just trying to break things
the right way to do this would be to use randomization, but also keep track of the expected state of the system based on the commands/queries submitted to it, then at the end of a lengthy test run check the logs and server states to ensure Raft’s invariants were not violated
but I’m just trying to get anything to break for now
Jordan Halterman
@kuujo
Feb 11 2017 00:45
hmm got another one :-)
Jon Hall
@jhall11
Feb 11 2017 00:47
sweet!
Jordan Halterman
@kuujo
Feb 11 2017 00:52
Hmm this seems to be a bug resulting from my randomization. It’s one that could happen if the server is reconfigured and restarted though, so still valid but pretty obscure.
this is good
Jon Hall
@jhall11
Feb 11 2017 00:53
so the reconfiguration wasn’t commited to all the nodes?
Jordan Halterman
@kuujo
Feb 11 2017 00:54
Actually, this is a reconfiguration of the log segment sizes and stuff like that. The major compaction process was using the configured maxSegmentSize when figuring out which segments it could compact, but then using the persisted segment size of the largest segment to create a new segment during compaction. So, if the user increases the maxSegmentSize after the cluster has been running for a while then major compaction can fail.
because the segment the compactor creates is smaller than the amount of disk that it expects to be able to compact
obscure, but it’s a good thing to find those!
no IndexOutOfBoundsException yet though :-(
Jon Hall
@jhall11
Feb 11 2017 00:58
yes, its great to catch these little bugs
yeah, I think my test is still running too
its on iteration 173, which is pretty good :)
Jordan Halterman
@kuujo
Feb 11 2017 01:09
I set this test up to run in 10 minute iterations too. Going to go grab some dinner and maybe when I come back it will be broken again :-)
Jordan Halterman
@kuujo
Feb 11 2017 01:20
It randomizes just about everything - number of servers, number of clients, log/segment sizes, compaction concurrency/thresholds/frequency, server crashes, commands, queries throughout all the tests. I'll keep improving it, but this should help find some issues if they exist. Really, to improve this I can eventually randomly inject partitions in the test transport, and like I said add assertions to check that Raft's invariants hold.
Jon Hall
@jhall11
Feb 11 2017 01:23
:+1:
Jordan Halterman
@kuujo
Feb 11 2017 01:25
Guess this can morph into a fuzz testing framework.
Jordan Halterman
@kuujo
Feb 11 2017 01:58
Got one IndexOutOfBoundsException. Not the same one but this is really promising.
Jordan Halterman
@kuujo
Feb 11 2017 02:58
and another IllegalStateException :-)
most of these have been related to the randomized Storage configuration
Jordan Halterman
@kuujo
Feb 11 2017 03:28
seems to be chugging along pretty well with all those fixes now
Jordan Halterman
@kuujo
Feb 11 2017 06:53
I extended the test to allow the failure of a random number of up to n-1 nodes in the cluster. This leads to long periods of unavailability and caused some StackOverflowExceptions in the clients which should be pretty easy to prevent.
Jon Hall
@jhall11
Feb 11 2017 07:52
@jhalterman In this change to trinity, should the nodes arguments of replica also have been removed? It doesn’t appear to be used anymore
Jon Hall
@jhall11
Feb 11 2017 08:19
I think I’m at the point where I can now sit down and follow the clojure code :) but I don’t think the refactor that was done last year was finished
Jordan Halterman
@kuujo
Feb 11 2017 08:52
Yay! Too much on my brain right now but I'll probably have to try to figure it out some day.
Jordan Halterman
@kuujo
Feb 11 2017 09:05

@jhall11 the code looks vaguely correct to me, but wtf do I know? But FYI what we should be expecting (in Java codes) when building and starting a replica is something like:

AtomixReplica replica = AtomixReplica.builder(new Address("123.123.123.123", 9876))
  .withTransport(new NettyTransport())
  .withStorage(Storage.builder()
    .withStorageLevel(StorageLevel.DISK)
    .build())
  .build();

replica.bootstrap(cluster).join();

Where cluster is a list of Addresses (all the initial nodes in the cluster).

So, compared to the original API we basically moved the cluster membership list from the builder method to the bootstrap/join methods.

Alternatively, one can bootstrap() a single node (with no arguments) and join(bootstrappedAddress) the other nodes to the bootstrapped node. The latter is an unproven aspect of the Raft protocol that I'd love to test in Jepsen. There were bugs found in the Raft protocol for configuration changes just last year.

Obviously it's not correct so I know nothing :-P
Jordan Halterman
@kuujo
Feb 11 2017 09:32

As for the work I've been doing, I found quite a few minor issues. First were the configuration issues which were simple to fix. But more critically there are really some fundamental flaws with concurrency in the log. The log operates fine for writing and replicating changes. But the compaction process introduces concurrency that's not handled well enough by the log and would likely be expensive to try to prevent. Copycat 2.0's log is designed from the ground up for concurrency, supporting multiple concurrent readers for replication. The support for multiple concurrent readers will solve the concurrency issues that infrequently occur during log compaction. These concurrency issues can materialize when catching up a follower or committing old entries in the log during compaction of a segment. They don't prevent the system from progressing. Copycat guards against failures during compaction: it won't commit a segment until the compaction process is complete, and it will clean up partially compacted segments at startup. The latter process could be extended to perform periodic cleanup of the log during normal operation, but that's just a bandaid. And partially compacted segments could still potentially wreak havoc in unpredictable ways.

So, I think after the next release of ONOS I should probably work on getting the Copycat 2.0 log into master so it can be done for the following release. The new log will allow much more concurrency within the Copycat server, and hopefully we will be able to put the tests in place to relax synchronization within the server and feel confident in its continued stability. The new log also makes it possible to remove serialization inside the Copycat server. So, increased concurrency and far less serialization/deserialization will hopefully mean a sizable performance boost.

It also consumes a lot less memory, which is relevant for ONOS in particular since it runs multiple partitions per node, and thus multiple logs per node.
Jon Hall
@jhall11
Feb 11 2017 17:08
That sounds great! I look forward to trying it out :)
Jon Hall
@jhall11
Feb 11 2017 17:15
ok, so I got new logs with the partition log files. I think this time x.x.x.2 is the node that had the failure
this time i think it took 262 iterations