These are chat archives for atomix/atomix

27th
Feb 2017
Jordan Halterman
@kuujo
Feb 27 2017 02:45
The fuzz test is going to have to be run a lot to weed out any more potential issues in conjunction with Jepsen and other testing. But it's doing much, much better than it was a week ago. I haven't been able to produce any consistency issues with logs being compared, snapshots being taken, and nodes being shutdown or removed. All consistency models are correctly adhered to thus far. I think there continue to be improvements that can be made to how clients manage connections. And the concurrency issues remain in the log wrt compaction. But aside from those things it seems to be doing really well. Jepsen is what can put the protocol under real strain, though, and we can target potential issues much better using that approach.
Going to push a new release after some more tests this evening.
Jienan Zeng
@jienan
Feb 27 2017 03:45
nice
vishwas
@vishwass
Feb 27 2017 05:05

```21:03:01.853 [copycat-server-localhost/127.0.0.1:5000-copycat] WARN i.a.c.server.state.LeaderAppender - localhost/127.0.0.1:5000 - AppendRequest to localhost/127.0.0.1:5001 failed. Reason: io.netty.channel.AbstractChannel$AnnotatedConnectException: Connection refused: localhost/127.0.0.1:5001

``` Please could anyone tell why server is showing this?

Jordan Halterman
@kuujo
Feb 27 2017 05:53
@vishwass that means a server is down. The leader is trying to replicate to another server but it can't connect.
vishwas
@vishwass
Feb 27 2017 05:55
how to start value-state-machine seperately for each server i wanted to see the logs in real time for each server .
?
what I did was run value-state-machine example in 3 seperate terminals with different masters
is it the right way to do it ?
Jordan Halterman
@kuujo
Feb 27 2017 06:42
what do you mean with three different masters?
you can’t choose the leader. The leader is chosen automatically
but here is an example of the commands to run...
java -jar examples/value-state-machine/target/value-state-machine.jar logs/server1 localhost:5000 localhost:5001 localhost:5002
java -jar examples/value-state-machine/target/value-state-machine.jar logs/server2 localhost:5001 localhost:5000 localhost:5002
java -jar examples/value-state-machine/target/value-state-machine.jar logs/server3 localhost:5002 localhost:5000 localhost:5001
@vishwass that will create a cluster of 3 nodes
each one has a different log directory (e.g. logs/server1) and a local address (e.g. localhost:5000) and the addresses of the other two nodes (e.g. localhost:5001 localhost:5002)
The first address is simply the address of the local node. The leader is chosen via the Raft protocol.
then you can start the client, and you can kill 1/3 of the nodes (a minority) and the cluster will continue to do its thing
Jordan Halterman
@kuujo
Feb 27 2017 08:59
@vishwass also, if you want to see the replication of commands you have to enable DEBUG logging here
Jordan Halterman
@kuujo
Feb 27 2017 09:44

@jhall11 Copycat 1.2.2 and Atomix 1.0.2 have been released! The Copycat changelog lists all the bug fixes, most importantly of which is the CRITICAL segment file header bug.

I improved the fuzz test a bit over the weekend, and the cluster has been doing really well with these fixes. I’ve only really run into one bug that’s not fixed, and that’s the one that can’t realistically be fixed until Copycat is moved to the new log. But that bug shouldn’t affect any normal operation. In general, I think we’ve made lots of huge improvements over the last week. More will likely have to come from future bug reports and Jepsen tests since my fuzz test seems to be losing its effectiveness now.

If we have an opportunity and any other bugs come up, I’m of course happy to track them down and fix them. But barring that, I’ll be back on ONOS for the next couple weeks.
Jordan Halterman
@kuujo
Feb 27 2017 10:57
And BTW as I mentioned above, I added linearizable/sequential consistency checks to the fuzz test and those haven't failed yet. Correctness is always my number one concern.
William Zhang
@zedware
Feb 27 2017 11:37
Have a try with lein test, but encounter the errors like,
$ lein test Could not find artifact io.atomix.catalyst:catalyst-local:jar:1.0.0-SNAPSHOT in clojars (https://clojars.org/repo/) Could not find artifact io.atomix.catalyst:catalyst-local:jar:1.0.0-SNAPSHOT in sonatype-nexus-snapshots (https://oss.sonatype.org/content/repositories/snapshots) Could not find artifact io.atomix.copycat:copycat-client:jar:1.0.0-SNAPSHOT in clojars (https://clojars.org/repo/) Could not find artifact io.atomix.copycat:copycat-client:jar:1.0.0-SNAPSHOT in sonatype-nexus-snapshots (https://oss.sonatype.org/content/repositories/snapshots) Could not find artifact io.atomix.copycat:copycat-server:jar:1.0.0-SNAPSHOT in clojars (https://clojars.org/repo/) Could not find artifact io.atomix.copycat:copycat-server:jar:1.0.0-SNAPSHOT in sonatype-nexus-snapshots (https://oss.sonatype.org/content/repositories/snapshots) Could not find artifact io.atomix.catalyst:catalyst-netty:jar:1.0.0-SNAPSHOT in clojars (https://clojars.org/repo/) Could not find artifact io.atomix.catalyst:catalyst-netty:jar:1.0.0-SNAPSHOT in sonatype-nexus-snapshots (https://oss.sonatype.org/content/repositories/snapshots) This could be due to a typo in :dependencies or network issues. If you are behind a proxy, try setting the 'http_proxy' environment variable.
vishwas
@vishwass
Feb 27 2017 12:23
@kuujo unable to open logs in ubuntu . I opened it using nano and and found junk data
Jordan Halterman
@kuujo
Feb 27 2017 13:45
@vishwass the logs are binary. You can't see what's in them unless we write a lot reader of some sort. You can only see what Copycat says it's doing
Jordan Halterman
@kuujo
Feb 27 2017 13:52
Or you can write one yourself :-) Just create a Storage object, call openLog("copycat") on it, and read your entries
vishwas
@vishwass
Feb 27 2017 14:02
OK got it .Thanks.
Jordan Halterman
@kuujo
Feb 27 2017 15:58
As for the Jepsen tests, master is currently broken. We've generally been working from atomix/atomix-jepsen#1 but there's a bug in Trinity that we need to fix but I personally suck at Clojure :-P
Jordan Halterman
@kuujo
Feb 27 2017 18:37
@zedware ^^
Jon Hall
@jhall11
Feb 27 2017 18:44

@kuujo, Thanks for the all the work this weekend! I started the fuzz test last night and it ended with

 23:15:09.532 [copycat-server-localhost/127.0.0.1:5003-copycat] INFO  i.a.copycat.server.CopycatServer - Server started successfully!
11 is less than last linearizable index 30690

How would we go about debugging this?

Jordan Halterman
@kuujo
Feb 27 2017 18:45
I set it up to write a debug log: target/fuzz-test/test.log or something like that
have it?
that that index is so low in comparison to the last read index makes me wonder about split brain
hmm
Jon Hall
@jhall11
Feb 27 2017 18:50
under copycat/target/fuzz-logs I’m seeing folders like:
 ls 2130712288
copycat-1-1.log                copycat-1-20170226231506.snapshot    copycat.meta
Jordan Halterman
@kuujo
Feb 27 2017 18:51
ls target/fuzz-logs
2130712285 2130712286 2130712287 2130712288 2130712289 2130712290 test.log
it seems to delete that test.log file for some reason
Jon Hall
@jhall11
Feb 27 2017 18:51
I’m not seeing test.log
Jordan Halterman
@kuujo
Feb 27 2017 18:51
sometimes, but sometimes not
Jon Hall
@jhall11
Feb 27 2017 18:52
hmm
Jordan Halterman
@kuujo
Feb 27 2017 18:57
Well I’ll just have to keep it running and should run into it eventually… haha. I think the correct workflow would be to use the fuzz test to produce a bug, then dig through the logs and reproduce it in Jepsen. But there’s a lot of work to be done in the Jepsen tests to be able to do that.
Jon Hall
@jhall11
Feb 27 2017 18:58
yep :)
Jordan Halterman
@kuujo
Feb 27 2017 19:09
That thing is a goldmine for minor bugs though once you actually dig into the logs
Jon Hall
@jhall11
Feb 27 2017 19:24
Jepsen or Fuzz?
or both ;)
Jordan Halterman
@kuujo
Feb 27 2017 19:48
The fuzz test is great for producing issues if it's run a lot. Jepsen is a lot better for simulating real crashes and partitions that this test could miss, and its model checkers are a lot better. Those consistency checks in the fuzz test could actually allow a lot of invalid histories. Jepsen can also be a useful part of a development workflow, which the fuzz test can't since it doesn't really do assertions, it just tries to break something
But I do have a bunch of issues to dig into from the fuzz test, so it's awesome. I think the right thing to do is hack through any issues that can be produced with the fuzz test and then move onto Jepsen when it seems stable
Jordan Halterman
@kuujo
Feb 27 2017 19:56

Some of the issues I've seen in the fuzz test:

  • Server bootstraps and joins fail at some points (which they probably shouldn't)
  • Servers join but then can't elect a leader (that they were able to join indicates there was a leader)
  • Servers somehow hang in the inactive state
  • Clients are way too busy when a majority of the cluster is down (back off is done per command, which is good during normal operation but bad when there's no leader)

A lot of these relate to configuration changes, which is not surprising. An evolving cluster creates major challenges with respect to tracking those changes with failures. The fuzz test puts huge strain on the cluster in this regard.

Jordan Halterman
@kuujo
Feb 27 2017 20:07
It's likely the consistency check failed because of a configuration change as well.
Jordan Halterman
@kuujo
Feb 27 2017 20:26
I also suspect my fuzz test is a little imperfect still :-P
Jordan Halterman
@kuujo
Feb 27 2017 20:32
hmm well knocked one of those issues out and it’s fine
Jordan Halterman
@kuujo
Feb 27 2017 20:48
elimintated another one
just seems to be a bug in my codes :-)