These are chat archives for atomix/atomix

14th
Dec 2015
Richard Pijnenburg
@electrical
Dec 14 2015 13:11
@kuujo I replied to the testing issue issue. but it still hangs :-(
Richard Pijnenburg
@electrical
Dec 14 2015 15:01
Could it be an issue with the jdk version i'm running? jdk1.8.0_66
Jordan Halterman
@kuujo
Dec 14 2015 18:42
@electrical sorry I'm around now. I don't think it's a JDK issue. I'm actually seeing the same thing, but the thing is I can only get test failures to happen when I run multiple tests. If I try to reproduce the failure by just running the test that failed it passes indefinitely, so it seems like a test cleanup issue to me.
If I run an arbitrary group of tests every now and then a few will fail, but never the first one
But the OOM could be a hint
I did some profiling of tests yesterday and YourKit showed some pretty constant memory usage.
Copycat is having this same issue BTW. Run the ClusterTests (which basically use the same pattern as the Atomix tests) and I see this there too
Because configuration changes in Raft are easily the most fragile part of the algorithm, I opted to make all of the tests start and stop clusters before and after every test to exercise that code. I haven't been able to find any configuration related bugs in the failing tests, I've only seen random failures and logs that don't make much sense (logs from different tests intermixed with each other)
Jordan Halterman
@kuujo
Dec 14 2015 19:08
hmm I actually may have just reproduced the issue with some clear enough logs to try to figure it out
Jordan Halterman
@kuujo
Dec 14 2015 19:39
So, it seems there's something causing unnecessary leader changes. The test failure that I produced happened when the test timed out because the cluster was converging on a new leader unnecessarily. The test didn't impose any failures, but something led one of the servers to timeout and start a new election. Or, more accurately, something caused the leader not to send an AppendRequest to one of the servers within the election timeout. This doesn't break consistency guarantees as leader changes can happen safely all they want, but it does slow the cluster down since the client has to find and reestablish its connection with the new leader. The test times out while the client was still retrying its request.
Jordan Halterman
@kuujo
Dec 14 2015 23:57
@electrical I think we tracked down the issue - related to cleaning up resources between tests. I’ll submit a PR in a bit after a bunch more test runs to verify it’s passing consistently.