These are chat archives for atomix/atomix

2nd
Jan 2016
Jordan Halterman
@kuujo
Jan 02 2016 02:41
these logs are actually really helpful
it seems like the client may not be getting the right error back from the cluster
when the client says State changed: SUSPENDED this is actually the correct result since it couldn’t commit a keep-alive to the cluster (a leader change occurred), but it should have detected that its session was expired. Gotta try to reproduce this
all the unknown session logs on servers are a result of the client not knowing its session was expired
Jordan Halterman
@kuujo
Jan 02 2016 03:45
Haven’t been able to reproduce that one yet, but I am getting some really good testing done as a result of this. Working on some more PRs
Jordan Halterman
@kuujo
Jan 02 2016 04:14
So, I set up a test where the leader returns an UnkownSessionException for keep-alives on every even session number, and clients properly detect and recover the session in Copycat. This could be an issue with the Atomix RecoveryStrategy though. This is good
bah it is an issue with the Atomix RecoveryStrategy and it’s an easy fix :-) Thanks @electrical!
I’ll put a lot of testing on this tonight
Jordan Halterman
@kuujo
Jan 02 2016 06:03
This actually helped me find a small design flaw in how Atomix resources are managed and recovered. I’ll submit a PR for that in an hour or two and that should be an awesome improvement.
Richard Pijnenburg
@electrical
Jan 02 2016 08:09
@kuujo glad to hear it's useful and you found some things :)
Jordan Halterman
@kuujo
Jan 02 2016 08:09
just writing some tests and ready for testing again in a bit
Richard Pijnenburg
@electrical
Jan 02 2016 08:30
No
Nice
Uhg. Typing on mobile sux
Jordan Halterman
@kuujo
Jan 02 2016 08:43
haha I know
I use my phone way too much and it always pisses me off
Richard Pijnenburg
@electrical
Jan 02 2016 10:06
Let me know when i can test things again :-)
Jordan Halterman
@kuujo
Jan 02 2016 10:06
sure just running the tests a couple more times probably
Richard Pijnenburg
@electrical
Jan 02 2016 10:07
okay cool :-)
Jordan Halterman
@kuujo
Jan 02 2016 10:24
There’s the PR… gonna let the tests run
ugh
Jordan Halterman
@kuujo
Jan 02 2016 10:30
test configuration broke it, let me try that again
Richard Pijnenburg
@electrical
Jan 02 2016 10:37
that's quite a big change
Jordan Halterman
@kuujo
Jan 02 2016 10:37
yeah… a lot of it is just structural
added an open() method to resources
that requires an update to all the tests
Richard Pijnenburg
@electrical
Jan 02 2016 10:38
ah i see. okay
Jordan Halterman
@kuujo
Jan 02 2016 10:43
Basically, what happens in Copycat is the client will register a session and try to keep it alive. If the client stops communicating with the cluster for long enough, the cluster will expire the client’s session. Once the client reconnects, it has to open a new session. But the problem is a lot of Atomix resources rely on the client’s session. So, in order for Atomix resources to work across sessions, Atomix has to detect the new session and transparently recreate all the opened resources. I just moved the code around a bit to make it do so in a much more elegant way. The CopycatClient interface now exposes States which indicate when the client loses its session, and so Atomix resources use their own CopycatClient implementation (InstanceClient) which monitors the real CopycatClient for state changes and creates a new logical session if the underlying client’s session changes.
I also just noticed after that other fix to Copycat that those tests are passing now
we’ll see how Atomix does, then I’ll run some local tests and then on EC2 if all goes well
Richard Pijnenburg
@electrical
Jan 02 2016 11:03
Nice ! Looking forward to run my small test again
Hmm. seems travis is stalled
stays compacting the log
Richard Pijnenburg
@electrical
Jan 02 2016 11:08
it does sometimes happen for some reason. not sure why
Btw, i still wonder why even with local setup the cluster will go instable at times.
Richard Pijnenburg
@electrical
Jan 02 2016 11:18
@kuujo think you'll need to re-trigger the test run
Richard Pijnenburg
@electrical
Jan 02 2016 11:38
or just merge it. lol :-)
Jordan Halterman
@kuujo
Jan 02 2016 11:39
Tests pass… I am not sure what causes that issue on Travis but not much I can do about it ATM other than click the button until it passes so might as well just merge it :-P
Richard Pijnenburg
@electrical
Jan 02 2016 11:39
haha true
i have the same issue sometimes when running tests on my dev system. they stall with the same thing
like teardown of nodes is not happening correct
Jordan Halterman
@kuujo
Jan 02 2016 11:44
yeah that’s what it is
a deadlock somewhere in there when a test is shutting down
Richard Pijnenburg
@electrical
Jan 02 2016 11:47
okay
Richard Pijnenburg
@electrical
Jan 02 2016 12:37
@kuujo any advise on yaml parsing libraries? want to use it for configuring the cluster stuff :-)
Also wondering what to do with logging. want to be able to control the logging output per part. for example have atomix logging on normal but other parts on debug if needed
Jordan Halterman
@kuujo
Jan 02 2016 13:43
Hmm... I have done YAML in Java so not sure about that one, but Atomix does need some configuration :-) maybe we can turn it into another module.
Logging can be controlled in a logging configuration file. Atomix uses slf4j which is basically an abstraction that allows you to use whatever JVM logging framework you want. Typically, logging frameworks will let you specify the classes or packages that can print messages. e.g. I want logs only from io.atomix.copycat.client or whatever.
The Atomix tests and examples use logback as the framework, and the logback.xml file controls the logging. Others are normal Java logging and log4j
Richard Pijnenburg
@electrical
Jan 02 2016 13:47
Ahh okay. cool.
Jordan Halterman
@kuujo
Jan 02 2016 13:47
I think the Copycat logback configurations disable io.atomix.catalyst logging for example, or they set it to WARN or something
If too much logging is done under one namespace or level we can probably move some of that stuff around too. The code can basically choose the logger name, and using the class that's logging it is just a convention in Java
I'll start testing it some more in a bit. I turned on"Making a Murderer" and it is slowing down my production :-P
Richard Pijnenburg
@electrical
Jan 02 2016 13:50
haha lol okay
lol finally finished building the new jar i see more new commits
Jordan Halterman
@kuujo
Jan 02 2016 13:56
You can run mvn package -DskipTests to build it quickly
Richard Pijnenburg
@electrical
Jan 02 2016 13:57
ah okay, will remember that for next time .
is it possible to have a single leader running at all? in case people want to try out logstash on a single machine without actual clustering
but want to still use the same code paths
Richard Pijnenburg
@electrical
Jan 02 2016 14:04
@kuujo running my test now
Seeing an 'unknown session: 9' error
also 15:05:32.560 [copycat-client-3] DEBUG i.a.c.client.util.ClientConnection - Failed to connect to the cluster
And also still a lot of i.a.c.server.state.LeaderAppender - localhost/127.0.0.1:5000 - request timed out
on the bright side the client part re-connected nicely
it does re-connect quite a bit though. not sure if thats good or not
Richard Pijnenburg
@electrical
Jan 02 2016 14:15
hmm 2 unknown session numbers
Richard Pijnenburg
@electrical
Jan 02 2016 14:27
hmm sometimes it switches leader very quickly after each other.
term 7 in like 2-3 minutes
ah. each client connection has its own session number of course
term 13 now
I'll see if i can gist the logs. but so far its working much better then before.
the client did have some connection issues several times
or at least the Failed to connect to the cluster
Richard Pijnenburg
@electrical
Jan 02 2016 14:33
i.a.c.client.util.ClientConnection - Failed to connect to the cluster
i see that quite a bit in the logs
Jordan Halterman
@kuujo
Jan 02 2016 15:58
@electrical nice seeing improvements. The failures to connect I'm assuming are probably happening at the same time as leader changes. The log message is actually a little inaccurate. The issue is not that the client can't connect to the cluster - it can - it's that the client can't connect and commit a keep-alive request because there's no leader to talk to. That's expected during a leader change. I saw the leader change in the other log. It looked like what happened was the leader detected a network partition and stepped down. The circumstances under which the leader detected the partition were correct. That is, the leader was correct in stepping down. But the problem seemed to be that responses were not arriving quickly enough. I'll see if that was the same case here.
Jordan Halterman
@kuujo
Jan 02 2016 16:12
Basically, the logic for detecting a network partition exists to prevent clients from communicating with a leader that can't talk to a majority of the cluster when there may be another leader that can talk to a majority of the cluster. If a client knows about leader A which is partitioned from the rest of the cluster, an the cluster already elected leader B, the client may not ever disconnect from leader A and reconnect to leader B.
Jordan Halterman
@kuujo
Jan 02 2016 16:19
However, this is actually no longer true in Copycat. Clients will always disconnect from a leader and try to find another leader if a keep-alive fails, so the logic for detecting partitions isn't even totally necessary from a client's perspective. Leaders can return their term in keep-alive responses to ensure clients can determine which is the most up-to-date leader. But this amounts to fixing a symptom rather than the cause. In reality, leader changes should not happen, and there are still some scenarios where not detecting a partition and stepping could be detrimental, for instance in unidirectional partitions. If a leader can send an AppendRequest to a follower, but the follower can't send an AppendResponse, the follower will never start a new election, and the leader will never be able too commit any entries, thus preventing state from progressing altogether during the partition.
I wonder if it could just be something about the Netty configuration that's contributing to it.
Richard Pijnenburg
@electrical
Jan 02 2016 16:54
It's hard to be sure it it happens at leader changed since I see it happening quite a bit. More then the amount of term increments.
But since everything is running on the same machine over localhost I wonder how a partition can be caused. It shouldn't right ?
Even if it would be over a local network it should be stable. Because leader changes will have a big impact on resources being available like the queue and locks
And I also want to build in that if the node lost connection to the cluster it's not able to process data.
Jordan Halterman
@kuujo
Jan 02 2016 18:15
No a real partition should be impossible. It actually doesn’t look like the leader is detecting a partition actually. Something else with the transport. A leader just stops sending AppendRequest for long enough for a follower to try to get elected
Richard Pijnenburg
@electrical
Jan 02 2016 18:16
Hmm okay.
Jordan Halterman
@kuujo
Jan 02 2016 18:16
hmm actually I think I found the cause
Richard Pijnenburg
@electrical
Jan 02 2016 18:17
Ohw? Do tell
Jordan Halterman
@kuujo
Jan 02 2016 18:18
hmm
Jordan Halterman
@kuujo
Jan 02 2016 18:24
false alarm… I need to try to reproduce this, but it seems like it may just be a bug in the math that determines when to send AppendRequest to followers
Richard Pijnenburg
@electrical
Jan 02 2016 18:25
Okay
If there is anything I can do let me know
Jordan Halterman
@kuujo
Jan 02 2016 18:25
have to see if I can reproduce it
Richard Pijnenburg
@electrical
Jan 02 2016 18:26
Okay
My test machine is a 2 vcore ubuntu vm on kvm in case it matters.
Jordan Halterman
@kuujo
Jan 02 2016 18:28
I gotta try something different
my machine lies to me
just chugs along like nothing
hmm
@electrical have you pulled and installed catalyst too? I fixed some issues there a few days ago
Jordan Halterman
@kuujo
Jan 02 2016 18:39
Hmm I did just find a bug in the leader, but it’s related to handling a loss of quorum. I guess I’ll just keep trying to force failures until it’s cleaned up :-)
Jordan Halterman
@kuujo
Jan 02 2016 19:19
hmmm
hmmmmmmmmm
I may have found the culprit
just gotta take some time to follow the complicated codes
at least one of the culprits that is
Jordan Halterman
@kuujo
Jan 02 2016 19:32
this will definitely have fixed some issues for sure
Richard Pijnenburg
@electrical
Jan 02 2016 22:02
@kuujo re catalyst. Yeah I clean up the .m2 dir every time.
Jordan Halterman
@kuujo
Jan 02 2016 22:14
Yeah there were definitely a few bugs introduced into LeaderAppender when snapshotting was added a while back. @electrical I should have a version with all the bugs I can reproduce fixed by your morning. I haven't been able to reproduce the leader changes you're seeing, but I did find and fix some bugs that were related to at least some of those leader changes. I think a day of cleaning up and testing that specific logic should get the leader back in working order as it was working great before. Jepsen tests have only ever caused leader changes during network partitions. This should put it back in that state and ready for RC.
I would say you can test that PR but I haven't actually tested it yet... Still checking the codes some more
Jordan Halterman
@kuujo
Jan 02 2016 23:05
Actually, it seems like I may just be on a role. I think I may have just found and fixed a bug I’ve been searching for for a loooong time… the bug that’s the cause of many of the intermitent Copycat/Atomix test failures. This is a good day.
I’ll explain all the bug fixes in PRs
This and the leader bug are the two most critical. Hopefully I can fix the latter without having to reproduce it :-( we’ll see