These are chat archives for atomix/atomix

8th
Feb 2017
Jon Hall
@jhall11
Feb 08 2017 00:09
gonna try to reproduce it with a debug port enabled so I can attach a debugger and get some more logging
Jordan Halterman
@kuujo
Feb 08 2017 03:40
I should have some time later tonight to set up a cluster and break it a bit
It's odd because it seems like the only way that future is completed exceptionally is if the log is closed. I don't see where it's otherwise completed exceptionally unless I'm missing something
Have to squeeze this in at night ATM, but I'm taking off a few days after tomorrow and may be able to help work out these issues a bit more
Jordan Halterman
@kuujo
Feb 08 2017 04:41
I’m doing a bit of testing and not having any luck getting the cluster to misbehave myself. I can’t even get a session to expire by killing a majority of the cluster, which is always good :-P But I’m going to take a look at the logs more closely and trace back the state to perhaps figure out what’s happening.
Jordan Halterman
@kuujo
Feb 08 2017 05:44
hmmm
So, I picked a random RegisterEntry to poke around at. index=60357 I see it stored on three servers in partition 1 but only see it applied on two followers and not the leader. Curious.
so, it seems like this block is reached but the registration is ultimately committed
so, it’s really a partial commit, which is fine but not fine when it’s a repeated occurrance
hmm that doesn’t seem right either
Jordan Halterman
@kuujo
Feb 08 2017 05:53
that certainly is happening
leaders append RegisterEntry but the RegisterRequest is failed before the entry is applied
followers eventually commit and apply RegisterEntry
hmmm
Jordan Halterman
@kuujo
Feb 08 2017 06:02
I also don’t see them sending back RegisterResponse at all
but the commitIndex is certainly increasing and entries are being committed
Jordan Halterman
@kuujo
Feb 08 2017 06:08
I see… so the RegisterResponses with status=ERROR that are being sent back to the client are always sent from followers that proxied the request to the leader. They’re just failed RegisterRequests because of a timeout during the hop from the follower to the leader. That’s just a symptom of the cause. The real issue is that the leader is never applying the RegisterEntry even though it’s successfully committed. So, that’s what needs to be tracked down.
the real question is whether this line is ever reached and if so, why is the entry not successfully applied and if not, why not
other entries are being successfully committed and applied by the same partitions
Jordan Halterman
@kuujo
Feb 08 2017 06:39
so, this is interesting because the commitIndex is definitely being increased correctly here, but either the related future is never completed or the future callbacks are not applying entries
Jordan Halterman
@kuujo
Feb 08 2017 07:31

@jhall11 that's as far as I could get without reproducing it myself which I tried by to no avail. The logs are pretty clear that the problem is RegisterEntry is committed but never applied. But why it's not applied is not obvious. The commitIndex is being properly set by the leader, meaning replication is working well and followers are committing the entry. But for some reason the leader's whenComplete callback is never called, so it never applies the entry and never responds to the client.

It doesn't seem like isOpen returns false which would cause the leader to skip the entry. The leader is still operating normally. There's no obvious reason the appendEntries(long) future would not be completed. There's no obvious reason context.getStateMachine().apply(entry) would not apply the entry. I'm totally stumped right now. But maybe a nap will help.

Jordan Halterman
@kuujo
Feb 08 2017 18:17
but on a brighter note Docker seems to be fixed :-)
Jon Hall
@jhall11
Feb 08 2017 18:31
Small victories :)
Jordan Halterman
@kuujo
Feb 08 2017 18:31
haha indeed
Jordan Halterman
@kuujo
Feb 08 2017 18:45
I'm still trying to reproduce this myself. No such luck so far. I set up a test that randomly skews timestamps in the state machine to force sessions to be expired. They all recover really nicely. This seems to be some state that the leader gets in that's not easily reproducible for me. But I'm going to take another stab at getting Jepsen to run today so perhaps I can try to reproduce it there.
Jon Hall
@jhall11
Feb 08 2017 18:59
ok, I’ll check and see if my overnight tests discovered anything
It did take over 200 node failures to hit this bug, so it does seem like a difficult situation to hit
Jordan Halterman
@kuujo
Feb 08 2017 18:59
yeah
Jordan Halterman
@kuujo
Feb 08 2017 19:13
ugh
➜  atomix-jepsen git:(master) ✗ docker rm jepsen
Error response from daemon: Bad response from Docker engine
➜  atomix-jepsen git:(master) ✗ docker ps
Error response from daemon: Bad response from Docker engine
➜  atomix-jepsen git:(master) ✗ docker ps -a
Error response from daemon: Bad response from Docker engine
hahaha
can’t get a break
Jon Hall
@jhall11
Feb 08 2017 19:14
what version are you using?
Jordan Halterman
@kuujo
Feb 08 2017 19:14
1.13.0
Jon Hall
@jhall11
Feb 08 2017 19:14
I had to update everything because they were so out of date, but i don’t know that i’ve had any other issues
hmm, thats the same as me
Jordan Halterman
@kuujo
Feb 08 2017 19:15
think it was just a random bug
bad Docker luck :-)
running now
Jon Hall
@jhall11
Feb 08 2017 19:16
good news, I reproduced the bug. bad news… I accidentally killed 2 of the 3 nodes
Jordan Halterman
@kuujo
Feb 08 2017 19:17
haha
almost got Jepsen going though
Jon Hall
@jhall11
Feb 08 2017 19:21
great
Jordan Halterman
@kuujo
Feb 08 2017 22:12
:+1:
Jon Hall
@jhall11
Feb 08 2017 23:12
got Jepsen running?
Jordan Halterman
@kuujo
Feb 08 2017 23:46
I got Jepsen hanging :-P
will have some actual time to put towards it tomorrow though
Jon Hall
@jhall11
Feb 08 2017 23:47
I’ve been going through some clojure tutorials in my spare time