These are chat archives for atomix/atomix

7th
Feb 2017
Jordan Halterman
@kuujo
Feb 07 2017 00:06
I’m not entirely confident that PR is complete yet, but it’s an improvement. Also, while it is certainly an issue that clients don’t proactively expire their sessions when a command/query fails, I’m a little concerned that this is just addressing a symptom of the cause, so I’ll dig through the logs a bit more to try to figure out why the session was expired in the first place.
atomix/copycat#277 that is
need to add some tests for it actually
Jon Hall
@jhall11
Feb 07 2017 00:13
great! I think we’ve made some great progress recently with respect to stability and tolerance of restarts
Jordan Halterman
@kuujo
Feb 07 2017 00:13
I know! It will only keep getting better, and hopefully fast
I don’t see anything particularly wrong with the client’s session having been expired. We can just increase session timeouts to decrease their likelihood, but that also increases the amount of time it takes to detect a failure and e.g. change a leader or release a lock so it’s a balance between fast failover and the possibility of a client’s commands failing if a session is lost.
I’ll test that PR out and write some tests, but what we should expect to see is really at most one UnknownSessionException and the next query should succeed
Jordan Halterman
@kuujo
Feb 07 2017 00:20
Love being able to find and fix these issues
Jon Hall
@jhall11
Feb 07 2017 00:23
Thats the benefit of open source, you get more eyes and different usecases which can expose these bugs
Jordan Halterman
@kuujo
Feb 07 2017 00:29
Don't have to sell me on open source :-P (but I've been trying to sell my company in it for years)
Jon Hall
@jhall11
Feb 07 2017 00:35
:)
Jon Hall
@jhall11
Feb 07 2017 01:37
I’m comfortable with merging atomix/copycat#276 and I’ve not seen any side effects of atomix/copycat#277 yet, but it is probably too soon to say on that
Jordan Halterman
@kuujo
Feb 07 2017 01:52
K awesome
Jordan Halterman
@kuujo
Feb 07 2017 02:39
Merged, I will write tests for the other one tonight
Jordan Halterman
@kuujo
Feb 07 2017 08:14
I’m working on a lot better tests for the client and am working out some issues with it. Will update the PR
belowm
@belowm
Feb 07 2017 15:17

Hi, I'm playing around with a simple cluster scenario consisting of just 2 nodes. I run into the following issue: When one member of the cluster goes away, I see this log: localhost/127.0.0.1:8700 - Suspected network partition. Stepping down In this state, the replica tries to reconnect, but seems no longer usable. Any attempt in writing to the replica results in

java.net.ConnectException: failed to connect
    at io.atomix.copycat.client.util.ClientConnection.sendRequest(ClientConnection.java:126) ~[atomix-all-1.0.0-rc9.jar:na]

Is this expected behavior? My expectation was that the replica would still accept writes and would synchronize after the connection is recovered.

Both nodes have a storage configured.
You will need to have at least 3 node to maintain a quorum under any node failures
Jordan Halterman
@kuujo
Feb 07 2017 18:44
@belowm that would be the behavior of an eventually consistent system. Copycat/Atomix are consensus based. Accepting writes on the minority side (1/2 nodes is a minority) of a partition requires conflict resolution and allows split brain (well, usually). Copycat properly detects that it can’t contact a majority of the cluster (2/2) and stops allowing writes from clients. You should use an odd number of nodes for the greatest fault tolerance (2/3, 3/5, 4/7, etc). A majority of the cluster is required to service writes and reads.
Jon Hall
@jhall11
Feb 07 2017 20:47
so it looks like we ran into a bug when testing atomix/copycat#277. It seems that there is continual attempts to register clients
Jordan Halterman
@kuujo
Feb 07 2017 21:22
hmm
I could see that happening when it’s not possible to register a new session
actually nvm… let me think about this
Jordan Halterman
@kuujo
Feb 07 2017 21:34
hmm this may be an easy fix
just have to dig through the ONOS code a bit
hmm nope… damn
Jon Hall
@jhall11
Feb 07 2017 21:37
I’m seening a lot of sequences like this in the logs:
16820 2017-02-07 09:58:16,450 | DEBUG | ycat-client-io-1 | ClientConnection                 | 90 - io.atomix.all - 1.0.1.SNAPSHOT | Sending ConnectRequest[client=fe98617f-c4b-bce-8690-d773abe12af]
16821 2017-02-07 09:58:16,451 | DEBUG | ycat-client-io-1 | CopycatTransportServer           | 130 - org.onosproject.onos-core-primitives - 1.9.0.SNAPSHOT | Created new incoming connection 3668970141070841755
16822 2017-02-07 09:58:16,451 | DEBUG | 9876-partition-3 | FollowerState                    | 90 - io.atomix.all - 1.0.1.SNAPSHOT | /172.17.0.3:9876 - Received ConnectRequest[client=fe98617f-c4b-bce-8690-d773abe12af]
16823 2017-02-07 09:58:16,451 | DEBUG | 9876-partition-3 | FollowerState                    | 90 - io.atomix.all - 1.0.1.SNAPSHOT | /172.17.0.3:9876 - Sent ConnectResponse[status=OK, leader=/172.17.0.3:9876, members=[/172.17.0.2:9876, /172.17.0.4:9876, /172.17.0.3:9876]]
16824 2017-02-07 09:58:16,451 | DEBUG | ycat-client-io-1 | ClientConnection                 | 90 - io.atomix.all - 1.0.1.SNAPSHOT | Received ConnectResponse[status=OK, leader=/172.17.0.3:9876, members=[/172.17.0.2:9876, /172.17.0.4:9876, /172.17.0.3:9876]]
16825 2017-02-07 09:58:16,451 | DEBUG | 9876-partition-3 | FollowerState                    | 90 - io.atomix.all - 1.0.1.SNAPSHOT | /172.17.0.3:9876 - Received QueryRequest[session=60453, sequence=89, index=60696, query=InstanceQuery[resource=38, query=ResourceQuery[query      =GetLeadership{topic=device:of:000ba2176cd174a}], consistency=null]]
16826 2017-02-07 09:58:16,451 | DEBUG | 9876-partition-3 | FollowerState                    | 90 - io.atomix.all - 1.0.1.SNAPSHOT | /172.17.0.3:9876 - Forwarded QueryRequest[session=60453, sequence=89, index=60696, query=InstanceQuery[resource=38, query=ResourceQuery[quer      y=GetLeadership{topic=device:of:000ba2176cd174a}], consistency=null]]
16827 2017-02-07 09:58:16,452 | DEBUG | 9876-partition-3 | FollowerState                    | 90 - io.atomix.all - 1.0.1.SNAPSHOT | /172.17.0.3:9876 - Sent QueryResponse[status=ERROR, index=0, eventIndex=0, result=null]
Jordan Halterman
@kuujo
Feb 07 2017 21:42
I’m interested in these repeated exceptions - in the logs it just seems recovery is not being initiated for some reason:
Failure due to class io.atomix.copycat.session.ClosedSessionException
Jon Hall
@jhall11
Feb 07 2017 21:43
Yes, I expected that too..
I’m not seeing the transition to EXPIRED, just to UNSTABLE after the last restart
Jordan Halterman
@kuujo
Feb 07 2017 21:50

yeah… for some reason the state seems not to have triggered a recovery. But that ClosedSessionException is thrown within the condition state == State.CLOSED || state == State.EXPIRED. It’s calling state.setState(State.EXPIRED) that should trigger the client to call the RecoveryStrategy which should call recover()

grep -r "Recovering session” ./*

I see it happening several times in the logs. I need to take a look at what happens after those recoveries when I have some time later.

Jordan Halterman
@kuujo
Feb 07 2017 22:03
hmm
all the registrations seem to be failing when they’re applied for some reason
it’s not obvious to me why they would be failing after they’re committed
it could be an exception in the state machine
the only reason Copycat intentionally fails a session registration is if the log is closed, meaning the server is being/has been shut down
Jordan Halterman
@kuujo
Feb 07 2017 22:08
The sessions that are being set to UNSTABLE are inside the servers. This call checks whether a session has not been kept alive and marks it as UNSTABLE if it hasn’t. Then the leader can remove the session whenever it wants.
hmm
I’m just not sure what could be causing this application of the RegisterEntry to fail
Jon Hall
@jhall11
Feb 07 2017 22:30
Should we exclude the session we are registering here
Jordan Halterman
@kuujo
Feb 07 2017 23:08
The sessions timestamp = the current timestamp so it should be okay but couldn't hurt
Some failure is happening in there, and it's not obvious why the registration is failing while being applied. Have to dig through it more tonight but I just got a bunch of stuff put on my plate
Jordan Halterman
@kuujo
Feb 07 2017 23:34
I see what’s happening @jhall11
well, not totally. But why the sessions are being set to UNSTABLE
Obviously, the sessions are applied and then are stored in memory in the server, but ultimately the registration request fails. That is, the registration of the session is not an atomic operation. So, because the registration fails, the client never sends a keep-alive. Then, what’s happening is another RegisterRequest is sent, that new session is stored and the prior session is set to UNSTABLE, and then the new session registration fails
so, we really just need to figure out why it fails
depending on why that is, the apply(RegisterEntry) code may need to be cleaned up to remove a session when registration fails
but I don’t think it should be failing
hmm
Jordan Halterman
@kuujo
Feb 07 2017 23:41
need to add the error type to those log messages
Jordan Halterman
@kuujo
Feb 07 2017 23:48
So, those sessions are just being set to UNSTABLE and eventually removed because the client thinks they failed. That's the expected behavior. The failure is the unexpected behavior.