These are chat archives for atomix/atomix

3rd
Jul 2016
niquola
@niquola
Jul 03 2016 18:31
I have a small misbehavior: I'm playing with 3 replicas in clojure repl, when I'm trying to shutdown last replica in cluster it does not stop properly and trying to reconnect to other members forever. I was able to stop server by hands, but client does not stopping with .close().
niquola
@niquola
Jul 03 2016 18:36
INFO [copycat-client-io-1] default io.atomix.catalyst.transport.NettyClient || Connecting to mobile/127.0.0.1:4444
Jordan Halterman
@kuujo
Jul 03 2016 21:15
is that the case every time?
or does it seem more like a race condition?
niquola
@niquola
Jul 03 2016 21:16
Yes, i've reproduced 10 times at least
i could stop 2 replicas - does not matter which
and i could not stop the last one
does reconnect cycle respect client close or server shutdown?
Jordan Halterman
@kuujo
Jul 03 2016 21:22
are you calling shutdown or leave?
or using an old version that has close?
niquola
@niquola
Jul 03 2016 21:23
shutdown on replica
Jordan Halterman
@kuujo
Jul 03 2016 21:23
gotcha
yeah definitely should not be doing that
that should be a pretty simple shutdown of the server
the client is trying to unregister its session probably
niquola
@niquola
Jul 03 2016 21:23
i've seen code there is shutdown clusterManager, shutdown server, close client
Jordan Halterman
@kuujo
Jul 03 2016 21:24
ahh you’re right
niquola
@niquola
Jul 03 2016 21:24
i was able shutdown server by hands
but client.close - hangs
Jordan Halterman
@kuujo
Jul 03 2016 21:24
hmm so it should be shutting down the client and then server
ahh I know the problem actually
niquola
@niquola
Jul 03 2016 21:26
?
Jordan Halterman
@kuujo
Jul 03 2016 21:28
So, when you call shutdown on a server it just shuts the server down without removing the server from the cluster. So, if you have a three node cluster and you shut down two nodes, you’re losing a majority of the cluster. Alternatively, if you were to leave the cluster then the cluster would shrink to one node. The problem with shutdown when you lose a majority is the client can’t connect to unregister its session. So, this could be fixed by just stopping the client if it can’t connect to unregister its session and just let its session expire. Maybe the client should still just shut down and return an exception to indicate that it couldn’t explicitly unregister its session so we can still imply that its session will have to expire in the cluster.
I don’t think we want to silently shut down the client since some code could be dependent on the assumption that once a client is closed its session is no longer present. For example, one could close a client that holds a lock and expect that the lock is released when the client is shut down, but that may not be the case if we allow the session to expire instead. So, the client should probably attempt once to unregister its session and shutdown gracefully and then shutdown and fail the CompletableFuture with some exception if it was unable to unregister its session.
That would allow the AtomixReplica#shutdown method to progress when a majority of the cluster is down
niquola
@niquola
Jul 03 2016 21:32
Oh, sounds reasonable :)
Jordan Halterman
@kuujo
Jul 03 2016 21:34
easy change…
niquola
@niquola
Jul 03 2016 21:36
This is funny problem - how to intentionally stop the distributed cluster :)
Roman Pearah
@neverfox
Jul 03 2016 21:36
So what is the proper way to shutdown replicas if you intentionally turning the cluster off? Call shutdown followed by leave on each or just leave or just shutdown?
Jordan Halterman
@kuujo
Jul 03 2016 21:37
you can just call leave on all nodes… that will first remove the node from the cluster and then shut it down
Roman Pearah
@neverfox
Jul 03 2016 21:38
ah cool
so that steps the cluster down, like you were saying
rather than causing a failed majority
Jordan Halterman
@kuujo
Jul 03 2016 21:39
right
though if you want to keep the configuration with this change it should work fine to call shutdown… in that case at least your data stays on disk
certainly safer if you’re going to start the cluster again
niquola
@niquola
Jul 03 2016 21:42
Shutting down cluster for maintenance looks like quite routine procedure.
Jordan Halterman
@kuujo
Jul 03 2016 21:42
indeed… this definitely needs to be fixed