These are chat archives for atomix/atomix

2nd
Mar 2018
Jordan Halterman
@kuujo
Mar 02 2018 19:53
GitHub is being attacked again?
Jordan Halterman
@kuujo
Mar 02 2018 19:56
They’ve been having some bad days lately
Can’t access the PRs ATM
Jon Hall
@jhall11
Mar 02 2018 19:56
yeah, not sure what the casue is
Johno Crawford
@johnou
Mar 02 2018 19:57
Maybe another memcached amp attack
Jon Hall
@jhall11
Mar 02 2018 19:58
hmm, It looks like I can access the PRs
Johno Crawford
@johnou
Mar 02 2018 20:01
atomix/atomix#411
Jordan Halterman
@kuujo
Mar 02 2018 20:06
Might just be mobile that’s messed up
Just get the pink unicorn when I try to access PRs
Jon Hall
@jhall11
Mar 02 2018 20:07
strange
Jordan Halterman
@kuujo
Mar 02 2018 20:34
One of the EC2 AZs in Oregon was having issues
Jordan Halterman
@kuujo
Mar 02 2018 21:04
I’m going to write some cluster configuration tests and start sending PRs to the test repo. We can work out the cluster configuration issues from there...
@johnou the problem with reconfiguring the cluster when a data node is killed would be that it makes split brain possible. Instead of killing the nodes, if there’s a partition between node 1 and 2, each would reconfigure their partitions to a single node and state would diverge in each side with no way to converge again. The only way to prevent that is by forcing users to decide how to reconfigure a cluster - remove node 1 or 2 - because it can’t be done automatically without risking split brain. So once a cluster has two nodes and all its partitions are reconfigured to two nodes, the two node partitions’ configurations are persistent until a user explicitly shuts down one of the data nodes.
Johno Crawford
@johnou
Mar 02 2018 22:23
@kuujo even if it is shut down gracefully it still fails to start
Jon Hall
@jhall11
Mar 02 2018 22:24
is node 2 still running? Don’t you need a quorum?
Jordan Halterman
@kuujo
Mar 02 2018 22:30
That’s starting a third node after the first two are shutdown, meaning 1/3 nodes are running
Hmm wait
Johno Crawford
@johnou
Mar 02 2018 22:31
it works if there are two bootstrap nodes, a new data node joins (server 3), all nodes shutdown, 1,2 then start up and cluster correctly
seems like another edge case with a single data node?
@jhall11 might be onto something with the quorum
Jordan Halterman
@kuujo
Mar 02 2018 22:33
  • Start node 1
  • Start node 2
  • Stop node 2
  • Stop node 1
  • Start node 3

In this case, node 3 is trying to join node 1 (bootstrapNodes is just node 1) but node 1 isn’t even running any more. Sort of strange that the errors are mentioning node 2 though, not sure how that’s happening.

I’m writing some tests for cluster configuration right now. I’ll send them today. I do suspect there are bugs in cluster reconfiguration

Johno Crawford
@johnou
Mar 02 2018 22:34
my use case is a little more simple than that though
start node 1 (with bootstrap of node 1)
start node 2 (with bootstrap of node 1)
stop node 2
stop node 1
start node 1 (with bootstrap of node 1) // fails
Jordan Halterman
@kuujo
Mar 02 2018 22:44
Yeah that should be a bug
I’m guessing it’s a race preventing Raft partitions from being reconfigured before node 2 is stopped
The configuration changes need to be transactional - I think I’ve mentioned this before
Jordan Halterman
@kuujo
Mar 02 2018 23:01
aSD
A
A
A
A
A
A
A
WOW
the Gitter UI did not like that gist
asdf
Jon Hall
@jhall11
Mar 02 2018 23:02
Yes, I had the same issues, had to use the web page until it was past the screen
but if I scroll up I need to restart gitter
Jordan Halterman
@kuujo
Mar 02 2018 23:02
terrible
@with_cluster(nodes=1)
def test_scale_up_down(cluster):
    assert len(cluster.nodes()) == 1
    node = cluster.add_node()
    assert len(cluster.nodes()) == 2
    _test_map(node)
    node.remove()
    assert len(cluster.nodes()) == 1
    node = cluster.node(1)
    _test_map(node)
    node.stop()
    node.start()
    _test_map(node)
the primitives actually fail after the new node is added
hmm actually this may just be a misconfiguration of the Docker containers not sure…
23:23:20.773 [raft-server-coordination-partition-6] WARN  i.a.p.raft.roles.FollowerRole - RaftServer{coordination-partition-6}{role=FOLLOWER} - io.netty.channel.ConnectTimeoutException: connection timed out: /172.18.0.2:5679
Jon Hall
@jhall11
Mar 02 2018 23:28
shouldn’t you try some primitives before you isolate and heal node 1?
Jordan Halterman
@kuujo
Mar 02 2018 23:29
wait nvm that test is using the primary-backup protocol which just doesn’t have enough backups
need to use a Raft primitive
Johno Crawford
@johnou
Mar 02 2018 23:33
how's that done in py
Jordan Halterman
@kuujo
Mar 02 2018 23:38
Through the REST API
Johno Crawford
@johnou
Mar 02 2018 23:39
right but i mean what you said saying the test is wrong
Jordan Halterman
@kuujo
Mar 02 2018 23:41
Hmm actually this test isn’t going to be right. We need to ensure stop() is called on the Atomix instance to allow it to reconfigure the cluster. Probably need to get the signal in the AtomixAgent
ctrl+c probably needs to stop() the node
Shutdown hook is what I need
Jordan Halterman
@kuujo
Mar 02 2018 23:47

FYI to run all tests:

atomix-test run

to run a specific test:

atomix-test run tests/test_cluster.py::test_cluster_restart