These are chat archives for atomix/atomix

2nd
Jan 2018
Johno Crawford
@johnou
Jan 02 2018 09:21 UTC
@kuujo seems the existing cluster receives the PartitionMetaData update io.atomix.protocols.raft.partition.RaftPartition#update from server3
then invokes server.leave..
should it be starting another election?
Jordan Halterman
@kuujo
Jan 02 2018 18:37 UTC
I’m back...
I don’t even know what to do with myself after doing nothing for the last couple weeks
Johno Crawford
@johnou
Jan 02 2018 18:38 UTC
Welcome back :)
Wdym? You relaxed right? That's super important
I'll be heading off in a couple days for a week or so
Sadly I was one of the few who worked over Christmas
Jordan Halterman
@kuujo
Jan 02 2018 18:44 UTC
Basically just visited a lot of family and watched a lot of sports
Johno Crawford
@johnou
Jan 02 2018 19:02 UTC
Sounds good to me
Paweł Kamiński
@pawel-kaminski-krk
Jan 02 2018 19:12 UTC
@kuujo welcome back! With @johnou we've been talking about joining and leaving the cluster. I am wondering if I can help anyhow rather than just throw problems on you guy's. I'm open to any ideas.
Johno Crawford
@johnou
Jan 02 2018 19:13 UTC
@pawel-kaminski-krk I actually opened an issue atomix/atomix#369
Paweł Kamiński
@pawel-kaminski-krk
Jan 02 2018 19:16 UTC
Cool. Meanwhile I'll try to look at tests and understand them so I can later reason more about new version.
Johno Crawford
@johnou
Jan 02 2018 19:16 UTC
I debugged it for maybe 30 minutes, seems the third node joins the cluster and on doing that the raft server closes, join handler is unregistered and then the join message fails
But I don't yet know the internals well enough to say how it should behave
Jordan Halterman
@kuujo
Jan 02 2018 20:55 UTC
K looking
Jordan Halterman
@kuujo
Jan 02 2018 21:16 UTC
Yeah I don’t think there are any higher level tests for cluster reconfiguration yet - just reconfiguration at the ClusterMetadataService level. Working mostly on testing and documentation for the next few weeks. I’ll hack out some tests to reproduce it right now and figure out what’s going on.
good project to get me back into the swing of things
will check out the PRs too
Johno Crawford
@johnou
Jan 02 2018 21:27 UTC
@kuujo should be a reproducer in the github issue
most of the prs are straight forward fixes
There is one or two that might need a second look
Jordan Halterman
@kuujo
Jan 02 2018 22:00 UTC
:+1:
Jordan Halterman
@kuujo
Jan 02 2018 22:08 UTC
Pretty easy to reproduce. Fixed a couple other bugs too, but I should actually go through those PRs first since I’m probably finding and fixing stuff you already found and fixed :-P
Jordan Halterman
@kuujo
Jan 02 2018 22:47 UTC
got one new PR in and now I’ll go through all the others
Jordan Halterman
@kuujo
Jan 02 2018 23:11 UTC

I realized something while playing with those reconfiguration tests... there’s a problem with the fact partitions are still based on the number of nodes in the cluster. When I start a single node cluster and then add a second node to it, if the number of “coordination” (Raft) partitions is not configured, the first node is started with one partition and the second with two partitions. The second node fails startup because the second partition can’t join a non-existent second Raft partition. This is far too difficult to understand IMO.

There are a few options here: either the number of partitions needs to be replicated in the cluster metadata, or the number of partitions maybe needs to be a required configuration, or the default number of partitions needs to be static, or the Raft primitives need to support repartitioning. The last option is a long-term feature that should probably be left for Atomix 2.2

Johno Crawford
@johnou
Jan 02 2018 23:36 UTC
don't the first two go hand in hand
number of partitions need to be replicated / number of partitions needs to be static
until re-partitioning is supported, which definitely should be targeted for a later milestone
Johno Crawford
@johnou
Jan 02 2018 23:48 UTC
how does that work with atomix storage?
eg. if I have nodes a, b, c then later decide I want to add node d
i shutdown all nodes, add d to the bootstrap list