These are chat archives for atomix/atomix

13th
May 2018
Ronnie
@rroller
May 13 2018 20:18
I'm having lots of trouble with Atomix with what I think should be standard operations. I have 3 hosts member-a, member-b, and member-c. Things work great until I do deployments. I'll deploy to one host at a time and the first host maybe or may not be able to rejoin the cluster. Since I've started using atomix a week or so ago this has been my pain point -- testing around nodes going offline and rejoining doesn't seem to work very well at all (maybe I'm doing something wrong) ...
Do you know of any open issues around this?
Ronnie
@rroller
May 13 2018 20:26
Things seem to work better if I do a deployment to all 3 hosts at once instead of a rolling deployment where each host is deployed to and waited to start before going to the next
Jordan Halterman
@kuujo
May 13 2018 20:27
If you’re using Raft then you have to start multiple nodes to acheive quorum to finish starting any one node
Ronnie
@rroller
May 13 2018 20:27
I'm using raft yes
but i have 3 hosts
so i should be able to take 1 down and deploy to it right?
and sometimes that does work, but not always
while 1 host is down, the other 2 are up, and that's sufficient for quorum unless I'm missing something
Jordan Halterman
@kuujo
May 13 2018 20:29
what constitutes “deployed to”?
Ronnie
@rroller
May 13 2018 20:29
stop the running instance and start it again with a newer version
(this is a docker image... so stop the old docker image, copy over the new one, start it up)
so for a minute or so, one member is totally offline, then attempts to start up again... this fails most of the time.. so i end up having to do a fleet wide deployment of all 3 at once... and that works
Jordan Halterman
@kuujo
May 13 2018 20:30
Newer version of Atomix?
Ronnie
@rroller
May 13 2018 20:30
no
the atomix code is not changing
it's newer versions of my service code (just standard java API service)
the service has a dependency on atomix and starts atomix during start up
similar to that one unit test i sent you a few days ago
also fwiw, that unit test that would verify this type of behavior works... worked after your fixed
but recently stopped working
i'm wondering if there's a regression
running it again to verify
yes, it's failing.
Jordan Halterman
@kuujo
May 13 2018 20:33
A couple things
Ronnie
@rroller
May 13 2018 20:35
"Raft partition group must be configured with persistent membership"
Jordan Halterman
@kuujo
May 13 2018 20:35
The regression is probably because of 0577ed6a0dbcb61577ddae8c89027b91e0140360 which fixed a bug that retained persistent members in the cluster after they were removed. But really, for this type of use case they probably shouldn’t be removed. If you’re shutting down a node to restart it, it should remain part of the cluster and just stay inactive. You also need to ensure the persistent state of the cluster is preserved across the containers.
Ronnie
@rroller
May 13 2018 20:35
is where it fails now...
it is
i map the state outside of the container
Jordan Halterman
@kuujo
May 13 2018 20:38
There’s probably a bug in re-adding a persistent member after it was removed then (something with the tombstone), but it probably shouldn’t be removed. If you’re shutting down a node that’s still a member of a Raft partition group then it should not be removed from the cluster membership. Maybe just need to add another method to remove() a node from the cluster.
Ronnie
@rroller
May 13 2018 20:41
From my perspective i'll have these two use cases... 1) regular deployments... and 2) dead hosts/reboots/etc (things that are not normal)
i'm not going to be adding new hosts or removing existing... i'll always have the 3 same member IDs
and ideally those two things would work without issue.
is there anything I can help with this?
to make that rock solid?
(do i need to open a issue with logs and details)?
it doesn’t have to be there since the Raft partition group configuration requires the set of members anyways
Ronnie
@rroller
May 13 2018 20:43
nice! I don't know this code at all so I can't comment :)
If you push out a new version please let me know and I can test it
Jordan Halterman
@kuujo
May 13 2018 21:09
It’s Mother’s Day, so I’m limited on time until after my wife goes to bed :-) but I have a really good idea of how to fix the cluster membership issues altogether
Ronnie
@rroller
May 13 2018 21:13
Thank you so much! Really appreciate it.