These are chat archives for atomix/atomix

26th
Mar 2016
Madan Jampani
@madjam
Mar 26 2016 05:40
After a few hours of toil success at last. Found a bug in event publishing that manifests itself around leadership changes. Should have a PR for it shortly
Jordan Halterman
@kuujo
Mar 26 2016 05:41
Yay!
I think this has been cropping up a bunch of places and is a huge fix :-)
I'll be around
Madan Jampani
@madjam
Mar 26 2016 05:44
yay! indeed. this one turned out to be quite tricky. perhaps exacerbated by the the fact that I first needed to understand the logic before I can debug it :)
Madan Jampani
@madjam
Mar 26 2016 05:49
its quite simple. when a leader changes, the new leader loses track of the previousIndex value for session and initializes it to sessionId. The problem occurs when the session has already acked an event at index > sessionId. So when its time to publish a new event the client will reject the published message complaining about a inconsistent index. However, the retry logic still includes the same old previousIndex. This now goes into a tight loop where the server publishes and the client keeps rejecting thereby locking up the system.
Jordan Halterman
@kuujo
Mar 26 2016 05:57
hmm
interesting
Lets say previousIndex == sessionId and the response from the client for the PublishRequest is > previousIndex
Jordan Halterman
@kuujo
Mar 26 2016 06:03
yep
Madan Jampani
@madjam
Mar 26 2016 06:03
However in resent PublishRequest we still use the same old previousIndex
the resend happens in a tight loop and in this scenario is a infinite loop (basically until someone else gets elected leader)
Jordan Halterman
@kuujo
Mar 26 2016 06:08
I think the part I’m not seeing is how the leader change puts it in that state. The ServerSessionContext is only ever created when RegisterEntry is applied, and that’s where the index is initialized to the sessionId. A leader change shouldn’t result in indexes being reset
Also, there’s another aspect to this which is that KeepAliveRequests give clients a chance to resolve their previousIndex with all the servers’ state machines. https://github.com/atomix/copycat/blob/master/server/src/main/java/io/atomix/copycat/server/state/ServerStateMachine.java#L580 In theory, even if a leader is in an infinite loop with a client, once the client submits a KeepAliveRequest and the KeepAliveEntry is applied on the leader, it should resolve back to the correct index
hmm but that’s not actually any more information than the client is already providing in the response
I’m sure I’m missing something
just have to think about it a bit...
Jordan Halterman
@kuujo
Mar 26 2016 06:14
In the client’s if statement is it that the client’s eventIndex is less than or greater than previousIndex? Greater right? The client has already seen the event?
it seems like the index in resendEvents is just not being used?
other than to clear old events
Madan Jampani
@madjam
Mar 26 2016 06:15
that was the fix I had in mind. Change the it from != to >
in ClientSessionListener
Jordan Halterman
@kuujo
Mar 26 2016 06:16
yeah… hmm
Madan Jampani
@madjam
Mar 26 2016 06:16
The issue seems to be the reliance on previousIndex instead of completeIndex in ServerSessionContext when building a PublishRequest
If we just used completeIndex as an argument to PublishRequest.Builder#withPreviousIndexthat was working.
Jordan Halterman
@kuujo
Mar 26 2016 06:19
so...
Madan Jampani
@madjam
Mar 26 2016 06:19
I did not see any place where we were updating previousIndex in ServerSessionContext
Jordan Halterman
@kuujo
Mar 26 2016 06:19
I can provide some insight into this
Madan Jampani
@madjam
Mar 26 2016 06:19
please!
Jordan Halterman
@kuujo
Mar 26 2016 06:25
The reason there’s a per-event previousIndex is to ensure all events are received in order. When an event is published by the state machine at a given index, an EventHolder is created with the current index and the previousIndex. What the previousIndex is is the last index for which an event was published to the session.
https://github.com/atomix/copycat/blob/master/server/src/main/java/io/atomix/copycat/server/state/ServerSessionContext.java#L525-L529
This is because events can skip indexes. A state machine can publish an event at index 1 and 3 and 5. In the event that only events for index 1 and 5 arrive on the client and 3 is lost, the client needs a way to say I should have seen 3 before 5. So, the event message’s previousIndex will say 3. If the client’s last seen index is not 3 then it rejects the request and tells the server to resend starting at > 1 (its last seen event index).
Servers will hold on to events until the client acknowledges them. The server that sends the event message can remove events from memory when the response is received from the client. Only the server to which the client is connected ever actually sends the event message. For LINEARIZABLE events, the leader will proxy the event through the server to which the client is connected. But all servers store the EventHolder in memory, so if the server to which the client is connected crashes, it should be able to reconnect to another server and continue receiving events.
In order to acknowledge events on servers to which the client is not connected, the KeepAliveRequest also contains the last index for which the client received an event. When KeepAliveEntry is applied, all servers remove events up to the client’s eventIndex and if necessary resend pending events.
The problem is, if we use completeIndex then the client can potentially miss events. Multiple events may be in flight. If there are events for indexes 1, 3, and 5, and the client has acknowledged 1, the server may send 3 and 5 concurrently. In that case, both would have completeIndex of 1, and if the event for index 3 is lost then the client will never know
does that make sense?
Madan Jampani
@madjam
Mar 26 2016 06:29
So completeIndex is what the client has acked?
Jordan Halterman
@kuujo
Mar 26 2016 06:29
yeah
So, in theory once clearEvents has been called, all the events that remain in the queue are the ones the client hasn’t seen
Madan Jampani
@madjam
Mar 26 2016 06:31
is perviousIndex ever updated?
scratch that
It should be immutable for a given event
Jordan Halterman
@kuujo
Mar 26 2016 06:32
right
this bit of logic assigns the previousIndex for an event based on the last event that was published if the current state machine index has since changed
Madan Jampani
@madjam
Mar 26 2016 06:34
got it.
So when a follower becomes leader.
Let me pull up some log entries from the run when I saw this.
Jordan Halterman
@kuujo
Mar 26 2016 06:40
It changes when any command is applied to the state machine. An event can be published in response to any command. Prior to the command being applied, the context.index() that is used for eventIndex and previousIndex is set to the command’s index https://github.com/atomix/copycat/blob/master/server/src/main/java/io/atomix/copycat/server/state/ServerStateMachine.java#L927 If an event is published by the state machine, the EventHolder is created. Once the command is done being applied, the event is actually sent https://github.com/atomix/copycat/blob/master/server/src/main/java/io/atomix/copycat/server/state/ServerSessionContext.java#L524-L529
None of the state of the session should change when a leader change occurs. The only notion of leader changes within the ServerStateMachine is the application of the InitializeEntry (no-op entry) that is required by Raft.
I did fix some bugs a while back in ServerSessionContext that did result from leader changes, so this certainly still may be the case
Trying to remember that
Madan Jampani
@madjam
Mar 26 2016 06:43
Here is the first event published by the new leader and the response form the client
37636 2016-03-25 08:21:41,106 | DEBUG | artition-3-state | ServerSessionContext | 2155 - Sending PublishRequest[session=2155, eventIndex=2491, previousIndex=2489, events=[Event[event=changeEvents, message=InstanceEvent[resource=86, message=[...] 37637 2016-03-25 08:21:41,109 | DEBUG | artition-3-state | ServerSessionContext | 2155 - Received PublishResponse[status=ERROR, index=2438] 37638 2016-03-25 08:21:41,109 | DEBUG | artition-3-state | ServerSessionContext | 2155 - Sending PublishRequest[session=2155, eventIndex=2489, previousIndex=2155, events=[Event[event=changeEvents, message=InstanceEvent[resource=86, message=[...]
oops. I messed that up :)
37636 2016-03-25 08:21:41,106 | DEBUG | artition-3-state | ServerSessionContext             |  2155 - Sending PublishRequest[session=2155, eventIndex=2491, previousIndex=2489, events=[Event[event=changeEvents, message=InstanceEvent[resource=86, message=[...]
 37637 2016-03-25 08:21:41,109 | DEBUG | artition-3-state | ServerSessionContext             |  2155 - Received PublishResponse[status=ERROR, index=2438]
 37638 2016-03-25 08:21:41,109 | DEBUG | artition-3-state | ServerSessionContext             |  2155 - Sending PublishRequest[session=2155, eventIndex=2489, previousIndex=2155, events=[Event[event=changeEvents, message=InstanceEvent[resource=86, message=[...]
Jordan Halterman
@kuujo
Mar 26 2016 06:46
:-)
hmm
that’s odd
those numbers don’t match up at all
If the client is sending back 2438 then the next PublishRequest send by the server should have previousIndex=2438
It’s even more strange that it’s actually lower
Madan Jampani
@madjam
Mar 26 2016 06:49
basically reset to sessionId
Jordan Halterman
@kuujo
Mar 26 2016 06:49
yeah
hmmm
I may have a thought
just have to work it out in my head :-)
when does the leader change occur, right before this?
and this is on the new leader?
Madan Jampani
@madjam
Mar 26 2016 06:51
yes.
this is the first publish message sent out from the new leader to this client.
Jordan Halterman
@kuujo
Mar 26 2016 06:51
alright I think I might know the issue… just gotta explore a bit
what was the last KeepAliveEntry applied for that session?
you have that log?
Madan Jampani
@madjam
Mar 26 2016 06:53
I should. Let me look it up.
Jordan Halterman
@kuujo
Mar 26 2016 06:53
thanks
Madan Jampani
@madjam
Mar 26 2016 06:53
you mean the KeepAliveEntry right before this PublishRequest on the new leader?
Received KeepAliveRequest[session=2155, commandSequence=117, eventIndex=2438]
Appended KeepAliveEntry[index=2485, term=9, session=2155, commandSequence=117, eventIndex=2438, timestamp=1458919299850]
Applying KeepAliveEntry[index=2485, term=9, session=2155, commandSequence=117, eventIndex=2438, timestamp=1458919299850]
Sent KeepAliveResponse[status=OK, error=null, leader=/192.168.56.104:9876, members=[/192.168.56.101:9876, /192.168.56.102:9876, /192.168.56.104:9876]]
Jordan Halterman
@kuujo
Mar 26 2016 06:57
yeah
Madan Jampani
@madjam
Mar 26 2016 06:57
These log lines are from the new leader node.
Jordan Halterman
@kuujo
Mar 26 2016 06:57
yup perfect
Madan Jampani
@madjam
Mar 26 2016 07:04
Sending PublishRequest[session=2155, eventIndex=2489, previousIndex=2155
In fact this is the first publish from the new leader to the client.
Jordan Halterman
@kuujo
Mar 26 2016 07:05
That indeed does seem to indicate what you said - that eventIndex was still initialized to the session ID
Because ServerSessionContext is only created when RegisterEntry is applied, that seems to indicate that events were never published on that node
after the RegisterEntry was applied
maybe… gonna go poke through the codes
Madan Jampani
@madjam
Mar 26 2016 07:08
That is most likely the case.
Jordan Halterman
@kuujo
Mar 26 2016 07:11
ooooh
I think I see it
is that the order of these two lines?
maybe
Madan Jampani
@madjam
Mar 26 2016 07:13
hmmm
Jordan Halterman
@kuujo
Mar 26 2016 07:13
If the client already acknowledged events it just ignores published messages
that means eventIndex is never updated, and once the node becomes leader and publishes a new event its eventIndex is still its session ID
It at least needs to update eventIndex there
so it can use that for previousIndex for later published events
Madan Jampani
@madjam
Mar 26 2016 07:15
what is commit.index()?
Jordan Halterman
@kuujo
Mar 26 2016 07:16
the index of the current command being applied to the state machine
hmmm… I’m not sure that actually makes sense though. The only way that can be updated on a server that didn’t publish the event is through a KeepAliveEntry. Because of the order of the log, KeepAliveEntry will always have an eventIndex less than the current index. So, that condition is actually probably never true
Madan Jampani
@madjam
Mar 26 2016 07:17
That is what I was thinking as well.
Are lines 526 and 527 in the right order?
Jordan Halterman
@kuujo
Mar 26 2016 07:18
yeah…
long previousIndex = eventIndex; // eventIndex is now previousIndex
eventIndex = context.index(); // currentIndex is now eventIndex
Madan Jampani
@madjam
Mar 26 2016 07:20
So here's what happened (I think)
node 1 was restarted followed by node 2
between node1 and node2 restarts no new events are published by node2 (the leader)
when node2 was restarted node1 becomes leader
And on node1 Line 525 evaluates to true (because event == null)
Jordan Halterman
@kuujo
Mar 26 2016 07:23
ahh
Madan Jampani
@madjam
Mar 26 2016 07:23
So it initializes previousIndex to eventIndex which is initialized to sessionId
Jordan Halterman
@kuujo
Mar 26 2016 07:23
right
hmmm
but the replay of entries on restart should result in events being recreated
trying to think if there’s a reason that wouldn’t happen
there is a reason...
actually… I just realized something interesting in the logs
Jordan Halterman
@kuujo
Mar 26 2016 07:29
Sending PublishRequest[session=2155, eventIndex=2491, previousIndex=2489, events=[Event[event=changeEvents, message=InstanceEvent[resource=86, message=[...]
This previousIndex indicates that the new leader does have an event at index 2491. That means eventIndex should not be 2155. It seems to just be the case that the first event that the new leader does have is for index 2489. So, the leader has events 2489 and 2491
It never published what the client received last - 2438
Madan Jampani
@madjam
Mar 26 2016 07:30
Yes. That was published by the old leader
Jordan Halterman
@kuujo
Mar 26 2016 07:30
oh gotcha
the second was the first one by the new leader
37636 2016-03-25 08:21:41,106 | DEBUG | artition-3-state | ServerSessionContext             |  2155 - Sending PublishRequest[session=2155, eventIndex=2491, previousIndex=2489, events=[Event[event=changeEvents, message=InstanceEvent[resource=86, message=[...]
 37637 2016-03-25 08:21:41,109 | DEBUG | artition-3-state | ServerSessionContext             |  2155 - Received PublishResponse[status=ERROR, index=2438]
 37638 2016-03-25 08:21:41,109 | DEBUG | artition-3-state | ServerSessionContext             |  2155 - Sending PublishRequest[session=2155, eventIndex=2489, previousIndex=2155, events=[Event[event=changeEvents, message=InstanceEvent[resource=86, message=[...]
Madan Jampani
@madjam
Mar 26 2016 07:31
The first publish by the new leader is actually this
Jordan Halterman
@kuujo
Mar 26 2016 07:31
those aren’t all from the new leader?
Madan Jampani
@madjam
Mar 26 2016 07:31
Sending PublishRequest[session=2155, eventIndex=2489, previousIndex=2155
Jordan Halterman
@kuujo
Mar 26 2016 07:31
oh ok
makes sense
Madan Jampani
@madjam
Mar 26 2016 07:34
And to this the client responds
PublishResponse[status=ERROR, index=2438]
Jordan Halterman
@kuujo
Mar 26 2016 07:34
How many nodes in the cluster? They’re using persistent storage?
Madan Jampani
@madjam
Mar 26 2016 07:35
3 and yes its a disk based log.
Jordan Halterman
@kuujo
Mar 26 2016 07:37
I think this is probably the same reason I did .withPreviousIndex(Math.max(event.previousIndex, completeIndex))previously, which seems it would fix this. But I’m concerned that would be masking the real problem
Madan Jampani
@madjam
Mar 26 2016 07:37
If the client advertises a last received eventIndex of 2438 and our current previousIndex is 2155, is there something wrong in bumping up previousIndex to client advertised event index?
Jordan Halterman
@kuujo
Mar 26 2016 07:37
That’s what that would do ^^
Madan Jampani
@madjam
Mar 26 2016 07:37
yeah.
Jordan Halterman
@kuujo
Mar 26 2016 07:38
Just have to decide whether that could break anything
Madan Jampani
@madjam
Mar 26 2016 07:38
That does fix the issue in my test runs.
Jordan Halterman
@kuujo
Mar 26 2016 07:38
I say submit a PR to do that and I’ll try to think about that in the meantime
Madan Jampani
@madjam
Mar 26 2016 07:38
but I cannot be sure of sideeffects.
Jordan Halterman
@kuujo
Mar 26 2016 07:40
hmm…. actually
I think it could very well be safe to do
just going to ponder for a while
Madan Jampani
@madjam
Mar 26 2016 07:50
atomix/copycat#200
Jordan Halterman
@kuujo
Mar 26 2016 07:50
awesome
Madan Jampani
@madjam
Mar 26 2016 07:50
It is starting to make sense to me (I could be wrong :) )
the new leader should have not even buffered any of the previously published events.
what could prevent from the new leader to not create events on log replay?
I mean from which index does the new leader start creating events?
I guess I can look it up. But I'm too lazy to think now :)
Jordan Halterman
@kuujo
Mar 26 2016 07:57
There are things that can prevent events from being created… For example, if a snapshot is taken then all the prior commands are removed. In that case, if a node crashes and is restarted, those prior events are lost. State machines take care to ensure commands for events that are pending will never be removed from the log, but commands that were acknowledged can be
Also, a new server that joins the cluster may miss a lot of commands
Once a command has been applied to the state machine and has been closed or snapshotted and all related events have been acknowledged, the command can be excluded from replication altogether
I think that’s why I put that Math.max call in in the first place
Hopefully, the Math.max usage is safe since the state machine ensures commands that have pending events can never be compacted from the log or excluded from replication
So, any leader should apply those commands and queue those events
The state machine’s lastCompleted index is the lowest index for which all related events have been acknowledged by clients
That amounts to the index up to which logs can be compacted
That ensures a new leader will always have commands related to events that haven’t yet been acknowledged by all clients
But none of that still seems to totally apply to this scenario
Jordan Halterman
@kuujo
Mar 26 2016 08:03
I’m not sure this had anything to do with how commands are compacted or excluded from replication
Madan Jampani
@madjam
Mar 26 2016 08:04
so it is possible that all the event generating commands have been compacted from the log.
before the leadership change occured.
Jordan Halterman
@kuujo
Mar 26 2016 08:04
yeah
totally
I think this fix is safe
considering the invariants
Madan Jampani
@madjam
Mar 26 2016 08:05
so the new leader (which itself was recently restarted) has to first start from default previousIndex for each session
I too think that it is safe.
Jordan Halterman
@kuujo
Mar 26 2016 08:07
k awesome I think this fixes a big issue others have seen too thanks a lot!
Madan Jampani
@madjam
Mar 26 2016 08:08
thanks for your help! It filled some holes in my understanding.
Jordan Halterman
@kuujo
Mar 26 2016 08:08
:-)
Madan Jampani
@madjam
Mar 26 2016 08:09
so when are you planning the next rc?
Jordan Halterman
@kuujo
Mar 26 2016 08:09
I should probably take a nap now
now is good for me
nothing waiting except for this
Madan Jampani
@madjam
Mar 26 2016 08:10
cool. get some sleep now. you can do it tomorrow. :)
alright take care. i'll crash now.
Jordan Halterman
@kuujo
Mar 26 2016 08:10
thanks! adios
Jordan Halterman
@kuujo
Mar 26 2016 08:19
actually running the tests and releasing while I take a shower… Atomix may just be a few days behind though
got some great stuff fixed
err maybe I’ll do Atomix tomorrow too
Jordan Halterman
@kuujo
Mar 26 2016 08:44
Catalyst and Copycat both released
Jordan Halterman
@kuujo
Mar 26 2016 08:59
I actually have some really extensive tests to run that I haven’t been able to get to pass consistently yet. Should be interesting to see where it’s at now
Jordan Halterman
@kuujo
Mar 26 2016 09:45
So far, it looks like this may have resolved those test issues as well. I’ve run them all twice both to complete success. Will continue to run them in multiple contexts
Jordan Halterman
@kuujo
Mar 26 2016 09:55
A bit of a post-mortem: This has worked out well. My general philosophy is to favor potentially breaking things over hiding the symptom of a larger issue. I removed that line to adhere to that philosophy, so in that sense this was an expected part of the process. Because it was removed we were able to detect an obvious misbehavior and discuss it at length and agree on the fix. This is preferable to me to allowing a more subtle and difficult to notice bug. Had that line remained and only fixed a symptom of a larger issue, it would have resulted in infrequently lost messages at the beginning of a leader's term, and that would have been more difficult to detect.
There is still an issue that has been seen by some users that I've been unable to reproduce. Specifically, people are seeing request timeouts with the NettyTransport. It's difficult to tell if this is indicative of any issue. It's not abnormal for timeouts to occur. The algorithm is designed to handle connection issues gracefully, and request timeouts amount to connection issues. Reports have seemed to indicate that it happens frequently, but I've been unable to reproduce this in any environment. I do wonder if the DNS resolution fix could have helped this though.
Jordan Halterman
@kuujo
Mar 26 2016 10:20
We can also implement the performance optimizations for appending to the log and pipelining AppendRequest before the 1.0 release. That should help significantly with performance as well. Then, after the 1.0 release we can worry about threading.
Madan Jampani
@madjam
Mar 26 2016 15:12
I'm fairly confident the netty timeouts are due to dns bug. Since that change I have seen significant improvement in the stability of my runs (btw, I use a custom transport that is built on that of an existing messaging substrate, which itself is built on top of netty)
previous to that bug fix I would see all sorts of timeouts. If I take down a node and bring it back online it basically made the system unusable. The leader has to transfer a big batche of log entries to the new nodes and the serialization of each such batch required a ton of reverse dns looks ups. By the time batch construction finishes the leader would be replaced and the problem repeats at the new leader.
Madan Jampani
@madjam
Mar 26 2016 15:20
Also the DNS issue was only possible if you are using IP addresses and not host names. If host names are specified reverse DNS look up just uses the cached value. May be that is why you were not able to reproduce it ?
Jordan Halterman
@kuujo
Mar 26 2016 15:21
Yeah I thought that might be it. Makes sense
samirsdoshi
@samirsdoshi
Mar 26 2016 15:26
Are there any upcoming features which would allow applications with Atomix to be deployed to Cloud ?
Richard Pijnenburg
@electrical
Mar 26 2016 15:30
@samirsdoshi how do you mean? its a java library so you can build anything on top of that
or do you mean things like auto discovery with EC2 ?
samirsdoshi
@samirsdoshi
Mar 26 2016 15:39
@electrical it requires to know ip addresses to form cluster which may not be possible. also needs ports opened which may be restrictive in cloud.
Richard Pijnenburg
@electrical
Mar 26 2016 15:41
Ah okay, i see. Well, the ports you can choose your self so you have freedom in there. With regards to the IP addresses. I juist openend an issue atomix/catalyst#201 for getting the IP’s via alternative ways
with Amazon for example you can use a security group and the API to find out what the IP’s are
oops. atomix/copycat#201 i mean
same kind of thing can be implemented for other cloud providers. as long as there is a way to group the nodes and get the IP’s.
is there a specific thing you are trying to find a solution for ?
samirsdoshi
@samirsdoshi
Mar 26 2016 15:43
Ok..great. Also, what about if we add more VMs which register themselves in a group, would Atomix still know the majority votes required to elect leader ?
Richard Pijnenburg
@electrical
Mar 26 2016 15:44
it should yeah as far as i know.
samirsdoshi
@samirsdoshi
Mar 26 2016 15:44
Trying to evaluate if applications using Atomix for leader election, can be deployed to cloud and what are they issues to be addressed.
Richard Pijnenburg
@electrical
Mar 26 2016 15:44
i see :-)
Kuujo is the main author, i just build small things and test stuff
But having an extention to allow for alternative discovery would be very useful indeed
with EC2 you could add them in a security group with predefined firewall rules and then we can do the discovery using the ec2 API.
samirsdoshi
@samirsdoshi
Mar 26 2016 15:46
Thanks. Yes, Making Atomix cloud ready would be a big win.
Richard Pijnenburg
@electrical
Mar 26 2016 15:46
I’m unsure how other clould providers work since i don’t have access to them, but if someone is able to provide info / resources for that we can defo look at that :-)
samirsdoshi
@samirsdoshi
Mar 26 2016 15:46
Got it..
Richard Pijnenburg
@electrical
Mar 26 2016 15:47
only thing ive implemented so far in this stack is the TLS support :-)
samirsdoshi
@samirsdoshi
Mar 26 2016 15:48
Ok. I am looking at Cloud Foundary...
Richard Pijnenburg
@electrical
Mar 26 2016 15:51
okay. i know of it but never worked with it
ah cloudfoundry uses other virt solutions underneath
Richard Pijnenburg
@electrical
Mar 26 2016 16:19
Not sure how it would work with CF, but im sure in time we can find a solution :-)
samirsdoshi
@samirsdoshi
Mar 26 2016 18:42
Great. Thanks @electrical
Richard Pijnenburg
@electrical
Mar 26 2016 23:16
@kuujo @jhalterman wonder what you guys think of the alternative discovery thing.
perhaps we could do a .withDiscovery(new DiscoveryStatic(members)) and .withDiscovery(new DiscoveryEC2(settings)) ?
Jordan Halterman
@kuujo
Mar 26 2016 23:40
Yeah that would be interesting
Richard Pijnenburg
@electrical
Mar 26 2016 23:40
unsure where this abstraction would fit.. copycat or atomix
Jordan Halterman
@kuujo
Mar 26 2016 23:40
As long as configuration changes are always driven by the Raft algorithm
It's perfectly safe
The problem is just at startup
Richard Pijnenburg
@electrical
Mar 26 2016 23:41
in the end it should generate a list of addresses like its done now
so i think implementation wise it won’t change that much. except at startup it will connect to EC2 and find the addresses.
its mainly handy for clients for automated discovery
Could perhaps build one for Consul as well :p
Jordan Halterman
@kuujo
Mar 26 2016 23:43
The problem is when a node starts there's a chance of split brain depending on how discovery is done. For example, if a three node cluster is started, and a fourth node is started but can't see the first three nodes, it could form its own cluster. That would mean there are two leaders, each of which can accept writes. When the partition heals, they converge and one side's writes are overwritten by the other side's. This is why Atomix/Copycat have a static list at startup.
Richard Pijnenburg
@electrical
Mar 26 2016 23:44
I see
Jordan Halterman
@kuujo
Mar 26 2016 23:44
The static membership list ensures this can't happen since when a single node is started it can never be in the majority by itseld
itself*
Richard Pijnenburg
@electrical
Mar 26 2016 23:45
there is a setting that enforces the number of expected masters right?
Jordan Halterman
@kuujo
Mar 26 2016 23:45
Unless it's the only member I the cluster
Yeah
Richard Pijnenburg
@electrical
Mar 26 2016 23:45
I think this feature will be mainly handy for clients connecting
Jordan Halterman
@kuujo
Mar 26 2016 23:45
The other way it can be done with discovery is by saying this is the first node and this is not the first node. The first node that's started elects itself leader, and all future nodes expect to join that clustwr
That's true. That can definitely be done safely
But the same thing can be done with DNS too
Richard Pijnenburg
@electrical
Mar 26 2016 23:47
with consul they have added a flag that forces a single instance to be elected master. otherwise if it can’t join an existing cluster it will fail
mainly for bootstrapping the cluster
DNS can be used as well. but with things like EC2 or rackspace the DNS names of machines change all the time.
Jordan Halterman
@kuujo
Mar 26 2016 23:50
I'm thinking of Hazelcast, which Atomix is largely modeled on. It uses multicast for discovery in normal networks and the EC2 API in AWS. The problem with that is just determining when a node is the first node started. As long as something says that it's fine. I'm actually open to replacing the membership list in Atomix with a flag like that. I think that's what the Raft dissertation recommends too
I think the membership list is too confusing and doesn't buy much
You can just start nodes one at a time and get the same result with less compexity
Richard Pijnenburg
@electrical
Mar 26 2016 23:51
yeah indeed. but we dont have multicast in atomix right?
Jordan Halterman
@kuujo
Mar 26 2016 23:51
No that's planned but not really a priority at all
Madan Jampani
@madjam
Mar 26 2016 23:52
Hazelcast comparison has some limitations in my opinion
Richard Pijnenburg
@electrical
Mar 26 2016 23:52
Elasticsearch moved away from multicast because of ‘security’ related issues
Madan Jampani
@madjam
Mar 26 2016 23:52
Hazelcast set up allows for a split brain. Atomix/Copycat should not.
Richard Pijnenburg
@electrical
Mar 26 2016 23:52
not saying we should follow the same advise but just something to think about :-)
and multicast is blocked on so many different networks anyway :-)
Jordan Halterman
@kuujo
Mar 26 2016 23:55
Yeah that's what I'm saying above ^^ If there's a flag indicating that a node is starting a new cluster or joining an existing cluster then there is no split brain issue. Split brain is the result of nodes expecting to discover whether they're forming a new cluster. This isn't any different than what Copycat does now. Copycat currently just says these n nodes are the first nodes in the cluster.
This isn't just about multicast, it's about any discovery
Richard Pijnenburg
@electrical
Mar 26 2016 23:55
i agree yeah
Jordan Halterman
@kuujo
Mar 26 2016 23:57
Raft recommends the same thing that would be done for multicast. That is, start a single node cluster, add a node to that cluster, add a node to that cluster. Multicast or any service discovery would just be used to find that first node or cluster of nodes. Something just has to indicate whether there's a cluster already running.
Richard Pijnenburg
@electrical
Mar 26 2016 23:57
agreed
Jordan Halterman
@kuujo
Mar 26 2016 23:58
This is the closest Raft can get to what Hazelcast does safely
Richard Pijnenburg
@electrical
Mar 26 2016 23:58
just on the side of multicast, i’m not sure if it makes sense to implement it.