These are chat archives for atomix/atomix

4th
Apr 2016
Roman Pearah
@neverfox
Apr 04 2016 00:10
@kuujo Sounds good.
Jordan Halterman
@kuujo
Apr 04 2016 01:21
I like this way more. I think it's great. Just gotta update the Javadoc and I'll push it and update Atomix tonight as well. Also merging the last Atomix PR and still planning on a release tomorrow
Roman Pearah
@neverfox
Apr 04 2016 03:18
@kuujo Great!
Jordan Halterman
@kuujo
Apr 04 2016 03:37
Address address = new Address(“localhost”, 5000);
CopycatServer server = CopycatServer.builder(address)
  .withTransport(new NettyTransport())
  .build();
server.bootstrap().join();
Address address = new Address(“localhost”, 5001);
CopycatServer server = CopycatServer.builder(address)
  .withTransport(new NettyTransport())
  .build();
server.join(new Address(“localhost”, 5000)).join();
or
CopycatServer server1 = CopycatServer.builder(new Address(“localhost”, 5000))
  .withTransport(new NettyTransport())
  .build();

CopycatServer server2 = CopycatServer.builder(new Address(“localhost”, 5001))
  .withTransport(new NettyTransport())
  .build();

CopycatServer server3 = CopycatServer.builder(new Address(“localhost”, 5002))
  .withTransport(new NettyTransport())
  .build();

Collection<Address> cluster = Arrays.asList(
  new Address(“localhost”, 5000),
  new Address(“localhost”, 5001),
  new Address(“localhost”, 5002),
);

CompletableFuture.allOf(server1.bootstrap(cluster), server2.bootstrap(cluster), server3.bootstrap(cluster)).join();
CopycatServer server = CopycatServer.builder(new Address(“localhost”, 5003))
  .withTransport(new NettyTransport())
  .build();
server.join(cluster).join()
Jordan Halterman
@kuujo
Apr 04 2016 04:01
I’m seeing something odd with session events:
20:52:41.400 [copycat-server-localhost/127.0.0.1:5000-copycat-state] DEBUG i.a.c.s.state.ServerSessionContext - 11 - Sending PublishRequest[session=11, eventIndex=18, previousIndex=17, events=[Event[event=join, message=GroupMemberInfo[member=test]], Event[event=resign, message=test], Event[event=term, message=18], Event[event=elect, message=test]]]
20:52:41.400 [copycat-server-localhost/127.0.0.1:5001-copycat] DEBUG i.a.c.server.state.FollowerState - localhost/127.0.0.1:5001 - Received PublishRequest[session=11, eventIndex=18, previousIndex=17, events=[Event[event=join, message=GroupMemberInfo[member=test]], Event[event=resign, message=test], Event[event=term, message=18], Event[event=elect, message=test]]]
20:52:41.400 [copycat-client-2] DEBUG i.a.c.client.session.ClientSession - 11 - Received PublishRequest[session=11, eventIndex=18, previousIndex=17, events=[Event[event=join, message=GroupMemberInfo[member=test]], Event[event=resign, message=test], Event[event=term, message=18], Event[event=elect, message=test]]]
...
20:52:43.855 [copycat-client-2] DEBUG i.a.c.client.session.ClientSession - 11 - Sending KeepAliveRequest[session=11, commandSequence=3, eventIndex=15]
The session says Received PublishRequest[session=11, eventIndex=18, previousIndex=17 and then two seconds later sends back KeepAliveRequest[session=11, commandSequence=3, eventIndex=15]
Then the next publish attempt fails, as it should:
20:52:43.879 [copycat-client-2] DEBUG i.a.c.client.session.ClientSession - 11 - Received PublishRequest[session=11, eventIndex=18, previousIndex=17, events=[Event[event=join, message=GroupMemberInfo[member=test]], Event[event=resign, message=test], Event[event=term, message=18], Event[event=elect, message=test]]]
20:52:43.879 [copycat-client-2] DEBUG i.a.c.client.session.ClientSession - 11 - Inconsistent event index: 17
ahh nvm I see it
in the KeepAliveRequest, the ClientSessionManager sends the complete index which hasn’t yet been incremented

But I think there may be a bug wherein error responses:

20:52:43.879 [copycat-server-localhost/127.0.0.1:5001-copycat-state] DEBUG i.a.c.s.state.ServerSessionContext - 11 - Received PublishResponse[status=ERROR, index=18]

Aren’t resulting in linearizable futures being properly completed in the state machine

thus the original command response is never sent back to the client
Jordan Halterman
@kuujo
Apr 04 2016 04:29
Some of my problem was actually user error, but another problem is SessionManager and SessionListener are giving two different feedbacks to the cluster. The former is sending completeIndex back to the cluster, and the latter eventIndex. Ther may need to be two indexes indicating the index the client has received and the index the client has processed for linearizability.
But it does seem like in some cases event futures are not triggered via error responses or keep-alives. Going to investigate both now
Jordan Halterman
@kuujo
Apr 04 2016 08:36
I pushed a branch that fixes the session event issue, but I'm not totally sure if I want to do this. It basically adds that second index and something like acks for session events. It's just hard to decide on the consistency guarantees for session events. The changes I made will certainly slow events down a bit but provide significantly greater consistency and flexibility, and I think I'm willing to make that tradeoff for linearizable events, which are not designed for performance.
The API changes for cluster configurations are done, but I had to do the same for clients connect in order to get it to work well in Atomix. client.connect(cluster). I got sidetracked on the event fixes, so Atomix is still a work in progress. I think quite a few changes need to be made e.g. to properties files, but that will be fine. The standalone server will have to be run with bootstrap or join flags with a list of Addresses, but I like that anyways. Will continue hacking those changes tomorrow.
Madan Jampani
@madjam
Apr 04 2016 22:03
@kuujo Are you planning to open a PR for the session events issue you describe above? I’d be interested to take a look
Jordan Halterman
@kuujo
Apr 04 2016 22:03
yeah I will tonight
just need to make one more small change
I realized a little change can make sure we don’t lose performance there
Basically, the problem was the way clients ack events. In the PublishResponse, the client sent its eventIndex, but in the KeepAliveRequest, the client was sending its completeIndex. The difference between the two are eventIndex indicates the highest event the client has received and completeIndex indicates the highest event the client has actually processed. completeIndex is tracked to ensure linearizability - a client’s listeners actually saw an event before it was acknowledged - to ensure the events triggered by a command happen before the client that submitted the command receives the response. Since completeIndex is always lower than eventIndex, that means that a KeepAliveRequest could contain a lower index than what the client already acknowledged. So, basically I just created the notion of separate indexes in the ServerSession. One index tracks events the client hasn’t seen, and the other events the client hasn’t processed (for linearizability).
Jordan Halterman
@kuujo
Apr 04 2016 22:09
Linearizability is important to me e.g. to ensure all nodes see a group member join the group before the join operation completes. Without having that separate completeIndex, we can only guarantee that the client received the event but not necessarily that event listeners were called (if e.g. the event thread is blocked by something else)
Madan Jampani
@madjam
Apr 04 2016 22:11
Hmm. Does that mean a buggy client can block the entire system?
For example a client receives the event, but can never complete it becuse one of its listener blocks indefinitely during processing
Jordan Halterman
@kuujo
Apr 04 2016 22:35
Actually, I fixed that too
One of my tests were blocking in an event callback. So, currently a client can block events by blocking in an event listener, but with the final change, a client won't be able to block events any more except by not acking them. If a client doesn't complete the event then the command that triggered it (if linearizable) will not be able to complete. But not completing an event won't prevent other events from being received... They just can't be removed from memory on the server. There are definitely still ways this can go awry
If a client doesn't ack an event that is
There could perhaps be a timeout on acks
Jordan Halterman
@kuujo
Apr 04 2016 22:45
But the problem is that’s implementation dependent. If a linearizable event is timed out and the original command is failed, it’s really a partial completion. The command was written and applied to the state machine, but resulting events were never handled, but I’m fine with that if we need a way to prevent faulty clients from blocking the completion of commands.
Madan Jampani
@madjam
Apr 04 2016 22:50
I think we definitely will need some protection (for the cluster) from faulty clients. Timing out event handling is one option. I'm thinking timing out the session altogether is another.
Jordan Halterman
@kuujo
Apr 04 2016 22:50
that would be better IMO
should be easy to do
better to me because a lost session implies consistency is lost at least
Madan Jampani
@madjam
Apr 04 2016 22:51
Yeah. I agree. We can piggyback it top of the existing keep alive mechanism, right?
Jordan Halterman
@kuujo
Apr 04 2016 22:51
yeah that’s my thought
Madan Jampani
@madjam
Apr 04 2016 22:51
Yes!
Jordan Halterman
@kuujo
Apr 04 2016 22:51
sounds good
thanks!
Madan Jampani
@madjam
Apr 04 2016 22:52
btw, I just saw the session event issue in my test run as well !
Jordan Halterman
@kuujo
Apr 04 2016 22:52
:-P
Madan Jampani
@madjam
Apr 04 2016 22:52
:)
randomly. I was not trying to repro it :)
Jordan Halterman
@kuujo
Apr 04 2016 23:05
indeed I just accidentally wrote tests that consistently reproduced it last night