These are chat archives for atomix/atomix
We are no longer monitoring this channel, please join Slack! https://join.slack.com/t/atomixio/shared_invite/enQtNDgzNjA5MjMyMDUxLTVmMThjZDcxZDE3ZmU4ZGYwZTc2MGJiYjVjMjFkOWMyNmVjYTc5YjExYTZiOWFjODlkYmE2MjNjYzZhNjU2MjY
k so let me write out what should be happening in the code here and perhaps we can figure out what needs to be fixed…
So, in your issue the client sends a command and then disconnects and tries another server. That command eventually gets logged and committed and applied to the state machine which perhaps publishes an event to the session. When the client switches servers and resubmits the command, the state machine correctly notices that the command has already been applied and therefore returns the existing output:
This is all correct. We don’t want the same command to be applied to the state machine twice since that would break linearizability. Once the client finally does receive a successful response for the command, it should update the
responseIndex with the original request’s sequence and index and go on its merry way:
When the next command
CommandEntry[index=738, term=2, session=8, sequence=236, timestamp=1482433831887, command=InstanceCommand[resource=103, command=ResourceCommand[… is applied the sequence number
236 should be next.
But since events are sequenced among commands, if the client isn’t able to receive events for that index as well then that could certainly prevent the client from progressing. Madan and I spent a lot of time working out issues with event sequencing. But the way this scenario is handled in the events system is this:
If an event was published by that command that was reapplied at index
737, those events are stored in memory on all servers until they’re acknowledged by the client. So, when the client reconnects to another server and submits a
KeepAliveRequest, the application of that
KeepAliveEntry should result in events being published back to the client starting at the client’s event index:
This means in this scenario, if the client submits its command, disconnects, reconnects to another server, and resubmits that command, even if the command is ignored in the
ServerStateMachine, the events will still be in memory on every server and the client’s next
KeepAliveRequest (which is sent when the client reconnects to a new server) should force events to be resent to the client. This perhaps could be improved by resending events after that in-memory
Result is returned to the client, but I’m not seeing what’s broken yet.
If you have them and have time to sanitize them if you have to, full logs would be really helpful.
Thanks for the explanation. It aligns with what I'm seeing in code and logs.
The client misses the publishing of its command with sequence=235 but when it later reconnects it gets that event (was logged at index 634). So this is working. The problem arises from the fact that this same command, when sent again is appended:
Appended CommandEntry[index=737, term=2, session=8, sequence=235, timestamp=1482433831886,
it goes at index=737. But because the sequence is old it won't generate a commit and will just return the same result. Which is correct. The problem is now that subsequence events/queries sent will depend on index=737, but it is never published to the client. 738 (the next one) is published.
QueryRequest[session=8, sequence=236, index=737, query=InstanceQuery
So, this query for example won't ever get its responde sequenced at the client because it is waiting for the 737 index to be published.
737, including anything related to events. The response that's sent to the client when the state machine is skipped contains the index and eventIndex of the original command
/172.17.0.2:9876 - Applying CommandEntry[index=737, term=2, session=8, sequence=235, ... /172.17.0.2:9876 - Sent CommandResponse[status=OK, index=737, eventIndex=630, result /172.17.0.2:9876 - Applying CommandEntry[index=738, term=2, session=8, sequence=236, ... /172.17.0.2:9876 - Executing commit ServerCommit[index=738, session=ServerSessionContext[id=8]
when that’s completed this gets called:
to build the response
Resultshould have contained the original command’s