These are chat archives for atomix/atomix

3rd
Jan 2017
Thiago Santos
@thiagoss
Jan 03 2017 18:29
When a client sends a CommandRequest and times out, but the command is applied anyway. Then the client will keep trying to send the same command again (same sequence number and everything). What should happen to that Request? I'm seeing it getting applied again, but it won't generate a commit()
This particular one that doesn't generate a commit never ends up becoming a PublishRequest and the client gets stuck waiting for this PublishRequest with a particular index. No other queries or command will be used on this client's session because it waits forever for that command that didn't commit
Any ideas on how to fix this?
Thiago Santos
@thiagoss
Jan 03 2017 18:35
@jhall11 This is the last known issue I have with ONOS
Thiago Santos
@thiagoss
Jan 03 2017 18:57
The problem is that this client's will start receiving query/events that depend on the index of that command that was applied but didn't commit
Jordan Halterman
@kuujo
Jan 03 2017 20:25
hmm...
Thiago Santos
@thiagoss
Jan 03 2017 20:26
Reported with a summary at atomix/copycat#261
Jordan Halterman
@kuujo
Jan 03 2017 20:26
thanks
Thiago Santos
@thiagoss
Jan 03 2017 20:27
My first thought is that somehow a PublishRequest should be generated even when the command is a repeated one
So that the client is notified that that particular command was applied already
Jordan Halterman
@kuujo
Jan 03 2017 20:28
yeah… let me think about this
Thiago Santos
@thiagoss
Jan 03 2017 20:29
I have time to write and test the fix, I just don't know what the fix is yet. If you have any ideas let me know and I can write and try them :)
Jordan Halterman
@kuujo
Jan 03 2017 20:30
sweet we can work that out… so… the client submits a command and thinks it failed, but it succeeded. Then the client switches and tries to submit the same command (correctly) but because the command was already committed it’s basically just rejected? Is that what’s going on here?
ahh I see what you’re saying
Thiago Santos
@thiagoss
Jan 03 2017 20:31
It is not rejected. It falls within the check for the sequenceNumber and it will just return the same result again
Jordan Halterman
@kuujo
Jan 03 2017 20:31
yeah
but then the problem is events are not sent to the client?
Thiago Santos
@thiagoss
Jan 03 2017 20:32
But it is applied nonetheless so other events/queries sent by this client will depend on that command index but it wasn't published to the client
Jordan Halterman
@kuujo
Jan 03 2017 20:32
hmm
Thiago Santos
@thiagoss
Jan 03 2017 20:33
In my particular log it was applied at index=737, so queries will have eventIndex=737 but this one wasn't published to the client. It receives the previous one 634 and the following one 738. But the 737 one is missing from the client's perspective. So it gets stuck
My guess is that we should somehow notify the client that 737 was applied
Jordan Halterman
@kuujo
Jan 03 2017 20:36
yeah… lemme think about it
Jordan Halterman
@kuujo
Jan 03 2017 21:16

k so let me write out what should be happening in the code here and perhaps we can figure out what needs to be fixed…

So, in your issue the client sends a command and then disconnects and tries another server. That command eventually gets logged and committed and applied to the state machine which perhaps publishes an event to the session. When the client switches servers and resubmits the command, the state machine correctly notices that the command has already been applied and therefore returns the existing output:
https://github.com/atomix/copycat/blob/master/server/src/main/java/io/atomix/copycat/server/state/ServerStateMachine.java#L774-L782

This is all correct. We don’t want the same command to be applied to the state machine twice since that would break linearizability. Once the client finally does receive a successful response for the command, it should update the commandResponse and responseIndex with the original request’s sequence and index and go on its merry way:
https://github.com/atomix/copycat/blob/master/client/src/main/java/io/atomix/copycat/client/session/ClientSessionSubmitter.java#L309-L310

When the next command CommandEntry[index=738, term=2, session=8, sequence=236, timestamp=1482433831887, command=InstanceCommand[resource=103, command=ResourceCommand[… is applied the sequence number 236 should be next.

But since events are sequenced among commands, if the client isn’t able to receive events for that index as well then that could certainly prevent the client from progressing. Madan and I spent a lot of time working out issues with event sequencing. But the way this scenario is handled in the events system is this:

If an event was published by that command that was reapplied at index 737, those events are stored in memory on all servers until they’re acknowledged by the client. So, when the client reconnects to another server and submits a KeepAliveRequest, the application of that KeepAliveEntry should result in events being published back to the client starting at the client’s event index:
https://github.com/atomix/copycat/blob/master/server/src/main/java/io/atomix/copycat/server/state/ServerStateMachine.java#L562

This means in this scenario, if the client submits its command, disconnects, reconnects to another server, and resubmits that command, even if the command is ignored in the ServerStateMachine, the events will still be in memory on every server and the client’s next KeepAliveRequest (which is sent when the client reconnects to a new server) should force events to be resent to the client. This perhaps could be improved by resending events after that in-memory Result is returned to the client, but I’m not seeing what’s broken yet.

If you have them and have time to sanitize them if you have to, full logs would be really helpful.

I’m obviously missing something that’s going on here, but that’s a general explanation of how the system should be able to continue to progress when the client switches servers and duplicate commands are submitted
Jordan Halterman
@kuujo
Jan 03 2017 21:23
probably because I just got back from a loooong vacation :-)
Jordan Halterman
@kuujo
Jan 03 2017 21:32
A reproducer would be amazing. Maybe I can take a stab at it
Thiago Santos
@thiagoss
Jan 03 2017 21:33

Thanks for the explanation. It aligns with what I'm seeing in code and logs.

The client misses the publishing of its command with sequence=235 but when it later reconnects it gets that event (was logged at index 634). So this is working. The problem arises from the fact that this same command, when sent again is appended:

Appended CommandEntry[index=737, term=2, session=8, sequence=235, timestamp=1482433831886,

it goes at index=737. But because the sequence is old it won't generate a commit and will just return the same result. Which is correct. The problem is now that subsequence events/queries sent will depend on index=737, but it is never published to the client. 738 (the next one) is published.

QueryRequest[session=8, sequence=236, index=737, query=InstanceQuery

So, this query for example won't ever get its responde sequenced at the client because it is waiting for the 737 index to be published.

My setup is complicated and involves proprietary apps, and even worse, this race isn't easy to reproduce. Perhaps I can try with something simpler by forcing some timeouts at the right moments in a different application. I'll think on how to do it. Meanwhile I can sanitize and upload my logs
The 3 bugs I've found so far were all due to client disconnecting under heavy stress of the system.
Thiago Santos
@thiagoss
Jan 03 2017 21:42
The command at index 634 was re-sent as 737, but the second one never got to be published and the client doesn't know about it. So it waits for 737. Maybe 737 should generate a PublishRequest or maybe the client should somehow tie 634 to 737 because they have the same sequence number and automatically skip waiting for 737?
Jordan Halterman
@kuujo
Jan 03 2017 21:54
So, the command is publishing an event? And the event doesn't get published because it's not applied to the state machine? But the problem is the client is waiting for an event from index 737?
But from the client's perspective, it shouldn't even know anything about index 737, including anything related to events. The response that's sent to the client when the state machine is skipped contains the index and eventIndex of the original command
Thiago Santos
@thiagoss
Jan 03 2017 21:59
/172.17.0.2:9876 - Applying CommandEntry[index=737, term=2, session=8, sequence=235, ...
/172.17.0.2:9876 - Sent CommandResponse[status=OK, index=737, eventIndex=630, result
/172.17.0.2:9876 - Applying CommandEntry[index=738, term=2, session=8, sequence=236, ...
/172.17.0.2:9876 - Executing commit ServerCommit[index=738, session=ServerSessionContext[id=8]
The command response contains the index=737
Jordan Halterman
@kuujo
Jan 03 2017 22:02
hmm
so that seems to be the problem then
Thiago Santos
@thiagoss
Jan 03 2017 22:02
It should be the index of the original command? Hopefully that is an easy fix
err...
Result should have contained the original command’s index and eventIndex
have to figure out why that wasn’t the case
Thiago Santos
@thiagoss
Jan 03 2017 22:05
Ok, thanks for the tips. I'll go back to the logs and try to dig something out of it
Jordan Halterman
@kuujo
Jan 03 2017 22:05
sounds good
this is where it should have been originally created when the original command was applied:
https://github.com/atomix/copycat/blob/master/server/src/main/java/io/atomix/copycat/server/state/ServerStateMachine.java#L860-L863
certainly seems something is broken
thanks!