These are chat archives for atomix/atomix

23rd
Feb 2017
Jordan Halterman
@kuujo
Feb 23 2017 00:00
from the server’s perspective, the sequence number returned by getCommandSequence is indeed set when the command is applied to the state machine
so I suppose we have to ensure that command actually was applied
I need to rename them so they’re easier to understand
that should be getLastAppliedCommandSequence or something
it’s not a good sign when you have to figure out your own code :-P
Jon Hall
@jhall11
Feb 23 2017 00:03
:)
It looks like the command was applied in logs_172.17.0.2_test_cluster_controller_restarts_FAILED/log/karaf.log.1
Jordan Halterman
@kuujo
Feb 23 2017 00:08
I don’t see it applied on 172.17.0.3 though, which is where that memory is
not yet anyways
gotta apply my trusty script
Jon Hall
@jhall11
Feb 23 2017 00:08
:)
Jon Hall
@jhall11
Feb 23 2017 00:16

Here are the the logs for applying commands from that session at sequence 113. This seems strange to me, as there are 2 on one server a few minutes apart:

logs_172.17.0.2_test_cluster_controller_restarts_FAILED/log/karaf.log.1:2017-02-11 10:00:05,356 | DEBUG | 9876-partition-3 | ServerStateMachine               | 90 - io.atomix.all - 1.0.1.SNAPSHOT | /172.17.0.2:9876 - Applying CommandEntry[index=87048, term=185, session=86746, sequence=113, timestamp=1486807205237, command=InstanceCommand[resource=363, command=ResourceCommand[command=org.onosproject.store.primitives.resources.impl.AtomixConsistentMapCommands$Listen@1c53b7ee]]]
logs_172.17.0.4_test_cluster_controller_restarts_FAILED/log/karaf.log.4:2017-02-11 10:02:47,631 | DEBUG | 9876-partition-3 | ServerStateMachine               | 90 - io.atomix.all - 1.0.1.SNAPSHOT | /172.17.0.4:9876 - Applying CommandEntry[index=87048, term=185, session=86746, sequence=113, timestamp=1486807205237, command=InstanceCommand[resource=363, command=ResourceCommand[command=org.onosproject.store.primitives.resources.impl.AtomixConsistentMapCommands$Listen@554c1eab]]]
logs_172.17.0.4_test_cluster_controller_restarts_FAILED/log/karaf.log.8:2017-02-11 10:00:05,247 | DEBUG | 9876-partition-3 | ServerStateMachine               | 90 - io.atomix.all - 1.0.1.SNAPSHOT | /172.17.0.4:9876 - Applying CommandEntry[index=87048, term=185, session=86746, sequence=113, timestamp=1486807205237, command=InstanceCommand[resource=363, command=ResourceCommand[command=org.onosproject.store.primitives.resources.impl.AtomixConsistentMapCommands$Listen@51205517]]]

I also don’t see anything on node 2

hmm, although node 3 did restart right before the second log message
Jordan Halterman
@kuujo
Feb 23 2017 00:20
I see a lot of entries from the session being applied, but a lot of sequence numbers are skipped, and I may have an idea...
I see sequence=85 in karaf.log.4 and then in the next log sequence=94 in karaf.log.3 is the next command applied
and then the KeepAliveRequests suddenly have commandSequence=113
I’m imagining the client got those responses from another server
need to go dig through the client’s logs
there’s nothing wrong with those sequence numbers being skipped
well, there may be nothing wrong with it
Copycat is able to exclude entries that don’t contribute to the state machine’s state from replication
Jon Hall
@jhall11
Feb 23 2017 00:25
most of those commands were applying listeners, if that helps
Jordan Halterman
@kuujo
Feb 23 2017 00:25
but that may lead to this state where a query can’t succeed, so will have to use my superpowers of deduction to figure out if that’s the case
Jordan Halterman
@kuujo
Feb 23 2017 00:34
There are some issues with the leader election commands that I haven’t been able to fix yet, but it doesn’t seem like they should be contributing to this type of problem. They should only affect the state of the state machine if a node goes down for a long time.
Jordan Halterman
@kuujo
Feb 23 2017 00:43
I think this should be an easy fix
Jordan Halterman
@kuujo
Feb 23 2017 00:55
yeah, it’s an easy fix, but there are a couple of more fundamental issues I’m going to address tonight and we’ll have a PR in for this issue
Jon Hall
@jhall11
Feb 23 2017 00:56
great!
Jordan Halterman
@kuujo
Feb 23 2017 01:05

I fould a couple correctness issues: one with LINEARIZABLE queries and one with SEQUENTIAL queries. The issue with LINEARIZABLE queries is unlikely in practice, but still important. The one with SEQUENTIAL queries is more likely in practice, and there’s nothing currently preventing it from happening.

Gonna make myself some food and then I’ll have to spend some time working out the protocol. I’ll ping you later.

Jon Hall
@jhall11
Feb 23 2017 01:09
awesome!
Jordan Halterman
@kuujo
Feb 23 2017 03:38
hmm… fortunately, I think I may have come to the conclusion that the sequential read issue may not actually be an issue at all because of followers’ existing consistency checks with the leader...
Jordan Halterman
@kuujo
Feb 23 2017 04:49
I have indeed come to that conclusion… at least as close to it as I can come without actually writing and checking a model which I don’t have time for :-)
so, just two bug fixes
Jordan Halterman
@kuujo
Feb 23 2017 05:03
so, this does actually appear to be a byproduct of a couple bugs the AtomixLeaderElectorState
that is, a bug in AtomixLeaderElectorState resulted in the server state that exposed this bug, but it is indeed still a bug in Copycat’s protocol as well
there may actually be a couple changes to be made: one to prevent this bug, and one to improve the time it takes to converge on the correct sequence number
I’ll explain it in the PR though
Jordan Halterman
@kuujo
Feb 23 2017 06:10
so, just an FYI this also really needs to be fixed in AtomixLeaderElectorState which I should file an issue for
Jordan Halterman
@kuujo
Feb 23 2017 06:28
@jhall11 atomix/copycat#282
will submit a separate one for the linearizability issue
Jon Hall
@jhall11
Feb 23 2017 07:27
wow, Thats an interesting situation. So is it the AtomixLeaderElectorState that caused the last command to be compacted?
Jordan Halterman
@kuujo
Feb 23 2017 07:28
yeah
atomix/copycat#283
The problem in AtomixLeaderElectionState is a lot of the commands are being released immediately after they’re applied even though they actually contribute to the state of the system. Listen is properly managed, which is why I’m seeing those as the last command in the log. But on leader changes they’re released and then excluded from replication, causing this issue. There’s nothing wrong with releasing them immediately in principle, but there is in the case of that particular state machine.
Jordan Halterman
@kuujo
Feb 23 2017 07:33
I started working on fixing it a couple weeks ago but didn’t find the time. It should definitely be fixed soon because even with this bug fix that state machine can still have an incorrect state in this type of scenario. Meaning, a leader change could occur while a server is down, and that server may not see the leader change when it comes back up. That would mean if the server is later elected leader, it could cause weird leader changes or possibly allow multiple leaders depending on how the client is notified of changes (don’t remember). It’s a relatively straightfoward fix, I just don’t want to break that state machine so have to do it with a little care.
we may just want to store the leader state machine’s state in a snapshot just to be extra safe
Jon Hall
@jhall11
Feb 23 2017 07:35
hmm, I think I may have seen that before
Jordan Halterman
@kuujo
Feb 23 2017 07:38
storing that state in a snapshot should eliminate those potential issues from that state machine - which is obviously a very important one - and should be cheap since leader elections are relatively rare and the state is small
hmm
maybe I’ll take a stab at it right now
I have some time
I’m in the process of going through the code base to audit all the distributed systems components. That was the biggest issue I found thus far, but I have a long way and a couple weekends to go still.
the other smaller issues I already fixed
Jon Hall
@jhall11
Feb 23 2017 07:42
How do you find it so far? Is it fairly easy to get started?
Jordan Halterman
@kuujo
Feb 23 2017 07:49
I think it’s awesome! The distributed systems aspects of ONOS basically implement a lot of what has always been planned for Atomix. I love it! All the APIs are really well thought out, and the architecture and abstractions are amazing. Everyone did a really great job. I’ve spent a lot of time in the code, but not actually running it. I set up a VM and played with it only briefly and then get sucked back in to actually fiixing things, but one of these weekend I’ll actually be able to play with the entire system in general some more. But at least I was able to get it set up pretty quickly and without much hassle once I found the right tutorial.
Jon Hall
@jhall11
Feb 23 2017 07:50
great! That has been one of our goals
Jordan Halterman
@kuujo
Feb 23 2017 07:51
It was actually really easy considering what I was able to set up in a few minutes
I’m a sucker for good documentation, especially for open source projects
Jon Hall
@jhall11
Feb 23 2017 07:53
haha, I think ONOS is kind of hit or miss. We have some really great documentation, but then there are some features completely undocumented, or outdated
Jordan Halterman
@kuujo
Feb 23 2017 07:53
yep, I did run into a lot of documentation that seemed to be outdated
various tutorials
some distributed systems documentation that references old architectures
but it’s hard to keep up with when you have a lot of documentation and nobody necessarily dedicated to actually making it look nice
there’s a lot of Atomix documentation that’s out of date
nobody merges code and then goes and documents it
these things tend to come in waves
Jon Hall
@jhall11
Feb 23 2017 07:55
yes, there is an effort planned in the next month or so to get some of the user/operator level documentations and tutorials up to sync with the code
Jordan Halterman
@kuujo
Feb 23 2017 07:55
documentation will get out of date and then someone will spend some time updating a portion of it and forgetting about 10 other portions of it
haha
Jon Hall
@jhall11
Feb 23 2017 07:56
:)
Jordan Halterman
@kuujo
Feb 23 2017 07:56
I’m talking about atomix.io ;-)
<— and that someone is this guy
Jon Hall
@jhall11
Feb 23 2017 07:57
I guess atomix is somewhat easier since there is someone who knows the whole codebase
Jordan Halterman
@kuujo
Feb 23 2017 07:57
yep good point
Jon Hall
@jhall11
Feb 23 2017 07:57
with ONOS, there are many more authors
and some of those authors only stay around long enough to get their code checked in
Jordan Halterman
@kuujo
Feb 23 2017 08:02
that’s a good problem to have!
Jordan Halterman
@kuujo
Feb 23 2017 08:30
hmm actually this is an easy fix for AtomixLeaderElectionState
Madan’s code is right. It’s already storing the state in the snapshot. Just the command CompactionModes are wrong.
we really should get Jepsen running and test the ONOS state machines on it
Jordan Halterman
@kuujo
Feb 23 2017 09:22
k fixed the AtomixLeaderElectorState issue too. I think the state machine looks fine. Just fixed the commands
Jordan Halterman
@kuujo
Feb 23 2017 09:34
I also merged the bug fixes and the snapshot is updated. I'll cut a new patch release when all seems good.
Jordan Halterman
@kuujo
Feb 23 2017 10:12
Keep in mind, though, we still has that IndexOutOfBoundsException in the log that will probably show up again. I don't think that was fixed in the various recent PRs. It may have been, but I don't think so.
have*
As for the log in general, the Copycat 2.0 branch is almost stable. I'm working through the last few bugs, and then I'll have to merge the worn that's been done on master back into it.
Jon Hall
@jhall11
Feb 23 2017 16:35
cool, I’ll start some tests on these fixes now.
Also, ONOS is planning on cutting the final 1.9 release on Friday. Do you think you can make a final bugfix version for 1.x by then?
Jordan Halterman
@kuujo
Feb 23 2017 18:39
Yep
Tomorrow?
Jon Hall
@jhall11
Feb 23 2017 18:42
Yeah
Jordan Halterman
@kuujo
Feb 23 2017 18:43
:+1:
Jon Hall
@jhall11
Feb 23 2017 18:43
Running some tests, so far so good
Jordan Halterman
@kuujo
Feb 23 2017 20:15
I will push it this evening