These are chat archives for atomix/atomix

22nd
Feb 2017
Jordan Halterman
@kuujo
Feb 22 2017 00:46
not entirely sure this is gonna work the easy way...
Jordan Halterman
@kuujo
Feb 22 2017 01:07
well… it was able to read all the logs from all the partitions just fine
Jon Hall
@jhall11
Feb 22 2017 01:08
without getting into that state where the index is 0?
Jordan Halterman
@kuujo
Feb 22 2017 01:08
yep
this particular issue seems a bit different than last time though
I’m looking at the application logs for 172.17.0.2, and what I see is all the partitions successfully applying KeepAliveEntry, but that’s it
that seems really odd
they’re committing way more entries than they’re applying, but only applying that type of entry, which is why I didn’t notice it at first
Jon Hall
@jhall11
Feb 22 2017 01:11
hmm
Jordan Halterman
@kuujo
Feb 22 2017 01:12
that makes me think perhaps command sequencing is off for some reason
that could theoretically prevent commands/queries from being applied
the state machines are waiting for a command that will never come
hmm nvm that’s all they’re getting
so I mean, at the end of the logs they’re still getting and storing and applying entries correctly
if the clients were hanging this could be a separate issue in the client communication
have to track back to the last commands that were successful
hmm… this does seem related but unrelated
it happens after a restart, but seems to be a different issue
Jon Hall
@jhall11
Feb 22 2017 01:17
hmm ok
Jordan Halterman
@kuujo
Feb 22 2017 01:22
In the previous case, the clients lost their sessions and kept trying to register a new session. In this case, the clients seem to stop sending commands for some reason. I’m going to write a script to split the logs so I can try to deduce what happened. Having a different type of failure could get closer to the real cause
hard to follow all the partitions
Jon Hall
@jhall11
Feb 22 2017 01:27
I think that is what I find hardest about debugging these, just following everything in the logs. When there are several nodes and multiple partitions it becomes a nightmare to keep track of
Jordan Halterman
@kuujo
Feb 22 2017 01:30
well, this little script made some much more manageable logs:
import sys

inpath = sys.argv[1]
outpath = sys.argv[2]
match = sys.argv[3]

with open(inpath, 'r') as infile:
    with open(outpath, 'w+') as outfile:
        for line in infile:
            if match in line:
                outfile.write(line.split('|')[5].strip(' /'))
e.g. python split_logs.py karaf.log karaf_clean.log partition-1
Jon Hall
@jhall11
Feb 22 2017 01:34
ah, I like it
I’ve mostly just been using cat piped to a series of greps
Jordan Halterman
@kuujo
Feb 22 2017 01:37
I’ve gotten too used to Python for hacking out these things the programmer way :-P
but ahh these logs make so much more sense now
cat/grep is the right way :-)
Jon Hall
@jhall11
Feb 22 2017 01:42
yes, but I think I need to learn how to mix in sed to get rid of some of the excess info
Jordan Halterman
@kuujo
Feb 22 2017 01:46
here’s a nicer version that actually aligns all of the logs:
import sys

inpath = sys.argv[1]
outpath = sys.argv[2]
match = sys.argv[3]
indent = 25
with open(inpath, 'r') as infile:
    with open(outpath, 'w+') as outfile:
        for line in infile:
            if match in line:
                split = line.split('|')
                cls = split[3].strip()
                log = split[5].strip(' /')
                buf = ''.join([' ' for i in range(indent - len(cls))])
                outfile.write(cls + buf + log)
Got all the cleaned up logs generated, so I’ll spend some time digging through there and should be able to come up with some answers. Those answers just might be more questions :-P
Jordan Halterman
@kuujo
Feb 22 2017 07:14
hmmm
Jordan Halterman
@kuujo
Feb 22 2017 07:26
So, I'm not sure what to make of this... I loaded all the logs from all the servers, and they can all be read fine. I split all the logs for clients and servers in each partition and looked through them manually. All the clients and servers seem to be making progress. All the servers definitely are as they're all committing and applying KeepAliveEntry. All that seems to be happening is many of the clients stop submitting operations to the cluster. It's not clear from the logs whether or not that's intentional. The last commands/queries/events from those clients are successful and seem to be correctly sequenced. So, they should have completed, but my only idea is that for some reason the client-side sequencer hangs, but it seems unlikely. All the indexes look correct and al the necessary information seems to have been received by clients. It simply seems as if the clients just stopped doing things.
It's definitely not the same issue as last time
Johno Crawford
@johnou
Feb 22 2017 09:29
@kuujo thanks, I am evaluating atomix for virtual actors and the messaging between nodes is a vital part of that
Jordan Halterman
@kuujo
Feb 22 2017 09:33
They will indeed remain, but are still considered @Experimental until the new test framework is done and those more complex state machines are better tested.
Johno Crawford
@johnou
Feb 22 2017 09:52
when I get time to implement the proto i'll be happy to provide feedback, coming from jgroups / aws it's a little bothersome to deal with individual ports via config instead of giving a range but I think that should be straight forward, just need to define bootstrap nodes, join the cluster with those bootstrap nodes obviously remembering to define the correct interface so the nodes can connect to one another
and obviously deal with port allocation, any tips for that? have you been setting a port for each node or using some retry mechanism to bind to a port within a range?
Jordan Halterman
@kuujo
Feb 22 2017 22:50
@jhall11 have you seen that? ^^ I can't really see any issues in the logs. I see what appears to be normal expected behavior from all the servers, and seemingly normal expected behavior from all the clients. There doesn't seem to be any inconsistency or failure to progress. What was the behavior that led to the test failing? Something certainly seems off, but it's basically just that the clients aren't doing anything.
Jon Hall
@jhall11
Feb 22 2017 22:52
hmm, the tests stopped because the ‘leaders’ command in ONOS timed out. This prints the current topic and leaders for the LeaderElectors. Which I think results in a query to each partition
I can try to reproduce it again and get new logs/metadata files
Jon Hall
@jhall11
Feb 22 2017 23:01
Are you seeing a response to this:
6682 2017-02-11 10:04:33,146 | ERROR | sh for user onos | ShellUtil | 27 - org.apache.karaf.shell.console - 3.0.8 | Exception caught while executing command
err
2017-02-11 10:04:32,856 | DEBUG | ycat-client-io-1 | ClientSession | 90 - io.atomix.all - 1.0.1.SNAPSHOT | 86746 - Sending QueryRequest[session=86746, sequence=113, index=87281, query=InstanceQuery[resource=88, query=ResourceQuery[query=org.onosproje ct.store.primitives.resources.impl.AtomixLeaderElectorCommands$GetAllLeaderships@621542ce], consistency=null]]
Jon Hall
@jhall11
Feb 22 2017 23:09
So it looks like the client from node 1 is sending this query, and the partition-3 leader which is node 2’s server is receiving the request, but I’n not seeing a response to that
Jordan Halterman
@kuujo
Feb 22 2017 23:09
:+1: one sec
Jon Hall
@jhall11
Feb 22 2017 23:26
it looks like the is another request a few seconds later from a different session and that one is answered
Jordan Halterman
@kuujo
Feb 22 2017 23:34
indeed it is
gonna have to track down the last command for 86746
Jordan Halterman
@kuujo
Feb 22 2017 23:39
this seems odd
Received QueryRequest[session=86746, sequence=113, index=87281, query=InstanceQuery[resource=88, query=ResourceQuery[query=org.onosproject.store.primitives.resources.impl.AtomixLeaderElectorCommands$GetAllLeaderships@63f92679], consistency=null]]
let me link to some codes
so… clearly in the leader’s logs, the previous KeepAliveRequest has commandSequence=113 and is successfully applied
that would have been my guess as to why a query would just hang, as it is in the server
the fact that another later query with the same consistency requirements succeeds no problem indicates that there’s nothing wrong with the leadership check
that’s independent of sessions
so it seems like it has to be something in the session state
Jon Hall
@jhall11
Feb 22 2017 23:44
yes
When should the commandSequence be changed? just when a command is applied to the state machine?
Jordan Halterman
@kuujo
Feb 22 2017 23:49

So, with each command a client sends to the cluster, it increments a sequence number in the CommandRequest. That’s used for linearizable semantics. When the state machine applies a command, it will store the command’s output keyed by the sequence number. Then, once the client gets a response for that command, it keeps track of the highest commandSequence for which it received a response and sends that in KeepAliveRequest so the state machines can remove the command output since the client has already seen it.

So, sequence in CommandRequest increments on each command from a client, and commandSequence in KeepAliveRequest is incremented when the client receives a response for a command.

hmm...
I’ll go through that algorithm a bit, but it almost has to be that condition that’s true
Jordan Halterman
@kuujo
Feb 22 2017 23:55
there are some naming inconsistencies between the client and server since they’re two different perspectives