These are chat archives for atomix/atomix

2nd
Mar 2017
Jordan Halterman
@kuujo
Mar 02 2017 00:44

@jhall11 I improved my script to automatically split all the partitions and all the sessions from all the files:

import sys
import os
from os.path import isfile, isdir, join
import re

path = sys.argv[1]

INDENT = 25
SEP = ' | '

def find_pattern(f, pattern):
    values = set()
    prog = re.compile(pattern)
    for line in f:
        match = prog.search(line)
        if match:
            values.add(match.group(1))
    f.seek(0)
    return values

def split_line(line):
    split = line.split('|')
    if len(split) > 5:
        ts = split[0].strip()
        cls = split[3].strip()
        log = split[5].strip(' /')
        tsbuf = ''.join([' ' for i in range(INDENT - len(ts))])
        clsbuf = ''.join([' ' for i in range(INDENT - len(cls))])
        return ts + tsbuf + SEP + cls + clsbuf + SEP + log
    return ''

def split_file(infile, matcher, filemaker, matchmaker):
    for match in matcher(infile):
        matcher = matchmaker(match)
        with open(filemaker(match), 'w+') as outfile:
            for line in infile:
                if matcher in line:
                    outfile.write(split_line(line))
        infile.seek(0)

def split_partitions(filepath, f):
    split_file(
        f,
        lambda f: find_pattern(f, '9876-partition-([0-9]+)'),
        lambda partition: filepath + '.partition-' + partition,
        lambda partition: 'partition-' + partition)

def split_sessions(filepath, f):
    split_file(
        f,
        lambda f: find_pattern(f, 'session=([0-9]+)'),
        lambda session: filepath + '.session-' + session,
        lambda session: session)

def split_files(dirpath, filename):
    filepath = dirpath + os.sep + filename
    with open(filepath, 'r') as f:
        split_partitions(filepath, f)
        split_sessions(filepath, f)

if __name__ == '__main__':
    for (dirpath, dirnames, filenames) in os.walk(path):
        for filename in filenames:
            if '.log' in filename:
                split_files(dirpath, filename)

It’s imperfect and creates a lot of files, but it seems to be mostly accurate

haha
now I just have to figure out what I’m actually looking for
lots of timeouts on node 2 this time
Jordan Halterman
@kuujo
Mar 02 2017 00:52
so, session 77377 seems to receive a CommandResponse that’s never completed for sure
just have to backtrack and figure out why
this is definitely the problem
in ClientSequencer
that session gets Completing response QueryResponse[status=OK, error=null, index=140759, eventIndex=140709, result=[]] and then no more are completed after that
and it looks like that’s because there’s a PublishRequest that it’s waiting for
hmm
I can’t tell for sure yet, but it looks like the client may be responding successfully to a PublishRequest it shouldn’t have
Jordan Halterman
@kuujo
Mar 02 2017 00:58
if it does that then the server will remove events that the sequencer may be expecting
lemme poke around in the Copycat codes
Jordan Halterman
@kuujo
Mar 02 2017 01:48
hmm… I’m actually wrong
I don’t know why this command is not being completed
ahh it’s the command before that
hmm
2017-03-01 21:24:21,653   | ClientSession             | 77377 - Received CommandResponse[status=OK, error=null, index=140761, eventIndex=140711, result=null]
eventIndex=140711 is never received
2017-03-01 21:24:21,032   | ClientSession             | 77377 - Received QueryResponse[status=OK, error=null, index=140759, eventIndex=140709, result=[]]
2017-03-01 21:24:21,032   | ClientSequencer           | 77377 - Completing response QueryResponse[status=OK, error=null, index=140759, eventIndex=140709, result=[]]
2017-03-01 21:24:21,643   | ClientSession             | 77377 - Sending CommandRequest[session=77377, sequence=9746, command=InstanceCommand[resource=74, command=ResourceCommand[command=Withdraw{topic=work-partition-7}]]]
2017-03-01 21:24:21,644   | FollowerState             | 172.17.0.3:9876 - Received CommandRequest[session=77377, sequence=9746, command=InstanceCommand[resource=74, command=ResourceCommand[command=Withdraw{topic=work-partition-7}]]]
2017-03-01 21:24:21,646   | FollowerState             | 172.17.0.3:9876 - Appended CommandEntry[index=140761, term=18, session=77377, sequence=9746, timestamp=1488403461645, command=InstanceCommand[resource=74, command=ResourceCommand[command=Withdraw{topic=work-partition-7}]]] to log at index 140761
2017-03-01 21:24:21,653   | ClientSession             | 77377 - Received CommandResponse[status=OK, error=null, index=140761, eventIndex=140711, result=null]
there should have been a PublishRequest in there, supposedly
but I don’t see it even being sent, let alone received
Jordan Halterman
@kuujo
Mar 02 2017 02:04
there definitely could have been events produced by index 140711 which is an UpdateAndGet command...
something odd is that the client sends the next KeepAliveRequest like it received the event
the server seems to be expecting an event to have been sent, but there’s no PublishRequest for the event
just trying to make sense of this in my head
Jordan Halterman
@kuujo
Mar 02 2017 02:11
the client sets eventIndex to the last sequenced CommandResponse index, presumably because it doesn’t think it should have gotten an event but that’s not right
I think there’s a bug there
I think I can reproduce this in a unit test
Jordan Halterman
@kuujo
Mar 02 2017 02:17
I see
this thing gets into an odd state and this is what needs to be figured out...

the leader responds to this command:

2017-03-01 21:24:21,643   | ClientSession             | 77377 - Sending CommandRequest[session=77377, sequence=9746, command=InstanceCommand[resource=74, command=ResourceCommand[command=Withdraw{topic=work-partition-7}]]]

with this response

2017-03-01 21:24:21,653   | ClientSession             | 77377 - Received CommandResponse[status=OK, error=null, index=140761, eventIndex=140711, result=null]

However, according to the next PublishRequest that does get sent to the client, eventIndex=140711 was never an event to this session:

2017-03-01 21:24:21,730   | ClientSession             | 77377 - Received PublishRequest[session=77377, eventIndex=140761, previousIndex=140709, events=[Event[event=leadershipChangeEvents, message=InstanceEvent[resource=74, message=[Change{oldValue=Leadership{topic=work-partition-7, leader=Leader{nodeId=172.17.0.3, term=22, termStartTime=1488403219390}, candidates=[172.17.0.3, 172.17.0.4, 172.17.0.2]}, newValue=Leadership{topic=work-partition-7, leader=Leader{nodeId=172.17.0.4, term=23, termStartTime=1488403461730}, candidates=[172.17.0.4, 172.17.0.2]}}]]]]]
2017-03-01 21:24:21,730   | ServerSessionContext      | 77377 - Received PublishResponse[status=OK, error=null, index=140761]

It’s a consistent PublishRequest, but the sequencer is expecting an event at index 140711 which will never come because it doesn’t seem to exist

really odd
the notable section of that PublishRequest is: eventIndex=140761 and previousIndex=140709
that says there’s no event between 140709 and 140761
so… this is interesting
how tf did an event that doesn’t exist get created on the server?
the client sequencer appears to be correct… the server is just giving it nonsensical information
Jordan Halterman
@kuujo
Mar 02 2017 02:24
wish I could add a /headbash emoji like I have in Slack :-(
there is a No handler for message type message right before that response though
Jordan Halterman
@kuujo
Mar 02 2017 06:33
I think we can definitely say that the No handler issue is contributing to this problem. I’m seeing a bunch of PublishRequests being sent by the leader on server 3 to 77377 and seeing No handler messages around those times on server 2. But I still need to do some more investigation into the client state.
I think what’s happening is: the leader is sending a bunch of events to the client, and then messages get lost. The client switches its connection to server 2, which doesn’t seem to have the same state for events as server 3 did. Server 2 sends the last PublishRequest I pasted above, which is missing a bunch of events that server 3 sent to the client.
Jordan Halterman
@kuujo
Mar 02 2017 06:46
hmmm
hmmmmmmm
Jordan Halterman
@kuujo
Mar 02 2017 07:24
I can’t tell if this is just caused by the No handler issue or not
that doesn’t seem to be the case, but the logs don’t make sense
Jordan Halterman
@kuujo
Mar 02 2017 07:31
I think there may be an issue in AtomixConsistentMapState.updateAndGet
going to check the earlier logs
Jordan Halterman
@kuujo
Mar 02 2017 07:58
So, there's not enough information in the earlier logs to tell exactly what's going on, but I know what the symptoms are...
Jordan Halterman
@kuujo
Mar 02 2017 08:53
I'm going to have to try again with some sleep under my belt
Jordan Halterman
@kuujo
Mar 02 2017 09:03
The good news is, there's an easy fix for this. I'm just not sure it's the right fix.
Jordan Halterman
@kuujo
Mar 02 2017 18:23

Okay, so I got my sleep. @jhall11 this is fixable, but I'm not sure the fix I know of is the right one. Although, judging from some of our past work it may be acceptable.

So, the problem that's happening is different servers have different sequences of events. In the last run, when the client receives events from the leader and then switches to a follower, the leader already told the client about some events that the follower doesn't have. We actually guarded against this by ensuring the client can reset its event index when switching severs, and that forces the new server to start resending events from the non-existent event point so that the client will accept it, but that actually only solved part of the problem. In this case, the client is learning about a missing event through a response. So, to fix this we just have to ensure that event index in the response can be used by the client to reset the event index.

The risk here, though, is that some events are actually lost if the state machine events are non-deterministic. This is really the root of the problem. The consistent map state machine's events on updateAndGet seem not to be deterministic, and so can result in different event sequences on different servers. This is a byproduct of incremental compaction. The consistent map state machine relies on past values for some events, but those past values may not exist in the log under certain conditions.

At least, that's what seems to be causing the missing events across servers. So, there are really two quick solutions here: fix the client sequencer to ensure it can progress if the server is missing events. We can say then that it's up to the state machine to decide whether events should be deterministic. Or we may be able to just set the UpdateAndGet command to CompactionMode.FULL for writes/updates to ensure that all the events are created on all the servers. This just forces the commands to be applied on all servers, but they can still be compacted pretty efficiently.

The correct solution is probably to do both. Guard against the risk of non-deterministic events in the Copycat client, and ensure events are deterministic for that state machine in ONOS.

Jordan Halterman
@kuujo
Mar 02 2017 18:30
The No handler issue seems to just be exacerbating this problem - making it more likely by making the client switch servers more often and losing more PublishRequests
It's likely without that issue we might not even see this, or at least it would be a lot more difficult to produce
Jordan Halterman
@kuujo
Mar 02 2017 18:39

I think there are a couple of ways to go about fixing this in Copycat: first, submit a PR to allow the client to skip events if they're actually missing from the server. That would mean that it's just up to state machines to ensure events are deterministic if they want them to always be delivered. Alternatively, the client could proactively expire/recreate its session when it detects such an inconsistency. Consistency in clients is bounded by the context of a session. Sort of failing the session allows inconsistency on the client. The latter is probably the correct approach, but it would put session recovery mechanisms under strain.

I'd advocate fixing it in the sequencer and then filing an issue to make it more correct when we have more time to do something like that.

Fixing the sequencer is especially acceptable because we're really just fixing a behavior that already existed. Madan seems to have implemented the logic to ensure the client can reset its last event index if a server is missing some events. So, we can just ensure that works and address any problems with it later.
I'll submit a PR for that
Jordan Halterman
@kuujo
Mar 02 2017 19:05
I reproduced the issue in a unit test, but I just need to make sure fixing it is not going to cause unwanted consistency issues...
Jordan Halterman
@kuujo
Mar 02 2017 19:13
That test is showing a client that gets a CommandResponse indicating there's an event at index 10. The client then gets a PublishRequest that says there are no events between 5 and 25. The sequencer is waiting on event 10 to complete the response. What it needs to do when it gets the PublishRequest is somehow ignore the response's event index and use the PublishRequest as the source of truth for the sequence of events. Hopefully it's not too expensive to do that
Jon Hall
@jhall11
Mar 02 2017 19:41
wow, good work!
are the missing events due to the No handler issue?
or just due to the leadership change?
Jon Hall
@jhall11
Mar 02 2017 21:11
On the performance front, it looks like it may just have been the difference between the clients connected to followers vs. the leader. Switching to ServerSelectionStrategy.LEADER seems to consistently give us the same results we were seeing on the last release
Jordan Halterman
@kuujo
Mar 02 2017 21:14
Awesome!
Jordan Halterman
@kuujo
Mar 02 2017 21:28

Sorry, I was in a meeting.

So, the no handler issue does cause event messages to be lost, but is not the cause of the inconsistency on the client. The missing handlers just cause the client to switch to a new server where the inconsistency exists. The root cause of the inconsistency seems to be non-determinism in the consistent map state machine.

So, there are three issues to be fixed.

First, the No handler issue would be nice to fix but isn't strictly necessary. No doubt, it has a negative affect on the cluster though and can cause commands to take a lot longer than they need to when this scenario is encountered.

Second, the consistent map state machine should be fixed to ensure that events are consistent across servers. That's a one line code change.

Third, the Copycat client sequencer should be fixed to allow for that inconsistency. That I consistency is actually already allowed by one of the changes Madan made to Copycat sessions a long time ago, this is just a bug in that coordination that needs to be fixed. We can revisit whether allowing that consistency at all is a good idea in a later release.

I'll have a PR in for Copycat today, and can do the same for the consistent map state machine. I'll have to look at the NettyMessagingManager still, but that doesn't seem to be a show stopper with the other two fixes.

inconsistency*
Jordan Halterman
@kuujo
Mar 02 2017 21:33
The risk we want to avoid is relaxing consistency checks in the client sequencer to the point that invalid histories are allowed. I have to take some time to write some extra tests to make sure that doesn't happen. But the fix in the sequencer alone should fix the issues the cluster is having in general (by allowing the inconsistency). The other two fixes are just to prevent the inconsistency from occurring in the first place.
Jordan Halterman
@kuujo
Mar 02 2017 22:52
I've been finishing up stuff at work, but finally have some time to hack this out after I grab some lunch.