These are chat archives for atomix/atomix

1st
Mar 2017
Jon Hall
@jhall11
Mar 01 2017 00:18
ok, it looks like I was able to reproduce this, I haven’t dug through the logs yet
I am seeing these messages:
2017-02-28 23:35:45,104 | DEBUG | artition-1-state | NettyMessagingManager            | 129 - org.onosproject.onos-core-dist - 1.10.0.SNAPSHOT | No handler for message type onos-copycat-1-6849643038023614706
Jordan Halterman
@kuujo
Mar 01 2017 00:24
Hmm
That handler name doesn't look familiar. Have to look at NettyMessagingManager again
Seems like some message not getting delivered in the ONOS Transport implementation
Jordan Halterman
@kuujo
Mar 01 2017 00:32
I'll check out the logs
Jordan Halterman
@kuujo
Mar 01 2017 00:40
As for the client/consistency thing, there are probably some ways we can relax consistency for certain resources, but it would break consistency across resources. The most efficient thing to do would be to make FIFO consistency for responses/events configurable, and multiplex resources on clients based on their consistency models. Consistency would be broken across Copycat clients. But that's something the user can decide to do.
Jon Hall
@jhall11
Mar 01 2017 02:38
I’ve been looking into the message handler issue, it looks like the Transport is registered then immediately unregistered. I’m adding some more logging now to see why
Jordan Halterman
@kuujo
Mar 01 2017 02:38
@jhall11 I see the same thing here: server 1/partition 2 gets elected and then starts getting CommandRequest and commits and applies the commands but doesn’t log a response
Jon Hall
@jhall11
Mar 01 2017 02:39
I wasn’t seeing any exceptions caught by the change either
Jordan Halterman
@kuujo
Mar 01 2017 02:43
hmm
actually, I think one of the logResponse calls was removed during recent changes and just needs to be added back, but the client doesn’t seem to be getting the responses regardless
I’m guessing they’re being sent but not received for some reason
Jon Hall
@jhall11
Mar 01 2017 02:44
altough we are printing the stacktrace there instead of logging, and I’m not sure where that is going
Jordan Halterman
@kuujo
Mar 01 2017 02:44
probably the reason above
well, that change would still result in INTERNAL_ERROR responses if an exception were thrown
it’s just a missing log message
which I’ll fix
but the clients certainly aren’t getting them
Jon Hall
@jhall11
Mar 01 2017 02:45
yeah, thats true
Jordan Halterman
@kuujo
Mar 01 2017 02:45
when I search that partition, the client gets only like 3 responses out of hundreds
wait a minute...
it’s getting them
just need to replace that log message so I don’t confuse myself again :-)
the only thing I see then is the No handler message like crazy
Jordan Halterman
@kuujo
Mar 01 2017 02:52
partition 2 seems to go pretty quite for a while, but it ultimately gets a couple more queries that succeed with no problems towards the end of the logs
well, that annoying mistake is fixed in the PR
Jon Hall
@jhall11
Mar 01 2017 02:57
I’m gonna grab something to eat, then take a look again. I think the “No handler” issue is caused by a connection either not getting cleaned up properly, or getting closed too soon. I saw some connections opened, then almost immediately closed then I started seeing these messages
Jordan Halterman
@kuujo
Mar 01 2017 03:04
those messages seem pretty consistent on some of the nodes, but the cluster seems to be operating normally and I haven’t found any issues yet
one more partition to go
Jordan Halterman
@kuujo
Mar 01 2017 03:09
all the partitions have pretty sizeable gaps of inactivity, but from what I’ve seen so far all the writes/reads are completed… although those CommandResponse logs couldn’t hurt :-)
at least on the servers anyways
Jordan Halterman
@kuujo
Mar 01 2017 03:32
Connections are opened and then closed by clients pretty quickly when they're trying to find a leader. That's one of the issues I noted from the fuzz tests that needs to be fixed. Obviously some messages must be getting lost because of the missing handler. If it's on a client then that can prevent the client from progressing even if the cluster is doing fine. I'll try to deduce what's being lost after I eat myself as well.
Jon Hall
@jhall11
Mar 01 2017 06:08
So it looks like what is happening with the No handler messages is that a client tries to connect and opens a connection. The handler is registered for that connection, with type ones-copycat-{partition-id}-{connection-id}, the connection times out on a server and the handler is unregistered, then we start getting messages with this type sometime later. I keep getting distracted so I wanted to write down what I found so far
Alex Atomei
@AlexAtomei1_twitter
Mar 01 2017 07:57
@kuujo Thanks for the answer. I was using the standalone server and had atomix-all in my classpath. Will try to bootstrap the cluster from my java app.
Jordan Halterman
@kuujo
Mar 01 2017 17:47
this example server configuration is actually missing a section where resource types are added. Those missing resource types are why you get a ClassNotFoundException. TBH nobody seems to use the standalone server much, and that's probably because it obviously sucks so much :-) Most everybody uses the replica API right now. Probably over the next couple months the standalone server will be rewritten to take a JSON/YAML configuration, have better command line tools, have an HTTP API, better serialization, better pluggability. That work has been started to some extent but will take a while as we're working out performance/stability issues in Copycat first.
@jhall11 I have been sick for a couple days and actually fell asleep on the couch last night :-P but I seem to be better now and I'm going to split up all the sessions and can hopefully find where some request was lost. It may be the case that the missing handlers have no negative affect on the cluster assuming request futures were properly failed. Missing handlers may just be a product of the semantic differences between ONOS's messaging service and Copycat's transport.
Jordan Halterman
@kuujo
Mar 01 2017 17:52
if that makes sense
Jon Hall
@jhall11
Mar 01 2017 17:53
I’m glad you are feeling somewhat better!
yes, I couldn’t see any obvious side effects from the missing handlers, I didn’t look much more last night, but I’m about to dig in again
Jordan Halterman
@kuujo
Mar 01 2017 17:53
Meaning, as long as the Connection was already closed from Copycat's perspective
Jordan Halterman
@kuujo
Mar 01 2017 19:15
I did find this
org.onosproject.store.service.ConsistentMapException$Timeout: onos-packet-requests
    at org.onosproject.store.primitives.DefaultConsistentMap.complete(DefaultConsistentMap.java:228)
    at org.onosproject.store.primitives.DefaultConsistentMap.lambda$computeIf$2(DefaultConsistentMap.java:110)
org.onosproject.store.service.ConsistentMapException$Timeout: onos-network-configs
    at org.onosproject.store.primitives.DefaultConsistentMap.complete(DefaultConsistentMap.java:228)
    at org.onosproject.store.primitives.DefaultConsistentMap.keySet(DefaultConsistentMap.java:138)
    at org.onosproject.store.config.impl.DistributedNetworkConfigStore.processPendingConfigs(DistributedNetworkConfigStore.java:132)
    at org.onosproject.store.config.impl.DistributedNetworkConfigStore.addConfigFactory(DistributedNetworkConfigStore.java:125)
    at org.onosproject.net.config.impl.NetworkConfigManager.registerConfigFactory(NetworkConfigManager.java:107)
Jon Hall
@jhall11
Mar 01 2017 19:20
hmm
Jordan Halterman
@kuujo
Mar 01 2017 19:21
I split up all the sessions I could find on that node and they’re all getting all the responses for all their requests and don’t seem to be blocked for any reason
need to look at them a little more closely to figure out the state inside the client
it’s still possible after a response is received a CompletableFuture may not be completed if the response is received out of order, so have to check the indexes but nothing so far
Jordan Halterman
@kuujo
Mar 01 2017 19:27
there is something odd in session 33836 on server 1: the leader is the local server. When the leader sends a PublishRequest, I don’t see it receiving a PublishResponse back even though the request appears to be successfully received and sequenced by the client. PublishResponse isn’t required for the system to progress since KeepAliveRequest does the same thing, but I wonder if it could be an issue in the ONOS transport.
regardless, that doesn’t seem to break anything
all these logs show clients progressing in all aspects with commands and queries being completed and events being sequenced and completed
need to try to track down the requests that caused the above exceptions
Jordan Halterman
@kuujo
Mar 01 2017 19:40
the second exception on keySet, I see keySet requests from three clients that are all completed before that exception
has to be something on the client that’s not happening
according to the state of the client based on later KeepAliveRequests the client should indeed have completed them as well
but what looks like a 5 second timeout elapses anyways
suggesting the future was for some reason not completed
Jon Hall
@jhall11
Mar 01 2017 19:48
could it be that the thread is busy?
Jordan Halterman
@kuujo
Mar 01 2017 19:48
that’s what I was thinking, but that doesn’t look like a thread Copycat should be trying to use
hmm wait
yeah
the stacktrace from the exception doesn’t seem like something happening in one of Copycat’s threads so it should be fine to block there
although that doesn’t mean it’s not blocked elsewhere
Copycat takes some measures to recognize when the IO thread is blocked and uses the event thread to complete futures instead, but I don’t think it’s bulletproof yet
I’d sort of be interested to add another log messages to the Copycat client to show when it’s actually completing operations
all these logs look perfect otherwise, but obviously those exceptions are being thrown so there must be some information missing
one sec
Jordan Halterman
@kuujo
Mar 01 2017 20:01
Okay I added DEBUG logging to the ClientSequencer which is the last bit of complex logic a response/event goes through after they're received on the client. Madan and I spent a lot of time making that thing work well, but it's a complicated process and could certainly still have issues. This will show whether it does or doesn't. Once a response/event is through that sequencing process the future will be completed unless the threads are blocked.
It's in atomix/copycat#292
Jon Hall
@jhall11
Mar 01 2017 20:03
ok
Jordan Halterman
@kuujo
Mar 01 2017 21:30
The only idea I've come up with so far is the blocked thread problem. I'm gonna go through the previous run's logs again and maybe can see something there with this extra knowledge.
vishwas
@vishwass
Mar 01 2017 21:33
 Storage storage = Storage.builder()
            .withDirectory("logs")
            .withStorageLevel(StorageLevel.DISK)
            .build();

    LOGGER.info("LOGS:"+storage.openLog("copycat").toString());
@kuujo I am using the above code in the shutdown and leave functions of the copycat server. But still am unable to access the log and print them. any pointers?
Jordan Halterman
@kuujo
Mar 01 2017 21:37

@vishwass the Log is a Raft log that's on disk. It's not quite as easy as just printing the object. The Log object is just an interface to read and write the Raft log. You'll have to do something like:

for (long index = log.firstIndex(); index <= log.lastIndex(); index++) {
  try (Entry entry = log.get(index)) {
    if (entry != null) {
      // print entry
    }
  }
}

Look at all the tests for the Log to see how it's used

I guess we could print out all the entries in toString, but that would make it an insanely expensive operation in many cases
vishwas
@vishwass
Mar 01 2017 21:40
i just want to print the entire log when server about to shutdown and leave the cluster
I guess we can do that do that in the leave function then !! ?
Jordan Halterman
@kuujo
Mar 01 2017 22:04
depends on where
the leave() method deletes the logs
I’d do it in shutdown()
before this line
@jhall11 I see just one of these on the previous run:
org.onosproject.store.service.StorageException$Timeout
    at org.onosproject.store.primitives.DefaultLeaderElector.complete(DefaultLeaderElector.java:115)
    at org.onosproject.store.primitives.DefaultLeaderElector.getLeaderships(DefaultLeaderElector.java:80)
    at org.onosproject.store.cluster.impl.DistributedLeadershipStore.getLeaderships(DistributedLeadershipStore.java:170)
    at org.onosproject.cluster.impl.LeadershipManager.getLeaderBoard(LeadershipManager.java:98)
    at org.onosproject.cli.net.LeaderCommand.execute(LeaderCommand.java:140)
    at org.onosproject.cli.AbstractShellCommand.doExecute(AbstractShellCommand.java:150)
    at org.apache.karaf.shell.console.AbstractAction.execute(AbstractAction.java:33)[27:org.apache.karaf.shell.console:3.0.8]
    at org.apache.karaf.shell.commands.basic.AbstractCommand.execute(AbstractCommand.java:33)[27:org.apache.karaf.shell.console:3.0.8]
actually, I see what the No handler messages are losing in that run
2017-02-28 08:12:30,130 | DEBUG | artition-1-state | ServerSessionContext             | 90 - io.atomix.all - 1.0.2 | 79136 - Sending PublishRequest[session=79136, eventIndex=80788, previousIndex=80774, events=[Event[event=changeEvents, message=InstanceEvent[resource=112, message=[MapEvent{name=, type=INSERT, key=8e:02:01:00:07:6f:72:67:2e:6f:6e:6f:73:70:72:6f:6a:65:63:74:2e:6f:70:65:6e:66:6c:6f:f7, newValue=Versioned{value=byte[]{length=7947, hash=975134002}, version=40, creationTime=2017-02-28T08:08:55.546Z}, oldValue=null}]]]]]
2017-02-28 08:12:30,130 | DEBUG | artition-1-state | NettyMessagingManager            | 129 - org.onosproject.onos-core-dist - 1.10.0.SNAPSHOT | No handler for message type onos-copycat-1-8006912965292463257
2017-02-28 08:12:30,130 | DEBUG | artition-1-state | ServerSessionContext             | 90 - io.atomix.all - 1.0.2 | 79136 - Sending PublishRequest[session=79136, eventIndex=80794, previousIndex=80788, events=[Event[event=changeEvents, message=InstanceEvent[resource=112, message=[MapEvent{name=, type=INSERT, key=8e:02:01:00:0a:6f:72:67:2e:6f:6e:6f:73:70:72:6f:6a:65:63:74:2e:63:70:f2, newValue=Versioned{value=byte[]{length=7893, hash=-1658661112}, version=41, creationTime=2017-02-28T08:08:55.594Z}, oldValue=null}]]]]]
2017-02-28 08:12:30,130 | DEBUG | artition-1-state | NettyMessagingManager            | 129 - org.onosproject.onos-core-dist - 1.10.0.SNAPSHOT | No handler for message type onos-copycat-1-8006912965292463257
2017-02-28 08:12:30,131 | DEBUG | artition-1-state | ServerSessionContext             | 90 - io.atomix.all - 1.0.2 | 79136 - Sending PublishRequest[session=79136, eventIndex=80798, previousIndex=80794, events=[Event[event=changeEvents, message=InstanceEvent[resource=112, message=[MapEvent{name=, type=INSERT, key=8e:02:01:00:0e:6f:72:67:2e:6f:6e:6f:73:70:72:6f:6a:65:63:74:2e:76:72:6f:75:74:65:f2, newValue=Versioned{value=byte[]{length=7953, hash=2012870500}, version=42, creationTime=2017-02-28T08:08:55.627Z}, oldValue=null}]]]]]
2017-02-28 08:12:30,131 | DEBUG | artition-1-state | NettyMessagingManager            | 129 - org.onosproject.onos-core-dist - 1.10.0.SNAPSHOT | No handler for message type onos-copycat-1-8006912965292463257
Jordan Halterman
@kuujo
Mar 01 2017 22:10
that’s definitely on the client, and it’s losing PublishRequest which can cause the client not to be able to progress
eventually it should resolve, but not if the server doesn’t know the client can’t get the messages and the client doesn’t itself fix the problem
btw I’m looking at logs_172.17.0.4_test_cluster_controller_restarts_FAILED/log/karaf.log

Also, the logs do show that the client is not making any progress. Its next KeepAliveRequest obviously doesn’t contain those events. The server then tries to resend them again:

2017-02-28 08:12:32,635 | DEBUG | artition-1-state | ServerSessionContext             | 90 - io.atomix.all - 1.0.2 | 79136 - Sending PublishRequest[session=79136, eventIndex=80788, previousIndex=80774, events=[Event[event=changeEvents, message=InstanceEvent[resource=112, message=[MapEvent{name=, type=INSERT, key=8e:02:01:00:07:6f:72:67:2e:6f:6e:6f:73:70:72:6f:6a:65:63:74:2e:6f:70:65:6e:66:6c:6f:f7, newValue=Versioned{value=byte[]{length=7947, hash=975134002}, version=40, creationTime=2017-02-28T08:08:55.546Z}, oldValue=null}]]]]]
2017-02-28 08:12:32,635 | DEBUG | artition-1-state | NettyMessagingManager            | 129 - org.onosproject.onos-core-dist - 1.10.0.SNAPSHOT | No handler for message type onos-copycat-1-8006912965292463257
2017-02-28 08:12:32,635 | DEBUG | artition-1-state | ServerSessionContext             | 90 - io.atomix.all - 1.0.2 | 79136 - Sending PublishRequest[session=79136, eventIndex=80794, previousIndex=80788, events=[Event[event=changeEvents, message=InstanceEvent[resource=112, message=[MapEvent{name=, type=INSERT, key=8e:02:01:00:0a:6f:72:67:2e:6f:6e:6f:73:70:72:6f:6a:65:63:74:2e:63:70:f2, newValue=Versioned{value=byte[]{length=7893, hash=-1658661112}, version=41, creationTime=2017-02-28T08:08:55.594Z}, oldValue=null}]]]]]
2017-02-28 08:12:32,635 | DEBUG | artition-1-state | NettyMessagingManager            | 129 - org.onosproject.onos-core-dist - 1.10.0.SNAPSHOT | No handler for message type onos-copycat-1-8006912965292463257
2017-02-28 08:12:32,635 | DEBUG | artition-1-state | ServerSessionContext             | 90 - io.atomix.all - 1.0.2 | 79136 - Sending PublishRequest[session=79136, eventIndex=80798, previousIndex=80794, events=[Event[event=changeEvents, message=InstanceEvent[resource=112, message=[MapEvent{name=, type=INSERT, key=8e:02:01:00:0e:6f:72:67:2e:6f:6e:6f:73:70:72:6f:6a:65:63:74:2e:76:72:6f:75:74:65:f2, newValue=Versioned{value=byte[]{length=7953, hash=2012870500}, version=42, creationTime=2017-02-28T08:08:55.627Z}, oldValue=null}]]]]]
2017-02-28 08:12:32,635 | DEBUG | artition-1-state | NettyMessagingManager            | 129 - org.onosproject.onos-core-dist - 1.10.0.SNAPSHOT | No handler for message type onos-copycat-1-8006912965292463257

but obviously the client’s still not getting them @jhall11

and again
2017-02-28 08:12:35,139 | DEBUG | artition-1-state | ServerSessionContext             | 90 - io.atomix.all - 1.0.2 | 79136 - Sending PublishRequest[session=79136, eventIndex=80788, previousIndex=80774, events=[Event[event=changeEvents, message=InstanceEvent[resource=112, message=[MapEvent{name=, type=INSERT, key=8e:02:01:00:07:6f:72:67:2e:6f:6e:6f:73:70:72:6f:6a:65:63:74:2e:6f:70:65:6e:66:6c:6f:f7, newValue=Versioned{value=byte[]{length=7947, hash=975134002}, version=40, creationTime=2017-02-28T08:08:55.546Z}, oldValue=null}]]]]]
2017-02-28 08:12:35,139 | DEBUG | artition-1-state | NettyMessagingManager            | 129 - org.onosproject.onos-core-dist - 1.10.0.SNAPSHOT | No handler for message type onos-copycat-1-8006912965292463257
2017-02-28 08:12:35,139 | DEBUG | artition-1-state | ServerSessionContext             | 90 - io.atomix.all - 1.0.2 | 79136 - Sending PublishRequest[session=79136, eventIndex=80794, previousIndex=80788, events=[Event[event=changeEvents, message=InstanceEvent[resource=112, message=[MapEvent{name=, type=INSERT, key=8e:02:01:00:0a:6f:72:67:2e:6f:6e:6f:73:70:72:6f:6a:65:63:74:2e:63:70:f2, newValue=Versioned{value=byte[]{length=7893, hash=-1658661112}, version=41, creationTime=2017-02-28T08:08:55.594Z}, oldValue=null}]]]]]
2017-02-28 08:12:35,139 | DEBUG | artition-1-state | NettyMessagingManager            | 129 - org.onosproject.onos-core-dist - 1.10.0.SNAPSHOT | No handler for message type onos-copycat-1-8006912965292463257
2017-02-28 08:12:35,139 | DEBUG | artition-1-state | ServerSessionContext             | 90 - io.atomix.all - 1.0.2 | 79136 - Sending PublishRequest[session=79136, eventIndex=80798, previousIndex=80794, events=[Event[event=changeEvents, message=InstanceEvent[resource=112, message=[MapEvent{name=, type=INSERT, key=8e:02:01:00:0e:6f:72:67:2e:6f:6e:6f:73:70:72:6f:6a:65:63:74:2e:76:72:6f:75:74:65:f2, newValue=Versioned{value=byte[]{length=7953, hash=2012870500}, version=42, creationTime=2017-02-28T08:08:55.627Z}, oldValue=null}]]]]]
2017-02-28 08:12:35,139 | DEBUG | artition-1-state | NettyMessagingManager            | 129 - org.onosproject.onos-core-dist - 1.10.0.SNAPSHOT | No handler for message type onos-copycat-1-8006912965292463257
etc etc
I’ll see if I can see that in the newer run
Jordan Halterman
@kuujo
Mar 01 2017 22:16
yeah…
2017-02-28 23:33:40,980 | DEBUG | artition-1-state | ServerSessionContext             | 90 - io.atomix.all - 1.0.3.SNAPSHOT | 34137 - Sending PublishRequest[session=34137, eventIndex=34382, previousIndex=34298, events=[Event[event=leadershipChangeEvents, message=InstanceEvent[resource=86, message=[Change{oldValue=Leadership{topic=device:of:0000566d42203142, leader=null, candidates=[]}, newValue=Leadership{topic=device:of:0000566d42203142, leader=Leader{nodeId=172.17.0.2, term=157, termStartTime=1488324820978}, candidates=[172.17.0.2]}}]]]]]
2017-02-28 23:33:40,981 | DEBUG | artition-1-state | ServerSessionContext             | 90 - io.atomix.all - 1.0.3.SNAPSHOT | 33050 - Sending PublishRequest[session=33050, eventIndex=34382, previousIndex=34298, events=[Event[event=leadershipChangeEvents, message=InstanceEvent[resource=86, message=[Change{oldValue=Leadership{topic=device:of:0000566d42203142, leader=null, candidates=[]}, newValue=Leadership{topic=device:of:0000566d42203142, leader=Leader{nodeId=172.17.0.2, term=157, termStartTime=1488324820978}, candidates=[172.17.0.2]}}]]]]]
2017-02-28 23:33:40,981 | DEBUG | artition-1-state | ServerSessionContext             | 90 - io.atomix.all - 1.0.3.SNAPSHOT | 33836 - Sending PublishRequest[session=33836, eventIndex=34382, previousIndex=34298, events=[Event[event=leadershipChangeEvents, message=InstanceEvent[resource=86, message=[Change{oldValue=Leadership{topic=device:of:0000566d42203142, leader=null, candidates=[]}, newValue=Leadership{topic=device:of:0000566d42203142, leader=Leader{nodeId=172.17.0.2, term=157, termStartTime=1488324820978}, candidates=[172.17.0.2]}}]]]]]
2017-02-28 23:33:40,981 | DEBUG | artition-1-state | NettyMessagingManager            | 129 - org.onosproject.onos-core-dist - 1.10.0.SNAPSHOT | No handler for message type onos-copycat-1-4437736648291622574
2017-02-28 23:33:40,982 | DEBUG | artition-1-state | ServerSessionContext             | 90 - io.atomix.all - 1.0.3.SNAPSHOT | 34137 - Received PublishResponse[status=OK, error=null, index=34382]
2017-02-28 23:33:40,983 | DEBUG | artition-1-state | ServerSessionContext             | 90 - io.atomix.all - 1.0.3.SNAPSHOT | 33050 - Received PublishResponse[status=OK, error=null, index=34382]
so, session 33836 isn’t getting PublishRequest for some time
oddly, it does seem to get some of them but not others
2017-02-28 23:33:43,498 | DEBUG | artition-1-state | ServerSessionContext             | 90 - io.atomix.all - 1.0.3.SNAPSHOT | 34137 - Sending PublishRequest[session=34137, eventIndex=34386, previousIndex=34382, events=[Event[event=leadershipChangeEvents, message=InstanceEvent[resource=86, message=[Change{oldValue=Leadership{topic=device:of:0000566d42203142, leader=Leader{nodeId=172.17.0.2, term=157, termStartTime=1488324820978}, candidates=[172.17.0.2]}, newValue=Leadership{topic=device:of:0000566d42203142, leader=Leader{nodeId=172.17.0.2, term=157, termStartTime=1488324820978}, candidates=[172.17.0.2, 172.17.0.3]}}]]]]]
2017-02-28 23:33:43,498 | DEBUG | artition-1-state | ServerSessionContext             | 90 - io.atomix.all - 1.0.3.SNAPSHOT | 33050 - Sending PublishRequest[session=33050, eventIndex=34386, previousIndex=34382, events=[Event[event=leadershipChangeEvents, message=InstanceEvent[resource=86, message=[Change{oldValue=Leadership{topic=device:of:0000566d42203142, leader=Leader{nodeId=172.17.0.2, term=157, termStartTime=1488324820978}, candidates=[172.17.0.2]}, newValue=Leadership{topic=device:of:0000566d42203142, leader=Leader{nodeId=172.17.0.2, term=157, termStartTime=1488324820978}, candidates=[172.17.0.2, 172.17.0.3]}}]]]]]
2017-02-28 23:33:43,498 | DEBUG | artition-1-state | ServerSessionContext             | 90 - io.atomix.all - 1.0.3.SNAPSHOT | 33836 - Sending PublishRequest[session=33836, eventIndex=34386, previousIndex=34382, events=[Event[event=leadershipChangeEvents, message=InstanceEvent[resource=86, message=[Change{oldValue=Leadership{topic=device:of:0000566d42203142, leader=Leader{nodeId=172.17.0.2, term=157, termStartTime=1488324820978}, candidates=[172.17.0.2]}, newValue=Leadership{topic=device:of:0000566d42203142, leader=Leader{nodeId=172.17.0.2, term=157, termStartTime=1488324820978}, candidates=[172.17.0.2, 172.17.0.3]}}]]]]]
2017-02-28 23:33:43,498 | DEBUG | artition-1-state | NettyMessagingManager            | 129 - org.onosproject.onos-core-dist - 1.10.0.SNAPSHOT | No handler for message type onos-copycat-1-4437736648291622574
2017-02-28 23:33:43,499 | DEBUG | artition-1-state | ServerSessionContext             | 90 - io.atomix.all - 1.0.3.SNAPSHOT | 34137 - Received PublishResponse[status=OK, error=null, index=34386]
2017-02-28 23:33:43,499 | DEBUG | artition-1-state | ServerSessionContext             | 90 - io.atomix.all - 1.0.3.SNAPSHOT | 33050 - Received PublishResponse[status=OK, error=null, index=34386]
but right after that
2017-02-28 23:33:43,504 | DEBUG | artition-1-state | ServerSessionContext             | 90 - io.atomix.all - 1.0.3.SNAPSHOT | 34137 - Sending PublishRequest[session=34137, eventIndex=34387, previousIndex=34386, events=[Event[event=leadershipChangeEvents, message=InstanceEvent[resource=86, message=[Change{oldValue=Leadership{topic=device:of:0000566d42203142, leader=Leader{nodeId=172.17.0.2, term=157, termStartTime=1488324820978}, candidates=[172.17.0.2, 172.17.0.3]}, newValue=Leadership{topic=device:of:0000566d42203142, leader=Leader{nodeId=172.17.0.2, term=157, termStartTime=1488324820978}, candidates=[172.17.0.2]}}]]]]]
2017-02-28 23:33:43,504 | DEBUG | artition-1-state | ServerSessionContext             | 90 - io.atomix.all - 1.0.3.SNAPSHOT | 33050 - Sending PublishRequest[session=33050, eventIndex=34387, previousIndex=34386, events=[Event[event=leadershipChangeEvents, message=InstanceEvent[resource=86, message=[Change{oldValue=Leadership{topic=device:of:0000566d42203142, leader=Leader{nodeId=172.17.0.2, term=157, termStartTime=1488324820978}, candidates=[172.17.0.2, 172.17.0.3]}, newValue=Leadership{topic=device:of:0000566d42203142, leader=Leader{nodeId=172.17.0.2, term=157, termStartTime=1488324820978}, candidates=[172.17.0.2]}}]]]]]
2017-02-28 23:33:43,504 | DEBUG | artition-1-state | ServerSessionContext             | 90 - io.atomix.all - 1.0.3.SNAPSHOT | 33836 - Sending PublishRequest[session=33836, eventIndex=34387, previousIndex=34386, events=[Event[event=leadershipChangeEvents, message=InstanceEvent[resource=86, message=[Change{oldValue=Leadership{topic=device:of:0000566d42203142, leader=Leader{nodeId=172.17.0.2, term=157, termStartTime=1488324820978}, candidates=[172.17.0.2, 172.17.0.3]}, newValue=Leadership{topic=device:of:0000566d42203142, leader=Leader{nodeId=172.17.0.2, term=157, termStartTime=1488324820978}, candidates=[172.17.0.2]}}]]]]]
2017-02-28 23:33:43,505 | DEBUG | artition-1-state | NettyMessagingManager            | 129 - org.onosproject.onos-core-dist - 1.10.0.SNAPSHOT | No handler for message type onos-copycat-1-4437736648291622574
2017-02-28 23:33:43,505 | DEBUG | artition-1-state | ServerSessionContext             | 90 - io.atomix.all - 1.0.3.SNAPSHOT | 34137 - Received PublishResponse[status=OK, error=null, index=34387]
2017-02-28 23:33:43,505 | DEBUG | artition-1-state | ServerSessionContext             | 90 - io.atomix.all - 1.0.3.SNAPSHOT | 33050 - Received PublishResponse[status=OK, error=null, index=34387]
2017-02-28 23:33:43,505 | DEBUG | ycat-client-io-1 | ClientSession                    | 90 - io.atomix.all - 1.0.3.SNAPSHOT | 33836 - Received PublishRequest[session=33836, eventIndex=34386, previousIndex=34382, events=[Event[event=leadershipChangeEvents, message=InstanceEvent[resource=86, message=[Change{oldValue=Leadership{topic=device:of:0000566d42203142, leader=Leader{nodeId=172.17.0.2, term=157, termStartTime=1488324821187}, candidates=[172.17.0.2]}, newValue=Leadership{topic=device:of:0000566d42203142, leader=Leader{nodeId=172.17.0.2, term=157, termStartTime=1488324821187}, candidates=[172.17.0.2, 172.17.0.3]}}]]]]]
confusing
Jordan Halterman
@kuujo
Mar 01 2017 22:21
So, I rescind my previous statement. My best idea is that this has something to do with the No handler message.
I’m guessing if we do another run with the sequencer logging, we’ll see sequencing hang because of missed PublishRequests on a client
this occurs before each of these exceptions
a PublishRequest with a No handler message and then a timeout
I'll have to take a look at the ONOS transport this evening. But it basically seems for some reason that the Copycat server thinks the connection is still open (thus it's sending PublishRequest) while the client or at least the client side of the connection thinks it's closed and has therefore unregistered the handler in the messaging manager.
Jordan Halterman
@kuujo
Mar 01 2017 22:27
Ugh Gitter is losing my messages again. Need to stop typing on my phone
I'll take a look at the ONOS Transport implementation this evening. But in general, it seems the server side believes the connection is still open (it's sending PublishRequest) while the client, or at least the client side of the connection, seems to think the connection has been closed and so it unregistered the handler in the messaging manager.
Ahh there's my message
Pretty consistent phrasing I have :-P
Jon Hall
@jhall11
Mar 01 2017 22:32
yes, almost spot on :)
Jordan Halterman
@kuujo
Mar 01 2017 22:33
Haha
Jon Hall
@jhall11
Mar 01 2017 22:35
I am seeing some handlers immediately opened then closed then later other nodes getting messages for that connection id. I’m not sure this is correct, but I still don’t fully understand how netty works
Jon Hall
@jhall11
Mar 01 2017 22:44
oh, never mind, that is not closing the listener, but adding a close Listener
Jon Hall
@jhall11
Mar 01 2017 23:01
let my grab the new logs for you
Jordan Halterman
@kuujo
Mar 01 2017 23:01
Sweet
I'm actually leaving work soon, so I'll look at the CopycatTransport and NettyMessagingManager in just a bit.
Jon Hall
@jhall11
Mar 01 2017 23:05
here they are
Jordan Halterman
@kuujo
Mar 01 2017 23:05
Thanks!
Jon Hall
@jhall11
Mar 01 2017 23:07
I am seeing the no handler issue when I start up a 7 node cluster. I don’t notice any issues besides that warning though
Jon Hall
@jhall11
Mar 01 2017 23:39
hmm, there is definiatly something not correct with the handlers, i’m seeing one partition with 279 connection handlers registered for it
Jordan Halterman
@kuujo
Mar 01 2017 23:50
jeez
k I’m at home now and looking