These are chat archives for cloudera/kudu

5th
Jan 2018
Aaron Hiniker
@hindog
Jan 05 2018 20:23

I have a Spark job that appears to be hung (Kudu 1.4.0-cdh5.12.0):
In the driver stacks, I see threads stuck here:

java.lang.Object.wait(Native Method)
java.lang.Object.wait(Object.java:502)
com.stumbleupon.async.Deferred.doJoin(Deferred.java:1136)
com.stumbleupon.async.Deferred.join(Deferred.java:1019)
org.apache.kudu.client.KuduClient.joinAndHandleException(KuduClient.java:340)
org.apache.kudu.client.KuduClient.tableExists(KuduClient.java:196)
org.apache.kudu.spark.kudu.KuduContext.tableExists(KuduContext.scala:102)
...

and in the driver logs, I'm seeing these messages get logged:

20:19:07 WARN  ConnectToCluster - Unable to find the leader master (10.50.3.56:7051,10.50.3.209:7051,10.50.3.202:7051), will retry
20:19:07 ERROR TabletClient - [Peer master-10.50.3.56:7051] Unexpected exception from downstream on [id: 0xf129dffb, /10.50.3.72:37744 => /10.50.3.56:7051]
java.lang.RuntimeException: Could not deserialize the response, incompatible RPC? Error is: step
        at org.apache.kudu.client.KuduRpc.readProtobuf(KuduRpc.java:383)
        at org.apache.kudu.client.Negotiator.parseSaslMsgResponse(Negotiator.java:282)
        at org.apache.kudu.client.Negotiator.handleResponse(Negotiator.java:235)
        at org.apache.kudu.client.Negotiator.messageReceived(Negotiator.java:229)
Running kudu cluster ksck <master> reports all tables as healthy
I should probably mention that we have multiple threads operating on independent KuduContext's in this case, but this same code has been running fine for months until recently
Aaron Hiniker
@hindog
Jan 05 2018 20:29
Oh, and also this in the Kudu logs: W0105 19:58:16.417731 18382 negotiation.cc:310] Unauthorized connection attempt: Server connection negotiation failed: server connection from 10.50.3.72:53896: authentication token expired
Dan Burkert
@danburkert
Jan 05 2018 23:05
@hindog You'll probably have more luck on the Kudu slack channel, discussion is much more active there: https://getkudu-slack.herokuapp.com/ . KUDU-2013 is probably your issue, especially if that job took more than 6 or 7 days