These are chat archives for atomix/atomix

4th
Sep 2018
Maxim Manco
@mmanco
Sep 04 2018 22:24
Hi guys, I've been playing with Atomix for the past month and finally came with a locally working solution (direct and through docker) that suites our needs. the problem start when I try to deploy the service under kubernetes.
I am seeing bunch of io.netty.handler.codec.DecoderException:
io.netty.handler.codec.DecoderException: java.net.UnknownHostException: addr is of illegal length
        at io.netty.handler.codec.ByteToMessageDecoder.callDecode(ByteToMessageDecoder.java:459) ~[netty-codec-4.1.27.Final.jar!/:4.1.27.Final]
        at io.netty.handler.codec.ByteToMessageDecoder.channelInputClosed(ByteToMessageDecoder.java:392) ~[netty-codec-4.1.27.Final.jar!/:4.1.27.Final]
        at io.netty.handler.codec.ByteToMessageDecoder.channelInputClosed(ByteToMessageDecoder.java:359) ~[netty-codec-4.1.27.Final.jar!/:4.1.27.Final]
        at io.netty.handler.codec.ByteToMessageDecoder.channelInactive(ByteToMessageDecoder.java:342) ~[netty-codec-4.1.27.Final.jar!/:4.1.27.Final]
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:245) [netty-transport-4.1.27.Final.jar!/:4.1.27.Final]
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:231) [netty-transport-4.1.27.Final.jar!/:4.1.27.Final]
        at io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:224) [netty-transport-4.1.27.Final.jar!/:4.1.27.Final]
        at io.netty.channel.DefaultChannelPipeline$HeadContext.channelInactive(DefaultChannelPipeline.java:1429) [netty-transport-4.1.27.Final.jar!/:4.1.27.Final]
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:245) [netty-transport-4.1.27.Final.jar!/:4.1.27.Final]
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:231) [netty-transport-4.1.27.Final.jar!/:4.1.27.Final]
        at io.netty.channel.DefaultChannelPipeline.fireChannelInactive(DefaultChannelPipeline.java:947) [netty-transport-4.1.27.Final.jar!/:4.1.27.Final]
        at io.netty.channel.AbstractChannel$AbstractUnsafe$8.run(AbstractChannel.java:822) [netty-transport-4.1.27.Final.jar!/:4.1.27.Final]
        at io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:163) [netty-common-4.1.27.Final.jar!/:4.1.27.Final]
        at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:404) [netty-common-4.1.27.Final.jar!/:4.1.27.Final]
        at io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:322) [netty-transport-native-epoll-4.1.27.Final-linux-x86_64.jar!/:4.1.27.Final]
        at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:884) [netty-common-4.1.27.Final.jar!/:4.1.27.Final]
        at java.lang.Thread.run(Thread.java:745) [na:1.8.0_121]
Caused by: java.net.UnknownHostException: addr is of illegal length
        at java.net.InetAddress.getByAddress(InetAddress.java:1042) ~[na:1.8.0_121]
        at java.net.InetAddress.getByAddress(InetAddress.java:1439) ~[na:1.8.0_121]
        at io.atomix.cluster.messaging.impl.MessageDecoder.decode(MessageDecoder.java:87) ~[atomix-cluster-3.0.3.jar!/:na]
        at io.netty.handler.codec.ByteToMessageDecoder.decodeRemovalReentryProtection(ByteToMessageDecoder.java:489) ~[netty-codec-4.1.27.Final.jar!/:4.1.27.Final]
        at io.netty.handler.codec.ByteToMessageDecoder.callDecode(ByteToMessageDecoder.java:428) ~[netty-codec-4.1.27.Final.jar!/:4.1.27.Final]
        ... 16 common frames omitted

Tried to triage the problem based on the stacktrace and I am seeing the following behavior which I would like to confirm with the pros :)

  • MessageEncoder does not write sender ip to the buffer since addressWritten is already set to true (is it expected that the buffer will be pre-populated with the ip since the buffer is passed to the encoder from outside?)
  • MessageDecoder lands on the READ_SENDER_VERSION state which reads totally different value then current VERSION. That leads to the rest of the cases to read wrong data and lead to the exception above.

Is there something that I may be missing or this may be a bug?

Please note that I have verified that there are no version conflicts (missmatches) specifically with Netty and Atomix and exactly same setup works locally and through docker (without k8s)

Jordan Halterman
@kuujo
Sep 04 2018 22:28
A new MessageEncoder is created for each Netty channel, and the sender’s address is written when the first message is sent. Basically, the version, IP, and port are a written once per connection and then addressWritten is changed to true so they’re not sent with every message since the MessagingService uses TCP and the connection is persistent anyways.
I’ve definitely never seen this exception… looking through the code rn
seems really strange that the version would be inconsistent because it’s the first thing written and the first thing read
Maxim Manco
@mmanco
Sep 04 2018 22:32
@kuujo That is my though as well. Will continue to look for the root cause and report back if I find something useful. Thanks for checking this out!
Jordan Halterman
@kuujo
Sep 04 2018 22:32
The inverse of the MessageEncoder is the MessageDecoder being initialized to READ_SENDER_VERSION, reading the version, IP, and port, and then never returning to that state
sounds good
Maxim Manco
@mmanco
Sep 04 2018 22:33
That's the thing I see it returning to READ_SENDER_VERSION several times. could it something else up the chain which causes the channel to be recreated?
Jordan Halterman
@kuujo
Sep 04 2018 22:33
curious what the ByteBuf’s pointers look like when the version is read
I see
hmm
yeah it would have to be a new decoder object
but using the same ByteBuf
Maxim Manco
@mmanco
Sep 04 2018 22:34
I'll try to trace that route
Jordan Halterman
@kuujo
Sep 04 2018 22:34
but the only place MessageDecoder is instantiated is in the Netty handler's initChannel method
strange
Maxim Manco
@mmanco
Sep 04 2018 22:36
indeed!
Jordan Halterman
@kuujo
Sep 04 2018 22:54
FYI my next major task is working on a prototype for k8s deployments, including upgrades and what not
Maxim Manco
@mmanco
Sep 04 2018 22:58
Sounds good, hopefully I'll have a working solution and will be able to provide some thoughts
Jordan Halterman
@kuujo
Sep 04 2018 22:59
Yes that would be awesome. We have worked on some k8s deployments of ONOS with Atomix embedded, but not since the changes in 3.0 so I’m starting from scratch and taking notes from past deployments
Maxim Manco
@mmanco
Sep 04 2018 23:01
Basically I am planning on deploying a stateful set and an headless service which will act as the cluster seed/mgmt then any other service which needs distributed primitives will be using the seed nodes as the bootstrap nodes.
Jordan Halterman
@kuujo
Sep 04 2018 23:02
yeah that’s basically my planned approach
I’ve tried so hard to push stateful sets on people and they seem to always run into problems with them that leads them to abandon them and build something approximating them so we’ll see 🤷‍♂️
obviously that doesn’t sound like a good idea to me
I don’t remember the specific problems they ran into, but I’ll be going back and talking to the teams I know of that have done it to try to better understand the issues they saw
Maxim Manco
@mmanco
Sep 04 2018 23:05
I found it even more complex to manage with deployment. everything becomes upredictable and forces you to couple the system with k8s service discovery and what not
Jordan Halterman
@kuujo
Sep 04 2018 23:05
some of them may have been more related to ONOS which significantly increases the complexity of the deployment on top of consensus though
yeah exactly