These are chat archives for atomix/atomix

14th
Nov 2018
Jordan Halterman
@kuujo
Nov 14 2018 00:17
@zwillim which version are you running?
Sometimes this is the result of a blocked thread, sometimes a bug
rarely the network actually going slow
Jordan Halterman
@kuujo
Nov 14 2018 00:55
@coopci I actually did a lot more benchmarking and found that it was just the Docker agent and the profiler that were freaking out. In reality, the nodes were doing just fine and continue to make progress.
One thing I did notice is the explosion of sockets on large clusters. Each node has a pool of up to 8 connections to each peer.
Alexander Richter
@alex-richter
Nov 14 2018 15:56

Also, I seem to be having trouble to join a client node to a cluster. I have tried that with a three-node-cluster and a single-node cluster. This is the code I am connecting the client with:

AtomixBuilder atomixBuilder = Atomix.builder()
                 .withMemberId("client")
                 .withAddress("127.0.0.1:8002")
                 .withProfiles(Profile.client())
                 .withMembershipProvider(BootstrapDiscoveryProvider.builder()
                         .withNodes(Node.builder()
                             .withId("node1")
                             .withAddress("127.0.0.1:8200")
                             .build())
                         .build());

 Atomix atomix = atomixBuilder.build().start().join();

From the logs, which I have attached below, it looks like the client successfully connects to the cluster, but for some reason, the CompletableFuture returned from start() never completes at all. By debugging, I have figured out that the last line that is actually executed is .thenApply(v -> null); in AtomixCluster.java on line 279. The cluster is definitely running, as I can successfully send requests to distributed primitives.

This is the logging output of atomix at the debug level. One more thing that might be helpful: I can see heartbeat messages being sent back and forth at the trace level.

2018-11-14 16:41:51,197 [INFO ] [atomix-partition-group-membership-service-0] io.atomix.primitive.partition.impl.DefaultPartitionGroupMembershipService - Started
2018-11-14 16:41:51,210 [DEBUG] [atomix-partition-group-membership-service-0] io.atomix.utils.serializer.Namespace - (no name): class io.atomix.primitive.partition.GroupMember already registered as 502. Skipping 553.
2018-11-14 16:41:51,211 [DEBUG] [atomix-partition-group-membership-service-0] io.atomix.utils.serializer.Namespace - (no name): class io.atomix.cluster.MemberId already registered as 503. Skipping 554.
2018-11-14 16:41:51,212 [DEBUG] [atomix-partition-group-membership-service-0] io.atomix.utils.serializer.Namespace - (no name): class io.atomix.primitive.partition.MemberGroupId already registered as 504. Skipping 555.
2018-11-14 16:41:51,212 [DEBUG] [atomix-partition-group-membership-service-0] io.atomix.utils.serializer.Namespace - (no name): class io.atomix.primitive.partition.PartitionId already registered as 505. Skipping 556.
2018-11-14 16:41:51,214 [INFO ] [atomix-cluster-heartbeat-sender] io.atomix.cluster.impl.DefaultClusterMembershipService - node1 - Member updated: Member{id=node1, address=127.0.0.1:8200, properties={}}
2018-11-14 16:41:51,218 [INFO ] [atomix-partition-group-membership-service-0] io.atomix.primitive.partition.impl.HashBasedPrimaryElectionService - Started
2018-11-14 16:41:51,228 [DEBUG] [atomix-partition-group-membership-service-0] io.atomix.primitive.partition.impl.HashBasedPrimaryElection - client - Recomputed term for partition PartitionId{id=1, group=system}: PrimaryTerm{term=1, primary=GroupMember{memberId=node1, groupId=node1}, candidates=[]}
2018-11-14 16:41:51,229 [DEBUG] [netty-messaging-event-nio-client-1] io.atomix.cluster.messaging.impl.NettyMessagingService - 127.0.0.1:8002 - No handler for message type primary-election-counter-system-1 from localhost:8200
2018-11-14 16:41:51,245 [DEBUG] [atomix-partition-group-membership-service-0] io.atomix.protocols.backup.partition.impl.PrimaryBackupPartitionClient - Successfully started client for PartitionId{id=1, group=system}
2018-11-14 16:41:51,246 [INFO ] [atomix-partition-group-membership-service-0] io.atomix.protocols.backup.partition.PrimaryBackupPartitionGroup - Started
2018-11-14 16:41:51,252 [DEBUG] [atomix-partition-group-membership-service-0] io.atomix.primitive.session.impl.RecoveringSessionClient - SessionClient{system-partition-1} - Opening proxy session
2018-11-14 16:41:52,182 [DEBUG] [atomix-cluster-event-executor-0] io.atomix.cluster.messaging.impl.NettyMessagingService - Established a new connection to 127.0.0.1:8200
2018-11-14 16:41:54,145 [DEBUG] [primary-election-0] io.atomix.primitive.partition.impl.HashBasedPrimaryElection - cli
Wayne Hunter
@incogniro
Nov 14 2018 19:21
I have a strange situation. When I add Atomix to an existing multi-module Maven project which is an application server that will be deployed to Kubernetes. The CompletableFuture call at atomix.start().join() never completes. However if I put the same code into a seperate Maven project, then the cluster Future completes and the code continues to execute.
Jordan Halterman
@kuujo
Nov 14 2018 19:22
Do you see any warnings being logged? Usually if the future doesn’t complete it’s because Atomix can’t find partitions to use, and it will log a warning indicating that. But if partition groups are configured locally then that is indeed strange
Wayne Hunter
@incogniro
Nov 14 2018 19:27
The only warning is: 2018-11-14 19:25:54 WARN AbstractChannel:131 - A non-root user can't receive a broadcast packet if the socket is not bound to a wildcard address; binding to a non-wildcard address (localhost/127.0.0.1:8071) anyway as requested.
Even if I run the code in isolation not coupled to the app server startup but as isolated classes from a test harness, the cluster doesn’t fully start. My colleague and I have been head scratching the last couple days.
Wayne Hunter
@incogniro
Nov 14 2018 19:43

When I kill the process, I notice in the log that Atomix then joins:
2018-11-14 19:40:58 DEBUG ThreadPoolAsynchronousRunner:730 - com.mchange.v2.async.ThreadPoolAsynchronousRunner$DeadlockDetector@6398e5a6 -- Running DeadlockDetector[Exiting. No pending tasks.]
2018-11-14 19:41:04 INFO Member:71 - Stopping: Atomix{partitions=io.atomix.primitive.partition.impl.DefaultPartitionService@17053079}
Joined: Atomix{partitions=io.atomix.primitive.partition.impl.DefaultPartitionService@17053079}
2018-11-14 19:41:04 INFO CorePrimitivesService:381 - Stopped

Process finished with exit code 130 (interrupted by signal 2: SIGINT)

Jordan Halterman
@kuujo
Nov 14 2018 21:03
that’s really strange
what’s the last Started message logged?
seems like killing it is unblocking a thread or something
Alexander Richter
@alex-richter
Nov 14 2018 21:14
In my case, it seems like atomix has at least found all the partition groups, since they're all bootstrapped from the cluster node. And the DefaultPartitionGroupMembershipService also appears to be started. Is there anything that you think could cause atomix to hang at this point?
Wayne Hunter
@incogniro
Nov 14 2018 22:20
I’m not sure, there are a lot of dependencies in the project. I’ll try only starting Atomix without anything else and report how I get on.