These are chat archives for atomix/atomix

18th
Aug 2018
Junbo Ruan
@aruanruan
Aug 18 2018 01:55
@kuujo after 12 hours, logs were repeating like:
09:51:19.870 [raft-server-system-partition-1] DEBUG i.a.p.raft.roles.CandidateRole - RaftServer{system-partition-1}{role=CANDIDATE} - Requesting vote from DefaultRaftMember{id=node3, type=ACTIVE, updated=2018-08-17T07:53:09.041Z} for term 2832
09:51:19.870 [raft-server-system-partition-1] DEBUG i.a.p.raft.roles.CandidateRole - RaftServer{system-partition-1}{role=CANDIDATE} - Requesting vote from DefaultRaftMember{id=node1, type=ACTIVE, updated=2018-08-17T07:53:09.041Z} for term 2832
09:51:19.870 [raft-server-system-partition-1] TRACE i.a.p.raft.roles.CandidateRole - RaftServer{system-partition-1}{role=CANDIDATE} - Received AppendRequest{term=2831, leader=node3, prevLogIndex=1524, prevLogTerm=39, entries=0, commitIndex=1523}
09:51:19.870 [raft-server-system-partition-1] DEBUG i.a.p.raft.roles.CandidateRole - RaftServer{system-partition-1}{role=CANDIDATE} - Rejected AppendRequest{term=2831, leader=node3, prevLogIndex=1524, prevLogTerm=39, entries=0, commitIndex=1523}: request term is less than the current term (2832)
09:51:19.870 [raft-server-system-partition-1] TRACE i.a.p.raft.roles.CandidateRole - RaftServer{system-partition-1}{role=CANDIDATE} - Sending AppendResponse{status=OK, term=2832, succeeded=false, lastLogIndex=4089}
09:51:19.870 [raft-server-system-partition-1] TRACE i.a.p.raft.roles.CandidateRole - RaftServer{system-partition-1}{role=CANDIDATE} - Received AppendRequest{term=2831, leader=node3, prevLogIndex=1524, prevLogTerm=39, entries=0, commitIndex=1523}
09:51:19.870 [raft-server-system-partition-1] DEBUG i.a.p.raft.roles.CandidateRole - RaftServer{system-partition-1}{role=CANDIDATE} - Rejected AppendRequest{term=2831, leader=node3, prevLogIndex=1524, prevLogTerm=39, entries=0, commitIndex=1523}: request term is less than the current term (2832)
09:51:19.870 [raft-server-system-partition-1] TRACE i.a.p.raft.roles.CandidateRole - RaftServer{system-partition-1}{role=CANDIDATE} - Sending AppendResponse{status=OK, term=2832, succeeded=false, lastLogIndex=4089}
09:51:19.870 [raft-server-system-partition-1] TRACE i.a.p.raft.roles.FollowerRole - RaftServer{system-partition-1}{role=FOLLOWER} - Sending OpenSessionResponse{status=ERROR, error=RaftError{type=NO_LEADER, message=null}}
09:51:19.870 [raft-server-system-partition-1] TRACE i.a.p.raft.roles.FollowerRole - RaftServer{system-partition-1}{role=FOLLOWER} - Sending OpenSessionResponse{status=ERROR, error=RaftError{type=NO_LEADER, message=null}}
09:51:19.870 [raft-server-system-partition-1] TRACE i.a.p.raft.roles.FollowerRole - RaftServer{system-partition-1}{role=FOLLOWER} - Sending OpenSessionResponse{status=ERROR, error=RaftError{type=NO_LEADER, message=null}}
09:51:19.870 [raft-server-system-partition-1] TRACE i.a.p.raft.roles.CandidateRole - RaftServer{system-partition-1}{role=CANDIDATE} - Received OpenSessionRequest{node=node2, serviceName=primitives, serviceType=atomic-map, readConsistency=SEQUENTIAL, minTimeout=250, maxTimeout=30000}
09:51:19.870 [raft-server-system-partition-1] TRACE i.a.p.raft.roles.CandidateRole - RaftServer{system-partition-1}{role=CANDIDATE} - Sending OpenSessionResponse{status=ERROR, error=RaftError{type=NO_LEADER, message=null}}
09:51:19.870 [raft-server-system-partition-1] TRACE i.a.p.raft.roles.CandidateRole - RaftServer{system-partition-1}{role=CANDIDATE} - Received OpenSessionRequest{node=node3, serviceName=primitives, serviceType=atomic-map, readConsistency=SEQUENTIAL, minTimeout=250, maxTimeout=30000}
09:51:19.870 [raft-server-system-partition-1] TRACE i.a.p.raft.roles.CandidateRole - RaftServer{system-partition-1}{role=CANDIDATE} - Sending OpenSessionResponse{status=ERROR, error=RaftError{type=NO_LEADER, message=null}}
09:51:19.870 [raft-client-system-partition-1-10] TRACE i.a.p.r.s.impl.RaftSessionConnection - RaftClient{system-partition-1} - Received OpenSessionResponse{status=ERROR, error=RaftError{type=NO_LEADER, message=null}} from node2
09:51:19.870 [raft-client-system-partition-1-10] TRACE i.a.p.r.s.impl.RaftSessionConnection - RaftClient{system-partition-1} - Resetting connection. Reason: Failed to locate leade
client reports:
java.util.concurrent.TimeoutException
    at java.util.concurrent.CompletableFuture.timedGet(CompletableFuture.java:1771)
    at java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1915)
    at xxx.registry2.client.AbstractPrimitiveTest.atomix(AbstractPrimitiveTest.java:71)
    at xxx.registry2.client.graph.UdoGraphTest.atomicUdoGraphBuilder(UdoGraphTest.java:73)
    at xxx.registry2.client.graph.UdoGraphTest.newGraph(UdoGraphTest.java:77)
    at xxx.registry2.client.graph.UdoGraphTest.testdNotificationsOfManyUnhealthy(UdoGraphTest.java:580)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:47)
    at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
    at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:44)
    at org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
    at org.junit.internal.runners.statements.FailOnTimeout$StatementThread.run(FailOnTimeout.java:74)
3 nodes are active, but refuse connecting
Jordan Halterman
@kuujo
Aug 18 2018 06:24
@aruanruan I need to see all the Raft trace logs to understand why that is… or at least a significant chunk of them - several leader elections
This shows leader elections are happening, but it lacks all the information necessary to understand why that is.
I often have to sift through hundreds of megabytes of logs and correlate individual lines across 10MB files to debug issues, so short snippets of logs are rarely useful for fixing issues in the Raft code. Also, we probably need to align the logs on all of the nodes to understand why the leader election happens.
Jordan Halterman
@kuujo
Aug 18 2018 06:30
For example, that node’s logs probably show that it timed out and started a new election, but that just shows that the leader election algorithm is working correctly. The root cause is what caused the leader not to send a heartbeat to the follower quickly enough, eventually leading to a leader election. To determine that, you have to track and time specific hearbeats from the leader to that follower, check their round trip times, ensure log indexes are progressing correctly, etc, and we need a lot of logs for all nodes for that, which I’d absolutely welcome and encourage but I know you have security concerns
there’s definitely leader flapping happening there and I’m very interested in getting enough logs to debug it
Jordan Halterman
@kuujo
Aug 18 2018 06:36
I’ll even take a significant random sample of logs from several nodes
I’ll take what I can get
MiniDigger
@MiniDigger
Aug 18 2018 08:43
Oh sorry, didn't see that you replied yesterday, that pr looks good
Is master published to maven as snapshot so I can test it later?
Johno Crawford
@johnou
Aug 18 2018 10:28
Yes
3.0.0-rc12
@MiniDigger CI actually builds and runs tests Atomix on jdk 10 and 11
Which are passing
MiniDigger
@MiniDigger
Aug 18 2018 10:31
ah, nice
reverted my changes from yesterday to avoid the CME and it doesn't trigger on rc12 now, so its fixed
thanks for the fast response
Johno Crawford
@johnou
Aug 18 2018 10:35
The rc is a release btw not snapshot
But snapshots are also published from master
MiniDigger
@MiniDigger
Aug 18 2018 10:36
ah k
Johno Crawford
@johnou
Aug 18 2018 10:44
Well release candidate but release in maven..
Johno Crawford
@johnou
Aug 18 2018 13:29
@kuujo probably because of the fork option
Johno Crawford
@johnou
Aug 18 2018 19:57
Ah which isn't defined