These are chat archives for atomix/atomix

17th
Aug 2018
Junbo Ruan
@aruanruan
Aug 17 2018 00:45
```
following the log fragments:
08:41:54.877 [raft-server-system-partition-2] ERROR i.a.s.journal.JournalSegmentReader - length < 0: -201129727
08:41:54.877 [raft-server-system-partition-2] ERROR i.a.s.journal.JournalSegmentReader - length < 0: -201129727
08:41:54.877 [raft-server-system-partition-2] ERROR i.a.s.journal.JournalSegmentReader - length < 0: -201129727
08:41:54.877 [raft-server-system-partition-2] ERROR i.a.s.journal.JournalSegmentReader - length < 0: -201129727
08:41:54.877 [raft-server-system-partition-2] ERROR i.a.s.journal.JournalSegmentReader - length < 0: -201129727
08:41:54.877 [raft-server-system-partition-2] ERROR i.a.s.journal.JournalSegmentReader - length < 0: -201129727
08:41:54.877 [raft-server-system-partition-2] ERROR i.a.s.journal.JournalSegmentReader - length < 0: -201129727
08:41:54.877 [raft-server-system-partition-2] ERROR i.a.s.journal.JournalSegmentReader - length < 0: -201129727
08:41:54.878 [raft-server-system-partition-2] ERROR i.a.s.journal.JournalSegmentReader - length < 0: -201129727
08:41:54.878 [raft-server-system-partition-2] ERROR i.a.s.journal.JournalSegmentReader - length < 0: -201129727
08:41:54.878 [raft-server-system-partition-2] ERROR i.a.s.journal.JournalSegmentReader - length < 0: -201129727
08:41:54.878 [raft-server-system-partition-2] ERROR i.a.s.journal.JournalSegmentReader - length < 0: -201129727
08:41:54.878 [raft-server-system-partition-2] ERROR i.a.s.journal.JournalSegmentReader - length < 0: -201129727
08:41:54.914 [raft-server-system-partition-2] ERROR i.a.s.journal.JournalSegmentReader - length too long: 1635020653
08:41:54.914 [raft-server-system-partition-2] ERROR i.a.s.journal.JournalSegmentReader - length too long: 1635020653
08:41:54.914 [raft-server-system-partition-2] ERROR i.a.s.journal.JournalSegmentReader - length too long: 1635020653
08:41:54.914 [raft-server-system-partition-2] ERROR i.a.s.journal.JournalSegmentReader - length too long: 1635020653
08:41:54.914 [raft-server-system-partition-2] ERROR i.a.s.journal.JournalSegmentReader - length too long: 1635020653
08:41:54.914 [raft-server-system-partition-2] ERROR i.a.s.journal.JournalSegmentReader - length too long: 1635020653
08:41:54.928 [raft-server-system-partition-2] ERROR i.a.s.journal.JournalSegmentReader - length too long: 1635020653
08:41:54.928 [raft-server-system-partition-2] ERROR i.a.s.journal.JournalSegmentReader - length too long: 1635020653
08:41:54.928 [raft-server-system-partition-2] ERROR i.a.p.raft.impl.RaftServiceManager - RaftServer{system-partition-2} - Cannot apply index 994 (hasNext: true, nextIndex: 159)
08:41:54.948 [raft-server-system-partition-2] ERROR i.a.s.journal.JournalSegmentReader - length too long: 1919511905
08:41:54.948 [raft-server-system-partition-2] ERROR i.a.s.journal.JournalSegmentReader - length too long: 1919511905
08:41:54.948 [raft-server-system-partition-2] ERROR i.a.s.journal.JournalSegmentReader - length too long: 1919511905
08:41:54.948 [raft-server-system-partition-2] ERROR i.a.s.journal.JournalSegmentReader - length too long: 1919511905
08:41:54.948 [raft-server-system-partition-2] ERROR i.a.s.journal.JournalSegmentReader - length too long: 1919511905
08:41:54.948 [raft-server-system-partition-2] ERROR i.a.s.journal.JournalSegmentReader - length too long: 1919511905
08:41:54.973 [raft-server-system-partition-2] ERROR i.a.s.journal.JournalSegmentReader - length too long: 1919511905
08:41:54.973 [raft-server-system-partition-2] ERROR i.a.s.journal.JournalSegmentReader - length too long: 1919511905
08:41:54.973 [raft-server-system-partition-2] ERROR i.a.p.raft.impl.RaftServiceManager - RaftServer{system-partition-2} - Cannot apply index 995 (hasNext: true, nextIndex: 159)
08:41:54.974 [raft-server-system-partition-2] ERROR i.a.s.journal.JournalSegmentReader - length < 0: -201129727
08:41:54.974 [raft-server-system-partition-2] ERROR i.a.s.journal.JournalSegmentReader - length < 0: -201129727
08:41:54.974 [raft-server-system-partition-2] ERROR i.a.s.journal.JournalSegmentReader - length < 0: -201129727
Johno Crawford
@johnou
Aug 17 2018 00:48
@kuujo so the classrule isn't being run or?
suimi
@suimi
Aug 17 2018 02:33
2018-08-16 15:10:49,626 [netty-messaging-event-nio-client-6] DEBUG NettyMessagingService - 127.0.0.1:8804 - No handler for message type raft-group-partition-1-append from 127.0.0.1:8800 [,,]
2018-08-16 15:10:49,703 [raft-client-system-partition-1-8] DEBUG RaftSessionConnection - RaftClient{system-partition-1} - OpenSessionRequest{node=node-E, serviceName=atomix-primary-elector, serviceType=PRIMARY_ELECTOR, readConsistency=SEQUENTIAL, minTimeout=250, maxTimeout=30000} failed! Reason: {} [,,]
java.util.concurrent.TimeoutException: Request timed out in 5024 milliseconds
@kuujo the log of node-E
2018-08-16 15:11:03,750 [raft-server-system-partition-1] INFO  RaftContext - RaftServer{system-partition-1} - Transitioning to CANDIDATE [,,]
2018-08-16 15:11:03,751 [raft-server-system-partition-1] INFO  CandidateRole - RaftServer{system-partition-1}{role=CANDIDATE} - Starting election [,,]
2018-08-16 15:11:03,764 [raft-server-system-partition-1] DEBUG RaftContext - RaftServer{system-partition-1} - Set term 8 [,,]
2018-08-16 15:11:03,764 [atomix-cluster-heartbeat-sender] INFO  DefaultClusterMembershipService - node-C - Member reachable: Member{id=node-C, address=127.0.0.1:8802, properties={}} [,,]
2018-08-16 15:11:09,331 [atomix-cluster-heartbeat-sender] INFO  DefaultClusterMembershipService - node-B - Member reachable: Member{id=node-B, address=127.0.0.1:8801, properties={}} [,,]
2018-08-16 15:11:09,331 [netty-messaging-event-nio-client-6] DEBUG NettyMessagingService - 127.0.0.1:8804 - No handler for message type raft-group-partition-1-append from 127.0.0.1:8800 [,,]
2018-08-16 15:11:09,331 [netty-messaging-event-nio-client-6] DEBUG NettyMessagingService - 127.0.0.1:8804 - No handler for message type raft-group-partition-1-append from 127.0.0.1:8800 [,,]
2018-08-16 15:11:09,332 [atomix-cluster-heartbeat-sender] INFO  DefaultClusterMembershipService - node-D - Member reachable: Member{id=node-D, address=127.0.0.1:8803, properties={}} [,,]
2018-08-16 15:11:09,334 [raft-server-system-partition-1] DEBUG RaftContext - RaftServer{system-partition-1} - Voted for node-E [,,]
2018-08-16 15:11:09,334 [raft-server-system-partition-1] DEBUG CandidateRole - RaftServer{system-partition-1}{role=CANDIDATE} - Requesting votes for term 8 [,,]
2018-08-16 15:11:09,334 [raft-server-system-partition-1] DEBUG CandidateRole - RaftServer{system-partition-1}{role=CANDIDATE} - Requesting vote from DefaultRaftMember{id=node-D, type=ACTIVE, updated=2018-08-16T06:25:06.888Z} for term 8 [,,]
2018-08-16 15:11:09,335 [raft-server-system-partition-1] DEBUG CandidateRole - RaftServer{system-partition-1}{role=CANDIDATE} - Requesting vote from DefaultRaftMember{id=node-C, type=ACTIVE, updated=2018-08-16T06:25:06.888Z} for term 8 [,,]
2018-08-16 15:11:09,335 [raft-server-system-partition-1] DEBUG CandidateRole - RaftServer{system-partition-1}{role=CANDIDATE} - Requesting vote from DefaultRaftMember{id=node-B, type=ACTIVE, updated=2018-08-16T06:25:06.888Z} for term 8 [,,]
2018-08-16 15:11:09,335 [raft-server-system-partition-1] DEBUG CandidateRole - RaftServer{system-partition-1}{role=CANDIDATE} - Requesting vote from DefaultRaftMember{id=node-A, type=ACTIVE, updated=2018-08-16T06:25:06.888Z} for term 8 [,,]
2018-08-16 15:11:09,335 [atomix-cluster-heartbeat-sender] INFO  DefaultClusterMembershipService - node-A - Member reachable: Member{id=node-A, address=127.0.0.1:8800, properties={}} [,,]
2018-08-16 15:11:09,335 [raft-server-system-partition-1] DEBUG CandidateRole - RaftServer{system-partition-1}{role=CANDIDATE} - Rejected AppendRequest{term=7, leader=node-C, prevLogIndex=290, prevLogTerm=3, entries=0, commitIndex=2177}: request term is less than the current term (8) [,,]
2018-08-16 15:11:09,336 [raft-server-system-partition-1] DEBUG CandidateRole - RaftServer{system-partition-1}{role=CANDIDATE} - Rejected AppendRequest{term=7, leader=node-C, prevLogIndex=290, prevLogTerm=3, entries=0, commitIndex=2177}: request term is less than the current term (8) [,,]
2018-08-16 15:11:09,348 [raft-server-system-partition-1] DEBUG CandidateRole - RaftServer{system-partition-1}{role=CANDIDATE} - Received rejected vote from DefaultRaftMember{id=node-D, type=ACTIVE, updated=2018-08-16T06:25:06.888Z} [,,]
2018-08-16 15:11:09,359 [raft-server-system-partition-1] DEBUG CandidateRole - RaftServer{system-partition-1}{role=CANDIDATE} - Received successful vote from DefaultRaftMember{id=node-B, type=ACTIVE, updated=2018-08-16T06:25:06.888Z} [,,]
2018-08-16 15:11:09,361 [netty-messaging-event-nio-client-1] DEBUG NettyMessagingService - Received a reply for message id:[18210] but was unable to locate the request handle [,,]
Junbo Ruan
@aruanruan
Aug 17 2018 06:45
It's the first time that my test case can run 3 times continously after pull newest codes
Jordan Halterman
@kuujo
Aug 17 2018 10:21
@johnou yeah
@suimi I’m not really seeing what’s wrong with those logs. Messages will time out every now and then. Those are debug logs. The system is designed to timeout requests fairly aggressively and retry them. The second batch of log messages is just a Raft partition electing a leader, seemingly doing exactly what it’s supposed to be doing.
Is there some exception that’s propagating up to the client code?
Junbo Ruan
@aruanruan
Aug 17 2018 11:47
@kuujo after inserting logs data, restart the cluster, wait the cluster replay completed, client can not connect the cluster
sending heartbeat to each other cluster nodes is timeout
19:45:34.069 [atomix-bootstrap-heartbeat-receiver] DEBUG i.a.c.d.BootstrapDiscoveryProvider - Node{id=node3, address=cloudate3:8000} - Sending heartbeat to 10.78.137.136:8000 failed
java.util.concurrent.TimeoutException: Request timed out in 5050 milliseconds
        at io.atomix.cluster.messaging.impl.NettyMessagingService$AbstractClientConnection.timeoutCallbacks(NettyMessagingService.java:919) ~[atomix-cluster-3.0.0-SNAPSHOT.jar:na]
        at io.atomix.cluster.messaging.impl.NettyMessagingService.timeoutAllCallbacks(NettyMessagingService.java:338) ~[atomix-cluster-3.0.0-SNAPSHOT.jar:na]
        at java.util.concurrent.Executors$RunnableAdapter.call(Unknown Source) ~[na:1.8.0_172]
Junbo Ruan
@aruanruan
Aug 17 2018 12:02
20:01:09.322 [raft-server-system-partition-1] TRACE i.a.p.raft.roles.LeaderAppender - RaftServer{system-partition-1} - Reset next index for RaftMemberContext{member=node1, term=54, configIndex=0, snapshotIndex=0, nextSnapshotIndex=0, nextSnapshotOffset=0, matchIndex=0, nextIndex=1740, heartbeatTime=1534507269309, appending=0, appendSucceeded=false, appendTime=1534507269318, configuring=false, installing=false, failures=0} to 1740
20:01:09.327 [raft-server-system-partition-2] TRACE i.a.p.raft.roles.LeaderAppender - RaftServer{system-partition-2} - Sending AppendRequest{term=51, leader=node2, prevLogIndex=2687, prevLogTerm=51, entries=565, commitIndex=1540} to node1
20:01:09.328 [raft-server-system-partition-1] TRACE i.a.p.raft.roles.LeaderAppender - RaftServer{system-partition-1} - Sending AppendRequest{term=54, leader=node2, prevLogIndex=1739, prevLogTerm=54, entries=623, commitIndex=1523} to node1
20:01:09.329 [raft-server-system-partition-1] TRACE i.a.p.raft.roles.LeaderAppender - RaftServer{system-partition-1} - Sending AppendRequest{term=54, leader=node2, prevLogIndex=2394, prevLogTerm=54, entries=0, commitIndex=1523} to node3
20:01:09.329 [raft-server-system-partition-2] TRACE i.a.p.raft.roles.LeaderAppender - RaftServer{system-partition-2} - Received AppendResponse{status=OK, term=51, succeeded=false, lastLogIndex=2688} from node1
20:01:09.329 [raft-server-system-partition-1] TRACE i.a.p.raft.roles.LeaderAppender - RaftServer{system-partition-1} - Received AppendResponse{status=OK, term=54, succeeded=false, lastLogIndex=3003} from node3
20:01:09.329 [raft-server-system-partition-1] TRACE i.a.p.raft.roles.LeaderAppender - RaftServer{system-partition-1} - Reset next index for RaftMemberContext{member=node3, term=54, configIndex=0, snapshotIndex=0, nextSnapshotIndex=0, nextSnapshotOffset=0, matchIndex=0, nextIndex=2395, heartbeatTime=1534507269136, appending=0, appendSucceeded=false, appendTime=1534507269329, configuring=false, installing=false, failures=0} to 3004
Jordan Halterman
@kuujo
Aug 17 2018 17:31
What are you seeing in that log above? Doesn’t seem to be enough information. It’s just the Raft leaders resolving their logs with followers.
Jordan Halterman
@kuujo
Aug 17 2018 18:33
Certainly don’t expect the timeout to happen, but need a lot more context to actually see why it’s happening
MiniDigger
@MiniDigger
Aug 17 2018 21:55
finally got time to play with atomix today. my current test setup uses Profile.consensus with one member. I then went on to create a distributed list I want to update when somebody leaves the cluster. when I do that, these exceptions appear (filtered out the failed hearthbeats) https://gist.github.com/MiniDigger/c3a58729def05b9733c1f9c5f29ef557
Jordan Halterman
@kuujo
Aug 17 2018 21:56
that’s an exception inside the primitive service (replicated state machine)
can you attach a debugger and see what the exception is?
the logging there kind of sucks
doesn’t unwrap the exception correctly
MiniDigger
@MiniDigger
Aug 17 2018 22:03
not really sure what I am looking for. I set a breakpoint on the line that throws the exceptions, this is the caught exception: https://i.imgur.com/lTL4rOF.png
Jordan Halterman
@kuujo
Aug 17 2018 22:03
look at the causes
there should be a root cause
MiniDigger
@MiniDigger
Aug 17 2018 22:04
it has itself as cause :D
MiniDigger
@MiniDigger
Aug 17 2018 22:06
added an exception breakpoint, it his the last one
Jordan Halterman
@kuujo
Aug 17 2018 22:07
okay now that should have the correct cause
need to fix those stupid exceptions
Jordan Halterman
@kuujo
Aug 17 2018 22:07
hmm
oh I know what’s happening
good bug
one sec
MiniDigger
@MiniDigger
Aug 17 2018 22:07
has somebody tested atomix on java 10 before?
Jordan Halterman
@kuujo
Aug 17 2018 22:08
we have Java 10 tests but they have had issues
MiniDigger
@MiniDigger
Aug 17 2018 22:09
ah ok, but this isn't java 10 related as you found it?
Jordan Halterman
@kuujo
Aug 17 2018 22:09
in this case it’s just throwing a ConcurrentModificationException inside the state machine because your code is modifying the list while iterating. The list inside the state machine is a Collections.synchronizedList which doesn’t allow that
we probably need to change the data structure
MiniDigger
@MiniDigger
Aug 17 2018 22:10
I was searching for a removeif on the distributed list, didn't found it, thats why I am iterating myself
Jordan Halterman
@kuujo
Aug 17 2018 22:10
although CopyOnWriteArrayList is probably unrealistic
It isn’t supported because it uses a closure. We’d have to read and then do a check-and-remove using an optimistic lock and retries to maintain consistency for the removeIf method
MiniDigger
@MiniDigger
Aug 17 2018 22:13
ah ok
well, thanks for the hint on how to debug stuff, I fixed the issue by collecting what should be removed to a list and then calling remove to avoid the CME, that works
but atomix should probably expose these exceptions to the user
Jordan Halterman
@kuujo
Aug 17 2018 22:15
That’s not necessarily easy to do. That assumes the exceptions exist on the client node and the client node is running the same version.
so we have to use a generic exception
although the message is supposed to at least be propagated
not sure why it is not in this case
MiniDigger
@MiniDigger
Aug 17 2018 22:16
ahh. I am new to all this distributed stuff, I am starting to get a feel of the complexity involved
Jordan Halterman
@kuujo
Aug 17 2018 22:18
running different versions of the code certainly can cause a lot of problems
it pains me, but I guess we’re going to have to use CopyOnWriteArrayList in that state machine to avoid this
I’m going to fix that bug right now
Jordan Halterman
@kuujo
Aug 17 2018 22:24
actually I think we can just write a thread safe iterator for the list
that should be simple enough
MiniDigger
@MiniDigger
Aug 17 2018 22:37
if I can assist you in verifying its fixed let me know
Jordan Halterman
@kuujo
Aug 17 2018 22:56
@MiniDigger #811