These are chat archives for atomix/atomix

9th
Jun 2018
Johno Crawford
@johnou
Jun 09 2018 00:56
@kuujo LinkedList is a pretty cheap object, int, null first and last node
but still I see what you mean
ooh naughty @kuujo
what've you done in RecoveringSessionClient.java
oh boolean boo
was going to suggest using AtomicIntegerFieldUpdater
Johno Crawford
@johnou
Jun 09 2018 01:01
then again, would still be cheaper than using atomicboolean if you used a volatile int and AtomicIntegerFieldUpdater
Jordan Halterman
@kuujo
Jun 09 2018 05:36
@johnou what’s wrong with RecoveringSessionClient?
the futures are fine as they are now… the rule is just that any future that’s returned by a method more than once (the lifecycle management methods like connect() and start() and stop() and close() often return a single future) needs to be OrderedFuture
bah that connect() method is not correct
Jordan Halterman
@kuujo
Jun 09 2018 05:48
#607
Jordan Halterman
@kuujo
Jun 09 2018 08:25
It’s this line I see hanging in tests, so definitely some shutdown issue. Haven’t figured out which service even is not shutting down
the logs do seem to indicate that it at least got to the ClusterMembershipService though, and shutting down everything after that should be simple
Jordan Halterman
@kuujo
Jun 09 2018 08:47
@rroller change the version to 3.0.0-SNAPSHOT and then test again
Tests pass and all the API refactoring is done so I think we can cut the RC whenever we want to. Probably should find and remove the deprecated methods beforehand
Johno Crawford
@johnou
Jun 09 2018 09:10
+1 removal of deprecated methods in major version update
Johno Crawford
@johnou
Jun 09 2018 10:08
@kuujo did you see this? atomix/atomix#526
Johno Crawford
@johnou
Jun 09 2018 10:54
@kuujo io.atomix.cluster.impl.DefaultClusterMembershipService#stop
HEARTBEAT_MESSAGE is registered with heartbeatExecutor
and we unregister it after shutting down the executors
wouldn't it make sense shutting down heartbeatExecutor last
Johno Crawford
@johnou
Jun 09 2018 12:55
made shutdown a little more robust in #611
io.atomix.protocols.backup.roles.SynchronousReplicator#queues this stinks though
if there is a backup in the queue when that member leaves the cluster
nothing cleans it up
Johno Crawford
@johnou
Jun 09 2018 16:47
ah it might be related to atomix/atomix#526
io.atomix.protocols.backup.roles.PrimaryRole#close()
or at least some weirdness i've seen
(not removing the scheduled heartbeat task after the role has closed)
Jordan Halterman
@kuujo
Jun 09 2018 20:53
the Tavis badge seems to show the last build, not the last master build
so it shows failed for failed PRs
Jordan Halterman
@kuujo
Jun 09 2018 21:05
yes the ordering of the executor shutdown in DefaultClusterMembershipService can be changed, but it probably doesn’t make any difference. Just cleaner
both of the *Replicator queues need to be cleaned up
this also needs to wait for the response and only remove the operations from the queue if they’re acknowledged
Jordan Halterman
@kuujo
Jun 09 2018 21:19
well… fixed the build badge at least
Johno Crawford
@johnou
Jun 09 2018 21:36
ahahah
so that bug
atomix/atomix#526
io.atomix.protocols.backup.roles.PrimaryRole#close()
Jordan Halterman
@kuujo
Jun 09 2018 21:37
what about it
Johno Crawford
@johnou
Jun 09 2018 21:37
can't this be invoked from app thread, not state machine thread
Jordan Halterman
@kuujo
Jun 09 2018 21:37
That stacktrace has Raft in it
not primary backup
no it can't
err
maybe it can, but I’m not seeing how the two are related
Johno Crawford
@johnou
Jun 09 2018 21:38
well forget the old stacktrace
Jordan Halterman
@kuujo
Jun 09 2018 21:38
the Scheduled object inside of DefaultServiceExecutor is only ever called from inside state machines
Johno Crawford
@johnou
Jun 09 2018 21:38
yes
io.atomix.protocols.backup.service.impl.PrimaryBackupServiceContext#close()
but this is part of atomix stop no?
Jordan Halterman
@kuujo
Jun 09 2018 21:38
no
Johno Crawford
@johnou
Jun 09 2018 21:39
oh ffs i sent the wrong link
one sec
Jordan Halterman
@kuujo
Jun 09 2018 21:39
Johno Crawford
@johnou
Jun 09 2018 21:42
io.atomix.protocols.backup.roles.PrimaryRole#close() invokes io.atomix.primitive.service.impl.DefaultServiceExecutor.ScheduledTask#cancel
which modifies the arraylist io.atomix.primitive.service.impl.DefaultServiceExecutor#scheduledTasks
and i'm seeing different threads run that close method
23:41:49.963 [atomix-partition-group-membership-service-0] INFO i.a.p.backup.roles.PrimaryRole - PrimaryRole{system-partition-1}{role=PRIMARY} - Closing on atomix-partition-group-membership-service-0
23:42:09.420 [atomix-0] INFO i.a.p.backup.roles.PrimaryRole - PrimaryRole{system-partition-1}{role=PRIMARY} - Closing on atomix-0
Jordan Halterman
@kuujo
Jun 09 2018 21:43
PrimaryRole#close is not calling a Scheduled object that was created by DefaultServiceExecutor
it’s calling one that was created by an Executor
Johno Crawford
@johnou
Jun 09 2018 21:44
aha
Jordan Halterman
@kuujo
Jun 09 2018 21:44
the Scheduled objects that are created via the DefaultServiceExecutor are only created in state machines
but there is still a thread safety issue here
because of the lists in *Replicators
that scheduler is only used in a state machine when getScheduler() is called, e.g. https://github.com/atomix/atomix/blob/master/core/src/main/java/io/atomix/core/lock/impl/DefaultDistributedLockService.java#L76
because it’s a deterministic scheduler
not based on real time but based on the tick method calls which are based on Raft log entries or other deterministic time changes
Johno Crawford
@johnou
Jun 09 2018 21:45
just found it a little dangerous to expose modifying the list of the state machine in a public method like that
but maybe i'm over thinking it
Jordan Halterman
@kuujo
Jun 09 2018 21:46
it’s an interface method
Johno Crawford
@johnou
Jun 09 2018 21:47
maybe i can redeem myself with this one then
Jordan Halterman
@kuujo
Jun 09 2018 21:47
of an implementation that’s only accessible to state machines
haha
yeah that primary-backup code needs a lot of work
it’s basically still a proof of concept
Johno Crawford
@johnou
Jun 09 2018 21:56
and wtf
restore buffer
release
Johno Crawford
@johnou
Jun 09 2018 22:01
testScaleDownPersistent still fails locally
Jordan Halterman
@kuujo
Jun 09 2018 22:03
Yep I will try to figure out what’s wrong with that.
I’m also going to try to run this on the ONOS test infrastructure today, and if all goes well this weekend maybe do the RC. Then I will be able to get it into the next ONOS release so our QA will be testing it regularly for the next couple months.
Johno Crawford
@johnou
Jun 09 2018 22:13
one thing that caught my eye with the new proxy stuff is the use of method.getannotation
Jordan Halterman
@kuujo
Jun 09 2018 22:13
only happens when a new primitive is created
Johno Crawford
@johnou
Jun 09 2018 22:13
that's a bit of a performance killer, but i'm not sure people would be creating atomix objects that much
yes
Jordan Halterman
@kuujo
Jun 09 2018 22:14
creating a primitive also requires a write either to Raft or primary-backup, which is much more costly, but no reason the service/client commands can’t be cached
or shouldn’t be
Johno Crawford
@johnou
Jun 09 2018 22:14
might be a bit shitty to cache with felix too
probably need to wrap class in a weak ref
Jordan Halterman
@kuujo
Jun 09 2018 22:15
as long as it’s keyed by the class I think it’s fine, but yeah classes can be unloaded too
Johno Crawford
@johnou
Jun 09 2018 22:19
java.util.concurrent.ExecutionException: io.atomix.utils.config.ConfigurationException: Duplicate system group detected
at io.atomix.primitive.partition.impl.DefaultPartitionGroupMembershipService.updatePartitionGroups(DefaultPartitionGroupMembershipService.java:237)
at io.atomix.primitive.partition.impl.DefaultPartitionGroupMembershipService.lambda$bootstrap$7(DefaultPartitionGroupMembershipService.java:212)
at java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:760)
00:19:27.490 [netty-messaging-event-epoll-server-0] WARN i.a.c.m.impl.NettyMessagingService - localhost/127.0.0.1 failed to bind to port 5003 due to {}
io.netty.channel.unix.Errors$NativeIoException: bind(..) failed: Address already in use
guess shutdown is still failing
could we be shutting down the executors too quickly
eg. some previous stop call queued a task
Jordan Halterman
@kuujo
Jun 09 2018 22:23
yeah that’s definitely a node still running
Johno Crawford
@johnou
Jun 09 2018 22:26
kind of annoying following code in intellij
always takes me to the managed interface
makes sense, but still
Jordan Halterman
@kuujo
Jun 09 2018 22:27
haha
I know
Johno Crawford
@johnou
Jun 09 2018 22:28
oh
i see a problem
io.atomix.cluster.messaging.impl.NettyMessagingService#stop
Jordan Halterman
@kuujo
Jun 09 2018 22:28
Ugh… servers went down :boom:
Johno Crawford
@johnou
Jun 09 2018 22:29
we aren't waiting for the netty async stop
Jordan Halterman
@kuujo
Jun 09 2018 22:29
yeah that would be a problem
Johno Crawford
@johnou
Jun 09 2018 22:29
i think we should probably wait for the event loops to shut down
i had some code for that
Jordan Halterman
@kuujo
Jun 09 2018 22:31
shutdownGracefully should do that
Johno Crawford
@johnou
Jun 09 2018 22:31
it returns a future
Jordan Halterman
@kuujo
Jun 09 2018 22:31
ahh yes it does
also need to catch exceptions in there and make sure the future is completed too
Johno Crawford
@johnou
Jun 09 2018 22:31
yep
let me grab it
Jordan Halterman
@kuujo
Jun 09 2018 22:53
hmm… looks like I’ve managed to break my ONOS patch again ¯\(ツ)
Johno Crawford
@johnou
Jun 09 2018 22:54
atomix/atomix#617
how'd you manage that?
Jordan Halterman
@kuujo
Jun 09 2018 22:57
ahh just misconfigured the cluster
Johno Crawford
@johnou
Jun 09 2018 22:57
00:56:25.945 [atomix-system-1] ERROR i.a.p.b.i.PrimaryBackupServerContext - Failed closing services
java.util.concurrent.CompletionException: java.util.concurrent.RejectedExecutionException: Task java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask@d74dadd rejected from java.util.concurrent.ScheduledThreadPoolExecutor@dcef898[Shutting down, pool size = 16, active threads = 1, queued tasks = 0, completed tasks = 401]
at java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:273) ~[na:1.8.0_131]
at java.util.concurrent.CompletableFuture.uniComposeStage(CompletableFuture.java:991) ~[na:1.8.0_131]
at java.util.concurrent.CompletableFuture.thenCompose(CompletableFuture.java:2124) ~[na:1.8.0_131]
at io.atomix.utils.concurrent.OrderedFuture.thenCompose(OrderedFuture.java:287) ~[classes/:na]
at io.atomix.protocols.backup.impl.PrimaryBackupServerContext.lambda$stop$11(PrimaryBackupServerContext.java:212) ~[classes/:na]
at java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:193) ~[na:1.8.0_131]
at java.util.concurrent.ConcurrentHashMap$ValueSpliterator.forEachRemaining(ConcurrentHashMap.java:3566) ~[na:1.8.0_131]
at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:481) ~[na:1.8.0_131]
at java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:471) ~[na:1.8.0_131]
at java.util.stream.ReduceOps$ReduceOp.evaluateSequential(ReduceOps.java:708) ~[na:1.8.0_131]
at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234) ~[na:1.8.0_131]
at java.util.stream.ReferencePipeline.collect(ReferencePipeline.java:499) ~[na:1.8.0_131]
at io.atomix.protocols.backup.impl.PrimaryBackupServerContext.stop(PrimaryBackupServerContext.java:213) ~[classes/:na]
at io.atomix.protocols.backup.PrimaryBackupServer.stop(PrimaryBackupServer.java:100) ~[classes/:na]
at io.atomix.protocols.backup.partition.impl.PrimaryBackupPartitionServer.stop(PrimaryBackupPartitionServer.java:91) ~[classes/:na]
at io.atomix.protocols.backup.partition.PrimaryBackupPartition.lambda$close$4(PrimaryBackupPartition.java:142) ~[classes/:na]
at java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:760) ~[na:1.8.0_131]
at java.util.concurrent.CompletableFuture.uniWhenCompleteStage(CompletableFuture.java:778) ~[na:1.8.0_131]
at java.util.concurrent.CompletableFuture.whenComplete(CompletableFuture.java:2140) ~[na:1.8.0_131]
at io.atomix.protocols.backup.partition.PrimaryBackupPartition.close(PrimaryBackupPartition.java:140) ~[classes/:na]
Caused by: java.util.concurrent.RejectedExecutionException: Task java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask@d74dadd rejected from java.util.concurrent.ScheduledThreadPoolExecutor@dcef898[Shutting down, pool size = 16, active threads = 1, queued tasks = 0, completed tasks = 401]
at java.util.concurrent.ThreadPoolExecutor$AbortPolicy.rejectedExecution(ThreadPoolExecutor.java:2047) ~[na:1.8.0_131]
at java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:823) ~[na:1.8.0_131]
at java.util.concurrent.ScheduledThreadPoolExecutor.delayedExecute(ScheduledThreadPoolExecutor.java:326) ~[na:1.8.0_131]
at java.util.concurrent.ScheduledThreadPoolExecutor.schedule(ScheduledThreadPoolExecutor.java:533) ~[na:1.8.0_131]
at java.util.concurrent.ScheduledThreadPoolExecutor.execute(ScheduledThreadPoolExecutor.java:622) ~[na:1.8.0_131]
at io.atomix.utils.concurrent.ThreadPoolContext$1.execute(ThreadPoolContext.java:52) ~[classes/:na]
at io.atomix.utils.concurrent.ThreadPoolContext.execute(ThreadPoolContext.java:92) ~[classes/:na]
at io.atomix.utils.concurrent.BlockingAwareThreadPoolContext.execute(BlockingAwareThreadPoolContext.java:33) ~[classes/:na]
at io.atomix.protocols.backup.service.impl.PrimaryBackupServiceContext.close(PrimaryBackupServiceContext.java:625) ~[classes/:na]
at io.atomix.protocols.backup.impl.PrimaryBackupServerContext.lambda$null$10(PrimaryBackupServerContext.java:212) ~[classes/:na]
at java.util.concurrent.CompletableFuture.uniComposeStage(CompletableFuture.java:981) ~[na:1.8.0_131]
... 38 common frames omitted
doesn't like your fix for atomix/atomix#615
Jordan Halterman
@kuujo
Jun 09 2018 23:00
strange
I don’t even see where that ThreadContext is ever even shutdown
Johno Crawford
@johnou
Jun 09 2018 23:01
me either
but it looks like it might be shutting itself down?
in the sense that service is computed with PrimaryBackupServiceContext service
Jordan Halterman
@kuujo
Jun 09 2018 23:02
well, the ThreadPoolContext from which it was created can be shutdown, but I don’t see where that happens either
I guess it happens all the way up in PrimaryBackupPartitionGroup#close
which should happen after that call
Johno Crawford
@johnou
Jun 09 2018 23:04
unless it is happening after that call, on the executor thread?
or some weird shit
Johno Crawford
@johnou
Jun 09 2018 23:21
java.lang.RuntimeException
at io.atomix.utils.concurrent.BlockingAwareThreadPoolContextFactory$1.close(BlockingAwareThreadPoolContextFactory.java:49)
at io.atomix.protocols.backup.PrimaryBackupClient.close(PrimaryBackupClient.java:150)
at io.atomix.protocols.backup.partition.impl.PrimaryBackupPartitionClient.stop(PrimaryBackupPartitionClient.java:89)
at io.atomix.protocols.backup.partition.PrimaryBackupPartition.close(PrimaryBackupPartition.java:140)
at java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:193)
at java.util.concurrent.ConcurrentHashMap$ValueSpliterator.forEachRemaining(ConcurrentHashMap.java:3566)
at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:481)
at java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:471)
at java.util.stream.ReduceOps$ReduceOp.evaluateSequential(ReduceOps.java:708)
at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
at java.util.stream.ReferencePipeline.collect(ReferencePipeline.java:499)
at io.atomix.protocols.backup.partition.PrimaryBackupPartitionGroup.close(PrimaryBackupPartitionGroup.java:187)
at io.atomix.primitive.partition.impl.DefaultPartitionService.stop(DefaultPartitionService.java:212)
closed here
should be it's own though?
Johno Crawford
@johnou
Jun 09 2018 23:31
oh
it's because we call threadContextFactory.close();
from io.atomix.protocols.backup.PrimaryBackupClient#close
testScaleDownPersistent works for me after commenting out threadContextFactory.close();
@kuujo gl, time for bed this end
Jordan Halterman
@kuujo
Jun 09 2018 23:57
oooh nice