These are chat archives for atomix/atomix

11th
Jun 2018
Jordan Halterman
@kuujo
Jun 11 2018 00:32
I think I found the problem with shutdown
testing it now
well… one of them
Jordan Halterman
@kuujo
Jun 11 2018 02:00
The problem is actually primary-backup client close() futures not returning. The root cause seems to be messages not timing out, so trying to investigate that. Should have a fix tonight
Jordan Halterman
@kuujo
Jun 11 2018 06:08
for some reason the ScheduledExecutorService used for timing out requests in NettyMessagingService is not running when the tests hang
but I can’t find any exceptions being thrown inside the callback
Johno Crawford
@johnou
Jun 11 2018 06:27
it's probably because we use shutdownnow
and not an orderly shutdown
Jordan Halterman
@kuujo
Jun 11 2018 06:27
the executor is never shutdown
Johno Crawford
@johnou
Jun 11 2018 06:27
oh
Jordan Halterman
@kuujo
Jun 11 2018 06:27
this issue blocks shutdown
Johno Crawford
@johnou
Jun 11 2018 06:27
so use daemon threads?
Jordan Halterman
@kuujo
Jun 11 2018 06:28
ooooh
actually I think I found the bug
just found it
Johno Crawford
@johnou
Jun 11 2018 06:29
pulling in atomix/atomix#617 btw?
assignment to future is to shutdown both groups in parallel
Jordan Halterman
@kuujo
Jun 11 2018 06:29
yeah
stupid bug
another PR coming but I think it probably fixes a lot of these issues
Jordan Halterman
@kuujo
Jun 11 2018 06:34
#623 should be the source of our shutdown woes
Jordan Halterman
@kuujo
Jun 11 2018 06:54
@johnou tests fail like crazy with that PR
testClientJoinLeaveDataGrid(io.atomix.core.AtomixTest)  Time elapsed: 1.06 sec  <<< ERROR!
java.util.concurrent.ExecutionException: io.netty.channel.unix.Errors$NativeIoException: bind(..) failed: Address already in use
    at io.netty.channel.unix.Errors.newIOException(Errors.java:122)
    at io.netty.channel.unix.Socket.bind(Socket.java:287)
    at io.netty.channel.epoll.AbstractEpollChannel.doBind(AbstractEpollChannel.java:688)
    at io.netty.channel.epoll.EpollServerSocketChannel.doBind(EpollServerSocketChannel.java:70)
    at io.netty.channel.AbstractChannel$AbstractUnsafe.bind(AbstractChannel.java:558)
    at io.netty.channel.DefaultChannelPipeline$HeadContext.bind(DefaultChannelPipeline.java:1358)
    at io.netty.channel.AbstractChannelHandlerContext.invokeBind(AbstractChannelHandlerContext.java:501)
    at io.netty.channel.AbstractChannelHandlerContext.bind(AbstractChannelHandlerContext.java:486)
    at io.netty.channel.DefaultChannelPipeline.bind(DefaultChannelPipeline.java:1019)
    at io.netty.channel.AbstractChannel.bind(AbstractChannel.java:254)
    at io.netty.bootstrap.AbstractBootstrap$2.run(AbstractBootstrap.java:366)
    at io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:163)
    at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:404)
    at io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:309)
    at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:884)
    at java.lang.Thread.run(Thread.java:748)
many many of those
should be compareAndSet(true, false)
Jordan Halterman
@kuujo
Jun 11 2018 07:12
hmm…. just hanging now
I think the problem is probably that the executors are being shutdown in the Netty event loop, so the event loop can’t be shutdown because it’s blocked
Jordan Halterman
@kuujo
Jun 11 2018 07:19
this will be easier once we can actually rely on the tests to pass ¯\(ツ)
Johno Crawford
@johnou
Jun 11 2018 07:31
Pretty sure that was copy paste
Jordan Halterman
@kuujo
Jun 11 2018 07:33
haha I just found why the com.sun.jmx.remote package was required in the OSGi exports
Johno Crawford
@johnou
Jun 11 2018 07:33
Yeah?
Jordan Halterman
@kuujo
Jun 11 2018 07:34
guess I could have just searched for it before
who knows when that code was written
years ago
Johno Crawford
@johnou
Jun 11 2018 07:35
import com.sun.jmx.remote.internal.ArrayQueue;
oh dude
is that actually used in onos though
Jordan Halterman
@kuujo
Jun 11 2018 07:36
doubtful
probably should have been ArrayDeque
wait that’s there
Johno Crawford
@johnou
Jun 11 2018 07:37
?
oh in the list already
Jordan Halterman
@kuujo
Jun 11 2018 07:37
no it’s not used
Johno Crawford
@johnou
Jun 11 2018 07:37
i'd drop it
before 3.0 cut
Jordan Halterman
@kuujo
Jun 11 2018 07:37
#627
Johno Crawford
@johnou
Jun 11 2018 07:38
checked the rest of the started.compareAndSet
Johno Crawford
@johnou
Jun 11 2018 07:39
oh that's why you used the common fjp
okay
Jordan Halterman
@kuujo
Jun 11 2018 07:39
yeah
Johno Crawford
@johnou
Jun 11 2018 07:39
maybe a little comment would be worth it
Jordan Halterman
@kuujo
Jun 11 2018 07:41
yep
Johno Crawford
@johnou
Jun 11 2018 07:42
oh because it's on a thread context thread?
Jordan Halterman
@kuujo
Jun 11 2018 07:42
yep
Johno Crawford
@johnou
Jun 11 2018 07:51
@kuujo did you intentionally leave timeoutFuture.cancel(false); for remote connections?
just remove it for local
Jordan Halterman
@kuujo
Jun 11 2018 07:51
damnit
Jordan Halterman
@kuujo
Jun 11 2018 08:26
this one’s pretty frequent:
08:17:35.053 [backup-client-test-1] ERROR i.a.u.concurrent.ThreadPoolContext - An uncaught exception occurred
java.lang.AssertionError: expected:<1> but was:<2>
    at net.jodah.concurrentunit.Waiter.fail(Waiter.java:205) ~[concurrentunit-0.4.2.jar:na]
    at net.jodah.concurrentunit.Waiter.assertEquals(Waiter.java:40) ~[concurrentunit-0.4.2.jar:na]
    at net.jodah.concurrentunit.ConcurrentTestCase.threadAssertEquals(ConcurrentTestCase.java:20) ~[concurrentunit-0.4.2.jar:na]
    at io.atomix.protocols.backup.PrimaryBackupTest.lambda$testSequentialEvent$0(PrimaryBackupTest.java:158) ~[test-classes/:na]
    at io.atomix.primitive.session.impl.BlockingAwareSessionClient.lambda$null$2(BlockingAwareSessionClient.java:70) ~[classes/:na]
    at io.atomix.utils.concurrent.ThreadPoolContext.lambda$new$0(ThreadPoolContext.java:81) ~[classes/:na]
    at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) ~[na:1.8.0_141]
    at java.util.concurrent.FutureTask.run(FutureTask.java:266) ~[na:1.8.0_141]
    at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180) ~[na:1.8.0_141]
    at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) ~[na:1.8.0_141]
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) ~[na:1.8.0_141]
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) ~[na:1.8.0_141]
    at java.lang.Thread.run(Thread.java:748) ~[na:1.8.0_141]
Tests run: 21, Failures: 1, Errors: 0, Skipped: 2, Time elapsed: 23.311 sec <<< FAILURE!
testOneNodeEvent(io.atomix.protocols.backup.PrimaryBackupTest)  Time elapsed: 1.266 sec  <<< FAILURE!
java.lang.AssertionError: expected:<1> but was:<2>
    at net.jodah.concurrentunit.Waiter.fail(Waiter.java:205)
    at net.jodah.concurrentunit.Waiter.assertEquals(Waiter.java:40)
    at net.jodah.concurrentunit.ConcurrentTestCase.threadAssertEquals(ConcurrentTestCase.java:20)
    at io.atomix.protocols.backup.PrimaryBackupTest.lambda$testSequentialEvent$0(PrimaryBackupTest.java:158)
Johno Crawford
@johnou
Jun 11 2018 08:34
locally?
Jordan Halterman
@kuujo
Jun 11 2018 08:35
Travis
Johno Crawford
@johnou
Jun 11 2018 08:35
did you want to merge the three in queue then see what fails after that?
Jordan Halterman
@kuujo
Jun 11 2018 08:41
hmm… seems to be a thread not being shutdown here somewhere
Johno Crawford
@johnou
Jun 11 2018 08:42
what's the thread name
Jordan Halterman
@kuujo
Jun 11 2018 08:43
it’s the primary-backup threads
backup-client-* and backup-server-*
Johno Crawford
@johnou
Jun 11 2018 08:44
yay for well named threads
Jordan Halterman
@kuujo
Jun 11 2018 08:44
haha indeed
Johno Crawford
@johnou
Jun 11 2018 08:49
io.atomix.protocols.backup.PrimaryBackupServer.Builder#build
if it creates its own factory it is never closed
Jordan Halterman
@kuujo
Jun 11 2018 08:50
ahh it’s just a test issue
hmm
Johno Crawford
@johnou
Jun 11 2018 08:50
i've generally seen other libraries deal with this using a boolean no?
you either pass in an external executor, and it uses that and you are responsible for shutting it down, or it creates an internal one, then shuts that down
Jordan Halterman
@kuujo
Jun 11 2018 09:00
hmm
makes sense
in this case the PrimaryBackupPartitionGroup creates a ThreadContext for all the partitions, but in the tests it’s just creating a single client and servers
Johno Crawford
@johnou
Jun 11 2018 09:01
yeah and it's closed further up
io.atomix.protocols.backup.partition.PrimaryBackupPartitionGroup#close
Jordan Halterman
@kuujo
Jun 11 2018 09:06
I got this
Johno Crawford
@johnou
Jun 11 2018 09:07
you sure? i have it locally
Jordan Halterman
@kuujo
Jun 11 2018 09:08
had to make some updates to the tests too
Johno Crawford
@johnou
Jun 11 2018 09:11
i structured it like this but i think they're both ok
  final boolean closeThreadContextFactory = this.threadContextFactory == null;
  final ThreadContextFactory threadContextFactory = this.threadContextFactory != null
      ? this.threadContextFactory
      : threadModel.factory("backup-server-" + serverName + "-%d", threadPoolSize, log);
Jordan Halterman
@kuujo
Jun 11 2018 09:12
I’m off to bed after this test
Johno Crawford
@johnou
Jun 11 2018 09:17
looks good to me
Ronnie
@rroller
Jun 11 2018 16:07
Just saw "@rroller change the version to 3.0.0-SNAPSHOT and then test again". Will try it
Johno Crawford
@johnou
Jun 11 2018 16:08
might want to build your own snapshot and install it locally, not sure if there is one in sonatype and if it has all the latest fixes
Jordan Halterman
@kuujo
Jun 11 2018 16:53
Yay tests!
Jordan Halterman
@kuujo
Jun 11 2018 17:01
I will disable merging without tests passing once they pass a few more times.
Jordan Halterman
@kuujo
Jun 11 2018 17:27
Also about to start running ONOS tests and the test framework. Once everything is passing I’ll cut the first RC. Plenty more work to do after that. Biggest thing now is probably adding tests and cleaning up the primary-backup protocol.
Also the REST API, but it probably should be considered beta for a while.
Also, I will modify my ONOS/Atomix 3 slide deck to create an Atomix 3 overview, and maybe I’ll record a new talk just for Atomix 3
Jordan Halterman
@kuujo
Jun 11 2018 17:32
If I can get Atomix 3 in ONOS by Friday I’ll be right on schedule :-)
Jordan Halterman
@kuujo
Jun 11 2018 17:53
The log length has exceeded the limit of 4 MB (this usually means that the test suite is raising the same exception over and over).

The job has been terminated
ugh
I hate that damn limit
Johno Crawford
@johnou
Jun 11 2018 18:03
@kuujo I might try setting it up on gitlab
Gold for open source projects now
Wdyt
Jordan Halterman
@kuujo
Jun 11 2018 18:12
as long as we don’t have to move the repo
Johno Crawford
@johnou
Jun 11 2018 18:17
Nope
Jordan Halterman
@kuujo
Jun 11 2018 18:17
have at it
Johno Crawford
@johnou
Jun 11 2018 18:29
Path has already been taken
someone took atomix already :frowning:
Jordan Halterman
@kuujo
Jun 11 2018 18:29
maybe @jhalterman?
Johno Crawford
@johnou
Jun 11 2018 18:30
private group or something https://gitlab.com/atomix
maybe?
Jordan Halterman
@kuujo
Jun 11 2018 18:30
I will ask him
nope
damnit

HUNT THEM DOWN!

you can use atomixio
Johno Crawford
@johnou
Jun 11 2018 18:33
then maybe ask them to change it
Jordan Halterman
@kuujo
Jun 11 2018 18:33
or we can ask them
Johno Crawford
@johnou
Jun 11 2018 18:34
atomix.io ok?
i see other people are using the domain for the group
Jordan Halterman
@kuujo
Jun 11 2018 18:37
sure
Jordan Halterman
@kuujo
Jun 11 2018 19:22
tests are looking good now :boom:
aside from the fact they take an eternity to run
Nearing an hour
we can’t maintain that
Johno Crawford
@johnou
Jun 11 2018 19:30
yeah that's why I wanted to try gitlab
Jordan Halterman
@kuujo
Jun 11 2018 19:33
moment of truth
Johno Crawford
@johnou
Jun 11 2018 19:33
@kuujo might need you to setup the mirror..
Jordan Halterman
@kuujo
Jun 11 2018 19:33
okay
Johno Crawford
@johnou
Jun 11 2018 19:33
i think it's blowing up because i'm part of the jenkins group
and it's trying to load all my github repos
Jordan Halterman
@kuujo
Jun 11 2018 19:33
haha
Johno Crawford
@johnou
Jun 11 2018 19:33
502
Whoops, GitLab is taking too much time to respond.
have you logged into gitlab with your github account?
don't need to register, can just use github oauth
let me know when done and i'll add you to the group as owner
Jordan Halterman
@kuujo
Jun 11 2018 19:39
I signed in
what do you want me to do?
create a group?
click import from github
CI/CD for external repo
then github button, and select the atomix repo
i ended up creating and using atomix-io because that's the same group format the official gitlab repo group uses
Jordan Halterman
@kuujo
Jun 11 2018 19:41
bah damnit
Johno Crawford
@johnou
Jun 11 2018 19:42
what?
Jordan Halterman
@kuujo
Jun 11 2018 19:42
connected it to the /kuujo/atomix on accident
Johno Crawford
@johnou
Jun 11 2018 19:42
you can move it
or just delete and redo
Jordan Halterman
@kuujo
Jun 11 2018 19:42
I have two groups
atomix-io and atomix
Johno Crawford
@johnou
Jun 11 2018 19:43
really?
Jordan Halterman
@kuujo
Jun 11 2018 19:43
haha yeah
let’s see what happens if I go down that path
Johno Crawford
@johnou
Jun 11 2018 19:43
Jordan Halterman
@kuujo
Jun 11 2018 19:43
maybe a bug in their UI
Johno Crawford
@johnou
Jun 11 2018 19:43
i added you as owner to atomix-io
Jordan Halterman
@kuujo
Jun 11 2018 19:43
just errored out when I tried to connect it
Johno Crawford
@johnou
Jun 11 2018 19:43
ah
Jordan Halterman
@kuujo
Jun 11 2018 19:44
err… now I guess I have to figure out how to undo the kuujo/atomix one
then settings on the left
should be able to just delete it, or even transfer it to the group
Jordan Halterman
@kuujo
Jun 11 2018 19:47
okay fixed
okay cool
i'll setup CI
Johno Crawford
@johnou
Jun 11 2018 20:29
@kuujo how many years old is atomix?
Jordan Halterman
@kuujo
Jun 11 2018 20:30
2013
Well...
Johno Crawford
@johnou
Jun 11 2018 20:30
i'll just put since 2013
Jordan Halterman
@kuujo
Jun 11 2018 20:30
Copycat was started in 2013 I think, and Atomix grew from that so I say 2013
Looks like I have to finish up the transactions before I can release this thing. Might as well make them fault tolerant while I’m there.
Jordan Halterman
@kuujo
Jun 11 2018 20:47
:+1:
Job's log exceeded limit of 4194304 bytes.
thanks gitlab
where did you configure the log levels for travis?
Jordan Halterman
@kuujo
Jun 11 2018 21:24
environment variables
err
system properties
Johno Crawford
@johnou
Jun 11 2018 21:24
-Droot.logging.level=INFO
gotcha
Johno Crawford
@johnou
Jun 11 2018 21:31
did you want fail fast or fail at end btw
Jordan Halterman
@kuujo
Jun 11 2018 21:33
end would be nice
mm
depends on how long it takes actually
Johno Crawford
@johnou
Jun 11 2018 21:33
okay that's what I have it set to atm
let's see
ah must be fail at end of module
[INFO] Atomix Parent Pom 3.0.0-SNAPSHOT ................... SUCCESS [ 13.960 s]
[INFO] Atomix Utilities ................................... SUCCESS [ 18.813 s]
[INFO] Atomix Cluster ..................................... SUCCESS [01:08 min]
[INFO] Atomix Storage ..................................... SUCCESS [ 7.315 s]
[INFO] Atomix Primitive API ............................... SUCCESS [ 11.786 s]
[INFO] Atomix Protocols Parent ............................ SUCCESS [ 0.070 s]
[INFO] Atomix Protocols :: Raft ........................... FAILURE [04:59 min]
Johno Crawford
@johnou
Jun 11 2018 21:41
@kuujo still thinking of downsides for atomix/atomix.github.io#18 ? :P
they support the current setup you have with a record, just requires changing the old ones to the new ips then checking the box in the project settings
hm gitlab failed on RaftTest.lambda$testBlockOnEvent$16:862->ConcurrentTestCase.threadFail:76 null two times in a row
Jordan Halterman
@kuujo
Jun 11 2018 21:59
hmm
ugh I’m making a mess of this transaction thing
oh well
Johno Crawford
@johnou
Jun 11 2018 22:36
jvm is still not shutting down between modules
Jordan Halterman
@kuujo
Jun 11 2018 22:44
how did you determine that?
Johno Crawford
@johnou
Jun 11 2018 22:44
jmc
publishing the snapshots is kind of messing up development too
wouldn't take my local pom changes until i bumped the pom version
maybe I need to use a specific maven flag
Jordan Halterman
@kuujo
Jun 11 2018 22:46
Not sure what you mean
publishing snapshots is only going to become more frequent, as it should
Johno Crawford
@johnou
Jun 11 2018 22:49
i made a change to one of the pom.xml files in my IDE
and it wouldn't take those changes
Jordan Halterman
@kuujo
Jun 11 2018 22:49
what wouldn't
Johno Crawford
@johnou
Jun 11 2018 22:49
maven when running a mojo
Jordan Halterman
@kuujo
Jun 11 2018 22:51
need snapshots to be deployed continuously for testing
Johno Crawford
@johnou
Jun 11 2018 23:06
ah it's because the jvm is forked and stopping the tests from within the ide doesn't kill the child proc
Jordan Halterman
@kuujo
Jun 11 2018 23:07
oh weird
Jordan Halterman
@kuujo
Jun 11 2018 23:26
Could probably use that @Command/`@Query/@Event` caching now
something like this?
private final AnnotationCache<Command> commandCache = new AnnotationCache<>(Command.class);
isAnnotated can obviously be simplified, I must have been in a rush :)
Jordan Halterman
@kuujo
Jun 11 2018 23:35
TBH we don’t have to worry about changing classpaths. Atomix has to have a pretty strict requirement that the classpath not change at least as it concerns Atomix extensions because it persists and replays information
Johno Crawford
@johnou
Jun 11 2018 23:36
also with kryo
if the serialisation id changes it freaks out
Jordan Halterman
@kuujo
Jun 11 2018 23:37
indeed
Johno Crawford
@johnou
Jun 11 2018 23:46
hm
Screenshot from 2018-06-12 01-49-05.png
2gig of allocations in 60 seconds scanning annotations