These are chat archives for atomix/atomix

16th
Aug 2018
william.z
@zwillim
Aug 16 2018 01:59
when getting a communication msg, is there any way to get the source ip/memberId?
Junbo Ruan
@aruanruan
Aug 16 2018 02:40
when i add log like :

RaftServiceManager.java

 /**
   * Applies the next entry in the log up to the given index.
   *
   * @param index the index up to which to apply the entry
   */
  @SuppressWarnings("unchecked")
  private void applyIndex(long index) {
    logger.error("apply index: {}", index);

change log level to ERROR
with long time testing, some log like:

10:34:24.100 [raft-server-system-partition-1] ERROR i.a.p.raft.impl.RaftServiceManager - RaftServer{system-partition-1} - apply index: 460
10:34:24.398 [raft-server-consensus-partition-1] ERROR i.a.p.raft.impl.RaftServiceManager - RaftServer{consensus-partition-1} - apply index: 41393
10:34:24.398 [raft-server-consensus-partition-1] ERROR i.a.p.raft.impl.RaftServiceManager - RaftServer{consensus-partition-1} - apply index: 41394
10:34:24.398 [raft-server-consensus-partition-1] ERROR i.a.p.raft.impl.RaftServiceManager - RaftServer{consensus-partition-1} - apply index: 41395
10:34:24.398 [raft-server-consensus-partition-1] ERROR i.a.p.raft.impl.RaftServiceManager - RaftServer{consensus-partition-1} - apply index: 41396
10:34:24.399 [raft-server-consensus-partition-1] ERROR i.a.p.raft.impl.RaftServiceManager - RaftServer{consensus-partition-1} - apply index: 41397
10:34:24.421 [raft-server-consensus-partition-1] ERROR i.a.p.raft.impl.RaftServiceManager - RaftServer{consensus-partition-1} - apply index: 41398
10:34:24.449 [raft-server-consensus-partition-1] ERROR i.a.p.raft.impl.RaftServiceManager - RaftServer{consensus-partition-1} - apply index: 41399
10:34:24.449 [raft-server-consensus-partition-1] ERROR i.a.p.raft.impl.RaftServiceManager - RaftServer{consensus-partition-1} - apply index: 41400
10:34:24.477 [raft-server-consensus-partition-1] ERROR i.a.p.raft.impl.RaftServiceManager - RaftServer{consensus-partition-1} - apply index: 41401
10:34:24.509 [raft-server-consensus-partition-1] ERROR i.a.p.raft.impl.RaftServiceManager - RaftServer{consensus-partition-1} - apply index: 41402
10:34:24.534 [raft-server-consensus-partition-1] ERROR i.a.p.raft.impl.RaftServiceManager - RaftServer{consensus-partition-1} - apply index: 41403
10:34:24.535 [raft-server-consensus-partition-1] ERROR i.a.p.raft.impl.RaftServiceManager - RaftServer{consensus-partition-1} - apply index: 41404
10:34:24.551 [raft-server-consensus-partition-2] ERROR i.a.p.raft.impl.RaftServiceManager - RaftServer{consensus-partition-2} - apply index: 19
10:34:24.586 [raft-server-consensus-partition-1] ERROR i.a.p.raft.impl.RaftServiceManager - RaftServer{consensus-partition-1} - apply index: 41405
10:34:24.641 [raft-server-consensus-partition-1] ERROR i.a.p.raft.impl.RaftServiceManager - RaftServer{consensus-partition-1} - apply index: 41406
10:34:24.688 [raft-server-consensus-partition-1] ERROR i.a.p.raft.impl.RaftServiceManager - RaftServer{consensus-partition-1} - apply index: 41407
10:34:24.757 [raft-server-consensus-partition-1] ERROR i.a.p.raft.impl.RaftServiceManager - RaftServer{consensus-partition-1} - apply index: 41408
10:34:24.764 [raft-server-consensus-partition-1] ERROR i.a.p.raft.impl.RaftServiceManager - RaftServer{consensus-partition-1} - apply index: 41409
10:34:24.785 [raft-server-consensus-partition-2] ERROR i.a.p.raft.impl.RaftServiceManager - RaftServer{consensus-partition-2} - apply index: 20
10:34:24.848 [raft-server-consensus-partition-1] ERROR i.a.p.raft.impl.RaftServiceManager - RaftServer{consensus-partition-1} - apply index: 41410
10:34:25.101 [raft-server-consensus-partition-1] ERROR i.a.p.raft.impl.RaftServiceManager - RaftServer{consensus-partition-1} - apply index: 41411
10:34:25.196 [raft-server-consensus-partition-1] ERROR i.a.p.raft.impl.RaftServiceManager - RaftServer{consensus-partition-1} - apply index: 41412
10:34:25.257 [raft-server-consensus-partition-1] ERROR i.a.p.raft.impl.RaftServiceManager - RaftServer{consensus-partition-1} - apply index: 41413
10:34:25.266 [raft-server-consensus-partition-1] ERROR i.a.p.raft.impl.RaftServiceManager - RaftServer{consensus-partition-1} - apply index: 41414
10:34:25.350 [raft-server-consensus-partition-1] ERROR i.a.p.raft.impl.RaftServiceManager - RaftServer{consensus-partition-1} - apply index: 41415
10:34:25.403 [raft-server-consensus-partition-1] ERROR i.a.p.raft.impl.RaftServiceManager - RaftServer{consensus-partition-1} - apply index: 41416
Junbo Ruan
@aruanruan
Aug 16 2018 02:58
for the same index:
node1:
10:36:08.167 [raft-server-system-partition-1] ERROR i.a.p.raft.impl.RaftServiceManager - RaftServer{system-partition-1} - apply index: 491
10:36:08.782 [raft-server-system-partition-1] ERROR i.a.p.raft.impl.RaftServiceManager - RaftServer{system-partition-1} - apply index: 492

node2:
10:36:08.378 [raft-server-system-partition-1] ERROR i.a.p.raft.impl.RaftServiceManager - RaftServer{system-partition-1} - apply index: 491
10:36:08.886 [raft-server-system-partition-1] ERROR i.a.p.raft.impl.RaftServiceManager - RaftServer{system-partition-1} - apply index: 492

node3:
10:37:35.224 [raft-server-system-partition-1] ERROR i.a.p.raft.impl.RaftServiceManager - RaftServer{system-partition-1} - apply index: 491
10:37:35.224 [raft-server-system-partition-1] ERROR i.a.p.raft.impl.RaftServiceManager - RaftServer{system-partition-1} - apply index: 492
timestamp are different
max - min is about 30s
my cluster is configured with NTPs
30s -> 90s
Jordan Halterman
@kuujo
Aug 16 2018 03:15
I don’t understand what you mean by timestamps are different
10:36:08.167 and 10:37:35.224?
there’s nothing wrong with these logs aside from the fact there’s some major hotspotting happening
It’s just showing that one of the nodes is far behind two others. There’s nothing necessarily wrong with that. The system will tend towards the fastest majority of nodes when it’s under load. The slow node doesn’t really make a difference.
Jordan Halterman
@kuujo
Aug 16 2018 03:20
It’s basically just worth understanding why that node is slow, but usually it’s something related to the system
@zwillim we talked about adding a method to provide the source to subscribers but it never happened. Would be easy to add. Currently only the lower level MessagingService provides the source Address, but of course you can also just send it in the message
What we need is to understand the state of the log reader when the Cannot apply index message is logged
william.z
@zwillim
Aug 16 2018 03:23
I'll add it myself if it wont be supported by the framework.
Jordan Halterman
@kuujo
Aug 16 2018 03:24
I will add it
@aruanruan any idea why that performance test above would not reproduce this issue? Is there something about it that’s inconsistent with what you’re doing?
Junbo Ruan
@aruanruan
Aug 16 2018 03:28
@kuujo i just create a new service with saving <key, json object>
coinstantaneously, i save them to rocksdb
Jordan Halterman
@kuujo
Aug 16 2018 03:33
rocksdb? How does that fit in?
Is that happening inside a primitive service?
Junbo Ruan
@aruanruan
Aug 16 2018 03:34
yes
Jordan Halterman
@kuujo
Aug 16 2018 03:34
Maybe this can be reproduced by blocking inside the service then
That’s good info thanks
Junbo Ruan
@aruanruan
Aug 16 2018 03:35
now i remove rocksdb, begin testing more
Junbo Ruan
@aruanruan
Aug 16 2018 03:40
but it became unconnected now once more
using visualVM, i found thread-group 'netty-messaging-event-epoll-client-xxx' exists 130 , xxx from 26 to 159, they are running
from 0 to 159
about 160 threads
Jordan Halterman
@kuujo
Aug 16 2018 03:45
How many cores do you have? Don’t remember how Netty computes the number of threads
Junbo Ruan
@aruanruan
Aug 16 2018 03:47
80 cores
80 processors
[root@cloudate3 atomix]# cat /proc/cpuinfo |grep "processor" | wc -l
80
Jordan Halterman
@kuujo
Aug 16 2018 04:33
Makes sense
💪
Johno Crawford
@johnou
Aug 16 2018 07:28
DEFAULT_EVENT_LOOP_THREADS = Math.max(1, SystemPropertyUtil.getInt(
"io.netty.eventLoopThreads", NettyRuntime.availableProcessors() * 2));
Get the configured number of available processors. The default is {@link Runtime#availableProcessors()}.
     * This can be overridden by setting the system property "io.netty.availableProcessors" or by invoking
     * {@link #setAvailableProcessors(int)} before any calls to this method.
     *
Junbo Ruan
@aruanruan
Aug 16 2018 07:32
that is the max pool size , isn't ?
Jordan Halterman
@kuujo
Aug 16 2018 07:33
it’s just the pool size
Junbo Ruan
@aruanruan
Aug 16 2018 07:33
why filled full ?
Johno Crawford
@johnou
Aug 16 2018 07:33
event loops are not quite a pool though
however
80 cores
after x runtime
using netty
that's going to take a lot of memory
are you monitoring your gc?
Junbo Ruan
@aruanruan
Aug 16 2018 07:35
i set -xmx to 8G
Johno Crawford
@johnou
Aug 16 2018 07:35
what collector?
Junbo Ruan
@aruanruan
Aug 16 2018 07:35
gc is normal
Jordan Halterman
@kuujo
Aug 16 2018 07:36
It’s not filled full. It’s like Executors.newFixedThreadPool(Runtime.getRuntime().availableProcessors() * 2). When you create the event loop group Netty just creates that many threads.
Junbo Ruan
@aruanruan
Aug 16 2018 07:36
used memory is about 5G
Jordan Halterman
@kuujo
Aug 16 2018 07:36
but it’s definitely overkill
Junbo Ruan
@aruanruan
Aug 16 2018 07:37
ok, it means the netty threads pool is ok, right?
Jordan Halterman
@kuujo
Aug 16 2018 07:38
yeah
Johno Crawford
@johnou
Aug 16 2018 07:38
can you try running the stress test again with
-Xss256k -Dio.netty.recycler.maxCapacity=0 -Dio.netty.recycler.maxCapacity.default=0 -Dio.netty.recycler.maxCapacityPerThread=0 -Dio.netty.allocator.tinyCacheSize=0 -Dio.netty.allocator.smallCacheSize=0 -Dio.netty.allocator.normalCacheSize=0
Junbo Ruan
@aruanruan
Aug 16 2018 07:38
ok
Junbo Ruan
@aruanruan
Aug 16 2018 07:45
now netty-messaging-event-epoll-client threads is under 30
about 40
Johno Crawford
@johnou
Aug 16 2018 07:57
does the memory look better?
Junbo Ruan
@aruanruan
Aug 16 2018 07:57
2g-4g
Johno Crawford
@johnou
Aug 16 2018 07:58
for high performance i would keep the recycler disabled and enable / tune the allocator
Junbo Ruan
@aruanruan
Aug 16 2018 07:59
client threads are less
client threads were filled full before in few minutes
Johno Crawford
@johnou
Aug 16 2018 08:01
so it's working better?
sounds like your gc was going crazy before
Junbo Ruan
@aruanruan
Aug 16 2018 08:01
looks better
Johno Crawford
@johnou
Aug 16 2018 08:01
do you have gc graphs
Junbo Ruan
@aruanruan
Aug 16 2018 08:01
i must run test case more times
i don't know why i can't paste graph image
Johno Crawford
@johnou
Aug 16 2018 08:04
drag drop
Junbo Ruan
@aruanruan
Aug 16 2018 08:04
here
Johno Crawford
@johnou
Aug 16 2018 08:05
doesn't work on mobile
Junbo Ruan
@aruanruan
Aug 16 2018 08:07
E2A4E2F6-56BF-4D4E-A1F6-C5A22AE6D4DF.png
blue line means gc time
Johno Crawford
@johnou
Aug 16 2018 08:08
hard to see
Junbo Ruan
@aruanruan
Aug 16 2018 08:11
2F4D3B16-DDF5-49E9-9AB5-9233B64E4FE1.png
it looks like gc is normal
Junbo Ruan
@aruanruan
Aug 16 2018 08:18
client threads is 160, when i interrupt test case, then restart it, can not connect atomix cluster any more
16:19:36.477 [atomix-partition-group-membership-service-0] WARN  i.a.p.p.i.DefaultPartitionGroupMembershipService - Failed to locate system partition group. Retrying...
16:19:37.478 [atomix-partition-group-membership-service-0] WARN  i.a.p.p.i.DefaultPartitionGroupMembershipService - Failed to locate system partition group. Retrying...
16:19:39.479 [atomix-partition-group-membership-service-0] WARN  i.a.p.p.i.DefaultPartitionGroupMembershipService - Failed to locate system partition group. Retrying...
16:19:42.481 [atomix-partition-group-membership-service-0] WARN  i.a.p.p.i.DefaultPartitionGroupMembershipService - Failed to locate system partition group. Retrying...
16:19:47.482 [atomix-partition-group-membership-service-0] WARN  i.a.p.p.i.DefaultPartitionGroupMembershipService - Failed to locate system partition group. Retrying...
16:19:52.483 [atomix-partition-group-membership-service-0] WARN  i.a.p.p.i.DefaultPartitionGroupMembershipService - Failed to locate system partition group. Retrying...
Jordan Halterman
@kuujo
Aug 16 2018 08:20
interrupt and then restart?
Junbo Ruan
@aruanruan
Aug 16 2018 08:20
heap is about 6g
interrupt test case and the restart
i have inserted about 40000*500 rows data
Jordan Halterman
@kuujo
Aug 16 2018 08:23
that may take a long time to restore
doesn’t seem like it would cause that message though
but that depends on which nodes are being killed
Junbo Ruan
@aruanruan
Aug 16 2018 08:26
cluser logs are the same last
Jordan Halterman
@kuujo
Aug 16 2018 08:26
Same as what?
Junbo Ruan
@aruanruan
Aug 16 2018 08:26
last indexes are same
Jordan Halterman
@kuujo
Aug 16 2018 08:26
hmm
Junbo Ruan
@aruanruan
Aug 16 2018 08:27
each op include 500 rows
Jordan Halterman
@kuujo
Aug 16 2018 08:27
I see
the message just indicates the node has not been able to contact one other node to get partition group information
Junbo Ruan
@aruanruan
Aug 16 2018 08:30
partition indexes are same in nodes
Jordan Halterman
@kuujo
Aug 16 2018 08:34
not sure what you mean
Jordan Halterman
@kuujo
Aug 16 2018 08:40
I gotta get some sleep. I’ve been staring at this Raft log all day.
suimi
@suimi
Aug 16 2018 09:40
hi all, i got a problem. my cluster have 5 node(node-A~node-E), when restart the node-E more times, it can't started the manage group, still opening the proxy session. but after some times, it's take the leadership.
Jordan Halterman
@kuujo
Aug 16 2018 17:36
I may have found the cause of the Cannot apply index errors
@aruanruan I will have a PR for you to test this afternoon
@suimi it’s not clear to me exactly what’s going on. Can you maybe share some logs that show what you’re seeing?
Jordan Halterman
@kuujo
Aug 16 2018 19:20
@johnou yt?
Wondering if you have any ideas about this: https://github.com/atomix/atomix/tree/primitive-test-cluster-reuse
The idea is to share a cluster and clients among all the primitive tests. Setting up and tearing down clusters in those tests is redundant, and this drastically improves the speed and reliability of those tests. That branch mostly works, but when I run the full tests with mvn it fails and I’m not sure why. The test case never seems to get set up.
Junbo Ruan
@aruanruan
Aug 16 2018 23:49
@kuujo great