These are chat archives for atomix/atomix

7th
Nov 2018
Jordan Halterman
@kuujo
Nov 07 2018 02:42
That’s absolutely the recommended usage! Atomix is architected to allow many different application architectures - client-server, peer-to-peer, or whatever combination of the two makes sense (which is what we do in ONOS).
A Raft node can participate in the Raft protocol itself, or it can serve clients that operate on the Raft protocol.
But if no partition groups are configured, the client will essentially block until it can find some partitions to interact with, which is why startup is blocked for the client node.
If you enable DEBUG logging you’ll see messages saying no partition groups were found
The infinite blocking is done for systems like Kubernetes where many services are deployed at once and it’s often the responsibility of services to wait for each other. It also blocks to wait for certain failure scenarios to resolve, e.g. a network partition or unavailable Raft partitionz
But I think we could set a default timeout for bootstrapping, or at least for locating partitions. If a system wants to block indefinitely it can disable it.
coopci
@coopci
Nov 07 2018 02:47
@kuujo Any idea about the high CPU util?
I managed create a 3 node raft cluster with this conf https://github.com/coopci/learn-atomix3/blob/master/raft-cs2/raft.conf:
# This works with 3.0.6
# The cluster configuration defines how nodes discover and communicate with one another
cluster {
  node {
    id: ${atomix.node.id}
    address: ${atomix.node.address}
  }
#  multicast.enabled: true   # Enable multicast discovery
#  discovery.type: multicast # Configure the cluster membership to use multicast
}

cluster.discovery {
  type: bootstrap
  nodes.1 {
    id: raft-1
    address: "localhost:8701"
  }
  nodes.2 {
    id: raft-2
    address: "localhost:8702"
  }
  nodes.3 {
    id: raft-3
    address: "localhost:8703"
  }
}

# The management group coordinates higher level partition groups and is required
# This node configures only a management group and no partition groups since it's
# used only for partition/primitive management
management-group {
  type: raft # Use the Raft consensus protocol for system management
  partitions: 1 # Use only a single partition
  members: [raft-1, raft-2, raft-3] # Raft requires a static membership list
  storage: {
      directory: ${atomix.raft.dir}
  }
}

# Configure a Raft partition group named "raft"
partition-groups.raft {
  type: raft # Use the Raft consensus protocol for this group
  partitions: 7 # Configure the group with 7 partitions
  members: [raft-1, raft-2, raft-3] # Raft requires a static membership list
  storage: {
      directory: ${atomix.raft.data.dir}
  }
}
This code can be used to reproduce it https://github.com/coopci/learn-atomix3/tree/master/src/main/java/atomix3/examples/raftcs Raft1.java, Raft2.java, Raft3.java are the raft nodes, PClient.java is the clients.
Jordan Halterman
@kuujo
Nov 07 2018 04:37
This is just from 30 clients connecting? It doesn’t look like they’re doing anything, right?
Certainly not expected. I’d be curious to run a profiler on it
I actually just finished building a new rack last weekend to run scale tests, so I can try to reproduce it
This is exactly what I want to test. We typically have a few - up to 7 or 9 - nodes with many, many primitive sessions. Theoretically that shouldn’t differ much in terms of load from many clients. But it could be something unrelated to primitives that’s causing the high COU
CPU
Jordan Halterman
@kuujo
Nov 07 2018 04:49
My plan is actually to test more like 500 clients
coopci
@coopci
Nov 07 2018 05:04
Yes, just 30 clients would cause PClient to eat that high CPU.
Jordan Halterman
@kuujo
Nov 07 2018 05:27
Gonna fest it out
Jordan Halterman
@kuujo
Nov 07 2018 05:40
okay...
coopci
@coopci
Nov 07 2018 05:41
Found something?
Jordan Halterman
@kuujo
Nov 07 2018 05:42
well one problem is that the thread count skyrockets with that many separate Atomix instances on the same node
with 10 clients I have ~800 threads and like 10% CPU usage
I suspect this has more to do with resource starvation than anything else
I can’t even profile 20 clients let alone 30
coopci
@coopci
Nov 07 2018 05:44
So this is not a problem if there is only 1 client per jvm ?
But why 10 clients require ~800 threads ?
Jordan Halterman
@kuujo
Nov 07 2018 05:46
Every Atomix instance has its own thread pools… it’s own Netty event loop groups, its own primitive thread pools, etc. Those all just add up
Raft client thread pools
probably do need to do some pruning of threads, but the real valuable test is running a bunch of clients on separate JVMs, which I’ll try to do this week
just have to make some changes to the test framework to support clients
coopci
@coopci
Nov 07 2018 05:49
It seems to me even if the clients have their dedicated jvm, as long as the jvms are in the same box, there will be no less OS threads, thus no less CPU util. Not sure if this is correct.
Jordan Halterman
@kuujo
Nov 07 2018 05:52
I have beefy servers. What I’m most interested in is whether the servers can handle the traffic from n clients, not really whether a machine can handle the load of n clients, although it’s worth decreasing the resource usage if that’s the issue.
doesn’t really seem right though
even with that many threads
probably close to 2k threads with 30 clients
coopci
@coopci
Nov 07 2018 05:55
There is rarely real world use case to create many clients in one JVM. But enabling many clients per JVM makes benchmark the "server" node much easier.
Jordan Halterman
@kuujo
Nov 07 2018 05:55
there’s basically nothing going on with 10 clients
Why would you need to create many clients in one JVM? Atomix is designed for many disparate applications to share a single Atomix instance, which is much more efficient and is what we do in ONOS
well this is frustrating… I can’t even profile it when I get up to 15 clients
the profiler agent stops reporting
coopci
@coopci
Nov 07 2018 05:59
I wanted to test how many concurrent clients can the servers handle too ...
Jordan Halterman
@kuujo
Nov 07 2018 06:00
going to try another method
Jordan Halterman
@kuujo
Nov 07 2018 06:13
using the test framework/Docker I start to run into problems at node 11
it may have something to do with the time complexity of the cluster membership protocol in 3.0.x
I’m really curious how this does on 3.1
guess I can find out
Jordan Halterman
@kuujo
Nov 07 2018 06:49
Nope. It’s not the group membership protocol. It’s strange, the cluster is fine until I add the 11th container and then the CPU usage sky rockets. This is something that has to be figured out, but it’s very difficult to debug when I can’t actually see what’s happening in the containers
coopci
@coopci
Nov 07 2018 06:52
Which process is using most of the CPU? The 11th client only or the raft nodes or all the nodes use more CPU than before upon the 11th container join ?
May be you can start 10 containers first, and start the 11th client in IDE.
Wayne Hunter
@incogniro
Nov 07 2018 09:17
@coopci Thanks, I’ve managed to get up and running with your sample code.