These are chat archives for atomix/atomix
@himanshug thanks for your questions! I know Druid well as my day job until last week was researching query optimization in big data systems (mostly Spark via Apache Calcite).
Out of curiosity, why are you guys looking to move off of ZooKeeper/Curator? Is it just because of the distaste for having that as an external dependency? I've noticed a lot of large open source projects moving this direction (Kafka is another one that's trying to remove the ZK dependency). Or is there some limitation in ZK/Curator itself? I obviously have lots of opinions on that.
Good questions. So, in Atomix when a Raft session is expired the client will typically register a new session. Atomix resources watch the state of the underlying
CopycatClient for changes and invoke a
recover method when a session is expired and a new session is opened. When that happens, the
DistributedGroup implementation ensures any of the members the group that were added by the local process still exist in the group. There will actually be a brief period of inconsistency after the client loses its session, which is unavoidable. But group clients don't need to handle session recovery. Group membership listeners don't need to be re-registered since events for all changes in the group go to all open instances of the group.
There are some aspects of
DistributedGroup that don't behave as well. The messaging feature in
DistributedGroup is experimental and may not behave well when sessions expire. But work has been done to ensure that group membership itself remains useful across session expirations.
You never need to re-register serializers.
Worst case memory depends on the configuration of the Raft log. If you're only ever using one resource, you can actually forego the Atomix resource manager and save a lot of overhead. Each Atomix resource is just a Copycat state machine and can run independently in a Copycat cluster. Atomix is really just an API/manager for multiplexing state machines on a single Raft cluster (and in the future partitioning multiple state machines on multiple Raft clusters). If you only need groups, you can create a Copycat cluster with the
GroupState state machine. Then, my setting the log to n bytes, it shouldn't really grow much larger than n*2 bytes on each server.
The biggest overhead for
DistributedGroup is just managing sessions. That overhead can be controlled by setting the session timeout on the servers or when a
CopycatClient is created. Typically, clients submit a keep-alive at a rate of 2x the session timeout. That could make it tempting to simply set a large session timeout, but that session timeout is also used to determine when a node crashed. So, it's a balance between reducing the overhead of sessions while ensuring changes in the group can be detected and handled quickly. Keep-alives are logged and replicated but will be immediately compacted once the log rolls over to a new segment. For a
DistributedGroup, the log will contain mostly keep-alives, and if the segment size is 32mb then once the log grows to 32mb and rolls over to a new segment, the vast majority of that 32mb segment should be compacted (aside from one session register entry and one group join entry per client). Copycat can compact segments of the log much more quickly than it can create them, so using a small session timeout and small segment sizes can typically keep the footprint pretty small while keeping the group sufficiently reactive to changes in the cluster.
The upper bound on memory utilization is just the upper bound of the smallest replica's memory.
With all those questions answered, I should say users have also be really successful creating their own state machines for Atomix. ONOS and other projects rely heavily on custom Copycat/Atomix state machines to get exactly the behavior they want.