These are chat archives for atomix/atomix

8th
Mar 2017
Richard Pijnenburg
@electrical
Mar 08 2017 00:33
@himanshug Im not super familiar with atomix. but when sessions timeout they get expired from the server and the client will have re-init the connection and rejoin the groups and any resources. I'm not sure about the serializers. but this is all from memory so don't quote me on it ;-)
Jordan Halterman
@kuujo
Mar 08 2017 00:37
I'll answer today when I have a chance. I'm on vacation this week and only really answering emails/chat in the evenings.
Jordan Halterman
@kuujo
Mar 08 2017 06:59

@himanshug thanks for your questions! I know Druid well as my day job until last week was researching query optimization in big data systems (mostly Spark via Apache Calcite).

Out of curiosity, why are you guys looking to move off of ZooKeeper/Curator? Is it just because of the distaste for having that as an external dependency? I've noticed a lot of large open source projects moving this direction (Kafka is another one that's trying to remove the ZK dependency). Or is there some limitation in ZK/Curator itself? I obviously have lots of opinions on that.

Good questions. So, in Atomix when a Raft session is expired the client will typically register a new session. Atomix resources watch the state of the underlying CopycatClient for changes and invoke a recover method when a session is expired and a new session is opened. When that happens, the DistributedGroup implementation ensures any of the members the group that were added by the local process still exist in the group. There will actually be a brief period of inconsistency after the client loses its session, which is unavoidable. But group clients don't need to handle session recovery. Group membership listeners don't need to be re-registered since events for all changes in the group go to all open instances of the group.

There are some aspects of DistributedGroup that don't behave as well. The messaging feature in DistributedGroup is experimental and may not behave well when sessions expire. But work has been done to ensure that group membership itself remains useful across session expirations.

You never need to re-register serializers.

Worst case memory depends on the configuration of the Raft log. If you're only ever using one resource, you can actually forego the Atomix resource manager and save a lot of overhead. Each Atomix resource is just a Copycat state machine and can run independently in a Copycat cluster. Atomix is really just an API/manager for multiplexing state machines on a single Raft cluster (and in the future partitioning multiple state machines on multiple Raft clusters). If you only need groups, you can create a Copycat cluster with the GroupState state machine. Then, my setting the log to n bytes, it shouldn't really grow much larger than n*2 bytes on each server.

The biggest overhead for DistributedGroup is just managing sessions. That overhead can be controlled by setting the session timeout on the servers or when a CopycatClient is created. Typically, clients submit a keep-alive at a rate of 2x the session timeout. That could make it tempting to simply set a large session timeout, but that session timeout is also used to determine when a node crashed. So, it's a balance between reducing the overhead of sessions while ensuring changes in the group can be detected and handled quickly. Keep-alives are logged and replicated but will be immediately compacted once the log rolls over to a new segment. For a DistributedGroup, the log will contain mostly keep-alives, and if the segment size is 32mb then once the log grows to 32mb and rolls over to a new segment, the vast majority of that 32mb segment should be compacted (aside from one session register entry and one group join entry per client). Copycat can compact segments of the log much more quickly than it can create them, so using a small session timeout and small segment sizes can typically keep the footprint pretty small while keeping the group sufficiently reactive to changes in the cluster.

The upper bound on memory utilization is just the upper bound of the smallest replica's memory.

With all those questions answered, I should say users have also be really successful creating their own state machines for Atomix. ONOS and other projects rely heavily on custom Copycat/Atomix state machines to get exactly the behavior they want.

Himanshu
@himanshug
Mar 08 2017 15:05
@kuujo @electrical Thanks for taking time.
For curator/zookeeper, one reason is to remove an external dependency and simplify user's life by not having them to deploy yet another cluster of something and monitor it. but, also because of coding convenience, Druid used zk for many non-essential transient metadata which can be managed without zk also and that metadata puts alot of pressure on zk nodes, so we are also trying to move most of that state out of zk and let druid nodes talk to each other and manage that directly.
In the end, I believe only essential thing, we need zk for, is member discovery and leader elections. Those two are relatively lightweight even for zk. Main reason to explore atomix is raft, most of us have read raft paper and found the protocol easier to understand and believe we can better manage/understand raft cluster than zk.
so, you see, we are only gonna be using DistributedGroup for leader elections and members discovery and nothing else. However, we do have multiple groups so need to have multiplexing of state machines to manage individual group's state machines.
it sounds like, given the use case, it might be better for us to use copycat instead of atomix resources.
and reduce one layer of abstraction
Himanshu
@himanshug
Mar 08 2017 15:31
@kuujo also, once you're back to work next week. would you be available for a hangout? I have more questions and would like to get your opinion on how to set things up correctly for us.
i believe, it might also be possible to write a custom state machine that does all we need.