These are chat archives for atomix/atomix

9th
May 2016
v-arun
@v-arun
May 09 2016 00:53
Is there a performance benchmarking suite for copycat with a simple in-memory application? How does it compare to zookeer's consensus throughput?
Jordan Halterman
@kuujo
May 09 2016 01:07
I only have benchmarks on my laptop I use to to do comparisons between changes during development. @madjam has done more meaningful benchmarks than I have AFAIK and had good results, whatever that means. I expect that ZooKeeper being a much more mature project with less abstractions it should be faster than Copycat/Atomix. There is overhead to flexibility for sure. Atomix multiplexes resources and sessions on a Copycat state machine, and there's some additional I/O overhead required to do that. But they should and indeed do (I my tests) scale similarly since Copycat and ZooKeeper are both architected very similarly (albeit with different consensus algorithms). That is, both enforce FIFO consistency for client-operations, linearizable writes, support reads from followers, and support session events. There's also likely more overhead in Copycat/Atomix session events since they're fault tolerant whereas ZooKeeper events can be lost when a client switches servers. This requires additional memory on servers and acks from clients. But I do expect Copycat/Atomix can get much better performance for linearizable reads than can ZooKeeper since they're not exactly built in to ZooKeeper but can be achieved through sync/read. Linearizable reads with a leader lease perform much better since still no communication is required (though it does rely on a bounded clock drift). Copycat/Atomix also compact logs using an incremental compaction algorithm that's different than ZooKeeper and most other consensus based systems. Currently, I don't expect that to have any impact on performance differences, but in the long run it should result in more consistent performance as the projects mature (we haven't implemented e.g. throttling for log compaction yet).
v-arun
@v-arun
May 09 2016 03:07

Thanks for the quick and detailed reply. Most everything you said makes sense to me (except the bounded clock drift part as I think you don't really need that assumption to implement leader leases).

Can I get @madjam 's benchmarking code? Or could you give me a rough idea of copycat's single-node or 3-node throughput with say 1KB requests where each node is a single-core machine (assuming that more cores will roughly linearly scale throughput until serialization or the NIC becomes a bottleneck)? ZooKeeper can easily get over 10K/s. We have benchmarked Raft's LogCabin implementation before and barely gotten 2K/s.

v-arun
@v-arun
May 09 2016 03:23

Another question: is there a formal definition of "fault-tolerant sessions" in Atomix that I can read about or look at code somewhere? Are you basically caching requestIDs on the server to prevent duplicate executions and simply having the client retransmit uncommitted requests upon a connection failure or a switch to a different server?

(Throttling for log compaction is a good idea. ZooKeeper runs out of storage in minutes when pushed to capacity but doesn't allow reducing its log garbage collection interval to less than an hour.)

Madan Jampani
@madjam
May 09 2016 03:42
@v-arun with Atomix's super simple primitives you can write this benchmarking code in fewer than 150 lines. There are couple of primitives you could use for this test: DistributedLong and DistributedValue. The reason I suggest these two is because (a) they are simple (b) they cover all the bases with regards to log compaction that @kuujo alluded to. From my experience running on a 3 node baremetal server with with FILE based log and 3 clients (co-located on the same 3 hosts) submitting operations I could hit 13-14k operations per second. In my test each client calls incrementAndGet on a DistributedLong. Now if I use a MEMORY based log I can easily push the system throughput to ~45K operations per second. While not many people might use the system with a MEMORY based log in production, the reason for testing that configuration is to identify hot spots else where in the system. So one conclusion is that as the storage layer get faster the system can get faster as well.
Jordan Halterman
@kuujo
May 09 2016 03:43
Sure leader leases assume bounded clock drift: "To use clocks instead of messages for read-only queries, the leader would use the normal heartbeat mechanism to maintain a lease. Once the leader’s heartbeats were acknowledged by a majority of the cluster, it would extends its lease to start + election timeout / clock drift bound since the followers shouldn’t time out before then."
On a side note, found a typo in the dissertation there :-)
it would extends
Madan Jampani
@madjam
May 09 2016 03:47
@kuujo - getting back here after a while. Been away with some travel and stuff. Lot to catch up on :)
v-arun
@v-arun
May 09 2016 03:59

@madjam thanks, that's helpful. We'll check those out.

@kuujo thanks for the pointer. I just meant that there are ways to implement leases without relying on synchrony, but I understand what you are doing now.

Jordan Halterman
@kuujo
May 09 2016 04:03
gotcha
@madjam awesome nice to see you I’ve been wondering where you were :-)
@madjam’s numbers sound similar to mine… for sure the storage layer is where performance improvements will be made for a long time. I think there’s a ton of work to do there
Jordan Halterman
@kuujo
May 09 2016 07:24

@v-arun sorry I missed one of your questions. All the algorithms Copycat/Atomix uses are described on the website: http://atomix.io/copycat/docs/session-events/ I’ve been working on a spec in TLA+ but that sort of went by the wayside a while back.

The gist of how sessions in Copycat work is basically like you said. Clients attach a sequence number to each request. If a request times out on the client, it resends it. When a command from a client is applied to the state machine, the command is cached in the state machine by its sequence number. When clients send keep-alives, they send the highest sequence number for which they’ve received a response. When a server applies a keep-alive for a client, the server clears all cached outputs up to that point. This caching is done on all servers so that if a client switches servers and resubmits a command it’s only applied once.

Session events work in a similar way. All state machines cache session events (published by the state machine to a specific session) with the index at which the event was published. The server to which a client is connected actually sends the event to the client. It’s the responsibility of the client to ensure events are received in order. As with commands, clients send the highest index for which events have been received in their keep-alive requests. If a client switches servers, the new server will send any events the client missed.

All of those algorithms are described in the architecture documentation. Most of this state is tracked in the ServerSessionContext class if you’re interested in reading code.

There are some other complexities in sessions as well. For example, like ZooKeeper Copycat ensures FIFO order for all client operations. To do so, the sequence number is used by the leader to ensure operations are written to the Raft log in the order in which they were specified by the client regardless of the path they take to the leader. Queries evaluated on followers use the sequence number and potentially an index provided by the client to ensure state does not go back in time. etc