These are chat archives for atomix/atomix

17th
Dec 2015
Richard Pijnenburg
@electrical
Dec 17 2015 09:53
@kuujo @jhalterman thank you very much for the detailed info! I'm looking forward to experiment more with it. for some background. I work at Elastic and looking to see how we can use atomix with Logstash to build a clustered version of it.
Jordan Halterman
@kuujo
Dec 17 2015 16:00
@electrical that's awesome to hear! We're seeing a ton of interest from a lot of awesome projects, and that's exactly why all effort is focused on stability for the foreseeable future. The status of Copycat/Atomix is this: over the last six months myself and other contributors have been putting a ton of effort into getting the algorithms correct. Copycat implements every part of the Raft algorithm and more (I have a partially written paper on the "more"). I spent a lot of time with @madjam evaluating in particular the log compaction, configuration change, and session event algorithms, and we're confident they're correct (you can see some of these discussions in old GH issues). Atomix has been tested in Jepsen for linearizability via the DistributedValue resource (doing cas operations). @jhalterman is currently working on integrating configuration changes into the Jepsen tests (arbitrarily adding and removing nodes while writing to the cluster and compacting the logs). So, that's all to say up until about a week ago after I finished support for snapshots, all the effort was put into algorithms and features and handling various fault tolerance issues. But snapshots represented the end of that work, at least in Copycat. All we're doing now is testing Copycat/Atomix as much as possible, and it has been and I expect will continue to become much more stable every day. We'll push a release candidate once Jepsen testing with configuration changes is complete, and a full release if there are no issues with the release candidate. But Atomix will continue to be informed by use cases. Copycat/Atomix makes it incredibly easy to add new tools like a work queue.
I'm finishing up the work queue now BTW... I fell asleep last night :-P
Richard Pijnenburg
@electrical
Dec 17 2015 16:07
@kuujo wow thank you so much for the details :-) and no worries, we all need sleep :p
Jordan Halterman
@kuujo
Dec 17 2015 16:08
Also, distributed logstash <3
Richard Pijnenburg
@electrical
Dec 17 2015 16:08
hehe yeah. Its just something i'm trying out to see if its possible with atomix. The actual time is going down an other road.
And its also my first time messing around with java ( mainly ruby programmer )
so i might have a lot of questions down the road :-)
Jordan Halterman
@kuujo
Dec 17 2015 16:14
I'm an open book. I always take plenty of time to answer questions (sometimes too much :-P) so glad to help even if asynchronous cross-world communication is a little slow
Richard Pijnenburg
@electrical
Dec 17 2015 16:14
@kuujo btw, that DistributedWorkQueue idea for distributing the work load, how performant do you think it could be?
the idea i have is to have 3 different roles that people can set ( input, filter, output ) per machine and depending on the role it needs to transmit that data to other nodes
And an other question. i assume it would make sense to have a separate thread for all the atomix stuff to happen in, separate from any other working threads so it doesn't lock up cluster management.
Jordan Halterman
@kuujo
Dec 17 2015 16:24
Good question. The need for a DistributedWorkQueue in Atomix is to have something that will push work to clients rather than require them to poll and to guarantee items are not lost of a node crashes. So, what the work queue state machine does is: when a task is committed and applied to the state machine, it will push the task to a worker (client) if there's one available. Clients/replicas acknowledge a task and get the next task in a single commit. So in that sense it will be much faster than polling a queue, but two commits are still required to process one task (put the task on the queue, acknowledge the task). We could probably make a much more efficient version that acknowledges tasks in batches though. As for threading, in AtomixReplica the user always receives messages in a different thread than the server, and always in the same thread. The reason for using the same thread is because session events can be linearizable, so handling a message on the client has to be synchronous to maintain linearizability. But we could make a work queue that allows a single node to process a lot of tasks concurrently and ack them in batches since the task queue could be asynchronous anyways... If that makes sense. I guess there are several ways to go about it, but the scalability limitations of majorities in Raft will be there at least for writes to the queue.
Ultimately, the next version of Atomix will have sharding by running multiple Raft instances to get around that limitation.
Copycat does allow batches of messages to be pushed to a client. But it all depends on use cases. Does work need to be processed in order? Can the work itself be partitioned to enforce order only within a partition (send one partition to one worker). Can a single worker process many tasks concurrently? Does the process submitting the work need to know when it's done? Copycat can facilitate all of that, but there are a lot of potential options for different use cases.
Jordan Halterman
@kuujo
Dec 17 2015 16:33
Ultimately, though, comparitively speaking all Atomix resources (or the algorithms that back them) are designed to ensure that everything happens eventually, and that everything happens at a specific point in logical “time,” but not necessarily that anything happens fast or efficiently, and that’s why it’s best for things like leader elections or, in the case of a work queue, submitting work that is infrequent but important.
Richard Pijnenburg
@electrical
Dec 17 2015 16:43
Hmm, good points. What i want to avoid is having the need for an other system to do queuing and make sure there is one system that controls everything
Jordan Halterman
@kuujo
Dec 17 2015 16:58
Gotcha. I think the intended goal of Atomix is to provide the tools to build something like a higher throughput work queue. In fact, I'm currently building a Kafka clone purely from Atomix resources. But as Kafka is does not require a quorum to commit a write, doing something like this still requires a lot of work on top of Atomix. In the case of my Kafka clone, that means using a DistributedMembershipGroup to do cluster management, allow clients to discover servers, and allow a leader to communicate with followers, a DistributedLeaderElection to coordinate and assign topics and partitions, and a DistributedMessageBus to send messages from clients to servers and synchronously replicate between servers according to the topic's replication factor. I expect in the future that use cases of will inform the development of more complex Atomix resources to make things like that easier, but for now Atomix resources will just remain low-level enough to allow people to build things that we can perhaps turn into resources in the future.
Richard Pijnenburg
@electrical
Dec 17 2015 17:01
completely makes sense yeah :-)
for logstash the big important thing is throughput.
and if a clustered solution is slower then a standalone thing with redis for example its gonna be hard to sell
an other downside for me is that i run it together with jruby. gonna be fun
Richard Pijnenburg
@electrical
Dec 17 2015 17:08
@kuujo what i could do is write a resource that reads and writes to redis right?
just as a poc
in case internal communication is to slow
but can always see how fast it is :-)
perhaps by setting up a small cluster and have a client submitting data and an other one reading it
and personally i rather do it via raft. then what the guys are exploring now is by doing 'config sharing' via elasticsearch it self ;p
im pretty sure this system is more suited for it then building a custom thing
Jordan Halterman
@kuujo
Dec 17 2015 17:25
I would say this: in general the algorithm underlying Atomix are designed to facilitate a large number of resources and ensuring everything happens exactly as it's intended and when it's intended, even across resources. But for high throughout systems, those resources probably have to be used to coordinate other communication mechanisms. For example, you can use a DistributedMembershipGroup to track the set of nodes that are alive. A node will be removed from the group if it crashes. Then, on each node use the DistributedMessageBus to receive messages on that node. Clients would use the same DistributedMembershipGroup to determine where to send messages, and individual messages can be acknowledged by periodically updating a DistributedValue or DistributedLong. This I think could be one of the common patterns that's turned into an Atomix resource to allow for faster reliable messaging. But the problem with high throughout inside of the Atomix cluster - in a state machine - is everything is limited by memory right now, and so messages can't back up too far. If the client is reading from a persistent store and can replay messages then I guess that's not an issue... We'd just have to implement some sort of back pressure mechanism. I guess it wouldn't be too hard to create a resource that does that actually - faster reliable messaging from a persistent store - if the process pushing messages is reading from disk or something.
But I'm digressing a bit :-P
But I definitely get where you're coming from. The problem will be in Raft everything still has to go through a single node, so it will most likely be slower than a non-distributed system. You're trading performance for reliability. But that reliability can be used to coordinate a system that doesn't require writes to go through a single node.
Richard Pijnenburg
@electrical
Dec 17 2015 17:40
@kuujo ah i see. okay. Could always setup a test and see how fast it actually is. not sure if you have time for something like that at some point? or @jhalterman maybe.
Jordan Halterman
@kuujo
Dec 17 2015 17:51
I am planning on that this weekend. My work is donating some EC2 instances for R&D :-) but maybe I'll try to do it sooner
But @jhalterman is the one that works at a cloud provider ;-)
Jonathan Halterman
@jhalterman
Dec 17 2015 17:52
I know - I can't believe we won't have that public cloud anymore to use for stuff like this.
Jordan Halterman
@kuujo
Dec 17 2015 17:56
But really, I think what's ultimately needed here is sharding. Copycat's abstractions are intentionally designed to make sharding relatively easy to add, but it hasn't been done because stability is the more essential short-term goal, and there are some questions to be answered about how resources should be partitioned and balanced in Atomix.
I'll finish up the work queue and do some performance tests there. I'm pretty interested myself
Jordan Halterman
@kuujo
Dec 17 2015 23:58
Here's how Atomix and any consensus based system should be used for a high-throughput system: Atomix should be used to ensure that messages are sent to the correct places at the correct times, but not necessarily handle the actual sending of messages. For instance, a leader election might be used to elect a leader to receive messages for several partitions, but the messages themselves don't go through Raft. What Atomix will guarante is: if a leader crashes, a single new leader will be elected, no two leaders will exist simultaneously, and the leader can be discovered by a client. That may not sound like much, but it's essential to ensuring that the processes that do receive messages are fault tolerant (when one leader crashes a new one is elected) and messages are always send to the correct process (so they're not lost by sending them to the wrong one). The DistributedMessageBus is unreliable insofar as it's the sender's and receiver's responsibility to coordinate to ensure messages aren't lost. It would be actually fairly easy to provide some reliability in the sense that messages will be automatically resent if lost, and it could even be feasible to do in-memory replication of messages by coordinating the replication through Atomix (DistributedMessageBus basically uses something like a membership group internally to coordinate messages already), but that wouldn't really make much sense to me since it's in-memory