These are chat archives for atomix/atomix

28th
Jul 2018
Jordan Halterman
@kuujo
Jul 28 2018 00:01
@cow12331 what features do you need from a cache-like object? What consistutes “idle”? It would be straightforward to support things like maximumSize, expireAfterWrite, removalListener, etc, but as I mentioned expireAfterAccess would be extremely costly. It would require a write on every read to replicate the last access time. Maybe the best way to do this would be using the anti-entropy protocol, in which case the cache is always local and eventually consistent so extremely efficient.
Junbo Ruan
@aruanruan
Jul 28 2018 00:23
@kuujo i need your help: if i need write a new primitive service, but it depends on os resource like other data directory, other restful interface, how can i configure it ?
Jordan Halterman
@kuujo
Jul 28 2018 00:24
are you referring to an actual implementation of the PrimitiveService interface?
@aruanruan
Junbo Ruan
@aruanruan
Jul 28 2018 00:28
yes
@kuujo yes, i need implement new PrimitiveService for my special scene
Jordan Halterman
@kuujo
Jul 28 2018 00:30
Okay...
Junbo Ruan
@aruanruan
Jul 28 2018 00:31
@kuujo another question, how can i capture the raft leader changed?
Jordan Halterman
@kuujo
Jul 28 2018 00:44

So, what you’re really referring to in Raft terms (which also applies to the primary-backup protocol) is a persistent state machine. This is actually a really challenging problem to solve correctly, which is why Atomix doesn’t really do anything to support them currently.

The problem is, the state machine is populated from a history of operations written and replicated in the Raft log, and operations are applied on several nodes. So, if you’re calling a REST API then each node will make the same call to the same REST API n times for an n-node partition. Additionally, when a node crashes and restarts, the history in the Raft log is replayed. What that will mean is you’re repeating old calls to the file system or REST API.

This is even compounded when you introduce the problem of reconfiguring Raft partitions. Typically, when a Raft partition is reconfigured to add a new node, we send the latest snapshot and the rest of the logs to the new node. But in a persistent state machine, the snapshot is the persistent state - e.g. the files you’re writing on the file system or the state behind the REST API - so that is what needs to be sent to the new node instead of a state machine snapshot, and Atomix doesn’t currently provide an API for state machines to provide their own snapshots.

Some of this is avoidable. Raft and the primary-backup protocol provide monotonically increasing operation indexes. Those indexes can be used to deduplicate calls to an external REST API. But the reconfiguration of partitions is the biggest problem for services that access the file system.

You can probably make it work for a REST API but not file system... at least not correctly.

But Atomix is not designed to be used in this way, which is why support for persistent state machines has not been built in to it. Instead, users should look to use primitives to build higher level replication protocols. If you need to coordinate cluster-wide access to a shared resource, use a lock or leader election and use the ClusterCommunicationService to send operations to the leader. In ONOS we use a LeaderElector to elect multiple leaders - one per switch - to control switches and proxy operations through them.

You can access the Raft partition info via the partition group
atomix.getPartitionService().getPartitionGroup(“raft”).getPartition(1).primary()
There aren’t currently event listeners on that interface but we should probably add them
Jordan Halterman
@kuujo
Jul 28 2018 00:51
Users have always wanted to build persistent Raft state machines, but we’ve always resisted it because IMO you get more flexibility and the ability to better manage the cluster by coordinating such resources using primitives. For example, in ONOS we can balance leaders across the cluster using a LeaderElector, and something like that is not easy I’m Raft because it’s designed to elect the leader with the most up-to-date information, not the leader in the best position. I’ve never been a fan of using Raft to replicate e.g. persistent databases. In fact, I found a horrible implementation of this type of architecture quite recently...
Here it is
Junbo Ruan
@aruanruan
Jul 28 2018 01:01
@kuujo thx, in my one scene, some businesses will generate a lot of locks (or other resources) by with random name, i need to clear it periodically by creating a special client in "LEADER" node
Jordan Halterman
@kuujo
Jul 28 2018 01:15
You should definitely use a LeaderElection for that. Don’t rely on the Raft leader for anything. It’s only exposed for informational purposes.
Junbo Ruan
@aruanruan
Jul 28 2018 01:16
@kuujo ok
Junbo Ruan
@aruanruan
Jul 28 2018 01:27
last year, we use zk of 3 node in one project, some horrible day, i lose 2 nodes for VM lost, we can not backup the data & rebuild the cluster, so i think if i save some Important data in leveldb or rocksdb synchronously, maybe i can rebuild the cluster just copying rocksdb file to new clusters very quickly.
Junbo Ruan
@aruanruan
Jul 28 2018 01:49
by the way, https://github.com/pingcap/tidb is a very successful open-source distributed database which compatible with MySQL base raft
Jordan Halterman
@kuujo
Jul 28 2018 02:12
@aruanruan that seems to be a distributed database that’s compatible with the MySQL protocol. That’s different from replicating a MySQL database using Raft, which is what I think is misguided. It’s not impossible, but requires a lot more than any existing implementations have done.
The problem is you can’t just use Raft leaders to accept writes (like the project I linked above) because multiple leaders can and will exist. Writes have to go through the Raft implementation to get its strong consistency semantics. Also, writes to the MySQL database itself need to use an insert-or-update that uses a Raft log index column to avoid duplicate writes on log replays.
Jordan Halterman
@kuujo
Jul 28 2018 02:21
I guess you could try to use the Raft leader terms and do an insert-or-update including a term column, but you’d then get reads of uncommitted data
Actually you’d get lost writes
Junbo Ruan
@aruanruan
Jul 28 2018 02:33
i agree
Johno Crawford
@johnou
Jul 28 2018 09:58
@aruanruan did you get a chance to try out the cleaner branch?
Junbo Ruan
@aruanruan
Jul 28 2018 10:00
@kuujo yes, i am using the version - rc5 to testing ...
Johno Crawford
@johnou
Jul 28 2018 10:01
Ah it's not merged
Junbo Ruan
@aruanruan
Jul 28 2018 10:02
i am testing big data to returning back to test the cluster will shut down or not
Johno Crawford
@johnou
Jul 28 2018 10:03
Junbo Ruan
@aruanruan
Jul 28 2018 10:06
I changed to use Utils in atomix , it does not work yet in windows 10 in minutes, but for long time past, it is ok
i think java vm released the mapping buffer in full gc, so we can delete the segment files
Junbo Ruan
@aruanruan
Jul 28 2018 10:37
i will testing that in detail