These are chat archives for atomix/atomix

23rd
May 2018
Jianwei Mao
@MaoJianwei
May 23 2018 01:38
I found Atomix 2.1.0-SNAPSHOT is not so stable now. I start 5 nodes concurrently, and it will cost 1 or 2 minutes to finish atomix.start().join(). Then one of them put something into a map, others get the map by map.toString(), periodically. Now If I shutdown one getting node, all other four nodes will fail to put & get, with Exception thrown. My demo code as below. Did you get this problem?
Jordan Halterman
@kuujo
May 23 2018 01:39
Let’s see the code and we’ll find out
what exception?
Jianwei Mao
@MaoJianwei
May 23 2018 01:40
image.png
image.png
Jordan Halterman
@kuujo
May 23 2018 01:40
can you please put it in a gist or something so it can be copied
:smile: Thanks! Does this link work?
Jordan Halterman
@kuujo
May 23 2018 01:42
yep thanks
Jianwei Mao
@MaoJianwei
May 23 2018 01:44
this codes is for the Put node, I change the reading interval to 1 second for Get node. other parts is same.
Jordan Halterman
@kuujo
May 23 2018 01:45
working fine for me
Jianwei Mao
@MaoJianwei
May 23 2018 01:52
something like this:
image.png
image.png
can not Put and Get anymore
Jordan Halterman
@kuujo
May 23 2018 04:22
That looks like a partition. The node seems to be disconnected from all the other nodes (hence the MEMBER_REMOVED events) which is why the operation fails
In the second one, it just looks like an operation failure due to a connection failure after a node was shutdown. That can probably be mitigated by using retries. This code is basically creating the simplest possible map, which isn’t necessarily ideal unless you’re going to handle these types of failures on your own
Also, why the sleeps? Did you run into a problem there? The nodes should block on startup until they’ve been able to form a cluster, so once start().join() returns you should be able to perform normal operations
Also, any idea what commit you were on when it worked and are on now? It’s going to be very difficult to track down a regression without that information and without being able to reproduce it
Jordan Halterman
@kuujo
May 23 2018 04:32
Also, how specifically are you running them? I run all 5 nodes in parallel in my IDE, then kill a couple before it begins timing out (due to lost availability in Raft partitions).
Jianwei Mao
@MaoJianwei
May 23 2018 13:05
MEMBER_REMOVED is normal, but the problem is that one Get node failing should not cause all other four nodes failing. By the way, yes sleep(10s) is not so important. I use commit e267c5b2, 2.1.0-SNAPSHOT.
I just run them by click the green triangle, and start all of them in 10 seconds. After I see Put node puts some records into the map, and all Get nodes read the whole map in console output, I kill one Get node. Finally, the exception ouput is shown, this can be easy to recall. :smile:
Ronnie
@rroller
May 23 2018 16:28
@kuujo sorry, was out on vacation, back today. I'll get this tested
Ronnie
@rroller
May 23 2018 16:55
It would be really helpful when adding the @Deprecated annotations on things to also add the javadoc @deprecated and state what to use instead
So far things look good though, tests pass :)
Ronnie
@rroller
May 23 2018 17:01
Do we still need to add a local member?
appears we do
Jordan Halterman
@kuujo
May 23 2018 19:46
I thought I added that change?
the @deprecated one that is
ahh not on the withType method, but there is no other method to use. It’s just not used any more
It should be possible to remove the local member requirement as long as the node doesn’t have to be used in consensus (in that case it needs a static identifier)
but we have to determine which interface to use
nodes only need identifying information for consensus
Jordan Halterman
@kuujo
May 23 2018 21:05
I’m finally upgrading Atomix in ONOS, so probably will have lots of these types of small changes coming as I get through the upgrade, then will do an RC when I submit all the patches for ONOS
Ronnie
@rroller
May 23 2018 21:16
what's RC?
Jordan Halterman
@kuujo
May 23 2018 21:17
release candidate
Johno Crawford
@johnou
May 23 2018 21:50
did you mark them deprecated because of large usages in onos?
or could we just remove them completely
Jordan Halterman
@kuujo
May 23 2018 23:08
no just being nice :-)
I’ll remove them after the next RC
Jordan Halterman
@kuujo
May 23 2018 23:44
ONOS Cluster Architecture
Jordan Halterman
@kuujo
May 23 2018 23:49
I thought I’d share a quick drawing of the new ONOS architecture (this is why it’s nice being paid to work on open source projects). The new architecture, which I started on this week, uses both an external Atomix cluster and embedded nodes. The external cluster contains Raft partitions, and the ONOS controller nodes contain primary-backup partitions. What this allows is for controller nodes to be scaled up/down without any real overhead, and significantly improves performance by co-locating the in-memory primary-backup partitions with the nodes that access that data.
We will essentially hash device’s controller by the network to Atomix partitions, and assign mastership for those devices to the partition leaders. That allows reads on data related to a device to be done locally on the master node.
controlled*
Stupid auto correct
So, if a device with ID foo connects to the controller, it will be mapped e.g. to primary-backup partition 2, and partition 2’s leader will control that device (be the master). Then all the state related to the device is stored in partition 2, and the master can read up-to-date primitives locally.
Jordan Halterman
@kuujo
May 23 2018 23:57
Since the Raft partitions are all in the external Atomix agent cluster, container orchestration systems can add/remove/move controller nodes without much cost, which is a major limitation today (since the controller nodes are Raft nodes)