These are chat archives for atomix/atomix

25th
Feb 2017
Jordan Halterman
@kuujo
Feb 25 2017 00:01 UTC
so...
What it seems is happening is the segment locks for all the compacted segments are sitting in the OS cache still when the node is killed. Each partitions compacts the four segments, then writes a lock to the compacted segment files and deletes the old segment files. But when they’re read, the segments appear to be unlocked (partially compacted) and are therefore deleted.
But AFAICT the compaction process is flushing the locks to disk, so I’m not sure why this is happening yet
Jordan Halterman
@kuujo
Feb 25 2017 00:08 UTC
I think that PR I submitted was wrong, but I think I have the right one. Going to hack out some tests
Jordan Halterman
@kuujo
Feb 25 2017 00:18 UTC
@jhall11 fixed: atomix/copycat#287
this will have to be released immediately
I verified the bug with a failed test. This fix fixes the issue
Jienan Zeng
@jienan
Feb 25 2017 00:33 UTC

@kuujo yeah, client sends only command as much as possibly. I reproduce it easily

My understanding is when follows failed, master not just send headbeats, and also attempt to send empty appendEntry once recieving command from client.

Jordan Halterman
@kuujo
Feb 25 2017 00:33 UTC
right
that problem should be fixed now
it was indeed the problem
the PR I submitted will only allow a heartbeat every n seconds (using exponential backoff) once appends to a follower have failed
even for commands
Jienan Zeng
@jienan
Feb 25 2017 00:37 UTC
So well, the problem finally solved,
Jordan Halterman
@kuujo
Feb 25 2017 00:40 UTC
:clap:
Jordan Halterman
@kuujo
Feb 25 2017 00:49 UTC
@jhall11 I’ve double and triple and quadrupal checked atomix/copycat#287 and I think it’s good. I’m going to be merging it and pushing a new release of Copycat and Atomix once tests are done since this is a critical bug.
Jienan Zeng
@jienan
Feb 25 2017 00:52 UTC
I am confused a little after read https://github.com/atomix/copycat/pull/285/files,
when follower failed, can the master not send headbeats or empty appendEntry to the follower?
Jordan Halterman
@kuujo
Feb 25 2017 01:12 UTC
What do you mean? We don't want to completely stop the leader from attempting to send heartbeats to a follower because it's possible a configuration change could mean the follower no longer knows about the leader. Clusters are controlled by leaders. If we rely only on the follower being able to contact the leader to get its status changed after a partition, that may never happen in some scenarios. The leader is the only node that can be assumed to have an up to date view of the cluster. The leader updating the follower's status on PollRequest is just an optimization to reduce the amount of time it takes for a leader to rediscover and update the status of the follower's
Imagine this:
You have a three node cluster. The leader is node A, and B and C are followers. Follower C is partitioned and the leader stops sending heartbeats. Then a new follower D joins the cluster and is caught up. Leader A crashes and follower D is elected leader. Follower C’s partition is healed. How does node C get D to update its status? It doesn't know about leader D since leader D won't send it heartbeats and it never received the configuration change.
For that reason, the leader always needs to attempt to connect to followers at some rate.
@jienan
Jienan Zeng
@jienan
Feb 25 2017 01:22 UTC
i get it. Thank you for your detailed explanation. Implementation of raft/paxos is delicate and challenging.
Jordan Halterman
@kuujo
Feb 25 2017 01:28 UTC
Yep. Indeed it is! Especially configuration changes pose all sorts of challenges like those
Jordan Halterman
@kuujo
Feb 25 2017 02:42 UTC
@jhall11 when will the next RC be released and when is the final release?
Jon Hall
@jhall11
Feb 25 2017 02:45 UTC
The plan was for ones-1.9.0-rc2 to be promoted to onos-1.9 release on Monday unless our testing turns up any bugs. Not sure if we can get this fix into it or not. But we do 3 month releases, so it shouldn’t be a big deal. I’ll make a patch to update to this and we can try to get it in though
Jordan Halterman
@kuujo
Feb 25 2017 02:45 UTC
k cool
Jon Hall
@jhall11
Feb 25 2017 02:47 UTC
Thanks for looking into this, It really helps to have your expertise when looking at these bugs
Jordan Halterman
@kuujo
Feb 25 2017 02:54 UTC
Hey, this is what I do! I love it! I think we’ll be able to see the systems that rely on Atomix become a lot more stable much more quickly once I actually have time to focus on it. I’ve built all this in the evenings and on weekends and work on a totally different large project I created at work. My ability to meet the needs of users typically comes and goes. Now I just have to learn about the ONOS code base enough to do the same thing there. That will take some time, but I’ve had a lot of success jumping into open source projects over the years.
I think this is a very important bug fix. There are actually a couple of other fixes we’ve already merged since last night. One of them was the performance issue when a follower is down (too frequent connections), and the other was followers returning some incorrect information about the cluster. The latter doesn’t really affect ONOS though because it allows clients to connect to any node. The performance fix is nice. The IOOB bug is critical.
gonna push a release momentito
Jon Hall
@jhall11
Feb 25 2017 02:55 UTC
ok great!
It seems like there are still some people having bugs with onos-1.9.0-rc2 which has atomix-1.0.1. I’ll try to debug more with them on monday to get some more info
so we might end up doing another rc
Jordan Halterman
@kuujo
Feb 25 2017 02:57 UTC
sounds good
Jordan Halterman
@kuujo
Feb 25 2017 03:06 UTC
I’ll be around. I’m not doing much at work next week. Should have time to fix any issues that come up. But next weekend I’m leaving on vacation and disconnecting from the internets for a week! So, we’ll have to just get out as much as possible before then.
I’m excited about taking a break from technology. I doubt I’ll actually be successful, but at least it feels good to think that’s what I’m going to do right now :-P
Jon Hall
@jhall11
Feb 25 2017 03:16 UTC
haha, it is always good to disconnect and not have to check in with everything
Jordan Halterman
@kuujo
Feb 25 2017 03:19 UTC
hmm… there are some issues with the client tests, but we have time. I’ll fix them tonight and probably run the fuzz test some more and then release it this weekend
Jon Hall
@jhall11
Feb 25 2017 03:27 UTC
:+1:
Jordan Halterman
@kuujo
Feb 25 2017 10:27 UTC
I updated the changelog with all the recent work: https://github.com/atomix/copycat/blob/master/CHANGES.md
The fuzz test is seems to still be able to produce bugs frequently enough to wait on the release. I'll spend the weekend tracking down and fixing them and should have that done and released by the end of the weekend. Then I have to get back to my ONOS review :-)
Jordan Halterman
@kuujo
Feb 25 2017 11:09 UTC
This is obviously just peanuts though. Really, the next quarterly release is where we can see huge improvements in stability if we want to...