These are chat archives for atomix/atomix

10th
Feb 2017
Jordan Halterman
@kuujo
Feb 10 2017 00:02
that particular index check is really simple
Jon Hall
@jhall11
Feb 10 2017 00:06
so if this is comming form ServerStateMachine.java:270, and the index is 1, does that mean lastApplied is 0?
Jordan Halterman
@kuujo
Feb 10 2017 00:07
yep
trying to track back to where this actually starts in the logs
I feel like I’m close
Jordan Halterman
@kuujo
Feb 10 2017 00:25

this is the last entry successfully applied by the leader for partition-1:

Applying KeepAliveEntry[index=60033, term=156, session=59805, commandSequence=91, eventIndex=59854, timestamp=1486461361379]

Then it loses its leadership and eventually appends entries that are never applied

the entries it appends are already committed and should immediately be applied
Jordan Halterman
@kuujo
Feb 10 2017 00:34
it’s odd because all the partitions seem to be appending but not committing entries
err not applying them
Jordan Halterman
@kuujo
Feb 10 2017 00:55
man this is crazy
backporting the log is starting to actually seem attractive
Jordan Halterman
@kuujo
Feb 10 2017 02:18
Hmm... I wonder if it's worth trying to partially reproduce one of the partition's logs in a unit test. That should be doable but will take a while and may not be fruitful.
Shalakha Sidmul
@shalakhansidmul
Feb 10 2017 06:09
@neverfox Thank you your reply :)
By distributed data management I mean managing consistency/availability of data of my application across nodes in a cluster i.e. in a distributed environment.
Jordan Halterman
@kuujo
Feb 10 2017 06:23

I don’t know of any real-world diagrams that aren’t from large and complex systems. In general, the concepts are pretty simple. Look at any ZooKeeper diagram for the most basic architecture: a client-server architecture. Start 3 or 5 AtomixReplicas or standalone servers, and connect to them with AtomixClient. With the client API, you create resources which are distributed objects that intentionally model familiar Java classes. Any diagram will typically just show three replicas and some clients connected to them. The rest is all handled transparently by Atomix.

DistributedLock lock = client.getLock(“foo”).join();
lock.lock().join();
try {
  doSomething();
} finally {
  lock.unlock().join();
}

Because Atomix is strongly consistent, these distributed objects will behave exactly as you would expect them to behave (more or less) were they actually accessed across threads in the same VM via shared memory.

But AtomixReplicas themselves can be used to create and operate on distributed objects as well, so Atomix can indeed be embedded in a system like Hazelcast is. But the management of embedded Atomix clusters isn’t as straightforward as it is in Hazelcast due to the strict quorum requirements of consensus. Work to support dynamic clusters in Atomix is ongoing and will likely be improved some time in the near future, but as of now it’s really considered experimental. Hazelcast will always be somewhat simpler in that respect.

Jordan Halterman
@kuujo
Feb 10 2017 07:35

@jhall11 progress is being made towards finding the cause... ever so slowly.

So, this definitely appears to be an issue in the log, and in particular I think with log recovery. What I note from the ONOS logs is that all partitions on node 2 (*.*.*.3) are affected after a crash. This implies that it's not corrupt data, but perhaps is an issue with how the logs are recovered from disk. If the logs were corrupt I would expect just one or two partitions not to work since they operate completely independently of one another.

I'm hoping this can be reproduced in a unit test. Is it possible to list all the files in the data directory(s)? The file names only provide a subset of the metadata that's used when recovering from disk, but could still be informative. Aside from that, I'll just take a stab at beefing up the log recovery tests.

The Copycat 2.0 logs still generally use the same recovery mechanism, so it seems unlikely back porting them would fix this issue.

What's really odd about this is that severs are typically able to recover from disk after a failure, but after this particular failure all partitions fail to properly recover from disk. That implies some determinism that I don't particularly understand.
The partitions operate completely independent of one another, so again the fact they all fail to recover from disk is strange. My only guess is that perhaps they all just happened to roll over to a new partition prior to the crash since the tests seem to be distributing data pretty evenly.
new segment*
Jordan Halterman
@kuujo
Feb 10 2017 07:40
It might take some more time to narrow it down, but it will be found and when it is, it will feel good :-)
Jordan Halterman
@kuujo
Feb 10 2017 08:07
What I might have to do is just run randomized tests writing, compacting, recovering, and reading logs.
It's ugly but may be able to reproduce it in a debugger.
Hmm... actually better idea. Can you just send me the data directory from a server that runs into this bug? Just recovering the log should reproduce it if I'm right about where the issue is.
Jon Hall
@jhall11
Feb 10 2017 17:08
ok, do you want the logs before and after the node failure?
Jordan Halterman
@kuujo
Feb 10 2017 19:39
I think I just need the data. It seems like just recovering the Raft log from disk could reproduce it. But it's still really odd that this happens on all the partitions on the same node. That's what makes me question it. But I can't imagine any good reason this failure would be deterministic across totally separate instances of the Raft protocol. So hopefully that's just a coincidence.
Jon Hall
@jhall11
Feb 10 2017 19:39
ok, we are rerunning it now
Jordan Halterman
@kuujo
Feb 10 2017 20:12
@jhall11 I reproduced it in randomized tests
yay!
took a while but it worked
I’ll still wait on your tests just in case but I think I at least reproduced the same exception
Jordan Halterman
@kuujo
Feb 10 2017 20:20
hmm…. now I can’t get it again though
Jon Hall
@jhall11
Feb 10 2017 20:31
of course :)
Jordan Halterman
@kuujo
Feb 10 2017 20:31
haha
Jordan Halterman
@kuujo
Feb 10 2017 20:38
digging through the logs again… What’s happening after the server restart is the followers are receiving AppendRequest and committing entries. Then once that exception is thrown they fail to reset their election timeouts, so they timeout and forget about the leader (assuming it died) and then try to get elected, then learn about the leader and so on. I think there are some minor improvements to be made there (don’t forget about the leader until the node transitions to the candidate state and actually starts a new election). But what’s curious to me is that the followers are able to do the log consistency checks that are required when handling AppendRequest. That is, these checks seem to pass, and those checks are almost the same as the checks that throw IndexOutOfBoundsException (calling isEmpty() and lastIndex() on the log)
crazy bug
hmm
actually I might see something...
Jordan Halterman
@kuujo
Feb 10 2017 20:56
so… it looks like the commitIndex is actually being somehow set to an invalid value
I don’t know if that’s causing this problem, but it can’t be helping
Received AppendRequest[term=157, leader=-1408161301, logIndex=61799, logTerm=157, entries=[0], commitIndex=62074, 
globalIndex=61930]
Committed entries up to index 61800
that follower should only be able to commit up to 61799 after that request
Jordan Halterman
@kuujo
Feb 10 2017 21:03
hmm
I wonder if the problem is allowing the globalIndex to increase above the commitIndex on the node
let me think about this...
globalIndex should definitely lag behind commitIndex. I’ll fix that. Not sure if it will fix this bug though
Jordan Halterman
@kuujo
Feb 10 2017 21:46
I'll take one more stab at reproducing it in a unit test. If that doesn't work I think it can definitely be done with the test logs.
Jon Hall
@jhall11
Feb 10 2017 21:48
ok, looks like the test is still running
Jordan Halterman
@kuujo
Feb 10 2017 23:32
I set up a test that submits random commands and queries to a cluster, randomly compacts the logs, and randomly kills and later restarts nodes. I managed to get an exception from the test, but not the same one.
Will keep running it and hopefully track down some issues
Jon Hall
@jhall11
Feb 10 2017 23:33
hmm, thats good that we can catch these issues
BTW, ONOS is looking to do a release in about 2 weeks. would it be possible to get a bug fix release, or rc, sometime next week?
Jordan Halterman
@kuujo
Feb 10 2017 23:36
Yeah whenever it's needed. Should probably hack through some of the various little issues before then.
I think I can probably improve this test a bit more to get it to exercise more cases. Seems like a decent way to get it to misbehave