These are chat archives for atomix/atomix

22nd
Dec 2015
Richard Pijnenburg
@electrical
Dec 22 2015 00:01
Just got back :) so if the communication is done via raft and I read it from the client, at that point it will be removed from the raft logs ? Or could I use it in such a way for queue persistency that for example something crashes I won't lose events.
And no worries on 1.0 scope. Can always iterate further
My biggest worry is not losing events as soon as I receive them.
Richard Pijnenburg
@electrical
Dec 22 2015 00:06
Would be nice if raft handles
Uhg. Typing on a phone sux.
So. Android app is a bit better.
Richard Pijnenburg
@electrical
Dec 22 2015 00:12
Some plugins in logstash for example are only allowed to run once in the cluster. Having a lock to ensure it doesn't run anywhere else would be useful.
@kuujo ^^
Of reading from the queue means it's not in raft anymore I will still need a persistent queue per machine.
Richard Pijnenburg
@electrical
Dec 22 2015 08:43
Morning all
Richard Pijnenburg
@electrical
Dec 22 2015 14:11
@kuujo getting this weird error. "incompatible types: io.atomix.Atomix cannot be converted to io.atomix.AtomixClient" when trying to build an example with a client.
@jhalterman ^^ not sure if you may know what this is caused by
Jordan Halterman
@kuujo
Dec 22 2015 14:13
odd
I’m back
Richard Pijnenburg
@electrical
Dec 22 2015 14:14
i'll gist my code. maybe i'm doing something weird?
and sorry for all the questions, its quite a big project ive got started with very little know-how about java and stuff :p
Jordan Halterman
@kuujo
Dec 22 2015 14:16
yeah a gist would be great
it’s all good :-) Like I said, always happy to talk about it. I think it really helps work out what needs to be documented actually
Richard Pijnenburg
@electrical
Dec 22 2015 14:16
very true
Also when running tests i saw this coming up a few times: https://gist.github.com/electrical/e234708478a95a30b64a not sure if the stacktrace is expected to show up in the tests :-)
Jordan Halterman
@kuujo
Dec 22 2015 14:18
Yeah I actually saw that yesterday too. I haven’t frequently seen it in tests until now. I’m going to look into whether or not that’s a valid state. Could potentially be a race condition
Richard Pijnenburg
@electrical
Dec 22 2015 14:18
okay
Jordan Halterman
@kuujo
Dec 22 2015 14:19
Copycat tends to throw a lot of exceptions to try to catch race conditions like that.
Richard Pijnenburg
@electrical
Dec 22 2015 14:19
with regards to the client thing i was trying out. wanted to see if i could write a producing and consuming client to test out how fast i could send data across :-)
also I still saw the timeout error mentioned in #77
Richard Pijnenburg
@electrical
Dec 22 2015 14:33
Ah, interesting fix :-)
Jordan Halterman
@kuujo
Dec 22 2015 14:34
I had some notes on the issue in #77 somewhere… let me find them
Richard Pijnenburg
@electrical
Dec 22 2015 14:35
hehe okay.
'somewhere' with me usually means i lost it.
Jordan Halterman
@kuujo
Dec 22 2015 14:35
indeed
lol
somewhere searchable hopefully
Richard Pijnenburg
@electrical
Dec 22 2015 14:35
put it all in elasticsearch :-)
hah
Jordan Halterman
@kuujo
Dec 22 2015 14:36
ugh that’s what I need
we’ve talked about doing that for server logs in tests
would make it a lot easier to make sense of what’s going on in 5 servers
Richard Pijnenburg
@electrical
Dec 22 2015 14:36
yeah indeed
I do have all my Jenkins console logs forwarded to ES so i can make nice graphs and stuff of jobs
like runtimes and failed tests
Richard Pijnenburg
@electrical
Dec 22 2015 14:44
still very confused about the error with the AtomixClient thing
can't see any reason why
Jordan Halterman
@kuujo
Dec 22 2015 14:53

So, on #77 I am only able to reproduce this when the cluster is initially misconfigured. What a CopycatServer does at startup is this:

  • If a configuration is stored on disk via the MetaStore, load and use the last known cluster configuration, otherwise...
  • If the local server Address is listed in the list of server addresses provided by the user, start the server as a full member of the cluster
  • If the local server Address is not listed in the list of server addresses provided by the user, attempt to join via the provided list of servers
    https://github.com/atomix/copycat/blob/master/server/src/main/java/io/atomix/copycat/server/state/ServerState.java#L97-L109
    Essentially, the server uses the user-provided configuration only the first time the server is started, and thereafter uses the configuration stored on disk. This allows the cluster to evolve over time without the user provided configuration to change. When the server is started, it will either attempt to join an existing cluster by sending a JoinRequest or immediately transition to the FOLLOWER state to participate in the Raft algorithm. The transition is based on the configuration determined by the above link.

https://github.com/atomix/copycat/blob/master/server/src/main/java/io/atomix/copycat/server/state/ServerState.java#L97-L109

Address address = new Address(“localhost”, 5000);
List<Address> members = Arrays.asList(
  new Address(“localhost”, 5001),
  new Address(“localhost”, 5002)
);
CopycatServer server = CopycatServer.builder(address, members).build();

Because the local server’s address Address(“localhost”, 5000) is not listed in the members list, this code will cause the server to initially attempt to join the cluster.

Address address = new Address(“localhost”, 5000);
List<Address> members = Arrays.asList(
  new Address(“localhost”, 5000),
  new Address(“localhost”, 5001),
  new Address(“localhost”, 5002)
);
CopycatServer server = CopycatServer.builder(address, members).build();

Because the local server’s address Address(“localhost”, 5000) is listed in the members list, this code will cause the server to initially transition to the FOLLOWER state and begin talking to other servers.
The problem is, if all servers are started using the first method, then all servers will effectively attempt to be joining all other servers. The initial cluster needs to have some set of members that are started using the second method so that additional servers can be added. Raft requires that cluster membership changes be done as commits through the Raft log, and so at least one server needs to be started in the Raft FOLLOWER state. I’ve only been able to reproduce the join timeouts by starting all servers using the first method.

Once a server has been started and successfully joined the cluster, thereafter it will use the configuration stored on disk (if the StorageLevel is DISK or MAPPED). So, the second time you start the server in the first example, it will immediately transition to FOLLOWER and continue its participation in the Raft algorithm (assuming it was able to successfully join the first time it started).

If this is not what’s causing the startup failure then it could be a bug that I can’t reproduce. Also, when testing you should keep in mind if using StorageLevel.DISK (the default) that state will still be loaded from disk on separate runs. I typically use StorageLevel.MEMORY for a lot of testing and then switch to StorageLevel.DISK for final tests.

BTW you may benefit from using StorageLevel.MEMORY or StorageLevel.MAPPED which are waaaaay faster than disk
Richard Pijnenburg
@electrical
Dec 22 2015 14:56
Okay, my issue is not as much the building of the cluster anymore because that works.. for me its that it changes leader for unknown reasons giving the timeout errors
Jordan Halterman
@kuujo
Dec 22 2015 14:56
ahh gotcha
Richard Pijnenburg
@electrical
Dec 22 2015 14:57
perhaps i should have opened a new issue about it :-)
but your explanation is very clear and defo makes sense.
Jordan Halterman
@kuujo
Dec 22 2015 14:59
I have been trying to look into the leader change issue and haven’t had much success yet. It’s easy to reproduce in tests, but difficult to tell exactly why it’s happening. In tests I’ve seen all servers pause for as long 10 seconds causing a leader change. I haven’t really been able to reproduce it outisde of tests though. I’ve found that increasing the election timeout seems to help a lot, but that’s a terrible solution if servers are getting blocked in non-test environments:
CopycatServer server = CopycatServer.builder()
  .withElectionTimeout(Duration.ofSeconds(1))
  .build();
Richard Pijnenburg
@electrical
Dec 22 2015 15:00
Hmm yeah indeed.
Jordan Halterman
@kuujo
Dec 22 2015 15:00
I have some logs from tests actually...
14:16:52.190 [5001] DEBUG i.a.c.server.state.LeaderAppender - 5001 - Sent AppendRequest[term=1, leader=2130712285, logIndex=5, logTerm=1, entries=[0], commitIndex=5, globalIndex=5] to 5004
14:16:53.238 [5002] DEBUG i.a.c.server.state.FollowerState - 5002 - Heartbeat timed out in PT1.28S
14:16:53.341 [5003] DEBUG i.a.c.server.state.FollowerState - 5003 - Heartbeat timed out in PT1.387S
14:16:53.422 [5004] DEBUG i.a.c.server.state.FollowerState - 5004 - Heartbeat timed out in PT1.469S
14:16:53.432 [5005] DEBUG i.a.c.server.state.FollowerState - 5005 - Heartbeat timed out in PT1.479S
In this case, the leader sends a heartbeat to a follower, but the heartbeat doesn’t arrive for a couple seconds. Keep in mind, this isn’t even over a network. This is just threading.
Richard Pijnenburg
@electrical
Dec 22 2015 15:02
interesting
Jordan Halterman
@kuujo
Dec 22 2015 15:02
14:32:09.136 [5001] DEBUG i.a.copycat.server.state.ServerState - 5001 - Sending server identification to 5002
14:32:09.138 [5003] DEBUG i.a.copycat.server.state.ServerState - 5003 - Sending server identification to 5002
14:32:17.633 [5001] DEBUG i.a.c.server.state.FollowerState - 5001 - Heartbeat timed out in PT0.79S
14:32:17.633 [5003] DEBUG i.a.c.server.state.FollowerState - 5003 - Heartbeat timed out in PT0.85S
This one is over 8 seconds, but I’ve only seen these in Copycat tests and not in Atomix
Richard Pijnenburg
@electrical
Dec 22 2015 15:03
makes you wonder what it is thats blocking it if its even just local stuff
Jordan Halterman
@kuujo
Dec 22 2015 15:03
note that the heartbeat timeout was 0.79S but it still took 8 seconds to trigger
something in Copycat tests blocks all servers, but I haven’t actually been able to reproduce it outside of Copycat tests
Richard Pijnenburg
@electrical
Dec 22 2015 15:03
hmm yeah
but it also happens outside the tests
the timeouts i saw were in the leader election example
Jordan Halterman
@kuujo
Dec 22 2015 15:04
I’ll play with it more today
what’s your environment like?
Richard Pijnenburg
@electrical
Dec 22 2015 15:05
I'm currently running it on an Ubuntu 14.04 VM ( with kvm )
with 2 cpu's
Jordan Halterman
@kuujo
Dec 22 2015 15:07
Java version?
Richard Pijnenburg
@electrical
Dec 22 2015 15:07
jre1.8.0_66
Jordan Halterman
@kuujo
Dec 22 2015 15:07
cool
This thing and the client thing are going down today. Getting tired of them. Haha. I should be able to track it down
Richard Pijnenburg
@electrical
Dec 22 2015 15:08
Hehe cool
I'm on days off from Thursday for 2 weeks. so got a bit more time to play around with it
Jordan Halterman
@kuujo
Dec 22 2015 15:10
yeah me too, but I’ll be working on this
Jordan Halterman
@kuujo
Dec 22 2015 15:18
I gotta take a shower and head into the office. I have a lot of good stuff to work on :-)
Richard Pijnenburg
@electrical
Dec 22 2015 15:19
hehe okay. Catch you later. and thank you for all the help so far :-)
Richard Pijnenburg
@electrical
Dec 22 2015 15:30
Interesting. a test that is stuck. https://travis-ci.org/atomix/atomix/builds/98324287
Jordan Halterman
@kuujo
Dec 22 2015 15:51
Yep. I think those may be related to the client issues I'm fixing. Will be a lot easier to tell what's going on when that is fixed I think.
Richard Pijnenburg
@electrical
Dec 22 2015 15:52
Ah okay cool :-)
Jordan Halterman
@kuujo
Dec 22 2015 15:53
I guess we'll find out today!
Richard Pijnenburg
@electrical
Dec 22 2015 15:53
hehe yeah.