These are chat archives for atomix/atomix

22nd
Dec 2015
Richard Pijnenburg
@electrical
Dec 22 2015 00:01 UTC
Just got back :) so if the communication is done via raft and I read it from the client, at that point it will be removed from the raft logs ? Or could I use it in such a way for queue persistency that for example something crashes I won't lose events.
And no worries on 1.0 scope. Can always iterate further
My biggest worry is not losing events as soon as I receive them.
Richard Pijnenburg
@electrical
Dec 22 2015 00:06 UTC
Would be nice if raft handles
Uhg. Typing on a phone sux.
So. Android app is a bit better.
Richard Pijnenburg
@electrical
Dec 22 2015 00:12 UTC
Some plugins in logstash for example are only allowed to run once in the cluster. Having a lock to ensure it doesn't run anywhere else would be useful.
@kuujo ^^
Of reading from the queue means it's not in raft anymore I will still need a persistent queue per machine.
Richard Pijnenburg
@electrical
Dec 22 2015 08:43 UTC
Morning all
Richard Pijnenburg
@electrical
Dec 22 2015 14:11 UTC
@kuujo getting this weird error. "incompatible types: io.atomix.Atomix cannot be converted to io.atomix.AtomixClient" when trying to build an example with a client.
@jhalterman ^^ not sure if you may know what this is caused by
Jordan Halterman
@kuujo
Dec 22 2015 14:13 UTC
odd
I’m back
Richard Pijnenburg
@electrical
Dec 22 2015 14:14 UTC
i'll gist my code. maybe i'm doing something weird?
and sorry for all the questions, its quite a big project ive got started with very little know-how about java and stuff :p
Jordan Halterman
@kuujo
Dec 22 2015 14:16 UTC
yeah a gist would be great
it’s all good :-) Like I said, always happy to talk about it. I think it really helps work out what needs to be documented actually
Richard Pijnenburg
@electrical
Dec 22 2015 14:16 UTC
very true
Also when running tests i saw this coming up a few times: https://gist.github.com/electrical/e234708478a95a30b64a not sure if the stacktrace is expected to show up in the tests :-)
Jordan Halterman
@kuujo
Dec 22 2015 14:18 UTC
Yeah I actually saw that yesterday too. I haven’t frequently seen it in tests until now. I’m going to look into whether or not that’s a valid state. Could potentially be a race condition
Richard Pijnenburg
@electrical
Dec 22 2015 14:18 UTC
okay
Jordan Halterman
@kuujo
Dec 22 2015 14:19 UTC
Copycat tends to throw a lot of exceptions to try to catch race conditions like that.
Richard Pijnenburg
@electrical
Dec 22 2015 14:19 UTC
with regards to the client thing i was trying out. wanted to see if i could write a producing and consuming client to test out how fast i could send data across :-)
also I still saw the timeout error mentioned in #77
Richard Pijnenburg
@electrical
Dec 22 2015 14:33 UTC
Ah, interesting fix :-)
Jordan Halterman
@kuujo
Dec 22 2015 14:34 UTC
I had some notes on the issue in #77 somewhere… let me find them
Richard Pijnenburg
@electrical
Dec 22 2015 14:35 UTC
hehe okay.
'somewhere' with me usually means i lost it.
Jordan Halterman
@kuujo
Dec 22 2015 14:35 UTC
indeed
lol
somewhere searchable hopefully
Richard Pijnenburg
@electrical
Dec 22 2015 14:35 UTC
put it all in elasticsearch :-)
hah
Jordan Halterman
@kuujo
Dec 22 2015 14:36 UTC
ugh that’s what I need
we’ve talked about doing that for server logs in tests
would make it a lot easier to make sense of what’s going on in 5 servers
Richard Pijnenburg
@electrical
Dec 22 2015 14:36 UTC
yeah indeed
I do have all my Jenkins console logs forwarded to ES so i can make nice graphs and stuff of jobs
like runtimes and failed tests
Richard Pijnenburg
@electrical
Dec 22 2015 14:44 UTC
still very confused about the error with the AtomixClient thing
can't see any reason why
Jordan Halterman
@kuujo
Dec 22 2015 14:53 UTC

So, on #77 I am only able to reproduce this when the cluster is initially misconfigured. What a CopycatServer does at startup is this:

  • If a configuration is stored on disk via the MetaStore, load and use the last known cluster configuration, otherwise...
  • If the local server Address is listed in the list of server addresses provided by the user, start the server as a full member of the cluster
  • If the local server Address is not listed in the list of server addresses provided by the user, attempt to join via the provided list of servers
    https://github.com/atomix/copycat/blob/master/server/src/main/java/io/atomix/copycat/server/state/ServerState.java#L97-L109
    Essentially, the server uses the user-provided configuration only the first time the server is started, and thereafter uses the configuration stored on disk. This allows the cluster to evolve over time without the user provided configuration to change. When the server is started, it will either attempt to join an existing cluster by sending a JoinRequest or immediately transition to the FOLLOWER state to participate in the Raft algorithm. The transition is based on the configuration determined by the above link.

https://github.com/atomix/copycat/blob/master/server/src/main/java/io/atomix/copycat/server/state/ServerState.java#L97-L109

Address address = new Address(“localhost”, 5000);
List<Address> members = Arrays.asList(
  new Address(“localhost”, 5001),
  new Address(“localhost”, 5002)
);
CopycatServer server = CopycatServer.builder(address, members).build();

Because the local server’s address Address(“localhost”, 5000) is not listed in the members list, this code will cause the server to initially attempt to join the cluster.

Address address = new Address(“localhost”, 5000);
List<Address> members = Arrays.asList(
  new Address(“localhost”, 5000),
  new Address(“localhost”, 5001),
  new Address(“localhost”, 5002)
);
CopycatServer server = CopycatServer.builder(address, members).build();

Because the local server’s address Address(“localhost”, 5000) is listed in the members list, this code will cause the server to initially transition to the FOLLOWER state and begin talking to other servers.
The problem is, if all servers are started using the first method, then all servers will effectively attempt to be joining all other servers. The initial cluster needs to have some set of members that are started using the second method so that additional servers can be added. Raft requires that cluster membership changes be done as commits through the Raft log, and so at least one server needs to be started in the Raft FOLLOWER state. I’ve only been able to reproduce the join timeouts by starting all servers using the first method.

Once a server has been started and successfully joined the cluster, thereafter it will use the configuration stored on disk (if the StorageLevel is DISK or MAPPED). So, the second time you start the server in the first example, it will immediately transition to FOLLOWER and continue its participation in the Raft algorithm (assuming it was able to successfully join the first time it started).

If this is not what’s causing the startup failure then it could be a bug that I can’t reproduce. Also, when testing you should keep in mind if using StorageLevel.DISK (the default) that state will still be loaded from disk on separate runs. I typically use StorageLevel.MEMORY for a lot of testing and then switch to StorageLevel.DISK for final tests.

BTW you may benefit from using StorageLevel.MEMORY or StorageLevel.MAPPED which are waaaaay faster than disk
Richard Pijnenburg
@electrical
Dec 22 2015 14:56 UTC
Okay, my issue is not as much the building of the cluster anymore because that works.. for me its that it changes leader for unknown reasons giving the timeout errors
Jordan Halterman
@kuujo
Dec 22 2015 14:56 UTC
ahh gotcha
Richard Pijnenburg
@electrical
Dec 22 2015 14:57 UTC
perhaps i should have opened a new issue about it :-)
but your explanation is very clear and defo makes sense.
Jordan Halterman
@kuujo
Dec 22 2015 14:59 UTC
I have been trying to look into the leader change issue and haven’t had much success yet. It’s easy to reproduce in tests, but difficult to tell exactly why it’s happening. In tests I’ve seen all servers pause for as long 10 seconds causing a leader change. I haven’t really been able to reproduce it outisde of tests though. I’ve found that increasing the election timeout seems to help a lot, but that’s a terrible solution if servers are getting blocked in non-test environments:
CopycatServer server = CopycatServer.builder()
  .withElectionTimeout(Duration.ofSeconds(1))
  .build();
Richard Pijnenburg
@electrical
Dec 22 2015 15:00 UTC
Hmm yeah indeed.
Jordan Halterman
@kuujo
Dec 22 2015 15:00 UTC
I have some logs from tests actually...
14:16:52.190 [5001] DEBUG i.a.c.server.state.LeaderAppender - 5001 - Sent AppendRequest[term=1, leader=2130712285, logIndex=5, logTerm=1, entries=[0], commitIndex=5, globalIndex=5] to 5004
14:16:53.238 [5002] DEBUG i.a.c.server.state.FollowerState - 5002 - Heartbeat timed out in PT1.28S
14:16:53.341 [5003] DEBUG i.a.c.server.state.FollowerState - 5003 - Heartbeat timed out in PT1.387S
14:16:53.422 [5004] DEBUG i.a.c.server.state.FollowerState - 5004 - Heartbeat timed out in PT1.469S
14:16:53.432 [5005] DEBUG i.a.c.server.state.FollowerState - 5005 - Heartbeat timed out in PT1.479S
In this case, the leader sends a heartbeat to a follower, but the heartbeat doesn’t arrive for a couple seconds. Keep in mind, this isn’t even over a network. This is just threading.
Richard Pijnenburg
@electrical
Dec 22 2015 15:02 UTC
interesting
Jordan Halterman
@kuujo
Dec 22 2015 15:02 UTC
14:32:09.136 [5001] DEBUG i.a.copycat.server.state.ServerState - 5001 - Sending server identification to 5002
14:32:09.138 [5003] DEBUG i.a.copycat.server.state.ServerState - 5003 - Sending server identification to 5002
14:32:17.633 [5001] DEBUG i.a.c.server.state.FollowerState - 5001 - Heartbeat timed out in PT0.79S
14:32:17.633 [5003] DEBUG i.a.c.server.state.FollowerState - 5003 - Heartbeat timed out in PT0.85S
This one is over 8 seconds, but I’ve only seen these in Copycat tests and not in Atomix
Richard Pijnenburg
@electrical
Dec 22 2015 15:03 UTC
makes you wonder what it is thats blocking it if its even just local stuff
Jordan Halterman
@kuujo
Dec 22 2015 15:03 UTC
note that the heartbeat timeout was 0.79S but it still took 8 seconds to trigger
something in Copycat tests blocks all servers, but I haven’t actually been able to reproduce it outside of Copycat tests
Richard Pijnenburg
@electrical
Dec 22 2015 15:03 UTC
hmm yeah
but it also happens outside the tests
the timeouts i saw were in the leader election example
Jordan Halterman
@kuujo
Dec 22 2015 15:04 UTC
I’ll play with it more today
what’s your environment like?
Richard Pijnenburg
@electrical
Dec 22 2015 15:05 UTC
I'm currently running it on an Ubuntu 14.04 VM ( with kvm )
with 2 cpu's
Jordan Halterman
@kuujo
Dec 22 2015 15:07 UTC
Java version?
Richard Pijnenburg
@electrical
Dec 22 2015 15:07 UTC
jre1.8.0_66
Jordan Halterman
@kuujo
Dec 22 2015 15:07 UTC
cool
This thing and the client thing are going down today. Getting tired of them. Haha. I should be able to track it down
Richard Pijnenburg
@electrical
Dec 22 2015 15:08 UTC
Hehe cool
I'm on days off from Thursday for 2 weeks. so got a bit more time to play around with it
Jordan Halterman
@kuujo
Dec 22 2015 15:10 UTC
yeah me too, but I’ll be working on this
Jordan Halterman
@kuujo
Dec 22 2015 15:18 UTC
I gotta take a shower and head into the office. I have a lot of good stuff to work on :-)
Richard Pijnenburg
@electrical
Dec 22 2015 15:19 UTC
hehe okay. Catch you later. and thank you for all the help so far :-)
Richard Pijnenburg
@electrical
Dec 22 2015 15:30 UTC
Interesting. a test that is stuck. https://travis-ci.org/atomix/atomix/builds/98324287
Jordan Halterman
@kuujo
Dec 22 2015 15:51 UTC
Yep. I think those may be related to the client issues I'm fixing. Will be a lot easier to tell what's going on when that is fixed I think.
Richard Pijnenburg
@electrical
Dec 22 2015 15:52 UTC
Ah okay cool :-)
Jordan Halterman
@kuujo
Dec 22 2015 15:53 UTC
I guess we'll find out today!
Richard Pijnenburg
@electrical
Dec 22 2015 15:53 UTC
hehe yeah.