These are chat archives for atomix/atomix

3rd
Aug 2016
Jeff Nelson
@jtnelson
Aug 03 2016 11:31
@kuujo it seems Raghav and I are having issues with bootstrapping the server...the call to bootstrap() returns a Future that never returns. Off the top of your head, can you think of anything that would cause the bootstrap process to fail
Jordan Halterman
@kuujo
Aug 03 2016 15:41
@jtnelson it depends on the arguments to bootstrap. There are two ways you can bootstrap a cluster. First, you can call bootstrap on a single mode without any arguments and then join other nodes to it. Second, you can call bootstrap on all nodes and pass the full list of Addresses on each node. That results in the cluster being started with the provided configuration, but it also requires that bootstrap be called on at least a majority of nodes to start the cluster. The CompletableFuture returned by the bootstrap and join methods will be completed once a leader is elected. So, if you bootstrap a single node with no arguments, it will form a single node cluster, elect itself leader, and then complete the bootstrap. Then when additional nodes join that cluster you call join with the boostrapped node's Address. The joining node will request to join the boostrapped node. The bootstrapped node will commit a configuration change, and once the joining node finds the leader the returned CompletableFuture will be completed, etc.
You may want to enable DEBUG logging. Copycat prints pretty extensive logs about what's going on internally, and if you paste them I'd be glad to look at them. They should show the problem.
Logs will show every configuration change that occurs, ever entry that's written to the log, every entry that's applied to the state machine, every request that's sent, and every response that's received
Jordan Halterman
@kuujo
Aug 03 2016 15:47
The example here uses the method of bootstrapping that bootstraps the entire cluster. That just requires a majority of that cluster to be bootstrapped at the same time. Prior to a majority being bootstrapped the call will hang. You can just try bootstrap() to bootstrap a single node. If that hangs then it's likely the server startup is hanging for some reason.
Bootstrapping consensus based clusters is a rather inelegant process - because of the strictness of configurations necessary to achieve partition tolerance - that we've tried to simplify as much as possible.
Jeff Nelson
@jtnelson
Aug 03 2016 16:04
@kuujo how can i turn on DEBUG logging?
its it using logback?
Jordan Halterman
@kuujo
Aug 03 2016 16:21
slf4j... One sec I have a configuration
@jtnelson you'll have to add the logback dependency: https://github.com/atomix/copycat/blob/master/test/src/test/resources/logback.xml
Jeff Nelson
@jtnelson
Aug 03 2016 16:27
@kuujo in the example you provided above for the ValueStateMachine
that code assumes the servers in the cluster were started
somewher else, right?
Jordan Halterman
@kuujo
Aug 03 2016 16:42
Servers are started in ValueStateMachineExample. The server is configured with ValueStateMachine and the client example connects to the server example to submit commands to it
TBH those examples could be better... I actually use them for performance testing - start a cluster of 3 nodes and then start a bunch of clients. All the clients submit commands non-stop
Run the server(s) with java -jar value-state-machine.jar path/to/log localhost:5000 localhost:5001 localhost:5002. The log path must be unique for each server if they're being run on the same machine. The first host:port is the local server's host/port and the remaining host:port tuples are remote servers.
Once those are started run java -jar value-client.jar localhost:5000 localhost:5001 localhost:5002 to connect a client and it will start writing a bunch of commands. Then you can arbitrarily kill servers and what not to play around with it
Jordan Halterman
@kuujo
Aug 03 2016 16:48
That example effectively will do bootstrap(new Address("localhost", 5000), new Address("localhost", 5001), new Address("localhost", 5002)) in the case of the first example above
Jeff Nelson
@jtnelson
Aug 03 2016 16:51
@kuujo do you have 5 mins to do a screenshare so I can walk you through my code?
Jordan Halterman
@kuujo
Aug 03 2016 16:53
Unfortunately about to walk in to a meeting ATM
Jeff Nelson
@jtnelson
Aug 03 2016 16:53
okay, i'll ping you later
Jordan Halterman
@kuujo
Aug 03 2016 16:54
Sounds good
Jeff Nelson
@jtnelson
Aug 03 2016 16:54
the issue im having now is my server is never running
after i build
and call
server.isRunning()
in a loop
it loops forever
do i need to do anything else to start the server after building it?
Jordan Halterman
@kuujo
Aug 03 2016 16:57
Hmm... The only thing the server needs is: a configured Transport, and configured Storage. That's it. bootstrap() should start it with no problem. The only real risks for hanging I've seen is a failure to elect a leader or a blocked event thread, but the latter shouldn't happen when the server is being started. The ValueStateMachine example should show everything that's necessary to start the server.
Jeff Nelson
@jtnelson
Aug 03 2016 19:27
@kuujo FYI i figured out what was going on...state log file
when the server starts up, it seems to be replaying all the commits in the log but it doesn't restore the snapshot via the #install method. Is that expected behaviour?
Jordan Halterman
@kuujo
Aug 03 2016 20:25
Ahh I could have suspected that. Typically the reason a stale configuration would have that affect is because some old cluster configuration was written to the log and that's being used at startup. I'd recommend testing with StorageLevel.MEMORY and switching to disk to avoid having to wipe out the logs if you're messing with configurations...
Jordan Halterman
@kuujo
Aug 03 2016 20:37
When a snapshot is taken and when it's installed depends on a variety of factors, but certainly when a server is first started a snapshot is taken and should be installed on startup. You should see log messages like Taking snapshot and Installing snapshot. But it won't always be the case that snapshots are installed at startup. Copycat actually allows entries to be retained in the log prior to the snapshot. So, certain entries can be snapshotted and others can be compacted via the incremental compaction algorithm. This means snapshots are actually installed at the index at which they were taken. If a snapshot is taken at index 100, during recovery any entries prior to 100 will be replayed and then the snapshot will be installed.
Even when a snapshot is taken, it's not committed until certain criteria are met. For instance, events that are triggered by commands stored in the snapshot need to be received and acknowledged by clients before the snapshot can be completed, otherwise the commands will be compacted and events may never be received by clients.
takeSnapshot requests the snapshot from the state machine and complrteSnapshot commits it once it's safe to do so
Ugh completeSnapshot
installSnapshot installs the snapshot to the state machine once it's at the appropriate index