These are chat archives for atomix/atomix

24th
Feb 2017
kirankumargithub
@kirankumargithub
Feb 24 2017 05:38
Hi. We are trying to use Atomix with as part of a micro service built with DropWizard deployed to DCOS/docker container. Is there any way Atomix can auto discover the cluster members through multicast or some other mechanism as DCOS dynamically scales the service by deploying to multiple nodes. Any help is appreciated
Jienan Zeng
@jienan
Feb 24 2017 06:44

Cluster environment:

master(127.0.0.1:9999)
follower1(127.0.0.1:9998)
follower2(127.0.0.1:9997)

client (connect to follower2)

client just put bytes(1KB) to server
when members are normal, tps is 200
when kill follower1, tps fall to 20
when recover follower1, tps is 200

This is a very strange phenomenon

Jordan Halterman
@kuujo
Feb 24 2017 06:46
That is indeed really strange
Lemme think...
Jordan Halterman
@kuujo
Feb 24 2017 06:52

Performance should definitely remain the same and probably improve a bit with a follower down. There are instances where that may not be the case, but it certainly shouldn't happen if all the nodes are running on one machine. Quorum-based protocols tend to take advantage of the fastest nodes in the cluster. So, if you kill the follower that has been participating in commits, you may be falling back to a slower follower. But that drop in TPS seems drastic and shouldn't happen when they're all on the same machine. It seems to indicate a flaw in the protocol.

One issue that can occur after a follower is killed is the leader could start trying to replicate large batches of entries to a failed follower. If the follower is down, the leader could be reading and attempting to send entries unnecessarily. But Copycat guards against that by setting the follower's status to UNAVAILABLE and then only sending heartbeats periodically to see if it's alive. So, that shouldn't be what's happening.

I'll see if I can reproduce it's

Jienan Zeng
@jienan
Feb 24 2017 07:01
Thank you very much for your timely response,
i also see the strange phenomenon in cluster with three separate machines
hope that I am not mistaken
Jordan Halterman
@kuujo
Feb 24 2017 07:08
hmm
Jienan Zeng
@jienan
Feb 24 2017 07:56
maybe new connection to failed follower is very time consuming
Jordan Halterman
@kuujo
Feb 24 2017 07:57
all you’re doing is writing 1K?
no events or anything?
I have a test set up… one sec
Jienan Zeng
@jienan
Feb 24 2017 08:02
yeah, just 1k, ValueStateMachine do nothing. I think 1K does not matter
Jordan Halterman
@kuujo
Feb 24 2017 08:02
k perfect
Jordan Halterman
@kuujo
Feb 24 2017 08:08
I have a local performance test that uses the test state machine which is basically the same as ValueStateMachine. It starts a cluster of n/m nodes and submits commands as fast as possible. Both 2/3 and 3/3 nodes are doing about 8k/sec. I did notice that when a node is down, the leader attempts far too many connections when it tries to heartbeat the down node, but it’s still a small enough number to not really make a difference. Does need to be fixed though. I’ll modify the test to try blocking commands to see if anything changes
the blocking client does about 2k-3k/sec with all the nodes and >3k/sec with 2/3 nodes
hmm
Jienan Zeng
@jienan
Feb 24 2017 08:18
your client connect to another followers?
Jienan Zeng
@jienan
Feb 24 2017 08:30
blob
my test server command is :
code from master branch
Jordan Halterman
@kuujo
Feb 24 2017 08:43

Clients don't connect to any specific server unless you force them to by setting a ServerSelectionStrategy. Existing strategies are ANY, LEADER, and FOLLOWERS.

Even if you only give a client one node to connect to, it will initially connect to that node but learn about all the other nodes when it does so if that node crashes it can reconnect to a new node. Clients are limited only by the configured strategy which usually relates to the state of the nodes rather than specific addresses.

But the client switching from one follower to another certainly shouldn't impact performance like that
You didn't kill the leader, so there wouldn't have been a reason it would have switched from a leader to a follower
But I'll try it with different strategies
Jordan Halterman
@kuujo
Feb 24 2017 08:56
whoooooah
Jienan Zeng
@jienan
Feb 24 2017 09:06

I use latest code from master branch to reproduce:

Cluster environment:

master(127.0.0.1:9999)
follower1(127.0.0.1:9998)
follower2(127.0.0.1:9997)

client (connect to follower2)
  1. start servers
    command is:
    java -classpath value-state-machine.jar io.atomix.copycat.examples.ValueStateMachineExample /tmp/copycat2 127.0.0.1:9998 127.0.0.1:9999 127.0.0.1:9997
    Cluster environment:
    master(127.0.0.1:9999)
    follower1(127.0.0.1:9998)
    follower2(127.0.0.1:9997)

  2. java -classpath value-client.jar io.atomix.copycat.perftest.ValueClientPerfTest 127.0.0.1:9997|grep Completed
    (i open debug log)
    the tps:
    Completed 218 writes in 999 milliseconds
    Completed 255 writes in 1000 milliseconds
    Completed 271 writes in 1000 milliseconds
    Completed 250 writes in 1000 milliseconds
    Completed 286 writes in 1000 milliseconds
    Completed 267 writes in 1000 milliseconds
    Completed 203 writes in 1000 milliseconds

  3. kill follower1

the tps:
Completed 41 writes in 1000 milliseconds
Completed 40 writes in 1000 milliseconds
Completed 37 writes in 1000 milliseconds
Completed 38 writes in 1000 milliseconds
Completed 35 writes in 1000 milliseconds
Completed 34 writes in 1000 milliseconds
Completed 34 writes in 1000 milliseconds
Completed 27 writes in 1000 milliseconds

4.recover follower1

the tps:
Completed 152 writes in 1000 milliseconds
Completed 212 writes in 1000 milliseconds
Completed 204 writes in 1000 milliseconds
Completed 220 writes in 1001 milliseconds
Completed 258 writes in 999 milliseconds

Jordan Halterman
@kuujo
Feb 24 2017 09:06
I just fixed a bug… I’ll push and merge it momentito and then try out master again
Jienan Zeng
@jienan
Feb 24 2017 09:07
You found it?
Jordan Halterman
@kuujo
Feb 24 2017 09:08
well, I fixed a definite bug that was affecting performance. Not sure if it’s that one. I’ll need to see debug logs for the client to figure out whta’s going on if this doesn’t fix it
Jordan Halterman
@kuujo
Feb 24 2017 09:13
@jienan try master now. I’m going to have to get some sleep in a few, but if that fix in master doesn’t fix your problem, can you enable DEBUG logging for the client and send over some logs?
Jienan Zeng
@jienan
Feb 24 2017 09:14
ok
Jordan Halterman
@kuujo
Feb 24 2017 09:20
Copycat 1.2.1 and Atomix 1.0.1 are released
cc @jhall11
Jordan Halterman
@kuujo
Feb 24 2017 09:51

@kirankumargithub the short answer is no.

The long answer is: doing service discovery using those types of protocols risks split brain, and that's an unacceptable risk for a consensus based system. If servers are allowed to start and bootstrap a cluster of no other nodes exist, then multiple nodes can bootstrap separate clusters under a network partition. It would be possible to bootstrap one node and use service discovery to join the bootstrapped node, but all the nodes can also just join the bootstrapped node directly.

There's also an issue with scalability. Atomix clusters are not designed to scale performance. They're designed for consistency and fault tolerance, so while adding arbitrary nodes can improve fault tolerance, they may ultimately harm performance. You also run into issues if you want to remove nodes automatically as well. A 2 node cluster can't scale down to 1 after a failure. Additionally, it would decrease the user's ability to control the size of the quorum.

In Atomix 2.0, a lot of these issues will be addressed, but only to an extent. Sharding mechanisms will be merged back into Atomix, which will allow for dynamically scaling most of the cluster. But a core shard is still required to manage all shards to prevent the split brain scenario above, and service discovery will still have to be done by connecting to that core cluster.

Jienan Zeng
@jienan
Feb 24 2017 15:40

@kuujo the problem still exist with new code.

I change some code, but not sure whether it is correct.
test again with modified code, Performance improve a bit with a follower down as expected.

blob
Jordan Halterman
@kuujo
Feb 24 2017 18:08
@jienan is the client submitting only commands or is it also querying?
Jordan Halterman
@kuujo
Feb 24 2017 18:18
I guess this is a workable solution - using the server's PollRequest to set its status when it comes back up. It also perhaps can be done on JoinRequest. But I think it's likely just going to hide a deficiency. There's no reason attempting to send heartbeats to a down follower should be causing that much overhead. Once the follower is down, it doesn't need to be counted in commitment, so the failure of those connections should have no impact on write performance. I'm wondering if it's because queries are being submitted and the leader is attempting the heartbeat every time a query is linearized. I'd need to see the leader's logs to tell what's going on, but I'll keep playing with it.
Jordan Halterman
@kuujo
Feb 24 2017 18:27
My tests were doing the same thing over localhost also last night and didn't see any performance degradation. Those connection attempts only occur on heartbeats, which is around once a second by default. They are supposed to back off after 3 failures IIRC, which they don't seem to be doing.
Jordan Halterman
@kuujo
Feb 24 2017 18:40
@jienan I think I reproduced it and will find an elegant way to solve the problem
Jordan Halterman
@kuujo
Feb 24 2017 19:31
@jienan atomix/copycat#285 fixes the problem using a mixture of exponential backoff and the PollRequest status change.
Jon Hall
@jhall11
Feb 24 2017 19:36
I think we were able to reproduce the IndexOutOfBounds exception. I got the logs, but haven’t had a chance to attach a debugger and look at what is actually happening yet
Jordan Halterman
@kuujo
Feb 24 2017 20:24
Yay!
I'll run my test... I still have it laying around
:pray:
Jordan Halterman
@kuujo
Feb 24 2017 20:39
no such luck
lemme know what happens with the debugger, but I’ll poke around in the logs
Jordan Halterman
@kuujo
Feb 24 2017 20:48
@jhall11 this one does definitely look like the same issue
well, mostly I think
vishwas
@vishwass
Feb 24 2017 20:55
Hi , guys I was trying to run copycat example. I am able to run the server but client is unable to run .do you have any suggestions
?
Jordan Halterman
@kuujo
Feb 24 2017 20:55
what do you mean by unable to run?
Here’s an example command I use to run it:
java -jar examples/value-client/target/value-client.jar localhost:5000 localhost:5001 localhost:5002
Jon Hall
@jhall11
Feb 24 2017 20:57
It looks like the same thing, lastApplied is 0 again, which is causing the IOOB exception
Jordan Halterman
@kuujo
Feb 24 2017 20:57
hmm….
I should be able to reproduce this with the logs, I just need to take a more careful approach
need to go back and look at our chat to refresh my memory
vishwas
@vishwass
Feb 24 2017 21:02
@kuujo its not showing any logs
Jordan Halterman
@kuujo
Feb 24 2017 21:03
that’s odd
should at least be showing failures
vishwas
@vishwass
Feb 24 2017 21:04
the value-statemachine example looks like its running fine
but the value client is not going forward
Jordan Halterman
@kuujo
Feb 24 2017 21:07
that’s odd… there are no logs whatsoever printing when you run it?
just tried master and it’s working for me
java -jar examples/value-client/target/value-client.jar localhost:5000 localhost:5001 localhost:5002
13:06:05.680 [copycat-client-io-1] DEBUG i.a.c.client.util.ClientConnection - Setting up connection to localhost/127.0.0.1:5000
13:06:05.682 [copycat-client-io-1] DEBUG i.a.c.client.util.ClientConnection - Sending ConnectRequest[client=3f314e2f-1ff3-4e46-bb77-b19738cc754c]
13:06:05.718 [copycat-client-io-1] DEBUG i.a.c.client.util.ClientConnection - Received ConnectResponse[status=OK, error=null, leader=localhost/127.0.0.1:5000, members=[localhost/127.0.0.1:5000, localhost/127.0.0.1:5001, localhost/127.0.0.1:5002]]
13:06:05.729 [copycat-client-io-1] DEBUG i.a.c.client.session.ClientSession - Received RegisterResponse[status=OK, error=null, session=5, leader=localhost/127.0.0.1:5000, members=[localhost/127.0.0.1:5000, localhost/127.0.0.1:5001, localhost/127.0.0.1:5002]]
13:06:05.732 [copycat-client-io-1] DEBUG i.a.c.client.DefaultCopycatClient - State changed: CONNECTED
13:06:05.732 [copycat-client-io-1] INFO  i.a.c.client.session.ClientSession - Registered session 5
13:06:05.733 [copycat-client-io-1] DEBUG i.a.c.client.session.ClientSession - 5 - Sending KeepAliveRequest[session=5, commandSequence=0, eventIndex=5]
13:06:05.738 [copycat-client-io-1] DEBUG i.a.c.client.session.ClientSession - 5 - Sending CommandRequest[session=5, sequence=1, command=io.atomix.copycat.examples.SetCommand@3be69f76]
13:06:05.739 [copycat-client-io-1] DEBUG i.a.c.client.util.ClientConnection - Connecting to localhost/127.0.0.1:5000
13:06:05.740 [copycat-client-io-1] DEBUG i.a.c.client.util.ClientConnection - Setting up connection to localhost/127.0.0.1:5000
13:06:05.740 [copycat-client-io-1] DEBUG i.a.c.client.util.ClientConnection - Sending ConnectRequest[client=3f314e2f-1ff3-4e46-bb77-b19738cc754c]
13:06:05.743 [copycat-client-io-1] DEBUG i.a.c.client.util.ClientConnection - Received ConnectResponse[status=OK, error=null, leader=localhost/127.0.0.1:5000, members=[localhost/127.0.0.1:5000, localhost/127.0.0.1:5001, localhost/127.0.0.1:5002]]
13:06:05.789 [copycat-client-io-1] DEBUG i.a.c.client.session.ClientSession - 5 - Received CommandResponse[status=OK, error=null, index=6, eventIndex=5, result=null]
13:06:05.790 [copycat-client-io-1] DEBUG i.a.c.client.session.ClientSession - 5 - Received KeepAliveResponse[status=OK, error=null, leader=localhost/127.0.0.1:5000, members=[localhost/127.0.0.1:5000, localhost/127.0.0.1:5001, localhost/127.0.0.1:5002]]
13:06:10.795 [copycat-client-io-1] DEBUG i.a.c.client.session.ClientSession - 5 - Sending CommandRequest[session=5, sequence=2, command=io.atomix.copycat.examples.SetCommand@50f3a0cd]
etc
:worried:
hmmm
vishwas
@vishwass
Feb 24 2017 21:11
do we need to give the class path ?
for the main class?
Jordan Halterman
@kuujo
Feb 24 2017 21:11
not if you use the command above ^^
the Maven shade plugin is used to build that jar
value-client.jar
vishwas
@vishwass
Feb 24 2017 21:14
oh I just did mvn install . is that enough?
and are you using the last version of copycat?
Jordan Halterman
@kuujo
Feb 24 2017 21:21
yeah
you just have to do mvn package and that will build the jar
then java -jar examples/value-client/target/value-client.jar localhost:5000 localhost:5001 localhost:5002 (using the host:port of the servers you started)
Jordan Halterman
@kuujo
Feb 24 2017 21:33
@jhall11 I reproduced it!
:clap:
sdkjfhnlkasjhncl.kj,shman
salfkcjpnoaislfucjnpaeoisujc;oai4wlejrmf9o4wijmc9iowjfn9ui4hnp439roi4930puifjmoikjanmf co;wilaksjmc ska.,/m!!!!!!!!!!!!!!!!!!!!!!!!!
ugh
I think it should be an easy fix
just have to put on my thinking cap for a few minutes
Jordan Halterman
@kuujo
Feb 24 2017 21:39
hmm actually it would be an easy fix if we want to potentially hide the real issue :-)
So, lastApplied being 0 is indeed correct
that’s the initial state when the server starts, and then it starts applying entries from index 1 (Raft log indexes are usually 1-based)
but what’s happening when the logs are loaded from disk is absolutely not correct
I need to actually look at the log files on disk to figure out why it’s not loading them all. That may take some time if I have to write some scripts to read the metadata from the log segments, but those scripts are needed anyways TBH
What I’m seeing in my test is the SegmentManager is only loading one segment - the 5th segment where the starting index is 133586.
There should always be a segment 1 with starting index 1
even after compaction, since the log compaction algorithm always merges later segments into earlier segments
should be able to fix it today, but will probably take me some time
Jordan Halterman
@kuujo
Feb 24 2017 21:47
so, node 1 is definitely missing logs for some reason
it only has a segment 5 on disk
I’m going to have to dig through its logs to try to figure out where that actually happened, and whether it’s a bug in Copycat or not
I guess it should be
Jon Hall
@jhall11
Feb 24 2017 22:06
awesome!
so if you connected a new server to the cluster, it would just get these from the leader and not load from disk? What if the node has a way outdated log(it died and wasn’t restarted for a few days) how does it get back in sync?
Jordan Halterman
@kuujo
Feb 24 2017 22:12
Yeah, so a couple things...
vishwas
@vishwass
Feb 24 2017 22:13
@kuujo ```

```coder@ubuntu:~/copycat$ ls
bin CHANGES.md client examples LICENSE localhost:5000 pom.xml protocol README.md server test
coder@ubuntu:~/copycat$ java -jar examples/value-client/target/value-client.jar localhost:5000 localhost:5001 localhost:5002

```

still it doesnt seem to run
Jordan Halterman
@kuujo
Feb 24 2017 22:14
This could have happened during recovery. I'll look into that, but when the server recovers after a failure, it checks all the segments and deletes partially compacted segments. Original segments that are in the process of compaction are always kept on disk, so deleting partially compacted segments is safe. But that's where I'll look for an issue...
vishwas
@vishwass
Feb 24 2017 22:14
coder@ubuntu:~/copycat$ java -jar examples/value-state-machine/target/value-state-machine.jar localhost:5000 localhost:5001 localhost:5002 14:11:57.354 [copycat-server-localhost/127.0.0.1:5001-copycat] INFO i.a.c.server.state.ServerContext - localhost/127.0.0.1:5001 - Transitioning to FOLLOWER 14:11:58.601 [copycat-server-localhost/127.0.0.1:5001-copycat] INFO i.a.c.server.state.FollowerState - localhost/127.0.0.1:5001 - Polling members [ServerMember[type=ACTIVE, status=AVAILABLE, serverAddress=localhost/127.0.0.1:5002, clientAddress=null]] 14:12:00.113 [copycat-server-localhost/127.0.0.1:5001-copycat] INFO i.a.c.server.state.FollowerState - localhost/127.0.0.1:5001 - Polling members [ServerMember[type=ACTIVE, status=AVAILABLE, serverAddress=localhost/127.0.0.1:5002, clientAddress=null]] 14:12:02.110 [copycat-server-localhost/127.0.0.1:5001-copycat] INFO i.a.c.server.state.FollowerState - localhost/127.0.0.1:5001 - Polling members [ServerMember[type=ACTIVE, status=AVAILABLE, serverAddress=localhost/127.0.0.1:5002, clientAddress=null]] 14:12:04.194 [copycat-server-localhost/127.0.0.1:5001-copycat] INFO i.a.c.server.state.FollowerState - localhost/127.0.0.1:5001 - Polling members [ServerMember[type=ACTIVE, status=AVAILABLE, serverAddress=localhost/127.0.0.1:5002, clientAddress=null]] 14:12:05.745 [copycat-server-localhost/127.0.0.1:5001-copycat] INFO i.a.c.server.state.FollowerState - localhost/127.0.0.1:5001 - Polling members [ServerMember[type=ACTIVE, status=AVAILABLE, serverAddress=localhost/127.0.0.1:5002, clientAddress=null]] 14:12:07.948 [copyca
check the above code, is the server running as expected ?
@kuujo
clientAddress is null it shows
Jordan Halterman
@kuujo
Feb 24 2017 22:15
No. did you start all 3 servers? Or at least 2?
The cluster can't elect a leader with just one server up, so the client can't connect
vishwas
@vishwass
Feb 24 2017 22:15
yes i have given 3 arguments
localhost:5000 localhost:5001 localhost:5002 .do we need to start servers in different terminals
?
Jordan Halterman
@kuujo
Feb 24 2017 22:16
But you have to start three separate processes. The first argument is the local server's host:port. The second two are the other two remote servers. Then you have to start those remote servers separately. At least 2/3 have to be running to elect a leader. Once you see that, the client will be able to connect
@jhall11 So, I'll track that down. But once that bug is found and fixed, we can perhaps implement some safety mechanisms to prevent this from happening. It's not ideal, but servers can check their logs to ensure they're complete and can safely delete them and recover from the leader if they have to at that point.
vishwas
@vishwass
Feb 24 2017 22:20
oh ok ! right now I am only running one local server. is it completely to run remote servers?
*necessary?
Jon Hall
@jhall11
Feb 24 2017 22:20
sure
Jordan Halterman
@kuujo
Feb 24 2017 22:21
No it's not necessary @vishwass just pass one host:port and you'll have a one node cluster
Err "cluster"
Jon Hall
@jhall11
Feb 24 2017 22:22
if you are running multiple servers on a single machine, you also need to make sure they are not writing to the same files on disk
Jordan Halterman
@kuujo
Feb 24 2017 22:24

That reminds me @vishwass I think the command should be

java -jar examples/value-state-machine/target/value-state-machine.jar logs/server1 localhost:5000

The first argument is a log directory

vishwas
@vishwass
Feb 24 2017 22:25
@kuujo coder@ubuntu:~/copycat$ java -jar examples/value-client/target/value-client.jar localhost:5000 localhost:5001 14:24:19.164 [copycat-client-io-1] INFO i.a.c.client.session.ClientSession - Registered session 3 Completed 228 writes in 998 milliseconds Completed 130 writes in 1006 milliseconds Completed 124 writes in 1009 milliseconds Completed 115 writes in 989 milliseconds Completed 119 writes in 996 milliseconds Completed 71 writes in 1006 milliseconds Completed 201 writes in 1003 milliseconds Completed 135 writes in 991 milliseconds
now it looks like its working for 2 servers.
locally
@kuujo thanks
coder@ubuntu:~/copycat$ java -jar examples/value-state-machine/target/value-state-machine.jar localhost:5000 localhost:5001
14:23:56.118 [copycat-server-localhost/127.0.0.1:5001-copycat] INFO  i.a.c.server.state.ServerContext - localhost/127.0.0.1:5001 - Transitioning to FOLLOWER
14:23:57.040 [copycat-server-localhost/127.0.0.1:5001-copycat] INFO  i.a.c.server.state.ServerContext - localhost/127.0.0.1:5001 - Transitioning to CANDIDATE
14:23:57.048 [copycat-server-localhost/127.0.0.1:5001-copycat] INFO  i.a.c.server.state.CandidateState - localhost/127.0.0.1:5001 - Starting election
14:23:57.056 [copycat-server-localhost/127.0.0.1:5001-copycat] INFO  i.a.c.server.state.ServerContext - localhost/127.0.0.1:5001 - Transitioning to LEADER
14:23:57.059 [copycat-server-localhost/127.0.0.1:5001-copycat] INFO  i.a.c.server.state.ServerContext - localhost/127.0.0.1:5001 - Found leader localhost/127.0.0.1:5001
14:23:57.150 [copycat-server-localhost/127.0.0.1:5001-copycat] INFO  i.a.copycat.server.CopycatServer - Server started successfully!
14:26:55.883 [copycat-compactor-1] INFO  i.a.c.s.storage.compaction.Compactor - Compacting log with compaction: MINOR
i dont why i was not running before !!!
Jordan Halterman
@kuujo
Feb 24 2017 23:07
@vishwass what you're doing is starting a single node cluster with a server on localhost:5001 that writes to a directory called localhost:5000. You should see that directory in pwd. Copycat cannot form a cluster of two nodes with just one node up. A majority of the cluster must be running.
Should pass a proper directory name as the first argument
You can start one node in a two node cluster, but it won't be able to elect a leader and clients won't be able to submit any reads/writes
@jhall11 what's odd again is that the same thing occurs in all the partitions on server 1. All of them are missing segments prior to segment 5.
Jordan Halterman
@kuujo
Feb 24 2017 23:32
I found where it happens
2017-02-24 16:42:26,480 | DEBUG | 9876-partition-1 | SegmentManager                   | 90 - io.atomix.all - 1.0.1.SNAPSHOT | Deleting unlocked segment: 4-69 (partition-1-4-69.log)
2017-02-24 16:42:26,481 | DEBUG | 9876-partition-1 | SegmentManager                   | 90 - io.atomix.all - 1.0.1.SNAPSHOT | Deleting unlocked segment: 2-624 (partition-1-2-624.log)
2017-02-24 16:42:26,560 | DEBUG | 9876-partition-1 | SegmentManager                   | 90 - io.atomix.all - 1.0.1.SNAPSHOT | Loaded file segment: 5 (partition-1-5-1.log)
2017-02-24 16:42:26,560 | DEBUG | 9876-partition-1 | SegmentManager                   | 90 - io.atomix.all - 1.0.1.SNAPSHOT | Found segment: 5 (partition-1-5-1.log)
2017-02-24 16:42:26,561 | DEBUG | 9876-partition-1 | SegmentManager                   | 90 - io.atomix.all - 1.0.1.SNAPSHOT | Deleting unlocked segment: 1-930 (partition-1-1-930.log)
2017-02-24 16:42:26,561 | DEBUG | 9876-partition-1 | SegmentManager                   | 90 - io.atomix.all - 1.0.1.SNAPSHOT | Deleting unlocked segment: 3-333 (partition-1-3-333.log)
2017-02-24 16:42:26,568 | DEBUG | 9876-partition-1 | SnapshotStore                    | 90 - io.atomix.all - 1.0.1.SNAPSHOT | Loaded disk snapshot: 133587 (partition-1-133587-20170224152629.snapshot)
hmm…. this may be difficult to debug without actually having those segment files
Jordan Halterman
@kuujo
Feb 24 2017 23:43
the segments were not locked according to the logs, meaning it indeed did consider them to be partially compacted segments and deleted them
just prior to the crash, minor compaction started for all four segments that were ultimately deleted on recovery
Jordan Halterman
@kuujo
Feb 24 2017 23:49
the compaction tasks also finished for all segments
it seems like perhaps the lock was just cached and never flushed completely to disk before the original segments were deleted
~1 second after compaction tasks finish, the server crashes