These are chat archives for atomix/atomix

3rd
Jan 2016
Richard Pijnenburg
@electrical
Jan 03 2016 00:32
@kuujo great to hear! Sorry been out with friends drinking :) I'll do some tests tomorrow.
Jordan Halterman
@kuujo
Jan 03 2016 00:33
cool
Richard Pijnenburg
@electrical
Jan 03 2016 00:36
To bad travelling home always takes forever. London is way to expensive to live in.
Jordan Halterman
@kuujo
Jan 03 2016 00:36
ditto LA
Richard Pijnenburg
@electrical
Jan 03 2016 00:37
Been to cali
Once for work
And then to mountain view. Very boring place lol
San Francisco was pretty busy
Made me feel like amsterdam but bigger
Haven't been to LA yet.
Been to portland, or though
Was pretty nice but mainly because I know a lot of peppel there.
Jordan Halterman
@kuujo
Jan 03 2016 00:59
LA is terrible. You're not missing out.
I like San Francisco. It's an interesting place, but cost of living there is ridiculous.
Richard Pijnenburg
@electrical
Jan 03 2016 01:00
Hehe yeah
London is just as bad.
Jordan Halterman
@kuujo
Jan 03 2016 01:01
LA is a shithole with nothing but too many people and too much traffic. After two years in still baffled at why anybody likes living here, and a lot of people do. They think 30 minute 5 mile drives are normal.
Richard Pijnenburg
@electrical
Jan 03 2016 01:02
Uhg. Defo sounds like London.
Jordan Halterman
@kuujo
Jan 03 2016 01:03
I'm getting a remote job and getting out of here some time
Alright. I merged a few huge fixes. One of them is for a bug I've been looking for probably for months now. I fixed what bugs I could find in the leader election case. In my testing the issues I can reproduce were resolved. We'll see how that goes tomorrow. I'll be offline myself for the night.
Richard Pijnenburg
@electrical
Jan 03 2016 01:05
Heh yeah. I did remote for 1.5 years. Got boring at some point. Looking for an other job atm.
Okay. I'd.
Argh
Okay bud. I'll catch you tomorrow. I'll do some testing in my morning and let you know
Richard Pijnenburg
@electrical
Jan 03 2016 11:27
@kuujo ran my thing again and the cluster died at some point. https://gist.github.com/electrical/69e02f6e894d894e2aa4
Richard Pijnenburg
@electrical
Jan 03 2016 14:30
seeing a few java.nio.channels.ClosedChannelException errors in my latest run
also still a lot of request timed out on the LeaderAppender
Richard Pijnenburg
@electrical
Jan 03 2016 14:35
perhaps the less then 1 second timeout is to fast? or is something else going on
I'm confused why its so unstable
Jordan Halterman
@kuujo
Jan 03 2016 17:42
The ClosedChannelExceptions are a byproduct of closing the connection when an attempt fails. Note they almost always follow a timeout. But that looks like a race condition...
Richard Pijnenburg
@electrical
Jan 03 2016 17:43
ah i see. okay
Have you ever seen the instability i'm getting with the leader example?
Jordan Halterman
@kuujo
Jan 03 2016 17:44
no but I’m testing it again now
Richard Pijnenburg
@electrical
Jan 03 2016 17:44
okay
Jordan Halterman
@kuujo
Jan 03 2016 17:44
the only time I see leader changes are when I kill nodes
but I’ll run it for a long, long time today
Richard Pijnenburg
@electrical
Jan 03 2016 17:44
hmm, i see it without killing nodes.
Jordan Halterman
@kuujo
Jan 03 2016 17:45
yeah
Richard Pijnenburg
@electrical
Jan 03 2016 17:45
and it happens a few times within a few minutes
Jordan Halterman
@kuujo
Jan 03 2016 17:45
something is resulting in responses not making it back to the leader
follower gets a request and sends a response, but the leader doesn’t get it and the request times out
in your logs
Richard Pijnenburg
@electrical
Jan 03 2016 17:45
yeah
Jordan Halterman
@kuujo
Jan 03 2016 17:47
I think the problem is if another heartbeat is supposed to be sent during that ~500 milliseconds in which the response is never received, the leader skips sending the next AppendRequest and so the follower times out. That can be fixed
Richard Pijnenburg
@electrical
Jan 03 2016 17:48
its weird that I'm having all these kind of issues and you don't :-)
Jordan Halterman
@kuujo
Jan 03 2016 17:49

In server 1’s logs, this is what happens:
The leader sends an AppendRequest to 5001

12:24:52.113 [copycat-server-localhost/127.0.0.1:5000] DEBUG i.a.c.server.state.LeaderAppender - localhost/127.0.0.1:5000 - Sent AppendRequest[term=2, leader=2130712284, logIndex=118, logTerm=2, entries=[0], commitIndex=119, globalIndex=118] to localhost/127.0.0.1:5001

That request times out, but the leader doesn’t send another one until a second later which is too late

that is fixable
Richard Pijnenburg
@electrical
Jan 03 2016 17:50
wonder what the reason is that it times out. its all local network.
is it just going all to slow?
or are the timeouts to tight?
Jordan Halterman
@kuujo
Jan 03 2016 17:53
if you look at all the rest of the requests/responses, the 500ms timeout is plenty for requests and responses, and the other timeouts should be fine. I’m sure tweaking them would make it work fine, but responses are lost in real networks and so fixing that little issue should be done too
Richard Pijnenburg
@electrical
Jan 03 2016 17:53
trye
*true
even on a normal keyboard i can't type. lol
Jordan Halterman
@kuujo
Jan 03 2016 17:53
haha
Richard Pijnenburg
@electrical
Jan 03 2016 17:54
are all the timeouts configurable btw?
Jordan Halterman
@kuujo
Jan 03 2016 17:54
I started the servers 5 minutes ago
still the same leader so far
Richard Pijnenburg
@electrical
Jan 03 2016 17:55
okay
it would be changed at least 5 times already wtih me.
Jordan Halterman
@kuujo
Jan 03 2016 17:58
hmm I may have actually found that race condition
one sec for a PR
Richard Pijnenburg
@electrical
Jan 03 2016 17:59
okay
Jordan Halterman
@kuujo
Jan 03 2016 18:06
That PR explains it
Richard Pijnenburg
@electrical
Jan 03 2016 18:06
looking at it now
ah. makes sense yeah
Jordan Halterman
@kuujo
Jan 03 2016 18:07
That sort of fixes how and when to handle a timeout, but still odd that timeouts are happening at all. That’s sort of why I was wondering if it was an issue with the Netty configuration.
BTW my first leader is still the leader
Richard Pijnenburg
@electrical
Jan 03 2016 18:07
hmm. weird stuff that you are having no problems at all
Jordan Halterman
@kuujo
Jan 03 2016 18:08
or maybe a buffer isn’t being flushed
nope all the calls to Netty Channels use writeAndFlush
still leader :-|
Richard Pijnenburg
@electrical
Jan 03 2016 18:13
hmm
Jordan Halterman
@kuujo
Jan 03 2016 18:13
just gonna merge and deploy it
Richard Pijnenburg
@electrical
Jan 03 2016 18:15
okay
Jordan Halterman
@kuujo
Jan 03 2016 18:16
did
Richard Pijnenburg
@electrical
Jan 03 2016 18:20
I can rebuild and retest
Jordan Halterman
@kuujo
Jan 03 2016 18:20
sure
Richard Pijnenburg
@electrical
Jan 03 2016 18:24
running
seeing request time out errors
and already at term 3
term 5
stopped it now. term 8
gisting logs
Jordan Halterman
@kuujo
Jan 03 2016 18:26
hmm
Jordan Halterman
@kuujo
Jan 03 2016 18:26
can fix the issue of leader changes being caused by timeouts, but that the timeouts are happening is the real issue
Richard Pijnenburg
@electrical
Jan 03 2016 18:28
yeah. makes more sense to solve the cause then effect of it :)
Jordan Halterman
@kuujo
Jan 03 2016 18:28
haha indeed
has to be on the Netty level
still the same leader
Well, the issue of how timeouts are handled is still incorrect, so I’ll still fix that
Richard Pijnenburg
@electrical
Jan 03 2016 18:38
okay
Jordan Halterman
@kuujo
Jan 03 2016 18:41
I'm gonna get it on some EC2 instances today to see what happens there. I also sort of wonder if a VertxTransport would behave better.
@electrical What OS are you running it on again?
Richard Pijnenburg
@electrical
Jan 03 2016 18:47
Ubuntu
with jre1.8.0_66
Jordan Halterman
@kuujo
Jan 03 2016 18:51
hmm
I should remove the netty-transport-native-epoll crap from Catalyst
not sure that will have any effect but needs to be removed anyways
epoll is the major difference between our environments wrt Netty
Richard Pijnenburg
@electrical
Jan 03 2016 18:55
ah okay
Jordan Halterman
@kuujo
Jan 03 2016 18:58
done
Richard Pijnenburg
@electrical
Jan 03 2016 18:59
rebuild?
Jordan Halterman
@kuujo
Jan 03 2016 18:59
yeah
Richard Pijnenburg
@electrical
Jan 03 2016 19:02
within a minute at term 3
oh wait
forgot to clean my .m2 dir
term 3 already in 2 minutes
oh. term 5 now
Jordan Halterman
@kuujo
Jan 03 2016 19:06
not surprised
Richard Pijnenburg
@electrical
Jan 03 2016 19:06
okay
Jordan Halterman
@kuujo
Jan 03 2016 19:06
just gonna get on ubuntu
Richard Pijnenburg
@electrical
Jan 03 2016 19:06
okay
it was stable in the beginning and then it went nuts
Jordan Halterman
@kuujo
Jan 03 2016 19:06
my leader is still there :-P ugh
Richard Pijnenburg
@electrical
Jan 03 2016 19:06
will gist logs
Im gonna have dinner now. back in a bit.
Jordan Halterman
@kuujo
Jan 03 2016 19:07
cool
yep same thing
takes 2 milliseconds for server 3 to receive an AppendRequest, sends an AppendResponse 1 millisecond later, and never gets received by the leader
Richard Pijnenburg
@electrical
Jan 03 2016 19:21
okay. so im not going crazy then :p
Yum. pasta bolognese is always nice :-)
Jordan Halterman
@kuujo
Jan 03 2016 19:23
To be clear, I mean the same thing in your logs. My leader is still the leader haha
that sucks
Richard Pijnenburg
@electrical
Jan 03 2016 19:23
ahh okay
Jordan Halterman
@kuujo
Jan 03 2016 19:23
hard to fix something you don’t see
Richard Pijnenburg
@electrical
Jan 03 2016 19:23
yeah indeed
maybe worth upgrading netty from 4.0.21 to 4.0.33 ?
or won't make much difference?
Jordan Halterman
@kuujo
Jan 03 2016 19:25
yeah maybe
might as well
probably won’t matter but who knows
Richard Pijnenburg
@electrical
Jan 03 2016 19:26
this is in catalyst right?
Jordan Halterman
@kuujo
Jan 03 2016 19:27
yeah
Richard Pijnenburg
@electrical
Jan 03 2016 19:29
mvn clean install -DskipTests should be enough?
Jordan Halterman
@kuujo
Jan 03 2016 19:29
you have to do all three to make sure they’re built with the updated dependencies
Richard Pijnenburg
@electrical
Jan 03 2016 19:29
okay
Jordan Halterman
@kuujo
Jan 03 2016 19:29
mvn clean install -DskipTests catalyst, copycat, atomix
Richard Pijnenburg
@electrical
Jan 03 2016 19:30
building now
Richard Pijnenburg
@electrical
Jan 03 2016 19:37
done building. running test again
Jordan Halterman
@kuujo
Jan 03 2016 19:38
I wonder if this could be a race condition… the NettyConnection doesn’t add the response future to the responseFutures map until the flush is complete: https://github.com/atomix/catalyst/blob/master/netty/src/main/java/io/atomix/catalyst/transport/NettyConnection.java#L298 I wonder if it’s possible for the response to be received before that happens for some reason related to thread scheduling.
I also see a thread safety issue there
Richard Pijnenburg
@electrical
Jan 03 2016 19:40
ah okay
same issue still btw, but that was expected
term 8 now
Jordan Halterman
@kuujo
Jan 03 2016 19:41
That piece of code could very well be the culprit actually. The dumb thing about Java’s Executor framework is that it will catch and suppress exceptions in spots like that, and there’s a thread safety issue with the put to responseFutures there
Richard Pijnenburg
@electrical
Jan 03 2016 19:42
hmm okay. still weird i'm seeing it and you aren't.
Jordan Halterman
@kuujo
Jan 03 2016 19:42
yeah
still have the same leader
Richard Pijnenburg
@electrical
Jan 03 2016 19:43
also on the ec2 instances?
Jordan Halterman
@kuujo
Jan 03 2016 19:44
haven’t tried it yet
Richard Pijnenburg
@electrical
Jan 03 2016 19:44
okay
Jordan Halterman
@kuujo
Jan 03 2016 19:44
being lazy since I have to set up new keys and crap
haha
Richard Pijnenburg
@electrical
Jan 03 2016 19:44
haha
Jordan Halterman
@kuujo
Jan 03 2016 19:44
I’m gonna fix the thread safety issues I see here and then try it out
Richard Pijnenburg
@electrical
Jan 03 2016 19:45
okay cool
Jordan Halterman
@kuujo
Jan 03 2016 19:45
calls to close() on NettyConnection are completed in the wrong thread, and iterating the responseFutures map is not thread safe, which is what’s done to time out requests
Richard Pijnenburg
@electrical
Jan 03 2016 19:47
hehe okay
Richard Pijnenburg
@electrical
Jan 03 2016 20:33
Any luck?
Jordan Halterman
@kuujo
Jan 03 2016 20:34
almost
testing it out
Richard Pijnenburg
@electrical
Jan 03 2016 20:37
cool :)
Jordan Halterman
@kuujo
Jan 03 2016 20:53
alright I fixed all of the potential issues I saw in the NettyConnection so we will see if any of the obvious issues were the cause, otherwise off to EC2 today
still the same leader BTW
Richard Pijnenburg
@electrical
Jan 03 2016 20:58
Okay. Will re-build it all in a bit and test again.
Richard Pijnenburg
@electrical
Jan 03 2016 21:29
@kuujo running test now
still term 1 :-)
looks like you found it
starting the client now as well
wow, still term 1
lets see what happens when i kill the leader
Richard Pijnenburg
@electrical
Jan 03 2016 21:37
@kuujo all stable. only leader changes if i kill the leader
Richard Pijnenburg
@electrical
Jan 03 2016 21:47
Also cluster discovery worked for the client. only gave a single node and it has all 3 nodes as members
@kuujo my next test will be to create a producer and consumer and use that work queue you built for me.
Richard Pijnenburg
@electrical
Jan 03 2016 22:10
@kuujo when ever you have a moment. Its unclear to me how to use the DistributedTaskQueue thing you built.
Documentation is a bit confusing as well.. some mention atomix.create others do copycat.create
Richard Pijnenburg
@electrical
Jan 03 2016 22:24
is it something like this?
DistributedTaskQueue queue = 
  atomix.create("queue1", DistributedTaskQueue::new).get();

  queue.submit('some text')

  queue.consumer(task -> {
    // some code that handles the task
  })
and if i want to split up the submitter and consumer, i need to call the create part on 2 places right
?
i assume the var 'task' contains the text that i submit?
Jordan Halterman
@kuujo
Jan 03 2016 22:29
Haha wow awesome finally ugh
Yep
Richard Pijnenburg
@electrical
Jan 03 2016 22:29
took a while but its still running stable :-)
Jordan Halterman
@kuujo
Jan 03 2016 22:29
Sorry I was watching Making a Murderer again
Haha
Richard Pijnenburg
@electrical
Jan 03 2016 22:29
hehe :p
trying to make an example with the distributed task queue one.
Jordan Halterman
@kuujo
Jan 03 2016 22:30
Bah I need to update the docs actually. I think they're way behind now
Richard Pijnenburg
@electrical
Jan 03 2016 22:31
hehe yeah :p
Jordan Halterman
@kuujo
Jan 03 2016 22:34
Probably tons of stuff in there wrong, but what you want is atomix.create(...) I'm actually changing that interface a little bit today to allow for resource configurations so you can e.g. create an ordered or unordered worker. Also, with that stupid thing fixed I can actually focus on documenting those things and making sure they're correct. So glad you helped me find that thanks!
Richard Pijnenburg
@electrical
Jan 03 2016 22:34
No problem at all :-) very happy to help out.
and now i can actually try to use it for logstash ;-)
Jordan Halterman
@kuujo
Jan 03 2016 22:36
haha I know
actually fixing another issue I found in the client, but has made some huge progress now
Richard Pijnenburg
@electrical
Jan 03 2016 22:36
hehe nice :-)
Hmm. trying to compile the new client jar.. no luck yet
Richard Pijnenburg
@electrical
Jan 03 2016 22:42
package io.atomix.messaging does not exist
Jordan Halterman
@kuujo
Jan 03 2016 22:43
hmm odd
Richard Pijnenburg
@electrical
Jan 03 2016 22:44
ah only had coordination as a dependency. switched over to 'all'
still gives cannot find symbol
class DistributedTaskueue
location: package io.atomix.messagin
import io.atomix.messaging.DistributedTaskueue;
Jordan Halterman
@kuujo
Jan 03 2016 22:46
oh yeah
Richard Pijnenburg
@electrical
Jan 03 2016 22:47
already got that line in there. but still fails.
Jordan Halterman
@kuujo
Jan 03 2016 22:47
import io.atomix.messaging.DistributedTaskueue -> import io.atomix.messaging.DistributedTaskQueue?
Richard Pijnenburg
@electrical
Jan 03 2016 22:47
d0h :p
now one left. variable atomix
for the atomix.create
I really need to read better :p
Jordan Halterman
@kuujo
Jan 03 2016 22:51
lol
Richard Pijnenburg
@electrical
Jan 03 2016 22:54
hmm.. that failed horribly
Jordan Halterman
@kuujo
Jan 03 2016 22:54
what did?
Richard Pijnenburg
@electrical
Jan 03 2016 22:54
the client for setting up the resource
i'll gist the log. most likely i'm doing something wrong.
Jordan Halterman
@kuujo
Jan 03 2016 22:55
does the client and server both have the messaging module on the classpath?
Richard Pijnenburg
@electrical
Jan 03 2016 22:56
euhm. server side most likely not. using the distributedleader example
Jordan Halterman
@kuujo
Jan 03 2016 22:56
yeah that be the problem
I did that earlier too
I should add atomix-all to all the examples
Richard Pijnenburg
@electrical
Jan 03 2016 22:56
okay
will rebuild the example with the atomix-all
=9, entries=[13], commitIndex=5815, globalIndex=5783]: Request log term does not match local log term 8 for the same entry
23:58:33.237 [copycat-server-localhost/127.0.0.1:5001] DEBUG i.a.c.server.state.FollowerState - localhost/127.0.0.1:5001 - Sent AppendResponse[status=OK, term=9, succeeded=false, logIndex=5781]
23:58:33.251 [copycat-server-localhost/127.0.0.1:5001] DEBUG i.a.c.server.state.FollowerState - localhost/127.0.0.1:5001 - Received AppendRequest[term=9, leader=2130712286, logIndex=5781, logTerm=0, entries=[14], commitIndex=5815, globalIndex=5783]
23:58:33.251 [copycat-server-localhost/127.0.0.1:5001] DEBUG i.a.c.server.state.FollowerState - localhost/127.0.0.1:5001 - Appended entry term does not match local log, removing incorrect entries
23:58:33.252 [copycat-server-localhost/127.0.0.1:5001] ERROR i.a.c.u.c.SingleThreadContext - An uncaught exception occurred
java.lang.IndexOutOfBoundsException: cannot truncate committed entries
        at io.atomix.catalyst.util.Assert.index(Assert.java:45) ~[atomix-leader-election.jar:na]
        at io.atomix.copycat.server.storage.Log.truncate(Log.java:456) ~[atomix-leader-election.jar:na]
        at io.atomix.copycat.server.state.PassiveState.doAppendEntries(PassiveState.java:147) ~[atomix-leader-election.jar:na]
        at io.atomix.copycat.server.state.PassiveState.handleAppend(PassiveState.java:83) ~[atomix-leader-election.jar:na]
        at io.atomix.copycat.server.state.ActiveState.append(ActiveState.java:62) ~[atomix-leader-election.jar:na]
        at io.atomix.copycat.server.state.FollowerState.append(FollowerState.java:292) ~[atomix-leader-election.jar:na]
        at io.atomix.copycat.server.state.ServerState.lambda$connectServer$50(ServerState.java:472) ~[atomix-leader-election.jar:na]
        at io.atomix.catalyst.transport.NettyConnection.handleRequest(NettyConnection.java:102) ~[atomix-leader-election.jar:na]
        at io.atomix.catalyst.transport.NettyConnection.lambda$handleRequest$2(NettyConnection.java:90) ~[atomix-leader-election.jar:na]
        at io.atomix.catalyst.util.concurrent.Runnables.lambda$logFailure$6(Runnables.java:20) ~[atomix-leader-election.jar:na]
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) ~[na:1.8.0_66]
        at java.util.concurrent.FutureTask.run(FutureTask.java:266) ~[na:1.8.0_66]
        at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180) ~[na:1.8.0_66]
        at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) ~[na:1.8.0_66]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) ~[na:1.8.0_66]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) ~[na:1.8.0_66]
        at java.lang.Thread.run(Thread.java:745) ~[na:1.8.0_66]
^C
not sure what happend here
2 nodes work fine. 3rd one is failing
Jordan Halterman
@kuujo
Jan 03 2016 23:00
in addition to that exception?
I think that might be an out of order message
let’s see...
I’ve been looking for that exception actually
very fixable
Richard Pijnenburg
@electrical
Jan 03 2016 23:02
all i did was shutting down the 3 nodes and start them up again
removed the local log directory for this instance solved it
Jordan Halterman
@kuujo
Jan 03 2016 23:02
oh shit
bad condition actually
hmm nvm
Richard Pijnenburg
@electrical
Jan 03 2016 23:03
hehe. leave it up to me to break things. lol
yay. it printed out the text in the client :D
00:04:49.201 [copycat-client-2] DEBUG i.a.c.client.session.ClientSession - 6207 - Received PublishRequest[session=6207, eventIndex=6212, previousIndex=6207, events=[Event[event=process, message=InstanceEvent[resource=6210, message=some text]]]]
Jordan Halterman
@kuujo
Jan 03 2016 23:06
indeed
Richard Pijnenburg
@electrical
Jan 03 2016 23:07
so that works :-)
now i have to build a separate publisher that just keeps publishing things in a loop as fast as it can
and have a separate consumer that just prints it out
Jordan Halterman
@kuujo
Jan 03 2016 23:07
That PublishRequest and PublishResponse is how the client and server coordinate to send messages from servers to clients. In this case, a command is committed to the Raft cluster and then one server sends a PublishRequest to the client. The eventIndex is the Raft log index at which the event was published, and the previousIndex was the Raft log index at which the previous message was published. The client uses that to verify that it received all messages and that it received them all in order. If the client hasn’t seen previousIndex then it will ask for it
Richard Pijnenburg
@electrical
Jan 03 2016 23:08
ahh okay
need to find a way to measure the processing speed of it
Jordan Halterman
@kuujo
Jan 03 2016 23:08
The DistributedWorkQueue has a sync() and async() method you should pay attention to. async() will probably work a lot faster as the cluster will respond when the message is persisted but not necessarily received by the client. sync() will respond once the message is acknowledged by the consumer.
Richard Pijnenburg
@electrical
Jan 03 2016 23:09
ah i see. okay
is there a way to control how big the queue size is btw?
Jordan Halterman
@kuujo
Jan 03 2016 23:09
Also, you can send a bunch of concurrent requests if you don’t block to wait for a response.
Richard Pijnenburg
@electrical
Jan 03 2016 23:10
ohw?
Jordan Halterman
@kuujo
Jan 03 2016 23:10
That’s what I’m working on right now actually. There’s currently no way to configure stuff.
Richard Pijnenburg
@electrical
Jan 03 2016 23:10
hehe okay :p
exposing something for that would be nice yeah
then i can have sane defaults in my application and have a config file for overriding the defaults
Jordan Halterman
@kuujo
Jan 03 2016 23:11
queue.submit(“foo”);
queue.submit(“bar”);
queue.submit(“baz”);
That will send a lot faster than:
queue.submit(“foo”).join();
queue.submit(“bar”).join();
queue.submit(“baz”).join();
Richard Pijnenburg
@electrical
Jan 03 2016 23:11
ahh okay. I'm using the first example yeah
Jordan Halterman
@kuujo
Jan 03 2016 23:11
I have a recursive algorithm I use for testing actually
let me see
Richard Pijnenburg
@electrical
Jan 03 2016 23:11
okay
I'll most likely have a lot of different queues depending on the amount of machines and how the pipeline is
for example sending from inputs to filters for a single pipeline will be a queue
and then it depends if all the filters live on a single node or not
Jordan Halterman
@kuujo
Jan 03 2016 23:15
private void sendTasks(DistributedTaskQueue<String> queue) {
  for (int i = 0; i < 1000; i++) {
    sendTask(queue);
  }
}

private void sendTask(DistributedTaskQueue<String> queue) {
  queue.submit(“foo”).whenComplete((result, error) -> {
    sendTask(queue);
  });
}
This is something like what I use in throughput testing to submit 1000 concurrent tasks at a time. What it really needs is some sort of backpressure mechanism. I notice for sure things slow down after 10k/sec in my testing, and that can vary in different environments
haven’t actually tested the task queue though
Richard Pijnenburg
@electrical
Jan 03 2016 23:17
ah. for which one was this then?
Jordan Halterman
@kuujo
Jan 03 2016 23:18
I just wrote that ^^ I have only done that testing on DistributedValue and DistributedMap me thinks
Richard Pijnenburg
@electrical
Jan 03 2016 23:18
ahh okay
Jordan Halterman
@kuujo
Jan 03 2016 23:19
k I’m gonna fix that exception
Richard Pijnenburg
@electrical
Jan 03 2016 23:19
yeah. having a concurent / pool size would be nice.
pool up x events and send them in one go
but that means i have an other buffer somewhere i need to make sure i don't lose events in
Jordan Halterman
@kuujo
Jan 03 2016 23:20
Anyways, the big benefit of an asynchronous API is being able to submit 1000 requests at a time. I think a blocking request-response in Raft is too slow because that means:
  • Make a request to some server
  • Forward the request to the leader
  • Write to disk
  • Replicate to a majority of the cluster
  • Get a response from a majority of the cluster
  • Send the original response
    …and that is way too slow but it tends to make flow control a lot harder.
yep
but nothing’s guaranteed by Atomix not to be lost until the returned CompletableFuture is completed anyways
Richard Pijnenburg
@electrical
Jan 03 2016 23:21
true
but if for each received event i do a submit it should be fine as well
ensuring i don't lose anything is very important as welll
Jordan Halterman
@kuujo
Jan 03 2016 23:22
yeah
I think some of the Netty stuff maybe could be tweaked a bit too
no attention has been paid to that
Richard Pijnenburg
@electrical
Jan 03 2016 23:23
otherwise i need to have a persistent queue between the inputs getting the data and publishing it into the queue
and since its already persisted in the raft queue anyway if no client picks it up..
btw, elasticsearch uses netty, perhaps there are some values there that might make sense?
not sure though :-)
just an idea
Jordan Halterman
@kuujo
Jan 03 2016 23:26
Yeah that sucks… this is always the problem with transitioning data between two systems. It can’t really safely leave the first system until the second system has guaranteed its persistence.
yeah I really need to poke around somewhere
I know one thing is that right now Catalyst flushes on every message which isn’t really efficient
every message is a system call
Richard Pijnenburg
@electrical
Jan 03 2016 23:27
ah i see.
Jordan Halterman
@kuujo
Jan 03 2016 23:27
but also buffer sizes and things like that
think this is the part they setup the netty stuff
i think :-)
Richard Pijnenburg
@electrical
Jan 03 2016 23:38
btw, is there an easy way to add some timers in java? for example time how long a loop of 10k submits takes?
and want to do the same for pulling the events from the queue and printing it to the console
Jordan Halterman
@kuujo
Jan 03 2016 23:40
I usually just do some math with System.currentTimeMillis() inside the whenComplete callback
hmm I could actually buffer writes in the CopycatClient
that might not be a bad idea
Richard Pijnenburg
@electrical
Jan 03 2016 23:40
hmm yeah indeed
Jordan Halterman
@kuujo
Jan 03 2016 23:42
oh nice
org.elasticsearch.common.network.NetworkService.TcpSettings where is this?
oh nvm found it
Richard Pijnenburg
@electrical
Jan 03 2016 23:42
okay
Richard Pijnenburg
@electrical
Jan 03 2016 23:49
lol. broke things again. Inconsistent event index: 8549
Jordan Halterman
@kuujo
Jan 03 2016 23:51
No that one is normal
Richard Pijnenburg
@electrical
Jan 03 2016 23:52
    for (int i = 0; i < 1000; i++) {
      queue.submit("Task ID: "+i);
    }
broke it with this.
Jordan Halterman
@kuujo
Jan 03 2016 23:52
Just a normal part of sequencing events, which is why it should be a DEBUG message
Just means an event maybe got skipped or duplicated
Richard Pijnenburg
@electrical
Jan 03 2016 23:52
Received CommandResponse[status=OK, index=13482, result=null]
its not printing out any of the events
Jordan Halterman
@kuujo
Jan 03 2016 23:53
Did it stop getting PublishRequest?
Richard Pijnenburg
@electrical
Jan 03 2016 23:54
yeah
Jordan Halterman
@kuujo
Jan 03 2016 23:54
Oh nice can I see those logs?
Richard Pijnenburg
@electrical
Jan 03 2016 23:54
of course
will do a clean log run
or i got something in the code messing up
    queue.consumer( task -> {
        System.out.println("Task: "+task);
    });
thats the consumer
Jordan Halterman
@kuujo
Jan 03 2016 23:56
hmm actually that looks like it should not be an inconsistent index
let me poke around
Richard Pijnenburg
@electrical
Jan 03 2016 23:57
it didn't happen in the latest run i think
Jordan Halterman
@kuujo
Jan 03 2016 23:57
basically, previousIndex should be the last eventIndex the client received, and for the first event (which this is) it should be the session ID, which it is
but the client is saying inconsistent index, which it isn't
Richard Pijnenburg
@electrical
Jan 03 2016 23:58
ah okay
also funny it had to reconnect after doing all the things
Connecting to localhost/127.0.0.1:5002 00:55:13.882 [copycat-client-2] DEBUG i.a.c.client.util.ClientConnection - Setting up connection to localhost/127.0.0.1:500
Jordan Halterman
@kuujo
Jan 03 2016 23:58
yeah totally
Richard Pijnenburg
@electrical
Jan 03 2016 23:59
it logged the same PublishRequest twice btw