These are chat archives for akkadotnet/akka.net

30th
Apr 2015
Roger Johansson
@rogeralsing
Apr 30 2015 05:04 UTC
Its a rebranded atom.io or atomshell. No project support . Big fat ass fake announcement...
Nikita Tsukanov
@kekekeks
Apr 30 2015 06:32 UTC
It has support for project.json :D
BTW, it seems that they finally got coreclr working on *nix
Not sure what can it already run, but looks promising
Nikita Tsukanov
@kekekeks
Apr 30 2015 07:07 UTC
Yay
VS 2015 RC
I'm a slowpoke, I know
Arjen Smits
@Danthar
Apr 30 2015 07:13 UTC
so im looking at VS Code
not sure what to think of it yet. Complaining i dont have git installed, probably some classpath thing.
Ill probably have to install vnext to get the most out of it.
Bartosz Sypytkowski
@Horusiath
Apr 30 2015 07:17 UTC
@rogeralsing if you want some help with Akka.ClusterSharding, it should be broken into some minor tasks in a way, contributors wouldn't interfere eachother
also why ClusterSharding and not Cluster.Sharding? (there will be conflicts between using Akka.ClusterSharding namespace and ClusterSharding class ;) )
Arjen Smits
@Danthar
Apr 30 2015 07:28 UTC
leap year triple-writer dirty-mirror asynchronous semi-consistency. uuh.. WUT ? ^^
Arjen Smits
@Danthar
Apr 30 2015 07:33 UTC
@Aaronontheweb #thesaddestmoments funny indeed. This one cracked me up especially: BRYAN: I VERIFY MY VERIFICATION OF MY CLAIM THAT RICH CLAIMS THAT I KNOW CHRIS.
Roger Johansson
@rogeralsing
Apr 30 2015 07:36 UTC
@Horusiath :+1: Ill fix those things. was home yesterday coding in bed :)
Arjen Smits
@Danthar
Apr 30 2015 08:07 UTC
Seems there is a larger DNS issue over at microsoft. Several sites are down, including code.visualstudio.com. Nice timing ^^
Raymen Scholten
@raymens
Apr 30 2015 08:09 UTC
Yeah and nuget, but there's a workaround for that
Arjen Smits
@Danthar
Apr 30 2015 08:15 UTC
Wonder how they fucked up so many DNS entries at the same time.
jberzy
@jberzy
Apr 30 2015 08:16 UTC
it might be the actual name server
and then it propagates and you're screwed
Arjen Smits
@Danthar
Apr 30 2015 08:31 UTC
heh, that would mean it could a few hours before things start working again.
Bartosz Sypytkowski
@Horusiath
Apr 30 2015 08:57 UTC
Jeeez, @rogeralsing you've almost ported it all
Roger Johansson
@rogeralsing
Apr 30 2015 09:27 UTC
only 50%
:)
Bartosz Sypytkowski
@Horusiath
Apr 30 2015 09:39 UTC
please don't start Akka.Streams without me ;P
Roger Johansson
@rogeralsing
Apr 30 2015 09:40 UTC
jump in and join the sharding stuff.. there is a persistent actor in there, it could use some love :)
Bartosz Sypytkowski
@Horusiath
Apr 30 2015 09:41 UTC
ok, I'll try this weekend
first I want to PR for typed F# actor refs and new Akka.Persistence.FSharp
Roger Johansson
@rogeralsing
Apr 30 2015 10:44 UTC
alltherage.png
really nice to see that we have some traction :)
Anthony Brown
@bruinbrown
Apr 30 2015 10:46 UTC
It was absolutely packed in that room, some people got turned away because people were blocking the door
I'll be doing another talk or two in Bristol soon for the DNSW group you forwarded on
Roger Johansson
@rogeralsing
Apr 30 2015 10:47 UTC
are there any photos from the talk? would be nice to add some news about this kind of things :)
Anthony Brown
@bruinbrown
Apr 30 2015 10:48 UTC
None that I'm aware of
Roger Johansson
@rogeralsing
Apr 30 2015 10:48 UTC
ok
Arjen Smits
@Danthar
Apr 30 2015 11:35 UTC
Makes docs all the more important
Dan Barua
@danbarua
Apr 30 2015 12:41 UTC
anthony did I see a call centre application (akka/twilio) in one of your visual studio windows?
Anthony Brown
@bruinbrown
Apr 30 2015 12:45 UTC
You did, yeah, it was an idea I toyed with to show the concepts behind switchable behaviours, it needs a bit more work to be a polished demo but it's on the todo list
Dan Barua
@danbarua
Apr 30 2015 12:46 UTC
I was going to ask if that was demo code or proprietary... I'm going to be working on a cloud telecom platform using akka
i threw up a freeswitch lib on github a couple of months back and the first pull request was to do with making it play nice with akka.net
there's quite a few people already hacking on it in telecoms
Anthony Brown
@bruinbrown
Apr 30 2015 12:47 UTC
No, far from proprietary, I'll aim to finish it up sometime over the next few weeks, some of the logic's a bit iffy and I was planning on porting across Akka-replicated-data
I guess Erlang and Ericsson showed the power of the actor model in telecoms and that success is now continuing on with Akka
I've been playing around with rabbitmq + redis for real-time/hot data but I like the idea of keeping state internal to (event sourced) actors and using events to build up a read model separately
Anthony Brown
@bruinbrown
Apr 30 2015 12:50 UTC
Yeah, it's something I've wanted to learn more about for a while now, CRDTs, vector clocks and how distributed databases work so I thought that would be a decent starting point
Dan Barua
@danbarua
Apr 30 2015 12:50 UTC
I need to play a bit more with akka remoting and akka persistence to see how it all fits together
Riccardo Terrell
@rikace
Apr 30 2015 14:09 UTC
Has somebody started to play with Clustering ?
i've been playing with the example, trying to find ways of breaking it
Riccardo Terrell
@rikace
Apr 30 2015 14:10 UTC
Thanks! I am sure I can break it :smile:
Should be nice use Akka.NET in combination with brisk-engine for cloud deployment
I have played with mBrace & brisk-engine... and I am wondering now
Dan Barua
@danbarua
Apr 30 2015 14:20 UTC
it doesn't seem to be working reliably yet..
eg i bring up another worker and it just sits there spewing out errrors
like I said i need to understand it a bit more (eg. which errors being logged to the console are being recovered from and which ones are stopping the cluster working)
Arjen Smits
@Danthar
Apr 30 2015 14:44 UTC
go do the bootcamp
it explains all the basic stuff, including supervision, which incidentally explains your question about which errors are stopping the cluster from working. Bug excluding ofcourse ;)
Dan Barua
@danbarua
Apr 30 2015 14:46 UTC
yeah i'm just working through that
had it working nicely yesterday, today not so much (just running the examples as-is)
i'm also more interested in what's going on under the hood - akka is a nice toolkit but a lot of it is basically magic to the consuming developer
Arjen Smits
@Danthar
Apr 30 2015 14:49 UTC
I still have to dive into the clustering and network stuff that Akka does. The limiting factor being time. But if you have some specific questions im sure some of the core devs will answer them :)
Andrew Skotzko
@skotzko
Apr 30 2015 15:06 UTC
@bruinbrown any video of your talk?
Anthony Brown
@bruinbrown
Apr 30 2015 15:08 UTC
@skotzko unfortunately not I'm afraid
Andrew Skotzko
@skotzko
Apr 30 2015 15:08 UTC
:cry:
Andrew Skotzko
@skotzko
Apr 30 2015 15:09 UTC
kewl
Roger Johansson
@rogeralsing
Apr 30 2015 15:39 UTC
@danbarua afaik, "eg i bring up another worker and it just sits there spewing out errrors" the webcrawler sample does not load balance after a crawl has started, once started, new nodes jut hang there. cc @Aaronontheweb
Andrew Skotzko
@skotzko
Apr 30 2015 15:45 UTC
@danbarua in that demo (last i checked) new nodes were able to join the ongoing crawl jobs
there was an issue where once disconnected they weren’t reconnecting right
Dan Barua
@danbarua
Apr 30 2015 15:45 UTC
yeah
so i killed the entire actor system down
still no worky
was working great yesterday
Andrew Skotzko
@skotzko
Apr 30 2015 15:46 UTC
sometimes i’ve seen if you let it run for a bit it recovers
s/sometimes/usually
Joshua Benjamin
@annymsMthd
Apr 30 2015 16:09 UTC
Clustering still needs a bit of love. We currently have a cluster running but we see nodes go down every once and awhile.
Aaron Stannard
@Aaronontheweb
Apr 30 2015 16:25 UTC
@danbarua as @annymsMthd pointed out, there are some issues with Akka.Cluster at the moment
namely it doesn't handle reconnects very well - which is a problem when you first try to form a cluster with lots of processes starting in parallel
as part of @skotzko and I's effort to market https://petabridge.com/training/, where we do half-day long live webinars explaining Akka design / architecture patterns, Akka.Remote, and Akka.Cluster
I'll be publishing some free videos / blog posts explaining how some parts of those modules work
should have my first video explaining the internals of Akka.Remote
Dan Barua
@danbarua
Apr 30 2015 16:28 UTC
cool
Aaron Stannard
@Aaronontheweb
Apr 30 2015 16:28 UTC
up today
Dan Barua
@danbarua
Apr 30 2015 16:28 UTC
for now i'm happy to play with examples and read the source code
Arjen Smits
@Danthar
Apr 30 2015 16:28 UTC
Do we have an example for the TestKit where you test an Actor, which creates an ChildActor using the DIExtension system? (so in other words an childactor which has dependencies managed by an DI container)
Aaron Stannard
@Aaronontheweb
Apr 30 2015 16:31 UTC
@rogeralsing dude that jabbr screenshot is awesome -warms my heart
@bruinbrown lol I love your image on slide 6
that's great
Dan Barua
@danbarua
Apr 30 2015 16:38 UTC
oh this is awesome, my 7 year old's topic for this term: algorithms
Aaron Stannard
@Aaronontheweb
Apr 30 2015 16:40 UTC
they're teaching 7 year olds algorithms now?
Dan Barua
@danbarua
Apr 30 2015 16:40 UTC
according to this term's newsletter they are!
Aaron Stannard
@Aaronontheweb
Apr 30 2015 16:40 UTC
that's pretty cool
Dan Barua
@danbarua
Apr 30 2015 16:41 UTC
yeah that's awesom
i'd love to get involved in CodeClub but I don't have the time
also i find the idea of getting up in front of a class of kids terrifying
Aaron Stannard
@Aaronontheweb
Apr 30 2015 16:42 UTC
I remember when I first started learning how to program when I was that age (my dad was an old-school Mac and Windows programmer) - he made me write a console C application to do all of my multiplication tables
kids are probably a much better audience than adults - none of them have twitter accounts
I think it's awesome that they're learning this stuff that early
Dan Barua
@danbarua
Apr 30 2015 16:44 UTC
IT teaching in the uk (at school level) has been mostly how-to-use-MS-office
Aaron Stannard
@Aaronontheweb
Apr 30 2015 16:44 UTC
a lot of those skills, when taught young, will translate to other areas of study like math and science
Thomas Tomanek
@thomastomanek
Apr 30 2015 16:44 UTC
I think in the UK they're planning on making coding compulsory at school (like little kids school)
Dan Barua
@danbarua
Apr 30 2015 16:44 UTC
yeah
it's a good skill to have
Aaron Stannard
@Aaronontheweb
Apr 30 2015 16:45 UTC
the first thing they'll notice is that they'll start internalizing lots of stuff in terms of if...else... :P
Joshua Benjamin
@annymsMthd
Apr 30 2015 16:45 UTC
as long as that dont do it in terms of goto
Aaron Stannard
@Aaronontheweb
Apr 30 2015 16:47 UTC
man this //BUILD seems like a really good one
Dan Barua
@danbarua
Apr 30 2015 16:47 UTC
it's rush hour/home time in the uk
Aaron Stannard
@Aaronontheweb
Apr 30 2015 16:47 UTC
the last 2 I went to were pretty boring by comparison
Dan Barua
@danbarua
Apr 30 2015 16:47 UTC
so have to catch up via Twitter/HN in the evening
and on that note.. gotta run!
Aaron Stannard
@Aaronontheweb
Apr 30 2015 16:53 UTC
seeing WPF get some love at //BUILD makes me a little nostalgic
for what it does I think it's a pretty kick-ass piece of technology
@bruinbrown just finished going through your slides - very well done!
Thomas Tomanek
@thomastomanek
Apr 30 2015 16:57 UTC
surprised WPF got a mention at all
Anthony Brown
@bruinbrown
Apr 30 2015 16:58 UTC
@Aaronontheweb cheers, it's a shame I pre-emptively rushed through it to fit it all in and finished a bit too early
Aaron Stannard
@Aaronontheweb
Apr 30 2015 16:59 UTC
happens to the best of us
like Scott Hanselman says, you basically have to deliver the same talk over and over and over again before you really nail all of those timing details
you should be submitting talk proposals to some of the other EU conferences
there's a bunch happening this summer
Arjen Smits
@Danthar
Apr 30 2015 17:01 UTC
just joined in on the channel9 Build feed again.. And they are talking about pregnancy rates ??? o_O
Aaron Stannard
@Aaronontheweb
Apr 30 2015 17:01 UTC
many of them will assist with travel costs too
Roger Johansson
@rogeralsing
Apr 30 2015 17:01 UTC
today I'm to 1337 for all of you!! bow my minions, here comes the WordPress administrator!!
been staring at broken WP plugins the last few hours :-(
Arjen Smits
@Danthar
Apr 30 2015 17:02 UTC
Know about that zero day? Running the latest version I hope.
Anthony Brown
@bruinbrown
Apr 30 2015 17:02 UTC
Yeah, I've done a few other big confs, did CodeMesh, Lambdadays etc but I realised my degree was starting to slip so held off the past few months
Aaron Stannard
@Aaronontheweb
Apr 30 2015 17:03 UTC
mind-blown.gif
Arjen Smits
@Danthar
Apr 30 2015 17:03 UTC
@rogeralsing but my hat off to you sir.
Aaron Stannard
@Aaronontheweb
Apr 30 2015 17:03 UTC
that was for @rogeralsing
Arjen Smits
@Danthar
Apr 30 2015 17:03 UTC
^^
Aaron Stannard
@Aaronontheweb
Apr 30 2015 17:04 UTC
the next conference I'm submitting a proposal to is this guy http://learn.datastax.com/Cassandra-Summit-2015-Call-for-Papers.html
going to talk about stream processing with Cassandra
featuring Akka.Persistence + hopefully some prototype of Akka.Streams :p
btw @Horusiath, we had a suggestion from Roland to just go ahead and release stable versions of Akka.Persistence and Akka.ClusterSharding as part of 1.2
the Akka.Streams integration is an implementation detail that is invisible to the end-user
and might even be configurable as a different service provider internally
Riccardo Terrell
@rikace
Apr 30 2015 17:06 UTC
I will present at http://www.degoesconsulting.com/lambdaconf-2015/ about Akka.NET and F#
and next I will repeat the same presentation at https://www.codeonthebeach.com
Aaron Stannard
@Aaronontheweb
Apr 30 2015 17:07 UTC
@rikace dude, I need your travel budget
you get to do all of the fun ones :p
Riccardo Terrell
@rikace
Apr 30 2015 17:07 UTC
yeap... I am looking for the one in Florida!!
I am working on a new presenattion about Akka.NET in Azure/DS
Aaron Stannard
@Aaronontheweb
Apr 30 2015 17:10 UTC
we will Akka-ify the world!!!!
that's awesome @rikace - loved your talk you gave at the virtual meetup and I'm not even a F# developer :p
Riccardo Terrell
@rikace
Apr 30 2015 17:11 UTC
thanks... in Europe the majority of attendees were C#pers and all got very interested in Akka.NET even using F# samples
Bartosz Sypytkowski
@Horusiath
Apr 30 2015 17:17 UTC
@Aaronontheweb right now I'm against official release of Akka.Persistence - it's still not tested enough, besides we still have no general idea how it will be integrated with akka-typed (it may possibly affect API design)
Aaron Stannard
@Aaronontheweb
Apr 30 2015 17:17 UTC
Akka.Persistence is a ways out anyways - still have Akka.Cluster to worry about
Andrew Skotzko
@skotzko
Apr 30 2015 17:31 UTC
let the azure sales pitch resume!
Nikita Tsukanov
@kekekeks
Apr 30 2015 18:30 UTC
Speaking of conferences. I'm going to talk about Akka.NET at .NEXT conference in St. Petersburg (Russia) this June. Would you recommend any good reference materials? Most people there don't know what the heck is the actor and heard somewhere something about CQRS/EventSourcing, so my initial plan is to talk about why actor systems are cool and then follow up with some examples of event sourcing using Akka.Persistence.
Joshua Benjamin
@annymsMthd
Apr 30 2015 18:33 UTC
@bot deploy Ops to Production
whoops
lol totally wrong chat
Aaron Stannard
@Aaronontheweb
Apr 30 2015 18:39 UTC
@kekekeks that's awesome!
here's what I recommend doing as a starting point
first, begin with a brief explanation for what the actor model is and why we care (concurrency, Moore's Law, blah blah) - the slideshare that @bruinbrown published in here earlier today is a good example of that
then talk about Akka.NET and its ancestry - Erlang / JVM Akka
so once you've framed Akka.NET that way, you have a good conceptual model to work from - you can dive into the specifics of Akka.NET after that
start off by showing how you define a trivially simple actor
and then show how to create one from an actorsystem, and explain the purpose of the ActorRef
location transparency, all that jazz
Raymen Scholten
@raymens
Apr 30 2015 18:42 UTC
and then they'll be Akka.net ninja's!
Aaron Stannard
@Aaronontheweb
Apr 30 2015 18:42 UTC
@skotzko and I did a much longer version of this at .NET Fringe, but that was a 7 hour long workshop :p
the other talks we've done have typically been 30-60 minutes and speak to a .NET audience that doesn't know anything about the actor model
and we use a similar formula to Anthony's
Riccardo Terrell
@rikace
Apr 30 2015 18:44 UTC
I used the same approach in my presentations and it worked out very well, here is my slides to take a look too.
Nikita Tsukanov
@kekekeks
Apr 30 2015 18:45 UTC
Wow
Thanks
Can I have your permission to use/translate some slides?
Riccardo Terrell
@rikace
Apr 30 2015 18:46 UTC
please enjoy :smile:
creative commons
Nikita Tsukanov
@kekekeks
Apr 30 2015 19:01 UTC
Oh, dat picture with workers
Roger Johansson
@rogeralsing
Apr 30 2015 19:09 UTC
:)
Roger Johansson
@rogeralsing
Apr 30 2015 19:42 UTC
batman.jpg
Maximusya
@Maximusya
Apr 30 2015 19:53 UTC
have you guys heard of ZeroC Ice?
if yes, can you tell whether it has a lot in common with akka?
Roger Johansson
@rogeralsing
Apr 30 2015 20:08 UTC
at a quick glance it looks more similar to SignalR (?)
there are some threading control support to apparently
Roger Johansson
@rogeralsing
Apr 30 2015 20:14 UTC
Not sure what to make of it, the site doesnt really describe what techniques it builds upon (?)
Aaron Stannard
@Aaronontheweb
Apr 30 2015 20:24 UTC
I have nightly builds up and running on TeamCity, with a public nuget feed now
need to let it sit overnight and absorb configuration changes though before I can share
because there's a bunch of people who haven't synced from dev before sending PRs, it's publishing a bunch of nonsense v1.0.0 packages
those should disappear after a day
Roger Johansson
@rogeralsing
Apr 30 2015 20:26 UTC
:+1:
Aaron Stannard
@Aaronontheweb
Apr 30 2015 20:27 UTC
had to stop publishing per-PR artifacts and NuGet packages
and just stick to nightly builds that publish unique-named ones
Maximusya
@Maximusya
Apr 30 2015 21:04 UTC
ZeroC Ice is for building distributed systems. Location transparency etc etc. Actors in Akka ~ servants in Ice?
Just hoped to get an experienced opinion :)
Joshua Benjamin
@annymsMthd
Apr 30 2015 21:31 UTC
@Aaronontheweb I'm seeing an error in AkkaProtocolTransport.Associate that looks to be unhandled. It is "The remote system refused the association because it is shutting down." have you seen this one?
Aaron Stannard
@Aaronontheweb
Apr 30 2015 21:48 UTC
I think I might have - I've been documenting all of the Akka.Remote plumbing internals trying to find the source of these "can't reconnect" bugs
AkkaProtocolTransport.Associate?
hmm
public Task<AkkaProtocolHandle> Associate(Address remoteAddress, int? refuseUid)
        {
            // Prepare a Task and pass its completion source to the manager
            var statusPromise = new TaskCompletionSource<AssociationHandle>();

            manager.Tell(new AssociateUnderlyingRefuseUid(SchemeAugmenter.RemoveScheme(remoteAddress), statusPromise, refuseUid));

            return statusPromise.Task.ContinueWith(result => ((AkkaProtocolHandle) result.Result),
                TaskContinuationOptions.AttachedToParent & TaskContinuationOptions.ExecuteSynchronously);
        }
Joshua Benjamin
@annymsMthd
Apr 30 2015 21:49 UTC
I'm running the multinode tests locally and it throws in Visual Studio and breaks on the exception. It looks like the task stuff that surrounds it is not handling exceptions
yup
in that ContinueWith
Looks like the only place that calls that is the PreStart in the EndpointWriter
Aaron Stannard
@Aaronontheweb
Apr 30 2015 21:51 UTC
so the message handling ends up calling this
private void CreateOutboundStateActor(Address remoteAddress,
            TaskCompletionSource<AssociationHandle> statusPromise, int? refuseUid)
        {
            var stateActorLocalAddress = localAddress;
            var stateActorSettings = _settings;
            var stateActorWrappedTransport = _wrappedTransport;
            var failureDetector = CreateTransportFailureDetector();

            Context.ActorOf(RARP.For(Context.System).ConfigureDispatcher(ProtocolStateActor.OutboundProps(
                new HandshakeInfo(stateActorLocalAddress, AddressUidExtension.Uid(Context.System)),
                remoteAddress,
                statusPromise,
                stateActorWrappedTransport,
                stateActorSettings,
                new AkkaPduProtobuffCodec(), failureDetector, refuseUid)),
                ActorNameFor(remoteAddress));
        }
could be happening here:
 private void InitializeFSM()
        {
            When(AssociationState.Closed, fsmEvent =>
            {
                State<AssociationState, ProtocolStateData> nextState = null;
                //Transport layer events for outbound associations
                fsmEvent.FsmEvent.Match()
                    .With<Status.Failure>(f => fsmEvent.StateData.Match()
                        .With<OutboundUnassociated>(ou =>
                        {
                            ou.StatusCompletionSource.SetException(f.Cause);
                            nextState = Stop();
                        }))
since we set an exception on the promise right there
so I have a strong feeling I know what's going wrong
with a lot of these "unable to connect" issues
I think I goofed up inside the HeliosTcpTransport
and passed back the wrong type of disassocation reason
Joshua Benjamin
@annymsMthd
Apr 30 2015 21:55 UTC
Looks like one of the exceptions is coming from the FSM thing. There is another one that is coming from somewhere else
Aaron Stannard
@Aaronontheweb
Apr 30 2015 21:55 UTC
throwing the exception right there is a planned thing
designed to just stop the association process if it encounters an error
but as Dave pointed out, something doesn't get properly cleaned up
Joshua Benjamin
@annymsMthd
Apr 30 2015 21:57 UTC
If the exception inside that contunue with is unhandled could that be the reason stuff isnt being cleaned up?
Aaron Stannard
@Aaronontheweb
Apr 30 2015 21:58 UTC
could be - exception may not be happening in the right place
Joshua Benjamin
@annymsMthd
Apr 30 2015 21:59 UTC
Don't you have to explicitly handle exceptions in ContinueWiths?
Aaron Stannard
@Aaronontheweb
Apr 30 2015 21:59 UTC
the protocolstateactor will fail right there
since it's next state is Stop
so that should take care of it
yeah, this might be a culprit
Joshua Benjamin
@annymsMthd
Apr 30 2015 22:00 UTC
return statusPromise.Task .ContinueWith(result => { if (result.IsFaulted) { Handle(); } return ((AkkaProtocolHandle) result.Result); },TaskContinuationOptions.AttachedToParent & TaskContinuationOptions.ExecuteSynchronously);
bah sorry
something like that
Aaron Stannard
@Aaronontheweb
Apr 30 2015 22:00 UTC
give that a try
Joshua Benjamin
@annymsMthd
Apr 30 2015 22:00 UTC
Handle is just a stub
Aaron Stannard
@Aaronontheweb
Apr 30 2015 22:00 UTC
I'm testing something else at the moment
Joshua Benjamin
@annymsMthd
Apr 30 2015 22:00 UTC
not sure what to do with the exception
but ill investigate
Aaron Stannard
@Aaronontheweb
Apr 30 2015 22:03 UTC
so I think the fact that Helios was reporting most disconnections as "shutdowns" was part of the problem
that means something much more specific than I originally thought
the EndpointManager interprets a shutdown as a planned termination
and applies quarantining much more liberally
truth of the matter is that a lot of these disassociations are intermittent connection failures
Joshua Benjamin
@annymsMthd
Apr 30 2015 22:04 UTC
which would happen with networks
Aaron Stannard
@Aaronontheweb
Apr 30 2015 22:05 UTC
often the result of multiple nodes racing to connect first (Akka.Cluster), actual network failures, or missed heartbeats (which we are working on solving)
yeah
so I just changed it so the HeliosTcpTransport does exactly what the NettyTcpTransport does
reports all socket-level disconnects as "unknown"
meaning: this was an accident, please let that node reconnect when it can
Joshua Benjamin
@annymsMthd
Apr 30 2015 22:06 UTC
nice
Aaron Stannard
@Aaronontheweb
Apr 30 2015 22:21 UTC
ouch, my ArgumentNullCheck for cluster pool routers just found a nasty exception
as I suspected, looks like the supervision strategy doesn't get properly passed in from the underlying local RouterConfig
well, that explains a lot
Aaron Stannard
@Aaronontheweb
Apr 30 2015 22:50 UTC
yeah, this happens whenever you load a clustered pool router from configuration
have a patch and a spec for it now
Joshua Benjamin
@annymsMthd
Apr 30 2015 22:51 UTC
\o/
Aaron Stannard
@Aaronontheweb
Apr 30 2015 23:12 UTC
think I found another cause to some of the weird behavior in the ProtocolStateActor
had an improperly nested PatternMatch
that caused us to pass a null Exception object back to the statusCompletionPromise
so that might have prevented the error from cascading properly
only happened when the protocolstateactor was shutting down
Joshua Benjamin
@annymsMthd
Apr 30 2015 23:13 UTC
Where is it in the file?
Aaron Stannard
@Aaronontheweb
Apr 30 2015 23:13 UTC
in the OnTermination block
of the ProtocolStateActor
I'm basically just applying patch after patch after patch on the WebCrawler sample to see what blows up, since under decent amounts of traffic the webcrawler sample does a reliable job of reproducing this behavior
random disconnects / unable to rejoin
Joshua Benjamin
@annymsMthd
Apr 30 2015 23:15 UTC
awesome. With the changes to MultiNode i'm just running it in vs and seeing were it fails
Aaron Stannard
@Aaronontheweb
Apr 30 2015 23:15 UTC
the OnTermination block error was just a case of having a parentheses being in the wrong place
stupid error :\
Joshua Benjamin
@annymsMthd
Apr 30 2015 23:16 UTC
the match with OutboundUnderlyingAssociated ?
Aaron Stannard
@Aaronontheweb
Apr 30 2015 23:16 UTC
yep
Aaron Stannard
@Aaronontheweb
Apr 30 2015 23:42 UTC
seeing much better reconnect behavior now
still not 100% fixed
but I think I know what the root cause is right now
quarantining is when the problem happens
and I think we're doing it too aggressively
when two nodes are up and functional
and due to some intermittent issue one quarantines another
(in this case, I had the webcrawler crawling wikipedia and it saturated the threadpool)
when that quarantine happens two nodes will start each reporting the other as unreachable
and if one of the quarantined nodes is the leader
then things get interesting
Joshua Benjamin
@annymsMthd
Apr 30 2015 23:45 UTC
oh yeah. I have seen our cluster split before
Aaron Stannard
@Aaronontheweb
Apr 30 2015 23:45 UTC
so this is interesting - I'm in the middle of documenting when nodes are supposed to be gated vs. quarantined
IMHO, I don't think I understood this distinction very well when I originally ported the Akka.Remote code
and things are being quarantined too liberally
Joshua Benjamin
@annymsMthd
Apr 30 2015 23:46 UTC
speaking of gated. We i see failures with the tests it looks like a node repeats the gated thing over and over for awhile.
Aaron Stannard
@Aaronontheweb
Apr 30 2015 23:46 UTC
yeah, I've seen that too
ok - well now I know what i need to go research for both this interals video and this bug
sounds like really understanding these endpoint policies and how they're supposed to be applied is key
Joshua Benjamin
@annymsMthd
Apr 30 2015 23:46 UTC
I traced it to I think the EndpointWriter. I think when it is in the writing state is doesn't handle failures the same way
Aaron Stannard
@Aaronontheweb
Apr 30 2015 23:47 UTC
that sounds like an error then
we should go through the JVM source and see if we're doing anything obviously wrong
Joshua Benjamin
@annymsMthd
Apr 30 2015 23:47 UTC
It looked like akka did it the same way
I checked that
Aaron Stannard
@Aaronontheweb
Apr 30 2015 23:48 UTC
so man, this is interesting
the leader ended up marking itself as down
Joshua Benjamin
@annymsMthd
Apr 30 2015 23:49 UTC
ouch
Aaron Stannard
@Aaronontheweb
Apr 30 2015 23:49 UTC
new node tried to connect to it, because it's also the seed
leader rejected it
welp, looks like this is becoming less of a mystery
Joshua Benjamin
@annymsMthd
Apr 30 2015 23:50 UTC
good to hear!
Aaron Stannard
@Aaronontheweb
Apr 30 2015 23:50 UTC
there's some code that affects endpoint policy that is screwed up
and then there's also some code inside the CurrentClusterState and Reachability policies in Akka.Cluster that are messed up
don't know what yet, but that's a lot closer to the right answer than we were yesterday
Joshua Benjamin
@annymsMthd
Apr 30 2015 23:53 UTC
So the gated thing I saw was in ReliableDeliverySupervisor When it gets a terminated in the Gated state it schedules an Ungate message with itself. If the buffers have messages it becomes OnReceive again
Aaron Stannard
@Aaronontheweb
Apr 30 2015 23:54 UTC
I think our ReliableDeliverySupervisor is out of date - I saw some new messaging types that it has on the JVM side
and some boilerplate code inside the Remoting and EndpointManager that use it
noticed that just yesterday