These are chat archives for akkadotnet/akka.net

21st
Jun 2016
to11mtm
@to11mtm
Jun 21 2016 00:06
stupid question... what happens if one is unlucky enough that the same number gets picked twice by the RNG?
Aaron Stannard
@Aaronontheweb
Jun 21 2016 00:06
there is a chance that can happen
it's remote but not zero
in that case you'd have the behvavior that Akka.NET has now
system accidentally believes that this incarnation of the new actor is the same as the remote actor that came before it
odds of that happening are very small due to the latencies involved in shutting down and starting up actor systems - given that the RNG is seeded by the system clock
in the PR I'm wrapping up you'll be able to see the UID in all of the logs for each actor reference now
since it's part of the IActorRef.ToString() output now
so the way you'd diagnose that message is, unfortunately, comparing two log files between servers
to11mtm
@to11mtm
Jun 21 2016 00:09
yeah. It's the exact same behavior JVM does too, and I guess if it was an issue they'd have done something about it by now ;)
Aaron Stannard
@Aaronontheweb
Jun 21 2016 00:12
yeah'
I guess the tradeoff is having to persist every single UID for every remote actor reference you've been in contact with and comparing against that persisted result
versus the very remote chance of a single collision
qwoz
@qwoz
Jun 21 2016 00:24
it's a random int32?
Aaron Stannard
@Aaronontheweb
Jun 21 2016 00:26
yessir
we use a ThreadLocal<Random> for generating those
qwoz
@qwoz
Jun 21 2016 00:28
I guess a collision could happen... but you have much better odds of winning the lottery. :)
Aaron Stannard
@Aaronontheweb
Jun 21 2016 00:28
yeah
I mean, I could do something that would allow me to do a 64 bit random number, basically using some bitshifting from two 32-bit random numbers to combine them together
but as you say, the odds of it ever happening are extremely small
I remember freaking out one time when I was debugging some Ruby code and saw a Guid collision happen
which, while mathematically possible
has something like a 1 / 2^64 chance of actually happening
and then I saw it happen a second time
and decided to take a look
sure enough whoever wrote the code was manually cobbling together Guids using a counter
and, of course, being a Ruby program written by a typical Ruby developer, there was no concept of multiple requests all competing to update the Guid at the same time
qwoz
@qwoz
Jun 21 2016 00:32
I ran across it myself, pretty sure it was a legit Guid.NewGuid() collision
or, come to think of, it's statistically more likely it was a bug in the third party library that threw an exception when I tried to add an item it said it already had. I was performance testing it, throwing as many new GUIDs at it as it could handle.
Ricky Blankenaufulland
@ZoolWay
Jun 21 2016 06:40
Hi! I got a simple problem starting with Akka.Cluster and might need a small hint where to look into.
I have just a running Lighthouse (as seed-node) and one node doing nothing but creating the actor system at the moment. When I start the Lighthouse, then the node everything is fine as expected. When I start the node first and then the Lighthouse, the node never connects to the Lighthouse - I even does not feel that it retries at all. According to akka.cluster.retry-unsuccessful-join-after = 10sit should as I understand? It just says Starting up.... and never continues.
voltcode
@voltcode
Jun 21 2016 07:21
woah, waffle shows 0 issues pending for next release. is the next step for 1.1 emptying the "needs review " queue ?
Bartosz Sypytkowski
@Horusiath
Jun 21 2016 07:23
@ZoolWay the best person to ask would be @Aaronontheweb , but he is in the PST time zone (which is UTC -7h)
@voltcode this means that the issues are not properly tagged ;)
If we're going to release at the end of this month, maybe I could keep with the pace and make everything ready for upgrading persistence up to akka-streams? Basically need to fix postgres plugin (I think it will take 1-2 evenings) + take a look at new serializers by manifest API made by @alexvaluyskiy . If we could include it, it would be great.
Alex Valuyskiy
@alexvaluyskiy
Jun 21 2016 08:43
@ZoolWay seednode (Lighthouse) must be started before all another nodes
Ricky Blankenaufulland
@ZoolWay
Jun 21 2016 08:45
@alexvaluyskiy In normal operation sure. But it makes sense to retry the seed-node if it cannot be reached. Think about your servers restarting and the worker node comes up before the seed node. Otherwise what would akka.cluster.retry-unsuccessful-join-after mean?
Alex Valuyskiy
@alexvaluyskiy
Jun 21 2016 08:57
we are trying to connect to seednode for 5 second (akka.cluster.seed-node-timeout = 5s) and if we fail we will wait for 10 seconds (akka.cluster.retry-unsuccessful-join-after = 10s)
Ricky Blankenaufulland
@ZoolWay
Jun 21 2016 08:58
yes - and that is what I want to do but I will never tell me about no able to connect and never tell anything about retrying...
although I have even explicitly set those two values
Alex Valuyskiy
@alexvaluyskiy
Jun 21 2016 08:59
Which version of Akka do you use?
Ricky Blankenaufulland
@ZoolWay
Jun 21 2016 08:59
1.0.8, 1.0.8.25-beta for Akka.Cluster
Alex Valuyskiy
@alexvaluyskiy
Jun 21 2016 09:00
You could try Akka from nightly build from this nuget feed https://www.myget.org/F/akkadotnet/api/v2
It has a ton of fixes for remoting/cluster
Ricky Blankenaufulland
@ZoolWay
Jun 21 2016 09:33
@alexvaluyskiy Thank you! With 1.0.9.226-beta it works as expected (and even gives a concrete initial error and shows all retry attempts)
Alex Valuyskiy
@alexvaluyskiy
Jun 21 2016 11:15
@Horusiath I did not implement Serializers with string manifest. They was implemented, but these serializers didn't work in Remote/Cluster.Tools
@Horusiath Will you plan to update SqlServer, MySql, etc plugins or just Sqilte and Postgres?
Bartosz Sypytkowski
@Horusiath
Jun 21 2016 11:30
@alexvaluyskiy I have Sqlite and SqlServer already working
I'll pass MySql, as I don't want to install next db engine on my home computer ;)
Deniz ─░rgin
@Blind-Striker
Jun 21 2016 12:41

Hi,

We updated our project to latest Akka.Net nightly build packages after @Aaronontheweb suggestion (http://stackoverflow.com/questions/37680960/node-sometimes-doesnt-join-the-akka-net-cluster-after-iis-recycle-apppool)

When i start lighthouse application we get a warning about helios : [WARNING][21.6.2016 12:36:20][Thread 0019][Helios.Channels.Bootstrap.ServerBootstrap] Unknown channel option: Helios.Channels.ChannelOption`1[System.Int32]
Cause: Unknown

can it be a problem, should we add some option the HOCON configuration or just ignore the warning.

Thank you

Alex Valuyskiy
@alexvaluyskiy
Jun 21 2016 12:51
@Horusiath Linux VM with docker images for all supported databases will save us in the future, when MS release Sql Server Linux
Bartosz Sypytkowski
@Horusiath
Jun 21 2016 12:53
so many .net devs keeps their progress on the back foot in order to wait for .net core, that once it will finally appear, there may be no one left to use it
(and I'm not speaking about 1.0 here)
to11mtm
@to11mtm
Jun 21 2016 12:54
@Horusiath Are the SqlServer changes going to be on Github soon? Once they are I can try to get them going on Oracle
Bartosz Sypytkowski
@Horusiath
Jun 21 2016 12:54
but you're right, once it will be finally possible, the maintenance of those plugins will be easier
@to11mtm - akkadotnet/akka.net#2038 is the PR with new change set
there could be some little changes overtime, but they're really small
making plugins for specific sql version is a lot easier now
Alex Valuyskiy
@alexvaluyskiy
Jun 21 2016 12:56
yes, Mono version of System.Data.SqlServer is not appropriate for production. We should wait for net core version
Bartosz Sypytkowski
@Horusiath
Jun 21 2016 12:57
@alexvaluyskiy I have some doubts to say if .net core version is more appropriate for production xD https://github.com/dotnet/corefx/blob/d0dc5fc099946adc1035b34a8b1f6042eddb0c75/src/System.Data.Common/src/System/Data/Common/DbConnection.cs#L123
Alex Valuyskiy
@alexvaluyskiy
Jun 21 2016 13:05
What's wrong?
They are using sync version of Open for compatibility
Each provider should override this method and use proper async
Bartosz Sypytkowski
@Horusiath
Jun 21 2016 13:06
I'd expect async operation to be actually asynchronous
Bartosz Sypytkowski
@Horusiath
Jun 21 2016 13:09
it would be fun to discover, that 3/4 of the .net async API is actually synchronous wrapped with tasks :)
Alex Valuyskiy
@alexvaluyskiy
Jun 21 2016 13:11
Npgsql became truly async only in 3.0-3.1 releases
3.1 was released on 16.05.2016
I'm not sure about oracle and mysql drivers. They could use synchronous api internally
Roger Johansson
@rogeralsing
Jun 21 2016 13:12
@Aaronontheweb yes, the UID generation needs to be altered for actors. I think that was mentioned in my original PR. I'm not totally impressed with the random number idea as if there is a collision, then its going to be hell of a bug to debug or reason about.. but that is the currently best bet imo
to11mtm
@to11mtm
Jun 21 2016 13:25
@Horusiath
thank you!
Chris G. Stevens
@cgstevens
Jun 21 2016 13:27
So I wired up Application Insights and started to play with metrics and such.
I have found that I have like 6k of Dead Letters in a matter of minutes.
What is the best way to debug this and find out more information?
Thanks!
Thomas Lazar
@thomaslazar
Jun 21 2016 14:28
This message was deleted
nevermind...
Aaron Stannard
@Aaronontheweb
Jun 21 2016 15:28
@cgstevens I would subscribe to DeadLetters and unpack the contents to see whats not being delivered and where
most likely scenario is that a router is misconfigured
@rogeralsing I changed the UID generation to work the same as the JVMs
@Blind-Striker yeah it can be disregarded - I haven't figured out which Helios setting it is that triggers that. I'm sure it's one of the buffer size settings
I'm going to fix that before the release
it's one of those things that isn't difficult to fix, but it's less urgent than the other stuff being worked on for 1.1
Deniz ─░rgin
@Blind-Striker
Jun 21 2016 15:35
@Aaronontheweb thanks for answer. Anyway nightly builds seems more stable for Cluster re-joining. I'll publish the production, i hope it will save me from sleepless nights :smile:
Ricky Blankenaufulland
@ZoolWay
Jun 21 2016 15:49
I must have a simple routing problem
I got two programs joined the same cluster through Akka.Cluster and Lighthouse
the core-program has an actor TestActor ("/user/tester") and the core-program can tell him strings he logs...
the client-programm should now also tell that actor something
both programs have different roles
the client-program has this deployment/route:
deployment {
    /tester {
        router = round-robin-group
        routees.paths = ["/user/tester"]
        cluster {
            enabled = true
            use-role = "core"
        }
    }
}
But this does not work in the client-program: this.actorSystem.ActorSelection("/user/tester").Tell("hello from client");
What am I missing here?
The result is a dead-letter btw
Alex Valuyskiy
@alexvaluyskiy
Jun 21 2016 15:53
enabled = on
Ricky Blankenaufulland
@ZoolWay
Jun 21 2016 15:53
Oo
frasermolyneux
@frasermolyneux
Jun 21 2016 15:55
I'm getting a load of "Ignoring received gossip intended for someone else" when I take nodes off/on and the node refusing to reconnect to the cluster - any ideas of what is going wrong there?
Aaron Stannard
@Aaronontheweb
Jun 21 2016 15:55
@Blind-Striker if #2090 gets merged today that will help even more with the Cluster rejoins
Ricky Blankenaufulland
@ZoolWay
Jun 21 2016 15:55
@alexvaluyskiy Thanks! Unfortunately that cannot be everything. I fixed it but it still produces a dead letter and does not go to the TestActor on the core-program
Aaron Stannard
@Aaronontheweb
Jun 21 2016 15:55
fixed two different issues in Akka.Remote's EndpointManager that could bork a cluster
frasermolyneux
@frasermolyneux
Jun 21 2016 15:56
Essentially I took my two well known nodes offline to check it all continues to work - it did but never recovered when back online
Where can I pull the latest stable from? I assume NuGet isn't updated all the time?
Aaron Stannard
@Aaronontheweb
Jun 21 2016 15:57
latest stable is on NuGet, when we do public releases
the nightly builds can be found here http://getakka.net/docs/akka-developers/nightly-builds
everything that goes into the nightly build has to pass our extensive CI system and human review
frasermolyneux
@frasermolyneux
Jun 21 2016 15:58
So safe for CI/Dev but not prod
Aaron Stannard
@Aaronontheweb
Jun 21 2016 15:58
but those changes are in-flight - we might add something one day and then change it the next, so the API itself is less stable
I'd use it in production if and only if you're using one of our beta modules
and are need some of the fixes being packed into a planned release right now
1.1 is going to be a pretty big release in terms of the amount of changes that are going into it
frasermolyneux
@frasermolyneux
Jun 21 2016 16:00
I believe I do in terms of the cluster stability but i'll check the current changes to see if it's worth it - thanks :)
Aaron Stannard
@Aaronontheweb
Jun 21 2016 16:00
that's the punchcard for all of the remaining 1.1 items and the closed ones
Aaron Stannard
@Aaronontheweb
Jun 21 2016 16:54
@alexvaluyskiy this router PR is running great
running the MNTR test suite locally after rebasing everything on the dispatcher changes et al
Aaron Stannard
@Aaronontheweb
Jun 21 2016 17:00
I'm having a lot of trouble commenting on the PR itself though due to how big the diff is - Chrome is lagging pretty bad on the diff page
Aaron Stannard
@Aaronontheweb
Jun 21 2016 17:29
@alexvaluyskiy any idea why the cluster singleton startup spec might be failing on #2103?
I've seen it pass locally, so it appears to be racy - but we haven't had that issue on any of the other PRs
Alex Valuyskiy
@alexvaluyskiy
Jun 21 2016 17:48
Its racy
@Aaronontheweb uncomment all routers specs first ;)
Aaron Stannard
@Aaronontheweb
Jun 21 2016 17:49
I'm having some trouble just scrolling through the diff
freaking javascript lol
ok, I'll start working on the multi-node specs for clustered and remote routers
inside my rebase PR
Aaron Stannard
@Aaronontheweb
Jun 21 2016 19:03
which is a relief
@alexvaluyskiy looks like you might have resolved #1311
that one has been a pain in my ass for a while
decided to take a look at that before I moved onto the clustered router specs
Alex Valuyskiy
@alexvaluyskiy
Jun 21 2016 19:09
Yes, this spec works
Only resizers have racy problems
And Cluster Pool Routers don't work
Aaron Stannard
@Aaronontheweb
Jun 21 2016 19:10
resizers are inherently a bit racy
I discourage their use in production
tuning them to work as expected is pretty difficult in practice
I'll take a look at the cluster pool routers
just doubled the number of agents running on the build server
so we can chew through the build queue a bit faster
Aaron Stannard
@Aaronontheweb
Jun 21 2016 19:44
looks like since we made the dispatcher / actor creation changes there are some racy specs here and there
seen the EventFilter specs for the testkit fail a couple of times
I have an idea for figuring out what is causing this
if it's not a bug with the dispatcher system
then I'm pretty confident it's an issue with how these tests were designed and RepointableActorRef
which now gets returned for 100% of top level actors
Aaron Stannard
@Aaronontheweb
Jun 21 2016 21:45
@alexvaluyskiy alright, I can easily reproduce the problem with pool routers now
unreachable nodes aren't ever being removed from the routing table
I think I can mock this behavior using a spec that doesn't require running a cluster
let me give that a try
Aaron Stannard
@Aaronontheweb
Jun 21 2016 22:12
first though, these racy specs that started popping up today - I think I might need to deal with that first
just to rule out any sort of catastrophic issue with the dispatcher
I doubt it, but I want to be certain