These are chat archives for akkadotnet/akka.net

20th
Oct 2016
Andrew Young
@ayoung
Oct 20 2016 00:15
Anyone here have experience hosting Clusters in Service Fabric?
From what I know, Service Fabric can move your Akka Nodes around to another SF Nodes to rebalance the load. This would effectively change the topology of your Cluster.
How do you handle this especially for the SeedNodes which are supposed to have a well-known location?
Corneliu
@corneliutusnea
Oct 20 2016 03:10
@Aaronontheweb Yes, I've started to build a ws transport for Azure AppService... just going way to slow as I'm super busy with other stuff :(
Daniel Söderberg
@raskolnikoov
Oct 20 2016 07:29
Hi, I can not get the custom-mailbox feature to work. Can someone please give me some help in PM?
verilocation
@verilocation
Oct 20 2016 07:51
@qwoz Whether its right or wrong, I tend to prefer storing things in Databases (and perhaps with a made for purpose cache like redis). Reason being that its very hard to maintain state within a distributed application. What is the correct state? What if something crashes and it loses its state? How do you sync state? For this reason (i.e. my sanity) I prefer to store everything in the database and just take the performance hit; which isn't usually a problem with well written queries. In this case I'd aim for most of my actors to be stateless (again this helps with designing for fault tolerance). Some wont or cant be stateless. But if the majority of your app is built this way and you dont have serious performance concerns, then itll save you a big headache.
John Nicholas
@MrTortoise
Oct 20 2016 08:02
well you need to figure out the source of truth for your system. Its most likely your event store. As for what is correct state, you can answer that but at the cost of availability.
Peter Bergman
@peter-bannerflow
Oct 20 2016 08:12
Question about routers, I have a pooled round robin router. If I somehow want to broadcast a message to that routers routees, what would be the best approach?
a) setup a new broadcast group router that routes to the actor path where the pooled routers routees live
b) use some other mechanism?
Bartosz Sypytkowski
@Horusiath
Oct 20 2016 08:13
@qwoz you need some more DDD modeling for that. Entities are not created equal i.e. I think having a separate actor for every single books may be an overkill. Usually actors are created only for aggregate roots of your domain
@raskolnikoov you don't need PM, you can just ask here
@peter-bannerflow there's a message called Broadcast(message), if you wrap your custom message with it and send it to router, it will broadcast the content to all routees, no matter what semantic router has
Peter Bergman
@peter-bannerflow
Oct 20 2016 08:15
Niiice, thanks
Thomas Lazar
@thomaslazar
Oct 20 2016 08:20
not sure if ppl are aware of this already but the "Building Distributed Systems with Akka.NET Clustering" course on pluralsight is free this month. https://app.pluralsight.com/library/courses/akka-dotnet-building-distributed-systems-clustering/table-of-contents
Andrew Buttigieg
@andrewbuttigieg
Oct 20 2016 08:50
That is nice @thomaslazar

I am not sure if I have found a bug in Akka.Clustering or not... but -

works:

var props = Props.Create<HashLoggerActor>().WithRouter(FromConfig.Instance);
                    var hashRouter = seedSystem.ActorOf(props, "hashLogger");
                    var someHashableMessage1 = new ConsistentHashableMessage("This is a message from the seed that is hashable.", Guid.NewGuid());
                    hashRouter.Tell(someHashableMessage1);

does not work:

var props = Props.Create<HashLoggerActor>(some-sort-of-constructor-injection).WithRouter(FromConfig.Instance);
                    var hashRouter = seedSystem.ActorOf(props, "hashLogger");
                    var someHashableMessage1 = new ConsistentHashableMessage("This is a message from the seed that is hashable.", Guid.NewGuid());
                    hashRouter.Tell(someHashableMessage1);
Daniel D'Agostino
@dandago2_twitter
Oct 20 2016 08:53
hey guys, I was just wondering why when you Tell(), it doesn't include the sender by default (you have to pass it in as a second parameter). Is it expensive or something?
Daniel D'Agostino
@dandago2_twitter
Oct 20 2016 09:05
I get deadletters as the sender unless I specify a second parameter for Tell()
Daniel D'Agostino
@dandago2_twitter
Oct 20 2016 09:12
ok looks like this happens when Tell()ing directly from the actorsystem rather than from within an actor
Andrew Buttigieg
@andrewbuttigieg
Oct 20 2016 09:23
So I have narrowed the problem down, it appears to be an issue with the TestProbe and Clustering, going to open a bug on GitHub
himekami
@himekami
Oct 20 2016 09:27
what's wrong with this code ?
let clusterSingletonProperties = system.ActorOf(ClusterSingletonManager.Props(
                                                    singletonProps: Props.Create<LoggerActor>(),         
                                                    terminationMessage: PoisonPill.Instance,                  
                                                    settings: ClusterSingletonManagerSettings.Create(system)),
                                                    name: "manager")
Bartosz Sypytkowski
@Horusiath
Oct 20 2016 09:35
@himekami except that you're using F# let to declare variable, while the rest of the code looks like C#? idk unless you'll show an error ;)
himekami
@himekami
Oct 20 2016 09:35
haha. thank you @Horusiath
Bartosz Sypytkowski
@Horusiath
Oct 20 2016 09:37
@dandago2_twitter there's an extension method on Tell, which includes sender implicitly (in Akka.Actor namespace afaik). Beside that, sender is expensive only when serialized/deserialized over the wire, So if you're sending a message to actor on another node and won't use sender there, it's better to pass Actor.NoSender - but it's only microoptimization
John Nicholas
@MrTortoise
Oct 20 2016 10:11
@Horusiath so if actors are usually only for aggregate roots then you consider persistence at actor level as command sourcing?
Bartosz Sypytkowski
@Horusiath
Oct 20 2016 11:31
@MrTortoise yes, I think we could say so.
qwoz
@qwoz
Oct 20 2016 12:18
@Horusiath Can you explain more why a book would not be an actor? It seems to be an aggregate root in that the next level up would either be whole subjects or the entire library itself. And for a book, I'd want things like lending history plus books can be moved between libraries.
Bartosz Sypytkowski
@Horusiath
Oct 20 2016 12:39
@qwoz in this case (so if book is an actor) if you need to retrieve a list of books, it would be better to keep this list as a view/perspective/live stream, so you already have that data in place and don't need to traverse over actors to construct a read state each time
verilocation
@verilocation
Oct 20 2016 12:52
I thought Actors were to be thought of as people doing roles... Would an Actor be more akin to a librarian (or book shelf, etc) in this analogy? Should a Book simply not be a Model rather than an Actor?
Bartosz Sypytkowski
@Horusiath
Oct 20 2016 13:13
@verilocation actors are, what you make them. It's more question about your domain and how are you going to model, what's an aggregate root, what's an entity and where boundary contexts are
(using DDD terminology)
John Nicholas
@MrTortoise
Oct 20 2016 13:15
yeah sorry I started the DDD thing. I'm not sure its been helpful
Arsene Tochemey GANDOTE
@Tochemey
Oct 20 2016 13:54
Hello Guys
When you kill an actor does it get garbage collected?
Andrew Young
@ayoung
Oct 20 2016 16:41
@Horusiath do you have any insight into my question about how to handle the case where your cluster seed nodes might be brought down and then brought back up in a different location? I'm talking Service Fabric here. Thanks!
Marc Piechura
@marcpiechura
Oct 20 2016 17:24
@Tochemey yep
Bartosz Sypytkowski
@Horusiath
Oct 20 2016 19:11
@ayoung you need to integrate your cluster nodes with service fabric app lifecycle (it must expose some hooks to enable graceful shutdown) and SF service discovery (so cluster node can register/unregister its own location)
I know that @mmisztal1980 had some working version of akka cluster on SF
Syed Hassaan Ahmed
@bannerflow-hassaan
Oct 20 2016 19:47
@Horusiath Thanks for the help with Persistence yesterday .. was able to repro the events replay failure locally by simply pointing to Sandbox Azure Redis instead of localhost .. Now seeing full exception "StackExchange.Redis.RedisConnectionException: No connection is available to service this operation" .. Stacktrace indeed points to RedisJournal.ReadHighestSequencrNrAsync() as you mentioned before.
Andrew Young
@ayoung
Oct 20 2016 20:15
@Horusiath is it true that if your seed nodes get moved, your cluster is basically hosed? Because the other nodes on cluster are configured with the well-known location of the seed nodes. If they get moved, its not well-known anymore.
Aaron Stannard
@Aaronontheweb
Oct 20 2016 20:16
@ayoung not true if you have multiple seeds
as long as one of the original seeds is still connected to the cluster, it won't matter
when the seed node boots it'll re-establish contact with the others
Andrew Young
@ayoung
Oct 20 2016 20:17
i'm guessing that's where service discovery comes in, right?
Aaron Stannard
@Aaronontheweb
Oct 20 2016 20:17
it can be done statically
but you could use service discovery to do that too
let me grab my Azure bootstrapping sample RQ
Andrew Young
@ayoung
Oct 20 2016 20:17
how can it be done statically when Service Fabric can move it to a completely different VM/IP?
Aaron Stannard
@Aaronontheweb
Oct 20 2016 20:18
oh, I didn't see the Service Fabric part
only saw your last message
Andrew Young
@ayoung
Oct 20 2016 20:18
:)
Aaron Stannard
@Aaronontheweb
Oct 20 2016 20:18
then yeah in that case you'd use service discovery to find the other nodes at startup
I'm like 3 days behind on Gitter chat at least
lol
Andrew Young
@ayoung
Oct 20 2016 20:19
i'm guessing things like Docker Swarm work in a similar fashion?
Aaron Stannard
@Aaronontheweb
Oct 20 2016 20:19
yeah, they all have similar bits of functionality for that
ditto with Mesos / Kubernetes / etc
https://gist.github.com/Aaronontheweb/29c0f65a2721ba22d44c - this is super old, but it boostraps a cluster using cloud services
Andrew Young
@ayoung
Oct 20 2016 20:20
so the idea is: multiple seed nodes, ensure at least one seed node is part of the cluster.
Aaron Stannard
@Aaronontheweb
Oct 20 2016 20:20
cloud services are a terrible place to host an Akka.NET cluster, but this was a proof of concept
yeah, exactly
Andrew Young
@ayoung
Oct 20 2016 20:20
when a seed gets moved, use discovery services to find the other seed node
Aaron Stannard
@Aaronontheweb
Oct 20 2016 20:21
yep
Andrew Young
@ayoung
Oct 20 2016 20:21
what's the alternative to cloud services?
Aaron Stannard
@Aaronontheweb
Oct 20 2016 20:21
cloud services were the very first Azure offering
PaaS v1
Service Fabric is PaaS v2
Andrew Young
@ayoung
Oct 20 2016 20:22
i see.
Aaron Stannard
@Aaronontheweb
Oct 20 2016 20:22
which is much better suited to sophisticated applications than Cloud Services
Andrew Young
@ayoung
Oct 20 2016 20:22
yep
Aaron Stannard
@Aaronontheweb
Oct 20 2016 20:22
Cloud Services were basically for ASP.NET + SQL Azure CRUD applications
couldn't handle anything more complicated than that
tried running a MongoDb availability set inside one while I was at Microsoft
Syed Hassaan Ahmed
@bannerflow-hassaan
Oct 20 2016 20:23
@Aaronontheweb Can App Services be used to host a node in Akka.NET cluster?
Aaron Stannard
@Aaronontheweb
Oct 20 2016 20:23
this is before IaaS V1 was available
Andrew Young
@ayoung
Oct 20 2016 20:23
if a seed node becomes unreachable, wont the other nodes in the cluster try to continually reconnect to it?
Aaron Stannard
@Aaronontheweb
Oct 20 2016 20:23
@bannerflow-hassaan based on what a few folks have said, looks like the big issue with App Services is they close down all ports except HTTP/HTTPS/WebSockets
Andrew Young
@ayoung
Oct 20 2016 20:23
especially with auto-down turned off...
Aaron Stannard
@Aaronontheweb
Oct 20 2016 20:24
so right now Akka.Cluster won't work
however, I'm thinking about implementing a SignalR / WebSocket transport
which would be a fun work-around for that :p
Syed Hassaan Ahmed
@bannerflow-hassaan
Oct 20 2016 20:24
Would be awesome if it works! :smile:
Aaron Stannard
@Aaronontheweb
Oct 20 2016 20:24
have no idea if it would actually work
if someone else wants to take a stab at WebSockets for that, be my guest
I won't be able to spend any time on it for a bit lol
@ayoung yep, they'll try to reach the seed node
the problem is if the seed node thinks it's the only reachable seed node
Andrew Young
@ayoung
Oct 20 2016 20:25
what i'm trying to understand is how do non-seed nodes know to swap addresses to a new seed?
Aaron Stannard
@Aaronontheweb
Oct 20 2016 20:25
it'll perform a self-join
and become its own cluster
the non-seed nodes don't care about seed nodes any more once they startup
seed nodes only matter when you haven't joined the cluster
so you'd want to have all of your nodes, if you're running on service fabric, use some sort of service discovery mechanism for locating seeds
Andrew Young
@ayoung
Oct 20 2016 20:27
dude... 💡
thanks.
Aaron Stannard
@Aaronontheweb
Oct 20 2016 20:27
it's what I'm here for
that and watching our build system self-immolate
Andrew Young
@ayoung
Oct 20 2016 20:27
i kept thinking that seed-nodes need always be at a well0known location
Aaron Stannard
@Aaronontheweb
Oct 20 2016 20:27
ah
Arjen Smits
@Danthar
Oct 20 2016 20:27

that and watching our build system self-immolate

LOL

Aaron Stannard
@Aaronontheweb
Oct 20 2016 20:28
no, they only need to be at a well-known location when someone uses them to join the cluster
but a service discovery tool itself becomes the "well known" location
Andrew Young
@ayoung
Oct 20 2016 20:29
sorry to beat a dead horse but, if all seed-nodes are unreachable, you're screwed? or how do you recover from that?
Aaron Stannard
@Aaronontheweb
Oct 20 2016 20:29
if all seed nodes are unreachable, it's like showing up at a bar and all of your friends are gone
you either wait for someone to show up or start drinking alone
maybe that analogy only applies to my lonely life
so let me rephrase
if you can't reach any of the seed nodes, you can't join
Andrew Young
@ayoung
Oct 20 2016 20:30
being lonely is partially self-inflicted
Aaron Stannard
@Aaronontheweb
Oct 20 2016 20:30
you'll receive an error message back
inside the logs
the way I'd program this bit of logic myself
"if I haven't received a Welcome message from the cluster in X seconds, signal to the process that I need to be restarted with a new configuration"
easy way to tell if you've been Welcome'd is to have an actor subscribe to any of the cluster events and wait to receive a CurrentClusterState
you get that message before you've officially been marked as Up
Andrew Young
@ayoung
Oct 20 2016 20:32
if all the seeds are unreachable, and i start up a new seed, would it be ok if that new seed finds a node, any node, in that cluster and join it?
Aaron Stannard
@Aaronontheweb
Oct 20 2016 20:33
the magic trick to making a cluster work correctly
in terms of this stuff
is make sure that any individual node can always find at least one other node that is up and is a member of the current cluster
believe it or not, you can do this without seed nodes altogether
Andrew Young
@ayoung
Oct 20 2016 20:34
ok. got it.
i was just about to say that
Aaron Stannard
@Aaronontheweb
Oct 20 2016 20:34
an application can call Cluster.Join
and pass in the address of a node you want to join
so theoretically, rather than injecting seed nodes at configuration
you could just query your service discovery API at runtime
and issue a join command to any up node
Andrew Young
@ayoung
Oct 20 2016 20:35
seed nodes just help in preventing split brains, right?
Aaron Stannard
@Aaronontheweb
Oct 20 2016 20:35
seed nodes just help automate the process of forming and joining a cluster
the split brain prevention stuff is the unreachable vs. down bit
the fact that a cluster can distinguish between a node that is temporarily unreachable
versus one that is permanently dead and not coming back
(this is what the DowningProvider / issuing Down commands do)
is what prevents a split brain
if we automatically assumed that every node that was unreachable was no longer part of the cluster
we could end up breaking up one cluster into several small ones
which is what a split brain is
Andrew Young
@ayoung
Oct 20 2016 20:38
is there sort of a best policy to DowningProviders? or do we typically have to write one based on the design of our cluster?
Aaron Stannard
@Aaronontheweb
Oct 20 2016 20:38
IMHO, the absolute best DowningProvider strategy
would be one that uses a cloud-specific API
to check to see if an unreachable node is really dead according to the status / resource APIs for that cloud
i.e. is the virtual machine at 10.12.120.11 up or down according to EC2 / ARM / Service Fabric?
if you can reach the cloud management API and verify that the node is down, terminated, missing, etc
you know conclusively that this cluster member is never coming back
Andrew Young
@ayoung
Oct 20 2016 20:39
makes sense.
Aaron Stannard
@Aaronontheweb
Oct 20 2016 20:40
you could also check to see if that node has unregistered with the service discovery API
that would accomplish the same thing
Andrew Young
@ayoung
Oct 20 2016 20:40
wrt cluster configs (sorry lots of questions since i have you right now).
Aaron Stannard
@Aaronontheweb
Oct 20 2016 20:40
although you might have to have the downing provider check multiple times though
since there could be a delay between when the node goes unreachable in the cluster vs. when it gets marked as unhealthy by service discovery / cloud management API
sure, go for it
Andrew Young
@ayoung
Oct 20 2016 20:41
do cluster configs get gossiped ?
Aaron Stannard
@Aaronontheweb
Oct 20 2016 20:41
nah
they're all private to the ActorSystem that used them
Andrew Young
@ayoung
Oct 20 2016 20:42
ok. what happens if there is a discrepency between configs between nodes?
for instance min-nr-of-members?
Aaron Stannard
@Aaronontheweb
Oct 20 2016 20:42
that's a good example
Andrew Young
@ayoung
Oct 20 2016 20:42
or DowningProviders
Aaron Stannard
@Aaronontheweb
Oct 20 2016 20:42
also a good one
Andrew Young
@ayoung
Oct 20 2016 20:42
that have different downing policies
Aaron Stannard
@Aaronontheweb
Oct 20 2016 20:42
that would be a fun one
essentially, the way that nodes get downed would vary by which one is the leader then
if a node with DowningProviderA is leader
its strategy gets used
if a node with DowningProviderB is leader, its strategy gets used
super fun scenario: network partition separates half the cluster from each other
you're going to have a leader of each half of the cluster, temporarily, until the partition is healed
if either leader has a super aggressive downing policy
split brains, chaos, dying kittens, and a giant meteor striking the Earth will ensue
Andrew Young
@ayoung
Oct 20 2016 20:45
oh good :)
so...cluster restart?
Aaron Stannard
@Aaronontheweb
Oct 20 2016 20:45
one solution I was asked about a couple of times during our trainings this week
is central configuration management for cluster nodes
that's perfectly doable, and I know some of our users are already
they store the HOCON configuration as a string in Azure Table Storage, Mongo, etc
have each node pull down a copy of that configuration when they start
and override instance-specific properties using HOCON fallbacks
such as the role of each node
and the Akka.Remote transport address
Andrew Young
@ayoung
Oct 20 2016 20:46
oh. that's pretty smart
Aaron Stannard
@Aaronontheweb
Oct 20 2016 20:47
yeah, that's a reasonable way to do it
people who had these questions, some of them work on larger teams
where there might be many different engineers working on multiple parts of the cluster
centralizing the important parts of the configuration, like downing strategy, serializer, etc
are good ideas
Andrew Young
@ayoung
Oct 20 2016 20:48
helios.tcp.public-hostname vs. hostname
?
Aaron Stannard
@Aaronontheweb
Oct 20 2016 20:48
ah
JVM Akka added something similar eventually
but TL;DR;
if you want to use something like Elastic IP
Damian Reeves
@DamianReeves
Oct 20 2016 20:49
Hey AAron, any time frame on the Distributed Data code being available?
Aaron Stannard
@Aaronontheweb
Oct 20 2016 20:49
to identify a specific node
@DamianReeves akkadotnet/akka.net#2261
Bartosz Sypytkowski
@Horusiath
Oct 20 2016 20:50
@DamianReeves once it passes tests (I have no idea why it doesn't)
Aaron Stannard
@Aaronontheweb
Oct 20 2016 20:50
elastic IPs / private host names technically exist outside the virtual machine running the Akka.NET process
so a socket can never directly bind to an elastic IP
Bartosz Sypytkowski
@Horusiath
Oct 20 2016 20:51
@DamianReeves anyway I was able to use ddata over cluster from my branch
Aaron Stannard
@Aaronontheweb
Oct 20 2016 20:51
if a socket tried to bind, it'd notice that IP is non-local
and barf up a glorious exception
Andrew Young
@ayoung
Oct 20 2016 20:51
ok, not familiar with elastic IPs
Aaron Stannard
@Aaronontheweb
Oct 20 2016 20:51
elastic IP is like an IP address alias for a machine
it performs network address translation
Damian Reeves
@DamianReeves
Oct 20 2016 20:51
cool... will take a look
Aaron Stannard
@Aaronontheweb
Oct 20 2016 20:51
and re-routes traffic to that IP to a real private IP for a given VM
it's an AWS concept
there's something similar in Windows Azure
Andrew Young
@ayoung
Oct 20 2016 20:52
ah ok.
Aaron Stannard
@Aaronontheweb
Oct 20 2016 20:52
but what public-hostname is for
it allows the socket to bind to a locally available address
like 0.0.0.0
which is always reachable
while stating that the formal, publicly visible address is something else
like a hostname or elastic IP
so when that Akka.Remote process connects to someone else
it always identifies itself using the hostname / elastic IP
and therefore can be reached by other nodes
in essence it allows you to decouple the socket's actual bound address from the address you send message to reach it
Andrew Young
@ayoung
Oct 20 2016 20:54
ah. i'm guessing this is why if i set hostname to 127.0.0.1 but then try to connect to localhost it barfs too.
Aaron Stannard
@Aaronontheweb
Oct 20 2016 20:54
yeah
Akka.Remote itself is strict
if the incoming address doesn't match its public-hostname
it'll yell at you and drop the message
if public-hostname isn't set, it defaults to the hostname setting
Andrew Young
@ayoung
Oct 20 2016 20:56
when running in the console, is there a way to make it less verbose?
Aaron Stannard
@Aaronontheweb
Oct 20 2016 20:56
the log messages?
I've been trying to trim Helios' first-chance exceptions out of the log
Andrew Young
@ayoung
Oct 20 2016 20:57
yeah. just trying to make sense of what exceptions i should be paying attention to vs. all the cluster exceptions.
Aaron Stannard
@Aaronontheweb
Oct 20 2016 20:57
since most of the time it's a non-actionable error
like, there's literally nothing that could be done to prevent it
Andrew Young
@ayoung
Oct 20 2016 20:57
yep.
Aaron Stannard
@Aaronontheweb
Oct 20 2016 20:57
I need to upgrade Akka.Remote to use Helios 2.1.3, forgot to do that in 1.1.2
eliminated some of the sources of error that way
Damian Reeves
@DamianReeves
Oct 20 2016 20:58
So lets say I want to do something where I have worker nodes that are all in a managed cluster (ClusterW) and I have ClusterA that has other responsibilities but can offload work to workers in ClusterW... have you seen such a design, and how can I accomplish this?
Aaron Stannard
@Aaronontheweb
Oct 20 2016 20:58
and I still have some grooming to do with Akka.Remote itself to silence exceptions that relate to graceful terminations of a connection
@DamianReeves multi-cluster design?
that'd be bad ass - best way to accomplish that would probably be with ClusterW having a ClusterClient to ClusterA
in terms of what can be done today
Andrew Young
@ayoung
Oct 20 2016 21:00
are you aware of any companies running clusters in Docker w/ mono?
Aaron Stannard
@Aaronontheweb
Oct 20 2016 21:00
cc @annymsMthd might have been last I checked
Damian Reeves
@DamianReeves
Oct 20 2016 21:00
We are thinking of that design because the workers may be short-lived and may barf on work so they may be a lot of church of member's going up and down
Andrew Young
@ayoung
Oct 20 2016 21:02
do you have any way of detecting split brain?
Aaron Stannard
@Aaronontheweb
Oct 20 2016 21:03
best way of detecting it is from the outside
grabbing a piece of documentation real fast
Damian Reeves
@DamianReeves
Oct 20 2016 21:03
Can an Actor in ClusterA do a deathwatch on an Actor in ClusterW?
Aaron Stannard
@Aaronontheweb
Oct 20 2016 21:04
@DamianReeves yep, you can deathwatch any actor you have a reference to
essentially ClusterW would be connecting to ClusterA using the plain old Akka.Remote protocol
Damian Reeves
@DamianReeves
Oct 20 2016 21:05
I thought there was a limitation where remoting can't be done from within a cluster to outside a cluster (am I making that up, or did that used to be the case?)
Aaron Stannard
@Aaronontheweb
Oct 20 2016 21:05
bah
let me fix that
there we go
so how to detect a split brain: have every single node in the cluster pump out their Cluster.CurrentClusterState objects periodically
if there's ever an occasion where the SeenBy collections are mutually exclusive, i.e. there is zero overlap
or if SeenBy is absolutely empty after startup
means that one or more nodes is not visible to the cluster
split brains can usually only happen as a result of bad configuration, i.e. leaving auto-down-unreachable-after on with some ridiculously short timespan
or a bad seed node strategy
Andrew Young
@ayoung
Oct 20 2016 21:08
is SeenBy a list of nodes that know about the current one? or is it a list of nodes that the current one knows about?
Aaron Stannard
@Aaronontheweb
Oct 20 2016 21:09
it's a list of nodes that know about the current one
who has seen my gossip
the other thing you can look at are the unreachable lists
but that's not a split brain - split brains can only happen if the cluster membership changes
actually, yeah
you want to take a look at the member lists instead
Andrew Young
@ayoung
Oct 20 2016 21:10
is there an ACK before an address gets added to SeenBy?
Aaron Stannard
@Aaronontheweb
Oct 20 2016 21:10
all of them should be identical within a short period of time
if one has a radically different list of members
Andrew Young
@ayoung
Oct 20 2016 21:10
ok.
Aaron Stannard
@Aaronontheweb
Oct 20 2016 21:10
after a period of time
it's probably not a member of the cluster
Andrew Young
@ayoung
Oct 20 2016 21:10
sweet.
Aaron Stannard
@Aaronontheweb
Oct 20 2016 21:10
that's a much more authoritative way of checking that
SeenBy is a weaker signal
the membership messages come from the leader
SeenBy is peer to peer
sent as a result of receiving gossip back from nodes
@DamianReeves nah, we fixed that
was an issue before
the changes in 1.1 resolved it
we have multi-node specs that test for it
Andrew Young
@ayoung
Oct 20 2016 21:12
i'm getting the feeling that clusters need very careful attention to detail and maintenance processes to keep it healthy. but i guess that's just the case with distributed computing in general.
Aaron Stannard
@Aaronontheweb
Oct 20 2016 21:13
nah, the cluster isn't as brittle as you think
Andrew Young
@ayoung
Oct 20 2016 21:13
ok. i guess that's just how it seems to me. probably 'cause i don't have much confidence/experience in it yet.
Aaron Stannard
@Aaronontheweb
Oct 20 2016 21:14
the most tricky part pertaining to the cluster itself
is the downing strategy
I might just implement this akkadotnet/akka.net#2280 and do a point release right away
since that would solve about 99% of issues with having nodes leave gracefully
would guarantee that once the actor system has terminated that the node has gracefully exited the cluster
all you'd have to do then, as part of your design, is guarantee that the actor system is shut down before your process terminates
i.e. when the window service begins terminating, block it from exiting until that task completes
the other 1% of issues would be the catastrophe scenario
where a piece of hardware actually does die unexpectedly
having a manual downing tool like https://github.com/cgstevens/Akka.Cluster.Monitor or a good downing provider could take care of that
Andrew Young
@ayoung
Oct 20 2016 21:19
yeah, problem with his naive solution in the PR is that if the actor system never successfully joins the cluster, then OnMemberRemoved is never called and it basically hangs the process
Aaron Stannard
@Aaronontheweb
Oct 20 2016 21:20
let me check something rq
https://github.com/akkadotnet/akka.net/blob/dev/src/core/Akka.Cluster/Configuration/Cluster.conf#L23 - so by default the system will automatically attempt to re-join after 10s
if it fails the first time
and it will do that indefinitely by the looks of it
I could attach a "poison count" configuration setting
which would state that after N unsuccessful attempts to join
the configuration is poisoned
and this node needs to be rebooted with a new configuration
that way at least, the system could raise an alarm
and let the humans know that the issue won't resolve itself
this would be the "bad seed node" configuration case
but really, that shouldn't be necessary
ultimately what you have to do is pick a strategy where the seed nodes are consistently chosen
Andrew Young
@ayoung
Oct 20 2016 21:24
we could also check if the actor system is part of a cluster first. if it is, leave first, otherwise you can just actorSystem.Terminate() right away.
Aaron Stannard
@Aaronontheweb
Oct 20 2016 21:24
which is why Lighthouse exists
never reboot them, never upgrade them, never remove them
that's an extreme, IMHO
but for reasons I don't want to get into, was a scenario I had
when I work with Cassandra, which uses the exact same clustering mechanics as Akka.Cluster
I have a startup script that injects a list of IP addresses and ports belonging to nodes that should exist in my cluster into Cassandra.yml
I'm using EC2 / ARM when I'm doing this
I've never had a split brain issue with it in four years
now if my Cassandra network was constantly changing and the addresses could be all over the place
Andrew Young
@ayoung
Oct 20 2016 21:29
haha. then maybe i'm just not confident in Service Fabric :)
Aaron Stannard
@Aaronontheweb
Oct 20 2016 21:29
I'd have to do one of two things
one is implement a service discovery mechanism that reports both the C* nodes addresses and their cluster status
and rewrite my bootstrapper script to only have a new C* node join other nodes that are currently members of the same cluster
or I'd go with the lighthouse strategy
park 2-3 nodes outside of my auto-scale group
put them on static IPs
Andrew Young
@ayoung
Oct 20 2016 21:31
that makes it way easier too
Aaron Stannard
@Aaronontheweb
Oct 20 2016 21:31
and have a separate process for updating them when it comes time to do servicing
Andrew Young
@ayoung
Oct 20 2016 21:31
maybe i'll just do that.
Aaron Stannard
@Aaronontheweb
Oct 20 2016 21:31
yeah, it's not the end of the world
either approach would work, you're just trading infrastructure setup overhead for infrastructure maintenance overhead in one case or the other
Andrew Young
@ayoung
Oct 20 2016 21:32
good. thanks for the tips. i think you've given me enough to work with for now.
oh...one more question :)
do you have a good way of debugging message flows between actors? it is often difficult when so many things are happening concurrently on a particular type of actor and i just want to follow a single flow but not bother with the other stuff that comes in...
does that makes sense?
Aaron Stannard
@Aaronontheweb
Oct 20 2016 21:34
yep
here's what I'd do
hmmm, wondering how easy this will be to explain over gitter
Andrew Young
@ayoung
Oct 20 2016 21:35
heh
Aaron Stannard
@Aaronontheweb
Oct 20 2016 21:35
for debugging a specific message flow, here's what I'd recommend
are you familiar with the concept of a "tracer round" ?
comes from the military originally
Andrew Young
@ayoung
Oct 20 2016 21:36
yes. basically a flare, right?
Aaron Stannard
@Aaronontheweb
Oct 20 2016 21:36
when you're firing a big machine gun at night, especially an anti-aircraft gun, every Nth round or so is made of bright phosphorous or some other illuminating agent
it's designed to show you what you're aiming at when visibility is poor
yeah, basically a very fast moving flare
I've not implemented this myself, but I've outlined it before
basically mark each message that can be traced with an interface that includes a correlation ID
so the first message in the sequence is what sets the ID
each subsequent message that is part of this flow
also receives a copy of this ID and exposes it using the same interface
even though the messages may be different
Andrew Young
@ayoung
Oct 20 2016 21:39
oh got it.
Aaron Stannard
@Aaronontheweb
Oct 20 2016 21:39
you can override inside a base class for your user-defined actors
the AroundReceive method
this will allow you to inspect the message before it's handed off to the actor for processing
you can determine if this message implements ITraceable or whatever
extract its correlation ID
Andrew Young
@ayoung
Oct 20 2016 21:39
yeah. then i can set conditional breakpoints after i know the correlation id.
Aaron Stannard
@Aaronontheweb
Oct 20 2016 21:39
yep
Andrew Young
@ayoung
Oct 20 2016 21:40
nice.
Aaron Stannard
@Aaronontheweb
Oct 20 2016 21:40
or you could write all of the different actors and the timestamps of when they received that correlation to a log
which would produce a timeline
that'd be a cool plugin idea
note to self
Andrew Young
@ayoung
Oct 20 2016 21:40
add it to issues
alright. thanks again for your ˙elp
Arjen Smits
@Danthar
Oct 20 2016 21:42
Arent you describing something like an zipkin integration?
Roger Johansson
@rogeralsing
Oct 20 2016 21:45
Zipkin does all of the above, there is no integration with akka.net yet though afaik. but @Horusiath has written a c# lib for zipkin. and I'm building a new UI for it. the main diff from the usual correlation id is that you can follow each branch (https://raw.githubusercontent.com/openzipkin/zipkin-ui/master/zipkinui.png)
Aaron Stannard
@Aaronontheweb
Oct 20 2016 21:46
looks that way
Daniel Söderberg
@raskolnikoov
Oct 20 2016 23:04
What is best in terms of speed. Should I use ReceiveActor och TypedActor for my messages?
och = or
Andrew Young
@ayoung
Oct 20 2016 23:58
@Aaronontheweb I'm finding that a member cannot Leave() the cluster if the status is MemberStatus.Joining. It is only allowed to leave if it is MemberStatus.Up. https://github.com/akkadotnet/akka.net/blob/dev/src/core/Akka.Cluster/ClusterDaemon.cs#L1195
if you call Leave() when status is Joining, the member is considered unreachable by the cluster indefinitely.
Aaron Stannard
@Aaronontheweb
Oct 20 2016 23:59
ohhhhh interesting
yeah, we need to have some sort of short-circuit there
a means to cancel the joining process
Andrew Young
@ayoung
Oct 20 2016 23:59
should i write up an issue?
Aaron Stannard
@Aaronontheweb
Oct 20 2016 23:59
mind logging an issue for that? I think the best way to handle that is to have the node down itself
yeah