These are chat archives for akkadotnet/akka.net

28th
Apr 2016
Pablo Castilla
@pablocastilla
Apr 28 2016 05:03
I want to start the developing of a big headend system for an electric utility using akka.net . Are clustering,remoting and persistence production ready? (We would develop an Oracle persistence)
do you have good experiences with actor per entity in the IoT world? Thanks for helping
Bartosz Sypytkowski
@Horusiath
Apr 28 2016 06:26
@pablocastilla there are production users of akka clusters, when it comes to persistence it's pretty solid right now, but in regards of SQL-based plugins still some changes are happening, and probably a some more will happen after Akka.Persistence.Query and Akka.Streams will come out. But in that case a migration path is usually described for them.
the rest of persistence plugins usually has a lot less aggressive changes
Corneliu
@corneliutusnea
Apr 28 2016 06:39
@pablocastilla I'm testing Akka Clustering 1.0.7 beta and I'm hitting akkadotnet/akka.net#1700 and it's killing me :( just worth knowing
Bartosz Sypytkowski
@Horusiath
Apr 28 2016 06:40
@corneliutusnea personally I've never used routers in any scenario
Corneliu
@corneliutusnea
Apr 28 2016 06:41
I'm not using any router, just the cluster does not come back alive
The cluster itself does not recover
[WARNING][26/04/2016 7:58:32 AM][Thread 0009][[akka://Test/system/endpointManager/reliableEndpointWriter-akka.tcp%3a%2f%2fTest%40127.0.0.1%3a9912-1]] Association with remote system akka.tcp://Test@127.0.0.1:9912 has failed; address is now gated for 5000 ms. Reason is: [Akka.Remote.EndpointDisassociatedException: Disassociated
   at Akka.Remote.EndpointWriter.PublishAndThrow(Exception reason, LogLevel level)
   at Akka.Remote.EndpointWriter.Unhandled(Object message)
   at Akka.Actor.ReceiveActor.ExecutePartialMessageHandler(Object message, PartialAction`1 partialAction)
   at Akka.Actor.ReceiveActor.<>c__DisplayClass11_0.<Become>b__0(Object m)
   at Akka.Actor.ActorCell.<>c__DisplayClass109_0.<Akka.Actor.IUntypedActorContext.Become>b__0(Object m)
   at Akka.Actor.ActorBase.AroundReceive(Receive receive, Object message)
   at Akka.Actor.ActorCell.ReceiveMessage(Object message)
   at Akka.Actor.ActorCell.ReceivedTerminated(Terminated t)
   at Akka.Actor.ActorCell.AutoReceiveMessage(Envelope envelope)
   at Akka.Actor.ActorCell.Invoke(Envelope envelope)]
[ERROR][26/04/2016 7:58:32 AM][Thread 0011][akka://Test/system/endpointManager/endpointWriter-akka.tcp%3a%2f%2fTest%40127.0.0.1%3a9912-2] Disassociated
Cause: Akka.Remote.EndpointDisassociatedException: Disassociated
   at Akka.Remote.EndpointWriter.PublishAndThrow(Exception reason, LogLevel level)
   at Akka.Remote.EndpointWriter.Unhandled(Object message)
   at Akka.Actor.ReceiveActor.ExecutePartialMessageHandler(Object message, PartialAction`1 partialAction)
   at Akka.Actor.ReceiveActor.OnReceive(Object message)
   at Akka.Actor.UntypedActor.Receive(Object message)
   at Akka.Actor.ActorBase.AroundReceive(Receive receive, Object message)
   at Akka.Actor.ActorCell.ReceiveMessage(Object message)
   at Akka.Actor.ActorCell.ReceivedTerminated(Terminated t)
   at Akka.Actor.ActorCell.AutoReceiveMessage(Envelope envelope)
   at Akka.Actor.ActorCell.Invoke(Envelope envelope)
Bartosz Sypytkowski
@Horusiath
Apr 28 2016 06:42
have tried that after upgrading to 1.0.8?
Corneliu
@corneliutusnea
Apr 28 2016 06:43
oh, I didn't notice the 1.08 .. I'll update now
Bartosz Sypytkowski
@Horusiath
Apr 28 2016 06:44
btw. @pablocastilla if you have some questions, try ask @annymsMthd - I know he's using cluster in production for some time
Corneliu
@corneliutusnea
Apr 28 2016 06:52
@Horusiath yey, initial test with 1.0.8 seems to be a lot happier ;)
thanks
@Horusiath is a Lighthouse required for a cluster? it seems to be a nice to have as the primary seed nodes, right ?
Bartosz Sypytkowski
@Horusiath
Apr 28 2016 06:57
it's not required, but it's nice to have - if you'll take a look into it, you'll see that lighthouse is basically empty actor system with cluster configuration
Corneliu
@corneliutusnea
Apr 28 2016 06:57
yes, I wrote my own one based on your sample + akka.cluster.monitor sampel
Corneliu
@corneliutusnea
Apr 28 2016 07:14
@Horusiath excitement was short lived ...
[WARNING][28/04/2016 7:12:45 AM][Thread 0010][[akka://OneSaas/system/endpointManager/reliableEndpointWriter-akka.tcp%3a%2f%2fOneSaas%40127.0.0.1%3a9911-4]] Association with remote system akka.tcp://OneSaas@127.0.0.1:9911 has failed; address is now gated for 5000 ms. Reason is: [Akka.Remote.EndpointDisassociatedException: Disassociated
   at Akka.Remote.EndpointWriter.PublishAndThrow(Exception reason, LogLevel level)
   at Akka.Remote.EndpointWriter.Unhandled(Object message)
   at Akka.Actor.ReceiveActor.ExecutePartialMessageHandler(Object message, PartialAction`1 partialAction)
   at Akka.Actor.ReceiveActor.OnReceive(Object message)
   at Akka.Actor.UntypedActor.Receive(Object message)
   at Akka.Actor.ActorBase.AroundReceive(Receive receive, Object message)
   at Akka.Actor.ActorCell.ReceiveMessage(Object message)
   at Akka.Actor.ActorCell.ReceivedTerminated(Terminated t)
   at Akka.Actor.ActorCell.AutoReceiveMessage(Envelope envelope)
   at Akka.Actor.ActorCell.Invoke(Envelope envelope)
--- End of stack trace from previous location where exception was thrown ---
   at Akka.Actor.ActorCell.HandleFailed(Failed f)
   at Akka.Actor.ActorCell.SystemInvoke(Envelope envelope)]
[ERROR][28/04/2016 7:12:45 AM][Thread 0010][akka://OneSaas/system/endpointManager/reliableEndpointWriter-akka.tcp%3a%2f%2fOneSaas%40127.0.0.1%3a9911-4] Disassociated
Cause: Akka.Remote.EndpointDisassociatedException: Disassociated
   at Akka.Remote.EndpointWriter.PublishAndThrow(Exception reason, LogLevel level)
   at Akka.Remote.EndpointWriter.Unhandled(Object message)
   at Akka.Actor.ReceiveActor.ExecutePartialMessageHandler(Object message, PartialAction`1 partialAction)
   at Akka.Actor.ReceiveActor.OnReceive(Object message)
   at Akka.Actor.UntypedActor.Receive(Object message)
   at Akka.Actor.ActorBase.AroundReceive(Receive receive, Object message)
   at Akka.Actor.ActorCell.ReceiveMessage(Object message)
   at Akka.Actor.ActorCell.ReceivedTerminated(Terminated t)
   at Akka.Actor.ActorCell.AutoReceiveMessage(Envelope envelope)
   at Akka.Actor.ActorCell.Invoke(Envelope envelope)
--- End of stack trace from previous location where exception was thrown ---
   at Akka.Actor.ActorCell.HandleFailed(Failed f)
   at Akka.Actor.ActorCell.SystemInvoke(Envelope envelope)
Bartosz Sypytkowski
@Horusiath
Apr 28 2016 07:15
Are you trying to send unserializable message or something? /cc @Aaronontheweb
Corneliu
@corneliutusnea
Apr 28 2016 07:15
nope, nothing, just two lighthouses trying to talk to each other .. no user code
Bartosz Sypytkowski
@Horusiath
Apr 28 2016 07:16
maybe Aaron will be able to help here
I'm in work right now and cannot investigate it
Corneliu
@corneliutusnea
Apr 28 2016 07:25
Its ok, I'll keep an eye out as I'm building in this area
Vladyslav Pyshnenko
@Pisha91
Apr 28 2016 07:42
Hi , @corneliutusnea , we also had such problem. We fixed it by changing serializer to wire
Kris Schepers
@schepersk
Apr 28 2016 07:45

@Horusiath any thoughts on this?

Hmm, anyone else noticing this: When a ClusterClientReceptionist is started on every node of a role (running locally on 1 dev machine), those nodes consume all CPU power.

When you run a single node, everything is fine..
Vagif Abilov
@object
Apr 28 2016 07:45
@Horusiath Can you please have another look at PR #1900 that fixes #1896. I've made some changes based on your comment. Would be nice to get it merged to fix F# issues.
Corneliu
@corneliutusnea
Apr 28 2016 07:49
@Pisha91 thanks, I'll try
@Pisha91 I'm actually using Wire already .. just checked
Arjen Smits
@Danthar
Apr 28 2016 07:52
Akka logger packages are updated to 1.0.8 and available on nuget
testkit.nunit is updated as well
Corneliu
@corneliutusnea
Apr 28 2016 07:55
hm, I'm sure I'm doing something wrong with this clustering
I have 2 x lighthouses, I added a new server in the cluster "Worker"
they are all pretty much empty, no user messages, just initialized and connected to each other
I kill the worker, now both lighthouses freak out
   at Akka.Actor.UntypedActor.Receive(Object message)
   at Akka.Actor.ActorBase.AroundReceive(Receive receive, Object message)
   at Akka.Actor.ActorCell.ReceiveMessage(Object message)
   at Akka.Actor.ActorCell.Invoke(Envelope envelope)
--- End of stack trace from previous location where exception was thrown ---
   at Akka.Actor.ActorCell.HandleFailed(Failed f)
   at Akka.Actor.ActorCell.SystemInvoke(Envelope envelope)
[WARNING][28/04/2016 7:56:21 AM][Thread 0020][[akka://OneSaas/user/clusterstatus]] RoleLeader: Engine, No leader found!
[WARNING][28/04/2016 7:56:21 AM][Thread 0020][[akka://OneSaas/user/clusterstatus]] Unreachable Member; Role: Engine, Status: Up, Address: akka.tcp://OneSaas@127.0.0.1:8911,
[WARNING][28/04/2016 7:56:25 AM][Thread 0009][[akka://OneSaas/system/endpointManager/reliableEndpointWriter-akka.tcp%3a%2f%2fOneSaas%40127.0.0.1%3a8911-28/endpointWriter]] AssociationError [akka.tcp://OneSaas@127.0.0.1:9911] -> akka.tcp://OneSaas@127.0.0.1:8911: Error [Invalid address: akka.tcp://OneSaas@127.0.0.1:8911] []
[WARNING][28/04/2016 7:56:25 AM][Thread 0009][remoting] Tried to associate with unreachable remote address [akka.tcp://OneSaas@127.0.0.1:8911]. Address is now gated for 5000 ms, all messages to this address will be delivered to dead letters. Reason: [Invalid address: akka.tcp://OneSaas@127.0.0.1:8911] Caused by: [Akka.Remote.Transport.InvalidAssociationException: Association failure ---> Helios.Exceptions.HeliosConnectionException: No connection could be made because the target machine actively refused it 127.0.0.1:8911 ---> System.Net.Sockets.SocketException: No connection could be made because the target machine actively refused it 127.0.0.1:8911
   at System.Net.Sockets.Socket.EndConnect(IAsyncResult asyncResult)
   at Helios.Net.Connections.TcpConnection.Open()
   --- End of inner exception stack trace ---
   at Helios.Net.Connections.TcpConnection.Open()
   at Akka.Remote.Transport.Helios.CommonHandlers.Open()
   at Akka.Remote.Transport.Helios.HeliosTcpTransport.AssociateInternal(Address remoteAddress)
if I bring back the "worker" role back everything comes back to normal and the worker seems to be elected Leader
can I change that? can the lighthouse be the leaders ?
Arjen Smits
@Danthar
Apr 28 2016 08:01
leader election churn is normal when you first bring up your cluster
As to the freaking out
its normal as well. You brought down the node which was elected leader
so untill you mark that node as down (using cluster tools)
the cluster will try to reconnect
maybe it gives up after a while. dont know, dont think so .
As to the question if you can force a node to be the leader.
dont know, dont expect thats possible.
would be a serious weakness of your cluster
And defeats the purpose of a quorum based cluster :)
Corneliu
@corneliutusnea
Apr 28 2016 08:06
I see .. I still can't make this work ... I'm now back to the first problem I reported earlier
all 3 processes are talking to each other, each complain about a different issue
[INFO][28/04/2016 8:05:57 AM][Thread 0017][[akka://OneSaas/system/cluster/core/daemon]] Leader can currently not perform its duties, reachability status: [akka.tcp://OneSaas@127.0.0.1:8911 -> UniqueAddress: (akka.tcp://OneSaas@127.0.0.1:9911, 1723357465): Unreachable [Unreachable] (3), ], member status: [$akka.tcp://OneSaas@127.0.0.1:8911 $Up seen=$True, $akka.tcp://OneSaas@127.0.0.1:9911 $Up seen=$False, $akka.tcp://OneSaas@127.0.0.1:9912 $Up seen=$True]
[WARNING][28/04/2016 8:06:02 AM][Thread 0030][[akka://OneSaas/user/clusterstatus]] Unreachable Member; Role: Lighthouse, Status: Up, Address: akka.tcp://OneSaas@127.0.0.1:9911,
   at Akka.Actor.ActorCell.SystemInvoke(Envelope envelope)                                                                                                                                                                                      
[WARNING][28/04/2016 8:07:22 AM][Thread 0020][[akka://OneSaas/user/clusterstatus]] RoleLeader: Engine, No leader found!                                                                                                                         
[WARNING][28/04/2016 8:07:22 AM][Thread 0020][[akka://OneSaas/user/clusterstatus]] Unreachable Member; Role: Engine, Status: Up, Address: akka.tcp://OneSaas@127.0.0.1:8911,                                                                    
[WARNING][28/04/2016 8:07:23 AM][Thread 0010][[akka://OneSaas/system/endpointManager/reliableEndpointWriter-akka.tcp%3a%2f%2fOneSaas%40127.0.0.1%3a8911-107]] Association with remote system akka.tcp://OneSaas@127.0.0.1:8911 has failed; addre
ss is now gated for 5000 ms. Reason is: [Akka.Remote.EndpointDisassociatedException: Disassociated                                                                                                                                              
   at Akka.Remote.EndpointWriter.PublishAndThrow(Exception reason, LogLevel level)                                                                                                                                                              
   at Akka.Remote.EndpointWriter.Unhandled(Object message)                                                                                                                                                                                      
   at Akka.Actor.ReceiveActor.ExecutePartialMessageHandler(Object message, PartialAction`1 partialAction)
Corneliu
@corneliutusnea
Apr 28 2016 08:16
another question: maybe someone has an answer: Sharding & Persistent actors: What's happening to child actors when an actor is moved across from one shard/server to another ?
Bartosz Sypytkowski
@Horusiath
Apr 28 2016 08:50
@corneliutusnea they are killed
kariem-ali
@kariem-ali
Apr 28 2016 09:23
Hi. Does persistence in Akka.Net support something like this: http://doc.akka.io/docs/akka/2.4.0/scala/persistence.html#id3
Marc Piechura
@marcpiechura
Apr 28 2016 09:46
@Danthar DI plugins are done, I also created a PR with build support for SimpleInjector but will leave it up to @CarlosTorrecillas to merge it in
Carlos Torrecillas
@CarlosTorrecillas
Apr 28 2016 09:49
@Silv3rcircl3 just merged that in
I will add the readme file later on and in theory that should be it
and we should be ready to publish the NuGet package
Marc Piechura
@marcpiechura
Apr 28 2016 09:51
:+1: we need to wait until the teamcity builds are ready for the DI plugins, when that's done you would need to create a PR form dev to master which triggers the release build in teamcity, that's also the reason why I created the dev branch
nightly builds are created from dev and release builds on every commit against master
Carlos Torrecillas
@CarlosTorrecillas
Apr 28 2016 09:52
ok, that makes sense
Vagif Abilov
@object
Apr 28 2016 12:11
@Silv3rcircl3 @Horusiath Rebased PR #1900.
Alex Valuyskiy
@alexvaluyskiy
Apr 28 2016 13:02
WebCrawler docs says, there there can only be one node of Tracker. What will happen, if that node fail? The cluster just stop working?
Kris Schepers
@schepersk
Apr 28 2016 13:04
@Horusiath I've created an issue and an example regarding my issue mentioned this morning. Can you take a look at #1918 ?
Bartosz Sypytkowski
@Horusiath
Apr 28 2016 13:16
@schepersk thanks, will take a look (but don't know when I'll be able to fix it if I find the reason thou)
f**k, really need some more time for that
Kris Schepers
@schepersk
Apr 28 2016 13:21
@Horusiath Thanks! Any comment of tip on this is very welcome.. I'm trying to implement some basic monitoring for our cluster. Thought to be smart and reuse the lighthouse instances for receptionists and hit this annoying issue.
I guess the workarround for this would be finding a role with only 1 instance, or let the monitoring web site become a node in the cluster itself and don't use the cluster client.
Ralf
@Ralf1108
Apr 28 2016 13:40
hi... is the Wire serializer a full replacement for the Json serializer? For me it looks that Wire is almost Protobuf without the required attributes on the members in the dto classes
Arjen Smits
@Danthar
Apr 28 2016 14:15
@Silv3rcircl3 ill finish up the TC config for the DI plugins tonight.
kariem-ali
@kariem-ali
Apr 28 2016 14:19
Hi. I have persistent actor that persists without any problems to SQLite but just changing the configuration to use SQLServer or PostgreSQL causes the following exception: [ERROR][28/04/2016 2:17:04 PM][Thread 0010][akka://server/system/akka.persistence.journal.sql-server] Object reference n
ot set to an instance of an object.
Cause: [akka://server/system/akka.persistence.journal.sql-server]: Akka.Actor.ActorInitializationException: Exception du
ring creation ---> System.NullReferenceException: Object reference not set to an instance of an object.
at Akka.Actor.Props.NewActor()
at Akka.Actor.ActorCell.CreateNewActorInstance()
at Akka.Actor.ActorCell.<>cDisplayClass113_0.<NewActor>b0()
at Akka.Actor.ActorCell.UseThreadContext(Action action)
at Akka.Actor.ActorCell.NewActor()
at Akka.Actor.ActorCell.Create(Exception failure)
--- End of inner exception stack trace ---
at Akka.Actor.ActorCell.Create(Exception failure)
at Akka.Actor.ActorCell.SystemInvoke(Envelope envelope)
Does anyone know what might cause this?
Bartosz Sypytkowski
@Horusiath
Apr 28 2016 14:51
@kariem-ali which version of akka are you using? (try 1.0.8)
Chris G. Stevens
@cgstevens
Apr 28 2016 14:57
So I have 2 services that are just using remoting. I have an SubscriberActor that has a list of 2 address that just sits validates their identity.
Every 10 seconds it checks the --- node.Value.Tell(new Identify(node.Key), Self);
When I get the Receive<ActorIdentity> I then watch the ActorRef and use that to _akkaWorkerIActorRef[key].Tell(new SubscribeToAllJobs(), _subscriberActorRef);
Everything works great... I stop one of the services and the other gets the Receive<Terminated> message.
The next time the Schedule checks it finds the identity again and we are back to subscribing to that services updates.
My problem is if it goes beyond like 5 seconds then I never get a Receive<ActorIdentity> for that service that was restarted.
I even removed the use of the ActorRef and just used ActorSelection and I get the same result. My message never makes it.
What setting controls this and what is the correct way to get the communication happening again?
Before submitting this I went ahead and changed it to an .Ask(ActorIdentity) which now I can see that the task gets Cancelled after my 5 seconds.
Both system can't see each other until they are both shut down and brought back up which to me is wrong. How do I make get this going again without doing a full restart?
Chris G. Stevens
@cgstevens
Apr 28 2016 15:14
@schepersk Not sure what kind of basic monitoring you are looking for but you can take a look at https://github.com/cgstevens/Akka.Cluster.Monitor
blob
Just need to Start the system and then subscribe and it will feed it all of the cluster Events.
kariem-ali
@kariem-ali
Apr 28 2016 15:31
@Horusiath I am using 1.0.8 along with the latest Akka.Persitence packages
Alex Valuyskiy
@alexvaluyskiy
Apr 28 2016 15:32
@kariem-ali the latest packages for Akka.Persistence are not compatible with Akka >= 1.0.7
kariem-ali
@kariem-ali
Apr 28 2016 15:33
@alexvaluyskiy Do you have any recommendation regarding specific versions?
Alex Valuyskiy
@alexvaluyskiy
Apr 28 2016 15:34
@kariem-ali you should build it from the sources, until Akka team release the package. The PR for 1.0.7 was created more than one month ago
kariem-ali
@kariem-ali
Apr 28 2016 15:35
@alexvaluyskiy OK. Thank you.
wdspider
@wdspider
Apr 28 2016 18:52
Can someone give me some guidance on how to ensure an actor's PerisistenceId is unique when you don't know how many instances of the actor there are going to be?
Kris Schepers
@schepersk
Apr 28 2016 18:59
@cgstevens Yes I know of your great project, thanks! In fact, the basic monitoring we're implementing in our cluster is based on it :smile: But we would like to replace the dull winform with a fancy web page.. The cluster client seemed to be the most elegant solution for it, just receive messages from the cluster without being part of it..
Chris G. Stevens
@cgstevens
Apr 28 2016 19:03
Cool and thanks. That is going to be my next step.
Take that in the WinForms and place it in my Website and instead of being able to connect to one Cluster and I want to subscribe or poll members from multiple clusters or even the same cluster...
Kris Schepers
@schepersk
Apr 28 2016 19:07
Nice! What we would like to add next is get the state of the shards in the cluster. Live reporting of the number of shards and the number of entities in those shards.
Chris G. Stevens
@cgstevens
Apr 28 2016 19:09
that would be awesome. I got pulled off my project for over 2 months and now getting back into and I have missed the Shards and Persistence. That will be next.
Now if I can just figure out why my dummy project works great with my remoting subscriber but my real project the nodes just stop talking to each other.
Maxim Cherednik
@maxcherednik
Apr 28 2016 19:16
Hi @corneliutusnea. I am also trying to do a very basic test of cluster failover.
I am doing pretty much similar things: 2 seed nodes, 2 worker nodes, auto-down is off, all nodes have the same inbound port number 4053, no firewalls, each node is on different virtual machine. Trying to imitate death of one of the nodes - kill and start after a couple of seconds. All the console are showing those exceptions, gated for 5000 seconds and so on. Then seed node sees that there is another client with known host and port connected, but different UID - it says: hah, it must be a new one, I will let you in once the previous one will be out. Ok it's removing the previous one, then lets the new one to join the cluster and then the seed node is going crazy, it is constantly printing one line(don't remember exactly): leader is moving the new node to the state UP, leader is moving the new node to the state UP,leader is moving the new node to the state UP,l eader is moving the new node to the state UP....
Kris Schepers
@schepersk
Apr 28 2016 19:20
@cgstevens Those issues are often related to configuration or differences of versions in used libraries.. Or some other subtle difference in the actual code one does not see. Try stripping down your real project untill only the essence remains?
Chris G. Stevens
@cgstevens
Apr 28 2016 19:35
@schepersk That is what I did basically to create my test project... I wasn't really expecting it to work :( It does and now for the past hour I have been trying to see what the heck the difference is...
Maxim Cherednik
@maxcherednik
Apr 28 2016 19:39
guys, can someone explain the versioning of the project? it's 1.0.8 at the moment, I knew about the version 1.1. But now I see from time to time that the upcoming version is 1.5. How so?
Kris Schepers
@schepersk
Apr 28 2016 19:42
@cgstevens Auw.. That s*cks..
Chris G. Stevens
@cgstevens
Apr 28 2016 19:45
@schepersk
Issue is when I have an ''' node = ActorSelection ''' and then do ''' node.Value.Ask(new Identify(node.Key), TimeSpan.FromSeconds(5)) '''
In my internal app... when I shut down one node the other I end up getting a IsCanceled when asking for the Identity even when I bring the other node back up.
But in my sample I just put up.... If I launch both services and shut one down it doesn't timeout and in fact the ActorIdentity.Subject is null like it should be.
https://github.com/cgstevens/AkkaRemoting
Then once I bring the other back up... I am able to get the identity again.
Kris Schepers
@schepersk
Apr 28 2016 19:53
@cgstevens Interesting.. If I can find some time tomorrow, I'll explore your sample..
Maxim Cherednik
@maxcherednik
Apr 28 2016 19:59
1 node is joining. Why do I see this 3 times?
[INFO][4/28/2016 7:58:24 PM][Thread 0016][[akka://riskengine/system/cluster/core/daemon]] Leader is moving node [akka.tcp://riskengine@10.0.0.10:4053] to [Up]
[INFO][4/28/2016 7:58:25 PM][Thread 0015][[akka://riskengine/system/cluster/core/daemon]] Leader is moving node [akka.tcp://riskengine@10.0.0.10:4053] to [Up]
[INFO][4/28/2016 7:58:26 PM][Thread 0016][[akka://riskengine/system/cluster/core/daemon]] Leader is moving node [akka.tcp://riskengine@10.0.0.10:4053] to [Up]
Chris G. Stevens
@cgstevens
Apr 28 2016 20:02
@schepersk tonight/tomorrow I am going to start taking things out to see if something else could be causes it. Just at a loss now.
Maxim Cherednik
@maxcherednik
Apr 28 2016 20:20
guys, when node is joining the cluster, do I need to have a code which tracks the state of the joining process?
Arjen Smits
@Danthar
Apr 28 2016 20:35
@maxcherednik about your thing with the gated connection and the different UID
thats normal
What happens is that when a node goes down, the cluster assumes its because the connectivity is gone. And its gating that node, waiting for the connectivity to return.
When you restart the node. It gets a new UID, and the cluster sees it as a new node, which in fact it is, because its squaky clean in terms of what it knows about the cluster. So it needs to be told all the stuff that the previous node already knew about. Thats why it gets a new UID on restart, to detect that kind of thing.
Now when you stop/kill a node
With the intention of restarting it, for whatever reason
you need to let it tell the cluster that its going down.
usually thats done on the exit code of your node. That way the cluster knows your bringing the node down on purpose and it wont start freaking out.
So in short. Testing cluster resilience by manually crashing nodes, is not the right way. Because it does not reflect a scenario in how the cluster can auto heal.
When a node crashes, the cluster cannot detect if it was done on purpose or not. Thats why you need tools to tell the cluster if a node is really down, or not, when it crashes due to a problem.
That is where @cgstevens cluster monitor really comes in handy. You can build it yourself though by using the akka.cluster.tools
Arjen Smits
@Danthar
Apr 28 2016 20:40
@Silv3rcircl3 I havent been able to finish the DI CI templates. Private stuff got in the way. I think i can clear some time tomorrow.
Marc Piechura
@marcpiechura
Apr 28 2016 20:46
No hurry, 1.0.7 plugins should run fine with 1.0.8 at least I can't remember any DI related changes
Corneliu
@corneliutusnea
Apr 28 2016 23:15
@Danthar I think a node crashing and coming back quickly is a very realistic scenarios ... Run the node as a service, there is a crash, the process exits, the service restarts automatically
offline-restart-online in less than 15 seconds
similar for an app recycle