Where communities thrive


  • Join over 1.5M+ people
  • Join over 100K+ communities
  • Free without limits
  • Create your own community
People
Activity
  • Feb 22 12:02
    cptjazz opened #4242
  • Feb 21 19:37
    Arkatufus synchronize #4228
  • Feb 21 19:37
    Arkatufus ready_for_review #4228
  • Feb 21 15:42
    Aaronontheweb assigned #4241
  • Feb 21 15:42
    Aaronontheweb milestoned #4241
  • Feb 21 15:42
    Aaronontheweb labeled #4241
  • Feb 21 15:42
    Aaronontheweb labeled #4241
  • Feb 21 15:42
    Aaronontheweb labeled #4241
  • Feb 21 15:42
    Aaronontheweb opened #4241
  • Feb 21 01:47
    Arkatufus opened #4240
  • Feb 21 00:10
    Aaronontheweb synchronize #4212
  • Feb 20 23:50
    Aaronontheweb synchronize #4238
  • Feb 20 23:48
    Aaronontheweb commented #4234
  • Feb 20 23:47
    Aaronontheweb synchronize #4212
  • Feb 20 23:46

    Aaronontheweb on dev

    close #4234 - added CachingConf… (compare)

  • Feb 20 23:46
    Aaronontheweb closed #4239
  • Feb 20 23:46
    Aaronontheweb closed #4234
  • Feb 20 23:25
    Aaronontheweb synchronize #4239
  • Feb 20 23:24
    Aaronontheweb opened #4239
  • Feb 20 23:14
    Aaronontheweb commented #4234
Ricky Blankenaufulland
@ZoolWay
:D
@crucifieddreams Restarting of that node in your case means the nodes leaves the cluster gracefully, process terminates, new process starts?
Alex Gibson
@crucifieddreams
Yes exactly :), it's a restart of a windows service with code for graceful exit. When a node is going to get into this state the leader sees the exit (it logs this every second) but never carries out the exit process. The node restarts and the leader removes the old incarnation and then won't rejoin (leader tries to bring it up and logs this fact every second). Other nodes join happily even if the rejoining node is stuck at joining which is confusing as the behaviour seems like a convergence problem.
Maxim Cherednik
@maxcherednik
@crucifieddreams ports are static?
Alex Gibson
@crucifieddreams
Yes all configured statically in hocon. We set up 53500-53520 as our port range and each node uses a different port in that range. 10 nodes exist on a single server and the other nodes are split across two other servers. The nodes don't share ports they have their own allocated even if running on another server.
Maxim Cherednik
@maxcherednik
usually node stucks in joining state if cluster does not have convergence . Are you sure that while some node got stuck in Joining -it's the only node which is out ?
Lutando Ngqakaza
@Lutando

I am getting some odd logs when i shedule a message using quartz this is what the logger says to me

[DEBUG][2017/01/31 12:28:14 PM][Thread 0037][akka://MySystem/user/my-system/my-coordinator/5kcfZkKW0ku4Uk-A6j8MFA/MPp3gd5y8EK1m-8snEuZZA] Unhandled message from akka://MySystem/user/quartz : DEFAULT.f6bdcd16-9950-41d1-894a-9453368679d2 with trigger DEFAULT.d3e56bf7-2c8d-48a6-bf3e-86a6646924d9/MPp3gd5y8EK1m-8snEuZZA has been created.

Ricky Blankenaufulland
@ZoolWay
@crucifieddreams If you list the cluster members from another node (I made myself some kind of monitoring and admin tool), is the exiting node really removed from the cluster? Most problems like that I get when the original was not really removed, I manually down them in this case.
Maxim Cherednik
@maxcherednik
if the node restarts, it should kick out the old one
Alex Gibson
@crucifieddreams
When a node gets into this state it doesn't leave cleanly. It tries to, I have a monitor running in all service discovery nodes (2 of them). They both report the cluster status that they see. When this problem happens the cluster status us everything UP and everything Seen. The leader gets the request that the node is exiting and this is logged every second that it is moving the node to exiting but it never exits. After 15 seconds my windows service will kill the service and failure detection will kick in. I manually down the node although just starting it again causes the cluster to see the new node and remove the old one. The cluster monitors both report the node is removed. The node rejoins and gets stuck.
Ricky Blankenaufulland
@ZoolWay
Does the windows service wait until the exit completed before it shuts down itself? Got lots of problems without graceful shutdown, I even got some examples up at github how I get it working. This is the code for a windows service using TopShelf: https://github.com/ZoolWay/akka-net-cluster-graceful-shutdown-samples/blob/master/TopShelfNode3/Worker.cs
critical is to wait with process exit until the member is really removed. @crucifieddreams
windows will restart immediately otherwise which is too early
Alex Gibson
@crucifieddreams
That's a good call we have similar code
But it times out after 15 seconds
Thanks :)
Maxim Cherednik
@maxcherednik
It's not that fast - sometimes it takes time.
Put more just to see how it goes.
the cluster will move it down only if it operates well
Alex Gibson
@crucifieddreams
I will remove the timeout from the
Maxim Cherednik
@maxcherednik
if there are other nodes missing - you will get a timeout
Alex Gibson
@crucifieddreams
Manual reset event we have there
And see how that goes
Maxim Cherednik
@maxcherednik
another thing with this approach - if node never joined the cluster and you try to stop it - it will get stuck forever here
Bartosz Sypytkowski
@Horusiath

@maxcherednik

if node never joined the cluster and you try to stop it - it will get stuck forever here

This sounds like a design issue /cc @Aaronontheweb

Maxim Cherednik
@maxcherednik
yep - i just didn't have time to report :)
Alex Gibson
@crucifieddreams
Interesting that is a useful piece of information, i didn't realise that.
:) thanks
Maxim Cherednik
@maxcherednik
btw Alex, just create an empty cluster without any logic and try to play around with all those edge cases
it helped me a lot
and 1.1.3 is way cleaner in terms of logging.
Alex Gibson
@crucifieddreams
It might be worth upgrading while I am making changes. Thanks for all your help folks.
Aaron Stannard
@Aaronontheweb
@crucifieddreams @maxcherednik akkadotnet/akka.net#2347 would that fix it?
sounds like that's what you need
Maxim Cherednik
@maxcherednik
Maybe, but I am not sure :)
Alex Gibson
@crucifieddreams
Looks promising, certainly it was an issue I wasn't aware of. I am just running some tests to see if that is the problem I am seeing.
Aaron Stannard
@Aaronontheweb
In this state it doesn't leave cleanly. It tries to, I have a monitor running in all service discovery nodes (2 of them). They both report the cluster status that they see. When this problem happens the cluster status us everything UP and everything Seen. The leader gets the request that the node is exiting and this is logged every second that it is moving the node to exiting but it never exits.
whoops
there we go
so I've suspected that we have an issuer with MemberRemoved not firing correctly
I've not been sure under what circumstances this occurs
no idea if that report gets shown to guests or not
but either way, this is the flaky test report for
ClusterSpec.A_cluster_must_complete_LeaveAsync_task_upon_being_removed
that information you just mentioned is very helpful. Confirms for me that this is a bug.
that under some circumstances, the MemberRemoved event is not received or processed correctly
if you have some logs from that situation you described, that would be helpful
Aaron Stannard
@Aaronontheweb
opened an issue, #2492
Alex Gibson
@crucifieddreams
I'll gather up some logs of what we see and post them on the issue log. Thanks!
Thomas Tomanek
@thomastomanek
#2491 has been opened btw
Chris Ochs
@gamemachine
so back trying to debug why distributedpubsub isn't working for me. basically after some time period publish just stops working. I enabled DEBUG logging and for a while I see 'Received Akk.Cluster.GossipStatus' messages, and then it just stops after some time, and that's when publish stops working also