Hi guys I have a question related to remoting.
Main question is actually if remoting should be resilient/robust against temporary network issues (network partitioning, host not responding, not receiving any deathwatch hearbeat responses...).
To be more specific, is it acceptable that an ActorSystem can become quarantined because of a temporary network issue?
I see no issue with heartbeat systems that try to detect issues with the network and drop messages because of detected network issues, but I find it problematic that a system gets quarantined because there were some temporary network issues. I find this problematic because in Akka this means that the quarantined system needs to restart!
This is something I find as not "Reactive" since no recovery is possible (except the real dramatic recovery of restarting the actorsystem, which in server application is perhaps not possible).
We have an application in production (a lot of clients connecting to one server ) that uses remoting and because of network errors a client marks the remote server system as quarantained.
Which means that that client will not be able to connect until the server restarts/recycles (or at least restarts its actorsystem, which is not really feasible/desirable).
I have no problem that a state as "quaratined" exists, but I have a problem that something can get quarantined because of (temp) network errors or because the deathwatch hearbeat responses are not received. System should not get corrupted because of such errors and as such should not get quarantined.
What do you guys think about this ? Is this a bug that needs to be fixed (I do not mean that quarantining is a bug, but that getting quarantined because of temporary network issues is a possible bug) ?
Am I looking at this in the wrong way ?
What are the options to handle this (network errors are not that rare condition) ?
My current solution is to set parameter prune-quarantine-marker-after = 0 s (which is not recommended in the docs !!!!)
I tried also increasing some of the other heartbeat parameters (acceptable-heartbeat-pause in the transport-failure-detector and the watch-failure-detector ), but more to the effect that system would not recover at all.
If I'm not using the death-watch monitor then system can recover (meaning after being gated trying to associate/connecting again), but when having death watch enabled (by watching an actor) then suddenly there is some interaction that makes it not being able to reassociate (seems to be a bug) , not even trying, which results in the dead watch heartbeat getting dropped until that receives its pause threshold parameter value, which in turn triggers the quarantining.
Version info : using akka.net 126.96.36.199 (put did also a test with the version in the dev git branch beginning of this week)