These are chat archives for akkadotnet/akka.net

2nd
Mar 2015
Håkan Canberger
@HCanber
Mar 02 2015 06:35
@rodrigovidal NP :)
Roger Johansson
@rogeralsing
Mar 02 2015 06:52
@jcwrequests, I saw you aop monitoring blog post, I've been thinking of doing exactly the same myself. Thats great :)
Bartosz Sypytkowski
@Horusiath
Mar 02 2015 07:03

yes But the router should be ConsistentHashingPool. Would that be thie consistent-hashing-pool

@jcwrequests @Aaronontheweb Didn't we talk about potential danger of using resizer with consistent hashing pool?

Roger Johansson
@rogeralsing
Mar 02 2015 07:25
yes but @Aaronontheweb has implemented the virtual nodes thing from JVM. this should be safer now (I still dont understand how)
Aaron Stannard
@Aaronontheweb
Mar 02 2015 07:42
it's still not super safe
all that virtual nodes does is spreads out the affected hash areas on membership change more evenly across the ring
so rather than losing a giant contiguous chunk of the hash range all at once
you lose N virtual nodes worth of the ring - it adds up to the same amount
but just spread out across different parts of the range - makes it so you don't end up with a big dead-zone for a large continguous range all at once
(that dead-zone only lasts for a short period of time, until the hashing logic is able to perform a CAS operation to update its node ring)
I'd recommend not using those two in combination together
Resizer will work just fine if your routees aren't stateful
but consistent hashing is really meant for stateful routees
I'll talk to @jcwrequests about that - if he's using Akka.Cluster already then you can use consistent hashing with that and achieve the same results
auto-scaling routees, but only when your network topology changes - happens much less often than the resizer
Aaron Stannard
@Aaronontheweb
Mar 02 2015 07:50
@rogeralsing submitted my question to the akka-dev mailing list, btw
should hear back from them
in the next day or so
Natan Vivo
@nvivo
Mar 02 2015 10:21
@rogeralsing, if you wanna talk about the async/await stuff I'm available. I think it's getting hard to discuss that in the issues.
Roger Johansson
@rogeralsing
Mar 02 2015 10:27
Shoot :) writing on my phone so it might be slow
First, do you get what the ActorTaskScheduler is doing?
Natan Vivo
@nvivo
Mar 02 2015 10:33
Yes, I did
But there is nothing wrong with ActorTaskScheduler per se
What's wrong is to believe you have to use it every time
The whole point of async/await is to not force async if not needed. Maybe MS should have named it better to avoid this confusion...
Roger Johansson
@rogeralsing
Mar 02 2015 10:38
Ok, you mean that if the task is already completed, we should have a different code path, right?
Natan Vivo
@nvivo
Mar 02 2015 10:38
Yes
But the point is that if you try to do that, you'll see that any attempt to achieve this correctly will turn into reimplementing exactly what is already there and you will just support async/await out of the box
Roger Johansson
@rogeralsing
Mar 02 2015 10:40
But that is just a small optimimization inside runtask, right? Other than that, we handle all state changes
Natan Vivo
@nvivo
Mar 02 2015 10:40
That's what I'm not sure. I tried to change the code to do that, you can see my comment later
It seems to me that what you are doing that is not needed at all if you just support tasks in actors as @HCanber proposed
Roger Johansson
@rogeralsing
Mar 02 2015 10:41
For non completed tasks we need the scheduler. Otherwise we cant reset our implicit state
Natan Vivo
@nvivo
Mar 02 2015 10:41
I agree you need the scheduler
I just think there might be a better way to do this that doesn't need to break or reimplement anything
I'm trying to come up with an example
It seems to me that any "await" done in a task inside a specific task pool will already return to that task pool
Roger Johansson
@rogeralsing
Mar 02 2015 10:43
Yes, but with the current api design in JVM akka, this is what we have to deal with.
Natan Vivo
@nvivo
Mar 02 2015 10:43
That would mean the act of calling the scheduler should be somehow implicit
Well, then we will have a half baked solution to a problem that is already solved =)
because JVM and .NET are different, right?
Roger Johansson
@rogeralsing
Mar 02 2015 10:44
I know, but we cant change our api one week before v1
Natan Vivo
@nvivo
Mar 02 2015 10:44
I don't think it should
and I'm not proposing that
Roger Johansson
@rogeralsing
Mar 02 2015 10:44
but i agree that for v2 we should address this
Functional + scheduler and its all done
Natan Vivo
@nvivo
Mar 02 2015 10:44
I think @HCanber got the idea of why this is important. the thing is that this cannot be a "hack" on top of the framework to work correctly
Because it's just like you wouldn't implement Futures on C# just because java has that. You'd use tasks
the same way, any attempt to reimplement async/await behavior that will just cause extra work for something that is complete and there to use
at least that is the conclusion I'm having by trying to come up with alternative solutions
the only problem is that we don't seem to agree on what is the problem. =)
Roger Johansson
@rogeralsing
Mar 02 2015 10:58
The problem today, is to reset the implicit state. Everything else is just hacks to do so
Natan Vivo
@nvivo
Mar 02 2015 11:01
Ok then.
David Smith
@smalldave
Mar 02 2015 11:15
@rogeralsing can't we use some other mechanism other than thread static to flow the context? MS get HttpContext.Current to work with async / await. we'd stay closer to the JVM implementation doing this
Roger Johansson
@rogeralsing
Mar 02 2015 11:20
That is what logicalcallcontext is for, thats why we use it. The thread static stuff can be replaced with this too. But has not yet been done, logicalcallcontext is also a lott more expensive than thread statics, thats why we have the 50% overhead
David Smith
@smalldave
Mar 02 2015 11:22
ah. I saw the 50% overhead mentioned. wasn't sure what that was in relation to. fair enough
Roger Johansson
@rogeralsing
Mar 02 2015 11:22
Flow context + Marshall completions back to the actor mailbox run to ensure actor concurrency constraint (if reentrancy is on)
Its likely not noticable in a real app, but the pingpong benchmark gets 50% but that is just pure messagepassing no logic
Natan Vivo
@nvivo
Mar 02 2015 11:59
I just provided a "probably" working implementation there.
Roger Johansson
@rogeralsing
Mar 02 2015 11:59
@nvivo i think the biggest communication problem here is reentrancy, reentrancy mutates the actor cell for each message, this forces us to do funky marshalling an state flow. W/o reentrancy, your PR would have worked if we used logicalcallcontext instead of threadstatics
Does that make sense?
Natan Vivo
@nvivo
Mar 02 2015 12:02
@rogeralsing, yes, I think the issue started as how to provide non-reentrancy. And that suggested that supporting async/await is the most clean and performant solution.
Agree?
Because if we don't agree on this, than we are discussing different things.
Roger Johansson
@rogeralsing
Mar 02 2015 12:07
Does the tests succeed using your modification? Dont have a computer for a week now
Natan Vivo
@nvivo
Mar 02 2015 12:08
let me test that. I didn't compile it, i thought about it on the car
Roger Johansson
@rogeralsing
Mar 02 2015 12:09
But i do agree it should work, the await bit does need the continuation, not sure why, but the exception does not flow correctly without it
Natan Vivo
@nvivo
Mar 02 2015 12:10
no, it doesn't. in fact, all the code in that entire method I implemented is exactly what "await" does
when you use await, you don't need to handle continuations
another thing you should note is that continuations force task scheduling, and that forces async
I think that's why you got 50% drop in performance in your test even with completed tasks
Natan Vivo
@nvivo
Mar 02 2015 15:09
@rogeralsing, added my final words on #700 and this async/await stuff. if there are alternatives or anything I'm missing, I'd be glad to discuss.
Aaron Stannard
@Aaronontheweb
Mar 02 2015 16:50
@smalldave I'm going to open an issue for that Cluster InternalStats bug you and I talked about over email
I'm going to classify it as a live-lock, because that's the behavior I've observed inside the ClusterCoreDaemon whenever that configuration setting is set to 0
the daemon would do stuff like never process Join messages before the cluster heartbeat timed out
David Smith
@smalldave
Mar 02 2015 16:53
it's pretty consistent then? any guesses where the problem lies?
Aaron Stannard
@Aaronontheweb
Mar 02 2015 16:54
yeah, it's really consistent
all it needs is a small value, like a 100ms to work properly
but when the value is 0 the actor floods its own mailbox
so two theories
David Smith
@smalldave
Mar 02 2015 16:55
is the default configuration correct?
Aaron Stannard
@Aaronontheweb
Mar 02 2015 16:55
  1. the actor just fills up its mailbox with a ton of internal stats tick messages
let me check, I don't think 0 is the default value
hah, the default value turns the feature off altogether
and that appears to work
I got super frustrated a week ago when I ported the JoinSeedNodeSpec
because that test failed every time, for no apparent reason
couldn't form a cluster
yet one of the cluster samples I wrote a long time ago, which uses a similar cluster formation
has worked just fine for several months
realized that it was this value
so the other possibility is this
  1. internal stats ticks up the gossip history when its value is set to zero
which means that valid gossip might be incorrectly assumed to be too many generations too old
so important events, like MemberUp, might never get gossiped correctly
I'm pretty confident that it's something more along the lines of issue 1 though - in my next PR i added a bunch of additional debug logging inside the cluster daemon
never once saw any of the Join messages make it into the inbox when the value was zero
David Smith
@smalldave
Mar 02 2015 17:00
that sounds a lot like what I remember happening
Aaron Stannard
@Aaronontheweb
Mar 02 2015 17:00
yeah
I was pulling my hair out trying to figure out what was going on lol
didn't help that I intentionally ignored all of the ITick messages from my logging because there are so many of them
David Smith
@smalldave
Mar 02 2015 17:01
confused on default value. where is zero coming from? what does it mean?
Aaron Stannard
@Aaronontheweb
Mar 02 2015 17:01
the value of zero comes from the multi-node testkit, MultiNodeClusterSpec
it's defined in the default cluster settings there
David Smith
@smalldave
Mar 02 2015 17:02
ah and that is ported correctly from jvm version?
Aaron Stannard
@Aaronontheweb
Mar 02 2015 17:02
and within the cluster core daemon it has a special meaning
yes, it is
David Smith
@smalldave
Mar 02 2015 17:02
why would that be set for testing but not for default cluster settings?
Aaron Stannard
@Aaronontheweb
Mar 02 2015 17:02
publish-stats-interval = 0 s # always, when it happens
from the comments
to be honest, I have no idea why
this feature is straight up disabled by default in production
my guess is that it's used only for debugging and testing
ironic, right?
David Smith
@smalldave
Mar 02 2015 17:03
but the jvm version handles it. we don't for some reason?
Aaron Stannard
@Aaronontheweb
Mar 02 2015 17:03
that something designed to help you figure out what's going wrong with cluster formations actually prevents them from happening
David Smith
@smalldave
Mar 02 2015 17:03
yes
Aaron Stannard
@Aaronontheweb
Mar 02 2015 17:04
I took a close look at the cluster core code
and it looks like it aligns pretty well with JVM
so I'm really not sure
David Smith
@smalldave
Mar 02 2015 17:15
in clusterdaemon around line 684. what is the impact of passing zero to the scheduler?
Aaron Stannard
@Aaronontheweb
Mar 02 2015 17:17
not sure
David Smith
@smalldave
Mar 02 2015 17:17
```scala
oops
Aaron Stannard
@Aaronontheweb
Mar 02 2015 17:17
although I change that line in my next PR
if (settings.PublishStatsInterval != null && settings.PublishStatsInterval >= TimeSpan.Zero && settings.PublishStatsInterval != TimeSpan.MaxValue)
            {
                _publishStatsTaskTaskCancellable = new CancellationTokenSource();
                _publishStatsTask =
                            scheduler.Schedule(
                                settings.PeriodicTasksInitialDelay.Max(settings.PublishStatsInterval.Value),
                                settings.PublishStatsInterval.Value,
                                Self,
                                InternalClusterAction.PublishStatsTick.Instance,
                                _publishStatsTaskTaskCancellable.Token);
            }
OH CRAP
>= zero
instead of >
David Smith
@smalldave
Mar 02 2015 17:17
the jvm version is
  // start periodic publish of current stats
  val publishStatsTask: Option[Cancellable] = PublishStatsInterval match {
    case Duration.Zero | _: Duration.InfiniteNone
    case d: FiniteDurationSome(scheduler.schedule(PeriodicTasksInitialDelay.max(d), d, self, PublishStatsTick))
  }
Aaron Stannard
@Aaronontheweb
Mar 02 2015 17:18
yeah, if we set a task for 0
then I don't think the scheduler actually uses the TPL at all
I think it just executes it line
inline*
but the thing is, that would technically put the actor into an infinite loop, wouldn't it?
if the interval is 0?
yeah, that would spam up the mailbox
David Smith
@smalldave
Mar 02 2015 17:19
yep
Aaron Stannard
@Aaronontheweb
Mar 02 2015 17:19
first time the scheduler runs it has a real delay
but from that point onward the thread would write an infinite number of messages to itself
so yeah, that's got to be it
If I change my >= to > that should do it
David Smith
@smalldave
Mar 02 2015 17:20
hope so. sorry about that. pretty nasty bug
would explain why I turned it off!
Aaron Stannard
@Aaronontheweb
Mar 02 2015 17:21
actually I think that's a two-part bug
the scheduler should absolutely not allow a value of zero for the interval
for this very reason
should have a Guard.Assert in there
this was helpful Dave - mystery solved
hopefully
now I just need to figure out how I broke programmatically configured cluster routers, and then I'm all set :sparkles:
David Smith
@smalldave
Mar 02 2015 17:23
excellent
Aaron Stannard
@Aaronontheweb
Mar 02 2015 17:24
been porting the routing multi-node specs
so far so good - favorite test is the one where I kill off a node and watch the consistent hash router automatically adjust its hash ring
David Smith
@smalldave
Mar 02 2015 17:24
i need to get back on to this. i've got some stuff to get finished up at work and then I should have some time to look at this again
Aaron Stannard
@Aaronontheweb
Mar 02 2015 17:25
sounds good - I've been really impressed at how well Akka.Cluster has performed in all of the multinode tests
with the exception of that bug
David Smith
@smalldave
Mar 02 2015 17:25
it is nice to watch that kind of functionality working in the multi-node tests. hopefully it will happen a bit more often now
the majority of the issues actually seem to be in the test framework :)
jcwrequests
@jcwrequests
Mar 02 2015 23:37
@rogeralsing Thanks it's nice to know that at least someone reads my blog. :smile:
jcwrequests
@jcwrequests
Mar 02 2015 23:51
@Horusiath @rogeralsing @Aaronontheweb So it's not a good idea for using resizers with the hashpools For the reason that while a sizing action occurs I risk a message that should have went to Actor B goes to say Actor C. Do I have that right? The main reason I want to use this combination is to safely allow for multiple writers to my EventStore and ensure that always the same writer used for a particular aggregate. I wonder if there is a way to to stop all message processing, resize the rebalance all the messages. Much like what happens in a Cassandra cluster when a new node is added. To mitigate the costly rebalancing I would have to have a larger amount of actors pre allocated on start up. Is that something that could theoretically be done? Even if it's not I might be able to mitigate the risk on the data store end. What are you thoughts?
Aaron Stannard
@Aaronontheweb
Mar 02 2015 23:52
@jcwrequests yes, that's correct - the risk really comes from how rapidly the resizer fires
can be tuned with settings, but if you have a bursty workload it can be really unpredictable
the amount of time it takes the hashrange to change in the new release is pretty close to instant - it's an atomic CAS operation
but under really large workloads, it's possible that some messages will be mis-routed
the real problem comes from having state fragments spread across the routees
whenever the routee pool changes
without using Akka.Persistence to merge state
it's going to be an issue
so my recommendation is to avoid the Resizer there - I should have mentioned to this you earlier; my fault
are you using Akka.Cluster at all?