These are chat archives for paypal/squbs

30th
Dec 2016
Anil Gursel
@anilgursel
Dec 30 2016 01:23
Found a flaw in my new Timeout Bidi design. It keeps an internal buffer like structure with start times of each element. There is also a timer which periodically wakes up, marks items that timed out. The timer seems to be running parallel to the flow. so, the buffer like structure gets accessed concurrently :worried: I was so close to creating a PR. Realized this issue while changing the structure in ordered vs unordered scenario..
I had not seen a problem prior to changing the internal structure for unordered.. Probably because LinkedHashMap somehow synchronizes.. Need to confirm though.
Grrr.. I feel like going back to square one.
Akara Sucharitakul
@akara
Dec 30 2016 01:42
This is actually not too bad. Please check the Delay code base for their synchronization. I'd actually prefer not to have a timer and rely on pull alone. But in low volume flows that could cause stagnation.
Rationale is if there is no demand, timeout messages won't go anywhere. So it is not worth waking up. Need to ccarefully check for semantics how much we can rely on the stream itself vs timer. A busy stream with 100msgs/sec or more we really don't need a timer.
Anil Gursel
@anilgursel
Dec 30 2016 02:13
That's something we somewhat discussed earlier. Let's say a request take 5 seconds and timeout is 1 seconds, without a timer, we won't be able to send timeout failure until 5 seconds passed
Akara Sucharitakul
@akara
Dec 30 2016 02:15
Yes, unless there is another event causing us to do so. Anything can be an event causing us to check. An onPull for instance is a good event to check.
If the pull + push frequency is low then the timer will have to deal with the time gap between events.
Anil Gursel
@anilgursel
Dec 30 2016 02:16
Checking with every onPull would be expensive though. We need to do it occasionally
Akara Sucharitakul
@akara
Dec 30 2016 02:16
That's easy to do.
Hold a variable lastChecked and make sure the time from lastChecked is at least 10ms before you check again.
Anil Gursel
@anilgursel
Dec 30 2016 02:17
Yes, that's something we can do
Akara Sucharitakul
@akara
Dec 30 2016 02:17
That's for high frequency cases.
What I'm worried about is the low frequency cases. But if the back-pressure comes from downstream, it is easy. No demand, no check.
Anil Gursel
@anilgursel
Dec 30 2016 02:19
Exactly. So, checking in onPull should be good. For both high and low frequency cases
Akara Sucharitakul
@akara
Dec 30 2016 02:20
Then it is better to let things not time out, lets say there is no demand and the response comes after the timeout.
Anil Gursel
@anilgursel
Dec 30 2016 02:20
Even with timer, if there is no downstream demand, we cannot send a timeout failure
Akara Sucharitakul
@akara
Dec 30 2016 02:20
Yes, onPull should be enough, except if onPull is not coming as an event. We need to understand the event reliability.
Yes, absolutely.
And that's why it does not make sense to time out if there is no demand.
Lets say onPull happened a second ago. Demand is there, but nothing was sent down just yet. onPull may not get invoked again. That's where we need the timer.
Anil Gursel
@anilgursel
Dec 30 2016 02:24
The current design I have is.. wake up every x ms, mark items as timed out based on the start time. But, this step does not push any timeout event. It just makes elements ready to be pushed down in a queue as a failure. Once, downstream demand comes in, an element from the queue(whether timeout failure or a successful response) will be send.
Akara Sucharitakul
@akara
Dec 30 2016 02:25
I think if the timeout event is not being pushed down, we should not call it a timeout.
Anil Gursel
@anilgursel
Dec 30 2016 02:25
So if I move the logic from onTimer to onPull, it should be ok.
Akara Sucharitakul
@akara
Dec 30 2016 02:25
Generally yes. I just want to think it through whether it covers all cases, as the example I gave above.
Anil Gursel
@anilgursel
Dec 30 2016 02:26
How can we push down unless there is demand?
Akara Sucharitakul
@akara
Dec 30 2016 02:27
That's the point. Having a demand and having an event indicating the demand could be two different things.
From the pulling party, if you pulled and there is nothing coming down just yet, would you expect messages to come down when they are available? You should not have to do a periodic pull, right?
That's what I'm worried about.
Demand is there, but pull event happened sometime in the past.
We should look at Delay very closely how they handle this.
Anil Gursel
@anilgursel
Dec 30 2016 02:31
where in the past or now, if there is a pullthen onPull gets called. If there is an available element to push down (whether timeout failure or success), it will be pushed down. If not, it (timeout failure or success) will be pushed down when available.
so, we might need to add another check point in “push” scenarios for timeouts, if pull happened in the past, and we could not mark anything timed out.
Akara Sucharitakul
@akara
Dec 30 2016 02:32
Timeout failure will be pushed down when available. So that "when available" can be triggered by a pull, a push, or a timer.
Anil Gursel
@anilgursel
Dec 30 2016 02:36
so, if we drop the timer.. Then, we need to do a periodic check in pull (e.g., if 10ms passed since last check) and also when an element gets available from the original flow, we can add another check for time (unless element is already marked as timed out).
Akara Sucharitakul
@akara
Dec 30 2016 02:38
That periodic part needs to come from a timer, right?
In high flow, a timer should not be used at all. The question is, how would we know?
Anil Gursel
@anilgursel
Dec 30 2016 02:38
periodic might be a wrong word. It will basically be in onPull.
I think we should drop the timer all together
Akara Sucharitakul
@akara
Dec 30 2016 02:38
If we can rely on that to be periodic.
Anil Gursel
@anilgursel
Dec 30 2016 02:39
rely on onPull for periodically checking + check the time on an element gets available from original flow. These two should be enough, IMO.
Given that the “timeout” we talk about is the time taken in original flow. It does not take into account waiting for demand etc..
Akara Sucharitakul
@akara
Dec 30 2016 02:41
Think about a scenario with low flow and onPull was called sometime ago without anything being pushed down. You may still need the timer. But I'll leave that choice to you.
Anil Gursel
@anilgursel
Dec 30 2016 02:42
for that scenario, we basically wait for onPush of the the Bidi input 2 (from the original flow). Once onPush gets called, we should do another check about time (not periodic, every time for that single item).
Akara Sucharitakul
@akara
Dec 30 2016 02:42
And if that's not happening?
Anil Gursel
@anilgursel
Dec 30 2016 02:44
alright, that was a scenario that I implemented with onTimer earlier today :smile: I have a pretty bad memory not being able to think about it.
now I see the point..
Anil Gursel
@anilgursel
Dec 30 2016 02:49
actually, whether it is high frequency or low frequency, we may encounter this problem. If the original flow has an issue and for whatever reason does not send a message (maybe dropping messages, or stuck), then we won’t be able to send timeout failure message to downstream. So, I do not think the problem is necessarily for low frequency scenarios.
Anil Gursel
@anilgursel
Dec 30 2016 02:55

My current design would not have this problem, as it has the timer. Timer seems to be a safer choice, but it has other problems:

  1. When there is no problem in the flow, timer would cause an interrupt. We may not want this for high frequency scenarios.
  2. timer runs on a sepearate thread and causes concurrency issues. I wonder if we can find a way to run in withing the same thread of the stream. I remember seeing an Akka utility for that, will look for it.

So, we should use timer, but try to minimize its use. Instead of doing schedule periodically, we can instead do schedule once. Once schedule comes in, we schedule again. And in onPush we can do scheduleOnce again. AFAIR, when you call schedule again, it cancels the previous timer. If canceling and schedling a new one is not expensive, this can be a hybrid approach we can take.

  /**
   * Schedule timer to call [[#onTimer]] after given delay.
   * Any existing timer with the same key will automatically be canceled before
   * adding the new timer.
   */
  final protected def scheduleOnce(timerKey: Any, delay: FiniteDuration): Unit = {
    cancelTimer(timerKey)
    val id = timerIdGen.next()
    val task = interpreter.materializer.scheduleOnce(delay, new Runnable {
      def run() = getTimerAsyncCallback.invoke(Scheduled(timerKey, id, repeating = false))
    })
    keyToTimers(timerKey) = Timer(id, task)
  }
the cost seems to be a mutable map assignment.
Anil Gursel
@anilgursel
Dec 30 2016 03:33
@akara the concurrency issue with onTimer turned out to be a non-issue. It was a bug in my own code. So, we’re good with onTimer. However, I am glad that I made that mistake, because it triggered a really good discussion.
I will make the PR ready; however, I will also study the effects of timer little more. If needed, we can try a hybrid approach.
Akara Sucharitakul
@akara
Dec 30 2016 06:56
Thank @anilgursel! Really take a look how Delay makes use of timer. I'm sure we can learn a lot from that.
Anil Gursel
@anilgursel
Dec 30 2016 17:48

I was for a second excited thinking there is already a support for request timeouts in streams https://github.com/akka/akka-http/blob/master/akka-http-core/src/main/scala/akka/http/impl/engine/server/HttpServerBluePrint.scala#L237. But, this assumes the message ordering does not change and specifically for http.

Client side does not have a timeout functionality http://doc.akka.io/docs/akka-http/current/scala/http/client-side/connection-level.html#timeouts. Maybe one of the reasons is that it complicates things when message ordering is not guaranteed. Anyway, the timeout functionality we providing is definetely gonna help.

The question I have now is.. If there is no timeout functionality internally, how do the http messages get drained from the http client flow, when the corresponding server is responding slow. It would cause the stream to stuck and not sending any demand.

Here is what I see happening:

  1. https://github.com/akka/akka-http/blob/master/akka-http-core/src/main/scala/akka/http/scaladsl/Http.scala#L663
  2. https://github.com/akka/akka-http/blob/master/akka-http-core/src/main/scala/akka/http/impl/engine/client/PoolGateway.scala#L37
  3. https://github.com/akka/akka-http/blob/master/akka-http-core/src/main/scala/akka/http/scaladsl/Http.scala#L471
  4. https://github.com/akka/akka-http/blob/master/akka-http-core/src/main/scala/akka/http/scaladsl/Http.scala#L644
  5. So, it is the PoolMasterActor that handles the request ultimately. https://github.com/akka/akka-http/blob/master/akka-http-core/src/main/scala/akka/http/impl/engine/client/PoolMasterActor.scala#L75
  6. Then, goes here https://github.com/akka/akka-http/blob/master/akka-http-core/src/main/scala/akka/http/impl/engine/client/PoolInterfaceActor.scala#L106

Following the chain, the response message is sent only if there is a failure (e.g., buffer is full, or the server responds).

On one side, it makes sense. If it is slow, it back pressures so it does not get further requests. But, on the other side, if it is slow I do not want my server to be stuck and not accepting any requests. I may instead prefer going through a separate branch.

We will somehow achieve this with the Circuit Breaker; however, ultimately the demand should be controlled by the http flow. If slow responses from the server stops the demand, our Circuit Breaker would not bu helping. I would rather prefer timing out the internal messages and let timeout configuration determine the demans of the flow when things go bad, instead of the other server. So, I am little concerned at this point.

Anil Gursel
@anilgursel
Dec 30 2016 17:59

so, in other words, I do not agree with the statement in the this doc http://doc.akka.io/docs/akka-http/current/scala/http/client-side/connection-level.html#timeouts

Currently Akka HTTP doesn’t implement client-side request timeout checking itself as this functionality can be regarded as a more general purpose streaming infrastructure feature.

IMO, it should be internal part of http.

But, anyway, I think the Circuit Breaker max failure count would be set to a very small number, 5-10 etc.. while the number of requests a pool can handle should be a lot higher. so, I would expect the stream to continue functioning and circuit breaking properly.

now, I should go back and resume back focusing on the timeout. Finding some problem with each design change, but, this is a lot of fun :smile:
Akara Sucharitakul
@akara
Dec 30 2016 18:11
What we do definitely seems to help with that.
Just one note thinking over the concurrency model.
The encapsulation of the concurrent block is doing no more than checking the buffer and flushing the timed out events downstream.
So in the event of concurrency, we should never block. We just give up, skip, and not do the job as the other side/thread is already doing exactly the same job.
There is no point of blocking to do it again.
Akara Sucharitakul
@akara
Dec 30 2016 18:17
So, synchronized is not working here. We probably want to resort to some kind of flag or semaphore. Whether we want to use the Java semaphore or just a volatile boolean (or something else) as a flag, I'll leave it to you.
Anil Gursel
@anilgursel
Dec 30 2016 18:57
sorry, I could not understand the context exactly. Are you talking about the concurrent access to the “buffer like structure” to find out time outs? If so, it does not seem to be concurrent access. I was mistaken on that yesterday. It is the same thread as the flow is running on. In other words, there will not be a scenario where flow and timer is running in parallel.
Anil Gursel
@anilgursel
Dec 30 2016 19:05
between, while similar, our use case is different than delay. So, while I keep some internal structures for tracking, I do not have a buffer that’s used for back pressuring. I want the original flow to be responsible for back pressuring (anyway the original flow is the downstream in the scenario), while our stage should not be creating any extra demand accidentally (with the timeout use cases).