Where communities thrive


  • Join over 1.5M+ people
  • Join over 100K+ communities
  • Free without limits
  • Create your own community
People
Repo info
Activity
    Gil Tene
    @giltene
    More specifically, you want to check for what (highest) sustained throughput can be carried without causing latency or response time to become unacceptable. You can’t find this by picking some % of the max observed latency (under which latency will virtually always be unacceptable, if there is a latency requirement). In some systems that would happen at 80% of max throughout, in some at 5%, and in some at 99.9%.
    Gil Tene
    @giltene
    You need to test a range of throughputs to find the highest sustainable throughout. And you need to test at a any given throughout for a prolonged period (long enough for side effects like accumulated background work to exhibit their symptoms), which on many systems means tens of minutes. The “classic” ramp-up technique (e.g. a 100 minutes test ramping from 0 to 1000 clients at a ramp rate of 10 clients per minute) virtually always gives you wrong data, since it tends to detect “breakage” at a throughout that is much higher than what is actually sustainable (if you ran at that throughout for e.g. 100 minutes)
    Gil Tene
    @giltene
    Doing a quick set of tests to identify the likely “knee” point (finding things that break for sure is quick) is a good way to focus the remaining tests, but often when you continue to test for real, many systems will find that it is hard to maintain a throughout that is even 20% of where “things stopped breaking in 2 minute tests” without the common impacts of background accumulated-debt work causing service level failures (and e.g. flipped circuit breakers). Things like periodic journal or other buffer flushing, data merging (e.g. table compaction), re-indexing, cache or catalog refreshing, garbage collection of all sorts, and even exhausting scheduler quotas are all examples of accumulated debt that is paid after some time, and have delayed effects which happen anywhere from 10s or milliseconds to tens of minutes after the actual operations that incurred the debt have completed. Such accumulated debt will cause future operations to cross latency response time requirement boundaries, which is why you need to keep going at a given throughout fir quite a while if you wish to know that it is sustainable.
    Unfortunately sustainable throughout (which per the above, is much more time consuming to establish in experiments) is the thing that you need to know in order to answer “how much can this instance take” or “how many instances do I need to carry a load of X” questions. A temporary max established in a twofold ten minute test is useless in estimating that in most systems.
    Andriy Plokhotnyuk
    @plokhotnyuk
    @giltene Gil, please share what do you think about following benchmarks which aim to test throughput of all contemporary web libraries and frameworks: https://github.com/TechEmpower/FrameworkBenchmarks
    Here are results of latest sprint for simple HTTP/1.1 with JSON serialization: https://www.techempower.com/benchmarks/#section=test&runid=3da523ee-fff1-45d8-9044-7feb532bf9ee&hw=ph&test=json
    Sample of raw logs with results for some of frameworks from top-10 at previous chart: https://tfb-status.techempower.com/unzip/results.2018-06-28-04-18-09-819.zip/results/20180625165335/colossus/json/raw.txt
    Results of other sprints for both types of servers (physical and cloud) are gathered on this page: https://tfb-status.techempower.com/
    Samuel Williams
    @ioquatix
    How do you maintain constant throughput if the server is not fast enough?
    Gil Tene
    @giltene
    The constant throughout is for the load, not the server. It defines the model of when requests are supposed to be initiated, regardless of what the server can actually do. The response time of a request is then measured from when it was supposed to start to when it actually completed. If the server is not fast enough it will “fall behind”
    Gil Tene
    @giltene
    and the response times will start growing linearly with time (the longer you spend being slower than the incoming rate, the longer the response times for incoming requests get ). If the slowness was temporary (as it often is, with e.g. glitches, pauses, short stalls), it will eventually recover. If it never catches up, response times grow to infinity. (Think of a line of people at a coffee shop, and a barista that cant make coffee as fast as people are coming in, making the line grow, and with it the “how long did it take to get my coffee” metric)
    Samuel Williams
    @ioquatix
    You should put that explanation in the readme.
    Gil Tene
    @giltene
    Do you mean the coffee shop example? The readme slready explain the technique used: “The model I chose for avoiding Coordinated Omission in wrk2 combines the use of constant throughput load generation with latency measurement that takes the intended constant throughput into account. Rather than measure response latency from the time that the actual transmission of a request occurred, wrk2 measures response latency from the time the transmission should have occurred according to the constant throughput configured for the run. When responses take longer than normal (arriving later than the next request should have been sent), the true latency of the subsequent requests will be appropriately reflected in the recorded latency stats.”
    Samuel Williams
    @ioquatix
    Coffee shop example
    I read the README and didn't get it even the bit quoted above
    but 🤷‍♂️
    Fatima Tahir
    @Fatahir
    Hi, anyone knows how can i print all the requests and latencies generated by wrk2?
    Samuel Williams
    @ioquatix
    Fork the code and write some C :p
    Fatima Tahir
    @Fatahir
    I want to set each thread address differently. Using wrk.lookup, what i understand is that it can take only host and port. if i want to set thread address like "localhost:8080/index.html", how can i set to thread.addr? Also I have four threads and each thread has different address then how can i generate latency of each address? latency of overall script can be computed in done function using latency:percentile. but i dont know how can i compute latency of each thread.
    samcgardner
    @samcgardner
    Is there a good rule of thumb of number of threads/number of connections to use with wrk/wrk2?
    Radim Vansa
    @rvansa
    Hi folks, we're seeing wrk constantly outperform wrk2 in throughput even when -R is set high above the actually achieved request rate. Neither of the tools are CPU bound and neither network (short requests don't saturate bandwidth). Has anyone already explained where does the difference come from?
    Samuel Williams
    @ioquatix
    What do you mean “outperform”?
    Laurent Demailly
    @ldemailly
    and what is your server, avg latency, how to reproduce etc
    Radim Vansa
    @rvansa
    By outperforming I mean the maximum throughput that wrk/wrk2 can throw at the server (ignoring the latency aspect completely).
    Right now I have used a bit more complex setup in Openshift, load-balancing traffic through Istio to multiple Quarkus based simple REST apps.
    But I remember I've tried that out some while ago in a local-only setup, with similar results - difference in the throughput in 10s of percent.
    Laurent Demailly
    @ldemailly
    just doing back to back requests with no delay is pretty much always going to be the max throughput as long as the target isn’t pushed into degraded mode by it. usually you want to test fixed qps to get useful latency at about no more than 80% of that max, at least that’s what happens with fortio
    Radim Vansa
    @rvansa
    @ldemailly I totally agree - if you're concerned about latency, running at fixed qps is the thing. I am concerned about different thing, though: the max throughput itself. It might be a useful metric e.g. for quick regression tests as it's easier to compare than latencies, since it's single number and not a vector of values. And if two tools give you different throughput, there's obviously a difference in what these tools are doing.
    I'd assume that unless throttled by CPU both wrk2 with a qps target high enough should report the actual throughput roughly equal to plain wrk.
    And I am running in a cluster, separating driver nodes and system under test - and still can see that while running with enough threads the cores are not utilized (with either of the tools, so the bottleck should be the system under test and not driver).
    Still, the throughput differs.
    Laurent Demailly
    @ldemailly
    try fortio :) curious if you get a 3rd set of numbers
    Radim Vansa
    @rvansa
    @ldemailly Seems I am getting into a different ballpark with Fortio: that gives me ~ 64k reqs/s (each core is at 15-20%), wrk gets to 220k reqs/s (each core at 5-7%) and wrk2 to 145k reqs/s (10-12% CPU).
    Radim Vansa
    @rvansa
    I am running that from a single node (28 cores, HT off), and actually using 24 instances of fortio/wrk/wrk2 to hit 24 endpoints concurrently (testing the routing layer between). The commands look like:
    wrk -d 60 -t 4 -c 400 -H 'x-variant: stable' https://app-$i.my.cloud/some/path
    wrk2 -R 10000 -d 60 -t 4 -c 400 -H 'x-variant: stable' https://app-$i.my.cloud/some/path
    fortio load -k -qps 0 -c 400 -t 60s -H 'x-variant: stable' https://app-$i.my.cloud/some/path
    (that's for 12 of the commands, the other 12 use a different header and are downscaled 4x to wrk -t 1 -c 100 ...)
    Radim Vansa
    @rvansa
    And the CPU stats are for usertime only, there's more in sys and softirq handling.
    But even with wrk mpstat tells me that the machine is > 35% idle
    Laurent Demailly
    @ldemailly
    interesting, I can get up to 500k/sec with fortio against itself on a big enough machine but then again the goal for fortio is to test “slow” stuff in the 10k range
    Laurent Demailly
    @ldemailly
    oh https is handled by golang so it’s slower than the custom client
    fightlight
    @fightlight

    Hi, guys!
    I have an issue when I'm trying to dynamically generate a request body using this lua script:

    local random = math.random
    math.randomseed(os.time())
    
    local function uuid()
        return string.gsub('xxxxxxxx-xxxx-4xxx-yxxx-xxxxxxxxxxxx', '[xy]', function (c)
            local v = (c == 'x') and random(0, 0xf) or random(8, 0xb)
            return string.format('%x', v)
        end)
    end
    
    local function data()
      return '{"id": "' .. uuid() .. '","text": "some text"}'
    end
    
    request = function()
    wrk.format(method, path, headers, body)
        return wrk.format('POST','/my-endpoint', {["Content-Type"]= "application/json"}, data())
    end

    I run wrk like this wrk2 -t3 -c3 -d60s -R4000 -s my-script.lua http://locahost:8080

    When I pass -t3 to wrk I see 3 identical UUIDs in logs so I conclude that wrk generates body only once per request and uses the same result for N threads.

    I want wrk generate unique request for each thread. Is there a way to solve this?

    Radim Vansa
    @rvansa
    @ldemailly I would understand if some implementation of TLS would perform worse e.g. because it doesn't use fancier instructions, or allocates unnecessary objects (and then has to pay GC to tidy that). And it might be the reason why with fortio I am seeing higher CPU usage even at lower throughput. However we're not even close to maxing out the CPU, so there must be some other bottleneck in the system.
    And the bottleneck is somehow tied to the tool we're using; it's not just network or something in kernel. Unless these tools use different kernel API - I would expect all of them working through epoll. I'll try strace at low throughput
    Radim Vansa
    @rvansa
    Hmm, I've tried to switch implementation from NIO to epoll in my benchmark (that's quite close to wrk2) and observed no change.
    Laurent Demailly
    @ldemailly
    it won’t make fortio double its qps or get near what you seam to get with wrk but for http my client is about 20% more than golang’s (because it’s specialized and reuses the same buffer etc).
    Radim Vansa
    @rvansa

    From wrk2:

    #[Mean    =       56.879, StdDeviation   =      722.732]
    #[Max     =    15458.304, Total count    =      1406480]
    #[Buckets =           27, SubBuckets     =         2048]
    ----------------------------------------------------------
      2425538 requests in 29.98s, 305.34MB read
      Socket errors: connect 0, read 0, write 0, timeout 79576
    Requests/sec:  80915.04
    Transfer/sec:     10.19MB

    I wonder how many requests did it really do. Total Count is 1.4 M, but then it says 2.4M requests in 30s and 80k timed-out...

    Hmm, the same discrepancy seems to be even in the readme (39500 vs 60018), does it mean that only a certain subset of the requests is actually measured?
    Hmm, I guess that the difference belongs to the 'calibration period'
    Gil Tene
    @giltene
    Yes. If you do longer runs, where the post-calibration period is much longer, you'll likely see much smaller discrepancies in totals.
    Ben Sless
    @bsless

    Hello, I'm having trouble compiling wrk2 on Linux which I think relate to LuaJIT. The instructions aren't complicated, made sure I have dependencies installed and ran make
    When I reach

    LUAJIT src/wrk.lua
    LINK wrk

    I get a bunch of errors such as

    /usr/bin/ld: deps/luajit/src/libluajit.a(lj_err.o): relocation R_X86_64_32S against `.rodata' can not be used when making a PIE object; recompile with -fPIE
    /usr/bin/ld: deps/luajit/src/libluajit.a(lj_str.o): relocation R_X86_64_32 against `.rodata.str1.1' can not be used when making a PIE object; recompile with -fPIE

    Followed by

    collect2: error: ld returned 1 exit status

    I'm not sure how I should recompile with -fPIE because I don't know if it needs to be passed to wrk's compilation or to LuaJIT's