These are chat archives for ManageIQ/manageiq/performance

29th
Jan 2016
Alex Krzos
@akrzos
Jan 29 2016 13:49
@carbonin Do you have the system resource usage with those graphs? I'd be interested to see if bdr uses less cpu/mem in addition to the faster speed
With the delay it's odd that 200ms delay performed better than the 50ms delay on bdr (Unless I'm reading that incorrectly)
Dennis Metzger
@dmetzger57
Jan 29 2016 14:01
@akrzos system resources weren’t captured during those runs, I’m going to see about running again while capturing system resource utilization
Nick Carboni
@carbonin
Jan 29 2016 14:01
Yeah, I had to rerun those particular tests quite a few times before I was convinced, that's how it turned out.
Alex Krzos
@akrzos
Jan 29 2016 14:01
ok cool if you want an appliance on server hardware to re-run those same tests on I can get one for ya
Nick Carboni
@carbonin
Jan 29 2016 14:01
Is there a particular way/tool you guys use to measure the resource usage?
Dennis Metzger
@dmetzger57
Jan 29 2016 14:04
sar -o <ResultsFile> -A <Internval> is a good simple overall view. I tend to use 5 (seconds) as my interval.
Alex Krzos
@akrzos
Jan 29 2016 14:04
@carbonin Depending on the test or nature I might use pbench - https://github.com/distributed-system-analysis/pbench or collectd, or munin, or actively observing tools or dumping them to a text file if there isn't a good way to capture that output over time for that paricular situation
pbench is literally just automation over a number of mostly build in tools
sar as dmetzger57 suggested captures nearly everything too
if your running that on a linux box you can use turbostat to view the cpu freq and c-state too to make sure your not benchmarking how well your intel proc turbos
i doubt you need that precision in this case though given the data you already captured
Joe Rafaniello
@jrafanie
Jan 29 2016 17:05
@tenderlove I have good news, I have 2.3.0 running our app on an appliance with systemtap
Chris Arcand
@chrisarcand
Jan 29 2016 17:21
:clap:
Joe Rafaniello
@jrafanie
Jan 29 2016 17:27
systemtap provides lots of useful information in terms of cow page faults that lead to a page copy.... unfortunately, it's way too much info
adding instrumentation to inject summation of Private vs. Shared page information to the file that systemtap is logging the page faults to help find the page faults that cause the shift from Shared -> Private
after this, I'm done looking at it
ETOOMANYCOWS
Jason Frey
@Fryguy
Jan 29 2016 17:29
:cow: :cow2: :cow: :cow2: :cow: :cow2: :cow: :cow2: :cow: :cow2: :cow: :cow2: :cow: :cow2:
Joe Rafaniello
@jrafanie
Jan 29 2016 17:30
:cow: :gun:
Chris Arcand
@chrisarcand
Jan 29 2016 17:30
:scream_cat:
Jason Frey
@Fryguy
Jan 29 2016 17:31
== :meat_on_bone:
Joe Rafaniello
@jrafanie
Jan 29 2016 17:32
I bet Linus or someone had fun naming some of these variables: https://github.com/torvalds/linux/blob/v3.13/mm/memory.c#L2977
even_cows?
Chris Arcand
@chrisarcand
Jan 29 2016 17:33
HAH that’s awesome.
is_cow
Chris Arcand
@chrisarcand
Jan 29 2016 17:35
BTRFS_ROOT_REF_COWS
Now I’m intrigued.
Joe Rafaniello
@jrafanie
Jan 29 2016 17:49
lol
Aaron Patterson
@tenderlove
Jan 29 2016 18:09
@jrafanie nice! How are you adding the instrumentation?
Joe Rafaniello
@jrafanie
Jan 29 2016 18:36
@tenderlove I'm dumping GC.stat and sum of Private_Dirty and Shared_Dirty to the same file that stap is writing for the process
so, I can tell when Shared -> Private, I can look at which pagefaults happened in that window
much of this could be legit modifications in the child, but if there's a GC thing causing it, hopefully narrowing the window of the enormous number of page faults will help
Keenan Brock
@kbrock
Jan 29 2016 19:09
Does metrics collection use a bunch of queues? is it that dynamic?
For 100vms, we're calling metrics_collector_queue_name 130 times.
If it is dynamic, then I'll come up with a solution, but if it is static...
ok, 1 per vendor. hmm
Jason Frey
@Fryguy
Jan 29 2016 20:02
the queue name should be per vendor, I would think
that is, it should make its way to the PerEmsWorkerMixin
Keenan Brock
@kbrock
Jan 29 2016 20:03
yea, I'm thinking cutting up our work per ems will save us a few hundred queries
Jason Frey
@Fryguy
Jan 29 2016 20:03
it already is cut up that way
Keenan Brock
@kbrock
Jan 29 2016 20:03
cool - that is what I though
the workload
Jason Frey
@Fryguy
Jan 29 2016 20:03
ah ok
Keenan Brock
@kbrock
Jan 29 2016 20:03
when putting into the queue, we lookup each ems hundreds of times
Jason Frey
@Fryguy
Jan 29 2016 20:03
whoa weird
maybe once per metric collected or something like that?
Keenan Brock
@kbrock
Jan 29 2016 20:05
perfect
Jason Frey
@Fryguy
Jan 29 2016 20:06
is that expensive though? If I recall, getting the name is not very expensive
Keenan Brock
@kbrock
Jan 29 2016 20:06
putting entries into the queue for 100vms is 600 queries
maybe more
we're looking up the same ems (and region) hundreds of times
Jason Frey
@Fryguy
Jan 29 2016 20:07
ok, but is that expensive? I mean, is that the bottleneck?
600 queries is meaningless if it takes 0.1s :)
Keenan Brock
@kbrock
Jan 29 2016 20:07
1/3 of perf_capture_timer
and it is timing out
Jason Frey
@Fryguy
Jan 29 2016 20:07
ah ok...cool
Keenan Brock
@kbrock
Jan 29 2016 20:07
wait
Jason Frey
@Fryguy
Jan 29 2016 20:07
:+1:
Keenan Brock
@kbrock
Jan 29 2016 20:07
so if you see an N+1, and it only takes 0.1s - punt?
Jason Frey
@Fryguy
Jan 29 2016 20:08
no, but I might put it on my future TODO list
Keenan Brock
@kbrock
Jan 29 2016 20:08
k
Oleg Barenboim
@chessbyte
Jan 29 2016 20:08
@kbrock measure what you are trying to refactor
Matthew Draper
@matthewd
Jan 29 2016 20:08
If you're trying to fix something that's taking 10s, then yes
Oleg Barenboim
@chessbyte
Jan 29 2016 20:08
performance is ALL about fixing what is taking a big percentage of time or resources
Keenan Brock
@kbrock
Jan 29 2016 20:08
well, this is addressing 4 BZs
that say it it thrashing
I'll collect numbers
Jason Frey
@Fryguy
Jan 29 2016 20:09
not saying that you shouldn't focus on the perf_capture_timer issue...just making sure that the thing you are looking at will have tangible benefit
Matthew Draper
@matthewd
Jan 29 2016 20:09
Yeah, not forget it exists.. but if it's only accounting for 0.1s of what you're looking for, it's obviously not The Thing