These are chat archives for ManageIQ/manageiq/performance

21st
Oct 2015
Dennis Metzger
@dmetzger57
Oct 21 2015 02:50
Haven't looked / evaluated all the date, so far: In the few runs I looked at, PostgreSQL is consuming about 60Mb more in 5.5 & Master than 5.4. Here's how the max memory used in the small VMware environment plays out.
MaxConsumedMem.png
The next charts shows memory use during VMware environment runs - initial inventory refresh followed by CnU enablement. The 5.5 large environment did not complete, I killed it. As you can see in the chart, the appliance was running out of available memory and gaining memory by killing processes and then repeating.
SmallEnv MemUsed.png
MedEnv MemUsed.png
LargeEnv MemUsed.png
I need to process the data collected throughout each test run on all running proceses to summarize growth of non-worker proceseses.
In the small VMware environment we're about 10% larger than 5.4 now, primarily due to the two new (Automate) workers.
akrzos @akrzos is back
Oleg Barenboim
@chessbyte
Oct 21 2015 13:00
@dmetzger I spoke to @gmcculloug about the automate workers -- 2 issues
  • code is not routing work to automate workers (yet) - he will address
  • drop default automate workers from 2 to 1
Dennis Metzger
@dmetzger57
Oct 21 2015 13:02
fyi the current 2 combined add about 300Mb of rss memory growth
Oleg Barenboim
@chessbyte
Oct 21 2015 13:02
right - that is what prompted my conversation with him
Dennis Metzger
@dmetzger57
Oct 21 2015 13:04
i'm plotting the growth between 5.4 & tuned Master by functional area (vmdv commands, http, postgres, system, etc.)
@akrzos welcome back
Matthew Draper
@matthewd
Oct 21 2015 13:09
Is it feasible for us to retain a config option, by which people could opt to do automate work in the generic workers, and thereby implicitly disable the extra automate worker?
Alex Krzos
@akrzos
Oct 21 2015 13:12
@dmetzger57 Thanks!
@dmetzger57 sounds like all reasonable things to track between releases with a given workload
I had looked into this python script called ps_mem.py that sums major processes and when I looked into it I also noted higher memory usage on the newer postgres as well, but I wasn't entirely sure if that script sums the memory usage correctly
Oleg Barenboim
@chessbyte
Oct 21 2015 13:14
@matthewd I think the difference would be in the queue_name parameter to MiqQueue from 'generic' to 'automate'
Alex Krzos
@akrzos
Oct 21 2015 13:15
just another perf thing to dig into, perhaps more tunables for postgres now with the newer version or some untuned defaults that result in "higher" memory usage
Oleg Barenboim
@chessbyte
Oct 21 2015 13:15
@akrzos welcome back!
good to have you back, partner!
Matthew Draper
@matthewd
Oct 21 2015 13:16
We should probably do a run with prepared_statements: false in database.yml, just in case we are actually hitting rails/rails#21992
That would manifest as a small increase in ruby memory usage, and "some" increase in the PG processes
Alex Krzos
@akrzos
Oct 21 2015 13:17
@chessbyte Thanks, been reading back the conversation here for highlights, looks like the environments have proven very useful, makes me :smile:
Dennis Metzger
@dmetzger57
Oct 21 2015 13:21
if possible I'd like to schedule some time(s) to run against your environments, till we get our own reference environments up.
Alex Krzos
@akrzos
Oct 21 2015 13:21
@dmetzger57 yeah absolteuly no worries, still just reading to get up to speed
@dmetzger57 One thing we/I don't know if/when an environment can be "oversubscribed" - ex how many cfme environments can manage a provider before it creates issues with performance
hence testing oversubscribed would become an issue in itself, I know QE has come across some issues and tried to solve that with a proxy caching things, however I don't think they have any specific metrics around it with VMware
Jason Frey
@Fryguy
Oct 21 2015 14:01
I got memory_analyzer to create treemaps from heap dumps, but when I did the full small env, it blew up Chrome. However, we don't need to see it from the root, we can pick a layer that's a little lower
Matthew Draper
@matthewd
Oct 21 2015 14:06
@Fryguy I've had some success in Firefox when Chrome dies
Jason Frey
@Fryguy
Oct 21 2015 14:06
oh thanks...I'll try that
Joe Rafaniello
@jrafanie
Oct 21 2015 14:44
wow, this profiler works pretty well... http://rbkit.codemancers.com/
it seems to bog down a bit locally during high allocations in the broker... but you can take snapshots of the heap at any time to view the objects in a gui... and even compare the heaps...
Oleg Barenboim
@chessbyte
Oct 21 2015 14:49
@jrafanie why can't manageiq.org be as elegant as the profiler web site?
Joe Rafaniello
@jrafanie
Oct 21 2015 14:50
:wink:
Keenan Brock
@kbrock
Oct 21 2015 14:51

I can just see it now:

Using ManageIQ is simple:

rpm install manageiq
service manageiq start
Oleg Barenboim
@chessbyte
Oct 21 2015 14:52
@kbrock are you saying that ManageIQ is more complex than the profiler? ;-)
Keenan Brock
@kbrock
Oct 21 2015 14:52
for now...
but it would be cool if...
servce manageiq-cu start
service manageiq-ui start
service manageiq-automate start

open http://localhost/
Matthew Draper
@matthewd
Oct 21 2015 14:54
:100:
Keenan Brock
@kbrock
Oct 21 2015 14:55
This message was deleted
Oleg Barenboim
@chessbyte
Oct 21 2015 14:55
@kbrock I thought you will be moving us off of MiqQueue and onto message bus
Keenan Brock
@kbrock
Oct 21 2015 14:55
@chessbyte right after we are complete with tenancy... :/
Matthew Draper
@matthewd
Oct 21 2015 14:55
:sparkles: activejob :sparkles:
Oleg Barenboim
@chessbyte
Oct 21 2015 14:56
@kbrock I thought you will be moving us off of PG and onto TSDB for metrics and events
Keenan Brock
@kbrock
Oct 21 2015 14:56
oh noes
more service calls
Alex Krzos
@akrzos
Oct 21 2015 15:05
@dmetzger57 / @jrafanie Any tests today that you could use my help getting started on? I see more 5.5 builds I can get my benchmarks running against
Joe Rafaniello
@jrafanie
Oct 21 2015 15:06
welcome back @akrzos
Alex Krzos
@akrzos
Oct 21 2015 15:06
@jrafanie Thanks!
Jason Frey
@Fryguy
Oct 21 2015 15:13
@jrafanie that's funny...I contributed to rbkit like a year ago and totally forgot about it :/
Joe Rafaniello
@jrafanie
Oct 21 2015 15:14
it's pretty awesome, doing it now on the broker to compare the broker before/after priming on the small vsphere
Jason Frey
@Fryguy
Oct 21 2015 15:14
yeah I remember it being pretty sweet
Joe Rafaniello
@jrafanie
Oct 21 2015 15:14
it gets seriously bogged down with the medium and large vsphere though
Oleg Barenboim
@chessbyte
Oct 21 2015 15:15
and did it get bogged down in previous versions too?
Matthew Draper
@matthewd
Oct 21 2015 15:17
Did we make a decision on what "acceptable growth" looked like? If we're now around 10%…
Keenan Brock
@kbrock
Oct 21 2015 15:18
seems to me that the issue is more around the new refresh uses up more and/or bigger objects
Jason Frey
@Fryguy
Oct 21 2015 15:18
I'm not so sure it's all in refresh
Keenan Brock
@kbrock
Oct 21 2015 15:18
ok
Jason Frey
@Fryguy
Oct 21 2015 15:18
I believe there is a baseline bump
Joe Rafaniello
@jrafanie
Oct 21 2015 15:18
@matthewd can we peg to 4-2-stable? or is a rc coming out soon?
Matthew Draper
@matthewd
Oct 21 2015 15:19
Also, if someone's in a position to do so, I'd love a heap dump of the large environment, after the refresh
Jason Frey
@Fryguy
Oct 21 2015 15:19
and then some specific workers have inordinate bumps
but I admit I don't know the numbers anymore, and we probably need new readings
I believe @jrafanie has that (the heap dump)
Matthew Draper
@matthewd
Oct 21 2015 15:19
The jump at ~400s looks much larger than the equivalent ~750s one on 5.4
@jrafanie peg for now; I'll poke about getting an rc
Joe Rafaniello
@jrafanie
Oct 21 2015 15:21
@matthewd I guess I'm concerned that any numbers that @dmetzger57 or @akrzos finds will be against 4.2.4 without the fix we know prevented some things from being GC'd
I'll make a PR
@akrzos are you in a position to test using your scripts off of upstream MIQ appliances?
Alex Krzos
@akrzos
Oct 21 2015 15:22
@jrafanie I can take a look
@jrafanie nightly build new enough or will I need to patch it?
Joe Rafaniello
@jrafanie
Oct 21 2015 15:22
Also, @akrzos I'd be curious what the numbers look like without the 2 additional automate workers so we're comparing properly
Matthew Draper
@matthewd
Oct 21 2015 15:23
@akrzos if you have an upstream nightly, it's just a git pull away from being up to date anyway ¯\_(ツ)_/¯
Alex Krzos
@akrzos
Oct 21 2015 15:24
nightlies break after the 16th? Thats the latest I'm seeing
Joe Rafaniello
@jrafanie
Oct 21 2015 15:24
yeah, @akrzos looks like the build hung again due to yum
grab the 16th... then do this...
stop the server processes, vmdb... git pull, appliance... git pull...reboot
vmdb and appliance are aliases to cd to the appropriate git repo
Alex Krzos
@akrzos
Oct 21 2015 15:25
easy enough, I'll let you know what I find
Matthew Draper
@matthewd
Oct 21 2015 15:26
@jrafanie should there be a "just in case" migrate in there?
Joe Rafaniello
@jrafanie
Oct 21 2015 15:26
then verify set |grep RUBY_GC shows some of the GC knobs tuned
yes, @matthewd , good point
and probably a bundle ;-)
Jason Frey
@Fryguy
Oct 21 2015 15:27
sounds like we need a simple "pull me everything on the appliance" script
Joe Rafaniello
@jrafanie
Oct 21 2015 15:27
stop the server processes, vmdb... git pull, bundle, bundle exec rake db:migrate, appliance... git pull
Alex Krzos
@akrzos
Oct 21 2015 15:27
as far as disabling automate workers, though, that should be irrelevant for my refresh benchmarks as automate role is disabled, I had some trouble in the past with that role disabled and the workers hanging around but had trouble reproducing that actual issue a second time
Jason Frey
@Fryguy
Oct 21 2015 15:27
wow...nice @matthewd
Joe Rafaniello
@jrafanie
Oct 21 2015 15:27
ok, @akrzos as long as you don't have automate workers running, that's good
Keenan Brock
@kbrock
Oct 21 2015 15:28
@matthewd heh - nice: sometimes the simplist of things require rocket science
Matthew Draper
@matthewd
Oct 21 2015 15:32
@akrzos note that @dmetzger57 has recently been charting mostly around the whole-of-VM RAM consumption, in an attempt to better keep things in perspective / measured against the thing we actually have to constrain
Alex Krzos
@akrzos
Oct 21 2015 15:34
@matthewd I saw a few of those charts
I'd have to add some things to my automation with regards, or even better would be just have something collect all of that in a TSDB so I don't need to automate that portion. Is that the preferred measurements you guys want/need for comparsion or is individual task/workload rss/virt memory + GCstat before/after still needed?
Matthew Draper
@matthewd
Oct 21 2015 15:40
In my opinion, that's the best goal-post measurement, while the others remain helpful in finding the most likely source of wins… but I'm not Team Perf, so the relevance of my opinion is limited :)
Joe Rafaniello
@jrafanie
Oct 21 2015 15:48
@akrzos I think different measurements are useful but the whole system: "consumed memory", cpu time and wall time to perform specific measurable sets of tasks is probably helpful
for example, @dmetzger57 did a baseline of idle workers... no ems, just starting the server 5.4 vs. 5.5 is really helpful... note, to be even more helpful, we should eliminate the automate workers
another one that measures the introduction of the first ems and first refresh is another one that @dmetzger57 was doing... from there we can dive into specific worker problem areas
Alex Krzos
@akrzos
Oct 21 2015 15:51
@jrafanie gotcha, I believe we discussed a test centered around that as well, just an idle system turned on and baseline memory usage, then add a provider, baseline memory usage, turn on another feature, baseline, repeat
I think both sides of the picture are most optimal, entire system baseline + individual features, so we can see if specific parts regress, of course that means more work now :smiley:
Alex Krzos
@akrzos
Oct 21 2015 16:06
ugh, so I git stashed my differences in appliance and after reboot:
Oct 21 12:04:57 localhost systemd: Starting EVM server daemon...
Oct 21 12:04:57 localhost sh: /usr/bin/env: ruby: No such file or directory
missing this:
+export PATH=$PATH:/opt/rubies/ruby-2.2.3/bin
Keenan Brock
@kbrock
Oct 21 2015 16:08
YES!
Alex Krzos
@akrzos
Oct 21 2015 16:08
looks like it's booting now
Keenan Brock
@kbrock
Oct 21 2015 16:08
I hate this
Alex Krzos
@akrzos
Oct 21 2015 16:09
looks like my workers are starting up
Joe Rafaniello
@jrafanie
Oct 21 2015 17:02
@matthewd what dump do you want? vmware refresh worker after refresh + Gc.start?
Matthew Draper
@matthewd
Oct 21 2015 17:03
@jrafanie yes please.. on large
Joe Rafaniello
@jrafanie
Oct 21 2015 17:04
with 4-2-stable i assume, right?
Matthew Draper
@matthewd
Oct 21 2015 17:04
Sure
I'm basically hoping the different ratios on large will better reveal what happens in that one big jump
Matthew Draper
@matthewd
Oct 21 2015 17:12
@jrafanie Oh, oops… I missed a rather vital "with allocation tracing, at least for refresh" in there
Joe Rafaniello
@jrafanie
Oct 21 2015 17:13
oh, that would be hard
allocation_tracer segv's with too much stuff going on
I might be able to get a dump with the medium
Matthew Draper
@matthewd
Oct 21 2015 17:14
Matthew Draper
@matthewd
Oct 21 2015 17:20
Maybe all I really need is some logging correlation, to work out approximately what we're doing at the various interesting points on the graph
Joe Rafaniello
@jrafanie
Oct 21 2015 17:38
@matthewd found a thing in the router
let me gist it
Dennis Metzger
@dmetzger57
Oct 21 2015 17:45
@matthewd at 120 seconds elapsed a medium sized vmware provider was added and at the next jump (405 - 485 depending on the version) Cap n U was enabled.
Matthew Draper
@matthewd
Oct 21 2015 17:46
@dmetzger57 the jump that has my interest is the one at ~255 on that graph
Alex Krzos
@akrzos
Oct 21 2015 17:47
End of refresh perhaps
close to the end of refresh
Matthew Draper
@matthewd
Oct 21 2015 17:47
.. because it seems to correspond to the far more pronounced jump at 425 on https://files.gitter.im/ManageIQ/manageiq/performance/mraT/LargeEnv-MemUsed.png
Alex Krzos
@akrzos
Oct 21 2015 17:48
ugh, what is going on there, worker threshing?
Matthew Draper
@matthewd
Oct 21 2015 17:50
@jrafanie independent of anything else, and without much preference to provider size, I did think we should try a prepared_statements: false run, just to double-check the statement cache isn't causing a problem
Joe Rafaniello
@jrafanie
Oct 21 2015 17:50
good point, I forgot about that @matthewd
Matthew Draper
@matthewd
Oct 21 2015 17:51
@akrzos yeah, that's worker vs oom killer
Dennis Metzger
@dmetzger57
Oct 21 2015 17:51
at 225 elapsed we began saving the inventory to the database for that run
Joe Rafaniello
@jrafanie
Oct 21 2015 17:52
yeah, I think out of box ruby 2.2.3 from most recent 5.5 build is probably not worth investigating other than for marvelling at how much it loves to eat memory
Matthew Draper
@matthewd
Oct 21 2015 17:52
@dmetzger57 so does that mean we started saving at about 300 on the large run?
Dennis Metzger
@dmetzger57
Oct 21 2015 17:54
yes
Matthew Draper
@matthewd
Oct 21 2015 17:57
@dmetzger57 and on large, is 5.4 saving at 500, or 250? I guess either way, it looks like it's the saving that's showing the largest comparative increase (assuming my eyeballing is accurate)
Jason Frey
@Fryguy
Oct 21 2015 18:24
@dmetzger57 On your comparison graphs, can you redraw them such that the graphs overlay? that is, subtract the x-axis-0 value across the board for a particular line
that way we can see if they are really the same with just shifted baselines, or if there are deltas in specific spots
Alex Krzos
@akrzos
Oct 21 2015 19:19
Well so the qe automation can't add providers anymore now that the time profile bar isn't on the miq master, so my automated benchmarks won't work on master or newer 5.5 releases if it's also missing the time bar on the bottom of the web ui
Jason Frey
@Fryguy
Oct 21 2015 19:20
those things don't sound related at all :)
Alex Krzos
@akrzos
Oct 21 2015 19:29
selenium
can I add providers via rest api yet?
and automated configuration of server roles too
Joe Rafaniello
@jrafanie
Oct 21 2015 19:30
not sure about rest-api, i have a rails runner script I use to avoid hitting the ui
e = ManageIQ::Providers::InfraManager.create(:name => "small", :hostname => "10.12.20.65", :type => "ManageIQ::Providers::Vmware::InfraManager", :zone => Zone.default_zone)
e.update_authentication(:default => {:userid => "username", :password => "password"})
e.authentication_check
Alex Krzos
@akrzos
Oct 21 2015 19:31
@jrafanie could you share? If it can configure the providers and then configure server roles that would save a lot of selenium pain in using their automation
nice
Joe Rafaniello
@jrafanie
Oct 21 2015 19:32
make sure you do the authentication_check otherwise, the ems' creds won't be valid and workers won't start
Alex Krzos
@akrzos
Oct 21 2015 19:33
gotcha
Joe Rafaniello
@jrafanie
Oct 21 2015 19:35
server roles... brute force: cp and modify the vmdb.tmpl.yml to vmdb.yml, change the role: database_operations,event,reporting,scheduler,smartstate,ems_operations,ems_inventory,user_interface,web_services,automate line to whatever you need, it should be picked up in 1-2 minutes, I think...
still needs to be a better interface but that will work
Oleg Barenboim
@chessbyte
Oct 21 2015 21:05
so, I have an idea about a possibility to significantly reduce the footprint of VimBroker memory
currently, it uses in-memory cache
Joe Rafaniello
@jrafanie
Oct 21 2015 21:06
ok
Oleg Barenboim
@chessbyte
Oct 21 2015 21:06
what if we added another cache strategy - to cache things in Redis, Mongo, or some other file store
I am pretty sure that the majority of the memory used is sitting in its cache
main broker thread writes to this cache
Joe Rafaniello
@jrafanie
Oct 21 2015 21:07
Yes, for sure... much of it is VimHash's with VimStrings, VimArrays and VimHashes inside it
Oleg Barenboim
@chessbyte
Oct 21 2015 21:07
all the other callers just read from it
Joe Rafaniello
@jrafanie
Oct 21 2015 21:08
and Strings
Jason Frey
@Fryguy
Oct 21 2015 21:08
I agree, but @roliveri has insisted forever that the actual inventory size is not that big
But that doesn't jive with the enormous worker size
Oleg Barenboim
@chessbyte
Oct 21 2015 21:08
well - that would be a good way to validate that
Jason Frey
@Fryguy
Oct 21 2015 21:08
I'm concerned that if he's correct, it won't do anything at all
so, we need to figure out what's actually taking up the memory...perhaps we can change the broker to just not inventory anything
like, as it's unmarshaling, it's just removing stuff behind it or something
I think that would still let us vet out the claim without implementing an entire backend, possibly for nothing
Jason Frey
@Fryguy
Oct 21 2015 21:17
I still think the idea of an alternate store is a good idea (always have), but I'd like to prove out that it would gain us anything
Oleg Barenboim
@chessbyte
Oct 21 2015 21:21
I just want to throw that idea out there - especially if it is not too complicated to implement
Jason Frey
@Fryguy
Oct 21 2015 21:21
I think it will be :(
Oleg Barenboim
@chessbyte
Oct 21 2015 21:21
if not for this year, then next year
Jason Frey
@Fryguy
Oct 21 2015 21:21
yeah