These are chat archives for ManageIQ/manageiq/performance

6th
Jan 2016
Looks like something in 5.5.2.0 introduces more memory growth to the schedule worker FYI ^^
Joe Rafaniello
@jrafanie
Jan 06 2016 14:57
is this with the same user schedules? scheduled reports, widgets, etc.?
Alex Krzos
@akrzos
Jan 06 2016 15:01
It's the same set of scenarios
so should be the default schedules shipping with an appliance
I'll grab the logs for that pid from my 5.5.2.0 appliance to see what shows up at those time frames
Joe Rafaniello
@jrafanie
Jan 06 2016 15:05
is it the same version OS?
Alex Krzos
@akrzos
Jan 06 2016 15:06
should be rhel 7 on both 5.5.0.13 and 5.5.2.0
Joe Rafaniello
@jrafanie
Jan 06 2016 15:06
7.1 on both?
Alex Krzos
@akrzos
Jan 06 2016 15:11
5.5.2.0 is on 7.2
akrzos @akrzos checks 5.5.0.13-2
Joe Rafaniello
@jrafanie
Jan 06 2016 15:13
the only other changes that seem even remotely related to ONLY the schedule worker is the change to not use ntpdate and use chronyd instead (I don't know that the former was even affecting the schedule worker although it was the target since time is really important for schedules)
we pulled in a new linux_admin and ovirt gem
a few additional routes in routes.rb
nothing really jumping out at me
@akrzos can you fleece the two and compare the rpms changes?
or use rpm -qa or whatever to see the rpm changes
something else must have changed on the two appliances
Alex Krzos
@akrzos
Jan 06 2016 15:15
I can take a look
5.5.0.13-2 is 7.1
Joe Rafaniello
@jrafanie
Jan 06 2016 15:16
^ Cool, that's an interesting difference
Alex Krzos
@akrzos
Jan 06 2016 15:16
I attached the evm.log grepped on the pid and added the specific timestamps for the rss memory growth as well
Joe Rafaniello
@jrafanie
Jan 06 2016 15:16
any idea how much memory room 5.5.0.13-2 had to grow before it would hit the threshold?
something simple such as a single gem that is globally required and consumes more memory could also be the difference if the 5.5.0.13-2 was very close to the threshold
Joe Rafaniello
@jrafanie
Jan 06 2016 15:21
if that's not in a good format, we should fix it so it makes comparing these things easier
Alex Krzos
@akrzos
Jan 06 2016 15:22
@jrafanie Here's the rpms compared
hmm is that build time?
well first boot I assume
Joe Rafaniello
@jrafanie
Jan 06 2016 15:23
at boot, yes
wow, gem list should be bundle list or something
Alex Krzos
@akrzos
Jan 06 2016 15:24
hmm well I wouldn't say it's the best format for that data since it appends it each boot
But it gives you a log then of what was installed or updated
Joe Rafaniello
@jrafanie
Jan 06 2016 15:26
well, at least cloud-init wasn't changed ;-)
glibc changed along with a bunch of other things
patch release though
@akrzos it would be useful to know if memory grew across the board for all workers or just schedule worker
Alex Krzos
@akrzos
Jan 06 2016 15:30
Digging into it
ScheduleWorker stuck out as it passed its threshold
Joe Rafaniello
@jrafanie
Jan 06 2016 15:31

hmm well I wouldn't say it's the best format for that data since it appends it each boot

those files were written to make this easier so do please provide suggestions to make it more usable

gem_list ^
truncated down to the latest dump on gems
Jason Frey
@Fryguy
Jan 06 2016 16:00
note that the gemset rpm has a manifest.txt if that helps
Alex Krzos
@akrzos
Jan 06 2016 16:02
MiqScheduleWorkersCompared.png
MiqScheduleWorker Compared^
During same scenario
with same provider
Joe Rafaniello
@jrafanie
Jan 06 2016 16:04
thanks @akrzos, how about generic/reporting workers for comparison?
Alex Krzos
@akrzos
Jan 06 2016 16:04
Checking those out now
Joe Rafaniello
@jrafanie
Jan 06 2016 16:05
looks like just loading the gems in the bundle plus our code adds ~25 MB from 5.5.0.13-2 to 5.5.2.0
Jason Frey
@Fryguy
Jan 06 2016 16:06
are any of the gems changes also removing the :require => false?
Joe Rafaniello
@jrafanie
Jan 06 2016 16:06
that would be the added/updated gems, new routes, change in required files
$ git diff 5.5.0.13 5.5.2.0 Gemfile gems/pending/Gemfile
diff --git a/Gemfile b/Gemfile
index 565ee48..4892c1b 100644
--- a/Gemfile
+++ b/Gemfile
@@ -7,6 +7,9 @@ eval_gemfile(File.expand_path("gems/pending/Gemfile", __dir__))
 gem "activerecord-deprecated_finders", "~>1.0.4",  :require => "active_record/deprecated_finders"
 gem "rails",                           "~>4.2.5"

+# Temporarily restrict Sprockets to < 3.0 while we deal with compatibility issues
+gem "sprockets-rails", "< 3.0.0"
+
 # Local gems
 path "gems/" do
   gem "manageiq_foreman", :require => false
diff --git a/gems/pending/Gemfile b/gems/pending/Gemfile
index 6150ea7..5fc8c21 100644
--- a/gems/pending/Gemfile
+++ b/gems/pending/Gemfile
@@ -25,7 +25,7 @@ gem "fog-core",                "!=1.31.1",          :require => false
 gem "httpclient",              "~>2.5.3",           :require => false
 gem "kubeclient",              "=0.8.0",            :require => false
 gem "hawkular-client",         "~>0.1.2",           :require => false
-gem "linux_admin",             "~>0.12.1",          :require => false
+gem "linux_admin",             "~>0.13.0",          :require => false
 gem "log4r",                   "=1.1.8",            :require => false
 gem "memoist",                 "~>0.11.0",          :require => false
 gem "memory_buffer",           ">=0.1.0",           :require => false
@@ -34,7 +34,7 @@ gem "net-sftp",                "~>2.1.2",           :require => false
 gem "net-scp",                 "~>1.2.1",           :require => false
 gem "nokogiri",                "~>1.6.0",           :require => false
 gem "openshift_client",        "=0.2.0",            :require => false
-gem "ovirt",                   "~>0.7.0",           :require => false
+gem "ovirt",                   "~>0.7.1",           :require => false
 gem "parallel",                "~>0.5.21",          :require => false
 gem "pg",                      "~>0.18.2",          :require => false
 gem "psych",                   "~>2.0.12"
Jason Frey
@Fryguy
Jan 06 2016 16:07
wow
Joe Rafaniello
@jrafanie
Jan 06 2016 16:07
so, no
Jason Frey
@Fryguy
Jan 06 2016 16:07
that's weird
Joe Rafaniello
@jrafanie
Jan 06 2016 16:08
Jason Frey
@Fryguy
Jan 06 2016 16:09
yeah I saw that...what brought that in?
@akrzos Do you have the Gemfile.lock?
Alex Krzos
@akrzos
Jan 06 2016 16:10
for which version
5.5.2.0?
Joe Rafaniello
@jrafanie
Jan 06 2016 16:10
yes
Jason Frey
@Fryguy
Jan 06 2016 16:10
yeah
Joe Rafaniello
@jrafanie
Jan 06 2016 16:10
requiring concurrent ruby increases memory around 3 mb locally
Jason Frey
@Fryguy
Jan 06 2016 16:11
In my local I see concurrent-ruby is brought in by sprockets but I am at 3.5.2...the Gemfile restrics to < 3.0
Jason Frey
@Fryguy
Jan 06 2016 16:11
wth...why is sprockets 3.5 there?
oh..we restricted sprockets-rails
Joe Rafaniello
@jrafanie
Jan 06 2016 16:12
we only changed sprockets-rails
although th ecomment is wrong
# Temporarily restrict Sprockets to < 3.0 while we deal with compatibility issues
gem "sprockets-rails", "< 3.0.0"
Jason Frey
@Fryguy
Jan 06 2016 16:12
yeah...that's confusing (I wonder who did that (<_<) (>_>) )
Joe Rafaniello
@jrafanie
Jan 06 2016 16:12
I read the comment and not the gem line
git praise
Chris Arcand
@chrisarcand
Jan 06 2016 16:15
(<_<) (>_>)
Jason Frey
@Fryguy
Jan 06 2016 16:15
oh haha...I thought is was me
chrisarcand @chrisarcand backs away slowly
Chris Arcand
@chrisarcand
Jan 06 2016 16:16
Catching up, what now? sprockets-rails 3 doesn’t line up with the sprockets version?
Probably should have checked that 😬
Joe Rafaniello
@jrafanie
Jan 06 2016 16:17
yeah, is sprockets 3.5 a problem? or was it sprockets-rails itself?
Jason Frey
@Fryguy
Jan 06 2016 16:17
I'm not sure that matters...but sprockets bumped as well
Keenan Brock
@kbrock
Jan 06 2016 16:17
sprockets is the issue
Jason Frey
@Fryguy
Jan 06 2016 16:17
3.4.1 to 3.5.2 (which is what brings in concurrent-ruby)
Keenan Brock
@kbrock
Jan 06 2016 16:18
it is not as lenient about forgetting to precompile files
I'm going through and fixing those bugs
Jason Frey
@Fryguy
Jan 06 2016 16:18
wait, is it broken?
Keenan Brock
@kbrock
Jan 06 2016 16:18
our app? yes
Jason Frey
@Fryguy
Jan 06 2016 16:18
on 5.5.2.0?
Keenan Brock
@kbrock
Jan 06 2016 16:18
sprockets 2 = fail silently
sprockets 3 = blow up
Alex Krzos
@akrzos
Jan 06 2016 16:18
So comparing memory on appliance level for the same scenario as presented in the bz I opened this morning I'm seeing ~300MiB more used in 5.5.2.0, but thats looking at the end of test, need to look at overall picture
Jason Frey
@Fryguy
Jan 06 2016 16:18
we were never on sprokets 2
Keenan Brock
@kbrock
Jan 06 2016 16:18
ok
Jason Frey
@Fryguy
Jan 06 2016 16:18
the sprockets version is 3.4.1 or 3.5.2
Keenan Brock
@kbrock
Jan 06 2016 16:18
what we have now = silent
Joe Rafaniello
@jrafanie
Jan 06 2016 16:18
FYI, we're looking at increased memory usage on the schedule worker causing it to exceed the threshold, not specifically looking at assets in the UI
Jason Frey
@Fryguy
Jan 06 2016 16:19
let's be very clear ...are you talking sprockets or sprockets-rails
Keenan Brock
@kbrock
Jan 06 2016 16:19
sprockets
changing sprockets-rails causes sprockets to upgrade (I think)
Jason Frey
@Fryguy
Jan 06 2016 16:19
we are locked on sprockets-rails on purpose righ tnow because upgrading breaks stuff
Keenan Brock
@kbrock
Jan 06 2016 16:19
+1
Jason Frey
@Fryguy
Jan 06 2016 16:19
(I think it brings in sprockets 4)
Keenan Brock
@kbrock
Jan 06 2016 16:19
but the stuff it "breaks" is already broken
Jason Frey
@Fryguy
Jan 06 2016 16:19
define broken
Keenan Brock
@kbrock
Jan 06 2016 16:20
404
we link to files that don't exist
Jason Frey
@Fryguy
Jan 06 2016 16:20
does QE know about this?
Keenan Brock
@kbrock
Jan 06 2016 16:20
they should
dunno
Jason Frey
@Fryguy
Jan 06 2016 16:20
i.e. is there a BZ? because I can't imagine that CloudForms 5.5.2 would have shipped
Keenan Brock
@kbrock
Jan 06 2016 16:20
no BZ
there are bugs that are fixed in dev (not production) that are marked "fixed"
but they fail
due to 404
(and it couldn't find the files)
typical developer "works for me" (in dev mode)
so the sprocket issues are real issues in our currently shipped app. it just isn't blowing up
Jason Frey
@Fryguy
Jan 06 2016 16:22
so it's possible QE may not see it?
Keenan Brock
@kbrock
Jan 06 2016 16:22
there are so many bugs...
Jason Frey
@Fryguy
Jan 06 2016 16:22
I'm confused what broken means to you :D Is the app usable, unusable, degraded, is it only in the log?
Keenan Brock
@kbrock
Jan 06 2016 16:22
looking
Jason Frey
@Fryguy
Jan 06 2016 16:22
there are levels of broken...haha
@akrzos Can we get the Gemfile.lock from the 5.5.1.13-2 appliance as well?
Keenan Brock
@kbrock
Jan 06 2016 16:24
@Fryguy ManageIQ/manageiq#5584 is marked as fixed. but @matthewd and I agree there is no way this could be fixed
Jason Frey
@Fryguy
Jan 06 2016 16:24
I'm curious what the dependencies of sprockets are (on that version)
Matthew Draper
@matthewd
Jan 06 2016 16:24
I'm 99% sure @kbrock is talking about a different brokenness than @Fryguy is
Keenan Brock
@kbrock
Jan 06 2016 16:24
+1
I need to get this PR merged - leave floor over to @matthewd
Joe Rafaniello
@jrafanie
Jan 06 2016 16:25
concurrent-ruby is clearly a main difference
Matthew Draper
@matthewd
Jan 06 2016 16:25
(and a much less interesting one, tbh… some sort of bug in the SPICE viewer thingy)
Jason Frey
@Fryguy
Jan 06 2016 16:25
I'm concerned about shipping a broken product...is the product broken (i.e. will a customer boot it and nothing works and they will lose their :shit:)
Keenan Brock
@kbrock
Jan 06 2016 16:26
nope
it just won't work as expected. and a BZ that we said we fixed, we didn't
Jason Frey
@Fryguy
Jan 06 2016 16:26
ok, then to me it's not "broken", just a bug :D
Joe Rafaniello
@jrafanie
Jan 06 2016 16:26
just requiring sprockets 3.4.1 consumes 2-3 mb locally, 3.5.2 consumes 7-9 mb
Keenan Brock
@kbrock
Jan 06 2016 16:27
@matthewd what happens if you reference a stylesheet from a tag and it is trying to look up the sha, but it is not precompiled / not in the metadata - does it blow up?
Jason Frey
@Fryguy
Jan 06 2016 16:27
@jrafanie so strange that would result in a 300MB bump, unless it was just really close to the threshhold to begin with
Joe Rafaniello
@jrafanie
Jan 06 2016 16:27
yes
if it's sprockets, this would affect all processes
Jason Frey
@Fryguy
Jan 06 2016 16:27
yea
Joe Rafaniello
@jrafanie
Jan 06 2016 16:28
Matthew Draper
@matthewd
Jan 06 2016 16:28
@kbrock I think it'll still just 404. But either way, you've jumped into the middle of @Fryguy's conversation about a different sprockets-related thing.
Jason Frey
@Fryguy
Jan 06 2016 16:28
autoprefixer-rails also bumped a Y version
as well as bundler
Joe Rafaniello
@jrafanie
Jan 06 2016 16:28
linux admin/ovirt, and a few others
Jason Frey
@Fryguy
Jan 06 2016 16:28
yeah but we control those
Keenan Brock
@kbrock
Jan 06 2016 16:29
@matthewd +1 / ignore
Jason Frey
@Fryguy
Jan 06 2016 16:29
was just looking at Y level releases...though a Z release could do weird stuff
Keenan Brock
@kbrock
Jan 06 2016 16:29
I hope to remove that pegging of the sprocket version by noon
Jason Frey
@Fryguy
Jan 06 2016 16:30
:+1: sounds good @kbrock ...glad you are figuring that out
Alex Krzos
@akrzos
Jan 06 2016 16:30
Jason Frey
@Fryguy
Jan 06 2016 16:30
@jrafanie What is that crazy bump at the end
I would be ok with the initial memory being higher if the lines caught up to each other, but that huge bump concerns me
Joe Rafaniello
@jrafanie
Jan 06 2016 16:31
we'd need to dig into the logs
it's hard to judge jumps like that since it greatly depends on how much the prior heap growth was
Jason Frey
@Fryguy
Jan 06 2016 16:32
# 5.5.2.0
    sprockets (3.5.2)
      concurrent-ruby (~> 1.0)
      rack (> 1, < 3)

# 5.5.0.13-2
    sprockets (3.4.1)
      rack (> 1, < 3)
just as an FYI
Oleg Barenboim
@chessbyte
Jan 06 2016 16:34
@kbrock @Fryguy - the sprockets discussion does NOT belong in Performance room. If CF is broken, we need a private email thread on that discussion.
Matthew Draper
@matthewd
Jan 06 2016 16:35
@chessbyte @Fryguy's does — it's about how a change of sprockets version has affected memory usage
Jason Frey
@Fryguy
Jan 06 2016 16:35
agreed...was mostly curious how far @kbrock's issue stems...i.e. is it in capablanca, master, etc
Oleg Barenboim
@chessbyte
Jan 06 2016 16:36
thanks for the clarification -- but the discussion about a version being broken (404s) had nothing to do with performance
Keenan Brock
@kbrock
Jan 06 2016 16:36
@akrzos If I recall, putting a basic diff into a gist displays great (if I remember)
@chessbyte +1
I was trying to say I'll get it unpegged soon
me culpa
Alex Krzos
@akrzos
Jan 06 2016 16:37
So digging through the data the most notable changes in memory usage are: MiqScheduleWorker (Noted in bz already), MiqWebServiceWorker (5.5.2.0 gains about 44MiB in RSS), RefreshCoreWorker (Gains about 24MiB RSS), VIMBroker (Gains ~11MiB RSS), RefreshWorker (Gains ~32MiB RSS), EventCatcher (~26MiB RSS), EventHandler (~28MiB), Metrics Collectors (19-26MiB RSS ea), MetricsProcessors (18-37MiB RSS ea), ReportingWorkers (23-29MiB ea)
You want a diff on the GemFile.locks?
Matthew Draper
@matthewd
Jan 06 2016 16:38
@Fryguy sprockets depends on concurrent-ruby, but it's only loading concurrent/future (and its transitive deps), so I wouldn't expect that to have much impact :confused:
Alex Krzos
@akrzos
Jan 06 2016 16:39
So with this particular scenario there definitely is a bit of a systemic memory gains across workers, not necessarily just the ScheduleWorker
Jason Frey
@Fryguy
Jan 06 2016 16:39
@matthewd Yeah, @jrafanie only seems about a +5MB or so difference locally
but if it's right on that heap slab boundary, it would jump
Joe Rafaniello
@jrafanie
Jan 06 2016 16:39
note, that's just requiring it
Jason Frey
@Fryguy
Jan 06 2016 16:39
ah
Joe Rafaniello
@jrafanie
Jan 06 2016 16:40
not doing any rails-y stuff
it's a bit harder because since time has passed and rails-i18n master branch no longer works
Alex Krzos
@akrzos
Jan 06 2016 16:40
Also note that the measurements compared are after 30 minutes of running and perhaps thats not enough burn in to see if they level off (5.5.0.13-2 vs 5.5.2.0)
but suffice to say 5.5.2.0 grows memory faster than 5.5.0.13
Jason Frey
@Fryguy
Jan 06 2016 16:41
I think that graph is pretty clear on that :D
Alex Krzos
@akrzos
Jan 06 2016 16:41
Just to be clear thats just the worker that stuck out
Jason Frey
@Fryguy
Jan 06 2016 16:41
kind of freaks me out that such minor (looking) changes are having a relatively big impact
Oleg Barenboim
@chessbyte
Jan 06 2016 16:42
at the same time, we are blessed that we have QE and Engineers dedicated to performance
Jason Frey
@Fryguy
Jan 06 2016 16:42
yes! :clap:
Alex Krzos
@akrzos
Jan 06 2016 16:42
I think you guys read my mind on what I was about to say
akrzos @akrzos apologizes for starting a performance fire again
Joe Rafaniello
@jrafanie
Jan 06 2016 16:44
dependencies are hard... it's so easy to add them, never to remove them :cry:
Alex Krzos
@akrzos
Jan 06 2016 16:46
I'll dig into my smaller scenarios as perhaps this problem doesn't affect those scenarios as much
the above data was all based on my vmware-large provider (3000 vms)
So I've been able to reproduce this on my smartstate analysis scenario (Much smaller providers <100vms)
the MiqSchedulerWorker issue
occassionly the ScheduleWorker literally grows to 249.99
and stays there
Alex Krzos
@akrzos
Jan 06 2016 16:51
Example:
Keenan Brock
@kbrock
Jan 06 2016 16:51
many other projects (like resque-scheduler) have a scheduler that is ignorant of what is being scheduled. so they avoid this. wonder if we can do something similar
Alex Krzos
@akrzos
Jan 06 2016 16:51
MiqScheduleWorker-59593.png
@kbrock So they don't grow in memory?
Looking at the log, nothing stuck out as unsually (to me) about what work the ScheduleWorker was queueing
I think a longer term scenario run will show the MiqScheduleWorker recycling periodically
akrzos @akrzos sets up a 2 hour scenario
Joe Rafaniello
@jrafanie
Jan 06 2016 17:02
@akrzos I'm thinking we should take this opportunity to bump the memory thresholds regardless for all workers, they were never changed for the generational GC and some are set to levels aren't even close to "crazy high memory usage"
Alex Krzos
@akrzos
Jan 06 2016 17:04
@jrafanie agreed with the new generational GC we should bump memory thresholds
Joe Rafaniello
@jrafanie
Jan 06 2016 17:05
i guess my impression of the thresholds were to detect workers that had exceeded what we thought was excessive
some of the values are not excessive for even ruby 2.0
Alex Krzos
@akrzos
Jan 06 2016 17:07
One issue I think that could occur though, is we look purely at RSS memory usage to determine if a worker should be recycled
if the appliance is swapping
Joe Rafaniello
@jrafanie
Jan 06 2016 17:07
yeah
Alex Krzos
@akrzos
Jan 06 2016 17:07
you could end up with a worker excceeding that threshold but most of its memory is in swap, and RSS won't show that
Joe Rafaniello
@jrafanie
Jan 06 2016 17:08
yeah, we need better ways to measure what a process is consuming
all of this will be more apparent when we do forking workers
Alex Krzos
@akrzos
Jan 06 2016 17:09
More shared memory then correct?
Joe Rafaniello
@jrafanie
Jan 06 2016 17:09
yes, eventually if we preload the app, gems, libraries before we fork
Alex Krzos
@akrzos
Jan 06 2016 17:09
Might want to decide if we should base memory size on USS (Unique Set Size) or PSS (Proportional Set Size)
Joe Rafaniello
@jrafanie
Jan 06 2016 17:10
I think we'll need your help there
I can't even measure it clearly with my test appliances
Alex Krzos
@akrzos
Jan 06 2016 17:10
My latest graphs show all of those, I started using smem for memory measurements
rss, pss, uss, vss, swap
per process
Joe Rafaniello
@jrafanie
Jan 06 2016 17:10
appliance consumed memory is the only measurement that seems reliable
how do you get those values?
Using RSS as the sole measurement becomes a huge problem with shared memory, take postgres for instance
When adding all rss up on my scale environment postgres "uses" greater than 76GiB of RSS
The appliance only has 16GiB of memory
Joe Rafaniello
@jrafanie
Jan 06 2016 17:13
let me try my forking workers appliance
i thought I tried smem before though
Alex Krzos
@akrzos
Jan 06 2016 17:13
Actual used appliance memory is ~2.66GiB and a lot in cached
Joe Rafaniello
@jrafanie
Jan 06 2016 17:14
is smem 1.4-1 new enough?
it seems like the latest
Alex Krzos
@akrzos
Jan 06 2016 17:15
yes
Thats what i have from epel
I do patch their command line display though
it truncates to like 27 characters
I push it to >100 characters, to get more "information" on what the process is without matching pids to ps output which would give me that
Joe Rafaniello
@jrafanie
Jan 06 2016 17:17
here's what it looks like with forking workers...
[root@new-host-3 ~]# smem -P "iq"
  PID User     Command                         Swap      USS      PSS      RSS
 4171 root     python /usr/bin/smem -P iq         0     6808     7119     8284
 3940 root     ManageIQ Worker: MiqEventHa        0   183316   188519   235040
 3968 root     ManageIQ Worker: MiqReporti        0   182852   188610   237300
 3973 root     ManageIQ Worker: MiqReporti        0   182868   188626   237312
 3950 root     ManageIQ Worker: MiqGeneric        0   183956   189231   236452
 3945 root     ManageIQ Worker: MiqGeneric        0   184208   189467   236416
 3982 root     ManageIQ Worker: MiqSchedul        0   185732   191058   238240
 3992 root     ManageIQ Worker: MiqUiWorke        0   185316   191978   240532
 4003 root     ManageIQ Worker: MiqWebServ        0   185292   192123   240608
 3962 root     ManageIQ Worker: MiqPriorit        0   192792   197334   241364
 3956 root     ManageIQ Worker: MiqPriorit        0   193372   197883   241744
 3768 root     /var/www/miq/vmdb/lib/worke        0   191392   198151   246372
no eager load of code, gems, libraries, etc.
last one is the main server process I think
aside: we need to change the setproctitle for the main process
Joe Rafaniello
@jrafanie
Jan 06 2016 17:23
here's without fork, normal spawn
[root@new-host-3 vmdb]# smem -P "iq"
  PID User     Command                         Swap      USS      PSS      RSS
 4693 root     python /usr/bin/smem -P iq         0     6568     6854     8044
 4466 root     MiqReportingWorker id: 97,         0   167012   167546   174424
 4464 root     MiqReportingWorker id: 96,         0   168172   168706   175588
 4444 root     MiqEventHandler id: 91, que        0   169424   169957   176832
 4488 root     MiqWebServiceWorker id: 100        0   181276   181866   188812
 4475 root     MiqUiWorker id: 99, uri: ht        0   181296   181886   188832
 4472 root     MiqScheduleWorker id: 98           0   186160   186707   193604
 4458 root     MiqPriorityWorker id: 95, q        0   205040   205576   212472
 4448 root     MiqGenericWorker id: 92, qu        0   205080   205616   212512
 4451 root     MiqGenericWorker id: 93, qu        0   206464   207000   213896
 4455 root     MiqPriorityWorker id: 94, q        0   207176   207712   214608
 4256 root     /var/www/miq/vmdb/lib/worke        0   239232   239789   246728
Jason Frey
@Fryguy
Jan 06 2016 17:23

aside: we need to change the setproctitle for the main process

I remember that one was a little more complicated than the others so I punted on it at the time

but I agree
Joe Rafaniello
@jrafanie
Jan 06 2016 17:24
So, yeah, it looks like USS/PSS doesn't include the roughly 50 MB of shared memory that RSS shows with fork
Joe Rafaniello
@jrafanie
Jan 06 2016 17:31
ok, @akrzos any suggestions?
  worker_base:
    :defaults:
      :memory_threshold: 200.megabytes
    :ems_refresh_core_worker:
      :memory_threshold: 400.megabytes
    :queue_worker_base:
      :defaults:
        :memory_threshold: 400.megabytes
      :ems_metrics_processor_worker:
        :memory_threshold: 400.megabytes
    :schedule_worker:
      :memory_threshold: 250.megabytes
those look like the ones that are very pessimistic
should we add 100 or 200 to each?
Jason Frey
@Fryguy
Jan 06 2016 17:35
you might want to get some field (or TomH) input
Alex Krzos
@akrzos
Jan 06 2016 17:35
Well looking through my long term (8 hour runs) of memory usage per worker
300 seems like a wise base choice
200 works for very few workers on 5.5.0.13-2
also I don't necessarily have a complete set of workers vs providers to base all that data off
so field / qe data would be helpful
Joe Rafaniello
@jrafanie
Jan 06 2016 17:38
yeah, not sure how to get that
I'm fine with changing only the ones we know are wrong for this bz
Alex Krzos
@akrzos
Jan 06 2016 17:43
refresh core worker I only hit that with an xlarge environment
in my tests i have it bump the core worker to 500MiB and it never recycles over 8 hours
Generic/Priority workers start to recycle around large sized environments over time (2 hours in)
large == 3k vms, 1.5k online
Joe Rafaniello
@jrafanie
Jan 06 2016 17:46
So, maybe change the 200 and 250 to 300 is enough for now?
Alex Krzos
@akrzos
Jan 06 2016 17:46
I would bump it to 300 IMO, but does any worker inherit that value?
I thought every worker had a value spelled out
Joe Rafaniello
@jrafanie
Jan 06 2016 17:47
not all of them do
it looks like only the replication worker would get that base 200
of course, schedule worker will get it's 250 value
Alex Krzos
@akrzos
Jan 06 2016 17:50
For schedule worker I would recommend we find out what causes IMO is a severely large increase in memory usage
Joe Rafaniello
@jrafanie
Jan 06 2016 17:50
event handler, generic, proirity, reporting, storage metrics collector look to get the 400 mb queue worker setting
Alex Krzos
@akrzos
Jan 06 2016 17:53
So I think the goal should be Small/Medium sized environments as well as Two provider environments (Small/medium sized) should work out of box with zero or minimal tuning to the config we ship, do you agree with that?
Because in that case, we only really need to tune the ScheduleWorker based on my data
Joe Rafaniello
@jrafanie
Jan 06 2016 17:53
yeah, @akrzos I'm trying it locally on a upstream appliance to see if restarts
Alex Krzos
@akrzos
Jan 06 2016 17:54
Could I git pull my master appliance to get the latest on it?
I have everything configured on it for monitoring with collectd to influxdb (Dashboards with grafana) and would hate to have to copy those configs over and fight that to work again
His DOB is 12/8/2015 and doesn't show the ScheduleWorker issue
He shows an RSS ~160MiB
Jason Frey
@Fryguy
Jan 06 2016 17:56
it should be good to pull
Alex Krzos
@akrzos
Jan 06 2016 17:56
also scheduleworker has a lot of threads
not sure if anyone else noticed that
Jason Frey
@Fryguy
Jan 06 2016 17:57
just make sure you db:migrate and also bundle install
more threads than previously?
Alex Krzos
@akrzos
Jan 06 2016 17:57
will it nuke the current db?
not sure what previous is
Jason Frey
@Fryguy
Jan 06 2016 17:57
it will migrate it...shouldn't nuke it
but you can back it up
Joe Rafaniello
@jrafanie
Jan 06 2016 17:59
@akrzos you might want git pull on the appliance repo to... use the appliance alias to cd to it
Alex Krzos
@akrzos
Jan 06 2016 17:59
got it
Joe Rafaniello
@jrafanie
Jan 06 2016 18:11
running a local appliance off of master schedule worker is up to 206 MB after a few minutes
RSS
Alex Krzos
@akrzos
Jan 06 2016 18:17
on my master env now he has grown 4MiB larger than what he was before
according to my tests on 5.5.2.0 it needs about 20 minutes before it hits 250
Joe Rafaniello
@jrafanie
Jan 06 2016 18:17
mine is up to 212 now
Alex Krzos
@akrzos
Jan 06 2016 18:18
there is a few things that delay IIRC
from start up hence the waiting
Joe Rafaniello
@jrafanie
Jan 06 2016 18:18
yeah, some schedules are not every x minutes
Alex Krzos
@akrzos
Jan 06 2016 18:33
hmm 20minutes past now and still at 164MiB
Joe Rafaniello
@jrafanie
Jan 06 2016 18:33
hmm
Alex Krzos
@akrzos
Jan 06 2016 18:34
is the last merged commit on appliance on dec 14th?
I ran git pulls on appliance directory and vmdb and then did the bundle install on vmdb and a db:migrate in that directory
perhaps I should do another bundle install on the appliance directory?
Jason Frey
@Fryguy
Jan 06 2016 18:40
appliance directory shouldn't matter (bundle-wise)
Alex Krzos
@akrzos
Jan 06 2016 18:45
trying a yum update now
to see if the packages affect the memory growth
Joe Rafaniello
@jrafanie
Jan 06 2016 18:51
yes, what @Fryguy said
Alex Krzos
@akrzos
Jan 06 2016 19:10
So I don't think that updated my gems
according to gem_list.txt
I still have the older bundler
Is there something I should run to update gems?
Joe Rafaniello
@jrafanie
Jan 06 2016 19:12
bundle update in vmdb
Keenan Brock
@kbrock
Jan 06 2016 19:12
@akrzos
also scheduleworker has a lot of threads
not sure if anyone else noticed that
Alex Krzos
@akrzos
Jan 06 2016 19:12
@kbrock 16 threads, most workers have 2 threads
Joe Rafaniello
@jrafanie
Jan 06 2016 19:13
yeah, I"m seeing 19 here
Keenan Brock
@kbrock
Jan 06 2016 19:13
did you notice the count for previous versions?
Joe Rafaniello
@jrafanie
Jan 06 2016 19:13
that should be the same in the older version
Alex Krzos
@akrzos
Jan 06 2016 19:13
Now I'm at 12 threads on my schedule worker
on master
not sure what it was in the past, never tracked it
just with playing with the per-worker metrics tracking with influxdb I can track threads/processes and just noticed that it was more than others
Joe Rafaniello
@jrafanie
Jan 06 2016 19:15
yeah, rufus-scheduler does job work in threads
Jason Frey
@Fryguy
Jan 06 2016 19:22
@akrzos what is gem_list.txt...did you write that out yourself, or are you reading that from somewhere?
just want to make sure it's not some artifact that you aren't updating, but are referring to
Alex Krzos
@akrzos
Jan 06 2016 19:22
read from vmdb/log it's updated every evmserverd restart
I think i got it now
Jason Frey
@Fryguy
Jan 06 2016 19:22
ah ok
Alex Krzos
@akrzos
Jan 06 2016 19:22
I ran bundle update
Alex Krzos
@akrzos
Jan 06 2016 19:29
So even with hat bundle install still says Using bundler 1.10.6
Jason Frey
@Fryguy
Jan 06 2016 19:30
oh yeah, you'll have to manually update bundler
gem install bundler
bundler can't update itself via a Gemfile reference
Joe Rafaniello
@jrafanie
Jan 06 2016 19:37
So, unless glibc changed, i can't explain the change in the schedule worker memory that is not common to all workers
Alex Krzos
@akrzos
Jan 06 2016 19:46
OK upgraded bundler and ScheduleWorker now using more memory
@ 175MiB
It's going up again
@ 189 now
Jason Frey
@Fryguy
Jan 06 2016 19:51
did we bump the ruby version?
we may have gone from 2.2.3 to 2.2.4
Alex Krzos
@akrzos
Jan 06 2016 19:53
going up
@197MiB now
@ ruby 2.2.3p173
but I yum updated
and there was no change
Jason Frey
@Fryguy
Jan 06 2016 19:54
if this is a master nightly, then it wouldn't update with yum
Joe Rafaniello
@jrafanie
Jan 06 2016 19:54
so, maybe bundler?
Jason Frey
@Fryguy
Jan 06 2016 19:54
even so, in your case you wouldn't have updated either
Joe Rafaniello
@jrafanie
Jan 06 2016 19:54
I'll be back, on and off, gotta pick up the kids
Jason Frey
@Fryguy
Jan 06 2016 19:55
I'd be shocked if it were bundle but anything is possible
Alex Krzos
@akrzos
Jan 06 2016 19:59
@ 207MiB now
doubt it's the version of ruby, 5.5.2.0 is ruby 2.2.2p95
Alex Krzos
@akrzos
Jan 06 2016 20:08
hmm would have expected it to hit ~250ish now
It's at 208
Alex Krzos
@akrzos
Jan 06 2016 20:19
MiqScheduleWorker-all.png
Two hour capture on vmware-small
might be why my master hasn't bumped itself to >250MiB yet
Alex Krzos
@akrzos
Jan 06 2016 20:34
I'm also tracking evm_server.rb grew ~30MiB on 5.5.2.0