These are chat archives for ManageIQ/manageiq/performance

10th
Nov 2017
Joe Rafaniello
@jrafanie
Nov 10 2017 19:59
@dmetzger57 sorry, I was looking at some travis issues with gaprindashvili... I'm looking at your graphs of the miq server process... are you saying that RSS according to top is growing? If so, can you DM me the location of those logs?
Dennis Metzger
@dmetzger57
Nov 10 2017 20:05
@jrafanie the data for the miq server memory graph is from the log_status messages recorded by the server in the evm.log. I pulled the logs off the QE test appliance
Joe Rafaniello
@jrafanie
Nov 10 2017 20:06
ok
Joe Rafaniello
@jrafanie
Nov 10 2017 20:13
@dmetzger57 do you have any "normal" size appliances that have been running a while? Can you add the nTH column to top by hitting f to see the thread count?
or you can do something like this... ls -1 /proc/28300/task/ |wc -l where 28300 is the pid of the server
I'm seeing 49 threads in that 3 GB server process
file descriptors seems reasonable
# lsof -p 28300 |wc -l
172
Joe Rafaniello
@jrafanie
Nov 10 2017 20:20
the schedule worker has 52 threads for reference but it doesn't start a DRb server
Dennis Metzger
@dmetzger57
Nov 10 2017 20:22
my only other “long” running appliance (graphed in the last email I sent) has 46 threads. this appliance is handling 10K VMs has a ~650Mb (PSS) MIQ Server (flat for days) with an (RSS) in the 1.1Gb range , running with Cap & U enabled
Joe Rafaniello
@jrafanie
Nov 10 2017 20:23
looking at smaps....
Dennis Metzger
@dmetzger57
Nov 10 2017 20:23
because finding memory issues is fun …...
Joe Rafaniello
@jrafanie
Nov 10 2017 20:25
$ grep -B 10 -E "[0-9]{5,10}\s*[km]B" /proc/28300/smaps

01128000-c5ac7000 rw-p 00000000 00:00 0                                  [heap]
Size:            3221116 kB
Rss:             3212108 kB
Pss:              521601 kB
Shared_Clean:          0 kB
Shared_Dirty:    2929720 kB
Private_Clean:         0 kB
Private_Dirty:    282388 kB
Referenced:      3203412 kB
Anonymous:       3212108 kB
AnonHugePages:    217088 kB
...
7f99825d0000-7f9988af9000 r--p 00000000 fd:00 228356                     /usr/lib/locale/locale-archive
Size:             103588 kB
Joe Rafaniello
@jrafanie
Nov 10 2017 20:30
Everything seems fine from a PSS perspective
# smem -P MIQ
  PID User     Command                         Swap      USS      PSS      RSS
33320 root     python /usr/bin/smem -P MIQ        0     6304     6646     8484
28797 root     puma 3.7.1 (tcp://127.0.0.1        0   232840   237789   312016
28751 root     MIQ: MiqEventHandler id: 6,        0   236652   241272   313912
28788 root     puma 3.7.1 (tcp://127.0.0.1        0   244544   248953   318656
28713 root     MIQ: MiqScheduleWorker id:         0   262436   267285   340272
62832 root     MIQ: Openstack::NetworkMana        0   301084   313065   422148
62814 root     MIQ: Openstack::CloudManage        0   301400   313363   422228
28769 root     MIQ: MiqReportingWorker id:        0   315628   318864   377528
28761 root     MIQ: MiqReportingWorker id:        0   320524   323759   382428
28695 root     MIQ: MiqPriorityWorker id:         0   370904   374235   433248
28703 root     MIQ: MiqPriorityWorker id:         0   372412   375743   434828
28779 root     puma 3.7.1 (tcp://127.0.0.1        0   397800   400571   448860
33221 root     MIQ: MiqGenericWorker id: 8        0   280004   601553  3217024
33215 root     MIQ: MiqGenericWorker id: 8        0   280396   601841  3217064
33270 root     MIQ: Openstack::CloudManage        0   297076   610216  3217536
28300 root     MIQ Server                         0   298748   612394  3222912
27653 root     MIQ: MiqEmsMetricsProcessor        0   286072   634586  3213056
27645 root     MIQ: MiqEmsMetricsProcessor        0   286072   634590  3213068
19374 root     MIQ: StorageManager::SwiftM        0   440360   718650  3217504
19369 root     MIQ: StorageManager::Cinder        0   538500   805395  3229976
19203 root     MIQ: Openstack::NetworkMana        0   569312   833212  3237820
19113 root     MIQ: Openstack::CloudManage        0   652288   885186  3235832
Dennis Metzger
@dmetzger57
Nov 10 2017 20:35
agreed, PSS is reasonable. the RSS is “surprising” for the MIQ Server
Joe Rafaniello
@jrafanie
Nov 10 2017 20:35
3.2 GB / 24 forked processes = 133 MB per process.
Joe Rafaniello
@jrafanie
Nov 10 2017 20:41
If workers are cycled over time, there might not be many workers started at roughly the same time so each worker is basically only sharing with the server process (until they hit a CoW fault). So the server process would be counting the shared memory with each worker twice.
Dennis Metzger
@dmetzger57
Nov 10 2017 20:45
so you’re thinking the server presents a high RSS because workers are being cycled, so no harm / no could
Joe Rafaniello
@jrafanie
Nov 10 2017 20:47
I wouldn't say no harm
Dennis Metzger
@dmetzger57
Nov 10 2017 20:47
which would lead us back to the why are the workers ballooning past their threshold
@NickLaMuro let out a sigh
Joe Rafaniello
@jrafanie
Nov 10 2017 20:48
We've already seen that workers forked from big (RSS) server processes, tend to have larger PSS
in the snippet above, all of the 3.2 GB RSS workers have more than 600 MB PSS
Dennis Metzger
@dmetzger57
Nov 10 2017 20:51
so is the trigger / sequence a worker(s) exceeding threshold and being restarted a few times causing the server to present large RSS which then leads to larger PSS for all new workers eventually leading to disaster
Joe Rafaniello
@jrafanie
Nov 10 2017 20:52
yeah, that's what i'm telling myself
Dennis Metzger
@dmetzger57
Nov 10 2017 20:53
plausible
Joe Rafaniello
@jrafanie
Nov 10 2017 20:54
So, I had tested cycling workers previously but it was very consistent: start the worker, give it just enough memory threshold to do one or two things before it exceeds memory threshold and restarts
Maybe a more random cycling of workers would expose this better
@Fryguy suggested randomly throwing Kernel.exit on the queue
Nick LaMuro
@NickLaMuro
Nov 10 2017 20:55
heh
Joe Rafaniello
@jrafanie
Nov 10 2017 20:55
Personally, I would use Process.kill(0)
Nick LaMuro
@NickLaMuro
Nov 10 2017 20:55
:expressionless:
Dennis Metzger
@dmetzger57
Nov 10 2017 20:57
we’re heading towards the rule “don’t undersize your workers and all will be fine"
Dennis Metzger
@dmetzger57
Nov 10 2017 21:04
I’l try to perturb my current long running (stable) appliance and see if I can induce the memory issue as we are thinking