These are chat archives for ManageIQ/manageiq/performance

15th
Nov 2017
Keenan Brock
@kbrock
Nov 15 2017 13:11
Is that the VC sim we've been pounding? :(
Dennis Metzger
@dmetzger57
Nov 15 2017 13:13
Yep, stopped accepting the it’s credentials (including not letting me ssh or log into the console)
Adam Grare
@agrare
Nov 15 2017 13:25
@dmetzger57 was your sim on dev-vc65?
smitty just cleaned up a lot of those
Dennis Metzger
@dmetzger57
Nov 15 2017 13:31
no, dev-vc60. I’ve seen VC’s get authentication get corrupted, and actually jumped through many hoops to get one back, not worth the hours of google/trial/error for this sim :smile:
Adam Grare
@agrare
Nov 15 2017 13:33
:+1: just making sure (since I told him no one else was using those haha)
Joe Rafaniello
@jrafanie
Nov 15 2017 14:56
@dmetzger57 from your appliance logs that are growing... do you have access to it still?
I'm wondering how many threads there are in the 13314 process and open file descriptors
Dennis Metzger
@dmetzger57
Nov 15 2017 14:59
sadly no, that env is gone as a result of the simulator failure. I’ve got two new appliance (one managing 1,000 VMs and one managing 3,000 VMs), I’ll be watching these today
Joe Rafaniello
@jrafanie
Nov 15 2017 14:59
ls -1 /proc/13314/task | wc -l and lsof -p 13314 |wc -l
doh, oh well
So, from the logs, the server started only 71 workers in the span in nearly 4 days.. which I think is really not that much
Joe Rafaniello
@jrafanie
Nov 15 2017 15:06
@dmetzger57 the only thing consistent throughout the log, which seems to coincide with the steady but slow growth of the process is the fetching of queue messages
[----] I, [2017-11-10T21:15:31.743868 #13314:505140]  INFO -- : MIQ(MiqServer#populate_queue_messages) Fetched 5 miq_queue rows for queue_name=generic, wcount=8, priority=200
[----] I, [2017-11-10T21:15:31.754738 #13314:505140]  INFO -- : MIQ(MiqServer#populate_queue_messages) Fetched 11 miq_queue rows for queue_name=ems_metrics_processor, wcount=4, priority=200
[----] I, [2017-11-10T21:15:36.764437 #13314:505140]  INFO -- : MIQ(MiqServer#populate_queue_messages) Fetched 4 miq_queue rows for queue_name=generic, wcount=8, priority=200
[----] I, [2017-11-10T21:15:36.776128 #13314:505140]  INFO -- : MIQ(MiqServer#populate_queue_messages) Fetched 8 miq_queue rows for queue_name=ems_metrics_processor, wcount=4, priority=200
[----] I, [2017-11-10T21:15:44.310287 #13314:505140]  INFO -- : MIQ(MiqServer#populate_queue_messages) Fetched 10 miq_queue rows for queue_name=generic, wcount=8, priority=200
[----] I, [2017-11-10T21:15:44.319302 #13314:505140]  INFO -- : MIQ(MiqServer#populate_queue_messages) Fetched 61 miq_queue rows for queue_name=vmware, wcount=4, priority=200
[----] I, [2017-11-10T21:15:44.322837 #13314:505140]  INFO -- : MIQ(MiqServer#populate_queue_messages) Fetched 12 miq_queue rows for queue_name=ems_metrics_processor, wcount=4, priority=200
[----] I, [2017-11-10T21:15:44.323027 #13314:505140]  INFO -- : MIQ(MiqServer#monitor_loop) Server Monitoring Complete - Timings: {:server_dequeue=>0.004149198532104492, :worker_monitor=>2.5261240005493164, :worker_dequeue=>0.015494346618652344, :total_time=>2.546069383621216}
[----] I, [2017-11-10T21:15:49.330720 #13314:505140]  INFO -- : MIQ(MiqServer#populate_queue_messages) Fetched 4 miq_queue rows for queue_name=generic, wcount=8, priority=200
[----] I, [2017-11-10T21:15:49.340125 #13314:505140]  INFO -- : MIQ(MiqServer#populate_queue_messages) Fetched 7 miq_queue rows for queue_name=ems_metrics_processor, wcount=4, priority=200
[----] I, [2017-11-10T21:15:54.346457 #13314:505140]  INFO -- : MIQ(MiqServer#populate_queue_messages) Fetched 7 miq_queue rows for queue_name=generic, wcount=8, priority=200
[----] I, [2017-11-10T21:15:54.356502 #13314:505140]  INFO -- : MIQ(MiqServer#populate_queue_messages) Fetched 7 miq_queue rows for queue_name=ems_metrics_processor, wcount=4, priority=200
[----] I, [2017-11-10T21:16:01.683182 #13314:505140]  INFO -- : MIQ(MiqServer#populate_queue_messages) Fetched 8 miq_queue rows for queue_name=generic, wcount=8, priority=200
[----] I, [2017-11-10T21:16:01.693168 #13314:505140]  INFO -- : MIQ(MiqServer#populate_queue_messages) Fetched 11 miq_queue rows for queue_name=ems_metrics_processor, wcount=4, priority=200
Keenan Brock
@kbrock
Nov 15 2017 15:08
@jrafanie to be fair - it could be the queue messages or it could be the steady flow of work?
Joe Rafaniello
@jrafanie
Nov 15 2017 15:09
Yes, either
that's the only consistent thing going on
Dennis Metzger
@dmetzger57
Nov 15 2017 15:09
eventually there will be one suspect :smile:
Joe Rafaniello
@jrafanie
Nov 15 2017 15:14
the only other thing that seems to stand out is that worker_monitor timings are often > 1 second
[----] I, [2017-11-14T12:01:01.027448 #13314:505140]  INFO -- : MIQ(MiqServer#monitor_loop) Server Monitoring Complete - Timings: {:server_dequeue=>0.004273891448974609, :server_monitor=>0.26223230361938477, :worker_monitor=>4.407755136489868, :worker_dequeue=>0.012911558151245117, :total_time=>4.687545537948608}
how many times worker_monitor took more than x seconds:
> 1 seconds: 18383 times
> 2 seconds: 13952
> 3 seconds: 6398
> 4 seconds: 2779
> 5 seconds: 1539
> 6 seconds: 655
> 7 seconds: 214
Dennis Metzger
@dmetzger57
Nov 15 2017 15:20
here’s the Miq Server RSS on the new appliance managing 1,000 VMs, it’s been running about 9 hours (@jrafanie I’ll pm you the address)
miq_server_rss.pdf
Joe Rafaniello
@jrafanie
Nov 15 2017 19:19
@gtanzillo @dmetzger57 here's the PRs to hack uss into the pss column for the various branches:
ManageIQ/manageiq-gems-pending#312 (master)
ManageIQ/manageiq-gems-pending#314 (gaprindashvili)
ManageIQ/manageiq-gems-pending#313 (fine)
ManageIQ/manageiq#16480 (euwe)
Joe Rafaniello
@jrafanie
Nov 15 2017 19:26
@kbrock @NickLaMuro, in case you are curious and weren't following the discussion... PSS turns out to be a very poor thing to check in workers when the server gets to be large in RSS... each new worker instantly gets a large PSS since the server RSS is immediately shared with x number of new processes. PSS is better than RSS but when the amount of shared memory is huge, it becomes a very bad way to check for memory leaking workers. USS is a better way to detect and kill workers that are growing beyond reasonable thresholds since memory leaks in workers will nearly always be in their private pages, not shared with others processes.
The PRs above are still up for debate, I don't like the hacking of uss into the pss column but it works and is easy for backports. We can rename the column later if that is the direction we want to go (I think it is).
Nick LaMuro
@NickLaMuro
Nov 15 2017 19:32
wow... now I really feel like people forgot I was there at the meeting where you explained all of this...
Joe Rafaniello
@jrafanie
Nov 15 2017 19:34
(that was more for Keenan but didn't want to leave out @NickLaMuro and face grumpy Nick face)
@NickLaMuro, how can I forget that you were larger than life?
Nick LaMuro
@NickLaMuro
Nov 15 2017 19:37
...
Keenan Brock
@kbrock
Nov 15 2017 19:55
Thanks for including me in the cc list tho
ugh, we have a pss column in the db
Keenan Brock
@kbrock
Nov 15 2017 20:00
good news is we only have it referenced in a dozen places
Joe Rafaniello
@jrafanie
Nov 15 2017 20:01
the good news is we can shove uss in the pss column using a single character change ;-)
Keenan Brock
@kbrock
Nov 15 2017 20:16
yay!
Nick LaMuro
@NickLaMuro
Nov 15 2017 20:23

ugh, we have a pss column in the db

The alternative being we ssh to each appliance to check the memory info manually from whatever process is doing the monitoring

Keenan Brock
@kbrock
Nov 15 2017 22:54
would be nice if we could just punt from all of this