These are chat archives for ManageIQ/manageiq/performance

19th
Dec 2017
Nick LaMuro
@NickLaMuro
Dec 19 2017 19:08

:warning: Large NLN ("Nick LaMuro Novel") Detected! :warning:

Just had a pretty long talk with @kbrock, and talked about a bunch of things regarding the memory leaks, but specifically I wanted to put down somewhere why we are pretty sure it is not DRb (or solely DRb) being the cause of the MiqServer process leak. Mostly so it is written down somewhere, but also incase there is some holes in my thinking with this.

So to start off, the "obvious" reason is that with the following script doesn't seem to replicate the leak:

https://github.com/NickLaMuro/miq_tools/blob/master/miq_server_leak_discovery/08_drb_heartbeat_loop_simulation_test.rb

With the script, it tries to use as much of the MiqServer mechanisms (drb read write locks, shared hash structure, etc.) to simulate how things would be done, and also include some frequent GC.starts to rule out the initial growth in the process being the cause. Now, this script alone being the proof would be of suspect, since it is a simplified version of things, and very prone to "me writing it", so it could be flawed in probably many ways.

A week ago, I found that running a MiqServer process with the MiqServer#monitor striped down to only run MiqServer#heartbeat and MiqServer#process_miq_queue (which handles the heartbeating of the server process, and processing the server process queue items respectively), doesn't leak what so ever. Something of note with this setup (which I will reference later), that running in this configuration will only spawn a Generic, Priority, and Schedule worker, even if you are configured to have many more (and monitor_workers will eventually kick up those other workers).

My conclusion then was that either it is something in MiqServer#monitor_workers, or DRb communication causing the issue. I then setup a test with my two test appliance where both had MiqServer#monitor_workers uncommented as well, but one had it's roles changed so it would only include the Generic, Priority, and Schedule workers. These are what the graphs for that look like:

20171212_335.png
20171212_28550.png
Similar to what was shared here and here, but just after letting it run for a week now.
Nick LaMuro
@NickLaMuro
Dec 19 2017 19:14

Basically, the leak still exists in both cases, but the number of workers causes the rate of the leak to continue.

I was originally still concluded that DRb could still be the cause, but what I noticed today is that the one with three workers was still leaking, and wasn't when #monitor_workers was commented out. In all of the runs, the Worker processes were still doing calling the two methods on the DRb::Object on a regular bases: heartbeating and fetching queue messages. But when monitor_workers was commented, the server didn't leak at all, regardless if the 3 workers were heartbeating to it or not.

Curious of other's thoughts on this, but I think me now spending my time focusing on what in MiqServer#monitor_workers is leaking is probably the best course of action.

A collection of some of the scripts I have tried to use to determine the cause for the leak can be found here:

https://github.com/NickLaMuro/miq_tools/tree/master/miq_server_leak_discovery

But because I poorly implemented the tests, I think I need re-run most of those to check I didn't improperly write them off ( I have re-run the DRb one though)

Jason Frey
@Fryguy
Dec 19 2017 19:18
@NickLaMuro You're going to hate me, but are you running your leak test with our monkey patch of DRb?
I'm not sure it's actually loaded in this particular case, but it does exist (was originally for the broker)
Nick LaMuro
@NickLaMuro
Dec 19 2017 19:19
@Fryguy uhh.... maybe? What is our monkey patch of DRb?
Jason Frey
@Fryguy
Dec 19 2017 19:19
getting the link
Adam Grare
@agrare
Dec 19 2017 19:19
"now you tell me" - Nick
Adam Grare
@agrare
Dec 19 2017 19:22
that one is used
Jason Frey
@Fryguy
Dec 19 2017 19:22
oh ok
Nick LaMuro
@NickLaMuro
Dec 19 2017 19:22
well, there is a quick way for me to find out
Jason Frey
@Fryguy
Dec 19 2017 19:22
and then there is a patch to Timeout, but I can't find it :/
Adam Grare
@agrare
Dec 19 2017 19:23
that's what dumps the buffer if there's an error, ManageIQ/vmware_web_service@760374d
Nick LaMuro
@NickLaMuro
Dec 19 2017 19:23
I can just add a print to the top of those two files on the appliance I am testing this with, and run the script
LOL yeah
Jason Frey
@Fryguy
Dec 19 2017 19:24
ah ok, but that would still be in that particular worker
Jason Frey
@Fryguy
Dec 19 2017 19:25
@roliveri I think tried to get those changes back into Ruby core, but they didn't take it
can't remember why
Adam Grare
@agrare
Dec 19 2017 19:25
@NickLaMuro what can possibly go wrong?
Jason Frey
@Fryguy
Dec 19 2017 19:25
it's hairy as hell
but it does tighten the race condition
Nick LaMuro
@NickLaMuro
Dec 19 2017 19:25

@NickLaMuro what can possibly go wrong?

@agrare see graphs above

Adam Grare
@agrare
Dec 19 2017 19:26
assume -snark
Nick LaMuro
@NickLaMuro
Dec 19 2017 19:26
sorry, I was joking as well... should have added #amirite or something I guess...
Adam Grare
@agrare
Dec 19 2017 19:26
haha gotcha
Nick LaMuro
@NickLaMuro
Dec 19 2017 19:27
@jrafanie just assumes everything I say is with snark...
Adam Grare
@agrare
Dec 19 2017 19:27
eww now I wish I never read that timeout patch
that's some gross stuff going on there
Nick LaMuro
@NickLaMuro
Dec 19 2017 19:28
:dragon:
NickLaMuro @NickLaMuro learned today there is a :dragon: emoji...
Adam Grare
@agrare
Dec 19 2017 19:29
that's a weak dragon, I thought it was a snake until I hovered over it
Jason Frey
@Fryguy
Dec 19 2017 19:29
:snake:
Nick LaMuro
@NickLaMuro
Dec 19 2017 19:29
fine... :dragon_face:
Adam Grare
@agrare
Dec 19 2017 19:29
that's better
Nick LaMuro
@NickLaMuro
Dec 19 2017 19:29
there is actually two
Richard Oliveri
@roliveri
Dec 19 2017 19:29
@Fryguy I think those timeout changes were only needed because there was a bug in Linux that caused a problem in Ruby.
Adam Grare
@agrare
Dec 19 2017 19:29
:fire: :fire: :dragon_face:
Nick LaMuro
@NickLaMuro
Dec 19 2017 19:29
TROGDOR!!!
Jason Frey
@Fryguy
Dec 19 2017 19:30
@roliveri Oh really? Maybe we can remove it then?
BURNINATING THE COUTRYSIIIIDEEE
Nick LaMuro
@NickLaMuro
Dec 19 2017 19:30
well, I will check to see if those are getting loaded or not. That said, how long have those patches been around?
one other point that Dennis has made is that this got worse between 5.7 and 5.8 (assuming I got the right version numbers...)
Adam Grare
@agrare
Dec 19 2017 19:31
the broker_timeout DMiqVimSync have been there for a long time
the other was recent, ManageIQ/vmware_web_service#20
Nick LaMuro
@NickLaMuro
Dec 19 2017 19:32
oh... and it was backported...
hmmm... then that is kind of suspect now
Adam Grare
@agrare
Dec 19 2017 19:32
yeah, this was also recent ManageIQ/vmware_web_service#17
Nick LaMuro
@NickLaMuro
Dec 19 2017 19:33
geez @Fryguy, why didn't you say so sooner!!1!
Jason Frey
@Fryguy
Dec 19 2017 19:33
sorry :(
I honestly just remembered :/
Nick LaMuro
@NickLaMuro
Dec 19 2017 19:33
I know, I kid...
Jason Frey
@Fryguy
Dec 19 2017 19:33
:D
Nick LaMuro
@NickLaMuro
Dec 19 2017 19:34
anyway, I will do some tests against this then, and confirm if it is getting loaded
Adam Grare
@agrare
Dec 19 2017 19:34
those two aren't needed for broker operation so you can revert them to test
Jason Frey
@Fryguy
Dec 19 2017 19:34
:+1:
Nick LaMuro
@NickLaMuro
Dec 19 2017 19:34
possibly try my script again with them loaded to see if they do leak
@agrare good idea
Richard Oliveri
@roliveri
Dec 19 2017 19:41
@Fryguy As far as I can recall, yes. But we should test it to be safe.
Nick LaMuro
@NickLaMuro
Dec 19 2017 19:50
so of note, none of that code seems to have been loaded in the DRb testing script
Jason Frey
@Fryguy
Dec 19 2017 19:50
:+1:
jrafanie @jrafanie reads @NickLaMuro's :book:
Joe Rafaniello
@jrafanie
Dec 19 2017 20:00
@NickLaMuro I like the book. Where can I find the cliff notes version? ;-)
@NickLaMuro to be safe, you can just raise at the top of the monkey patch files... you'll know if they get loaded then ;-)
Nick LaMuro
@NickLaMuro
Dec 19 2017 20:11
@jrafanie I was doing a puts...
Joe Rafaniello
@jrafanie
Dec 19 2017 20:12
That works too. :+1: I like to raise so i don't have to look for a message
Nick LaMuro
@NickLaMuro
Dec 19 2017 20:12
well, I am doing it in my DRb script, not the server process
there is like two lines of output
Joe Rafaniello
@jrafanie
Dec 19 2017 20:13
Nice
Nick LaMuro
@NickLaMuro
Dec 19 2017 20:13
so I modified the DRb script to do this at the top:
puts "PID: #{Process.pid}"

require 'config/environment.rb'
require 'drb'
require 'drb/acl'
require '/opt/rh/cfme-gemset/gems/vmware_web_service-0.2.2/lib/VMwareWebService/DMiqVim.rb'
require '/opt/rh/cfme-gemset/gems/vmware_web_service-0.2.2/lib/VMwareWebService/MiqVimDrbDebug.rb'
require '/opt/rh/cfme-gemset/gems/vmware_web_service-0.2.2/lib/VMwareWebService/broker_timeout.rb'

# Server object and DRb::DRbServer setup
Basically, this thing isn't leaking at all, even with the monkey patches
Joe Rafaniello
@jrafanie
Dec 19 2017 20:15
But doesn't it take a few hours to know if it's leaking at least in the appliance setup?
Nick LaMuro
@NickLaMuro
Dec 19 2017 20:16
@jrafanie yeah, but that script is meant to expedite how often the heartbeat is hit
Joe Rafaniello
@jrafanie
Dec 19 2017 20:16
yeah
Nick LaMuro
@NickLaMuro
Dec 19 2017 20:16
and I am looking it using exact values, versus a graph
$ date; wc -l /proc/15954/smaps; grep -A 10 "\[heap\]" /proc/15954/smaps; echo; grep VmRSS /proc/15954/status | cut -d " " -f3,4
within the 20 minutes of that running, the size hasn't budged at all
I will let it go for a bit while I get some lunch, but thinking we would have seen a slight uptick by now
Joe Rafaniello
@jrafanie
Dec 19 2017 20:19
yeah, probably but I don't know, this leak is more like a drip
Nick LaMuro
@NickLaMuro
Dec 19 2017 20:20
well, when looking at the top_output, the drip seems to happen about every 10ish minutes, but I guess this script is only using "5 workers", so that might be why it is slow at the moment
again, I will let it run for a few hours and check back
I will also revert that stuff and let the servers run with that code over a longer period of time before I call it a day today
anyway, back in a bit
Keenan Brock
@kbrock
Dec 19 2017 21:22
@NickLaMuro I install git into all my appliances (at least for lib and app) - then I can just git status to know what I've changed
Nick LaMuro
@NickLaMuro
Dec 19 2017 21:24
which is installed as a gem, so that wouldn't have a .git dir in it
also, not sure we ship appliances with the .git folder in the /var/www/miq/vmdb dir
probably not since that is a lot of extra weight to add to a package
Jason Frey
@Fryguy
Dec 19 2017 21:25
upstream appliances we do
Nick LaMuro
@NickLaMuro
Dec 19 2017 21:25
mmk, I stand corrected
Keenan Brock
@kbrock
Dec 19 2017 21:25
nope
I git init in cfme
Nick LaMuro
@NickLaMuro
Dec 19 2017 21:25
cheater
Jason Frey
@Fryguy
Dec 19 2017 21:25
downstream we don't
Keenan Brock
@kbrock
Dec 19 2017 21:26
that way I can git status and see what you did to app/models/miq_server.rb :)
anyway - I have fun doing that. /FWIW