These are chat archives for ManageIQ/manageiq/performance

12th
Jan 2018
Keenan Brock
@kbrock
Jan 12 2018 23:13
@NickLaMuro you fill up with food and find the culprit?
Nick LaMuro
@NickLaMuro
Jan 12 2018 23:13
still leaking since when I left it
So I am probably going to call it here for today. I just ate, but I am not pretty much out of it at this point. I plan on working on this more this weekend though.
Keenan Brock
@kbrock
Jan 12 2018 23:13
yay
k
think the next steps are pretty straight forward
so excited
Nick LaMuro
@NickLaMuro
Jan 12 2018 23:14
going to let it run with the laptop open to be "super secret ultra double extra sure"
Keenan Brock
@kbrock
Jan 12 2018 23:14
aah
yes, put it under super secret probation
Nick LaMuro
@NickLaMuro
Jan 12 2018 23:14
told LJ to also try running the script as well
Keenan Brock
@kbrock
Jan 12 2018 23:14
yay
Nick LaMuro
@NickLaMuro
Jan 12 2018 23:14
just so we have more than one person confirming it
guess I should probably give the short version here...
will do that before I call it quits... hold please while I type a :book:
Nick LaMuro
@NickLaMuro
Jan 12 2018 23:23

So, for those following along at home, the script we used to reproduce the leak can be found here:

https://github.com/NickLaMuro/miq_tools/blob/master/miq_server_leak_discovery/10_miq_process_loop_test.rb

Keenan and I were on a call for most of the day, and happen to stumble across this when debugging against a leaking server using gdb, and and some of the following while monitoring the /proc/[SERVER_PID]/smaps file of the server process:
$ gdb -p [SERVER_PID]
(gdb) p rb_eval_string("GC.start")
$1 = 8
(gdb) p rb_eval_string("s = MiqServer.my_server; 500.times { s.sync_workers }; s = nil")
$2 = 8
(gdb) p rb_eval_string("GC.start")
$3 = 8
(gdb) p rb_eval_string("s = MiqServer.my_server; 500.times { s.sync_workers; MiqWorker.status_update_all }; s = nil")
$4 = 8
(gdb) p rb_eval_string("GC.start")
$5 = 8
(gdb) p rb_eval_string("s = MiqServer.my_server; 500.times { MiqWorker.status_update_all }; s = nil")
$6 = 8
(of note, running gdb -p [SERVER_PID] halts execution where ever the process happened to be running, so when monitoring it, we could see the leak of the existing server process stop until we ran the lines above)
The first test, s = MiqServer.my_server; 500.times { s.sync_workers }; s = nil, didn't leak at all, which told use that the server leak wasn't in sync_workers (has been a working theory for the past week)
but the next two tests did, pretty heavily as well
Nick LaMuro
@NickLaMuro
Jan 12 2018 23:28
The script that I linked previously is a stripped down version of that which is all the data that is gathered from that, and just doesn't run the update_attributes! method for each of the workers.
The crazy part, is that only when the require "config/environment" line is present, do we actually observe a leak with that script, otherwise, it works without leaking at all
this leads us to believe that the "witches brew" of gems that we have in ManageIQ is causing the leak
this would also possibly explain why we are seeing a leak in other workers as well, and what we have in the MiqServer is just more reproducible because it doesn't have as much variability to it (a lot of our other workers do work based on jobs in the queue, which can make seeing the leak a bit more sporadic)
we need to zone in a bit more to figure out the exact location, but this is getting us much closer to finding the leak
Dennis Metzger
@dmetzger57
Jan 12 2018 23:53
Thanks for the short update 😏
And great to see progress