These are chat archives for ManageIQ/manageiq/performance

13th
Dec 2017
Nick LaMuro
@NickLaMuro
Dec 13 2017 01:40

Hours wasted trying to figure out how this line ever gets hit (SPOILER: it doesn't): ~2

https://github.com/ManageIQ/manageiq/blob/470ce7b/app/models/miq_queue_worker_base/runner.rb#L135

For the record, string based messages that are sent to the workers are actually processed on #heartbeat, and not when processing messages (I guess I forgot this fact... or glossed over it):

https://github.com/ManageIQ/manageiq/blob/29d1636/app/models/miq_worker/runner.rb#L373

Regardless, super confusing if starting from the #deliver_message code...

Nick LaMuro
@NickLaMuro
Dec 13 2017 02:08
For the record, I am trying to create a trimmed down replication of a script that just runs a DRbServer in the main process, and then forks some processes that will hit the DRb process rapidly, trying to replicate the memory growth. Trying to use as much as I can from the MiqServer code, but still runnable without interfering with an existing evmserverd process.
LJ's finding sounds somewhat promising now that we know the change was recent to more recent versions of centos/rhel, and might explain why this recently started showing up, but will have to touch based tomorrow to see how we can check on that.
Stepping away for now.
Joe Rafaniello
@jrafanie
Dec 13 2017 03:25
I was trying the drb angle too last week and it's worth investigating especially if it adds fuel to the fire that eventually removes drb from the server. It's best to continue to try different angles because this huge pages thing might not be the cause.
Keenan Brock
@kbrock
Dec 13 2017 03:30
good luck. looking good
Nick LaMuro
@NickLaMuro
Dec 13 2017 03:31
@jrafanie yeah, my thoughts exactly.
Joe Rafaniello
@jrafanie
Dec 13 2017 03:34
My gut still tells me the lack of changes to the inside ruby heap page length for days means it's something low level that's causing the resident memory to grow slowly
Nick LaMuro
@NickLaMuro
Dec 13 2017 04:08
I think we are both on the same page there, just trying to figure out "what" that thing is
the things I am think it could be is weird things we do with global vars, weird things we do with threads/DRb, etc. that could be causing helper malloc's outside of ruby's managed object space to grow
if it is related to the anonhugepage thing, it would be tied our usage of ruby in specific scenarios me thinks since it doesn't always present itself
@kbrock @Fryguy or anyone dealing with the ancestry gem might find this interesting
Nick LaMuro
@NickLaMuro
Dec 13 2017 18:17
So prelim data from the tests:
  • MiqServer#monitor_workers uncommented with only Generic, Priority, and Schedule workers running
20171212_335.png
Keenan Brock
@kbrock
Dec 13 2017 18:17
@chrisarcand I totally want to try out ltree and other algorithms, but don't have a good way of loading test data
Nick LaMuro
@NickLaMuro
Dec 13 2017 18:18
  • MiqServer#monitor_workers uncommented with all workers running (events, cap&u, ui, etc., plus the ones from above)
20171212_28550.png

so number of workers seems to have an effect on the speed at which the leak happens

going to try to speed the rest of the day getting two tests going:

  • try and simulate the DRbWorkers requesting stuff from the main server
  • simulate the MiqServer#monitor_workers method being looped against at a faster rate

Hopefully one of those will replicate the leak

Keenan Brock
@kbrock
Dec 13 2017 18:20
@chisarcand wanted a test suite - was trying to get the taxonomy databse loaded - so I'd know if I was improving speed or not for my benchmarks e.g.: https://spark-in.me/post/birds-voices-taxonomy
Dennis Metzger
@dmetzger57
Dec 13 2017 18:21
@NickLaMuro good datapoint
Keenan Brock
@kbrock
Dec 13 2017 18:22
yay @NickLaMuro very big find
Nick LaMuro
@NickLaMuro
Dec 13 2017 22:21

This is the script I came up with for the DRb message simulation:

https://github.com/NickLaMuro/miq_tools/blob/master/miq_server_leak_discovery/08_drb_heartbeat_loop_simulation_test.rb

Wouldn't mind a couple of eyes on it to see if I got things right, or to catch if I am possibly missing something.

right now, doesn't seem to be showing a leak locally, but might be something with my setup
Keenan Brock
@kbrock
Dec 13 2017 22:23
in production, both sides are allocating memory and doing things
maybe receive the message and create a string that is that long?
Nick LaMuro
@NickLaMuro
Dec 13 2017 22:24
@kbrock The thing is that only the portions that interact with drb in this case should be affecting the server's memory
hence why I didn't put a large emphasis on the work of the workers
Keenan Brock
@kbrock
Dec 13 2017 22:25
sounds good
Nick LaMuro
@NickLaMuro
Dec 13 2017 22:25
I probably could have left off the part of adding messages to the :messages hash