These are chat archives for ManageIQ/manageiq/performance

4th
Dec 2017
Keenan Brock
@kbrock
Dec 04 2017 04:03
<3
Keenan Brock
@kbrock
Dec 04 2017 15:17
@NickLaMuro [review 1] old code change I was thinking about putting into rails PR - especially after reading rails/rails#26815
@NickLaMuro [review 2] also curious the best way for us to get in the stubs - you still good on an active notification route? May be tricky to collect memory or stack trace info from a notification endpoint
Jason Frey
@Fryguy
Dec 04 2017 15:46
nice @NickLaMuro :)
Nick LaMuro
@NickLaMuro
Dec 04 2017 16:17

so regarding LJ's comments from Friday: https://gitter.im/ManageIQ/manageiq/performance?at=5a219803ffa3e37919e4378e

I was looking at some of the ObjectSpace memory dumps Friday evening and it confirms what LJ was talking about if you look at multiple of them over time

Joe Rafaniello
@jrafanie
Dec 04 2017 16:18
I'd love more data points to confirm it appears we're growing in the OS heap but not in the ruby heap (live slots)
Nick LaMuro
@NickLaMuro
Dec 04 2017 16:18
basically, the memory is still around either has a relatively low generation count (so what ever was autoloaded in the application, and some of that comes in slowly over time), and then whatever is currently in scope
when looking at two different dumps, for the generation values, you can see this (getting a paste together):

Dump 1

# File:  miq_server_1512059342_13164.dump.gz
irb(main):048:0> $heap.by_generation.keys.compact.sort
#=> [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21,
# 22, 23, 24, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46,
# 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66,
# 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86,
# 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101, 102, 103, 104,
# 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 120, 121,
# 123, 124, 125, 126, 127, 128, 129, 130, 131, 133, 134, 135, 136, 137, 138, 139,
# 140, 141, 142, 143, 144, 145, 146, 147, 148, 149, 150, 151, 152, 153, 154, 155,
# 156, 157, 158, 159, 160, 161, 162, 163, 164, 165, 166, 167, 168, 169, 170, 171,
# 172, 173, 174, 175, 176, 177, 178, 179, 180, 181, 182, 183, 184, 185, 186, 187,
# 188, 189, 190, 191, 192, 193, 194, 196, 200, 201, 213, 345, 803, 806, 808, 809,
# 810, 812, 814, 870, 1393, 1435, 1436, 1438, 3032, 3664, 3665, 4203, 5253, 5254,
# 5255, 5714, 5915, 5916, 5917, 6193, 6240, 6282, 6283, 6284, 6320, 6329, 6331,
# 6332, 6333, 6334, 6335, 6336, 6337, 6338, 6339, 6340]

Dump 2

# File:  miq_server_1512165033_13164.dump.gz
irb(main):002:0> $heap.by_generation.keys.compact.sort
#=> [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21,
# 22, 23, 24, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46,
# 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66,
# 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86,
# 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101, 102, 103, 104,
# 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 120, 121,
# 123, 124, 125, 126, 127, 128, 129, 130, 131, 133, 134, 135, 136, 137, 138, 139,
# 140, 141, 142, 143, 144, 145, 146, 147, 148, 149, 150, 151, 152, 153, 154, 155,
# 156, 157, 158, 159, 160, 161, 162, 163, 164, 165, 166, 167, 168, 169, 170, 171,
# 172, 173, 174, 175, 176, 177, 178, 179, 180, 181, 182, 183, 184, 185, 186, 187,
# 188, 189, 190, 191, 192, 193, 194, 196, 200, 201, 213, 345, 803, 806, 808, 809,
# 810, 812, 814, 870, 1393, 1435, 1436, 1438, 3032, 4203, 9395, 9397, 9759,
# 10665, 10666, 15072, 15514, 15515, 15517, 15530, 15614, 15667, 15679, 15680,
# 15681, 15682, 15683, 15684, 15685, 15686, 15687, 15688, 15689, 15690, 15691,
# 15692, 15693, 15694, 15695, 15696, 15697]
Keenan Brock
@kbrock
Dec 04 2017 16:24

When I see

3032, 3664, 3665, 4203 => 
3032,             4203

I wonder if we are getting fragmentation

Nick LaMuro
@NickLaMuro
Dec 04 2017 16:25
So with that, I think we have 3 possibilities for what the leak might be (though I might be missing something):
  1. Leak in ruby core (I hope not... as this is probably the hardest to fix)
  2. Leak in a c-extension (not sure which one though)
  3. Possibly a leak in DRB (not sure how DRB works under the hood, so this might not show up in memory dumps, etc.)
@kbrock I looked at those ones specifically, and I think those are references to the DRB threads that git spun up and down as workers are cycled
one of the workers on this simulated appliance, gets cycled within a day or two (MetricsCollector)
this is because there is a faulty job in the simulated environment that has a single Storage for all of the 1k VMs, and that spikes the memory growth for the process, and doesn't let the memory go after that
(of note, the above doesn't seem to show up in actually real world appliances, since the number of VMs for a single storage is only around 25 or so, not 1K/3K, like in our simulations)
Dennis Metzger
@dmetzger57
Dec 04 2017 16:29
So how about a DRb test, modify the Miq Server to run pound DRb with nothing else going on, see if the leak presents itself under that load?
Nick LaMuro
@NickLaMuro
Dec 04 2017 16:29
  1. Possibly a leak in DRB (not sure how DRB works under the hood, so this might not show up in memory dumps, etc.)
I think what we could try is spinning up and appliance that doesn't have the Metrics worker running, and see if we still see the leak (I forget if we have tried this or not)
((or even have captured stuff for the MiqServer in a idle state))
Dennis Metzger
@dmetzger57
Dec 04 2017 16:30
FYI, here’s a plot of 5.8 VS 5.9 Metrics Collector memory growth (same 400 VM VMware provider with C&U enabled), you can see the new memory consumption behavior
metrics_collector.pdf
I’ve not seen the Miq Server grow without a load (it’s not ballooned up until I’ve turned on C&U, not sure what other work would trigger it)
when I get a few minutes later I’ll start up an Idle test to verify that
Nick LaMuro
@NickLaMuro
Dec 04 2017 16:32
cool, I think it only needs to be up a few hours
I can look into what the Server might be doing, and see if I can comment out some of it's actions, and see if that makes a difference (and doesn't explode things in the process)
the other thing I was considering doing (but don't know how feasible it is) is attaching valgrind or something similar to the MiqServer process, and see what comes up. I don't have much experience working with valgrind, so this would be an "adventure" to say the least, and possibly fruitless
Keenan Brock
@kbrock
Dec 04 2017 16:35
@dmetzger57 if you run refresh, does it balloon?
Nick LaMuro
@NickLaMuro
Dec 04 2017 16:36

cool, I think it only needs to be up a few hours

Scratch that, possibly around a day, to start seeing things confidently...

Keenan Brock
@kbrock
Dec 04 2017 16:37
@NickLaMuro @dmetzger57 can you postulate what you see in that pdf graph?
Those downward spikes with relatively uniform timings look like cache clears to me. the c&u timers.
And the shape of the change for the 5.9 series looks like different gems for the data? (counters are different size?)
Nick LaMuro
@NickLaMuro
Dec 04 2017 16:39
I wouldn't look to deep into those, good chance it is a top_output parsing error
Keenan Brock
@kbrock
Dec 04 2017 16:39
thnx
Nick LaMuro
@NickLaMuro
Dec 04 2017 16:39
they are not the easiest to digest into tabular data since we only get full timestamps every hour
Joe Rafaniello
@jrafanie
Dec 04 2017 16:40
I was going to try recreating the growth in RSS for a ruby process with a drb server and 15 drb clients over time
Dennis Metzger
@dmetzger57
Dec 04 2017 16:42
@NickLaMuro that’s why I don’t even bother trying to map a date/time to each data point, we have all the data points for the time period just mapping is hard
@jrafanie that would be a great test
Keenan Brock
@kbrock
Dec 04 2017 16:55
@jrafanie you do something like this? https://marc.info/?l=ruby-core&m=124394952706835&w=2
Joe Rafaniello
@jrafanie
Dec 04 2017 16:56
yeah, something like that
but with forked children as clients
Keenan Brock
@kbrock
Dec 04 2017 16:57
aah
children are tricky - and require a lot of resources ;)
Joe Rafaniello
@jrafanie
Dec 04 2017 17:00
Yup, they often take up lots of time, leak, and consume lots of your resources