These are chat archives for ManageIQ/manageiq/performance

10th
Jan 2018
Nick LaMuro
@NickLaMuro
Jan 10 2018 17:00

@jrafanie so getting mixed results with the "Openstack-provider-less" tests I have been doing. The 3 appliances I am running seem to be leaking just the same as they did previously:

14 wrks w/ sync_workers

Before

20180108_15845.png
After
20180109_14052.png

3 wrks w/ sync_workers

Before
20180108_25490.png
After
20180109_24488.png

3 wrks (configured for 14) w/o sync_workers

Before
20180108_8392.png
After
20180109_11997.png
Joe Rafaniello
@jrafanie
Jan 10 2018 17:04
Interesting
which one is the "Openstack-provider-less" test? The w/o sync_workers one?
Nick LaMuro
@NickLaMuro
Jan 10 2018 17:05
all of the "after" ones
so "before"/"after" under each heading is the same server, just one with, and one without the Openstack provider
Joe Rafaniello
@jrafanie
Jan 10 2018 17:08
So, that's without the openstack worker class names or something else? You mentioned something about removing the provider gem entirely the other day. I'm just trying to understand what's different.
Nick LaMuro
@NickLaMuro
Jan 10 2018 17:09
It is possible that my method for removing the provider is flaw. Effectively my method was:
  • Remove manageiq-providers-openstack from the Gemfile
  • Remove all of the Openstack workers from lib/workers/miq_worker_types.rb
  • Comment out a few lines in manageiq/providers/redhat/network_manager.rb and app/models/mixing/authentication_mixin.rb where those constants were referenced directly.
Joe Rafaniello
@jrafanie
Jan 10 2018 17:10
Interesting. That sounds like it should work.
Nick LaMuro
@NickLaMuro
Jan 10 2018 17:12
That said, I ran the script for just looping through the monitor worker with and without the Openstack gem, and got different results. When Openstack was removed from the list, the leak didn't present itself
that said, the leak was also weird, in that it took a couple of hours for it to materialize (not sure if that is due to me closing the laptop lid with the VM running or something)
Joe Rafaniello
@jrafanie
Jan 10 2018 17:14
That stinks
Nick LaMuro
@NickLaMuro
Jan 10 2018 17:14
that said, similar behavior was shown on the MIQ Server process on that VM, where it was effectively running with no leak, then started leaking like a sieve...
(getting a graph... hold please...)
Joe Rafaniello
@jrafanie
Jan 10 2018 17:14
I had confirmed that the ruby heap didn't have anything in it that I didn't expect, just like you were seeing
It certainly still seems to be off the ruby heap in c land
Nick LaMuro
@NickLaMuro
Jan 10 2018 17:15
cool, glad that wasn't something we had differing data on at least
Joe Rafaniello
@jrafanie
Jan 10 2018 17:17
It's still possible there are ruby objects created in the ruby heap that are in the "area" that is leaking off the ruby heap but it doesn't really make itself obvious
Beni Cherniavsky-Paskin
@cben
Jan 10 2018 17:21
is there some kind of low-overhead "sampling" C heap profiler? something that can say "estimated 30% of your current heap was allocated from this point in code"?
Nick LaMuro
@NickLaMuro
Jan 10 2018 17:22
@cben if you know of one, I am all ears :D
Beni Cherniavsky-Paskin
@cben
Jan 10 2018 17:24
I dont. hmm, if eventual size is way bigger than normal, then most of the heap is leak, right? in theory if you just manually look at memory content of 5-10 random things in heap, you could get ideas what it is.
(not that I know how to "manually get pointer to random thing in heap")
Nick LaMuro
@NickLaMuro
Jan 10 2018 17:27

then most of the heap is leak, right?

Yeah, I actually wrote a script to diff /proc/[pid]/smaps every 10 seconds:

https://github.com/NickLaMuro/miq_tools/blob/master/miq_server_leak_discovery/smaps_monitor.rb

Basically everything that is leaking is in the [heap] section

For the vagrant VM that I was mentioning getting a graph for a few minutes ago:
20180106_11007.png
This is a sample of the smaps_monitor.rb monitoring that process right now:
NO Difference detected in /proc/11007/smaps
NO Difference detected in /proc/11007/smaps
SMAPS Difference detected in /proc/11007/smaps
49 => old: {:loc=>"[heap]", :size=>584128, :rss=>568892, :pss=>486304, :uss=>0, :swap=>0}
      new: {:loc=>"[heap]", :size=>584368, :rss=>569116, :pss=>486528, :uss=>0, :swap=>0}
NO Difference detected in /proc/11007/smaps
SMAPS Difference detected in /proc/11007/smaps
49 => old: {:loc=>"[heap]", :size=>584368, :rss=>569116, :pss=>486528, :uss=>0, :swap=>0}
      new: {:loc=>"[heap]", :size=>584588, :rss=>569340, :pss=>486752, :uss=>0, :swap=>0}
NO Difference detected in /proc/11007/smaps
NO Difference detected in /proc/11007/smaps
NO Difference detected in /proc/11007/smaps
SMAPS Difference detected in /proc/11007/smaps
49 => old: {:loc=>"[heap]", :size=>584588, :rss=>569340, :pss=>486752, :uss=>0, :swap=>0}
      new: {:loc=>"[heap]", :size=>585036, :rss=>569784, :pss=>487196, :uss=>0, :swap=>0}
NO Difference detected in /proc/11007/smaps
NO Difference detected in /proc/11007/smaps
NO Difference detected in /proc/11007/smaps
(should probably put in timestamps, but basically "NO" or "SMAPS" shows up every 10 seconds, so that is 2 minutes of line data right there)
Nick LaMuro
@NickLaMuro
Jan 10 2018 17:58
Protip... don't do what I did if you don't want to bork your Gemfile.lock and require you to manually rebuild it...
Joe Rafaniello
@jrafanie
Jan 10 2018 18:36

Basically everything that is leaking is in the [heap] section

@cben @NickLaMuro keep in mind, there's OS heap and the ruby heap. smaps is showing the OS ruby heap growing but the heap visible from within ruby doesn't appear to be growing, hence our thoughts that perhaps a c extension is leaking or possibly something in the OS/malloc is growing the ruby heap outside of ruby

Nick LaMuro
@NickLaMuro
Jan 10 2018 18:55
^ when making my statements above, I assumed that info since it has been mentioned a bunch in the past, but worth repeating
(in case that was directed at me... but I was aware of this probably being the case, since I mentioned yesterday that I thought the ObjectSpace.dumps weren't going to come up with much ;) )
Joe Rafaniello
@jrafanie
Jan 10 2018 19:00
@NickLaMuro No, I just included you since you were in the conversation... I know you are very well aware of these points
Nick LaMuro
@NickLaMuro
Jan 10 2018 19:03
:+1:
Joe Rafaniello
@jrafanie
Jan 10 2018 20:14
@NickLaMuro @dmetzger57 @kbrock as a "workaround" for the slow server leak, @gtanzillo and I worked on the simplest safeguard. https://github.com/ManageIQ/manageiq/compare/master...jrafanie:exit_server_on_large_memory_usage
We rely on systemd restarting the server if the server exits with non-zero exit code
Keenan Brock
@kbrock
Jan 10 2018 20:16
def cyanide_pill()
  die if other_user == "James Bond"
end
Joe Rafaniello
@jrafanie
Jan 10 2018 20:16
Clearly, this doesn't fix the problem and it's not graceful as workers will exit due to no drb server but it might be worth having just as a safeguard
;-)
Keenan Brock
@kbrock
Jan 10 2018 20:17
yea - we are not letting the user know if we are killing off so many workers :(
Joe Rafaniello
@jrafanie
Jan 10 2018 20:17
I can turn it into a PR if you want to comment on it. I just threw it out there.
Keenan Brock
@kbrock
Jan 10 2018 20:17
aah, this is the server
Thought you and GT already turned into a PR?
probably misunderstood today's standup
Dennis Metzger
@dmetzger57
Jan 10 2018 20:18
I lean toward restarting if we are say >25% into swap usage
but either way is a safety net
Joe Rafaniello
@jrafanie
Jan 10 2018 20:18
No, we ran it locally on my appliance
Keenan Brock
@kbrock
Jan 10 2018 20:18
I say checking the configuration and if the server is configured to run more workers than the memory on a machine to let the user know
Joe Rafaniello
@jrafanie
Jan 10 2018 20:19
@dmetzger57 so, check the server memory only if it's >25% into swap?
Keenan Brock
@kbrock
Jan 10 2018 20:19
@jrafanie no, dm said to make the threshold dynamic - based upon memory + 25% swap
hmm - but that assumes that the server is the culpret for the swap
Joe Rafaniello
@jrafanie
Jan 10 2018 20:20
yeah, that's the next problem
Dennis Metzger
@dmetzger57
Jan 10 2018 20:20
I really don't care if the Miq Server is at 7Gb if there's enough memory for everything to run - i.e. not using swap.
Keenan Brock
@kbrock
Jan 10 2018 20:21
well, if they are running a ton of workers, and they go too high, who is saying that miq server is the one that needs to get killed?
Dennis Metzger
@dmetzger57
Jan 10 2018 20:21
well I care, but let it run :smile:
Joe Rafaniello
@jrafanie
Jan 10 2018 20:22
@dmetzger57 so, swap 25% and check for >2 GB server process?
we have the data
irb(main):002:0> MiqServer.my_server.system_swap_used
=> 0
irb(main):003:0> MiqServer.my_server.system_swap_free
=> 6438252544
Keenan Brock
@kbrock
Jan 10 2018 20:22
we have the technology... (sorry - 6million dollar man reference)
Joe Rafaniello
@jrafanie
Jan 10 2018 20:22
that too
Dennis Metzger
@dmetzger57
Jan 10 2018 20:23
my leaning is that as long as we're not using swap, there's no reason to restart anything - from a memory usage perspective. so from that perspective, I think the server size check isn't needed.
Joe Rafaniello
@jrafanie
Jan 10 2018 20:25
Well, if we're at 26% swap and the server process is 350 MB, that seems wrong ;-)
Keenan Brock
@kbrock
Jan 10 2018 20:25
well. I'm just thinking of Smitty's case. The MiqServer case was good, even thought swap was at 95%
Dennis Metzger
@dmetzger57
Jan 10 2018 20:25
but still a reason to restart things, something is running way to large or as @kbrock says perhaps the appliance just has to many workers configured
if the issue is the latter, we're left with the possibility of a restarting loop
Keenan Brock
@kbrock
Jan 10 2018 20:26
"something is running way to large" - true. but does that mean that MiqServer is the issue?
hmm
Dennis Metzger
@dmetzger57
Jan 10 2018 20:27
it doesn't have to be the miq server, if we're getting heavily into swap, we're heading for a non-functional appliannce
Keenan Brock
@kbrock
Jan 10 2018 20:27
if we are already in a restart loop, guess it doesn't matter if we add one more to the stack - not too much worse
+1
Joe Rafaniello
@jrafanie
Jan 10 2018 20:29
How would this work if the appliance is misconfigured with too many workers? Won't the server keep restarting even though it's not the problem?
Dennis Metzger
@dmetzger57
Jan 10 2018 20:31
if you try bringing up an appliance that needs more ram than is configured to run the workers, you're essentially dead - just give it a little to get into nothing but swap activity. So yes, that's a possible problem, we have it today, this would cause constant restarts instead of a non-responsive appliance
ideally the restart logic could detect that the we've restarted more than X times in X minutes so just stop (be nice to configure the web server at that point to display a static "You need more memory to run" page in place of the login :smile: )
Keenan Brock
@kbrock
Jan 10 2018 20:33
echo "you are an idiot" > public/index.html
Joe Rafaniello
@jrafanie
Jan 10 2018 20:37
ok, maybe we should discuss this when @gtanzillo is available too. My concern is we already have a different safeguard that is not nearly as aggressive (kill workers at 80% swap, stop starting workers at 60% swap): https://github.com/ManageIQ/manageiq/blob/master/config/settings.yml#L987-L994
Gregg Tanzillo
@gtanzillo
Jan 10 2018 20:47
Perhaps we can setup a discussion about this. My feeling is that we set out to code a simple workaround for the memory leak that we’re struggling to fix. It seems that if we have it look at swap usage instead, we’l be trying to solve a different problem.
Not saying that I disagree with that approach. Just want to focus on mitigating the memory leak in the server process.
Joe Rafaniello
@jrafanie
Jan 10 2018 20:50

I mentioned to @gtanzillo that if we do "exit the server when swap > 25%", we can delete the systems limits because they'll never be hit since exiting the server at 25% will gracefully or kill the workers:

    :kill_algorithm:
      :name: :used_swap_percent_gt_value
      :value: 80
    :start_algorithm:
      :name: :used_swap_percent_lt_value
      :value: 60

I don't mind deleting the kill/start algorithm but it feels like it's not the same problem we're looking at here

Dennis Metzger
@dmetzger57
Jan 10 2018 20:55
I'm fine with the targeted solution. That does address in a focused way the issue of having a leak in the server that we can't find . I brought up the swap perspective as that covers handling when we (the application) is consuming more memory that is available to run no matter what process is consuming that RAM. Guess I hate imposing artificial limits. Sorry for the distraction.
Joe Rafaniello
@jrafanie
Jan 10 2018 20:57
Yeah, I agree. They're related problems and I like the idea of deleting this code: https://github.com/ManageIQ/manageiq/blob/master/app/models/miq_server/worker_management/monitor/system_limits.rb
;-)
but yes, for the targeted problem, we can just do the memory usage check. is 2 GB too high/low?
Dennis Metzger
@dmetzger57
Jan 10 2018 20:58
that's the $64,000 question.
Gregg Tanzillo
@gtanzillo
Jan 10 2018 20:58
No problem @dmetzger57, the discussion is important to have.
Dennis Metzger
@dmetzger57
Jan 10 2018 21:01
@jrafanie I can come up with reasons for <2Gb, 2Gb and >2Gb .... so I'm no help
Joe Rafaniello
@jrafanie
Jan 10 2018 21:01
:laughing: @dmetzger57, me too
setting default values for knobs is sometimes harder than naming the knob
Dennis Metzger
@dmetzger57
Jan 10 2018 21:02
think we know it has to be >1Gb, much bigger than 2Gb and the appliance is probably getting into memory trouble if it s "normal" size appliance, so :+1: for 2Gb
Joe Rafaniello
@jrafanie
Jan 10 2018 21:12
ok, cool. I'll fix up some of the code and open a PR later. Which BZ is this for? No need to reply now. I need to run home. I'll be online later tonight.
Beni Cherniavsky-Paskin
@cben
Jan 10 2018 21:33
Not sure it's a useful thought, but "What Would Orchestrator Do?", post-rearch, and should we steer MiqServer towards that?
https://docs.openshift.com/container-platform/3.6/admin_guide/out_of_resource_handling.html
https://docs.openshift.com/container-platform/3.6/admin_guide/overcommit.html
  • Swap is probably disabled. Several docs recommend to disable it, otherwise k8s can't recognize node memory pressure.
  • Each pod has independent memory limits. Above limit = killed & restarted, I think?
  • It's harder to know when "whole system" has memory problems. But (assuming you use similar values for requests & limits), pods will stay unscheduled when there is not enough memory.
@NickLaMuro apparently jemalloc can sample where memory was allocated from.
http://www.be9.io/2015/09/21/memory-leak/ "How I spent two weeks hunting a memory leak in Ruby"
Beni Cherniavsky-Paskin
@cben
Jan 10 2018 21:39
(I suspect you already found all this, just in case)
Beni Cherniavsky-Paskin
@cben
Jan 10 2018 21:47
tcmalloc can profile too, sounds like it does complete profile(?) https://github.com/gperftools/gperftools/wiki
Dennis Metzger
@dmetzger57
Jan 10 2018 21:53
@cben not allowing VMs (or pods) to swap follows an old line of of thinking - that it's the hypervisors job to handle memory over-commit in the aggregate. I personally follow that line (I run personal VMs with swap disabled), configure the machines / pods with the memory you think they need - if this puts pressure on the physical resource let the hypervisor deal with it.
Nick LaMuro
@NickLaMuro
Jan 10 2018 22:30

@cben thanks, some of those are new to me. Ran across the blog post before and read through it, though I feel like we might have tried switching some malloc settings to no avail, but might not be a bad thing to consider at this point.

I have known about gperftools before, but if it has some heap profiling, that might be nice

Keenan Brock
@kbrock
Jan 10 2018 23:23
@NickLaMuro did you see ManageIQ/manageiq-appliance-build#254 ? just making sure
Nick LaMuro
@NickLaMuro
Jan 10 2018 23:50
@kbrock yeah, we are still using the one in the repo for the application, but for users writing custom automate scripts that rely on rest-client, that will end up getting used
Was referenced here: ManageIQ/manageiq#16581