These are chat archives for ManageIQ/manageiq/performance

11th
Dec 2017
Nick LaMuro
@NickLaMuro
Dec 11 2017 16:39 UTC
So here is the update for over the weekend for the two VMs I have currently:
  • The one in which I have basically turned MiqServer#monitor into a no-op is currently not leaking memory:
20171208_28903.png
  • which narrows down the search to the remaining methods that are not commented out on the other VM, which still is leaking
20171208_20421.png
Of note, I am still doing memory dumps on this appliance every 3 hours or so, and the hiccup near the end was probably related to me getting back and gzipping those dumps just prior to the VM running out of disk space last night (forgot to do it sooner this weekend... sorry)
Keenan Brock
@kbrock
Dec 11 2017 16:42 UTC

The one in which I have basically turned MiqServer#monitor into a no-op is currently not leaking memory:

Did you just say you introduced a leak to monitor, or that there is a leak in the existing monitor?

Dennis Metzger
@dmetzger57
Dec 11 2017 16:42 UTC
@NickLaMuro how many methods are left in the leaking version?
Jason Frey
@Fryguy
Dec 11 2017 16:43 UTC
nice...that seems good as you've bookended the leak to the guts of the monitor loop
Nick LaMuro
@NickLaMuro
Dec 11 2017 16:43 UTC
currently this is what the leaking version looks like:
def monitor
  now = Time.now.utc
  Benchmark.realtime_block(:heartbeat)               { heartbeat }                        if threshold_exceeded?(:heartbeat_frequency, now)
  Benchmark.realtime_block(:server_dequeue)          { process_miq_queue }                if threshold_exceeded?(:server_dequeue_frequency, now)

  # Benchmark.realtime_block(:server_monitor) do
  #   monitor_servers
  #   monitor_server_roles if self.is_master?
  # end if threshold_exceeded?(:server_monitor_frequency, now)

  # Benchmark.realtime_block(:log_active_servers)      { log_active_servers }               if threshold_exceeded?(:server_log_frequency, now)
  Benchmark.realtime_block(:worker_monitor)          { monitor_workers }                  if threshold_exceeded?(:worker_monitor_frequency, now)
  # Benchmark.realtime_block(:worker_dequeue)          { populate_queue_messages }          if threshold_exceeded?(:worker_dequeue_frequency, now)
rescue SystemExit, SignalException
  #...
end
Jason Frey
@Fryguy
Dec 11 2017 16:43 UTC
my :moneybag: is still on DRb
Nick LaMuro
@NickLaMuro
Dec 11 2017 16:43 UTC
@kbrock to answer your question, the other one is basically all of the methods commented out
Jason Frey
@Fryguy
Dec 11 2017 16:43 UTC
cool..keep going! :)
Adam Grare
@agrare
Dec 11 2017 16:44 UTC
so running through valgrind now and it is slooooow
seeding timed out after 20 minutes, trying again with seending disabled
Nick LaMuro
@NickLaMuro
Dec 11 2017 16:44 UTC
@Fryguy I had some preliminary tests just running a simple script that was just doing #heartbeat on a vagrant VM (with periodic GC's every other minute), and it seemed to show some leaks
there was no DRB involved with that
but I wasn't able to run in long enough to have conclusive data
Jason Frey
@Fryguy
Dec 11 2017 16:46 UTC
ah interesting (though I thought heartbeat did use DRb but maybe that's the part the @gtanzillo removed)
Nick LaMuro
@NickLaMuro
Dec 11 2017 16:47 UTC
here is the script, incase you are curious
ENV["BUNDLE_WITHOUT"] = "test:metric_fu:development"
ENV["BUNDLE_GEMFILE"] = "/var/www/miq/vmdb/Gemfile"
ENV["PATH"]           = "#{ENV['PATH']}:/opt/rubies/ruby-2.3.1/bin"

puts "PID: #{Process.pid}"

require 'config/environment.rb'
server = MiqServer.my_server

puts "starting loop..."
loop do
  server.heartbeat
  GC.start && do_gc = false if do_gc  && Time.now.min % 2 == 0
  do_gc = true              if !do_gc && Time.now.min % 2 != 0
  sleep 0.5
end
@Fryguy I think you are thinking of worker heartbeat
Jason Frey
@Fryguy
Dec 11 2017 16:47 UTC
ah yes I am
Nick LaMuro
@NickLaMuro
Dec 11 2017 16:47 UTC
versus server heartbeat
Jason Frey
@Fryguy
Dec 11 2017 16:47 UTC
ahhh
that's very interesting...especially with the GC in there
Joe Rafaniello
@jrafanie
Dec 11 2017 16:56 UTC
Cool, @NickLaMuro
Keenan Brock
@kbrock
Dec 11 2017 17:05 UTC

micro benchmark using ActiveSupport::Notification, starting = Time.now ; ... ; ending = Time.now, and Benchmark.realtime_block [gist]

This has me wanting to remove all of our Benchmark.realtime_block calls and go with notifications instead.
They would be quicker, but more imporant, it would be easier to extract the timings.

Joe Rafaniello
@jrafanie
Dec 11 2017 17:06 UTC
@NickLaMuro what scenario consistently leaks for you? 1 provider with cap and u running? Which provider?
Nick LaMuro
@NickLaMuro
Dec 11 2017 17:07 UTC
1 provider with cap and u running using vmware
@dmetzger57 was the one that set it up, so he might have more details
Joe Rafaniello
@jrafanie
Dec 11 2017 17:07 UTC
ugh, that's what I did on friday and couldn't recreate it
Nick LaMuro
@NickLaMuro
Dec 11 2017 17:07 UTC
pretty sure it has 1k VMs against a VMWare simulator
Joe Rafaniello
@jrafanie
Dec 11 2017 17:07 UTC
upstream?
Nick LaMuro
@NickLaMuro
Dec 11 2017 17:07 UTC
checking
Dennis Metzger
@dmetzger57
Dec 11 2017 17:10 UTC
I see leaks on 5.8 (190) and 5.9 (98) appliances, with no provider added / no C&U … just an idle deployed appliances
these do not have @NickLaMuro diag code installed, running pure appliance to verify leak
Nick LaMuro
@NickLaMuro
Dec 11 2017 17:12 UTC
@jrafanie from what I can tell, the VMs that @dmetzger57 has provided me are upstream, based on the VERSION file (5.9 in the file v.s. 4.6)
Dennis should be able to verify
Dennis Metzger
@dmetzger57
Dec 11 2017 17:13 UTC
@NickLaMuro @jrafanie 5.9.0.9-1
Nick LaMuro
@NickLaMuro
Dec 11 2017 17:13 UTC
upstream, correct?
Joe Rafaniello
@jrafanie
Dec 11 2017 17:14 UTC
no, that'd downstream... I'll deploy that one and try it
@dmetzger57 did you enable any roles for the idle deployed appliances?
Nick LaMuro
@NickLaMuro
Dec 11 2017 17:26 UTC

Added the scripts that I have been using for testing (as well as a hopefully working copy of the smaps differ that I have been using:

NickLaMuro/miq_tools@f9b4a6f

Dennis Metzger
@dmetzger57
Dec 11 2017 17:29 UTC
@jrafanie default roles/config, appliances deployed - DB configured - left sitting idle
Joe Rafaniello
@jrafanie
Dec 11 2017 17:29 UTC
thanks @dmetzger57
Dennis Metzger
@dmetzger57
Dec 11 2017 17:30 UTC
Leak at idle is more pronounced on 5.9
Adam Grare
@agrare
Dec 11 2017 18:14 UTC
so running just @NickLaMuro 's script through valgrind I get 678KiB lost every 5 minutes (that's definitely + indirectly lost)
going to double the duration and see if the bytes lost doubles as well
Nick LaMuro
@NickLaMuro
Dec 11 2017 18:34 UTC
@agrare as a control, did you try running the other scripts to see if it is as bad?
Adam Grare
@agrare
Dec 11 2017 18:35 UTC
@NickLaMuro my baseline was running your script but exiting right before the loop started
Nick LaMuro
@NickLaMuro
Dec 11 2017 18:35 UTC
mmk
@agrare more curious that if we also leak though when just looping without the MIQ stuff in place
just because we saw the ruby -e "puts 'foo'" stuff "leak" pretty heavily as well
Adam Grare
@agrare
Dec 11 2017 18:36 UTC
hm so running for twice as long doesn't leak twice as much stuff in your script
Nick LaMuro
@NickLaMuro
Dec 11 2017 18:37 UTC
what I was afraid of :sob:
Adam Grare
@agrare
Dec 11 2017 18:37 UTC
yeah i wanted to check just the diff to remove the leaks that happen only once
Joe Rafaniello
@jrafanie
Dec 11 2017 18:53 UTC
@dmetzger57 how long before the idle appliance server process starts to "leak"
Dennis Metzger
@dmetzger57
Dec 11 2017 18:58 UTC
it takes a few hours for the noticeabe leak
miq_server_res.png