These are chat archives for ManageIQ/manageiq/performance

9th
Jan 2018
Joe Rafaniello
@jrafanie
Jan 09 2018 01:13
So, @NickLaMuro, whatever my testing is actually showing, clearly, it's the openstack provider workers
I'll let it run for the 60-90 minutes but I all but eliminated the vmware and redhat provider workers from being the reason for the large server growth with my 40.times { sync_workers }
40.times { sync_workers } + normal workers + Openstack, Redhat, Vmware provider workers

17:12:46 321192
17:42:48 332404 (+11212 30 minutes)
18:01:49 338476 (+17284 49 minutes)
Joe Rafaniello
@jrafanie
Jan 09 2018 01:19
40.times { sync_workers } + normal workers + Redhat + Vmware provider workers

18:07:30 315264
18:37:32 318752 (+3488, 30 minutes)
19:07:33 319220 (+3956, 60 minutes)
I'm now, just running openstack and the normal workers
MIQ_WORKER_TYPES = {
  "ManageIQ::Providers::Openstack::CloudManager::EventCatcher"                  => %i(manageiq_default),
  "ManageIQ::Providers::Openstack::CloudManager::MetricsCollectorWorker"        => %i(manageiq_default),
  "ManageIQ::Providers::Openstack::CloudManager::RefreshWorker"                 => %i(manageiq_default),
  "ManageIQ::Providers::Openstack::InfraManager::EventCatcher"                  => %i(manageiq_default),
  "ManageIQ::Providers::Openstack::InfraManager::MetricsCollectorWorker"        => %i(manageiq_default),
  "ManageIQ::Providers::Openstack::InfraManager::RefreshWorker"                 => %i(manageiq_default),
  "ManageIQ::Providers::Openstack::NetworkManager::EventCatcher"                => %i(manageiq_default),
  "ManageIQ::Providers::Openstack::NetworkManager::MetricsCollectorWorker"      => %i(manageiq_default),
  "EmbeddedAnsibleWorker"                                                       => %i(manageiq_default),
  "MiqCockpitWsWorker"                                                          => %i(manageiq_default),
  "MiqEmsMetricsProcessorWorker"                                                => %i(manageiq_default),
  "MiqEmsRefreshCoreWorker"                                                     => %i(manageiq_default),
  "MiqEventHandler"                                                             => %i(manageiq_default),
  "MiqGenericWorker"                                                            => %i(manageiq_default),
  "MiqPriorityWorker"                                                           => %i(manageiq_default),
  "MiqReportingWorker"                                                          => %i(manageiq_default),
  "MiqScheduleWorker"                                                           => %i(manageiq_default),
  "MiqSmartProxyWorker"                                                         => %i(manageiq_default),
  "MiqUiWorker"                                                                 => %i(manageiq_default ui_dependencies),
  "MiqVimBrokerWorker"                                                          => %i(manageiq_default),
  "MiqWebServiceWorker"                                                         => %i(manageiq_default),
  "MiqWebsocketWorker"                                                          => %i(manageiq_default),
}.freeze
After just 11 minutes, it's already up 11-12 MB compared to the start
I'll see if I can leave my laptop running on caffeine so it will run all night
Nick LaMuro
@NickLaMuro
Jan 09 2018 04:04
cool, I have my laptop closed for the day, but I will have status update in the morning for the scenarios I was running earlier
Keenan Brock
@kbrock
Jan 09 2018 04:49
@NickLaMuro tomorrow... could you give me a pointer?
# ruby ~/src/gems/miq_tools/miq_top_output_processor/top_processor.rb --pid 8942 top_output.log
# ls top_outputs
# grep 8942 top_output.log
 8942 root      27   7 1869732 1.436g   3100 S   1.5 12.4   0:00.91 ruby
 8942 root      27   7 1869732 1.436g   3100 S   0.0 12.4   0:00.94 ruby
 8942 root      27   7 1869732 1.436g   3100 S   0.1 12.4   0:00.98 ruby
 8942 root      27   7 1869732 1.436g   3100 S   0.0 12.4   0:01.01 ruby
 ...
 8942 root      27   7 1869732 1.437g   3244 S   0.0 12.4   0:03.53 ruby
KiB Mem : 12137360 total,  6089420 free,  5447772 used,   600168 buff/cache
KiB Mem : 12137360 total,  5894264 free,  5614064 used,   629032 buff/cache
ideas why top_processor.rb would not generate any files in top_outputs?
I'm assuming I'm calling the command incorrectly
Keenan Brock
@kbrock
Jan 09 2018 05:06
@NickLaMuro another question. in gnuplot, is there a way to draw an impulse without specifying a height - just tell it "draw the whole height" Thinking it would look like using 1:100% or something like that.
there seemed to be a number of options for variables and constants, but never figured out how to do the full axis
Keenan Brock
@kbrock
Jan 09 2018 15:29
evm-xx_MetricsCollectorWorker_pid_12250.png
color coding by job type :)

still waiting on the numbers for the jobs that are running with storage commented out, but I'm thinking that @NickLaMuro is right. It is probably a task that is setup for 1 hour after startup.

the fact that Storage#perf_capture_hourlyis run at the beginning that has me thinking it is not it

  • GC (every 15 minutes)
  • scheduler for the collection worker (and generic for some extent)
  • config refresh
Nick LaMuro
@NickLaMuro
Jan 09 2018 15:37

ideas why top_processor.rb would not generate any files in top_outputs?

@kbrock I have had multiple commits to make the regexp for top_processor.rb more flexible. I think in your case, you need to pass the --no-ppid since it doesn't look like there is one in that output

Nick LaMuro
@NickLaMuro
Jan 09 2018 16:04

@jrafanie Ugh... this was what I was afraid of... compare to what I had above after only 6 hours:

https://gitter.im/ManageIQ/manageiq/performance?at=5a53f4025a9ebe4f7561c089

2 different servers (14 and 3 workers respectively):


  def monitor_workers
    sync_workers
    MiqWorker.status_update_all
  end
20180108_15845.png
20180108_25490.png

Other server

# Effective code in MiqServer.monitor_workers
  def monitor_workers
    self.class.my_server_clear_cache
    # sync_workers   # Note: this was configured to run with 14 wrks and this line was triggered to run once, then never again.  Didn't work, so only 3 workers run in this case
    MiqWorker.status_update_all
    do_system_limit_exceeded if self.kill_workers_due_to_resources_exhausted?
  end
20180108_8392.png
Joe Rafaniello
@jrafanie
Jan 09 2018 16:07
Interesting
So, maybe two or more leaks?
Nick LaMuro
@NickLaMuro
Jan 09 2018 16:08
yeah, unsure
I guess it is possible it is MiqWorker.status_update_all
Joe Rafaniello
@jrafanie
Jan 09 2018 16:08
in this second appliance example, I'd be curious if the openstack workers are in the $LOADED_FEATURES
Nick LaMuro
@NickLaMuro
Jan 09 2018 16:28
I remember there being some changes where they switched from curb to straight curl bindings back a few releases... maybe something is mucking stuff up with that?
I am going to try running those three servers as is again, with all of the OpenStack stuff commented out in the worker types. See if it helps. (these appliances are using a VMWare simulator, so that shouldn't have an effect).
though, descendant loader might botch that... might have to comment things out in the Gemfile as well.... we'll see
Chris Arcand
@chrisarcand
Jan 09 2018 16:34
You guys should try Pampers Swaddlers, I've found them to be really good. No leaks.
Nick LaMuro
@NickLaMuro
Jan 09 2018 16:34
(-‸ლ)
Chris Arcand
@chrisarcand
Jan 09 2018 16:35
You're welcome @jrafanie
Joe Rafaniello
@jrafanie
Jan 09 2018 16:36
@chrisarcand I bet it lets you capture dumps really easily too
Nick LaMuro
@NickLaMuro
Jan 09 2018 16:37
lol
Chris Arcand
@chrisarcand
Jan 09 2018 16:43
zing!
Nick LaMuro
@NickLaMuro
Jan 09 2018 17:18

@jrafanie progress!

Was running the following for a while in a vagrant VM, and recently it started leaking like a sieve:

https://github.com/NickLaMuro/miq_tools/blob/master/miq_server_leak_discovery/09_sync_workers_loop_test.rb

Joe Rafaniello
@jrafanie
Jan 09 2018 17:18
cool
I'm tracing allocations in my test setup and looking at the dumps periodically
Nick LaMuro
@NickLaMuro
Jan 09 2018 17:19
I assume this is just the ObjectSpace ones, right?
Joe Rafaniello
@jrafanie
Jan 09 2018 17:23
yes
Nick LaMuro
@NickLaMuro
Jan 09 2018 17:23

...and recently it started leaking like a sieve

Of note, I was also running a similar test that was only constantizing all of the worker types. That has been running longer, and didn't start leaking

Joe Rafaniello
@jrafanie
Jan 09 2018 17:24
+gem 'rbtrace' to gemfile
diff --git a/app/models/miq_worker.rb b/app/models/miq_worker.rb
index be23a3b..19453eb 100644
--- a/app/models/miq_worker.rb
+++ b/app/models/miq_worker.rb
@@ -358,6 +358,9 @@ class MiqWorker < ApplicationRecord
       exit!
     end

+   require 'objspace'
+   ObjectSpace.trace_object_allocations_start
+
     Process.detach(pid)
     pid
   end
diff --git a/config/initializers/active_metrics.rb b/config/initializers/active_metrics.rb
index 14aa45d..a009cdd 100644
--- a/config/initializers/active_metrics.rb
+++ b/config/initializers/active_metrics.rb
@@ -1,2 +1,3 @@
+require 'rbtrace'
 # Note: The legacy Postgresql adapter ignores everything except the adapter name here, just stealing the current ActiveRecord connection.
 ActiveMetrics::Base.establish_connection(:adapter => "miq_postgres", :database => "manageiq_metrics")
then a periodic: rbtrace -p 1684 -e 'Thread.new{GC.start;require "objspace";io=File.open("/var/www/miq_tmp/ruby-heap_server7.dump", "w"); ObjectSpace.dump_all(output: io); io.close}'
Nick LaMuro
@NickLaMuro
Jan 09 2018 17:24
@jrafanie if this is a c-extension issue, then this might be something that you don't see in the ruby heap. Just and FYI.
Previously, when I was logging the more verbose info in the MIQ server process, the ruby object allocations were staying relatively constant, while the RSS continued to climb
Joe Rafaniello
@jrafanie
Jan 09 2018 17:26
yeah, I'm just hoping I can see "other" allocations in the same generation that might lead to something of value
Keenan Brock
@kbrock
Jan 09 2018 18:21
@NickLaMuro thanks for pointers - found out I needed to get the latest code
Nick LaMuro
@NickLaMuro
Jan 09 2018 18:23
oh yeah, sorry... I think I just pushed that up yesterday
NickLaMuro @NickLaMuro was able to figure out what @kbrock was referring to by scrolling up... surprisingly
Nick LaMuro
@NickLaMuro
Jan 09 2018 19:36
anyway, I finally figured out a way to set up my tests where they are running without the Openstack provider code loaded at all
err... well, I hope (it is commented out from the Gemfile, and caused other issues when doing that, so I think we are okay... assuming DescendantLoader is not doing anything I don't expect...)
anyway, will check in with it later to see if it is still leaking. If none of them are, then we might be closer to our smoking gun, but I am cautiously optimistic at the moment.
Dennis Metzger
@dmetzger57
Jan 09 2018 19:43
I read that as "caustically optimistic" ......
Keenan Brock
@kbrock
Jan 09 2018 19:56
Linux question: I've noticed that ps shows a process as MIQ: ... vs top shows it as ruby.
Dennis Metzger
@dmetzger57
Jan 09 2018 20:05
I believe we use setproctitle (or equivalent magic) to change to the name, which ps picks up / shows while top is showing the original process name (ruby).
Nick LaMuro
@NickLaMuro
Jan 09 2018 20:05
Also, just depends on the installation of the appliance and the defaults for top. Sometimes it is shown as ruby, and sometimes it is shown as the setproctitle
hence why I needed things like --no-ppid in the top processor... because it just depended on what was configured and set on the system
Keenan Brock
@kbrock
Jan 09 2018 20:40
And the answer is:
Inside of top, type the letter c and you will see the proctitle values
or run top -c
Nick LaMuro
@NickLaMuro
Jan 09 2018 21:10
M (shift m) to sort by mem as well
Keenan Brock
@kbrock
Jan 09 2018 21:14
nice --> top -c -o %MEM if you want command line to do it all for you
wish top -b -d 60 in our miq_top used the -c
Nick LaMuro
@NickLaMuro
Jan 09 2018 21:14
again, I think that in some systems, it is default, while others it is not
might be something that was made a default in newer versions of linux, because I have seen full commands in the top_output.log
and other places where it is not full commands
Keenan Brock
@kbrock
Jan 09 2018 21:16
aah
Nick LaMuro
@NickLaMuro
Jan 09 2018 21:17
("proctitles", whatever)
Keenan Brock
@kbrock
Jan 09 2018 21:17
in fine-4 cfme it is the short form
NickLaMuro @NickLaMuro relocating now, so AFK