These are chat archives for ManageIQ/manageiq/performance

8th
Jan 2018
Nick LaMuro
@NickLaMuro
Jan 08 2018 15:54

Okay, so look upping the settings drastically increased the time in which the leak presents itself:

Without Setttings Change

3 Workers

20180104_10802.png
14 Workers
20180104_20752.png

With 1 second poll time settings

20180105_16500.png
Joe Rafaniello
@jrafanie
Jan 08 2018 15:55
@NickLaMuro how many workers were in the last graph?
Nick LaMuro
@NickLaMuro
Jan 08 2018 15:57

Settings changed where:

The third graph with the appliance that had the changed settings was a copy of the appliance with 14 workers enabled

They were all running with MiqServer#sync_needed? commented out in the MiqServer#monitor_workers method
Going to change things up this morning so all of the appliances are running with those settings so we can have some new data in about 12 hours
I will try a different set of changes on the 3rd appliance since we have it available, but keep running the same tests between the 3 and 14 worker ones (but a different test than what was run over the weekend still between the two)
Joe Rafaniello
@jrafanie
Jan 08 2018 16:14
I'm in a meeting but it appears the 40.times { sync_workers } change definitely speeds up the "leak"
that would also cause an increase in the leak rate for your smaller poll test so we might be confirming each other, which is good
Nick LaMuro
@NickLaMuro
Jan 08 2018 16:16
I will re-confirm on the new appliance that commenting out sync_workers stops the leak with these new settings. If not, then it isn't so simple.
for what it is worth, I did try this test already, and didn't seem to show a change in the short time I ran it
that said, I don't think I shared those results, because it was over the holidays or something... worth doing again though
Keenan Brock
@kbrock
Jan 08 2018 16:21
@NickLaMuro do you think sync_workers could be leaking in some of the workers as well?
I'm assuming no, since some workers do not seem to leak
just curious
what ever the worker part of sync_workers is
remind me why monitor_class_names (String) version is used again? I assume Constantize is probably fine
wonder if loading that is calling some of our code loading libraries or something
Nick LaMuro
@NickLaMuro
Jan 08 2018 16:25

do you think sync_workers could be leaking in some of the workers as well?

@kbrock I think MiqWorker.sync_workers is still only called by the MiqServer object in the code, and never by the MiqWorker instances themselves.

Keenan Brock
@kbrock
Jan 08 2018 16:25
but don't they send a message to the workers to do something?
nvr mind - I think no as well
Nick LaMuro
@NickLaMuro
Jan 08 2018 16:26
yeah, they tell workers to shut down if there is too many, but they still figure out what workers to shut down from the worker monitor
Joe Rafaniello
@jrafanie
Jan 08 2018 16:26

I will re-confirm on the new appliance that commenting out sync_workers stops the leak with these new settings. If not, then it isn't so simple.

@NickLaMuro let's not assume there's only 1 leak ;-)

Nick LaMuro
@NickLaMuro
Jan 08 2018 16:26
pretty sure at least..
@jrafanie That was actually what I was thinking
it probably is some kind of shared method that not only exists in sync_workers, but in some other thing in those other remaining methods
Joe Rafaniello
@jrafanie
Jan 08 2018 16:30

rough numbers from my local appliance shows

+25252 memory increase in 2 hours 11 minute (normal sync_workers, deployed miq appliance)
+49880 memory increase in 2 hours 11 minute (40.times { sync_workers} )

@NickLaMuro ^
@NickLaMuro Note, the removal of sync_workers via commenting it out might cause other problems so I don't really know if you're comparing apples to apples
But, I'm thinking I'll try comparing 40.times {sync_workers} results to the same test with only basic workers in the class names. I have a feeling it's one or a subset of workers that might be involved here
Nick LaMuro
@NickLaMuro
Jan 08 2018 16:42

Note, the removal of sync_workers via commenting it out might cause other problems so I don't really know if you're comparing apples to apples

It was more to say that the leak was still present even when not executing sync_workers, not that it is apples to apples

Joe Rafaniello
@jrafanie
Jan 08 2018 16:52
Right
Joe Rafaniello
@jrafanie
Jan 08 2018 17:04
I'm doing a 40.times { sync_workers } test with all of the ManageIQ::Providers worker classes removed from MIQ_WORKER_TYPES
Nick LaMuro
@NickLaMuro
Jan 08 2018 17:04
:+1:
I think that is a good test
I was looking into that a bit as well
since I stumbled across this: ManageIQ/manageiq@0d1eca1
Joe Rafaniello
@jrafanie
Jan 08 2018 18:03

@NickLaMuro

40.times { sync_workers } + all provider workers removed from miq worker types:

11:57:40 328584
12:27:41 333976 (+5392 in 30 minutes)
12:57:43 334660 (+6076 in 1 hour)
for comparison, here's what "40.times { sync_workers }" with all workers looked like
17:04:08 325488
17:34:09 332024 (+6500 30 minutes)
18:04:11 344744 (+19220 1 hour)
19:15:14 376000 (+49880 2 hours 11 minutes)
I'll let it run another 30 minutes and see if it's still not growing that fast... then I'll look at the provider workers I disabled and see if I can add a small subset
in other words, this is the what is currently disabled probably the cause of the much faster memory growth:
 MIQ_WORKER_TYPES = {
-  "ManageIQ::Providers::Amazon::AgentCoordinatorWorker"                         => %i(manageiq_default),
-  "ManageIQ::Providers::Amazon::CloudManager::EventCatcher"                     => %i(manageiq_default),
-  "ManageIQ::Providers::Amazon::CloudManager::MetricsCollectorWorker"           => %i(manageiq_default),
-  "ManageIQ::Providers::Amazon::CloudManager::RefreshWorker"                    => %i(manageiq_default),
-  "ManageIQ::Providers::AnsibleTower::AutomationManager::EventCatcher"          => %i(manageiq_default),
-  "ManageIQ::Providers::AnsibleTower::AutomationManager::RefreshWorker"         => %i(manageiq_default),
-  "ManageIQ::Providers::Azure::CloudManager::EventCatcher"                      => %i(manageiq_default),
-  "ManageIQ::Providers::Azure::CloudManager::MetricsCollectorWorker"            => %i(manageiq_default),
-  "ManageIQ::Providers::Azure::CloudManager::RefreshWorker"                     => %i(manageiq_default),
-  "ManageIQ::Providers::Azure::NetworkManager::RefreshWorker"                   => %i(manageiq_default),
-  "ManageIQ::Providers::EmbeddedAnsible::AutomationManager::EventCatcher"       => %i(manageiq_default),
-  "ManageIQ::Providers::EmbeddedAnsible::AutomationManager::RefreshWorker"      => %i(manageiq_default),
-  "ManageIQ::Providers::Foreman::ConfigurationManager::RefreshWorker"           => %i(manageiq_default),
-  "ManageIQ::Providers::Foreman::ProvisioningManager::RefreshWorker"            => %i(manageiq_default),
-  "ManageIQ::Providers::Google::CloudManager::EventCatcher"                     => %i(manageiq_default),
-  "ManageIQ::Providers::Google::CloudManager::MetricsCollectorWorker"           => %i(manageiq_default),
-  "ManageIQ::Providers::Google::CloudManager::RefreshWorker"                    => %i(manageiq_default),
-  "ManageIQ::Providers::Google::NetworkManager::RefreshWorker"                  => %i(manageiq_default),
-  "ManageIQ::Providers::Hawkular::DatawarehouseManager::EventCatcher"           => %i(manageiq_default),
-  "ManageIQ::Providers::Hawkular::DatawarehouseManager::RefreshWorker"          => %i(manageiq_default),
-  "ManageIQ::Providers::Hawkular::MiddlewareManager::EventCatcher"              => %i(manageiq_default),
-  "ManageIQ::Providers::Hawkular::MiddlewareManager::RefreshWorker"             => %i(manageiq_default),
-  "ManageIQ::Providers::Kubernetes::ContainerManager::EventCatcher"             => %i(manageiq_default),
-  "ManageIQ::Providers::Kubernetes::ContainerManager::MetricsCollectorWorker"   => %i(manageiq_default),
-  "ManageIQ::Providers::Kubernetes::ContainerManager::InventoryCollectorWorker" => %i(manageiq_default),
-  "ManageIQ::Providers::Kubernetes::ContainerManager::RefreshWorker"            => %i(manageiq_default),
-  "ManageIQ::Providers::Kubernetes::MonitoringManager::EventCatcher"            => %i(manageiq_default),
-  "ManageIQ::Providers::Kubevirt::InfraManager::RefreshWorker"                  => %i(manageiq_default),
-  "ManageIQ::Providers::Lenovo::PhysicalInfraManager::EventCatcher"             => %i(manageiq_default),
-  "ManageIQ::Providers::Lenovo::PhysicalInfraManager::RefreshWorker"            => %i(manageiq_default),
-  "ManageIQ::Providers::Microsoft::InfraManager::RefreshWorker"                 => %i(manageiq_default),
-  "ManageIQ::Providers::Nuage::NetworkManager::RefreshWorker"                   => %i(manageiq_default),
-  "ManageIQ::Providers::Nuage::NetworkManager::EventCatcher"                    => %i(manageiq_default),
-  "ManageIQ::Providers::Openshift::ContainerManager::EventCatcher"              => %i(manageiq_default),
-  "ManageIQ::Providers::Openshift::ContainerManager::MetricsCollectorWorker"    => %i(manageiq_default),
-  "ManageIQ::Providers::Openshift::ContainerManager::InventoryCollectorWorker"  => %i(manageiq_default),
-  "ManageIQ::Providers::Openshift::ContainerManager::RefreshWorker"             => %i(manageiq_default),
-  "ManageIQ::Providers::Openshift::MonitoringManager::EventCatcher"             => %i(manageiq_default),
-  "ManageIQ::Providers::Openstack::CloudManager::EventCatcher"                  => %i(manageiq_default),
-  "ManageIQ::Providers::Openstack::CloudManager::MetricsCollectorWorker"        => %i(manageiq_default),
-  "ManageIQ::Providers::Openstack::CloudManager::RefreshWorker"                 => %i(manageiq_default),
-  "ManageIQ::Providers::Openstack::InfraManager::EventCatcher"                  => %i(manageiq_default),
-  "ManageIQ::Providers::Openstack::InfraManager::MetricsCollectorWorker"        => %i(manageiq_default),
-  "ManageIQ::Providers::Openstack::InfraManager::RefreshWorker"                 => %i(manageiq_default),
-  "ManageIQ::Providers::Openstack::NetworkManager::EventCatcher"                => %i(manageiq_default),
-  "ManageIQ::Providers::Openstack::NetworkManager::MetricsCollectorWorker"      => %i(manageiq_default),
-  "ManageIQ::Providers::Redhat::InfraManager::EventCatcher"                     => %i(manageiq_default),
-  "ManageIQ::Providers::Redhat::InfraManager::MetricsCollectorWorker"           => %i(manageiq_default),
-  "ManageIQ::Providers::Redhat::InfraManager::RefreshWorker"                    => %i(manageiq_default),
-  "ManageIQ::Providers::Redhat::NetworkManager::EventCatcher"                   => %i(manageiq_default),
-  "ManageIQ::Providers::Redhat::NetworkManager::MetricsCollectorWorker"         => %i(manageiq_default),
-  "ManageIQ::Providers::Redhat::NetworkManager::RefreshWorker"                  => %i(manageiq_default),
-  "ManageIQ::Providers::Vmware::CloudManager::EventCatcher"                     => %i(manageiq_default),
-  "ManageIQ::Providers::Vmware::CloudManager::RefreshWorker"                    => %i(manageiq_default),
-  "ManageIQ::Providers::Vmware::InfraManager::EventCatcher"                     => %i(manageiq_default),
-  "ManageIQ::Providers::Vmware::InfraManager::MetricsCollectorWorker"           => %i(manageiq_default),
-  "ManageIQ::Providers::Vmware::InfraManager::RefreshWorker"                    => %i(manageiq_default),
-  "ManageIQ::Providers::Vmware::NetworkManager::RefreshWorker"                  => %i(manageiq_default),
Joe Rafaniello
@jrafanie
Jan 08 2018 18:09
Note, I have no providers added, I just deploy the manageiq appliance and ran it. I reboot after making the changes and see what happens
At this point, it feels like a 1 hour binary search
Jason Frey
@Fryguy
Jan 08 2018 18:26
:point_up: January 8, 2018 10:54 AM Can you clarify this? Are you saying that settings might be the culprit, or that when settings are on, then it happen more quikcly, or just setting a lower value make it happen more quickly (i.e. the settings themselves aren't the problem)?
Keenan Brock
@kbrock
Jan 08 2018 18:26
@jrafanie did you just say: "one provider is including a gem that is screwing us"
Jason Frey
@Fryguy
Jan 08 2018 18:26
just tryig to understand to see if I can help out if it's believed to be settings
Keenan Brock
@kbrock
Jan 08 2018 18:27
I'd just group by vendor type - would decrease the number of combinations
Joe Rafaniello
@jrafanie
Jan 08 2018 18:31
@Fryguy I believe @NickLaMuro is referring to changing the poll to 1 second drastically accelerated the memory growth... which makes sense when combined with my comparison of 2 hours of normal, fresh deployed miq appliance vs. changing sync_workers to 40.times { sync_workers } :point_up: January 8, 2018 11:30 AM
Jason Frey
@Fryguy
Jan 08 2018 18:35
ahh ok
very interesting
Keenan Brock
@kbrock
Jan 08 2018 18:36
ok
how about
go through the list and
1000.times {
  40.times { "ManageIQ::Providers::Vmware::NetworkManager::RefreshWorker".constantize }
  GC.start
}
puts PSS # can't remember this line
Jason Frey
@Fryguy
Jan 08 2018 18:37
Don't forget to GC in there ;)
Keenan Brock
@kbrock
Jan 08 2018 18:40
can't remember the "put my RSS" line
@NickLaMuro - separating by pid makes everything better - thanks for helping me get there
of course I spent way too much time on the scripts that now generate the new graphs :(
Joe Rafaniello
@jrafanie
Jan 08 2018 18:45
haha, no, I'll continue on this path... it's slow and tedious but I don't want to jump to conclusions yet
I'm putting back Amazon, AnsibleTower, Azure, EmbeddedAnsible, Foreman, Google and Hawkular providers and restarting my tests
(aka, Kubernetes, Kubervier, Lenovo, Nuage, Openshift, Openstack, Redhat, Vmware are removed to see if the above leaks :point_up:)
90 minutes seems to be more than enough to confirm
Keenan Brock
@kbrock
Jan 08 2018 18:49
darn - wonder where I had that table of the changed gem versions of the 2 appliances
thought I put it into a chat window somewhere... :(
then your binary could be a little better informed
Joe Rafaniello
@jrafanie
Jan 08 2018 18:51
I fixed my comment above, I'm testing with the first several providers alphabetically to see if they leak, basically 40.times {sync_workers} on the [A-H] providers + non provider workers
Nick LaMuro
@NickLaMuro
Jan 08 2018 20:21
Sorry, was away... playing catchup
@jrafanie it would be interesting to see if you bumped the 40.times do to something much larger (200 or something), and see if it still grows a bit
@Fryguy yes, I was simply trying to decrease the time that we can determine if a leak was present. The advantage to this is that it doesn't introduce code and arguably non-sensical extra work (40.times of doing the same thing over and over would probably not be something we would do in reality, but works if we know the culprit)
Joe Rafaniello
@jrafanie
Jan 08 2018 20:26
yeah, I'm deploying another local appliance to do that exact thing. I want another run confirming the greatly increased memory growth
Nick LaMuro
@NickLaMuro
Jan 08 2018 20:27
sorry, meant to say to try that on the trimmed down set of worker types
we are technically doing 40 or so less units of work every time you do the iteration of 40.times
so the idea with that is we could confirm if the worker type makes a difference in it leaking, or if it is just the amount of times going through the constantize + sync_workers code is the culprit
Joe Rafaniello
@jrafanie
Jan 08 2018 21:09
so @NickLaMuro, I agree. These aren't apples to apples amounts of work but I'm looking at just "is it leaking in a crazy way"
For example...
40.times { sync_workers } + no provider workers [1]

11:57:40 328584
12:27:41 333976 (+5392 30 minutes)
12:57:43 334660 (+6076 60 minutes)
13:27:44 335176 (+6592 90 minutes)
13:41:45 335352 (+6768 104 minutes)
40.times { sync_workers } + normal workers + A-H provider workers

13:48:26 321964
14:18:28 324056 (+2092 30 minutes)
14:48:29 324424 (+2460 60 minutes)
15:08:30 324572 (+2608 80 minutes)
40.times { sync_workers } + normal workers + K-Z provider workers [1]

15:11:59 318840
15:42:01 327800 (+8960 30 minutes)
16:00:02 334552 (+15712 48 minutes)
16:06:02 337440 (+18600 54 minutes)
Nick LaMuro
@NickLaMuro
Jan 08 2018 21:12
The last one seems pretty telling, interesting
Joe Rafaniello
@jrafanie
Jan 08 2018 21:12
Remember, this is with all the workers
40.times { sync_workers }

17:04:08 325488
17:34:09 332024 (+6500 30 minutes)
18:04:11 344744 (+19220 60 minutes)
18:34:12 357304 (+31816 90 minutes)
19:15:14 376000 (+49880 2 hours 11 minutes)
I think the K-Z provider workers are clearly most "similar" to this "all workers" very much synced test
Keenan Brock
@kbrock
Jan 08 2018 21:14
@jrafanie so you're suggesting we rename the providers? lower ascii values use up fewer bits?
Nick LaMuro
@NickLaMuro
Jan 08 2018 21:15
@jrafanie by the way, what are you using for memory metrics here?
Joe Rafaniello
@jrafanie
Jan 08 2018 21:15
top_output.log
Nick LaMuro
@NickLaMuro
Jan 08 2018 21:15
I assume this is RSS then?
Joe Rafaniello
@jrafanie
Jan 08 2018 21:15
grep -E "top -|timesync|1639" log/top_output.log # where 1639 is the server pid
yes
Nick LaMuro
@NickLaMuro
Jan 08 2018 21:15
okay, just making sure I am looking at the same metric
Joe Rafaniello
@jrafanie
Jan 08 2018 21:19
I'm going to start testing K-Openshift (removing openstack/redhat/vmware

@jrafanie so you're suggesting we rename the providers? lower ascii values use up fewer bits?

@kbrock :laughing:

Nick LaMuro
@NickLaMuro
Jan 08 2018 21:28

For posterity, I ran a modified form of @kbrock 's test:

puts "PID: #{Process.pid}"

require 'config/environment.rb'

MiqServer

puts "starting loop..."
loop do
  MiqServer.monitor_class_names.each do |class_name|
    class_name.constantize
  end
  (GC.start; do_gc = false)   if do_gc  && Time.now.min % 2 == 0
  do_gc = true                if !do_gc && Time.now.min % 2 != 0
  sleep 0.5
end

And currently doesn't seem like it is leak. Of not, running it on my machine using a Vagrant VM, so possibly that is affecting things.

Joe Rafaniello
@jrafanie
Jan 08 2018 21:30
My bet is some shenanigans we're doing in the sync_workers for some provider
Keenan Brock
@kbrock
Jan 08 2018 21:30
you need to write class EveryOtherMinute
Nick LaMuro
@NickLaMuro
Jan 08 2018 21:30

Also looking into the internals of MiqWorker.sync_workers and seeing if there is a script I can build up for that. If I am not mistaken, we should never be having the current != desired block being executed, but maybe I should run over some of the logs from my previous runs and see if the pids ever changed for the workers.
Keenan Brock
@kbrock
Jan 08 2018 21:30
yea - that sounds like something
Nick LaMuro
@NickLaMuro
Jan 08 2018 21:31
@jrafanie so I looked into that, and we don't have too many different versions of .sync_workers across the different providers
but I guess one of the workers could be overriding one of the methods that .sync_workers uses
Joe Rafaniello
@jrafanie
Jan 08 2018 21:32
yeah
I've given up on guessing ;-)
Nick LaMuro
@NickLaMuro
Jan 08 2018 21:32
trying to multi-task and try working from small scripts while waiting on the more concrete long tests to do their thing
Joe Rafaniello
@jrafanie
Jan 08 2018 21:33
We finally have a test that takes less than 90 minutes to confirm out of the norm memory growth. It's entirely possible it's not exactly what happens in our other tests but it's definitely "interesting"
yeah, true
Nick LaMuro
@NickLaMuro
Jan 08 2018 21:34
@jrafanie the one thing I am wary about with that test is I have had tests "grow" for a bit, but then eventually level off
I hope you are right with your tests, but I have gotten my hopes up a few times already doing this in the past, and usually been wrong a day later
:disappointed:
Joe Rafaniello
@jrafanie
Jan 08 2018 21:37
Good point. I'll leave the all workers 40.times { sync_workers } running over night
Joe Rafaniello
@jrafanie
Jan 08 2018 22:13
I'm now testing Openstack/Redhat/Vmware providers + the normal workers since K-Openshift was pretty boring
40.times { sync_workers } + normal workers + K-Openshift provider workers [1]

16:23:40 322016
16:53:41 326436 (+4420 30 minutes)
16:59:41 326436 (+4420 36 minutes)
17:08:42 326436 (+4420 45 minutes)
Nick LaMuro
@NickLaMuro
Jan 08 2018 22:16
@jrafanie oh, were you thinking this was going to be interesting? :trollface:
Joe Rafaniello
@jrafanie
Jan 08 2018 22:17
I assumed it would be our oldest providers
But, I wanted to confirm the boringness of the others
That still doesn't prove anything yet... We need to get it down as small a subset and do longer term tests with just one or two providers removed to see if it helps
Nick LaMuro
@NickLaMuro
Jan 08 2018 22:43

So this has only run for 6 hours, and I think letting it continue is needed to confirm some things, but there is some interesting data so far:

2 different servers (3 and 14 workers respectively):

This is basically the remaining code in MiqServer.monitor_workers on these two:

  def monitor_workers
    sync_workers
    MiqWorker.status_update_all
  end
20180108_25490.png
Keenan Brock
@kbrock
Jan 08 2018 22:44
@NickLaMuro does this have the 40.times {} code?
I can't tell. does this look slower than a typical leak?
Nick LaMuro
@NickLaMuro
Jan 08 2018 22:45
20180108_25490.png

Other server

Code on this server
  def monitor_workers
    self.class.my_server_clear_cache
    # sync_workers
    MiqWorker.status_update_all
    do_system_limit_exceeded if self.kill_workers_due_to_resources_exhausted?
  end
20180108_8392.png

@NickLaMuro does this have the 40.times {} code?

@kbrock no, it doesn't, see the code snippet. It does however have the settings set to 1.second like I described above

The last one is interesting because it seems to level off near the end, while the others seem to continue to grow after a time being plateau'd

I can't tell. does this look slower than a typical leak?

You should be comparing these with some of the graphs here: https://gitter.im/ManageIQ/manageiq/performance?at=5a53941d03838b2f2a860a3c

they are in the same order in both
That said, this is only the first 6 hours, versus the ones from above were after running all weekend
Nick LaMuro
@NickLaMuro
Jan 08 2018 22:50
the last one in the link that I posted this morning is the only one that had the settings tweaked, the others had the default settings at the time of posting
I have since changed that so we can get faster turn around