These are chat archives for ManageIQ/manageiq/performance

8th
Dec 2017
Keenan Brock
@kbrock
Dec 08 2017 01:04 UTC
@bdunne thanks for the polite reminder that this stuff is hard
Oleg Barenboim
@chessbyte
Dec 08 2017 01:06 UTC
@kbrock untangle that require spaghetti
Beni Cherniavsky-Paskin
@cben
Dec 08 2017 08:42 UTC
a pie-in-the-sky thought: if type columns stored the ancestors chain (something like VmOrTemplate>Vm>Providers::Infra::Vm>Providers::Infra::Ovirt::Vm), queries like Vm.all could be based on prefix (LIKE 'VmOrTemplate>Vm>%'), without having to know all descendants. descendant loader could go away!
downside (besides being a major Rails change): inserting or otherwise changing ancestors would require a migration.
Beni Cherniavsky-Paskin
@cben
Dec 08 2017 08:48 UTC
hmm, that's not very different from persisting the descendants.yml somewhere into the DB.
but you don't need a migration to add descendants, prob. the most common case.
there are other ways (e.g. a "types" table storing all ancestor-descendant relations, and queries becoming joins through it)...
closer to reality: if we care about startup speed, maybe instead of actually loading all descendants before a query, we could just report to ActiveRecord what the descendant names are (and they'd autoload lazily, only if/when such records are constructed)?
anyway, I'm surprised we had to make this thing and it's not solved in Rails... STI + autoloading just don't combine well.
Beni Cherniavsky-Paskin
@cben
Dec 08 2017 09:11 UTC
say, are we planning to use Bootsnap?
it also scans sources, and caches bytecode — I wonder if that it might have enough data to figure out descendants. And if we must maintain a crazy caching hack, Bootsnap might be a good place to upstream it...
and could Bootsnap + full eager loading be fast enough to just forget about autoloading!?
Brandon Dunne
@bdunne
Dec 08 2017 12:25 UTC
@cben Starting at a leaf subclass, everything is fine. Rails knows how to autoload all the way down to the base class. But if you start at the base class, rails won't auto-load all the way up. For example:
# Without descendant loader
irb(main):001:0> VmOrTemplate.descendants.collect(&:name)
=> []
irb(main):002:0> Vm
=> Vm(id: integer, ....
irb(main):003:0> VmOrTemplate.descendants.collect(&:name)
=> ["Vm"]
# With descendant loader
irb(main):001:0> VmOrTemplate.descendants.collect(&:name)
=> ["Vm", "MiqTemplate", "VmServer", "ManageIQ::Providers::InfraManager::Vm", "ManageIQ::Providers::CloudManager::Vm", "VmXen", "ManageIQ::Providers::Redhat::InfraManager::Vm", "ManageIQ::Providers::Microsoft::InfraManager::Vm", "ManageIQ::Providers::Vmware::InfraManager::Vm", "ManageIQ::Providers::Amazon::CloudManager::Vm", "ManageIQ::Providers::Azure::CloudManager::Vm", "ManageIQ::Providers::Google::CloudManager::Vm", "ManageIQ::Providers::Openstack::CloudManager::Vm", "ManageIQ::Providers::Vmware::CloudManager::Vm", "ManageIQ::Providers::InfraManager::Template", "ManageIQ::Providers::CloudManager::Template", "ManageIQ::Providers::Redhat::InfraManager::Template", "ManageIQ::Providers::Microsoft::InfraManager::Template", "ManageIQ::Providers::Vmware::InfraManager::Template", "TemplateXen", "ManageIQ::Providers::Openstack::InfraManager::Template", "ManageIQ::Providers::Amazon::CloudManager::Template", "ManageIQ::Providers::Azure::CloudManager::Template", "ManageIQ::Providers::Google::CloudManager::Template", "ManageIQ::Providers::Openstack::CloudManager::Template", "ManageIQ::Providers::Vmware::CloudManager::Template", "ManageIQ::Providers::Openstack::CloudManager::VolumeSnapshotTemplate", "ManageIQ::Providers::Openstack::CloudManager::VolumeTemplate"]
The official rails solution is to eager-load everything
Brandon Dunne
@bdunne
Dec 08 2017 12:31 UTC
But we don't want enormous worker processes that don't use 80% of the loaded code
Also, constantizing a leaf subclass only loads that branch of the descendant tree. i.e. Constantizing ManageIQ::Providers::InfraManager::Vm won't load ManageIQ::Providers::CloudManager::Vm
Keenan Brock
@kbrock
Dec 08 2017 14:18 UTC
@bdunne but I think I did discover 1 thing: If VmOrTemplate.all had no filter, and Vm.all only searched for template = false, then it will not need to know the subclasses. All other searches will be from the class it is searching - so it should work just fine.
Only caviot is searching for all ManageIQ::Providers::CloudManager::Vm. Not sure how valid this is.
Yes, we do not want to load all classes into memory. Especially if we are in an OpenShift process. But it seems like there may be a simpler solution.
Writing a boot process that parses ruby files to determine the subclasses and then cache them seems like something better done in ruby's core.
And checking file mtimes to determine if the cache is correct is quite complicated.
Joe Rafaniello
@jrafanie
Dec 08 2017 14:27 UTC
Yeah, that's a good point @kbrock but honestly, unless we can pinpoint clear problems we can't mitigate, it will continue to not be on the radar. Most of our rails boot process is not doing the sti loader. I've found it's mostly parsing routes, parsing localization catalogs, etc... then of course seeding (for the MIQ Server process)
Keenan Brock
@kbrock
Dec 08 2017 14:43 UTC
Fair enough. I would like to configure which languages are supported, and which providers are supported.
My proposals for removing seating and have not had much success. So I am targeting my next pain point.
Seeding
Keenan Brock
@kbrock
Dec 08 2017 14:52 UTC
Maybe focusing on visibility into our processes ishould be my main goal
Brandon Dunne
@bdunne
Dec 08 2017 15:29 UTC
@kbrock I know it's not the case for VmOrTemplate, but.. There are several classes (may or may not be AR based) that can be instantiated based on UI selections. In some cases, there is a drop-down in the UI to choose the thing based on something like Thing.descendants.collect(&:display_name). All descendants of Thing are valid choices. There may not yet be any Things in the database, or may not be all types of Things in the database yet. In these cases I believe our only choice is DescendantLoader or eager-load everything.
Keenan Brock
@kbrock
Dec 08 2017 15:32 UTC
@bdunne GREAT use case. thanks
Jason Frey
@Fryguy
Dec 08 2017 15:54 UTC
To follow onto that point, the Vm being a template = false is a very specific use case. DescendantLoader solves the general use case. We can't always come up with a tailored query for each of the classes that have descendants
Incidentally 72 / 323 tables have the type column (implying STI, and possibly the descendants problem)
Nick LaMuro
@NickLaMuro
Dec 08 2017 15:57 UTC
so I guess my question with this is why we went with the custom DescendantLoader hack in the first place? Since we usually air on the argument of "let's do this the Rails way" in general?
Jason Frey
@Fryguy
Dec 08 2017 15:58 UTC
yeah...and also if any team were to introduce new tables with STI, they would be required to know this esoteric issue and solve it (and reviewers would have to always be on the lookout for it)
Nick LaMuro
@NickLaMuro
Dec 08 2017 15:59 UTC
the only thing I can think of for "why" is for faster boot times, which only seems valid only up to a point
Jason Frey
@Fryguy
Dec 08 2017 15:59 UTC
we initially did it differently (can't remember anymore how, but I thikn it involved ensuring a mixin was present), and then matthewd solved it generically
he tried to convince Rails core to take it, but it's a gross hack and they didn't want it
Nick LaMuro
@NickLaMuro
Dec 08 2017 16:00 UTC
still really doesn't answer the "why"... (and "why not just eager_load")
Jason Frey
@Fryguy
Dec 08 2017 16:00 UTC
but even they couldn't figure out how to solve it, so they just said "meh if you're eager loading it's solved, so whatevs"
"why" they didn't take it, or "why" the issue happens at all?
Nick LaMuro
@NickLaMuro
Dec 08 2017 16:01 UTC
this to me seems like a workaround to eager loading
Jason Frey
@Fryguy
Dec 08 2017 16:02 UTC
it sort of is, but even in "normal" Rails installation you don't eager load in dev or test and the issue still presents itself there
it's technically a problem with "autoload"...we just happen to use autload instead of eager load in production
Nick LaMuro
@NickLaMuro
Dec 08 2017 16:03 UTC
so we could in theory just eager load in dev and test then, right?
Jason Frey
@Fryguy
Dec 08 2017 16:03 UTC
yup
which would then make boot times longer for @kbrock ;)
not to mention the memory aspects :)
Nick LaMuro
@NickLaMuro
Dec 08 2017 16:03 UTC
right, but we just get to those high memory situations eventually anyway
and the issue isn't with "boot times", it is with "on first load of a query, there is a chance something is being eagerloaded"
skewing a performance benchmark
granted, nobody likes slow boot times
Jason Frey
@Fryguy
Dec 08 2017 16:04 UTC
maybe? The memory stuff only happens where you have long running processes (typically production)...if I'm firing up a Rails console why would I want to eager load the world
right, technically it happens when someone calls ".descendants" the very first time which is typically on the very first query of an STI model
so, basically almost immediately the first thing anyone "does" with a Rails console :)
Nick LaMuro
@NickLaMuro
Dec 08 2017 16:06 UTC

if I'm firing up a Rails console why would I want to eager load the world

making this point kinda moot

Jason Frey
@Fryguy
Dec 08 2017 16:06 UTC
now you lost me
DescendantLoader does not do an eager load...it doesn't "load the world" in the same sense
Nick LaMuro
@NickLaMuro
Dec 08 2017 16:11 UTC
well, I was going to do a quick test show that switching to eager loading is effectively the same thing as doing /bin/rails c; Vm.first with descendant loader, but turns out we have actually busted eager loading in our application
or at least in my current setup
Jason Frey
@Fryguy
Dec 08 2017 16:11 UTC
interesting
Nick LaMuro
@NickLaMuro
Dec 08 2017 16:12 UTC
breaks on trying to load Hawkular::MiddlewareManager::Inventory, for what it is worth
but not worth me solving now
Jason Frey
@Fryguy
Dec 08 2017 16:13 UTC
very interesting....wonder if there's a include_concern type bug there
(stupid question...how do you enable eager loading?)
Nick LaMuro
@NickLaMuro
Dec 08 2017 16:13 UTC
when into config/environments/development.rb and switched the eager_load and cache_classes vals to true
Jason Frey
@Fryguy
Dec 08 2017 16:14 UTC
:+1:
Keenan Brock
@kbrock
Dec 08 2017 16:29 UTC
@NickLaMuro think there is no inventory.rb stub file
Keenan Brock
@kbrock
Dec 08 2017 16:34 UTC
module ManageIQ::Providers
  class Hawkular::Inventory::Collector::MiddlewareManager
    ...
  end
end
what is Inventory? is that a module?
annoying that they do the lazy class definition - all on 1 line
Nick LaMuro
@NickLaMuro
Dec 08 2017 16:35 UTC
one of my bet peeves in ruby...
but in other news... turns out if you comment out all of MiqServer#monitor you still leak a bit of memory
@jrafanie ^
Joe Rafaniello
@jrafanie
Dec 08 2017 16:36 UTC
@NickLaMuro over how long a period?
Nick LaMuro
@NickLaMuro
Dec 08 2017 16:36 UTC
so my thought with that is we might have an issue with settings
@jrafanie only 12 hours
Keenan Brock
@kbrock
Dec 08 2017 16:36 UTC
are you palying with the ruby version?
wonder if you change the ruby version if it will go away
Nick LaMuro
@NickLaMuro
Dec 08 2017 16:36 UTC
nope, just commenting out lines on an appliance
Keenan Brock
@kbrock
Dec 08 2017 16:37 UTC
also wonder if you can drop the gem versions and try again
Nick LaMuro
@NickLaMuro
Dec 08 2017 16:37 UTC
the other thought though is it might be something in the Settings code
Keenan Brock
@kbrock
Dec 08 2017 16:37 UTC
using the version of ruby / gems that weren't leaking
Nick LaMuro
@NickLaMuro
Dec 08 2017 16:37 UTC
since that is still used in monitor_poll
Joe Rafaniello
@jrafanie
Dec 08 2017 16:37 UTC
@NickLaMuro interesting... I wish we could get it to definitively leak faster
Nick LaMuro
@NickLaMuro
Dec 08 2017 16:37 UTC
MiqServer#monitor isn't the looping code by the way
@jrafanie I might just try setting up a loop that just keeps checking Settings
Joe Rafaniello
@jrafanie
Dec 08 2017 16:38 UTC
I have an idle appliance running for an hour or two and it's roughly the same RSS as before
Nick LaMuro
@NickLaMuro
Dec 08 2017 16:39 UTC
it is a slow leak, for what it is worth
20171208_28903.png
Jason Frey
@Fryguy
Dec 08 2017 16:43 UTC
is that really a leak? If the GC is never invoked (i.e. you don't run out of live slots), then it will never clean up, and yeah RSS will keep going up
Nick LaMuro
@NickLaMuro
Dec 08 2017 16:50 UTC
hard to say after only running it for such a short amount of time
Joe Rafaniello
@jrafanie
Dec 08 2017 16:52 UTC
Another dumb idea I had was to add a GC.start in the server process roughly every 30-60 seconds. From your logging @NickLaMuro, sometimes it's 4-5 minutes between full GCs(major_gc_count gets incremented)
it's hard to say though if it takes several hours to reliably call it a leak
Jason Frey
@Fryguy
Dec 08 2017 17:06 UTC
@NickLaMuro I got past the hawkular issue, but the next issue with eager load is /Users/jfrey/.gem/ruby/2.4.2/bundler/gems/manageiq-providers-nuage-a1f64a1cc408/app/models/manageiq/providers/nuage/network_manager/event_catcher/messaging_handler.rb:1:in `<top (required)>': uninitialized constant Qpid (NameError)
which makes sense, but :/
Keenan Brock
@kbrock
Dec 08 2017 17:08 UTC
@Fryguy it needs a require "qpid_proton"
Jason Frey
@Fryguy
Dec 08 2017 17:08 UTC
which we can't do, because qpid_proton is optional
Keenan Brock
@kbrock
Dec 08 2017 17:08 UTC
and it is a subclass of a global scope
well, I build the gem
Jason Frey
@Fryguy
Dec 08 2017 17:08 UTC
yup
Keenan Brock
@kbrock
Dec 08 2017 17:08 UTC
and made it non-optional
but not able to require it bundle exec irb
Jason Frey
@Fryguy
Dec 08 2017 17:08 UTC
which makes sense, but :/
Keenan Brock
@kbrock
Dec 08 2017 17:09 UTC
ended up just commenting out that one and moved on
wait
Jason Frey
@Fryguy
Dec 08 2017 17:09 UTC
yup that's what I figured I'd do
Keenan Brock
@kbrock
Dec 08 2017 17:09 UTC
does it make sense to modify our code (put in PRs) to make this work?
Jason Frey
@Fryguy
Dec 08 2017 17:09 UTC
eventually, maybe
Keenan Brock
@kbrock
Dec 08 2017 17:09 UTC
this looks like a 1 day exercise and a number of changes
I ended up creating a bunch of rb files for each tier of the requires
it fixed a few things
Jason Frey
@Fryguy
Dec 08 2017 17:10 UTC
don't know what that means
Keenan Brock
@kbrock
Dec 08 2017 17:10 UTC
when the requires failed, I created a ruby file with the package name
it tended to resolve it correctly
Jason Frey
@Fryguy
Dec 08 2017 17:10 UTC
did you get past the consumption one?
Keenan Brock
@kbrock
Dec 08 2017 17:11 UTC
got to:
manageiq-providers-vmware/app/models/manageiq/providers/vmware/infra_manager/metrics_fetcher_worker.rb:1:in
'<top (required)>': uninitialized constant ManageIQ::Providers::BaseManager::MetricsFetcherWorker (NameError)
Jason Frey
@Fryguy
Dec 08 2017 17:11 UTC
ah, ok..I'm not hacking them per se but finding actual solutions
or at least something I can open up as an issue in each repo
Keenan Brock
@kbrock
Dec 08 2017 17:12 UTC
cool. GL
Jason Frey
@Fryguy
Dec 08 2017 17:12 UTC
some of them are straight up bugs like the one in consumption
Keenan Brock
@kbrock
Dec 08 2017 17:12 UTC
sweet
er - yea
I remember that heroku required everything
would find a lot of problems just in first boot
if only there were a ruby like language that compiled and found problems ... ;)
Adam Grare
@agrare
Dec 08 2017 17:13 UTC
@kbrock what's a metrics_fetcher_worker?
Keenan Brock
@kbrock
Dec 08 2017 17:13 UTC
aaaahhhh
thanks
lol
Adam Grare
@agrare
Dec 08 2017 17:14 UTC
;)
Keenan Brock
@kbrock
Dec 08 2017 17:14 UTC
@agrare delete it?
Adam Grare
@agrare
Dec 08 2017 17:14 UTC
you're probably on a local branch
its not in master
Keenan Brock
@kbrock
Dec 08 2017 17:14 UTC
we're not doing that ever. right?
it was not in git at all. right
@agrare thanks so much. I totally didn't see the problem there
Adam Grare
@agrare
Dec 08 2017 17:16 UTC
we had some branches for it I thought
Keenan Brock
@kbrock
Dec 08 2017 17:16 UTC
probably
Jason Frey
@Fryguy
Dec 08 2017 17:16 UTC
oh that consumption issue was "fixed" a week or so ago and by "fixed" I mean they deleted the code haha
bin/update for me :/
ok, I'm through the issues (only 2)
now to disable descendant loader
Nick LaMuro
@NickLaMuro
Dec 08 2017 17:20 UTC
"And this, ladies and gentlemen, has been a lesson in how to distract @Fryguy."
Jason Frey
@Fryguy
Dec 08 2017 17:21 UTC
:squirrel:
Nick LaMuro
@NickLaMuro
Dec 08 2017 17:21 UTC
"Tune in next week when we explore how to exploit this distraction into getting him to do work for you"
Jason Frey
@Fryguy
Dec 08 2017 17:21 UTC
I was more interested in the fact that you couldn't eager load which to me implied "uh oh bugs"
Nick LaMuro
@NickLaMuro
Dec 08 2017 17:22 UTC
HOLY S---!! SUCCESS!!!1!
found way to reproduce the leak with a script!
Jason Frey
@Fryguy
Dec 08 2017 17:23 UTC
O_O
Nick LaMuro
@NickLaMuro
Dec 08 2017 17:23 UTC
it is with Benchmark.realtime_block
Adam Grare
@agrare
Dec 08 2017 17:23 UTC
????
Jason Frey
@Fryguy
Dec 08 2017 17:23 UTC
:trophy:
really? wow
Nick LaMuro
@NickLaMuro
Dec 08 2017 17:23 UTC
need to let it run longer to confirm, but let me get show you some output... getting some stuff together... hold plox
# loop_monitor.rb
require 'sys-proctable'

loop do
  puts Sys::ProcTable.ps(ARGV[0].to_i).rss
  sleep 5
end
puts "PID: #{Process.pid}"
# loop_test_1.rb
require 'config/environment.rb'

puts "starting loop..."
loop do
  ::Settings.server.monitor_poll.to_i_with_method
  sleep 0.5
end
# loop_test_2.rb
puts "PID: #{Process.pid}"

require 'config/environment.rb'

puts "starting loop..."

def monitor
  # intentional no-op
end

loop do
  _dummy, timings = Benchmark.realtime_block(:total_time) { monitor }
  sleep 0.5
end
Keenan Brock
@kbrock
Dec 08 2017 17:25 UTC
holly cow
Jason Frey
@Fryguy
Dec 08 2017 17:25 UTC
interesting
Keenan Brock
@kbrock
Dec 08 2017 17:25 UTC
that is very short - (aka very cool)
Adam Grare
@agrare
Dec 08 2017 17:25 UTC
but that doesn't show up in the ruby heap?
Jason Frey
@Fryguy
Dec 08 2017 17:25 UTC
Please throw a GC in there to confirm
it's always going to increase until it hits GC thresholds
Nick LaMuro
@NickLaMuro
Dec 08 2017 17:26 UTC
mmk
Jason Frey
@Fryguy
Dec 08 2017 17:27 UTC
maybe I shoudl tweak that to not do a query, in case the query is part of the numbers
Keenan Brock
@kbrock
Dec 08 2017 17:27 UTC
Vm.all.to_sql
Jason Frey
@Fryguy
Dec 08 2017 17:27 UTC
ohh...good idea
Keenan Brock
@kbrock
Dec 08 2017 17:28 UTC
I thought eager load would have been much higher
I wonder the difference in filesystem accesses. (mac - cheap, in a vm, wonder)
Jason Frey
@Fryguy
Dec 08 2017 17:29 UTC
note that I am completely ignoring the caveat that if you use eager_load you can no longer reload classes in developement
Keenan Brock
@kbrock
Dec 08 2017 17:29 UTC
yes, eager_load is not a long term solution
Jason Frey
@Fryguy
Dec 08 2017 17:29 UTC
it's probably more the performance of RipperRubyParser vs just loading a class (for all files)
dammit...I deleted my comment :point_up: :angry:

but .to_sql is about the same numbers...

DescendantLoader uncached:

bundle exec rails r 'Vm.all.to_sql'  48.99s user 2.29s system 99% cpu 51.430 total

DescendantLoader cached:

bundle exec rails r 'Vm.all.to_sql'  8.07s user 2.03s system 98% cpu 10.207 total

DescendantLoader disabled and eager_load enabled (with fixes):

bundle exec rails r 'Vm.all.to_sql'  11.28s user 2.75s system 99% cpu 14.138 total
Keenan Brock
@kbrock
Dec 08 2017 17:32 UTC
@NickLaMuro loop_test_2.rb needs a loop?
Nick LaMuro
@NickLaMuro
Dec 08 2017 17:32 UTC
huh? are you saying it doesn't have one?
Adam Grare
@agrare
Dec 08 2017 17:33 UTC
loop do
  _dummy, timings = Benchmark.realtime_block(:total_time) { monitor }
  sleep 0.5
end
?
looks like a loop to me
Keenan Brock
@kbrock
Dec 08 2017 17:33 UTC
aah
Nick LaMuro
@NickLaMuro
Dec 08 2017 17:33 UTC
me too...
Keenan Brock
@kbrock
Dec 08 2017 17:33 UTC
that is what loop does?
ooooh
we monkey patch that
do you think it is our patch?
Nick LaMuro
@NickLaMuro
Dec 08 2017 17:35 UTC
we monkey patch loop? Or Benchmark?
(I assume he means we patch realtime_block)
Nick LaMuro
@NickLaMuro
Dec 08 2017 17:35 UTC
because I was suggesting a few days ago that it could be Benchmark
Jason Frey
@Fryguy
Dec 08 2017 17:35 UTC
@NickLaMuro I'm not seeing a leak there:
[5] pry(main)> loop do
[5] pry(main)*   _dummy, timings = Benchmark.realtime_block(:total_time) { monitor }
[5] pry(main)*   sleep 1
[5] pry(main)*   GC.start
[5] pry(main)*   puts MiqProcess.processInfo[:memory_usage]
[5] pry(main)* end
174018560
174018560
174018560
174018560...forever
Adam Grare
@agrare
Dec 08 2017 17:35 UTC
how long did you wait?
:laughing:
Jason Frey
@Fryguy
Dec 08 2017 17:36 UTC
does it matter?
Brandon Dunne
@bdunne
Dec 08 2017 17:36 UTC
looks like until forever :smile:
Adam Grare
@agrare
Dec 08 2017 17:36 UTC
haha :)
Jason Frey
@Fryguy
Dec 08 2017 17:36 UTC
yeah duh it says it right there haha
Keenan Brock
@kbrock
Dec 08 2017 17:36 UTC
or maybe 1 day before forever
Jason Frey
@Fryguy
Dec 08 2017 17:36 UTC
just like the loop :trollface:
-1/12
Nick LaMuro
@NickLaMuro
Dec 08 2017 17:36 UTC
this was a "short version" of my output:
$ ruby loop_monitor.rb 22621
36418
36418 X 19
36482
36482 X 19
36548
# note, I ctrl-c'd right here to get some stuff
36548 X 32
36614
36614 X 31
36680
...
that said, at the moment, we are about 50X + on 36680...
(left the first one running to help confirm @Fryguy 's theory that this could be invalid)
Adam Grare
@agrare
Dec 08 2017 17:38 UTC
what does the current_realtime have?
Jason Frey
@Fryguy
Dec 08 2017 17:38 UTC
(For those who don't knwo the -1/12 thing up there... https://en.wikipedia.org/wiki/1_%2B_2_%2B_3_%2B_4_%2B_%E2%8B%AF and https://www.youtube.com/watch?v=w-I6XTVZXww )
Adam Grare
@agrare
Dec 08 2017 17:38 UTC
is it not overwriting the hash but growing
Jason Frey
@Fryguy
Dec 08 2017 17:39 UTC
current_realtime will always have a single float (and the thread id but that won't change)
actually technically it will be deleted after the block closes
so empty hash
there's not much else going on there that's the only thing that stuck out
or maybe we're never calling delete_current_realtime
Jason Frey
@Fryguy
Dec 08 2017 17:43 UTC
realtime_store is not (should not be) called directly
even so, if there was a leak in Benchmark.realtime_* then all of this would be evidenced in the ruby memory stats, but we are not leaking ruby objects according to earlier investigations
Adam Grare
@agrare
Dec 08 2017 17:45 UTC
yeah
Keenan Brock
@kbrock
Dec 08 2017 17:45 UTC
I'm still in my external fragmentation statement
allocate a lot, store 1 object, unallocate the first lot
continue over an over again, waiting a little time, free that 1 object
Jason Frey
@Fryguy
Dec 08 2017 17:46 UTC
external fragmentation could def be an issue
Keenan Brock
@kbrock
Dec 08 2017 17:46 UTC
in theory, we've allocated no objects, but the memory is screwed up
Jason Frey
@Fryguy
Dec 08 2017 17:46 UTC
the MALLOC_ARENA_MAX should minimize that, but ¯\_(ツ)_/¯
Keenan Brock
@kbrock
Dec 08 2017 17:46 UTC
never occured to me that the benchmark code would be making it that much worse
but per @jrafanie said, using jemalloc may fix it...
Well, that allocator has less of an issue for this type of fragmentation
So that may support his hypothesis of "the problem is malloc"
Jason Frey
@Fryguy
Dec 08 2017 17:47 UTC
yup, it's worth investigating
Keenan Brock
@kbrock
Dec 08 2017 17:48 UTC
that is great @NickLaMuro
Nick LaMuro
@NickLaMuro
Dec 08 2017 17:48 UTC
welp, looks like @Fryguy 's theory is correct that the GC.start does clear up the leak...
back to the drawing board... again
Jason Frey
@Fryguy
Dec 08 2017 17:49 UTC
sorry...I was hoping you were right :confounded:
Nick LaMuro
@NickLaMuro
Dec 08 2017 17:49 UTC
might have to let these current VMs curn then over the weekend and not touch them, just to make sure that GC.start gets called
Keenan Brock
@kbrock
Dec 08 2017 17:49 UTC
well, can you allocate something in the loop?
Jason Frey
@Fryguy
Dec 08 2017 17:49 UTC
you can dump GC stats on some interval and see if GCs are getting called (and how often for that matter)
Keenan Brock
@kbrock
Dec 08 2017 17:50 UTC
x = 500 * [500 * "x"]
the variable would go out of scope - but keep it around at lest until after the benchmark block
in theory, that should just be a minor load and shouldn't do anything
hmm. we allocate 10's of millions of objects in some of these workers - so the 500 may be too trivial
Keenan Brock
@kbrock
Dec 08 2017 17:58 UTC
wonder if this minor load will introduce the fragmentation that we need
Joe Rafaniello
@jrafanie
Dec 08 2017 18:59 UTC
OMG, don't you people take lunch
ESOMANYMESSAGES
Nick LaMuro
@NickLaMuro
Dec 08 2017 19:00 UTC
<--- central timezone
:sunglasses:
Joe Rafaniello
@jrafanie
Dec 08 2017 19:01 UTC
yeah, I take a late eastern timezone lunch and you're still there ;-)
Nick LaMuro
@NickLaMuro
Dec 08 2017 19:01 UTC
oh god... I am turning into a emoji user...
Keenan Brock
@kbrock
Dec 08 2017 19:01 UTC
isn't an emoji user just a computer user?
Joe Rafaniello
@jrafanie
Dec 08 2017 19:02 UTC
I'm not convinced doing a GC.start is fair in a script we're trying to recreate a leak for
We don't GC.start in the server. We do it when there are enough allocations to warrant one.
NickLaMuro @NickLaMuro has just delayed is lunch a bit more...
Nick LaMuro
@NickLaMuro
Dec 08 2017 19:03 UTC
well, to @Fryguy 's point, after a while, my script stopped leaking, even without the GC.start in place
Joe Rafaniello
@jrafanie
Dec 08 2017 19:03 UTC
The script has to just run long enough and LOG the major_gc_count so that over time, a major GC is done
yeah, I was seeing the same with my Drb tests
it would grow to a high water mark and stay there
Keenan Brock
@kbrock
Dec 08 2017 19:03 UTC
if you don't have load in your monitor, then it will be hard to find a leak
Nick LaMuro
@NickLaMuro
Dec 08 2017 19:04 UTC
well, the point was simulating the small leak that "might" be there in the graph I shared above
^--- that is with all of monitor commented out
slight amount of growth over time, but because we are unsure of the GC situation, hard to tell if that is only a temp bit of growth
anyway, since @jrafanie is getting on my case about it, time for lunch... back in a bit
Joe Rafaniello
@jrafanie
Dec 08 2017 19:08 UTC
Yes, thank you @NickLaMuro. You're not yourself when you're hungry.
Nick LaMuro
@NickLaMuro
Dec 08 2017 19:09 UTC
I am not going to have a snickers for lunch...
Joe Rafaniello
@jrafanie
Dec 08 2017 19:13 UTC
haha
Jason Frey
@Fryguy
Dec 08 2017 19:31 UTC

I'm not convinced doing a GC.start is fair in a script we're trying to recreate a leak for
We don't GC.start in the server. We do it when there are enough allocations to warrant one.

Partially agree. Doing a GC forces you to not accidentally claim that a graph that looks like :chart_with_upwards_trend: is automatically a leak. It's very highly probable that the monitor loop ends up GC'ing at some point and putting in a manual GC just "speeds that up". That being said, if you think the issue is more of a small Ruby object with a large memory footprint, then yeah, you wouldn't do that, but I don't think we have any evidence pointing to that kind of situation here (unlike how we most obviously had that in vmware land with the giant XMLs)

Also, if we expect c-based literal leaks, then GC is perfect for that, because it removes all Ruby from the picture and the only thing left has to be mallocs
Keenan Brock
@kbrock
Dec 08 2017 21:14 UTC
@NickLaMuro finding any more neat stuff?
Nick LaMuro
@NickLaMuro
Dec 08 2017 21:16 UTC
no, but I am going to be looking into your theory shortly by modifying heapfrag to take the dumps I have and if the heap gets fragmented over time
though, I guess we could just look at total heap slots in the GC.stats to determine that
Joe Rafaniello
@jrafanie
Dec 08 2017 21:47 UTC
@NickLaMuro I can't get my appliances to grow miq server processes :shrug: I guess I'll look at other reported logs
It looks like you enable cap and u collector and processor and that's about it
Nick LaMuro
@NickLaMuro
Dec 08 2017 21:49 UTC
@jrafanie do you have it hooked up to a provider?
Joe Rafaniello
@jrafanie
Dec 08 2017 21:50 UTC
yes
a small provider
vmware
I have a broker and everything
Nick LaMuro
@NickLaMuro
Dec 08 2017 21:51 UTC
Mmk, just checking the obvious.