These are chat archives for ManageIQ/manageiq/performance

13th
Oct 2015
Joe Rafaniello
@jrafanie
Oct 13 2015 12:57
lots of old objects, going to look more at what old objects there are and see what we can do to remove some of them
not sure what else to do, ruby handles the flood of new allocations by expanding it's heap and eventually cleans up but the OS doesn't reclaim any memory... I don't know if it's because pages/slots are fragmented or somethings else... OSX seems to reclaim some memory
I'll try to get the same statistics on 2.0.0 where these values exist
Matthew Draper
@matthewd
Oct 13 2015 13:21
@jrafanie does any of that matter?
What are we trying to fix?
I thought the problem was that peak usage was too high
Joe Rafaniello
@jrafanie
Oct 13 2015 13:23
Yes, that's the problem and the large number of old objects created in refresh not being collected until much later drives up the memory usage
Right?
If fragmentation of those old objects, unnecessary allocations, or aggressive GC heap growth factors are driving that number up higher, they could all be possible ways to decrease memory
Also, if the high water mark memory was eventually released after the "storm", maybe the temporary higher memory would be tolerable
Matthew Draper
@matthewd
Oct 13 2015 13:28
Fragmentation won't drive it up… fragmentation means retention of the heap pages, which will then continue to be used
If ultimate release is all we need, then we just do the work in a short-lived fork, or restart the worker.
^^ why we need to know what we're aiming for
Joe Rafaniello
@jrafanie
Oct 13 2015 13:29
Yeah, @matthewd, I don't know if that's satisfactory
Matthew Draper
@matthewd
Oct 13 2015 13:29
Exactly
I don't see how we can fix anything, if no-one knows what is satisfactory :worried:
Joe Rafaniello
@jrafanie
Oct 13 2015 13:31
@dmetzger57 any idea what's satisfactory?
Keep in mind, we will run into this problem in other high allocation sections of code not in a tight loop
I wonder if cap & u have the same problem
Also, I assume event storms will also be a problem
Dennis Metzger
@dmetzger57
Oct 13 2015 13:34
my understanding is goal is for peak memory usage to match 5.4. realistically, there’s gonna be some growth, but not 4 - 8 times growth. yes, Alex saw the memory load issue with cap & u also.
Jason Frey
@Fryguy
Oct 13 2015 13:36
How and when will Ruby give back to the OS? If at all?
Joe Rafaniello
@jrafanie
Oct 13 2015 13:40

my understanding is goal is for peak memory usage to match 5.4. realistically, there’s gonna be some growth, but not 4 - 8 times growth. yes, Alex saw the memory load issue with cap & u also.

I don't know how that's achievable if we're 30+% faster

Oleg Barenboim
@chessbyte
Oct 13 2015 13:41
I am blown away that with moving to latest Ruby and Rails that our performance/memory has taken a step back
Joe Rafaniello
@jrafanie
Oct 13 2015 13:45
The things that made us faster before, avoiding N+1/prefetching data in an outer loop, bites us on 2.2.3 because of how long those variables are in scope
So, these charts I think are most telling: https://docs.google.com/spreadsheets/d/1LH9JpLJPoWSlpWhQmEK-jT7HYqhy8glo6gsIvZNMat4/edit#gid=2088514534, especially the live slots vs. free slots... it's interesting that the tomb pages are very very slowly reclaimed over the span of the charts (~1 hour)
Dennis Metzger
@dmetzger57
Oct 13 2015 13:51
note, i did say some growth (yes that needs to be defined) is acceptable, what is not acceptable is requring the customer to add 4 to 8 times the vRAM to an appliance, if they even have that much memory available for allocation. spoeed VS size is a fine dance, in the end there is only so much memory available (allocatable to us) to grow into. being faster helps when asking the customer to allocate more memory, but there is a limit (and the definition of that limit does need to be clearly defined / required).
Oleg Barenboim
@chessbyte
Oct 13 2015 13:52
very sad situation
Dennis Metzger
@dmetzger57
Oct 13 2015 13:54
it’s even harder when your going a current customer saying give us X more memory where X is substantial. we need to articulate how much the speed increase can help by requiring less appliances, that will help, just not sure if the request increase can be physically accomodated.
Oleg Barenboim
@chessbyte
Oct 13 2015 13:56
glad we have a performance/scalability team - but not so happy that we are hitting new issues that are not easily solved
Matthew Draper
@matthewd
Oct 13 2015 14:01
It was my understanding that we'd seen 4-8x growth in the amount of additional memory that gets allocated, by the one process actually doing a refresh, during the refresh.
Multiplier-sized changes to total appliance allocation sounds much worse than that.
Dennis Metzger
@dmetzger57
Oct 13 2015 14:05
QE was (is) running tests on 5.5, what I was told they had initially tried going from 6Gb to 8Gb (then 10Gb and were not able to complete their tests, ran into swapping on the appliance. that’s the only hard numbers i’ve gotten thus far, (well below 4 -6 times total), waiting to here the current QE results running with larger sized appliances.
the 4 - 8 times is a number that has been brought up on calls, but i’ve not seen documented results backing that up.
Matthew Draper
@matthewd
Oct 13 2015 14:07
@jrafanie on that spreadsheet, what's the refresher actually doing for that hour?
Joe Rafaniello
@jrafanie
Oct 13 2015 14:08
it does the initial refresh and not much after
Matthew Draper
@matthewd
Oct 13 2015 14:09
Okay, so it's ~idle after 12:52… it's not taking the whole time to do the thing.
Dennis Metzger
@dmetzger57
Oct 13 2015 14:09
@chessbyte we need to work with john (or whomever) to define the minimum required memory that is aceptable for 5.5.
Matthew Draper
@matthewd
Oct 13 2015 14:10
@jrafanie I guess you'll see what we can see (based on what's measurable with older GC), but it would surprise me to learn the slot live/free graph, for example, looked measurably different on 2.0
That is, Ruby's inability to efficiently return memory to the OS is not a new thing
Joe Rafaniello
@jrafanie
Oct 13 2015 14:17
it looks like 2.0 has heap_live_num vs. heap_free_num, whatever that means... not sure if that is slot count or what: {:count=>19, :heap_used=>138, :heap_length=>138, :heap_increment=>0, :heap_live_num=>47146, :heap_free_num=>15278, :heap_final_num=>0, :total_allocated_object=>404379, :total_freed_object=>357233}
I could generate graphs based on those numbers if it's helpful
Matthew Draper
@matthewd
Oct 13 2015 14:18
They do sound like the same thing
And it seems familiar that various GC stats got renamed to clarify their meaning.. circa 2.1, maybe?
Joe Rafaniello
@jrafanie
Oct 13 2015 14:27
yeah, Aman Gupta's blog has alot of this info and how github tweaks the GC using env variables: http://tmm1.net/ruby21-rgengc/, although that was before 2.2 changed some of that information yet again
Oleg Barenboim
@chessbyte
Oct 13 2015 15:00
so, I just had an out-of the-box thought on performance
for queue-based workers, what if we exit the process after handling a message from the queue and the memory is above a certain threshold
would that solve the memory issue?
Matthew Draper
@matthewd
Oct 13 2015 15:05
@chessbyte yeah, I mentioned that above… it all depends on what the memory issue actually is
That helps if the worker's retaining memory, and that's causing us a problem… but it doesn't change the peak usage during a refresh
Oleg Barenboim
@chessbyte
Oct 13 2015 15:07
@matthewd I agree - but maybe it would help to make a patch like that and see what @akrzos sees with that
Joe Rafaniello
@jrafanie
Oct 13 2015 15:40
Ok, so @chessbyte and @matthewd I did the same test on 2.2.3/2.0.0 with the same single vmware ems, collecting memory usage and size as measured by MiqProcess.processInfo along with a csv of GC.stat information: https://docs.google.com/spreadsheets/d/1LH9JpLJPoWSlpWhQmEK-jT7HYqhy8glo6gsIvZNMat4/edit#gid=2088514534
I updated the 2.2.3 example to contain nearly the same span of time, 15 minutes
NOte, we create many more strings in 2.0.0 compared to 2.2.3
there's two sheets there, one for 2.0.0 and one for 2.2.3
Oleg Barenboim
@chessbyte
Oct 13 2015 15:42
so we use much less memory in 2.2.3 than in 2.0.0, BUT Ruby allocates more memory in 2.2.3 than 2.0.0
Matthew Draper
@matthewd
Oct 13 2015 15:42
Yeah, so that seems consistent with my expectation… we were hanging on to heap pages just as hard before
Joe Rafaniello
@jrafanie
Oct 13 2015 15:47
@chessbyte I think it's slightly different, 2.2.3 allocates less objects but consumes 200+ MB resident memory as reported by the OS
Jason Frey
@Fryguy
Oct 13 2015 20:25
Some updates
1) @jrafanie changed some of the graphs on the spreadsheet ... in particular stacking the heap free/live graph makes it clearer
2) @jrafanie ussed tuning numbers from Aman Gupta just to see, and the results came out really nice...they are in the second tab of the spreadsheet
3) We reviewed the VMware refresh code and we found some places where we can release the raw data early, allowing the GC to throw it away. I've made a code patch to reorganize the code a little, and @jrafanie is testing now
Note that if that goes well, I think that #3 might make a real differences on targeted refreshes
with targeted refreshes we filter the VC data and can throw away unfiltered data early...in the full EMS refresh case, there is no filtering, so that particular line of code is a noop
However, we also found we can throw away the filtered VC data early as well, which would help both targeted and full
Oleg Barenboim
@chessbyte
Oct 13 2015 20:34
and what about exiting the process after a full refresh on VMware?
Jason Frey
@Fryguy
Oct 13 2015 20:37
that is really overkill
plus the code already does that if you set the memory policy on a woker properly
Matthew Draper
@matthewd
Oct 13 2015 20:38
We theoretically could apply a more nuanced policy, because unlike a straight limit (which I'm blindly assuming is what the existing policy option is), we know how much of our current usage is empty calories
Oleg Barenboim
@chessbyte
Oct 13 2015 20:39
just that Satoe set our appliance with 10gb instead of 6gb, because of the memory issues
that won't fly for GA
Joe Rafaniello
@jrafanie
Oct 13 2015 20:45

empty calories

:laughing:

Matthew Draper
@matthewd
Oct 13 2015 20:47
If our overall goal is determined by the appliance size (which obviously does make sense), I guess we need to look at the by-process breakdown, and confirm that the refresher's the only one that's grown disproportionately
.. and that the others' proportionate growth is within our acceptable range
Otherwise we may end up trying very hard to squeeze the refresher, when it's not all that exclusively at fault
Jason Frey
@Fryguy
Oct 13 2015 20:48
THIS ^
that's why I was trying to understand early on if the entire appliance bumped or just specific workers
Oleg Barenboim
@chessbyte
Oct 13 2015 20:54
@Fryguy @matthewd @jrafanie @dmetzger57 just forwarded you email about going to 10gb on appliance
please sync up with dajo on his findings
Joe Rafaniello
@jrafanie
Oct 13 2015 20:55
@dmetzger57 mentioned something about Alex seeing higher usage on c & u workers although I didn't see concrete numbers for them here: https://docs.google.com/spreadsheets/d/18PTQUCgh-gnvJjXHL5qaTPQnJ5ZoYyHpuFe6ZhQz4uU/edit#gid=1719100508
Oleg Barenboim
@chessbyte
Oct 13 2015 20:57
telling customers that upgrading to CF 4.0 from CF 3.2 will require 66% more RAM per appliance will be a very bitter pill to swallow
Jason Frey
@Fryguy
Oct 13 2015 20:57
yeah, I don't think anyone really wants that
email sucks...that is all :)
Oleg Barenboim
@chessbyte
Oct 13 2015 21:00
yes, I was just telling an open-source meetup group yesterday what I think of mailing list for open-source projects ;-)
Dennis Metzger
@dmetzger57
Oct 13 2015 21:00
i’m looking through my notes / email now, Alex at some point indicated he saw memory usage higher on 5.5 than 5.4 when he ran initial cap n u tests. I don’t recall seeing specific value(s)
Oleg Barenboim
@chessbyte
Oct 13 2015 21:01
dajo is providing more outside-in evidence about how often appliance needs to be rebooted
that is why I was suggesting more draconian solutions like having some workers terminate themselves after some memory threshold is reached
not being killed externally at an unknown point in their processing, but right after a worker processed a message and before it looks for the next one
again - I am VERY surprised that Ruby 2.2 is consuming so much more memory from the OS
Dennis Metzger
@dmetzger57
Oct 13 2015 21:04
in bz 1267697 Alex implies about a 600Mb delta between 5.4 and 5.5 when enabling C&U
Oleg Barenboim
@chessbyte
Oct 13 2015 21:04
that is an insane amount of additional memory to consume
is that the C&U Collectors or Processors?
Dennis Metzger
@dmetzger57
Oct 13 2015 21:08
ok, his breakdown says about 148 per collector and 107 per processor, given 2 each by default, approaching 600
Matthew Draper
@matthewd
Oct 13 2015 21:09
@dmetzger57 that's additional? Do we know what their baselines are?
Dennis Metzger
@dmetzger57
Oct 13 2015 21:09
baselines are not in the ticket, least I didn’t see them there
Jason Frey
@Fryguy
Oct 13 2015 21:17
@matthewd Remember the thing where the root machine_context would hold a reference?
How does that get cleared?
or did we ever figure out how that gets cleared?
Matthew Draper
@matthewd
Oct 13 2015 21:18
@Fryguy do a require
The most recent call to require leaves its self rooted
Or, for a different flavour of megahack, move the require from the EMS to something else, maybe ¯\_(ツ)_/¯
I guess even Kernel.send :require, '..' should do it?
Jason Frey
@Fryguy
Oct 13 2015 21:40
o_O
is that expected behavior?
Matthew Draper
@matthewd
Oct 13 2015 21:48
I'm gonna go with "no" ;)
Joe Rafaniello
@jrafanie
Oct 13 2015 22:19
@dmetzger57 it would be really helpful to confirm that upstream has the exact same memory increase problem as 5.5 and have others run baselines off of upstream if that is the case... we can backport any changes to 5.5
Would it be helpful to get this in upstream for baseline numbers? https://github.com/jrafanie/manageiq/blob/performance_testing/gems/pending/util/ruby_gc_logger.rb... I can make it optional on the workers... that's what I'm using to log the res, virt memory, GC.stat, ObjectSpace as csv for the charts
Dennis Metzger
@dmetzger57
Oct 13 2015 22:28
i did some runs today, including upstream, that shows utilization as an aggregate (not per process / worker) and will be looking at that data tonight and will a sheet
Joe Rafaniello
@jrafanie
Oct 13 2015 22:29
Doing just the Aman Gupta GC parameters, we drop ~730 MB RES -> ~600 MB, ~1666 MB VIRT to ~1260 MB before any other code changes to make the refresh have shorter lifetimes for objects or reduce allocations
Dennis Metzger
@dmetzger57
Oct 13 2015 22:30
nice step in the right direction.
Joe Rafaniello
@jrafanie
Oct 13 2015 22:30
For reference, upstream with 2.0.0 had ~488 MB RES, ~806 MB VIRT
VIRT is still ridiculously higher
Dennis Metzger
@dmetzger57
Oct 13 2015 22:32
where are you calling start_gc_statistics_thread for your collection runs?
Joe Rafaniello
@jrafanie
Oct 13 2015 22:32
We can obviously tweak those numbers more for our application... I just wanted to confirm the GC env variables can help
look at my branch, one sec...
@dmetzger57 I can clean it up and give you a branch you can test against upstream tomorrow if you want
from the csv, we can probably generate charts automatically instead of my manual google spreadsheets way I'm doing now
Dennis Metzger
@dmetzger57
Oct 13 2015 22:39
sounds good. long term project is to get performance runs included in the nightly process (raw data and charts) so we can catch any significant growth issues early in the release cycle
Joe Rafaniello
@jrafanie
Oct 13 2015 22:41
yeah, agreed
Dennis Metzger
@dmetzger57
Oct 13 2015 22:41
@jrafanie it’s nice having @akrzos environment to work work with :smile:
Joe Rafaniello
@jrafanie
Oct 13 2015 22:41
yeah
Matthew Draper
@matthewd
Oct 13 2015 22:42
Hopefully getting nightly runs might involve scripting the creation of a vcsim thingy
Joe Rafaniello
@jrafanie
Oct 13 2015 22:42
Good night all, more perfromance tomorrow ;-)
Dennis Metzger
@dmetzger57
Oct 13 2015 22:42
good night @jrafanie
Joe Rafaniello
@jrafanie
Oct 13 2015 22:42
Good morning @matthewd ;-)
Dennis Metzger
@dmetzger57
Oct 13 2015 22:46
@matthewd better than scripting dynamic creation of the vc simulators, gonna try to get dedicated resources to run our own instances.
@matthewd what time is it there?
Matthew Draper
@matthewd
Oct 13 2015 22:48
I'd still like to commit a script that builds it… both for external people, and anyone with a higher-than-average latency to a US datacenter ;)
0915… bed time :P
.. though I have no idea what impact licencing constraints would have on either of the above.
Dennis Metzger
@dmetzger57
Oct 13 2015 22:52
i need to revisit the license requirments to run (well and obtain) the vc from vmware
last time i looke was well over a year ago