These are chat archives for ManageIQ/manageiq/performance

28th
Jan 2019
Daniel Berger
@djberg96
Jan 28 13:23
Added to .bash_profile:
LIBJEMALLOC=`whereis -b libjemalloc | cut -d ' ' -f2`
if [ -f $LIBJEMALLOC ]; then
  echo "libjemalloc preloaded"
  export LD_PRELOAD=$LIBJEMALLOC
else
  echo "libjemalloc not found, skipping"
fi
Himanshu Roy
@hroyrh
Jan 28 13:58
@kbrock ping
Keenan Brock
@kbrock
Jan 28 13:58
ping
Himanshu Roy
@hroyrh
Jan 28 13:59
Hi Keenan, this is regarding the recommendation you shared for the postgres-tuning experiment results. You mentioned about changing the "performance_collection_interval" from 3mins->10mins
if a provider is generating events continuously, will this change make a significant impact. since if we increase the interval, it will have more processing to do in that window, which over a period of time will average out to similar number of collector_workers
just wanted to clarify this, before we repeat the experiment
@kbrock ^
Keenan Brock
@kbrock
Jan 28 14:14

@hroyrh we only process the events every 50 minutes (or hour for storage)
so checking every 3 minutes is not really necessary. The process of checking is non-trivial, so it ends up generating more work. I believe checking every 10 minutes will have the same data but with less work.

As a second plan, I wonder about changing the timings from 50minutes to 1 hour. Checking every 30 minutes (instead of every 3). Think it will generate the same results and reduce work even further (you want to pick an check frequency that divides easily into the collection interval)

Now if the end user has realtime alerting on metrics, then this would potentially delay alerts from every 10 minutes to every 30 minutes. But since we are seeing large backlogs, I don't think alerts are being sent out every 10 minutes by much less frequently. So the delay may actually decrease from hours to 30 minutes.

Peter McGowan
@pemcg
Jan 28 14:30
@kbrock as I understand it (I'm happy to be wrong), the 50 minutes come from the fact that VMware only keeps realtime metrics for 60 minutes, so queueing a new data collection message every 50 minutes allows for up to 10 minutes of message queue time without losing data
the 3 minutes is how often the coordinator checks that any object’s metrics are older than the threshold
So if we change the 3 minutes to 10, we’re only likely to generate 3x more messages
so the mean queue length will stay the same, but with a bigger saw-tooth profile (won’t it?)
that’s how Tom Hennessy explained it to me
Keenan Brock
@kbrock
Jan 28 15:07
@pemcg that logic is 100% sound but... (you knew that was coming)
I think vmware keeps metrics for 4 hours. (storages being different)
If we collect all metrics every 50 minutes, why are we checking every 3?
Seems we could get away with checking every 50 minutes. no?
(there is slight falicy to that last statement, but it does make you wonder)

also we check every 3 minutes, get a list of metrics to collect, and generate that work.
and that work will take ~20-30 minutes to do
Then we check 3 minutes later, "are we done yet?"
we gather a bunch of records that "need to be done" and we submit them.
Most (all) of the work requested the second time was just requested 3 minutes ago.

So this is putting a bunch of load on the queue

checking if we have alerts enabled on a particular object ends up being the most expensive part of all
Adam Grare
@agrare
Jan 28 15:15

@kbrock @pemcg from vSphere Performance Data Collection

Real-time data collection – An ESXi Server collects data for each performance counter every 20 seconds and maintains that data for one hour.

Keenan Brock
@kbrock
Jan 28 15:15
and when we have a backlog in our queue, that is for days, why do we still run the messages?
if the data is gone?
@agrare you should just rewrite this man ;)
Peter McGowan
@pemcg
Jan 28 15:16
I agree that having a backlog is bad
Adam Grare
@agrare
Jan 28 15:16
@kbrock you and I had it working :)
Keenan Brock
@kbrock
Jan 28 15:16
the way we implemented back pressure (the feedback loop where a system says it is bogged down and requests last work to be done) is broken here
Adam Grare
@agrare
Jan 28 15:16
maintaining compat with all the existing features wasn't really feasible though
Peter McGowan
@pemcg
Jan 28 15:16
I think we check every 3 so that at any point we’re only generating messages for approximately 1/16th of the VM estate
Keenan Brock
@kbrock
Jan 28 15:17
well, the first time in, we request for all
then 3 minutes later, we request a little bit more
I guess the delay in processing the objects does cause it to distribute
Peter McGowan
@pemcg
Jan 28 15:18
isn’t it - then 3 minutes later check which objects have metrics older than 50 minutes, and only queue a new message for ssch objects
over time or evens out pretty well
Adam Grare
@agrare
Jan 28 15:18
are we actually checking fewer objects every 3 minutes? I thought we were checking everything that was tagged for alerts
Keenan Brock
@kbrock
Jan 28 15:19
we check everything, and get a bunch of objects, then submit that full list (but the queue throws out duplicates in a way)
Peter McGowan
@pemcg
Jan 28 15:19
I think we still check every 3, but if an alert is defined then we queue a new message if the existing metrics are older then 20 minutes
Keenan Brock
@kbrock
Jan 28 15:20
but 45% of our processing is determining if the objects have realtime alerts
and 25% of our processing is determining if we just asked for this collection 3 minutes ago
that last 25% puts a bunch of contention on the queue too
Peter McGowan
@pemcg
Jan 28 15:20
so personally I think we should drop the concept of realtime alerting
noticing something that happened maybe 20 minutes ago isn't really realtime :-)
Keenan Brock
@kbrock
Jan 28 15:21
that was the #1 thing I put into Dennis's request for next version of the product - dropping realtime alerts
heh
@pemcg this is a good question for you
we have the ability to turn on metrics for a particular vm
I assume that is because we're trying to reduce load
if it were "free" to check metrics for all of an ems, could we just drop that concept?
just turn on for an ems
to be honest, implementing that ended up being very complicated - much easier/quicker to just check for a full ems
also we have this concept of checking each object every 50 minutes.
if it were cheaper to just check all vms in an ems every 50 minutes, could we drop the 50 minute timer per vm?
Peter McGowan
@pemcg
Jan 28 15:24
@kbrock do you mean collecting ad-hoc metrics?
Keenan Brock
@kbrock
Jan 28 15:24
ugh
knew you'd find something wrong with my plan
Peter McGowan
@pemcg
Jan 28 15:24
I think that was an OpenShift provider feature
Keenan Brock
@kbrock
Jan 28 15:25
our typical workload:
check every vm in an ems, how long has it been, is it turned on? request it
Adam Grare
@agrare
Jan 28 15:25
that just lets you see the native provider metrics but we don't do anything e.g. rollups with ad-hoc metrics
Keenan Brock
@kbrock
Jan 28 15:25
desired workflow:
check ems with enabled collection - collect all objects in that ems
Peter McGowan
@pemcg
Jan 28 15:25
Exactly, so we still need regular metrics collections for rollups, chargeback, rightsizing etc
Keenan Brock
@kbrock
Jan 28 15:26
it will get rid of an N+1 (fetching every vm in an ems vs all vms in an ems)
Peter McGowan
@pemcg
Jan 28 15:26
but we need to spread the lod of the collection through the time period, otherwise we hammer the ems and the cfme workers
Adam Grare
@agrare
Jan 28 15:26
@pemcg the biggest problem that we found re: vmware metrics collection is that we are asking 1 vm at a time
Keenan Brock
@kbrock
Jan 28 15:26
the 50 minutes and the enabled - they seem an artificial business logic clause introduced because it was too slow to collect metrics
Peter McGowan
@pemcg
Jan 28 15:27
so little and often would seem to be the most efficient?
Adam Grare
@agrare
Jan 28 15:27
the right way to do anything in vSphere is in batches, we found that with ~250-500 VMs in a batch we could collect metrics from 10,000 vms in ~3-4 minutes
Peter McGowan
@pemcg
Jan 28 15:27
@agrare yes if we could collect for multiple VMs in one go that would be great
Adam Grare
@agrare
Jan 28 15:27
but right now there is so much logic on the enqueue side and we enqueue vm-by-vm that is prevents us from batching anything
IMO fix the batching problem and the rest of the issues become far less of a problem
Keenan Brock
@kbrock
Jan 28 15:28
all_vms = select id from vms where ems_id = $ems_id
all_vms.each { |vm_id| collect(vm_id) } # N+1 queries
#vs
collect($ems_id) # 1 query
adam - verify
it takes ~30 seconds to collect all
it takes 10seconds to collect 1 (but it is N+1 times)
Peter McGowan
@pemcg
Jan 28 15:29
Could we merge/batch collection messsages like we do for targeted refresh messages?
Adam Grare
@agrare
Jan 28 15:29
@kbrock ack something like that
Keenan Brock
@kbrock
Jan 28 15:30
but the problem:
when we wanted to rewrite it, we were told we had to keep the per vm logic.
my question to you - is that really a business requirement? or is that because we have fear that it will be too slow without it?
Adam Grare
@agrare
Jan 28 15:30
@pemcg we tried that but since we have per-ems-type not per-ems metrics collector workers we can't guarantee that we are dequeuing vms from the same provider
Peter McGowan
@pemcg
Jan 28 15:30
hmm, ok
I think the business requirement is relatively simple - all metrics on all objects should be 'current'
Keenan Brock
@kbrock
Jan 28 15:31
do people really want to disable a single vm / or even a single ems?
Peter McGowan
@pemcg
Jan 28 15:31
with no gaps, which would render chargeback inaccurate
Keenan Brock
@kbrock
Jan 28 15:32
but we can't collect gaps because that would probably have been >50 minutes ago :(
Peter McGowan
@pemcg
Jan 28 15:32
@kbrock I’ve never heard that, I’ve only ever known all or nothing for C&U requirement
Keenan Brock
@kbrock
Jan 28 15:32
thnx
I assumed that was the case but... I'm not all knowing talking with customers. someone like yourself knows better
well, the POC was quick, but trying to implement all this convoluted timer logic made it work not so good
Adam Grare
@agrare
Jan 28 15:34
Yeah we can collect for specific clusters, hosts, and datastores
And also separately label individuals for alerting
Figuring out what to collect for is a huge perf issue by itself right @kbrock
Keenan Brock
@kbrock
Jan 28 15:41
yea, it is frustrating that determining realtime alerts (useless) and determining if we have taken alerts exactly 50 minutes ago (problem from a bad implementation) is a vast majority of our performance issue
then the only problem left is saving metrics - which is surprisingly slow. Again, we're trying to not store duplicates and we do odd stuff (problem from bad implementation possibly)
Himanshu Roy
@hroyrh
Jan 28 15:55
@kbrock @pemcg @agrare thanks a lot for clarifying things with the metrics-collection-interval
So, from my understanding, when we do a check every 3mins, it incurs an overhead because we have to figure out which objects' metrics-collection has to be enqueued. In that case, shouldn't increasing the 3mins interval reduce the overhead a bit, since we will be doing that extra work less number of times.
What can be the possible impact(s) on users, in real case scenarios, if we increase that interval
Keenan Brock
@kbrock
Jan 28 16:01
real time users will get messages less frequently. (10-20 minutes currently)
but I challenge that number if we have backlogs
if the check every 10 minutes reduces the backlog, then in theory, we would send out realtime alerts more frequently. (but again, that is suspect and why I want @pemcg to chime in - he has better business knowledge than I do)
Peter McGowan
@pemcg
Jan 28 16:18
@kbrock happy to help all I can
Keenan Brock
@kbrock
Jan 28 16:18
@pemcg really appreciate it
frustrated with the metrics collection process. Hard to understand which constraints are real (real business / vmware needs) and which ones are introduced because the current process was slow.
Peter McGowan
@pemcg
Jan 28 16:21
one of the challenges of C&U has been that it silently fails, or at least silently fails to keep up. Many (most even) customers wouldn't know to scale the number of workers, wouldn't realise that the queue was backing up to 10000’s of messages, and wouldn't realise that their metrics/chargeback etc are hopelessly inaccurate
auto-scaling of workers would be great
Keenan Brock
@kbrock
Jan 28 16:22
sure would
Adam Grare
@agrare
Jan 28 16:22
well i think the biggest reason we need so many workers (for collection) is the inefficiency of how we collect
Peter McGowan
@pemcg
Jan 28 16:22
I think the VMware provider is the only one with such tight realtime retention constraints
Adam Grare
@agrare
Jan 28 16:23
one tiny worker can do thousands if it does it the "right way"
Keenan Brock
@kbrock
Jan 28 16:23
but we've gone down a road for working around that inefficience that has kinda painted us in a corner
Peter McGowan
@pemcg
Jan 28 16:23
I think we’re all in agreement that something needs to be done :-)
Adam Grare
@agrare
Jan 28 16:23
yeah agreed, we tried to fix the scaling issue the wrong way
Peter McGowan
@pemcg
Jan 28 16:24
it’s never too late :-)
Keenan Brock
@kbrock
Jan 28 16:24
but when we came forward with a solution, we were told we needed to implement the per vm timers and alot of the complexities that are really only needed for the old solution (vs business requirement)
Adam Grare
@agrare
Jan 28 16:24
@pemcg for reference this is what @kbrock and I were working on during rearch https://github.com/agrare/manageiq-providers-vsphere/blob/master/metric/metrics_collector.rb
it simply collects for all powered on vms in a loop and is extremely fast compared to how we currently collect
Keenan Brock
@kbrock
Jan 28 16:25
you needed a bold around extremely
Adam Grare
@agrare
Jan 28 16:26
ha true, i'm going to try to find the numbers we got but it was like 1/10th the time
Keenan Brock
@kbrock
Jan 28 16:26
new bottleneck = saving the metric. the actual collection from vmware problem went away
Peter McGowan
@pemcg
Jan 28 16:27
looks good @agrare
Keenan Brock
@kbrock
Jan 28 16:27
does rhevm have a "collect all metrics" - or are they per vm?
Peter McGowan
@pemcg
Jan 28 16:27
yes saving the metrics is also a challenge, as they say bottlenecks are moving targets
Keenan Brock
@kbrock
Jan 28 16:28
nice to move bottle neck from "pounding the ems" and "pounding the queue" to "pounding the metrics table for saving"
Adam Grare
@agrare
Jan 28 16:28
rhevm doesn't really have a metrics API, we connect to their internal postgres
Keenan Brock
@kbrock
Jan 28 16:29
ok, so we could do a select * where instead of the N+1
Peter McGowan
@pemcg
Jan 28 16:29
interesting @kbrock we might need to rethink RHV anyway because aren't they deprecating the DWH database
Adam Grare
@agrare
Jan 28 16:29
so we could definitely collect for all vms :)
Keenan Brock
@kbrock
Jan 28 16:29
lol. yup
another problem for another day (week?)
Peter McGowan
@pemcg
Jan 28 16:32
coincidentally I’ve been trying out the pg_buffercache extension this week to look at shared_buffers efficiency
vmdb_metrics is the second biggest user in my setup
vmdb_production=# SELECT c.relname
  , pg_size_pretty(count(*) * 8192) as buffered
  , round(100.0 * count(*) / ( SELECT setting FROM pg_settings WHERE name='shared_buffers')::integer,1) AS buffers_percent
  , round(100.0 * count(*) * 8192 / pg_relation_size(c.oid),1) AS percent_of_relation
 FROM pg_class c
 INNER JOIN pg_buffercache b ON b.relfilenode = c.relfilenode
 INNER JOIN pg_database d ON (b.reldatabase = d.oid AND d.datname = current_database())
 WHERE pg_relation_size(c.oid) > 0
 GROUP BY c.oid, c.relname
 ORDER BY 3 DESC
 LIMIT 10;
                   relname                    | buffered | buffers_percent | percent_of_relation
----------------------------------------------+----------+-----------------+---------------------
 event_streams                                | 421 MB   |            13.7 |               100.0
 vmdb_metrics                                 | 412 MB   |            13.4 |                25.9
 index_vmdb_metrics_on_resource_and_timestamp | 201 MB   |             6.5 |                18.3
 vmdb_metrics_pkey                            | 194 MB   |             6.3 |                73.9
 guest_devices                                | 113 MB   |             3.7 |               100.0
 metric_rollups_02                            | 48 MB    |             1.6 |               100.0
 metric_rollups_12                            | 38 MB    |             1.2 |                41.5
 metric_rollups_01                            | 30 MB    |             1.0 |               100.1
 vim_performance_states                       | 27 MB    |             0.9 |                 1.7
 metric_rollups_07                            | 20 MB    |             0.6 |               100.2
(10 rows)
I’m trying to get comparable stats for a ‘real’ production database
Himanshu Roy
@hroyrh
Jan 28 16:56
@pemcg @kbrock Considering all that, does it make sense to repeat the postgres-tuning experiment right now.
Also, if I understand correctly, the bottleneck with fetching is gone. Does that mean that for VmWare the bottleneck is metrics saving now @agrare @kbrock
Keenan Brock
@kbrock
Jan 28 17:50
@hroyrh no. that code is not working, approved or anything. THIS is the current bottleneck
well, it is working, but not committed into manageiq
@pemcg vmdb_metrics always confuses me. but I think we're talking about metrics_00 / metrics_rollups_00 tables
but yes, miq, event streams, vmdb_metrics, and metrics et al are the vast majority of every customer database
guest_devices and vim_performance_states are also up there too, but closer to the bottom of the top 10
each event stream index tends to be bigger than the other tables too.
Adam Grare
@agrare
Jan 28 19:36
@pemcg figured since we were trying to remember numbers from over a year ago @kbrock and I tested our batch collector again
we have a vc simulator with 512 hosts and 10240 vms
I, [2019-01-28T14:34:22.271950 #2880]  INFO -- : Collecting metrics for 10752 targets...
I, [2019-01-28T14:34:40.570605 #2880]  INFO -- : Collecting metrics for 10752...Complete
18 seconds to collect metrics from all of them
Keenan Brock
@kbrock
Jan 28 19:37
heh. "our" is liberal. My contribution was "yea, adam, please do the work"
Adam Grare
@agrare
Jan 28 19:38
lol
Keenan Brock
@kbrock
Jan 28 19:39
I'm an ideas man. but they love me at vmware. just love me
too soon? :(
Adam Grare
@agrare
Jan 28 19:41
are you going to make metrics faster and make vmware pay for it?
Keenan Brock
@kbrock
Jan 28 19:44
Well, I didn't actually mean THEY would pay for it. but believe me, they'll pay for it
ugh, I'm sure it is a pita dealing with a liberal from Boston
@agrare this change has me so excited, but I do worry that we'll change it and then it will be faster for a few vendors but another vendor will slow down. because they need the N+1 approach.
Adam Grare
@agrare
Jan 28 19:46
yeah that's why I like being able to override how we schedule/queue/collect metrics per provider
Dennis Metzger
@dmetzger57
Jan 28 20:08
+1 ..... one size never fits all well