These are chat archives for ManageIQ/manageiq/performance

16th
Feb 2018
Keenan Brock
@kbrock
Feb 16 2018 00:00
@NickLaMuro I put in a gist -- it is impressive
the table scans aren't real - those tables are empty
not sure why it sometimes bitmap heap and other times index scans those other tables - they should all be the same (if there is data)
NickLaMuro @NickLaMuro admittedly doesn't really know how to read SQL explains yet
Keenan Brock
@kbrock
Feb 16 2018 00:02
as lad will tell you - the subquery I have there is bad
but the query is seriously wacked - here. I'll paste the explain from the metrics_21 query
Nick LaMuro
@NickLaMuro
Feb 16 2018 00:04
I don't know why you would do a subquery when you are already in the metrics_21 table...
DELETE FROM "metrics_21"
WHERE "metrics_21"."timestamp" <= '2018-02-01 19:56:37.519979'
Keenan Brock
@kbrock
Feb 16 2018 00:05
aah
the theory is we only want to delete 100 records at a time
updated that gist - so you can see the explain plans for metrics_21. and also added id = any(array(select ...)) - which I don't know how to generate from rails easily.
Nick LaMuro
@NickLaMuro
Feb 16 2018 00:06

the theory is we only want to delete 100 records at a time

curious what the rational for that is. I am sure one of the old guard can explain.

Keenan Brock
@kbrock
Feb 16 2018 00:06
well, ignoring the 100 (I think it should be 10,000 - but I digress)
we want to delete in batches to reduce contention
Daniel Berger
@djberg96
Feb 16 2018 00:08
@NickLaMuro deleting huge numbers of rows can lock the db (I think, Keenan can correct me)
Nick LaMuro
@NickLaMuro
Feb 16 2018 00:09
ah, okay, yeah, that makes sense
Keenan Brock
@kbrock
Feb 16 2018 00:09
thanks for the translation @ db
Daniel Berger
@djberg96
Feb 16 2018 00:09
been too long for me, guess i thought cascading deletes "just worked" without the need for additional queries
Keenan Brock
@kbrock
Feb 16 2018 00:09
it is the limit
I'm surprised sql doesn't just work
delete from table limit 100 seems good to me
but alas, nope
@djberg96 @NickLaMuro but you have to admit, that explain plan is humorous
when you add the analyze, it shows that most of those tables are not even accessed, but it takes ~4 minutes
just curious why this would be failing so badly for one customer when it has been working for others for the past 1-2 years for others
Daniel Berger
@djberg96
Feb 16 2018 00:14
same version of cloudforms and rails?
Keenan Brock
@kbrock
Feb 16 2018 00:15
that update went into master ~January 2016
the same sql is generated
maybe the version of postgres has changed
Daniel Berger
@djberg96
Feb 16 2018 00:27
(since it doesn't look to me, just viewing the ddl in pgadmin, like cascading delete is used)
and i could be on the wrong track completely here
well, that was weird
gitter just bombed out there for a second
while my microwave was going....hmmm.......
Keenan Brock
@kbrock
Feb 16 2018 00:30
lol - it has been dropping stuff for me aswell
Daniel Berger
@djberg96
Feb 16 2018 00:30
anyway, i was just saying i would have to fiddle with postgresql, parent and child tables, and cascading deletes
Keenan Brock
@kbrock
Feb 16 2018 00:30
well, we insert into metrics, but a trigger deletes from metrics and inserts into the approperiate metrics_*
Daniel Berger
@djberg96
Feb 16 2018 00:33
and we do that because....? i can't remember any more
Keenan Brock
@kbrock
Feb 16 2018 00:33
I've given up on that question
Daniel Berger
@djberg96
Feb 16 2018 00:33
sorry :(
Keenan Brock
@kbrock
Feb 16 2018 00:33
sharding is webscale
it was either that or switching over to mongo (or /dev/null)
Daniel Berger
@djberg96
Feb 16 2018 00:34
or oracle!
:-P
alright, time for dinner, gitter is dropping messages now, cya later
Jason Frey
@Fryguy
Feb 16 2018 13:44
@kbrock Reading back I don't understand where you are getting that expanded query from since the limit is not in the AR query
If I recall, PG is smart enough to not even consider the subtables that don't fit into the time range, so it's not 24 queries... Unless something had changed
Keenan Brock
@kbrock
Feb 16 2018 14:00

@Fryguy thnx - it should be:

Metric.where(:id => Metric.where('timestamp < ?', 4.hours.ago).limit(100)).delete_all

otherwise we'd just use:

Metric.where('timestamp < ?', 4.hours.ago)).delete_all

also, we use arel/config values in the actual generation there. I simplified for the room

Chris Arcand
@chrisarcand
Feb 16 2018 14:40
@kbrock Answering your original question, yes...? There's nothing stopping you from removing the trigger and writing things in Ruby/SQL to do the sharding and fine tune deleting things per table shard instead of leaving it to PG, if that really is a pain point. I thought that was already one of your known goals.
I mean I don't know if that's a good idea or if you'll benefit from it, but I don't see why it's not possible.
Keenan Brock
@kbrock
Feb 16 2018 15:26
@chrisarcand ooh - I was asking if it were possible to Metric.with_table(:metrics_2).create(:timestamp => xxx)
wasn't looking at modifying our sharding technique (yes I don't like it but... seems to work well enough)
Jason Frey
@Fryguy
Feb 16 2018 15:27
Ah, I see what you're trying to do now... I like that with table idea
Keenan Brock
@kbrock
Feb 16 2018 15:27
well, that is the insert version. wanted to use it for delete version
(I assumed the research was more of an insert mode rather than delete)
@Fryguy should we just implement a truncate strategy instead of our current delete with batches mode?
not ready for it with MetricRollup since that has a dependant table, but Metric / metrics seems ripe for the picking
I just don't know how to say to ActiveRecord, for this one query, could you change the query. I think tablename can be a lambda. maybe that will be it
Jason Frey
@Fryguy
Feb 16 2018 15:31
I did it a while ago, but can't remember how
But yeah, truncate strategy is good... Just have to be careful if someone sets the config higher than the number of tables... Maybe we prevent that for realtime
Keenan Brock
@kbrock
Feb 16 2018 15:34
@Fryguy I did post the current metrics deletes [ref again] -- it is very impressive
Jason Frey
@Fryguy
Feb 16 2018 15:35
Is it possible to run again with Explain Analyze and some data?
Chris Arcand
@chrisarcand
Feb 16 2018 15:40

You'd have to make your own with_table and just set the table metadata on the model if you want that sort of thing over just using arel. Rails doesn't provide any sort of dynamic 'use this table name for this query' that I'm aware of.

def self.with_table(table_name)
  # Save the old table name somehow, whatever...
  @old_table_name = self.table_name

  begin
    self.table_name = table_name
    yield
  ensure
     self.table_name = @old_table_name
  end
end

^ totally making this up, haven't tried it.

  Metric.with_table(:metrics_2) do
    create(:timestamp => blah)
  end
maybe?
Jason Frey
@Fryguy
Feb 16 2018 15:41
If I recall, that's similar to what I had done on the past
Chris Arcand
@chrisarcand
Feb 16 2018 15:42
Who knows how that screws up the model's state, too, you might have to reset_column_stuff whatever that method is called at the end.
Jason Frey
@Fryguy
Feb 16 2018 15:43
That should be ok because the child tables have the exact same structure
Keenan Brock
@kbrock
Feb 16 2018 16:19
@Fryguy I have a table with data - but it takes > 4 minutes to run
Jason Frey
@Fryguy
Feb 16 2018 16:19
the explain analyze does?
Keenan Brock
@kbrock
Feb 16 2018 16:19
yup
Jason Frey
@Fryguy
Feb 16 2018 16:19
cool
throw it in here when it's done: https://explain.depesz.com/
Keenan Brock
@kbrock
Feb 16 2018 16:20
@chrisarcand yea. well, you used to be able to set table_name = proc.new {}
Jason Frey
@Fryguy
Feb 16 2018 16:20
blog post with more info on the second link: http://tatiyants.com/postgres-query-plan-visualization/
Chris Arcand
@chrisarcand
Feb 16 2018 16:20
can you not anymore?
Keenan Brock
@kbrock
Feb 16 2018 17:39
@chrisarcand that does work (I assume)
would be cool if the query came into the lambda :)
Keenan Brock
@kbrock
Feb 16 2018 17:55
@Fryguy I ran the query with data. tried sending it to the service. the json is 23,332 lines - they can't handle it
I'll try again with metrics_21 in the outer query.
Jason Frey
@Fryguy
Feb 16 2018 18:03
gist?
Keenan Brock
@kbrock
Feb 16 2018 18:03
the tried and true one still rocks: https://explain.depesz.com/s/UuPX ( the newer one is pretty, but doen't get me what I want) http://tatiyants.com/pev/#/plans/plan_1518804077072
smaller metrics_21 gist
the larger one metrics gist is practically unresponsive
yea, the full gist of the explain is too long to past into gist - makes the page unresponsive :(
guess that says the query is probably wrong ;)
yea, this looks like metrics_18 in particular has bad statistics. they are table scanning it - taking 10 seconds
Keenan Brock
@kbrock
Feb 16 2018 18:13

also notice that they append all table results together before applying the LIMIT

so it is

SELECT *
from (
  SELECT * from metrics_00
  union all
  SELECT * from metrics_01
  union all
  SELECT * from metrics_02
  -- ...
)
LIMIT 100
was at least hoping for:
SELECT *
from (
  SELECT * from metrics_00 LIMIT 100
  union all
  SELECT * from metrics_01 LIMIT 100
  union all
  SELECT * from metrics_02 LIMIT 100
  -- ...
)
LIMIT 100
ok, that is the problem ^ - the Append of every metrics row together before limiting 100
Jason Frey
@Fryguy
Feb 16 2018 18:16
they have to append them all
in our case, outside in, we, as humans, know that our tables are sharded a particular way
Keenan Brock
@kbrock
Feb 16 2018 18:18
maybe that intermediate limit is "clever", but they are able to apply the where clause from the outside of the query to the inside
Jason Frey
@Fryguy
Feb 16 2018 18:18
but if you were to take a different column value, there's no way for PG to know that
(haven't thought it through all the way, but I think I could come up with an example)
Keenan Brock
@kbrock
Feb 16 2018 18:19
@Fryguy so truncate is the way we want to go?
if we keep 4 hours worth - just figure out the hour number and truncate all other tables?
they are able to do the WHERE, you'd think they could also do the LIMIT
Jason Frey
@Fryguy
Feb 16 2018 18:19
as long as they don't set the value to 2.days (i.e. a number higher than the number of tables)
Keenan Brock
@kbrock
Feb 16 2018 18:20
2 days for realtime metrics?
Keenan Brock
@kbrock
Feb 16 2018 18:25
# running on an empty database (not the customers)
((4.hours.ago.utc.hour + 24)..(0.hours.ago.utc.hour + 24)).map { |h| h %24 }.each do |tbl|
  ActiveRecord::Base.connection.execute("truncate metrics_#{tbl}")
end
   (24.2ms)  truncate metrics_14
   (10.0ms)  truncate metrics_15
   (5.6ms)  truncate metrics_16
   (4.5ms)  truncate metrics_17
   (3.0ms)  truncate metrics_18
hmm - not right, we want the opposite. but that gives you where I'm leaning
Keenan Brock
@kbrock
Feb 16 2018 18:51
frequency = ::Settings.workers.worker_base.schedule_worker.performance_realtime_purging_interval.to_i_with_method
keep = ::Settings.performance.history.keep_realtime_performances.to_i_with_method

# defaults: frequency: 21.minutes, keep: 4.hours -- could probably check against 24.hours
if frequency + (keep + 1.hour) > 12.hours
  puts "warning - frequency  this needs to be run more frequently"
end

now = Time.now.utc
good_hours = (((now - keep).hour + 24)..(0.hours.ago.utc.hour + 24)).map { |h| h %24 }
target_hours  = (0..23).to_a - good_hours.to_a
puts "keep: #{(now - keep).strftime("%H:%M")} - #{now.strftime("%H:%M")}"
puts "keep: #{good_hours}"
puts "remove: #{target_hours}"

target_hours.each do |tbl|
  ActiveRecord::Base.connection.execute("truncate metrics_%02d" % tbl)
end

keep: 14:48 - 18:48
remove: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 19, 20, 21, 22, 23]
   (7.5ms)  truncate metrics_00
   ...
   (2.4ms)  truncate metrics_13
   (4.2ms)  truncate metrics_19
   ...
   (2.6ms)  truncate metrics_23
@Fryguy ^ I'll put together a PR for that
Keenan Brock
@kbrock
Feb 16 2018 19:52
fun:
query cost time
delete from metrics_21 where id in(select id from metrics_21 limit 100) 875 40,000ms
delete from metrics_21 where id = any(array(select id from metrics_21 limit 100)) 72 107ms
delete from metrics_21 where ctid = any(array(select ctid from metrics_21 limit 100)) 64 30ms
with timings, so hard to know if it is a hot cache and stuff, so cost seems to work better for me (even if cost is estimated cost and not actual)
Keenan Brock
@kbrock
Feb 16 2018 19:58
@dmetzger57 would it be helpful to see the max memory for the workers in evm:status_full?
Worker Type Status ID PID SPID Server id Queue Name / URL Started On Last Heartbeat MB Usage
MiqEmsMetricsProcessorWorker started 1657 5932 5956 1 ems_metrics_processor 2018-02-16T14:15:11Z 2018-02-16T19:55:04Z 248/600
MiqEmsMetricsProcessorWorker started 1656 5923 5952 1 ems_metrics_processor 2018-02-16T14:15:11Z 2018-02-16T19:55:04Z 251/600
MiqGenericWorker started 1705 26571 26580 1 generic 2018-02-16T19:12:50Z 2018-02-16T19:55:04Z 301/500
MiqGenericWorker started 1710 29366 29375 1 generic 2018-02-16T19:54:20Z 2018-02-16T19:55:06Z 270/500
MiqPriorityWorker started 1708 28558 28568 1 generic 2018-02-16T19:42:29Z 2018-02-16T19:55:06Z 498/600
MiqPriorityWorker started 1709 28752 28762 1 generic 2018-02-16T19:45:32Z 2018-02-16T19:55:06Z 541/600
MiqScheduleWorker started 1658 5941 5955 1 2018-02-16T14:15:12Z 2018-02-16T19:54:59Z 258/500
MiqUiWorker started 1617 20047 1 http://127.0.0.1:3000 2018-02-16T01:49:10Z 2018-02-16T19:54:59Z 252/1024
MiqWebServiceWorker started 1618 20056 1 http://127.0.0.1:4000 2018-02-16T01:49:10Z 2018-02-16T19:54:59Z 252/1024
last column 252/1024 (enhancement) vs just 252 (current)
Dennis Metzger
@dmetzger57
Feb 16 2018 20:16
@kbrock those are max-Used / Threshold? I that would be helpful for the high level overview. How are you getting max-Used?
Keenan Brock
@kbrock
Feb 16 2018 20:17
Um, that is built into the rake task - think we save to the database. I'll check
we need to display different colors in the console, so you can see the "red over" and "orange - getting close" ;)
Also would be nice to display all the expected workers, so you could see when something is missing... but again, leaning towards brainstorming here
Keenan Brock
@kbrock
Feb 16 2018 20:42
@dmetzger57 sorry - answer: yes, it is from the database: evm:status_full calls: evm_application.rb:84
Dennis Metzger
@dmetzger57
Feb 16 2018 20:49
:+1: adding it