These are chat archives for ManageIQ/manageiq/performance

23rd
Jun 2016
Keenan Brock
@kbrock
Jun 23 2016 02:24
wow - fix_serialized_reports_for_rails_four is slow
66k blobs and 79k blob parts but... 45 minutes in and still running
Jason Frey
@Fryguy
Jun 23 2016 02:56
That's a known bad one... The stuff has to change, so not sure how to make it any better
Not sure on that PR.. Leave it open until I'm back from PTO
Keenan Brock
@kbrock
Jun 23 2016 12:22
Wow, the parallel gem is very cool. they need an example for fanning (main iterator reads AR records, fans work out to 8 processes, then the main thread collects the results and writes to the database)
I would have thought that the IO was the problem, but the solution suggests that it is more of a CPU crunching problem
Keenan Brock
@kbrock
Jun 23 2016 14:24
@jrafanie your PR ManageIQ/manageiq#9373 rocks
Here is what I found:
Joe Rafaniello
@jrafanie
Jun 23 2016 14:25
We ran it in console with sql logging and the picture N+1 was super obvious
Keenan Brock
@kbrock
Jun 23 2016 14:27
ms ms- queries query (ms) rows comments
23,749.8 0.0 9,491 2,245.3 9,496 no name-1
2,241.9 9,482 27.5 9,481 .SELECT "services".*
0.6 1 0.3 5 .SELECT "service_templates".*
0.4 1 0.3 5 .SELECT "pictures".*
1.1 5 0.4 1 .SELECT "binary_blobs".*
3.7 1.8 1 0.3 .Rbac.search
0.3 0.3 ..SELECT "users".*
read line #2
ALL of our queries are reading each service
better yet, we are using ancestry, so we CAN trun that into 1 query
Oleg Barenboim
@chessbyte
Jun 23 2016 14:28
@kbrock - keep in mind that I want much of this work to be backported to Darga
Keenan Brock
@kbrock
Jun 23 2016 14:30
+1
Joe Rafaniello
@jrafanie
Jun 23 2016 14:30
@kbrock do you have idea why we're using a virtual_has_one picture with this as a method instead of a real has_one through?
  def picture
    st = service_template
    return nil if st.nil?
    st.picture
  end
Keenan Brock
@kbrock
Jun 23 2016 14:30
as long as you are saying Darga not C - then we are good
@jrafanie I delete that method
Joe Rafaniello
@jrafanie
Jun 23 2016 14:30
was it just an oversight?
Keenan Brock
@kbrock
Jun 23 2016 14:30
seems so
we are ruby happy :(
Joe Rafaniello
@jrafanie
Jun 23 2016 14:30
it seems like a no brainer
Keenan Brock
@kbrock
Jun 23 2016 14:31
+1
Joe Rafaniello
@jrafanie
Jun 23 2016 14:31
good eye
Keenan Brock
@kbrock
Jun 23 2016 14:31
that is why I ran these numbers
they are slightly different
but once we are using rails, then we can use includes and lots of good stuff
@chessbyte thanks for the reminder
Joe Rafaniello
@jrafanie
Jun 23 2016 14:32
Maybe it makes sense to get this one in as is and open a PR to change it to a real has_one through
in case we want to backport one but not the other
Keenan Brock
@kbrock
Jun 23 2016 14:32
@jrafanie is there a place to put the tree from an "ancestry, please load the whole tree" ?
+1
Joe Rafaniello
@jrafanie
Jun 23 2016 14:32
@NickLaMuro , Jason and I were thinking of that, couldn't really find it
Nick LaMuro
@NickLaMuro
Jun 23 2016 14:33
@kbrock that is what we were going to look into next
Keenan Brock
@kbrock
Jun 23 2016 14:33
good
I already have that in b
didn't do in c
Joe Rafaniello
@jrafanie
Jun 23 2016 14:33
Jason started looking at rewriting the tree building to do it like the vm tree builder
Keenan Brock
@kbrock
Jun 23 2016 14:33
but I changed over to ancestry so we could do it in d
sorry I didn't also change this builder
Joe Rafaniello
@jrafanie
Jun 23 2016 14:34
Personally, I want to make note of these "tougher but worthwhile" things and continue on to other screens to get as many simple changes first
Keenan Brock
@kbrock
Jun 23 2016 14:34
well here
this is not a simple change
this was ~8 PRs to do it
thanks for taking over the finishing line
Nick LaMuro
@NickLaMuro
Jun 23 2016 14:35
@jrafanie that is fair, and makes sense. And you are right, the other N+1 is less obvious of a change
errr... harder to implement I would assume
Joe Rafaniello
@jrafanie
Jun 23 2016 14:35
yeah, I mean the ancestry N+1 is harder to do and worth investigating but probably can we added as an issue "here's how to recreate it and ideas on how to fix..." while we try other screens
Keenan Brock
@kbrock
Jun 23 2016 14:36
@jrafanie you want to merge this change. even thought it is identical to has_one PR?
Joe Rafaniello
@jrafanie
Jun 23 2016 14:37
@kbrock I think so, I don't know what this relationship looks like on other branches. I agree with you that's it's the same thing but feels like a refactoring to a simpler interface which is mostly unrelated to the slow screen
what do you think?
Keenan Brock
@kbrock
Jun 23 2016 14:37
+1
@NickLaMuro there is a hash that stores the tree nodes. you are poised to just call ancestry roots and populate it?
and change the "get this node" to look it up in the hash?
@jrafanie if we merge this, and I put in a PR for using rails relations instead of our own. it will just hang because it has no performance benefit...
and there is no way to prove that it doesn't break anything
@NickLaMuro or are you punting for now?
Joe Rafaniello
@jrafanie
Jun 23 2016 14:41
@kbrock I think it's fairly straightforward to write a test based on the TreeBuilderServices.new example... even if it's a throwaway test
Keenan Brock
@kbrock
Jun 23 2016 14:42
ok
Oleg Barenboim
@chessbyte
Jun 23 2016 14:44
let's get some wins that we can backport to Darga and deliver in CF 4.1 errata
Joe Rafaniello
@jrafanie
Jun 23 2016 14:46
@kbrock @NickLaMuro I'll open an issue to change the virtual_has_one to a has_one... I think it's worth investigating to simplify things. It doesn't make sense to have virtual relationships when a regular one seems like it will do what we want
Nick LaMuro
@NickLaMuro
Jun 23 2016 14:48
:+1:
Joe Rafaniello
@jrafanie
Jun 23 2016 14:49
@kbrock should I assign you? ManageIQ/manageiq#9407
Keenan Brock
@kbrock
Jun 23 2016 14:49
+1
Joe Rafaniello
@jrafanie
Jun 23 2016 14:50
I don't think it's a huge priority, but we should really see if it's simpler with no change in behavior
Cool, done, now onto the next thing
Keenan Brock
@kbrock
Jun 23 2016 14:53
@jrafanie and the PR is created / assigned to you. Cool, done, now onto the next thing
@jrafanie @NickLaMuro I'm confused. are you guys working on removing service's tree builder N+1 for darga, or just moving on?
aka: do you want me to take a peek?
Joe Rafaniello
@jrafanie
Jun 23 2016 14:54
@kbrock I'll forward you something, I'm not looking at it... Jason is
Keenan Brock
@kbrock
Jun 23 2016 14:55
thanks
I guess I jumped on that because I did so much work so we could do this... and then dropped the ball on the last mile.
Joe Rafaniello
@jrafanie
Jun 23 2016 14:55
just so you don't waste time on something that you should probably talk to Jason with when he's back from PTO
Keenan Brock
@kbrock
Jun 23 2016 15:00
ok. it looks like a ~5 line change
Alex Krzos
@akrzos
Jun 23 2016 15:41
@kbrock Does this look ok?
[----] I, [2016-06-23T11:39:22.300976 #18487:11e9988]  INFO -- : MIQ(ManageIQ::Providers::Vmware::InfraManager::HostEsx#perf_rollup) [hourly] Rollup for ManageIQ::Providers::Vmware::InfraManager::HostEsx name: [VC0DC0_C15_H9.eng.vmware.com], id: [31] for time: [2016-06-23T14:00:00Z]...Complete - Timings: {:server_dequeue=>0.0021080970764160156, :db_find_prev_perf=>133.37837290763855, :rollup_perfs=>3798.018982410431, :db_update_perf=>1512.491857290268, :process_perfs_tag=>2.6754729747772217, :total_time=>6292.7560403347015, :process_bottleneck=>494.3214576244354, :purge_metrics=>48.55723810195923}
Thats seconds right
6000s
Keenan Brock
@kbrock
Jun 23 2016 15:42
6000s doesn't look so good
@akrzos where do you see 6000?
Alex Krzos
@akrzos
Jun 23 2016 15:43
:total_time=>6292.7560403347015
digging at purging
Those timing values are confusing to me
Keenan Brock
@kbrock
Jun 23 2016 15:48
yea, that looks like seconds
so if you get 6000, then that would probably be a timeout. right?
Alex Krzos
@akrzos
Jun 23 2016 15:48
yeah typically 600s or 3600s depends on what the timeout was set to
not sure what it's set to for that hourly rollup
That just stuck out to me
but looking at purging
I see :total_time=>1223.9979286193848, :purge_metrics=>4.451541900634766
total time a lot greater than the purge_metrics time
on the flip side that seems a lot faster for the number of records but i gotta find something comparable (When looking at :purge_metrics time)
Keenan Brock
@kbrock
Jun 23 2016 16:21
This message was deleted
This message was deleted
It would be nice if we could find a way to not store pictures in the database.
Keenan Brock
@kbrock
Jun 23 2016 19:57

@jrafanie have you noticed:

[ExtManagementSystem, Vm, Host].map { |m| Benchmark.ms { m.first } } 
# => [1403, 769, 182] (ms)

It takes 2.5s to load those 3 models

Nick LaMuro
@NickLaMuro
Jun 23 2016 19:58
even the second time around?
Keenan Brock
@kbrock
Jun 23 2016 19:58
nope
but this causes all sorts of chaos when I'm trying to benchmark
Joe Rafaniello
@jrafanie
Jun 23 2016 19:58
to autoload the models?
Keenan Brock
@kbrock
Jun 23 2016 19:58
and test runs
yea
I know it takes time and all but...
Joe Rafaniello
@jrafanie
Jun 23 2016 19:59
yeah, they all load Host which loads the world
Keenan Brock
@kbrock
Jun 23 2016 19:59
that seems like a lot
@NickLaMuro yea, it takes ~1-2 ms each the second time around
it is hard, because I need to reload to get changes to some libraries
and there is caching coming into play
anyway, thought I'd share that our extension mechanism is quite expensive
Nick LaMuro
@NickLaMuro
Jun 23 2016 20:02
no, definitely good to know, especially as I start doing some more Benchmarking myself
my 2cents is that it probably isn't something worth fixing ATM since it is just cached eventually
Joe Rafaniello
@jrafanie
Jun 23 2016 20:09
Yeah, it's probably worthwhile eager loading models before you benchmark stuff
making Host not so expensive is worth doing though, although probably would require fixing one thing at a time
Joe Rafaniello
@jrafanie
Jun 23 2016 20:19
wow
I'm looking at slow services -> requests page
initially getting to the page is relatively fast (less than 2 seconds)
changing the request date and clicking apply...
it takes many seconds
the only difference...
Keenan Brock
@kbrock
Jun 23 2016 20:21
nice
Joe Rafaniello
@jrafanie
Jun 23 2016 20:22
Initial page
exp = MiqExpression.new(
  {"and"=>
    [{"AFTER"=>{"value"=>"7 Days Ago", "field"=>"MiqRequest-created_on"}},
     {"or"=>
       [{"="=>
          {"value"=>"MiqProvisionRequest", "field"=>"MiqRequest-resource_type"}},
        {"="=>
          {"value"=>"MiqProvisionConfiguredSystemRequest",
           "field"=>"MiqRequest-resource_type"}},
        {"="=>
          {"value"=>"VmReconfigureRequest",
           "field"=>"MiqRequest-resource_type"}},
        {"="=>
          {"value"=>"VmMigrateRequest", "field"=>"MiqRequest-resource_type"}},
        {"="=>
          {"value"=>"ServiceTemplateProvisionRequest",
           "field"=>"MiqRequest-resource_type"}},
        {"="=>
          {"value"=>"ServiceReconfigureRequest",
           "field"=>"MiqRequest-resource_type"}}]}]}
)
now, wait for it...
Keenan Brock
@kbrock
Jun 23 2016 20:22
lol
Joe Rafaniello
@jrafanie
Jun 23 2016 20:22
change to 30 days and apply
exp = MiqExpression.new(
  {"and"=>
    [{"AFTER"=>{"value"=>"1 Days Ago", "field"=>"MiqRequest-created_on"}},
     {"or"=>
       [{"="=>
          {"value"=>"pending_approval", "field"=>"MiqRequest-approval_state"}},
        {"="=>{"value"=>"approved", "field"=>"MiqRequest-approval_state"}},
        {"="=>{"value"=>"denied", "field"=>"MiqRequest-approval_state"}}]},
     {"or"=>
       [{"="=>
          {"value"=>"MiqProvisionRequest", "field"=>"MiqRequest-resource_type"}},
        {"="=>
          {"value"=>"MiqProvisionConfiguredSystemRequest",
           "field"=>"MiqRequest-resource_type"}},
        {"="=>
          {"value"=>"VmReconfigureRequest",
           "field"=>"MiqRequest-resource_type"}},
        {"="=>
          {"value"=>"VmMigrateRequest", "field"=>"MiqRequest-resource_type"}},
        {"="=>
          {"value"=>"ServiceTemplateProvisionRequest",
           "field"=>"MiqRequest-resource_type"}},
        {"="=>
          {"value"=>"ServiceReconfigureRequest",
           "field"=>"MiqRequest-resource_type"}}]}]}
)
the only difference is this
The or for the states:
    [{"AFTER"=>{"value"=>"1 Days Ago", "field"=>"MiqRequest-created_on"}},
     {"or"=>
       [{"="=>
          {"value"=>"pending_approval", "field"=>"MiqRequest-approval_state"}},
        {"="=>{"value"=>"approved", "field"=>"MiqRequest-approval_state"}},
        {"="=>{"value"=>"denied", "field"=>"MiqRequest-approval_state"}}]},
It almost looks like it's evaluating the "or" before the "AFTER"
in other words, instantiate all of the requests, check for the three states
then filter by date
:confused:
Keenan Brock
@kbrock
Jun 23 2016 20:30
should that OR get into the original or block?
would that make a difference?
Joe Rafaniello
@jrafanie
Jun 23 2016 20:33
I don't know
I need to figure out in what order MiqExpression evaluates things
it feels like it's pulling in all MiqRequests to check for those 3 different approval_states
so, tl;dr adding a single or with 3 states makes it go from 1-2 seconds to 30+ seconds
Keenan Brock
@kbrock
Jun 23 2016 22:00
@jrafanie in other places I'm noticing that rbac is pulling in all of the tree via relations
it keeps coming up for me
I keep seeing relationship_mixin in my stack traces
Joe Rafaniello
@jrafanie
Jun 23 2016 22:24
Interesting, I think this is the problem with the slow Services -> Requests screen on master... not sure why it's slow on older versions (seems to be a regression with the MiqExpression refactorings)
ManageIQ/manageiq#9427
Tim Wade
@imtayadeway
Jun 23 2016 22:25
@jrafanie working on a fix
Joe Rafaniello
@jrafanie
Jun 23 2016 22:25
Thanks @imtayadeway
Tim Wade
@imtayadeway
Jun 23 2016 22:49
ManageIQ/manageiq#9428
Tim Wade
@imtayadeway
Jun 23 2016 22:56
Looks like the build failed for that spectacularly.....
Keenan Brock
@kbrock
Jun 23 2016 23:45
That looks like not your issue. @imtayadeway those pass for you locally?
That fix looks good. Although, I liked how the previous one was using individual arel nodes. The helpers are so inconsistently available
Tim Wade
@imtayadeway
Jun 23 2016 23:47
My tests pass locally, yes. Is there a better way to do it? I'm confused why you can do And.new([any, number, of, nodes]) but only Or.new(left, right), unless I'm mistaken?
Keenan Brock
@kbrock
Jun 23 2016 23:48
Yea. I remember seeing the difference in expression code for or
May spelunk in arel to figure that out. Probably precedence issues