These are chat archives for ManageIQ/manageiq/performance

5th
Nov 2015
Alex Krzos
@akrzos
Nov 05 2015 19:42
@kbrock / @dmetzger57 I got my entire scale implementation feeding my influxdb / grafana machine with cpu/memory metrics now, getting closer to getting reproducing the perf_capture_timer timeout issue here (https://bugzilla.redhat.com/show_bug.cgi?id=1227008) on 5.4.3.1, Also my generic/priority workers are consuming > 900MiB of memory and are being cycled, I'm going to bump their limit again
@kbrock Also using a templated dashboard so each appliance (4 appliances) can be monitored
Keenan Brock
@kbrock
Nov 05 2015 19:43
dude. you are on fire
Alex Krzos
@akrzos
Nov 05 2015 19:43
@kbrock haha, more like the data center might be on fire, considering how much I'm stressing C&U on the scale collecting on 7k vms now
Dennis Metzger
@dmetzger57
Nov 05 2015 19:43
@akrzos what size environment has the workers cycling?
Alex Krzos
@akrzos
Nov 05 2015 19:44
5-6k vms turned on, it's a xlarge vmware provider
I'm just turning on more vms so it must collect on more
at 5k perf_capture_timer was taking ~330s to complete
Dennis Metzger
@dmetzger57
Nov 05 2015 19:44
@akrzos ok, i'll start breathing again, I feared this might have been small / medium ......
Alex Krzos
@akrzos
Nov 05 2015 19:44
at 6k now, ~440s
@dmetzger57 I started building a two provider test with 5.5 now too, you might be interested in some of the results
still need to build a better results page so it becomes easier to compare
Dennis Metzger
@dmetzger57
Nov 05 2015 19:47
@akrzos definite interest here
Alex Krzos
@akrzos
Nov 05 2015 19:48
for instance 5.4.3.1 vs 5.5.0.8:
5.4.3.1-appliance_memory.png
5.5.0.8-beta1.4-appliance_memory.png
crud, forgot to have version printed into the png
you can guess which one is 5.5
actually version in filename
two providers
vmware-medium + rhevm-medium
Dennis Metzger
@dmetzger57
Nov 05 2015 19:51
@akrzos so we can't squeeze the two medium providers into 6Gb .....
Alex Krzos
@akrzos
Nov 05 2015 19:55
We did with 5.4.3.1 for that same given time period
Dennis Metzger
@dmetzger57
Nov 05 2015 19:55
right, but we can't now :worried:
Alex Krzos
@akrzos
Nov 05 2015 19:55
granted perhaps if the test is lengthened, maybe 5.4.3.1 won't fit two medium providers either
that is about a 30-40 minute snapshot
Qe was mentioning that they couldn't get two providers without swapping on 5.5 attached to environments with 200-300 vms
so those my environments are a bit bigger but more "static"
@kbrock also I crashed influxdb yesterday, not entire sure what occurred, but the vm went from say 250MiB to all 8GiB of memory used
but just food for thought if we want to use them as our TSDB
Keenan Brock
@kbrock
Nov 05 2015 20:57
@akrzos heh - I have no doubt that if influx survives you, it will work for anyone's use cases
Alex Krzos
@akrzos
Nov 05 2015 21:03
@kbrock I pumped him up to 16GiB and updated to the latest influxdb so hopefully that won't happen again though i think it was my fault for poking grafana into too large of a non-downsampled time frame
Keenan Brock
@kbrock
Nov 05 2015 21:03
eh. it should take it and like it
Alex Krzos
@akrzos
Nov 05 2015 21:15
@dmetzger57 so I'm having perf_capture_timer timeouts on the scale environment now, I had to push it to 7k running vms with 5.4
also it's running on haswell instead of sandy bridge so that likely improved the timings for scheduling each target
Still not at purge spiral yet
Keenan Brock
@kbrock
Nov 05 2015 21:27
@dmetzger57 there are ~50 files with rescue nil - and most of them look dangerous to me. Do you have a particular plan or should I just create a few PRs?
Jason Frey
@Fryguy
Nov 05 2015 21:33
Is that a performance thing @kbrock ?
Keenan Brock
@kbrock
Nov 05 2015 21:33
so on an appliance
they are trying to kill a worker
the worker is not dieing
and then they start up a new worker
Jason Frey
@Fryguy
Nov 05 2015 21:33
how are "they" killing a worker
Keenan Brock
@kbrock
Nov 05 2015 21:34
it times out
not sure who "they" are
Jason Frey
@Fryguy
Nov 05 2015 21:34
is this the purge spiral?
Keenan Brock
@kbrock
Nov 05 2015 21:34
hu?
think purge is linear
lol
what does that mean?
matthewd @matthewd braces self; clicks link… oh thank goodness :sweat_smile:
Keenan Brock
@kbrock
Nov 05 2015 21:38
had to include link just because they used the "I dunno guy"
Jason Frey
@Fryguy
Nov 05 2015 21:39
purge spiral is an open BZ for timing out during purge, which makes us not purge, and then next time it takes even longer to purge cause there's more data
downward spiral
Keenan Brock
@kbrock
Nov 05 2015 21:39
aah - gotcha
Jason Frey
@Fryguy
Nov 05 2015 21:44
oh yeah I remember that one
didn't that come to "Doctor, it hurts when I do this...Doctor: Then don't do that"
don't really think that's high prio
IMO
Keenan Brock
@kbrock
Nov 05 2015 21:45
Dennis felt it was us not cleaning up after ourselves
Matthew Draper
@matthewd
Nov 05 2015 21:45
This is related to rescue nil? :confused:
Keenan Brock
@kbrock
Nov 05 2015 21:45
it is marked high/high
Jason Frey
@Fryguy
Nov 05 2015 21:45
oh, that could be...but if you read closely they were trying to manager 1 of every provider on 1 appliance
Keenan Brock
@kbrock
Nov 05 2015 21:45
yea
Jason Frey
@Fryguy
Nov 05 2015 21:45
litearlly 10 providers on one appliance
Keenan Brock
@kbrock
Nov 05 2015 21:45
they were being stupid
Jason Frey
@Fryguy
Nov 05 2015 21:46
it was high/high before it was closed
then someone reopened as RFE and never re-prioed
Keenan Brock
@kbrock
Nov 05 2015 21:46
I'm having trouble connecting the dots from Dennis' conclusion and the reported error.
but I do agree we need to remove so many of these rescue nil
ooh
Jason Frey
@Fryguy
Nov 05 2015 21:46
probably should go back to triage group
Keenan Brock
@kbrock
Nov 05 2015 21:47
yea
you might know
we have provider.close rescue nil
all over the place
Jason Frey
@Fryguy
Nov 05 2015 21:47
yeah
Keenan Brock
@kbrock
Nov 05 2015 21:47
are these rescue needed?
what are we trying to rescue?
"what are you sinking about?"
Matthew Draper
@matthewd
Nov 05 2015 21:47
That seems like the textbook case where a blind rescue is reasonable, to me
Jason Frey
@Fryguy
Nov 05 2015 21:47
The ensure can't know if the handle is sane
Keenan Brock
@kbrock
Nov 05 2015 21:48
yes
Jason Frey
@Fryguy
Nov 05 2015 21:48
so it tries to close, and if the handle is not sane, it just moves on
Keenan Brock
@kbrock
Nov 05 2015 21:48
I understand why rescue is there
but doesn't that also catch control-c and stuff?
Matthew Draper
@matthewd
Nov 05 2015 21:48
No
Jason Frey
@Fryguy
Nov 05 2015 21:48
no
StandardError only
Keenan Brock
@kbrock
Nov 05 2015 21:48
huh
did that change?
Matthew Draper
@matthewd
Nov 05 2015 21:49
No
Keenan Brock
@kbrock
Nov 05 2015 21:49
wtf
so these are the same?
begin
  xxx
rescue
  nil
end

xxx rescue nil
Matthew Draper
@matthewd
Nov 05 2015 21:49
Correct
Joe Rafaniello
@jrafanie
Nov 05 2015 21:50
yes
Matthew Draper
@matthewd
Nov 05 2015 21:50
rubocop even wants to autocorrect them like that
Keenan Brock
@kbrock
Nov 05 2015 21:50
I could have sworn xxx rescue nil was catching Exception
well...
anyway
so control-c / hup / ...
those are not messed up
very cool
dang - back to the drawing board then :)
Matthew Draper
@matthewd
Nov 05 2015 21:51
Unless you have a deeper rescue Exception, that turns them into a StandardError, or something
Jason Frey
@Fryguy
Nov 05 2015 21:52
rescue Exception
  raise "It's not *that* bad"
end
Matthew Draper
@matthewd
Nov 05 2015 21:53
Yeah. Not that we'd ever do something like that. 🌚
Keenan Brock
@kbrock
Nov 05 2015 21:53
cool - I'm so glad I was wrong on this one
Jason Frey
@Fryguy
Nov 05 2015 21:53
:flushed:
Joe Rafaniello
@jrafanie
Nov 05 2015 21:54
@matthewd what, like catching timeout's internal exception and re-raising something else?
Keenan Brock
@kbrock
Nov 05 2015 21:55
lol
Matthew Draper
@matthewd
Nov 05 2015 21:56
haha
So, we actually do that even more often than I thought
app/models/host.rb:    rescue Exception => err
app/models/host.rb-      _log.warn("#{err.inspect}")
app/models/host.rb-      raise MiqException::MiqHostError, "Unexpected response returned from system, see log for details"
Looks like some goodly portion of our "talk to a remote system" calls (which are network-slow, and thus most likely to be running when we receive an interrupt) do some variant of the above
Keenan Brock
@kbrock
Nov 05 2015 21:58
matthewd - so what happens in that case?
when an interrupt occurs there
Jason Frey
@Fryguy
Nov 05 2015 21:58
(-‸ლ)
Matthew Draper
@matthewd
Nov 05 2015 21:58
If an interrupt occurs inside that rescue's begin block, it'll get converted to a regular exception there
Keenan Brock
@kbrock
Nov 05 2015 21:58
because it is capturing Exception, not StandardError
Matthew Draper
@matthewd
Nov 05 2015 21:59
So any then-surrounding rescue will treat it as any other standarderror
Keenan Brock
@kbrock
Nov 05 2015 21:59
so inline vs non-inline != problem. rescue Exception == problem
Matthew Draper
@matthewd
Nov 05 2015 22:00
Correct
Joe Rafaniello
@jrafanie
Nov 05 2015 22:01
well, it's usually not a good idea to blindly swallow standard exceptions
Keenan Brock
@kbrock
Nov 05 2015 22:01
dang
so I went from 50 inline rescues
to 122 'rescue Exception' clauses
Matthew Draper
@matthewd
Nov 05 2015 22:01
My general guideline is that rescue Exception is only okay if you do a bare raise inside it — so, you can explicitly intervene for all abnormal exits, but it's only appropriate for that sort of "if we're failing, do this thing, then continue failing" handler… not something that thinks it can recover
Jason Frey
@Fryguy
Nov 05 2015 22:02
Agreed @matthewd
Keenan Brock
@kbrock
Nov 05 2015 22:02
quick grep didn't distinguish those cases
Joe Rafaniello
@jrafanie
Nov 05 2015 22:02
again, what are you trying to solve? I know it's bad, but unless you're in there fixing an issue, not sure it's easy to just fix without better understanding the surrounding code
Keenan Brock
@kbrock
Nov 05 2015 22:02
anyone else using ag (silver surfer) - it is like Ack, but better
Joe Rafaniello
@jrafanie
Nov 05 2015 22:07
I'm an :older_man: , I use git grep for most things
Chris Arcand
@chrisarcand
Nov 05 2015 23:52
@kbrock I do, works great. Your description is pretty much what I say as well ;)
Jason Frey
@Fryguy
Nov 05 2015 23:53
Oh hai @ChrisArcand :)
Chris Arcand
@chrisarcand
Nov 05 2015 23:53
Oh hai :wave: :smile:
Dennis Metzger
@dmetzger57
Nov 05 2015 23:59
@kbrock now that I've heard the story of BZ 1232490 and read the ticket, lets put that on the shelf and move onto more pressing issues, lets talk in the morning