These are chat archives for ManageIQ/manageiq/performance

8th
Nov 2017
Keenan Brock
@kbrock
Nov 08 2017 02:06
@Ladas figured out the sed issue, it uses \s to represent white space - but apple sed doesn't like it - so I swappe with ' '
cben @cben is amused that performance room is discussing cut, less, sed :), is reminded of https://aadrake.com/command-line-tools-can-be-235x-faster-than-your-hadoop-cluster.html
Ladislav Smola
@Ladas
Nov 08 2017 07:21
@kbrock these apple seds :-D
Keenan Brock
@kbrock
Nov 08 2017 14:15
ikr
@Ladas where do we go from here?
Ladislav Smola
@Ladas
Nov 08 2017 15:04
@kbrock so from what I see, with 95% probability, this issue was caused by refresh worker being dead, while 9 other workers kept queuing targets into 1 MiqQueue record. So this will always fail in time, no matter the size of target we have
@kbrock @dmetzger57 the other issue is that there is a killoop of 1 or many pods, causing huge amount of events, which are flooding both evm.log and automation.log
@kbrock @dmetzger57 not sure what tools we have for this, i know we log rotate and delete in time. But this is doing like 7GB of automation.log in few hours
and about half in evm.log
and the env has only 20GB for the whole log disk
Keenan Brock
@kbrock
Nov 08 2017 15:07
automation.log has been getting out of control. Sometimes logging 20-30 messages for a method that I thought deserved 1
Ladislav Smola
@Ladas
Nov 08 2017 15:07
@kbrock well it does like 40 lines per received event?
@kbrock and we get like 100k events in few hours (caused probably by unhealthy Pod)
@kbrock on evm.log side, we also log stuff around each event and every 20-40events are causing a refresh (it refreshes like every 1.5 minute)
@kbrock so we might need to limit the size of logs around events and refresh, but we still have the hugest part in generic Automate logging
Keenan Brock
@kbrock
Nov 08 2017 15:15
@Ladas maybe we need to ping @gmcculloug and tell him that the automate log is too verbose and blowing up?
Ladislav Smola
@Ladas
Nov 08 2017 15:18
@kbrock I guess :-) The other thing is that the env is small and the excessive amount of events is generated by a failing Pod, but we can expect that to happen in production
Adam Grare
@agrare
Nov 08 2017 15:20
wasn't the automate log just blowing up when it was logging the msg_data though?
can't really blame automate for that (I got your back @gmcculloug :wink:)
Greg McCullough
@gmcculloug
Nov 08 2017 15:23
Thanks @agrare. Automate only does what you tell it to do, and some times it doesn't even do that.
Probably need to review automate logging with a focus on events
Keenan Brock
@kbrock
Nov 08 2017 15:24
@agrare @gmcculloug no. @Ladas was talking about an event storm just causing too many logging lines in automate :point_up: November 8, 2017 10:04 AM
Adam Grare
@agrare
Nov 08 2017 15:26
didn't it still have that hex dump at the end of it though?
we can clear the payload in post_refresh_ems_cleanup
Keenan Brock
@kbrock
Nov 08 2017 15:28
thnx
Adam Grare
@agrare
Nov 08 2017 15:28
doesn't cover the case where the worker dies in the middle but it will still clear the blobs on a refresh exception
Ladislav Smola
@Ladas
Nov 08 2017 15:30
@agrare I didn't see that in automation.log, only in evm.log
Adam Grare
@agrare
Nov 08 2017 15:30
hm it was in the automate log also I thought
Ladislav Smola
@Ladas
Nov 08 2017 15:31
@agrare right, maybe that was only the OutOfMemory when trying to queue, but not in healthy workflow
Adam Grare
@agrare
Nov 08 2017 15:33
so now that we're only sending deleted we should be able to see what their automate log looks like with just events coming in and the occasional delete
Ladislav Smola
@Ladas
Nov 08 2017 15:59
@agrare right, i think we already saw this the past day, but lets clear the logs and wait 1 more day
@agrare the watch messages should not go through automate, right?
Adam Grare
@agrare
Nov 08 2017 16:00
correct but it was logging the refresh queue item which included the whole data column
Ladislav Smola
@Ladas
Nov 08 2017 16:01
@agrare also in automation? I haven't noticed that
Beni Cherniavsky-Paskin
@cben
Nov 08 2017 17:03
can you send (by mail, disconnecting now, and want to track this) details on the "pod kill loop" causing tons of events? is it same pod crashing, or is the pod deleted and new created (EDIT: so are these events or watch notices or both)? which event type(s)?
in the past, we saw FailedSync event was very noisy in such situations and ended up blacklisting it, maybe should blacklist more...