@kbrock so from what I see, with 95% probability, this issue was caused by refresh worker being dead, while 9 other workers kept queuing targets into 1 MiqQueue record. So this will always fail in time, no matter the size of target we have
@kbrock@dmetzger57 the other issue is that there is a killoop of 1 or many pods, causing huge amount of events, which are flooding both evm.log and automation.log
@kbrock@dmetzger57 not sure what tools we have for this, i know we log rotate and delete in time. But this is doing like 7GB of automation.log in few hours
can you send (by mail, disconnecting now, and want to track this) details on the "pod kill loop" causing tons of events? is it same pod crashing, or is the pod deleted and new created (EDIT: so are these events or watch notices or both)? which event type(s)?
in the past, we saw FailedSync event was very noisy in such situations and ended up blacklisting it, maybe should blacklist more...