These are chat archives for ManageIQ/manageiq/performance

7th
Nov 2017
niroy
@niroy
Nov 07 2017 11:36
@Fryguy how much time does it take for the data to get purged? this needs to be done only on the appliances having C & U roles enabled right?
Ladislav Smola
@Ladas
Nov 07 2017 14:40
@kbrock here?
Keenan Brock
@kbrock
Nov 07 2017 14:40
@Ladas sounds like a plan
Ladislav Smola
@Ladas
Nov 07 2017 14:41
@kbrock so, I was saying to @agrare that maybe we should put the uniq check back
@kbrock with combination of only DELETED watch events, the target size should be small
Keenan Brock
@kbrock
Nov 07 2017 14:42
trying not to rant about the size of the message
or the quantity of refreshes we ask for
or the updating of the event in the first place :(
Ladislav Smola
@Ladas
Nov 07 2017 14:42
@kbrock on background, we can store payload in BinaryBlob as @Fryguy and @agrare want, which would optimize it for the largest environments
Keenan Brock
@kbrock
Nov 07 2017 14:43
well, loic had a small environment
and it died within a few minutes (not sure how many). only 42 updates
Ladislav Smola
@Ladas
Nov 07 2017 14:44
@kbrock but another thing should be that dequeue should have bigger priority, not sure how to achieve this, we should be able to quickly do explicit lock on the row?
Adam Grare
@agrare
Nov 07 2017 14:44
@kbrock yeah that's what scared me, its a tiny env and it brough the queue to its knees
Keenan Brock
@kbrock
Nov 07 2017 14:44
I don't know if it is priority or if the refreshes are slow
Oleg Barenboim
@chessbyte
Nov 07 2017 14:44
guys, we are scheduled to do a Gaprindashvili build later today -- can you get the code base into a bearable safe state - and work on the optimization after that
Keenan Brock
@kbrock
Nov 07 2017 14:44
um
not saying they are slow bad
but they are much slower than when we generate a "refresh this and that and ...."
Adam Grare
@agrare
Nov 07 2017 14:44
@chessbyte we already disabled this worker
Oleg Barenboim
@chessbyte
Nov 07 2017 14:45
thank you!
Ladislav Smola
@Ladas
Nov 07 2017 14:45
@kbrock right, but in few minutes, there were 72k events queuing refresh and 32k watch events also queuing refresh, so 'dequeue job' was starved
Oleg Barenboim
@chessbyte
Nov 07 2017 14:45
then carry on ;-)
Keenan Brock
@kbrock
Nov 07 2017 14:45
thnx chess - eye on the prize
Ladislav Smola
@Ladas
Nov 07 2017 14:47
@chessbyte the watch api was a bugfix, where we were missing anything created and deleted between 2 refreshes, which can happen a lot in OSE env, so we should put it back soon :-)
Keenan Brock
@kbrock
Nov 07 2017 14:47
@Ladas so we can't just throw away this code? ;)
Oleg Barenboim
@chessbyte
Nov 07 2017 14:48
@Ladas totally agree -- at the same time, first Gaprindashvili build was DOA (dead on arrival) -- want today's build to actually work
Ladislav Smola
@Ladas
Nov 07 2017 14:48
@kbrock I believe this one tries to dequeue in refresh worker @agrare right?
@chessbyte sound good :-)
Keenan Brock
@kbrock
Nov 07 2017 14:48
grep "stale, retrying" log/*.log 
... nothing ...
^ checked every log - showing nothing
Adam Grare
@agrare
Nov 07 2017 14:49
@kbrock is that the log live on the appliance or did you grab a copy?
Keenan Brock
@kbrock
Nov 07 2017 14:50
ooh
Ladislav Smola
@Ladas
Nov 07 2017 14:50
@kbrock yeah, it's all debug :-(
Keenan Brock
@kbrock
Nov 07 2017 14:50
@agrare thanks - perfect
Adam Grare
@agrare
Nov 07 2017 14:50
I nil'd out the log so the server could start
Keenan Brock
@kbrock
Nov 07 2017 14:50
darn
Adam Grare
@agrare
Nov 07 2017 14:50
and applied our deleted only patch, looks a lot better
yeah sorry didn't realize you were actively looking at those
Ladislav Smola
@Ladas
Nov 07 2017 14:50
I downloaded the evm log
Adam Grare
@agrare
Nov 07 2017 14:51
I can un-do the patch and bring the server down again haha
Keenan Brock
@kbrock
Nov 07 2017 14:51
only need to delete the last line ;)
lol
@agrare too soon for @Ladas
Ladislav Smola
@Ladas
Nov 07 2017 14:51
let me upload it to my google disk :-)
Keenan Brock
@kbrock
Nov 07 2017 14:51
just delete the last line - honestly
I pasted the important part of the last line in the PR
I learned tail -c which was pretty cool
Adam Grare
@agrare
Nov 07 2017 14:52
you mean delete the last line to free up space?
Keenan Brock
@kbrock
Nov 07 2017 14:53
well, if we are sharing the log, best to delete last line so it wasn't too big / too much $ / too slow to download
Ladislav Smola
@Ladas
Nov 07 2017 14:53
@kbrock @agrare gah, it has 3gb, give me few secs to upload it :-)
I have just emv.log
Keenan Brock
@kbrock
Nov 07 2017 14:54
du -sh wc -l file
6.1G 1,633 automation.log
4.0G 97,526 evm.log
automate was at 1.6k lines - that is very early in the cycle
Ladislav Smola
@Ladas
Nov 07 2017 14:55
@agrare btw. automate log is also huge, lets download it an clean it?
Adam Grare
@agrare
Nov 07 2017 14:55
@kbrock do you want me to reproduce?
Keenan Brock
@kbrock
Nov 07 2017 14:55
no
Ladislav Smola
@Ladas
Nov 07 2017 14:56
@kbrock @agrare lets ask @Loicavenel for the actual creds and try it locally with many generic workers
Keenan Brock
@kbrock
Nov 07 2017 14:56
@Ladas could you grep and see if we do have an issue with stale?
Ladislav Smola
@Ladas
Nov 07 2017 14:56
@kbrock the stale logs are all debug, so we will not see them
Keenan Brock
@kbrock
Nov 07 2017 14:56
his environment is pretty robust
thnx - (you already said that :( ) - I didn't "hear" you
Adam Grare
@agrare
Nov 07 2017 14:57
so how about we turn on debug, back out the deleted only patch and get it to fail again?
Ladislav Smola
@Ladas
Nov 07 2017 14:57
@kbrock so we should try locally, with that log lines on info and see if we can do better lock for dequeue, since that should always go first, then another queue jobs
Keenan Brock
@kbrock
Nov 07 2017 14:58
ok, @Ladas so your are blaming this on hypothesis is there is a race condition on the queue for write priority over read?
(not intentionally write priority over read but... think that is what is there)
Ladislav Smola
@Ladas
Nov 07 2017 14:59
@kbrock well, I am blaming the size yes, on env this small, the refresh should be fast
Keenan Brock
@kbrock
Nov 07 2017 14:59
sorry. editied (that was ugly - sorry)
Ladislav Smola
@Ladas
Nov 07 2017 14:59
@kbrock but we needed the DELETED filter too, since there are many changes
@kbrock @agrare shared the log with you
Keenan Brock
@kbrock
Nov 07 2017 15:00
well, all of my checks assumed that we were just appending onto the msg_data field
but since it is encoded, and the targets we are tacking on are long... kinda goes out the window
it says you could embed a virus in here... hmm
Ladislav Smola
@Ladas
Nov 07 2017 15:01
@kbrock yes, but the periodic refresh should be cleaning these up
Keenan Brock
@kbrock
Nov 07 2017 15:02
lol - you kept the last line intact?
Ladislav Smola
@Ladas
Nov 07 2017 15:02
@kbrock eh, no idea, I've dowloaded it like 1h ago
Keenan Brock
@kbrock
Nov 07 2017 15:02
the serialization (uuencode or something) for msg_data just makes it worse
yea, and this failed when there were only a few hundred messages in the queue
the current queue is much bigger
type vm_count host_count pod_count
Amazon::CloudManager 41
AnsibleTower::AutomationManager
Azure::CloudManager 85
Google::CloudManager 900
Hawkular::MiddlewareManager
Lenovo::PhysicalInfraManager
Microsoft::InfraManager 6 3
Openshift::ContainerManager 394
Openstack::CloudManager 23
Openstack::InfraManager 5 3
Redhat::InfraManager 71 3
StorageManager::CinderManager
StorageManager::SwiftManager
Vmware::InfraManager 74 3
Ladislav Smola
@Ladas
Nov 07 2017 15:06
@kbrock right, so it prints query with 120k targets into log, for 10 workers that are touching it
@kbrock btw. could we prevent logging whole queries in genereal, for exceptions?
Keenan Brock
@kbrock
Nov 07 2017 15:07
think that is a rails thing
Ladislav Smola
@Ladas
Nov 07 2017 15:08
I am trying to delete thebig lines in log, otherwise I can' even grep it :-D
Keenan Brock
@kbrock
Nov 07 2017 15:08
when you find the command please share
I was thinking about using head or something
@Ladas head -n 7226 loic_evm.log > loic_evm.head.log
I tried 7227 and tail would not work on the file
but with 7266 I could run tail
ooh
@Ladas you are right - that is US printing the line
that text got into the exception message
Ladislav Smola
@Ladas
Nov 07 2017 15:17
@kbrock do you see there some :save_inventory
?
Keenan Brock
@kbrock
Nov 07 2017 15:17
ok, so the last line in the log is 0.5G
but the file is still 2.4G
are there others in there? it is killing vim
Ladislav Smola
@Ladas
Nov 07 2017 15:18
@kbrock i am trying sed -i '/SET\s"msg_data"/d' loic_evm.log
Keenan Brock
@kbrock
Nov 07 2017 15:18
did that work?
Ladislav Smola
@Ladas
Nov 07 2017 15:18
@kbrock and sed -i '/PG::InternalError: ERROR/d' loic_evm.log
@kbrock cleaned up the long lines
Keenan Brock
@kbrock
Nov 07 2017 15:19
Do we have a bunch of those lines?
Ladislav Smola
@Ladas
Nov 07 2017 15:20
@kbrock those should be the ones that have the whole msg_data logged
@kbrock but I see randomly placed binary data, seems like the log write was way bigger than the buffer size, so the log write was not atomic :-)
@kbrock anyway last_refresh_date: Sat, 04 Nov 2017 06:58:54 UTC +00:00
@kbrock so that looks like the log with succesful refresh was rotated 2 days ago
Ladislav Smola
@Ladas
Nov 07 2017 15:25
@agrare so did you revert the change?
@agrare the log disk is ful lagain
Adam Grare
@agrare
Nov 07 2017 15:27
i didn't touch it, you guys said you wanted to test it locally
and figured i shouldn't intentionally break loic's appliance w/o asking first :)
Ladislav Smola
@Ladas
Nov 07 2017 15:27
@agrare ah, ok
@agrare well the appliance is dead
Adam Grare
@agrare
Nov 07 2017 15:28
-rw-r--r--. 1 root root 8.5G Nov 7 04:01 log/automation.log
Ladislav Smola
@Ladas
Nov 07 2017 15:28
@agrare we could allow the stale logging there, clean the MiqQueue refresh item, clean the logs and restart
@agrare it should at least show us if the optimistic lock(Stale record) is being hit a log
Adam Grare
@agrare
Nov 07 2017 15:29
be my guest haha
Keenan Brock
@kbrock
Nov 07 2017 15:31
yea - it is quick
Ladislav Smola
@Ladas
Nov 07 2017 15:32
@kbrock @agrare I will download evm and automate logs and clean them
Keenan Brock
@kbrock
Nov 07 2017 15:33
whoa - sed is taking some time :(
Ladislav Smola
@Ladas
Nov 07 2017 15:37

@agrare @kbrock manually running full refresh

[----] I, [2017-11-07T04:33:46.930549 #30261:9bd138] INFO -- : MIQ(ManageIQ::Providers::Openshift::ContainerManager::Refresher#refresh) EMS: [OpenShift], id: [12] Refreshing targets for EMS...Complete - Timings {:collect_inventory_for_targets=>1.0771350860595703, :parse_targeted_inventory=>2.738220453262329, :save_inventory=>2.0162055492401123, :manager_refresh_post_processing=>4.172325134277344e-05, :ems_refresh=>5.8321452140808105}

so it's quick
Keenan Brock
@kbrock
Nov 07 2017 15:37
yes
Keenan Brock
@kbrock
Nov 07 2017 15:44
whoa - automation log is out of control
Ladislav Smola
@Ladas
Nov 07 2017 15:44
yes
basically every event handler should be blowing up with the same error
@kbrock I have donloaded the automation.log, I will clean it now
Keenan Brock
@kbrock
Nov 07 2017 15:45
lol - takes 5 minutes to download at work
Ladislav Smola
@Ladas
Nov 07 2017 15:45
@kbrock are you still downloading it?
Keenan Brock
@kbrock
Nov 07 2017 15:46
yea
Oleg Barenboim
@chessbyte
Nov 07 2017 15:46
I thought this is the performance room -- downloading things should be fast!
Ladislav Smola
@Ladas
Nov 07 2017 15:46
hehe
Keenan Brock
@kbrock
Nov 07 2017 15:46
and I'm wired too ;)
Oleg Barenboim
@chessbyte
Nov 07 2017 15:46
oh I know you are wired, @kbrock
Keenan Brock
@kbrock
Nov 07 2017 15:46
oh no. better get this download working quicker or I'm out of a job
lol
Ladislav Smola
@Ladas
Nov 07 2017 15:46
@chessbyte I have only 14.1MB/s in the office
:-)
Keenan Brock
@kbrock
Nov 07 2017 15:47
I'm @ 27MB/s :(
Ladislav Smola
@Ladas
Nov 07 2017 15:47
heh
well I have around 20MB/s at home
Keenan Brock
@kbrock
Nov 07 2017 15:47
think I'm 60MB/s at home - wired
Ladislav Smola
@Ladas
Nov 07 2017 15:47
so you still win :-)
heh
Oleg Barenboim
@chessbyte
Nov 07 2017 15:47
wow -- I have over 120MB/s at home on wireless
Keenan Brock
@kbrock
Nov 07 2017 15:47
haven't tried speed test at home for wireless
Ladislav Smola
@Ladas
Nov 07 2017 15:48
wait, I meant 20Mb/s, why it shows me bits
Keenan Brock
@kbrock
Nov 07 2017 15:48
wc -l filename
372 evm.log
880 automation.log
yea - this fails FAST
Ladislav Smola
@Ladas
Nov 07 2017 15:49
@kbrock can I truncate it now?
Keenan Brock
@kbrock
Nov 07 2017 15:49
43 seconds for automation
haven't tried evm.log yet but that should be quick
ok @Ladas it is all yours
Ladislav Smola
@Ladas
Nov 07 2017 15:50
@kbrock I've already truncated the emv.log
@kbrock but shared it with you on google drive :-)
Keenan Brock
@kbrock
Nov 07 2017 15:50
ooh
eh
I know the punch line: there was some error w/ the queue
I probably should have let you truncate THEN downloaded it
Oleg Barenboim
@chessbyte
Nov 07 2017 15:51
ooh-eh-ooh-eh ==> sounds that monkeys make :-)
Ladislav Smola
@Ladas
Nov 07 2017 15:53
@agrare so you haven't applied there the DELETED filter?
@kbrock ok, we can restart the appliance, I've cleanup the queue record, logs and added loggin for the stale object
@kbrock restarting
Keenan Brock
@kbrock
Nov 07 2017 15:57
thnx
@Ladas the sed not working so good for me
takes a long time - and the file is the same size :(
Ladislav Smola
@Ladas
Nov 07 2017 15:57
@kbrock hum weird
Keenan Brock
@kbrock
Nov 07 2017 15:57
ooh
do you need to play with the print stuff to make sure it is deleted? <== nope
Ladislav Smola
@Ladas
Nov 07 2017 15:58
@kbrock what do you mean?
Keenan Brock
@kbrock
Nov 07 2017 15:58
ignore
works like a charm:
echo -e "a\nb\na\nb"| sed '/b/d'
Ladislav Smola
@Ladas
Nov 07 2017 15:59
@kbrock hm, restart appliance reboots the machine? Or did I just kill it? :-D
@kbrock I've used the appliance console, instead of rake evm:restart
Adam Grare
@agrare
Nov 07 2017 16:00
@Ladas I left the deleted filter in
Ladislav Smola
@Ladas
Nov 07 2017 16:00
@agrare ok, so lets watch what it will do
@agrare then lets put is back, I've switched the stale comment to info, I would like to see if we are really starving out the 'dequeue'
Keenan Brock
@kbrock
Nov 07 2017 16:04
yea, sed is confusing me. ran both of those queries you wrote and:
wc -l try2*
     868 try2_automation.log
     880 try2_automation.log.bak
    2120 total
[fetch_collector_collect] manageiq/work $ du -sh try2*
8.4G    try2_automation.log
8.4G    try2_automation.log.bak
got rid of 12 lines, but same size
Ladislav Smola
@Ladas
Nov 07 2017 16:04
hum
are you on mac?
Keenan Brock
@kbrock
Nov 07 2017 16:05
aaah
Ladislav Smola
@Ladas
Nov 07 2017 16:05
then the format would be sed -i '' '/pattern/d' ./infile
Keenan Brock
@kbrock
Nov 07 2017 16:05
(yea)
Ladislav Smola
@Ladas
Nov 07 2017 16:05
hehe
Keenan Brock
@kbrock
Nov 07 2017 16:05
ooh - yea, I did -ibak
um

cool. I like -i '' better

If a zero-length extension is given, no backup will be saved

Ladislav Smola
@Ladas
Nov 07 2017 16:08
@kbrock hm, i've started it, but nothing queues the refresh
Keenan Brock
@kbrock
Nov 07 2017 16:08
yea
I don't think it gets that far
did you truncate tables yet?
Ladislav Smola
@Ladas
Nov 07 2017 16:09
@kbrock which tables?
@kbrock I;ve just deleted the 1 queued refresh for openshift
@kbrock at least the first refresh should be queued though
Keenan Brock
@kbrock
Nov 07 2017 16:10
class_name method_name count
MiqAeEngine deliver 1412
MiqEvent raise_evm_event 967
EmsEvent add 58
MiqServer stop_worker 51
MiqAlert evaluate_alerts 48
Openstack::InfraManager::Host perf_capture_historical 22
MiqServer shutdown_and_exit 18
VmdbDatabaseConnection log_statistics 2
MiqServer status_update 2
EmsRefresh refresh 2
MiqServer log_status 2
MiqWorker log_status_all 2
MiqServer queue_update_registration_status 2
Metric::Purging purge_rollup_timer 1
MiqTask destroy_older_by_condition 1
MiqEnterprise perf_rollup 1
Metric::Capture perf_capture_timer 1
MiqServer delete_active_log_collections 1
Metric::Purging purge_realtime_timer 1
Openstack::CloudManager::Provision poll_destination_in_vmdb 1
Ladislav Smola
@Ladas
Nov 07 2017 16:11
@kbrock there are not refresh workers, looks like metrics appliance
@kbrock @agrare is there actually refresh worker running?
Keenan Brock
@kbrock
Nov 07 2017 16:12
select id, lock_version,length(args) args,length(msg_data) msg_data from miq_queue where method_name = 'refresh';
   id    | lock_version | args | msg_data 
---------+--------------+------+----------
 7472718 |            7 |      |     1095
 7472717 |          111 |      |    12809
Ladislav Smola
@Ladas
Nov 07 2017 16:13
@agrare this might be a +1 for unique check, that the queue item will not rise indefinitely
@kbrock ok, yeah I see them now
Keenan Brock
@kbrock
Nov 07 2017 16:14
the other one - took only 42 tries to get too big
Ladislav Smola
@Ladas
Nov 07 2017 16:16
@kbrock there are 2 other appliances with refresh worker enabled
Keenan Brock
@kbrock
Nov 07 2017 16:16
aah - so you deleted the openshift - wonder how big the msg_data was
how is msg_data encoded? can we see it in sql?
<rant>using ruby only logicdata structures is killing us</rant>
@kbrock uh, forgot the size
Keenan Brock
@kbrock
Nov 07 2017 16:20
so there is no way to do unique in sql.
we keep shooting ourselves by this ruby only crap
^ ignore / <rant>
Ladislav Smola
@Ladas
Nov 07 2017 16:21
@kbrock actually I think @agrare removed the ruby uniq, now we have only concat
@kbrock maybe we will need the uniq back though
Keenan Brock
@kbrock
Nov 07 2017 16:21
we want to get rid of the ruby uniq because we don't want it in ruby
Ladislav Smola
@Ladas
Nov 07 2017 16:21
@kbrock as a speed vs. max queue items size
Keenan Brock
@kbrock
Nov 07 2017 16:21
IF it is in ruby, then the uniq is kinda a no-op
ok, not a no-op
and the longer it takes, the more chance of a failing race condition (to your point)
@Ladas what is happening on that machine? it is still running
Ladislav Smola
@Ladas
Nov 07 2017 16:25
@kbrock I am checking the other 2 appliances
Keenan Brock
@kbrock
Nov 07 2017 16:25
@Ladas also, if you are checking for duplicates and there isn't anything new, maybe we won't even have to write the queue entry
Ladislav Smola
@Ladas
Nov 07 2017 16:26
@kbrock possibly
@kbrock so, we replace the uniq with concat here
Keenan Brock
@kbrock
Nov 07 2017 16:26
for serialized columns (even for dates), I've added model.col = val if model.col != val
Keenan Brock
@kbrock
Nov 07 2017 16:29
Is the problem 1) that we have lots of duplicates? Or 2) that each element is so big that duplicates are very expensive?
the idea is to revert lines 83 and 174?
Ladislav Smola
@Ladas
Nov 07 2017 16:30
@kbrock @agrare so, I am thinking that this might have been caused by refresh appliance dying
Keenan Brock
@kbrock
Nov 07 2017 16:31
that makes it pretty unstable
Ladislav Smola
@Ladas
Nov 07 2017 16:31
@kbrock @agrare at which point the MiqQueue record rose a lot and killed other appliances, just by blowing up all logs :-)
Keenan Brock
@kbrock
Nov 07 2017 16:31
heh
"one appliance stopped working. lets all take the day off"
ugh
Ladislav Smola
@Ladas
Nov 07 2017 16:40
@kbrock @agrare just he line 174, we should put it back to msg.data | targets ?
  
@Fryguy ^
Adam Grare
@agrare
Nov 07 2017 16:41
@Ladas if msg.data wasn't so large, would the duplicates be an issue?
Ladislav Smola
@Ladas
Nov 07 2017 16:41
@kbrock btw. how are we planning to do this with a normal queue, if the appliance with refresh will die so we will be pushing into the queue without reading, at some point we will fill it
@agrare I think that in the case that refresh dies, it will just take like 10x or 20x bigger time to see the same exception
Keenan Brock
@kbrock
Nov 07 2017 16:43
the goal is to aoid "put unless exists" / "put or update"
@Ladas I always saw "this needs refresh" different from "this is the work that needs to be done"
Ladislav Smola
@Ladas
Nov 07 2017 16:44
@kbrock right but we will need to solve "queue is 80% full, you better scale your X workers, or it will start to refuse messages"
Keenan Brock
@kbrock
Nov 07 2017 16:44
are we going to see many duplicates?
Adam Grare
@agrare
Nov 07 2017 16:44
I think that's the persister autoscaler @Fryguy was talking about, we need a way to introspect the queue depth
Ladislav Smola
@Ladas
Nov 07 2017 16:45
@kbrock well, we will push every change and full refresh data, so unless it's being consumed, it will fill the queue or disk
Keenan Brock
@kbrock
Nov 07 2017 16:45
um - you either look at queue depth (non trivial) or you look at latency (easy)
Jason Frey
@Fryguy
Nov 07 2017 16:45
if you uniq the targets, then how to you correlate multiple payloads for the same target?
Ladislav Smola
@Ladas
Nov 07 2017 16:45
@Fryguy for now, we decided to only collect 'DELETED Pod'
Jason Frey
@Fryguy
Nov 07 2017 16:46
(sorry in a call atm, so reponses will be delayed)
Adam Grare
@agrare
Nov 07 2017 16:46
I don't think duplicates are a problem for watch events
Jason Frey
@Fryguy
Nov 07 2017 16:46
well, you can't duplicately delete a pod, so what's the problem?
Ladislav Smola
@Ladas
Nov 07 2017 16:46
@Fryguy so there should be just 1 message per 1 pod
Keenan Brock
@kbrock
Nov 07 2017 16:46
so why are we having troubles?
Adam Grare
@agrare
Nov 07 2017 16:46
@Ladas if msg.data wasn't so large, would the duplicates be an issue?
if we didn't put the payload in data would we still be having issues?
Keenan Brock
@kbrock
Nov 07 2017 16:47

same question as

Is the problem 1) that we have lots of duplicates? Or 2) that each element is so big that duplicates are very expensive?

Adam Grare
@agrare
Nov 07 2017 16:47
i don't see duplicates being an issue when the payload is [Vm, 1], [Vm, 1], ... 1000x
Ladislav Smola
@Ladas
Nov 07 2017 16:47
@Fryguy right, so I think we are looking at 'refresh worker died' and queue record size rises until it throws exception which bloats the logs, which seems to kill all appliances
@Fryguy so we might need to partially revert the https://github.com/ManageIQ/manageiq/pull/16271/files
Keenan Brock
@kbrock
Nov 07 2017 16:48
@agrare so you just said option 2 - cool
Ladislav Smola
@Ladas
Nov 07 2017 16:48
@Fryguy to re-add the uniq check
Keenan Brock
@kbrock
Nov 07 2017 16:49
there are 2 lines of interest + uniq and concat -> |
we are just thinking the second one?
Ladislav Smola
@Ladas
Nov 07 2017 16:49
@agrare without payload, the OutOfMemory exception will take longer to occur
Adam Grare
@agrare
Nov 07 2017 16:49
@Ladas do you think there are a lot of duplicates though?
they're all unique payloads from kubernetes
Jason Frey
@Fryguy
Nov 07 2017 16:50
how about we move to the binary blobs and then see if the uniq is still an issue?
Adam Grare
@agrare
Nov 07 2017 16:50
without payload, the OutOfMemory exception will take longer to occur
exactly
Jason Frey
@Fryguy
Nov 07 2017 16:50
locally I did 5000 records in msg_data wihtout a hiccup
Ladislav Smola
@Ladas
Nov 07 2017 16:50
@agrare well, it this particular env, I see 120k targets, for ~20 unique targets
Adam Grare
@agrare
Nov 07 2017 16:50
right, but i bet you they aren't actually duplicates
Ladislav Smola
@Ladas
Nov 07 2017 16:50
@agrare while 90k was the same EMS
Adam Grare
@agrare
Nov 07 2017 16:50
they're unique updates
Keenan Brock
@kbrock
Nov 07 2017 16:51
again, the lock_version was 42 and 111 for the failures. So it happened pretty quickly.
Jason Frey
@Fryguy
Nov 07 2017 16:51
90k deletes?
Ladislav Smola
@Ladas
Nov 07 2017 16:51
@agrare yes
@Fryguy it was 90k events, 30k watch changes
Adam Grare
@agrare
Nov 07 2017 16:51
i still think we should move payload out of data to blobs, then see if we need to add the unique check back
Jason Frey
@Fryguy
Nov 07 2017 16:51
let's talk real numbers...how many of those were deletes
Adam Grare
@agrare
Nov 07 2017 16:51
none of them were deletes
Jason Frey
@Fryguy
Nov 07 2017 16:51
i don't think we should optimize a problem we don't have
Ladislav Smola
@Ladas
Nov 07 2017 16:51
but the numbers could have take like 2 days
Jason Frey
@Fryguy
Nov 07 2017 16:52
agreed @agrare
Ladislav Smola
@Ladas
Nov 07 2017 16:52
@agrare @Fryguy some of them might have been deletes, we do not store the type into MiqQueue
Jason Frey
@Fryguy
Nov 07 2017 16:52
payload MUST be removed from MiqQueue messages anyway, so it makes no sense to perf test anything with those still in there
Adam Grare
@agrare
Nov 07 2017 16:54
unless they started out with 30,020 pods and deleted 30,000 of them, I'm willing to bet that the number of deletes are a rounding error compared to the number of updates
Jason Frey
@Fryguy
Nov 07 2017 16:54
and this is also only if they all come in between refreshes
Ladislav Smola
@Ladas
Nov 07 2017 16:54
@agrare no I checked that there were 20 unique ems_refs in the 30k targets queued
Jason Frey
@Fryguy
Nov 07 2017 16:55
because once you pick up the refresh record, then you start over from 0 targets again
Adam Grare
@agrare
Nov 07 2017 16:55
@Ladas idk what we're arguing about anymore
they have like 22 pods
they didn't have the delete filter before
Ladislav Smola
@Ladas
Nov 07 2017 16:55
@agrare I've seens some Terminated statuses of the Container, though
Adam Grare
@agrare
Nov 07 2017 16:56
and it led to 30000 updates
Ladislav Smola
@Ladas
Nov 07 2017 16:58
@agrare yes, I think they were probably stress testing it
@agrare so, the DELETED filter should remove the 30k updates
Adam Grare
@agrare
Nov 07 2017 16:59
agreed
Ladislav Smola
@Ladas
Nov 07 2017 16:59
@agrare but we would still have 90k duplicates of EMS full refresh
@agrare the memory limit for that is in millions probably
Adam Grare
@agrare
Nov 07 2017 17:00
but if those are from the event catcher shouldn't they just be class+id pairs?
Ladislav Smola
@Ladas
Nov 07 2017 17:00
@agrare yes
Adam Grare
@agrare
Nov 07 2017 17:00
this should be easy to test
lets just queue refresh of the same vm 100k times
and see how big the msg_data is, and how long it takes the refresh worker to dequeue it
my money is on it being negligible but we'll see
Adam Grare
@agrare
Nov 07 2017 17:14
okay it is 56kib for 1000 targets
each target is ["ManageIQ::Providers::Vmware::InfraManager::Vm", 1]
Jason Frey
@Fryguy
Nov 07 2017 17:14
is that the column size?
Adam Grare
@agrare
Nov 07 2017 17:15
msg_data.size
Jason Frey
@Fryguy
Nov 07 2017 17:15
ah ok
Marshal data is "compressed" in an interesting way, so that string is effectively a couple bytes when duplicated
Adam Grare
@agrare
Nov 07 2017 17:15
so 5 MiB for 100k targets
(extrapolating, didn't actually queue)
Jason Frey
@Fryguy
Nov 07 2017 17:16
ah...yeah I'm not sure you can extrapolate exactly like that
Adam Grare
@agrare
Nov 07 2017 17:16
okay let me try actually queueing that many
Jason Frey
@Fryguy
Nov 07 2017 17:16
how long did it take to enqueue 1000 targets
(part of switching to concat was because enqueueing was faster)
Adam Grare
@agrare
Nov 07 2017 17:17
30s
Jason Frey
@Fryguy
Nov 07 2017 17:17
oh wow...slow
wait, did you do EmsRefresh.queue_refresh in a loop one-by-one?
Adam Grare
@agrare
Nov 07 2017 17:18
yes not batching
Jason Frey
@Fryguy
Nov 07 2017 17:18
ah ok yeah, that makes sense...that's why we talked about batching because doing that one-by-one is terrible
Adam Grare
@agrare
Nov 07 2017 17:18
yeah let me do that, don't feel like waiting 100 * 30s
Jason Frey
@Fryguy
Nov 07 2017 17:18
I thought you were calling EmsRefresh.queue_refresh([t1, t2, t3...x1000])
Adam Grare
@agrare
Nov 07 2017 17:27
must be doing something wrong but it isn't duplicating them when I pass them all as args
vm = Vm.first
targets = 100000.times.map { vm }
Benchmark.realtime_block(:queue_refresh) { EmsRefresh.queue_refresh(targets) }
MiqQueue.first.data.count
=> 1
Adam Grare
@agrare
Nov 07 2017 17:33
duh, that's because we're still doing .uniq for targets passed to queue_refresh https://github.com/ManageIQ/manageiq/blob/master/app/models/ems_refresh.rb#L59
Ladislav Smola
@Ladas
Nov 07 2017 17:34
@agrare we do :-)
Adam Grare
@agrare
Nov 07 2017 17:35
disabling that for now :)
Ladislav Smola
@Ladas
Nov 07 2017 17:37
@agrare ok, so @Loicavenel appliance is up and running
Adam Grare
@agrare
Nov 07 2017 17:39
>> MiqQueue.first.msg_data.size
=> 5600008
Guess we can extrapolate like that haha
it was dramatically faster though, 1000 targets in a slice 100 times took 100 seconds
Ladislav Smola
@Ladas
Nov 07 2017 17:40
@agrare so we have the DELETED filter there?
Adam Grare
@agrare
Nov 07 2017 17:41
@Ladas I just queued the same vm 100k times to see how big the msg_data column would get
to see if we need to add the unique check back in
Ladislav Smola
@Ladas
Nov 07 2017 17:42
@agrare right, so the limit is much higher
@agrare but I suppose the bigger env can easily produce milions of 'queue_refresh' in few days
Adam Grare
@agrare
Nov 07 2017 17:43
this is also assuming the refresh worker is completely down the entire time
Ladislav Smola
@Ladas
Nov 07 2017 17:43
@agrare can you check Loic's server now?
Adam Grare
@agrare
Nov 07 2017 17:44
check it for what?
Ladislav Smola
@Ladas
Nov 07 2017 17:44
MiqQueue.where(:method_name => "refresh", :queue_name => "ems_12").first.data.select {|x| x.first == "ManagerRefresh::Target"}.map {|x| x.second.try(:[], :manager_ref) }.uniq.count
Adam Grare
@agrare
Nov 07 2017 17:45
18
the automate log filled up again though
MiqQueue.where(:method_name => "refresh", :queue_name => "ems_12").first.data.count
=> 83288
MiqQueue.where(:method_name => "refresh", :queue_name => "ems_12").first.msg_data.size
=> 188150335
Ladislav Smola
@Ladas
Nov 07 2017 17:47
@agrare yes
Adam Grare
@agrare
Nov 07 2017 17:47
I'm going to delete that queue item, it keeps filling up automate log
Ladislav Smola
@Ladas
Nov 07 2017 17:47
@agrare 40k ManagerRefresh::Target
@agrare the deleted filter might not work?
@agrare wait, it's new one
Adam Grare
@agrare
Nov 07 2017 17:47
that's probably from before the delete filter
Ladislav Smola
@Ladas
Nov 07 2017 17:48
created_on: Mon, 09 Oct 2017 11:06:48 UTC +00:00, updated_on: Tue, 07 Nov 2017 11:33:51 UTC +00:00,
no, I deleted that one
this is from few hours ago
lol to created at, that is from the future :-D
no, what, 09 oct
Adam Grare
@agrare
Nov 07 2017 17:49
okay deleted the queue item, going to restart the service
Ladislav Smola
@Ladas
Nov 07 2017 17:49
how is that possible
Adam Grare
@agrare
Nov 07 2017 17:49
09 oct is the past ladas ;)
Ladislav Smola
@Ladas
Nov 07 2017 17:49
@agrare i was deleting it today, this was a fresh one
Adam Grare
@agrare
Nov 07 2017 17:49
well, depends on what year
so I'll leave it up to @Fryguy if he thinks 5MB for 100k targets is too much
for now we need to move the managerrefresh::target payload to binary blobs
Ladislav Smola
@Ladas
Nov 07 2017 17:51
@agrare what service have you restarted?
Adam Grare
@agrare
Nov 07 2017 17:51
evmserverd
Ladislav Smola
@Ladas
Nov 07 2017 17:51
@agrare ok, so this queue item was created really few hours ago
@agrare so something is wrong, the DLETED filter probably doesn't work, or it sends way to many deleted messages
Adam Grare
@agrare
Nov 07 2017 17:52
okay we should see Received change for pod
i'll tail the logs for that
Ladislav Smola
@Ladas
Nov 07 2017 17:53
@agrare also, check the 'stale', I've switched that to info
Adam Grare
@agrare
Nov 07 2017 17:53
so tail -f log/evm.log | egrep 'Received change for pod|stale' ?
just 'stale'?
Ladislav Smola
@Ladas
Nov 07 2017 17:53
yeah
Adam Grare
@agrare
Nov 07 2017 17:53
+1
Ladislav Smola
@Ladas
Nov 07 2017 17:54
would be nice to get count of them from all appliances
Adam Grare
@agrare
Nov 07 2017 17:56
which appliance is the refresh worker on?
Ladislav Smola
@Ladas
Nov 07 2017 17:57
@agrare hm, the InventoryCollectorWorker was on more appliances
damned, disks are fll again
Adam Grare
@agrare
Nov 07 2017 17:57
they are?
Filesystem                             Size  Used Avail Use% Mounted on
/dev/mapper/VG--CFME-lv_os             4.5G  2.3G  2.3G  51% /
devtmpfs                                16G     0   16G   0% /dev
tmpfs                                   16G  132K   16G   1% /dev/shm
tmpfs                                   16G  720K   16G   1% /run
tmpfs                                   16G     0   16G   0% /sys/fs/cgroup
/dev/sda1                             1014M  164M  851M  17% /boot
/dev/mapper/VG--CFME-lv_home          1014M   33M  982M   4% /home
/dev/mapper/VG--CFME-lv_var             12G  897M   12G   8% /var
/dev/mapper/VG--CFME-lv_log             10G   89M   10G   1% /var/www/miq/vmdb/log
/dev/mapper/vg_pg-lv_pg                300G   79G  222G  27% /var/opt/rh/rh-postgresql95/lib/pgsql
Ladislav Smola
@Ladas
Nov 07 2017 17:58
just on 119
Adam Grare
@agrare
Nov 07 2017 17:58
oh okay i'm on 10.9.62.110
Ladislav Smola
@Ladas
Nov 07 2017 17:58
and 122 is kind of dead
i will restart them
Adam Grare
@agrare
Nov 07 2017 17:59
okay i need to grab something to eat brb
Ladislav Smola
@Ladas
Nov 07 2017 17:59
hm
@agrare I think we need a role check for ManageIQ::Providers::Openshift::ContainerManager::InventoryCollectorWorker
@agrare i see it on all appliances
@agrare it should run just once with refresh worker?
@agrare I see it killed on many places
Adam Grare
@agrare
Nov 07 2017 18:02
It should only be on the server with the active ems_inventory role
I didn't realize until now they had more than one server in the zone
Ladislav Smola
@Ladas
Nov 07 2017 18:05
@agrare ok, it might have that many deleted pods, it keeps killing kibana-proxy
@agrare ok, I don;t see refresh worker running anywhere
ManageIQ::Providers::Openshift::ContainerManager::InventoryCollectorWorker runs on 112
Adam Grare
@agrare
Nov 07 2017 18:08
few stale messages, all metrics/generic worker
Ladislav Smola
@Ladas
Nov 07 2017 18:09
@agrare ok, refresh worker is on 112
@agrare so I think this was really cause by the fact refresh worker was dead, in 2-3 hours, this env produced 40k watc hmessages and 40k full refresh
Adam Grare
@agrare
Nov 07 2017 18:10
if the collector worker is running on the other appliance, i didn't apply the deleted filter there
Ladislav Smola
@Ladas
Nov 07 2017 18:10
@agrare so even without payload, this might end up with the same error after 1 weekend e.g.
@agrare ok, so that might be the reason
Adam Grare
@agrare
Nov 07 2017 18:11
yeah
let me do that and restart
Ladislav Smola
@Ladas
Nov 07 2017 18:12
@agrare no, ManageIQ::Providers::Openshift::ContainerManager::InventoryCollectorWorker is running twice
Adam Grare
@agrare
Nov 07 2017 18:12
yeah its a problem with has_required_role?
Ladislav Smola
@Ladas
Nov 07 2017 18:12
@agrare 110 and 112
Adam Grare
@agrare
Nov 07 2017 18:12
i'm on it
Ladislav Smola
@Ladas
Nov 07 2017 18:12
@agrare 119 does not have any worker, not sure why
@agrare probably a wrong check and we need to make sure it runs just once per queue name, like refresh worker, right?
Adam Grare
@agrare
Nov 07 2017 18:13
yeah
Ladislav Smola
@Ladas
Nov 07 2017 18:13
@agrare so we might have received the messages 3x, from 3 appliances
Adam Grare
@agrare
Nov 07 2017 18:13
all it was checking was if it was disabled
Ladislav Smola
@Ladas
Nov 07 2017 18:14
also ,I am logging the 'stale' only on 110 :-)
@agrare eh, 119 has the same roles as 112, so no workers are starting there?
@agrare probably try to restart it, i have to run hoome :-D
Jason Frey
@Fryguy
Nov 07 2017 18:32
so that's 100,000 VM targets right?
Adam Grare
@agrare
Nov 07 2017 18:32
right
Jason Frey
@Fryguy
Nov 07 2017 18:32
Those Manager::Refresh::Target targets are much larger because they are hashes
not sure if everything in that hash really needs to be on the queue
Adam Grare
@agrare
Nov 07 2017 18:32
probably not, but we also won't have many duplicates of those either
Jason Frey
@Fryguy
Nov 07 2017 18:33
right
is 100k targeted refreshes a reasonable number?
Keenan Brock
@kbrock
Nov 07 2017 18:33
Will a uniq detect duplicate hashes?
Jason Frey
@Fryguy
Nov 07 2017 18:33
yes
we avoided the .uniq because it made enqueueing 1-by-1 slower
it's a balance between speed of enqueueing and speed of dequeueing
so we moved the .uniq to the dequeue side to make it faster to enqueue
one note is that the uniq with the Manager::Refresh::Target hashes is expensive
because it uniqs the entire payload instead of something more specific
Adam Grare
@agrare
Nov 07 2017 21:12
@kbrock @Fryguy @Ladas ManageIQ/manageiq#16413
throw :tomato: as LJ likes to say :)
Keenan Brock
@kbrock
Nov 07 2017 21:12
:)
@agrare if there are duplicates, we'll end up with a bunch of BinaryBlob rows?
Jason Frey
@Fryguy
Nov 07 2017 21:13
there are no duplicates
no literal duplicates (maybe same resource, but the payload will always be different)
Adam Grare
@agrare
Nov 07 2017 21:14
right, the whole point is kubernetes is telling us that something changed so that kind of guarentees the payload will be different
Keenan Brock
@kbrock
Nov 07 2017 21:15
so every time it comes in, we'll delete the old blob and create another
Jason Frey
@Fryguy
Nov 07 2017 21:15
no we just add another blob (multiple different payloads)
Adam Grare
@agrare
Nov 07 2017 21:16
so this will create the blob when we create the target in the collector, and load+delete the blob in the refresh_worker
1 blob per target
Keenan Brock
@kbrock
Nov 07 2017 21:22

ok - makes sense
A little confusing that if you are passed a payload_id - you swap to a payload and if you are passed a payload, you swap to a payload_id
seems you can't figure out what you want

But yes, I understand that it is 2 different callers/use cases

Adam Grare
@agrare
Nov 07 2017 21:23
yeah, we were doing the blob.binary+delete in the load method
but that just called new and created a new blob for it haha
so i figured all paths go through initialize
i'm open to other approaches though
Keenan Brock
@kbrock
Nov 07 2017 21:48
Does make me a little nervous to have the constructor accessing the database and creating / deleting records
Oleg Barenboim
@chessbyte
Nov 07 2017 21:53
@kbrock me too -- the code in ManageIQ/manageiq#16413 is very hard to follow
I read it twice and still could not grok it
Keenan Brock
@kbrock
Nov 07 2017 21:53
aah, I see, so we use Target.load and Target.new in just a few places
but 2 repos, kubernetes and manageiq
Adam Grare
@agrare
Nov 07 2017 21:56
Maybe an explicit payload accessor so we aren't trying to mash this into the options?
Keenan Brock
@kbrock
Nov 07 2017 21:57
I like passing in the :payload and :payload_id into the initializer.
but the modifying the db in the initializer is a bit much
Oleg Barenboim
@chessbyte
Nov 07 2017 21:57
I put a suggestion of if/elsif into the PR
but not thrilled with that either
Keenan Brock
@kbrock
Nov 07 2017 22:01
I did have to ask if the two would conflict with eachother
Jason Frey
@Fryguy
Nov 07 2017 22:01
it's because of the way the target serializes itself is what makes it weird
Adam Grare
@agrare
Nov 07 2017 22:01
So @kbrock you'd rather create the blob in #dump?
Jason Frey
@Fryguy
Nov 07 2017 22:01
@agrare what if the code lived in dump and load respectively
(we just have to make sure we don't load them when adding more targets into the queue)
Keenan Brock
@kbrock
Nov 07 2017 22:02
think load / new / create should not mess with the db
Oleg Barenboim
@chessbyte
Nov 07 2017 22:04
+1 element of least surprise
Beni Cherniavsky-Paskin
@cben
Nov 07 2017 22:04
it's in initialize to run on serialization to/from the queue ?
Keenan Brock
@kbrock
Nov 07 2017 22:05
dump - not sure what that means, but maybe that can modify the db. create sounds good
Beni Cherniavsky-Paskin
@cben
Nov 07 2017 22:06
can't we create binary blob before enqueuing in watch worker code, and load+delete after dequeuing in refresh code?
Adam Grare
@agrare
Nov 07 2017 22:07
Well dump is aliased to id
Yeah we can I was trying to keep the blob handling transparent to the caller
Bit yeah we could create the blob in the collector worker and just pass the id in the target options then load and delete in the preprocess targets method
Beni Cherniavsky-Paskin
@cben
Nov 07 2017 22:12
btw, is there risk of leaking blobs, if refresh fails before deleting blob? or losing them if fails after?
can blob delete be coupled to consuming the MiqQueue row?
Loicavenel
@Loicavenel
Nov 07 2017 22:13
guys, on the 10.9.62.110
it works for some time and now, things are getting crazy
evm.log and automation.log are 1763287040 abd 8912879616
tail crashes
logs shows things like: e672d63757261746f722d646f636b65726366672d7877713766227d5d7d2c22737461747573223a7b227068617365223a2252756e6e696e67222c22636f6e646974696f6e73223a5b7b2274797065223a22496e697469616c697a6564222c22737461747573223a2254727565222c226c61737450726f626554696d65223a6e756c6c2c226c6173745472616e736974696f6e54696d65223a22323031372d30372d30375430303a30323a31305a227d2c7b2274797065223a225265616479222c22737461747573223a2246616c7365222c226c61737450726f626554696d65223a6e756c6c2c226c6173745472616e736974696f6e54696d65223a22323031372d31312d30375432303a30363a30395a222c22726561736f6e223a22436f6e7461696e6572734e6f745265616479222c226d657373616765223a22636f6e7461696e657273207769746820756e7265616479207374617475733a205b63757261746f725d227d2c7b2274797065223a22506f645363686564756c6564222c22737461747573223a2254727565222c226c61737450726f626554696d65223a6e756c6c2c226c6173745472616e736974696f6e54696d65223a22323031372d30372d30375430303a30323a31305a227d5d2c22686f73744950223a223139322e3136382e34372e3138222c22706f644950223a223137322e31362e342e313833222c22737461727454696d65223a22323031372d30372d30375430303a30323a31305a222c22636f6e7461696e65725374617475736573223a5b7b226e616d65223a2263757261746f72222c227374617465223a7b227465726d696e61746564223a7b2265786974436f6465223a3235352c22726561736f6e223a224572726f72222c22737461727465644174223a22323031372d31312d30375432303a30323a33355a222c2266696e69736865644174223a22323031372d31312d30375432303a30363a30395a222c22636f6e7461696e65724944223a22646f636b65723a2f2f30383838306638373462643036353164373761306264313365646639613730326162373332313334323933393933336465373637636262303834353135393061227d7d2c226c6173745374617465223a7b227465726d696e61746564223a7b2265786974436f6465223a3235352c22726561736f6e223a224572726f72222c22737461727465644174223a22323031372d31312d30375431393a35333a34315a222c2266696e69736865644174223a22323031372d31312d30375431393a35373a32305a222c22636f6e7461696e65724944223a22646f636b65723a2f2f62323838623463643362353264383266346239643030376331636163326230376666313961336239643464373535366530353534666533656235313166623166227d7d2c227265616479223a66616c73652c2272657374617274436f756e74223a31323336332c22696d616765223a2272
Jason Frey
@Fryguy
Nov 07 2017 22:15
Hexify all the things!
Beni Cherniavsky-Paskin
@cben
Nov 07 2017 22:18
xxd -r -plain => g-curator-dockercfg-xwq7f"}]},"status":{"phase":"Running","conditions":[{"type":"Initialized","status":"True","lastProbeTime":null,"lastTransitionTime":"2017-07-07T00:02:10Z"},{"type":"Ready","status":"False","lastProbeTime":null,"lastTransitionTime":"2017-11-07T20:06:09Z","reason":"ContainersNotReady","message":"containers with unready status: [curator]"},{"type":"PodScheduled","status":"True","lastProbeTime":null,"lastTransitionTime":"2017-07-07T00:02:10Z"}],"hostIP":"192.168.47.18","podIP":"172.16.4.183","startTime":"2017-07-07T00:02:10Z","containerStatuses":[{"name":"curator","state":{"terminated":{"exitCode":255,"reason":"Error","startedAt":"2017-11-07T20:02:35Z","finishedAt":"2017-11-07T20:06:09Z","containerID":"docker://08880f874bd0651d77a0bd13edf9a702ab7321342939933de767cbb08451590a"}},"lastState":{"terminated":{"exitCode":255,"reason":"Error","startedAt":"2017-11-07T19:53:41Z","finishedAt":"2017-11-07T19:57:20Z","containerID":"docker://b288b4cd3b52d82f4b9d007c1cac2b07ff19a3b9d4d7556e0554fe3eb511fb1f"}},"ready":false,"restartCount":12363,"image":"r
Jason Frey
@Fryguy
Nov 07 2017 22:19
@Loicavenel Are you running debug mode?
@cben So we are writing the API calls into the logs?
Loicavenel
@Loicavenel
Nov 07 2017 22:20
Hum… I did not tooch it… @agrare when you access it to disable the faulty worker, did you touch anything else?
Adam Grare
@agrare
Nov 07 2017 22:22
@Loicavenel I didn't disable the worker just only send deleted notices
Beni Cherniavsky-Paskin
@cben
Nov 07 2017 22:23
dunno. yeah, this is part of a /pod[s] api response. watcher does log when receiving but IIRC only name not whole payload? no idea where hex came from.
Adam Grare
@agrare
Nov 07 2017 22:27
The refresher is logging the target IDs
Beni Cherniavsky-Paskin
@cben
Nov 07 2017 22:27

tail crashes

awesome LOL :boom:! try @kbrock's trick of cut -b 1-250, and/or less -S

Adam Grare
@agrare
Nov 07 2017 22:27
That includes the payloads
That's what ladas fixed recently
Keenan Brock
@kbrock
Nov 07 2017 22:28
@cben thanks for less -S
Beni Cherniavsky-Paskin
@cben
Nov 07 2017 22:28
ah, right, saw those commits. #16405 ?
Adam Grare
@agrare
Nov 07 2017 22:29
The whole target.id serializes the target is what makes this so strange
Beni Cherniavsky-Paskin
@cben
Nov 07 2017 22:29
(F inside less switches to tail -f mode, Ctrl-C exits it. don't know if it actually survives huge file with huge lines, it may try to calculate line numbers, but it tends to allow aborting that with Ctrl-C.)
Joe Rafaniello
@jrafanie
Nov 07 2017 22:29
@kbrock for when less gives you more ;-)