These are chat archives for ManageIQ/manageiq/containers

14th
Mar 2017
Mooli Tayer
@moolitayer
Mar 14 2017 15:36
@zeari @yaacov could you help Help @jdeubel? he has some questions about metric rollups (not really my area)
Jared Deubel
@jdeubel
Mar 14 2017 15:36
Ok thanks @moolitayer
Have a customer that is a big enviroment and is running into high timings for the metrics. I wanted to know how accurate the log lines are
[----] I, [2017-02-27T03:56:44.061769 #24998:1131130] INFO -- : MIQ(ManageIQ::Providers::Kubernetes::ContainerManager::Container#perf_rollup) [hourly] Rollup for ManageIQ::Providers::Kubernetes::ContainerManager::Container name: [author], id: [10000000041243] for time: [2017-02-14T06:00:00Z]...Complete - Timings: {:server_dequeue=>0.0023474693298339844, :db_find_prev_perf=>1020.0081889629364, :rollup_perfs=>13430.216671705246, :db_update_perf=>3029.258082151413, :process_perfs_tag=>4.839548587799072, :process_bottleneck=>1764.1716132164001, :total_time=>29022.528948545456, :purge_metrics=>4419.638503074646, :query_batch=>0.5897281169891357, :purge_vim_performance_tag_values=>0.18726515769958496}
You can see there that the total_time is showing 8 hours
Does this have anything to do with network latency or are there other factors that might effect this? @zeari @yaacov
Ari Zellner
@zeari
Mar 14 2017 15:41
@jdeubel thats a rollup for a single container? :O
Jared Deubel
@jdeubel
Mar 14 2017 15:41
Yea I believe so ..crazy right
lol
Ari Zellner
@zeari
Mar 14 2017 15:41
yeah....
Jared Deubel
@jdeubel
Mar 14 2017 15:42
When they are doing the refreshes they are hitting the message timeout of 7200 seconds
Ari Zellner
@zeari
Mar 14 2017 15:43
  • trying to figure it out *
Jared Deubel
@jdeubel
Mar 14 2017 15:44
] I, [2017-02-27T03:55:38.909739 #20001:1131130]  INFO -- : MIQ(ManageIQ::Providers::OpenshiftEnterprise::ContainerManager::Refresher#refresh) Refreshing all targets...
[----] I, [2017-02-27T03:55:38.909862 #20001:1131130]  INFO -- : MIQ(ManageIQ::Providers::OpenshiftEnterprise::ContainerManager::Refresher#refresh) EMS: [CTC Core], id: [10000000000001] Refreshing targets for EMS...
[----] I, [2017-02-27T03:55:38.909936 #20001:1131130]  INFO -- : MIQ(ManageIQ::Providers::OpenshiftEnterprise::ContainerManager::Refresher#refresh) EMS: [CTC Core], id: [10000000000001]   ManageIQ::Providers::OpenshiftEnterprise::ContainerManager [CTC Core] id [10000000000001]
[----] I, [2017-02-27T03:55:38.910040 #20001:1131130]  INFO -- : MIQ(ManageIQ::Providers::OpenshiftEnterprise::ContainerManager::Refresher#refresh_targets_for_ems) EMS: [CTC Core], id: [10000000000001] Refreshing target ManageIQ::Providers::OpenshiftEnterprise::ContainerManager [CTC Core] id [10000000000001]...
[----] I, [2017-02-27T03:55:38.910125 #20001:1131130]  INFO -- : MIQ(ManageIQ::Providers::OpenshiftEnterprise::ContainerManager#with_provider_connection) Connecting through ManageIQ::Providers::OpenshiftEnterprise::ContainerManager: [CTC Core]
[----] W, [2017-02-27T03:55:47.206733 #20001:1131130]  WARN -- : Unexpected Exception during refresh: HTTP status code 403, User "system:serviceaccount:management-infra:management-admin" cannot list all componentstatuses in the cluster
[----] I, [2017-02-27T03:55:47.207338 #20001:1131130]  INFO -- : MIQ(ManageIQ::Providers::OpenshiftEnterprise::ContainerManager#with_provider_connection) Connecting through ManageIQ::Providers::OpenshiftEnterprise::ContainerManager: [CTC Core]
[----] I, [2017-02-27T04:03:47.933493 #20001:1131130]  INFO -- : MIQ(EmsRefresh.save_inventory_multi) [container_builds] Deleting id: [10000000002831] name: [springbootdemo]
[----] I, [2017-02-27T04:10:24.200474 #20001:1131130]  INFO -- : MIQ(EmsRefresh.save_inventory_multi) [container_build_pods] Deleting id: [10000000014367] name: [springbootdemo-1], id: [10000000014377] name: [springbootdemo-2], id: [10000000016284] name: [springbootdemo-3], id: [10000000017189] name: [springbootdemo-4], id: [10000000017190] name: [springbootdemo-5], id: [10000000017555] name: [springbootdemo-6], id: [10000000017556] name: [springbootdemo-7], id: [10000000017566] name: [springbootdemo-8], id: [10000000017585] name: [springbootdemo-10], id: [10000000021336] name: [springbootdemo-11]
[----] E, [2017-02-27T05:55:38.898721 #20001:1131130] ERROR -- : MIQ(MiqQueue#deliver) Message id: [10000184683443], timed out after 7200.007051923 seconds.  Timeout threshold [7200]
Ari Zellner
@zeari
Mar 14 2017 15:46
well it looks like network latency
how about if they just try to re-authenticate?
Jared Deubel
@jdeubel
Mar 14 2017 15:46
To see how long it takes?
Ari Zellner
@zeari
Mar 14 2017 15:46
if its a truly giant env it then inventory refreshes might be awful...

To see how long it takes?

yeah

Jared Deubel
@jdeubel
Mar 14 2017 15:47
I don't think we have any good scaling best practices for Openshift environments as far as tuning the workers
Jason Frey
@Fryguy
Mar 14 2017 15:47
@zeari What is purge_metrics in there? Why would we purge during a rollup?
Jared Deubel
@jdeubel
Mar 14 2017 15:47
How would they get the timings when re-authenticating? Is there a log line that I can look for?
Ari Zellner
@zeari
Mar 14 2017 15:49

@zeari What is purge_metrics in there? Why would we purge during a rollup?

@fryguy no clue. i dont think we purge metrics right when we perform the rollups

Jason Frey
@Fryguy
Mar 14 2017 15:50
that's really interestig...because that's about 1/6 of the total time
but yeah, db latency seems in affect there because db_find_prev_perf is really high
Ari Zellner
@zeari
Mar 14 2017 15:50

How would they get the timings when re-authenticating? Is there a log line that I can look for?

let me check

Jared Deubel
@jdeubel
Mar 14 2017 15:54
@Fryguy @zeari Thanks...that is what it seemed to be high latency.
@Fryguy Is that a good value to go by :db_find_prev_perf to see high latency? Does that call the Openshift API?
Jason Frey
@Fryguy
Mar 14 2017 15:56
no that's pure database...we are looking to see if there are existing rows in the database so we can know to update or create
Jared Deubel
@jdeubel
Mar 14 2017 15:56
Oh ok so latency to our internal DB
Jason Frey
@Fryguy
Mar 14 2017 15:56
yes
I'd expect that number to be low, because for rollups we are generally creating new records, and even so, it should only find like 1 or 2 records for rollups anyway
either that or their database is in some serious need of tuning
What version is that?
Jared Deubel
@jdeubel
Mar 14 2017 15:59
Let me look

What version is that?

5.7.0.17

Yaacov Zamir
@yaacov
Mar 14 2017 16:01
How would they get the timings when re-authenticating? Is there a log line that I can look for?
WARN -- : MIQ(ManageIQ::Providers::Openshift::ContainerManager#put_authentication_check) Previous authentication_check_types for [EngLab] [5] with opts: [[[
"hawkular", "bearer"], {:save=>true}]] is still running, skipping...Check for a running server in zone: [default] with role: [ems_operations]
[----] I, [2017-03-14T17:56:34.182680 #18330:57bf5ac]  INFO -- : MIQ(MiqQueue.put) Message id: [889],  id: [], Zone: [default], Role: [ems_operations], Server: [], Ident: [generic], Target id: [], Instance id: [
5], Task id: [], Command: [ExtManagementSystem.authentication_check_types], Timeout: [600], Priority: [100], State: [ready], Deliver On: [], Data: [], Args: [["hawkular", "bearer"], {:save=>true}]
[----] I, [2017-03-14T17:56:34.875833 #28283:52510c]  INFO -- : MIQ(MiqQueue#deliver) Message id: [889], Delivering...
[----] I, [2017-03-14T17:57:00.555273 #28283:52510c]  INFO -- : MIQ(ManageIQ::Providers::Openshift::ContainerManager#with_provider_connection) Connecting through ManageIQ::Providers::Openshift::ContainerManager:
 [EngLab]
[----] I, [2017-03-14T17:57:00.760921 #28283:52510c]  INFO -- : MIQ(AuthToken#validation_successful) [ExtManagementSystem] [5], previously valid/invalid on: [2017-03-05 12:48:02 UTC]/[], previous status: [Valid]
[----] I, [2017-03-14T17:57:00.899463 #28283:52510c]  INFO -- : MIQ(MiqQueue.put) Message id: [890],  id: [], Zone: [default], Role: [], Server: [], Ident: [generic], Target id: [], Instance id: [], Task id: [],
 Command: [MiqEvent.raise_evm_event], Timeout: [600], Priority: [100], State: [ready], Deliver On: [], Data: [], Args: [["ManageIQ::Providers::Openshift::ContainerManager", 5], "ems_auth_valid", {}]
[----] I, [2017-03-14T17:57:02.035743 #28283:52510c]  INFO -- : MIQ(AuthToken#validation_successful) [ExtManagementSystem] [5], previously valid/invalid on: [2017-03-05 12:48:03 UTC]/[], previous status: [Valid]
[----] I, [2017-03-14T17:57:02.047376 #28283:52510c]  INFO -- : MIQ(MiqQueue.put) Message id: [891],  id: [], Zone: [default], Role: [], Server: [], Ident: [generic], Target id: [], Instance id: [], Task id: [],
 Command: [MiqEvent.raise_evm_event], Timeout: [600], Priority: [100], State: [ready], Deliver On: [], Data: [], Args: [["ManageIQ::Providers::Openshift::ContainerManager", 5], "ems_auth_valid", {}]
[----] I, [2017-03-14T17:57:02.047546 #28283:52510c]  INFO -- : MIQ(MiqQueue#delivered) Message id: [889], State: [ok], Delivered in [27.171750228] seconds
[----] I, [2017-03-14T17:57:02.054479 #28283:52510c]  INFO -- : MIQ(MiqQueue#deliver) Message id: [890], Delivering...
[----] I, [2017-03-14T17:57:02.210357 #28283:52510c]  INFO -- : <AutomationEngine> MiqAeEvent.build_evm_event >> event=<"ems_auth_valid"> inputs=<{"MiqEvent::miq_event"=>190, :miq_event_id=>190, "EventStream::ev
ent_stream"=>190, :event_stream_id=>190}>
Jared Deubel
@jdeubel
Mar 14 2017 16:02
@yaacov anything to look for in particular in that ?
Yaacov Zamir
@yaacov
Mar 14 2017 16:05
@jdeubel you do re-check authentication in the UI, and in the evm.log look for the time of the first line MIQ(MiqQueue.put) ... authentication_check_types
and the last line event= "ems_auth_valid"
it is not precise but it will give you a feeling of times (ms vs minuts)
Jared Deubel
@jdeubel
Mar 14 2017 16:06
Ok thanks @yaacov
I can look at the previous logs that I have to maybe get that information or I will ask them.
@yaacov @zeari This looks a little different but
[----] I, [2017-02-27T09:11:44.307778 #24853:1131130] INFO -- : MIQ(MiqGenericWorker::Runner#get_message_via_drb) Message id: [10000185139465], MiqWorker id: [10000000948620], Zone: [CTC Core], Role: [ems_operations], Server: [], Ident: [generic], Target id: [], Instance id: [10000000000001], Task id: [], Command: [ExtManagementSystem.authentication_check_types], Timeout: [600], Priority: [100], State: [dequeue], Deliver On: [2017-02-27 15:11:36 UTC], Data: [], Args: [[], {:attempt=>1}], Dequeued in: [67.624205364] seconds
That queue time is rather high wondering if the DB is in need of some maintenance.
@Fryguy ^
I don't think that the ems_operations role has a lot of work in the miq_queue
[----] I, [2017-02-27T09:11:42.518215 #24861:1131130] INFO -- : Q-task_id([log_status]) MIQ(MiqServer.log_system_status) [EVM Server (24706)] MiqQueue count for state=["ready"] by zone and role: {"Core"=>{"smartproxy"=>4, "ems_metrics_processor"=>3104, "event"=>36, "automate"=>10, "ems_metrics_collector"=>4239, nil=>1, "ems_inventory"=>1, "ems_operations"=>1}, "CTC DMZ"=>{"ems_metrics_collector"=>842, "ems_inventory"=>1, nil=>2, "ems_metrics_processor"=>1866, "smartproxy"=>1, "automate"=>6, "event"=>6}, "default"=>{"event"=>6, "smartstate"=>1, "smartproxy"=>2, "ems_metrics_processor"=>3, nil=>1, "ems_inventory"=>2}}
"ems_operations"=>1},
Ari Zellner
@zeari
Mar 14 2017 16:19
could the high Dequeued in time be a result of miqQueue filling up? theres an issue that pops up sometime when the automation engine keeps filling up the db with events and 'refreshes' in miq_queue
Jason Frey
@Fryguy
Mar 14 2017 16:21
yes, a full queue would contribute to high Dequed in times
which could also imply there aren't enough workers
Ari Zellner
@zeari
Mar 14 2017 16:23
@jdeubel is it reasonable to disable the automation engine for a couple of minutes and see if the queue clears up?
(and if the latency goes down)
@Fryguy that issue is the premise for ManageIQ/manageiq#13204