Where communities thrive


  • Join over 1.5M+ people
  • Join over 100K+ communities
  • Free without limits
  • Create your own community
People
Activity
Martin Hradil
@himdel
the server is a worker that listens for HTTP requests starting with /api
the client is a tool that creates that http request and sends it to the server
a browser could be an api client, as could wget/curl, as could the manageiq-api-client gem
bigo1981
@bigo1981
ok so there is an external job that is making curl calls to the miq server with the request /api/service_catalogs/5000000000001/service_templates/5000000000004 and the logs are from the server receiving this request
Martin Hradil
@himdel
ok perfect, thanks :) and .. does that job always use the same authentication token?
bigo1981
@bigo1981
so i do see this user has 2 token for different controllers
MIQ(Api::ServicesController.log_request) Authentication: {:type=>"token", :token=>"5a4611
MIQ(Api::ApiController.log_request) Authentication: {:type=>"token", :token=>"ce04e8
Martin Hradil
@himdel
interesting, so, is only one of those failing and the other not? or is it random?
(also note that there are 3 newer hammer releases available, as well as ivanchuk, jansa, kasparov and eventually lasker, if we're talking timing issues, the issue could have been resolved by now)
bigo1981
@bigo1981
so it is a little hard to tell but i see api request successful for both tokens
bigo1981
@bigo1981
could having multiple tokens for the same user be an issue? @himdel
Martin Hradil
@himdel
I have no idea how the system tokens work, sorry. I'm guessing not, but maybe using 2 at the same time for the same user could?
you could try to work around it by having the user do a http basic authentication to get a user token, and send that instead of a system token
but not sure if that'd be any different or not

hmm ... ManageIQ/manageiq@921a3b8

any chance @bigo1981 your servers are not timesynced?

looks like the timestamp inside the system token has to be within 5 minutes of server time, maybe that could cause some failures too
bigo1981
@bigo1981
So @himdel what servers should be timesynced? How do i see what the timestamp is within the token?
Martin Hradil
@himdel
well, the server generating the token with the server processing the API request, if they're not the same server
if you have multiple servers serving the API, all of them, since you'd never know where the request lands
but also.. the 5 minute interval implies the user can't just be reusing the same token forever
so maybe the problem is just that their token reneval takes 5 minutes and 1 second, and uses an old token for that one extra second, or something like that
as for how to tell.. on the appliance, in the rails console, you could run YAML.load(ManageIQ::Password.decrypt(x_miq_token)) to get a hash with the data in the token (x_miq_token being the actual token)
outside the appliance, you can't
bigo1981
@bigo1981

Hi @himdel I finally found the token error in the log

MIQ(Api::RequestsController.log_request) API Request: {:requested_at=>"2021-03-01 17:43:24 UTC"
MIQ(Api::RequestsController.rescue in authenticate_with_system_token) Authentication Failed with System Token
MIQ(Api::RequestsController.rescue in authenticate_with_system_token) X-MIQ-Token: v2:{A7WSSelfFYB2O.....
MIQ(Api::RequestsController.rescue in authenticate_with_system_token) Error: Invalid timestamp 2021-03-01 17:42:47 UTC specified

it seems that the request was made 37 sec after the timestamp. when is this v2 token created and is there a way to increase how long it lasts for?

Jason Frey
@Fryguy
with tokens it's very important that all servers have their time synced to NTP properly
Martin Hradil
@himdel

the token seems to come from the core repo, app/models/miq_region.rb api_system_auth_token
and the timeout is not 5 minutes, only 30 secons now...

app/controllers/api/base_controller/authentication.rb
6:      SYSTEM_TOKEN_TTL = 30.seconds
136:        raise "Invalid timestamp #{timestamp} specified" if SYSTEM_TOKEN_TTL.ago.utc > timestamp

I don't think there's a toggle to change it

Jason Frey
@Fryguy
oh nice find @himdel - I wonder why it's so small
Martin Hradil
@himdel
yeah, sounds like it could get lost in the queue for longer
Jason Frey
@Fryguy
I guess the next question is what is taking up those 37 seconds? Is it the queue, or something on the caller side?
Martin Hradil
@himdel
^ is the change from 5 minutes to 30s
Jason Frey
@Fryguy
yeah i can't find anything in the BZ exaplining why
though that PR was a "revert"
Martin Hradil
@himdel
yeah, not sure either
but it looks like the only caller of that logic is inter_region_api_method_relay api_client_connection_for_region,
called from process_tasks_mixin invoke_tasks_remote
Jason Frey
@Fryguy
@abellotti do you remember?
Alberto Bellotti
@abellotti
We had shifted the System token to an OTP for security reasons, but we couldn't keep it as such (was breaking central admin), so we brought it back as it was, and shortened it's TTL. Same question here, why is it taking 37 seconds.
Jason Frey
@Fryguy
usually those tokens, when used for central admin, are in a immediate request, so 37 seconds seems like a long time
for example, someone hits "power on a vm" at the global region, and while we put the request on the queue, i thought the token isn't generated until immediately before we make the remote call, so I would expect it to be immediately used (and then thrown away)
that was why I was thinking maybe the global region and the remote region have their system times drifting by more than 30 seconds
Alberto Bellotti
@abellotti
Ahh, yep that would do it too. They should be NTP synced.
bigo1981
@bigo1981

So thanks for all the replies. The global region and remote region times are fully synced to the second. we see that time increase as the load on the remote region db increases. If the load average on the db is below 10 the the calls happen in around 15 seconds but when the load gets to 20 -30 we see the times go to 30 to 60 seconds. It is a black box to us as well as to what it is doing in this time.

[----] I, [2021-03-01T15:15:14.713011 #26355:3e09950] INFO -- : MIQ(Api::ServiceCatalogsController.log_request) Response: {:completed_at=>"2021-03-01 20:15:14 UTC", :size=>"1.580 KBytes", :time_taken=>"15.014 Seconds", :status=>200}

[----] I, [2021-03-01T13:49:01.386888 #26389:3d5016c] INFO -- : MIQ(Api::ServiceCatalogsController.log_request) Response: {:completed_at=>"2021-03-01 18:49:01 UTC", :size=>"0.123 KBytes", :time_taken=>"60.334 Seconds", :status=>500}

Also we see that the more active retire jobs we have in the remote region the more load it puts on the db which we are also not sure why retires have that effect on the db. If we have 20 simultaneous retires the load average on the db goes up to 20 and it increases faster the more we have active.
Jason Frey
@Fryguy
oh wow that is interesting
bigo1981
@bigo1981
Would appreciate any information on how these things correlate with each other and any possible solutions to help alleviate this issue
bigo1981
@bigo1981

Just to provide some more data we updated the logs to show debug data and we are seeing a single vm retire task perform this query 1200 times with a different id each time. Is this expected behavior? In total we had over 70K lines for one retire.

Q-task_id([r5000000588240_vm_retire_task_5000000757649]) (85.3ms) SELECT "vms"."id" FROM "vms" WHERE (LOWER((SELECT "users"."userid" FROM "users" WHERE "users"."id" = "vms"."evm_owner_id")) = '**')

Jason Frey
@Fryguy
cc @gtanzillo ^
Nick LaMuro
@NickLaMuro
worth nothing that the LOWER stuff is probably from the AuthenticationMixin, and is probably either a scope, or a virtual_attribute
bigo1981
@bigo1981
Is this something that can be easily disabled or is this because of how we have something configured? It doesn't make sense to make this call for every user when we already know the vm owner.
Jason Frey
@Fryguy
if you're somehow able to narrow down where that line is coming from that might help
It feels like an N+1 error, but in retirement I don't understand why that would happen, since we are retiring a single entity
bigo1981
@bigo1981
@Fryguy Any suggestion on some ways i can try to narrow down where this is coming from? I tried to look through the appliance code to try and figure out what was calling this code but nothing stood out right away.