showq -c ...
was CNCLD(271)
. In Slurm they are (I think) shown as drained*
, where *
means that Slurm hasn't heard of the instance for a while (so it is likely offline). I don't know the TARDIS state before they go to NotAvailable
. There should be a few around tomorrow morning. I'll check again then and let you know. Thanks! :)
$ showq -c -w user=$(whoami) | grep 7728XXX
7728XXX V CNCLD tor 1.01 1.0 - fr_XXXXXX bwXXXXXX nXXXX.nemo.priva 20 00:09:15 Tue Mar 17 00:10:56
(I've censored it a bit and used the non-XML version.)sinfo
will just return nothing for the corresponding node (therefore it should be considered NotAvailable
in TARDIS). But at this point I can't guarantee that it is like that immediately after the drone goes down. There may be different transient states in Slurm right after the drone goes down which may confuse TARDIS.
root: 2020-03-17 09:17:07 Resource attributes: {'site_name': 'NEMO', 'machine_type': 'tardis_c40m120', 'remote_resource_uuid': 7728XXX, 'created': datetime.datetime(2020, 3, 17, 0, 1, 8, 58991), 'updated': datetime.datetime(2020, 3, 17, 9, 17, 7, 206896), 'drone_uuid': 'nemo-7728XXX', 'resource_status': <ResourceStatus.Stopped: 3>}
root: 2020-03-17 09:17:07 Destroying VM with ID 7728XXX
AttributeError
and TARDIS crashes.
prometheusmonitoring: 2020-04-16 13:31:09 Drone: {'site_name': 'NEMO', 'machine_type': 'tardis_c40m120', 'remote_resource_uuid': 8065787, 'created': datetime.datetime(2020, 4, 16, 13, 26, 12, 506621), 'updated': datetime.datetime(2020, 4, 16, 13, 31, 9, 748663), 'drone_uuid': 'nemo-8065787', 'resource_status': 'ResourceStatus.Running', 'state': 'IntegrateState', 'meta': 'atlsch', 'timestamp': 1587036609718, 'revision': 1} has changed state to IntegratingState
prometheusmonitoring: 2020-04-16 13:31:09 TTTT: <class 'str'>
resource_status
)
prometheusmonitoring: 2020-04-16 13:29:09 Drone: {'site_name': 'NEMO', 'machine_type': 'tardis_c40m120', 'remote_resource_uuid': 8065030, 'created': datetime.datetime(2020, 4, 16, 12, 2, 52, 850664), 'updated': datetime.datetime(2020, 4, 16, 13, 29, 8, 464074), 'drone_uuid': 'nemo-8065030', 'resource_status': <ResourceStatus.Stopped: 3>} has changed state to CleanupState
prometheusmonitoring: 2020-04-16 13:29:09 TTTT: <enum 'ResourceStatus'>
ElasticsearchPlugin
, I'm turning state
into a str
so that ElasticSearch isn't confused, but since resource_attributes
is passed by reference it affects everything else
{}
) with unpacking (**
) of the original dict.resource_attributes = {
**resource_attributes,
"state": str(state),
"meta": self._meta,
"timestamp": int(time() * 1000),
"resource_status": resource_attributes["resource_status"],
}
CANCELLED
is also not yet handled there.
TimeLimit
and Vacated
to ResourceStatus.Deleted
in the MOAB site adapter seems to solve our problem of zombie drones outlined above. The zombie drones caused a couple of problems which required us to restart the service at least once a day. I thought this was related to a bug in the Slurm batchsystem adapter, but apparently it wasn't. Anyway, with the updated mapping it has been running for more than a week without needing to be restarted.