prometheusmonitoring: 2020-04-16 13:29:09 Drone: {'site_name': 'NEMO', 'machine_type': 'tardis_c40m120', 'remote_resource_uuid': 8065030, 'created': datetime.datetime(2020, 4, 16, 12, 2, 52, 850664), 'updated': datetime.datetime(2020, 4, 16, 13, 29, 8, 464074), 'drone_uuid': 'nemo-8065030', 'resource_status': <ResourceStatus.Stopped: 3>} has changed state to CleanupState
prometheusmonitoring: 2020-04-16 13:29:09 TTTT: <enum 'ResourceStatus'>
ElasticsearchPlugin
, I'm turning state
into a str
so that ElasticSearch isn't confused, but since resource_attributes
is passed by reference it affects everything else
{}
) with unpacking (**
) of the original dict.resource_attributes = {
**resource_attributes,
"state": str(state),
"meta": self._meta,
"timestamp": int(time() * 1000),
"resource_status": resource_attributes["resource_status"],
}
CANCELLED
is also not yet handled there.
TimeLimit
and Vacated
to ResourceStatus.Deleted
in the MOAB site adapter seems to solve our problem of zombie drones outlined above. The zombie drones caused a couple of problems which required us to restart the service at least once a day. I thought this was related to a bug in the Slurm batchsystem adapter, but apparently it wasn't. Anyway, with the updated mapping it has been running for more than a week without needing to be restarted.
STARTD
) but don't need them. C/T will opportunistically add them to your system. There is no problem having multiple C/T instances add resources, nor in having resources from other sources as well.