Where communities thrive


  • Join over 1.5M+ people
  • Join over 100K+ communities
  • Free without limits
  • Create your own community
People
Activity
  • Jul 31 20:34
    maxfischer2781 labeled #81
  • Jul 31 20:34
    maxfischer2781 labeled #81
  • Jul 31 20:34
    maxfischer2781 opened #81
  • Jul 31 20:34
    maxfischer2781 opened #81
  • Jul 09 19:25
    olifre commented #69
  • Jul 09 19:25
    olifre commented #69
  • Jul 01 11:23
    giffels commented #80
  • Jul 01 11:23
    giffels commented #80
  • Jul 01 10:04
    maxfischer2781 commented #80
  • Jul 01 10:04
    maxfischer2781 commented #80
  • Jul 01 09:57
    maxfischer2781 review_requested #80
  • Jul 01 09:57
    maxfischer2781 review_requested #80
  • Jul 01 09:57
    maxfischer2781 review_requested #80
  • Jul 01 09:57
    maxfischer2781 ready_for_review #80
  • Jul 01 09:57
    maxfischer2781 ready_for_review #80
  • Jun 30 19:14
    maxfischer2781 opened #80
  • Jun 30 19:14
    maxfischer2781 opened #80
  • Jun 30 14:50
    maxfischer2781 commented #73
  • Jun 30 14:50
    maxfischer2781 commented #73
  • Jun 30 12:59
    maxfischer2781 closed #52
Max Fischer
@maxfischer2781
feel free to bump us if we should document some things better. ;)
Stefan Kroboth
@stefan-k
Thanks, I will :)
I was wondering what the difference between Drained and NotAvailable status is, because any drained drone is not available to my understanding.
Manuel Giffels
@giffels
NotAvailable means the drone is not registered (yet) in the overlay batch system. So, the status is unknown. A drone in state Drained means the drone is registered in the overlay batch system, however it does not accept new jobs.
Stefan Kroboth
@stefan-k
Thanks!
Stefan Kroboth
@stefan-k
Hi all! How does cobald/tardis decide whether a drone still exists? Currently, we see that cobald/tardis tries to kill drones which are long gone via canceljob XXXXX. The drones do not show up in showq and are invisible to our SLURM adapter, therefore I would assume that they are considered non-existent. The result of canceljob on one of these dead drones is ERROR: invalid job specified (XXXXXXX), therefore it is retried again and again. These dead drones accumulate, and the excessive canceljob calls potentially cause quite some strain on the site. I guess the garbage collector will take care of them at some point, but it seems as if this is not fast enough. We therefore need to regularly stop cobald/tardis, delete the database, stop all remaining drones and start it again. The non existent drones are considered NotAvailable in the Slurm adapter (not Drained). Could this be cause of this problem? Any pointers are highly appreciated :)
Manuel Giffels
@giffels
@stefan-k : We are currently pushing hard to finish the CHEP proceedings. ;-) So, we will come back afterwards to your question. Sorry, for that.
Stefan Kroboth
@stefan-k
@giffels : No worries, this is not at all urgent :) Good luck!
Manuel Giffels
@giffels
Hi @stefan-k, do I get it right, that the drones are not showing up in Moab's showq and not in your SLURM overlay batch system?
We use the Moab command showq --xml -w user=$(whoami) && showq -c --xml -w user=$(whoami) to get the status of the drone running at NEMO. The second command will list also complete ones. If the status in Moab is COMPLETED the drone should go to DownState in tardis and the garbage collector should take care.
Do you now the previous state of the Drones before going to NotAvailable?
Manuel Giffels
@giffels
What is the output of showq -c --xml -w user=$(whoami)?
Stefan Kroboth
@stefan-k
Unfortunately I just restarted everything for a different reason, so I don't have any of these zomies around. But I remember that the state of the ones I checked with showq -c ... was CNCLD(271). In Slurm they are (I think) shown as drained*, where * means that Slurm hasn't heard of the instance for a while (so it is likely offline). I don't know the TARDIS state before they go to NotAvailable. There should be a few around tomorrow morning. I'll check again then and let you know. Thanks! :)
Manuel Giffels
@giffels
CNCLD seems to be Canceled.
Let us check tomorrow, once you have new zombie's around.
Stefan Kroboth
@stefan-k
$ showq -c  -w user=$(whoami) | grep 7728XXX
7728XXX             V CNCLD         tor   1.01      1.0  - fr_XXXXXX bwXXXXXX nXXXX.nemo.priva    20    00:09:15   Tue Mar 17 00:10:56
(I've censored it a bit and used the non-XML version.)
Slurms sinfo will just return nothing for the corresponding node (therefore it should be considered NotAvailable in TARDIS). But at this point I can't guarantee that it is like that immediately after the drone goes down. There may be different transient states in Slurm right after the drone goes down which may confuse TARDIS.
Stefan Kroboth
@stefan-k
just for completeness:
$ canceljob 7728XXX
ERROR:  invalid job specified (7728XXX)
Stefan Kroboth
@stefan-k
I just realized that invalid job specified is handled by the site adapter.
Manuel Giffels
@giffels
What is the current state of this drone in TARDIS?
Stefan Kroboth
@stefan-k
root: 2020-03-17 09:17:07 Resource attributes: {'site_name': 'NEMO', 'machine_type': 'tardis_c40m120', 'remote_resource_uuid': 7728XXX, 'created': datetime.datetime(2020, 3, 17, 0, 1, 8, 58991), 'updated': datetime.datetime(2020, 3, 17, 9, 17, 7, 206896), 'drone_uuid': 'nemo-7728XXX', 'resource_status': <ResourceStatus.Stopped: 3>}
root: 2020-03-17 09:17:07 Destroying VM with ID 7728XXX
Manuel Giffels
@giffels
Are there any changes applied by you to the Moab site adapter?
Manuel Giffels
@giffels
This is a known issue and @rfvc is currently working on adding all stati known to Moab.
Stefan Kroboth
@stefan-k
Great, thanks! :) I've only made a minor change that should not make a difference: I changed the email address in the msub command.
Manuel Giffels
@giffels
Okay, that should not be the problem.
Stefan Kroboth
@stefan-k
@giffels regarding #142: I will investigate this further. This was pretty weird because I could see in Grafana that we sometimes get metrics, sometimes we didn't. It crashes tardis, it is restarted by systemd.
also, there's another bug in the most recent commit ;)
Stefan Kroboth
@stefan-k
prometheusmonitoring: 2020-04-16 13:31:09 Drone: {'site_name': 'NEMO', 'machine_type': 'tardis_c40m120', 'remote_resource_uuid': 8065787, 'created': datetime.datetime(2020, 4, 16, 13, 26, 12, 506621), 'updated': datetime.datetime(2020, 4, 16, 13, 31, 9, 748663), 'drone_uuid': 'nemo-8065787', 'resource_status': 'ResourceStatus.Running', 'state': 'IntegrateState', 'meta': 'atlsch', 'timestamp': 1587036609718, 'revision': 1} has changed state to IntegratingState
prometheusmonitoring: 2020-04-16 13:31:09 TTTT: <class 'str'>
interestingly, this happened once in about ~50 cases
(TTTT indicates the type of resource_status)
in all other cases, it is an enum:
prometheusmonitoring: 2020-04-16 13:29:09 Drone: {'site_name': 'NEMO', 'machine_type': 'tardis_c40m120', 'remote_resource_uuid': 8065030, 'created': datetime.datetime(2020, 4, 16, 12, 2, 52, 850664), 'updated': datetime.datetime(2020, 4, 16, 13, 29, 8, 464074), 'drone_uuid': 'nemo-8065030', 'resource_status': <ResourceStatus.Stopped: 3>} has changed state to CleanupState
prometheusmonitoring: 2020-04-16 13:29:09 TTTT: <enum 'ResourceStatus'>
ah I think I know what may be happening
"crosstalk" from the Elasticsearch plugin
Stefan Kroboth
@stefan-k
yep. sorry for the fuzz. resource_attributes is modified in the ElasticsearchPlugin. I never know whether arguments are passed by reference or value in Python...
In the ElasticsearchPlugin, I'm turning state into a str so that ElasticSearch isn't confused, but since resource_attributes is passed by reference it affects everything else
Max Fischer
@maxfischer2781
Can you link where this translation happens? You should be able to directly access the enum's name, without converting it.
(Sorry for not being available to give hints earlier, by the way. Trying to catch up now.)
Stefan Kroboth
@stefan-k
No worries at all! I'm glad that I didn't get you sucked into this more ;)
But I think I will completely redesign this anyway
so I think it is not necessary that you waste any time on this :)
Max Fischer
@maxfischer2781
Alright, I'll try to sprinkle some random wisdom on the next version, then. ;)
Stefan Kroboth
@stefan-k
I'm looking forward to it :D
Manuel Giffels
@giffels
Sorry, for being late to the party. I was stuck in a zoom meeting. Good that found the problem. Dictionaries are mutable objects in Python. Passing mutable objects to a function behaves similar to passed by reference in other languages. Everything that is not mutable behaves similar to passed by value in other languages.
Max Fischer
@maxfischer2781
For reference, a simple/fast way to convert-with-copy some keys is to use a dict literal ({}) with unpacking (**) of the original dict.
resource_attributes = {
    **resource_attributes,
    "state": str(state),
    "meta": self._meta,
    "timestamp": int(time() * 1000),
    "resource_status": resource_attributes["resource_status"],
}
Stefan Kroboth
@stefan-k
Thanks to both of you!
Stefan Kroboth
@stefan-k
BTW: I'm using PR #141 in production for a while now and it does not solve the problem that tardis tries to cancel drones which are long gone. I guess this may be related to @giffels comment here: https://github.com/MatterMiners/tardis/pull/141#discussion_r409591396
Manuel Giffels
@giffels
Okay, thanks for the information. @rfvc could you address the remaining issue in #141, please? I found that CANCELLED is also not yet handled there.
Stefan Kroboth
@stefan-k
@giffels @rfvc: I just wanted to let you know that mapping TimeLimit and Vacated to ResourceStatus.Deleted in the MOAB site adapter seems to solve our problem of zombie drones outlined above. The zombie drones caused a couple of problems which required us to restart the service at least once a day. I thought this was related to a bug in the Slurm batchsystem adapter, but apparently it wasn't. Anyway, with the updated mapping it has been running for more than a week without needing to be restarted.