Where communities thrive


  • Join over 1.5M+ people
  • Join over 100K+ communities
  • Free without limits
  • Create your own community
People
Activity
  • Nov 07 10:30
    maxfischer2781 edited #82
  • Nov 07 10:30
    maxfischer2781 edited #82
  • Nov 07 10:30
    maxfischer2781 commented #82
  • Nov 07 10:30
    maxfischer2781 commented #82
  • Nov 07 10:29
    maxfischer2781 labeled #82
  • Nov 07 10:29
    maxfischer2781 labeled #82
  • Nov 07 10:29
    maxfischer2781 labeled #82
  • Nov 07 10:29
    maxfischer2781 labeled #82
  • Nov 07 10:29
    maxfischer2781 opened #82
  • Nov 07 10:29
    maxfischer2781 opened #82
  • Jul 31 20:34
    maxfischer2781 labeled #81
  • Jul 31 20:34
    maxfischer2781 labeled #81
  • Jul 31 20:34
    maxfischer2781 opened #81
  • Jul 31 20:34
    maxfischer2781 opened #81
  • Jul 09 19:25
    olifre commented #69
  • Jul 09 19:25
    olifre commented #69
  • Jul 01 11:23
    giffels commented #80
  • Jul 01 11:23
    giffels commented #80
  • Jul 01 10:04
    maxfischer2781 commented #80
  • Jul 01 10:04
    maxfischer2781 commented #80
Stefan Kroboth
@stefan-k
Unfortunately I just restarted everything for a different reason, so I don't have any of these zomies around. But I remember that the state of the ones I checked with showq -c ... was CNCLD(271). In Slurm they are (I think) shown as drained*, where * means that Slurm hasn't heard of the instance for a while (so it is likely offline). I don't know the TARDIS state before they go to NotAvailable. There should be a few around tomorrow morning. I'll check again then and let you know. Thanks! :)
Manuel Giffels
@giffels
CNCLD seems to be Canceled.
Let us check tomorrow, once you have new zombie's around.
Stefan Kroboth
@stefan-k
$ showq -c  -w user=$(whoami) | grep 7728XXX
7728XXX             V CNCLD         tor   1.01      1.0  - fr_XXXXXX bwXXXXXX nXXXX.nemo.priva    20    00:09:15   Tue Mar 17 00:10:56
(I've censored it a bit and used the non-XML version.)
Slurms sinfo will just return nothing for the corresponding node (therefore it should be considered NotAvailable in TARDIS). But at this point I can't guarantee that it is like that immediately after the drone goes down. There may be different transient states in Slurm right after the drone goes down which may confuse TARDIS.
Stefan Kroboth
@stefan-k
just for completeness:
$ canceljob 7728XXX
ERROR:  invalid job specified (7728XXX)
Stefan Kroboth
@stefan-k
I just realized that invalid job specified is handled by the site adapter.
Manuel Giffels
@giffels
What is the current state of this drone in TARDIS?
Stefan Kroboth
@stefan-k
root: 2020-03-17 09:17:07 Resource attributes: {'site_name': 'NEMO', 'machine_type': 'tardis_c40m120', 'remote_resource_uuid': 7728XXX, 'created': datetime.datetime(2020, 3, 17, 0, 1, 8, 58991), 'updated': datetime.datetime(2020, 3, 17, 9, 17, 7, 206896), 'drone_uuid': 'nemo-7728XXX', 'resource_status': <ResourceStatus.Stopped: 3>}
root: 2020-03-17 09:17:07 Destroying VM with ID 7728XXX
Manuel Giffels
@giffels
Are there any changes applied by you to the Moab site adapter?
Manuel Giffels
@giffels
This is a known issue and @rfvc is currently working on adding all stati known to Moab.
Stefan Kroboth
@stefan-k
Great, thanks! :) I've only made a minor change that should not make a difference: I changed the email address in the msub command.
Manuel Giffels
@giffels
Okay, that should not be the problem.
Stefan Kroboth
@stefan-k
@giffels regarding #142: I will investigate this further. This was pretty weird because I could see in Grafana that we sometimes get metrics, sometimes we didn't. It crashes tardis, it is restarted by systemd.
also, there's another bug in the most recent commit ;)
Stefan Kroboth
@stefan-k
prometheusmonitoring: 2020-04-16 13:31:09 Drone: {'site_name': 'NEMO', 'machine_type': 'tardis_c40m120', 'remote_resource_uuid': 8065787, 'created': datetime.datetime(2020, 4, 16, 13, 26, 12, 506621), 'updated': datetime.datetime(2020, 4, 16, 13, 31, 9, 748663), 'drone_uuid': 'nemo-8065787', 'resource_status': 'ResourceStatus.Running', 'state': 'IntegrateState', 'meta': 'atlsch', 'timestamp': 1587036609718, 'revision': 1} has changed state to IntegratingState
prometheusmonitoring: 2020-04-16 13:31:09 TTTT: <class 'str'>
interestingly, this happened once in about ~50 cases
(TTTT indicates the type of resource_status)
in all other cases, it is an enum:
prometheusmonitoring: 2020-04-16 13:29:09 Drone: {'site_name': 'NEMO', 'machine_type': 'tardis_c40m120', 'remote_resource_uuid': 8065030, 'created': datetime.datetime(2020, 4, 16, 12, 2, 52, 850664), 'updated': datetime.datetime(2020, 4, 16, 13, 29, 8, 464074), 'drone_uuid': 'nemo-8065030', 'resource_status': <ResourceStatus.Stopped: 3>} has changed state to CleanupState
prometheusmonitoring: 2020-04-16 13:29:09 TTTT: <enum 'ResourceStatus'>
ah I think I know what may be happening
"crosstalk" from the Elasticsearch plugin
Stefan Kroboth
@stefan-k
yep. sorry for the fuzz. resource_attributes is modified in the ElasticsearchPlugin. I never know whether arguments are passed by reference or value in Python...
In the ElasticsearchPlugin, I'm turning state into a str so that ElasticSearch isn't confused, but since resource_attributes is passed by reference it affects everything else
Max Fischer
@maxfischer2781
Can you link where this translation happens? You should be able to directly access the enum's name, without converting it.
(Sorry for not being available to give hints earlier, by the way. Trying to catch up now.)
Stefan Kroboth
@stefan-k
No worries at all! I'm glad that I didn't get you sucked into this more ;)
But I think I will completely redesign this anyway
so I think it is not necessary that you waste any time on this :)
Max Fischer
@maxfischer2781
Alright, I'll try to sprinkle some random wisdom on the next version, then. ;)
Stefan Kroboth
@stefan-k
I'm looking forward to it :D
Manuel Giffels
@giffels
Sorry, for being late to the party. I was stuck in a zoom meeting. Good that found the problem. Dictionaries are mutable objects in Python. Passing mutable objects to a function behaves similar to passed by reference in other languages. Everything that is not mutable behaves similar to passed by value in other languages.
Max Fischer
@maxfischer2781
For reference, a simple/fast way to convert-with-copy some keys is to use a dict literal ({}) with unpacking (**) of the original dict.
resource_attributes = {
    **resource_attributes,
    "state": str(state),
    "meta": self._meta,
    "timestamp": int(time() * 1000),
    "resource_status": resource_attributes["resource_status"],
}
Stefan Kroboth
@stefan-k
Thanks to both of you!
Stefan Kroboth
@stefan-k
BTW: I'm using PR #141 in production for a while now and it does not solve the problem that tardis tries to cancel drones which are long gone. I guess this may be related to @giffels comment here: https://github.com/MatterMiners/tardis/pull/141#discussion_r409591396
Manuel Giffels
@giffels
Okay, thanks for the information. @rfvc could you address the remaining issue in #141, please? I found that CANCELLED is also not yet handled there.
Stefan Kroboth
@stefan-k
@giffels @rfvc: I just wanted to let you know that mapping TimeLimit and Vacated to ResourceStatus.Deleted in the MOAB site adapter seems to solve our problem of zombie drones outlined above. The zombie drones caused a couple of problems which required us to restart the service at least once a day. I thought this was related to a bug in the Slurm batchsystem adapter, but apparently it wasn't. Anyway, with the updated mapping it has been running for more than a week without needing to be restarted.
Manuel Giffels
@giffels
@stefan-k Thanks for reporting this. I have updated the pull request accordingly. @rfvc any chance we can finish this pull request this week?
rfvc
@rfvc
@stefan-k, thanks for your testing and input on it!
@giffels, I think so, I’m on it
Stefan Kroboth
@stefan-k
Can I somehow trigger codecov for the slurm PR? It seems like the current codecov result is not based on the most recent commit...
Stefan Kroboth
@stefan-k
Sorry about the noise in the PR, I thought I was in our local Gitlab... embarassing...
Manuel Giffels
@giffels

Can I somehow trigger codecov for the slurm PR? It seems like the current codecov result is not based on the most recent commit...

I have re-started CI at travis, show actually update codecov afterwards. Fingers crossed.

Stefan Kroboth
@stefan-k
@giffels: Thanks! unfortunately it looks like it didn't work. At least the diff seems to be missing the last commit :/
Max Fischer
@maxfischer2781
It seems there was an issue mapping the report to the commit. codecov complains that it was "Unable to find commit in GitHub"
Max Fischer
@maxfischer2781
Looks like the metadata was garbled, not sure if we can reset that. The easiest is probably for you to push a new commit, so that an entirely new CI+Codecov run is triggered.
Stefan Kroboth
@stefan-k
Thanks a lot! I'll try to push something today