Where communities thrive


  • Join over 1.5M+ people
  • Join over 100K+ communities
  • Free without limits
  • Create your own community
People
Activity
  • Jul 31 20:34
    maxfischer2781 labeled #81
  • Jul 31 20:34
    maxfischer2781 labeled #81
  • Jul 31 20:34
    maxfischer2781 opened #81
  • Jul 31 20:34
    maxfischer2781 opened #81
  • Jul 09 19:25
    olifre commented #69
  • Jul 09 19:25
    olifre commented #69
  • Jul 01 11:23
    giffels commented #80
  • Jul 01 11:23
    giffels commented #80
  • Jul 01 10:04
    maxfischer2781 commented #80
  • Jul 01 10:04
    maxfischer2781 commented #80
  • Jul 01 09:57
    maxfischer2781 review_requested #80
  • Jul 01 09:57
    maxfischer2781 review_requested #80
  • Jul 01 09:57
    maxfischer2781 review_requested #80
  • Jul 01 09:57
    maxfischer2781 ready_for_review #80
  • Jul 01 09:57
    maxfischer2781 ready_for_review #80
  • Jun 30 19:14
    maxfischer2781 opened #80
  • Jun 30 19:14
    maxfischer2781 opened #80
  • Jun 30 14:50
    maxfischer2781 commented #73
  • Jun 30 14:50
    maxfischer2781 commented #73
  • Jun 30 12:59
    maxfischer2781 closed #52
Stefan Kroboth
@stefan-k
root: 2020-03-17 09:17:07 Resource attributes: {'site_name': 'NEMO', 'machine_type': 'tardis_c40m120', 'remote_resource_uuid': 7728XXX, 'created': datetime.datetime(2020, 3, 17, 0, 1, 8, 58991), 'updated': datetime.datetime(2020, 3, 17, 9, 17, 7, 206896), 'drone_uuid': 'nemo-7728XXX', 'resource_status': <ResourceStatus.Stopped: 3>}
root: 2020-03-17 09:17:07 Destroying VM with ID 7728XXX
Manuel Giffels
@giffels
Are there any changes applied by you to the Moab site adapter?
Manuel Giffels
@giffels
This is a known issue and @rfvc is currently working on adding all stati known to Moab.
Stefan Kroboth
@stefan-k
Great, thanks! :) I've only made a minor change that should not make a difference: I changed the email address in the msub command.
Manuel Giffels
@giffels
Okay, that should not be the problem.
Stefan Kroboth
@stefan-k
@giffels regarding #142: I will investigate this further. This was pretty weird because I could see in Grafana that we sometimes get metrics, sometimes we didn't. It crashes tardis, it is restarted by systemd.
also, there's another bug in the most recent commit ;)
Stefan Kroboth
@stefan-k
prometheusmonitoring: 2020-04-16 13:31:09 Drone: {'site_name': 'NEMO', 'machine_type': 'tardis_c40m120', 'remote_resource_uuid': 8065787, 'created': datetime.datetime(2020, 4, 16, 13, 26, 12, 506621), 'updated': datetime.datetime(2020, 4, 16, 13, 31, 9, 748663), 'drone_uuid': 'nemo-8065787', 'resource_status': 'ResourceStatus.Running', 'state': 'IntegrateState', 'meta': 'atlsch', 'timestamp': 1587036609718, 'revision': 1} has changed state to IntegratingState
prometheusmonitoring: 2020-04-16 13:31:09 TTTT: <class 'str'>
interestingly, this happened once in about ~50 cases
(TTTT indicates the type of resource_status)
in all other cases, it is an enum:
prometheusmonitoring: 2020-04-16 13:29:09 Drone: {'site_name': 'NEMO', 'machine_type': 'tardis_c40m120', 'remote_resource_uuid': 8065030, 'created': datetime.datetime(2020, 4, 16, 12, 2, 52, 850664), 'updated': datetime.datetime(2020, 4, 16, 13, 29, 8, 464074), 'drone_uuid': 'nemo-8065030', 'resource_status': <ResourceStatus.Stopped: 3>} has changed state to CleanupState
prometheusmonitoring: 2020-04-16 13:29:09 TTTT: <enum 'ResourceStatus'>
ah I think I know what may be happening
"crosstalk" from the Elasticsearch plugin
Stefan Kroboth
@stefan-k
yep. sorry for the fuzz. resource_attributes is modified in the ElasticsearchPlugin. I never know whether arguments are passed by reference or value in Python...
In the ElasticsearchPlugin, I'm turning state into a str so that ElasticSearch isn't confused, but since resource_attributes is passed by reference it affects everything else
Max Fischer
@maxfischer2781
Can you link where this translation happens? You should be able to directly access the enum's name, without converting it.
(Sorry for not being available to give hints earlier, by the way. Trying to catch up now.)
Stefan Kroboth
@stefan-k
No worries at all! I'm glad that I didn't get you sucked into this more ;)
But I think I will completely redesign this anyway
so I think it is not necessary that you waste any time on this :)
Max Fischer
@maxfischer2781
Alright, I'll try to sprinkle some random wisdom on the next version, then. ;)
Stefan Kroboth
@stefan-k
I'm looking forward to it :D
Manuel Giffels
@giffels
Sorry, for being late to the party. I was stuck in a zoom meeting. Good that found the problem. Dictionaries are mutable objects in Python. Passing mutable objects to a function behaves similar to passed by reference in other languages. Everything that is not mutable behaves similar to passed by value in other languages.
Max Fischer
@maxfischer2781
For reference, a simple/fast way to convert-with-copy some keys is to use a dict literal ({}) with unpacking (**) of the original dict.
resource_attributes = {
    **resource_attributes,
    "state": str(state),
    "meta": self._meta,
    "timestamp": int(time() * 1000),
    "resource_status": resource_attributes["resource_status"],
}
Stefan Kroboth
@stefan-k
Thanks to both of you!
Stefan Kroboth
@stefan-k
BTW: I'm using PR #141 in production for a while now and it does not solve the problem that tardis tries to cancel drones which are long gone. I guess this may be related to @giffels comment here: https://github.com/MatterMiners/tardis/pull/141#discussion_r409591396
Manuel Giffels
@giffels
Okay, thanks for the information. @rfvc could you address the remaining issue in #141, please? I found that CANCELLED is also not yet handled there.
Stefan Kroboth
@stefan-k
@giffels @rfvc: I just wanted to let you know that mapping TimeLimit and Vacated to ResourceStatus.Deleted in the MOAB site adapter seems to solve our problem of zombie drones outlined above. The zombie drones caused a couple of problems which required us to restart the service at least once a day. I thought this was related to a bug in the Slurm batchsystem adapter, but apparently it wasn't. Anyway, with the updated mapping it has been running for more than a week without needing to be restarted.
Manuel Giffels
@giffels
@stefan-k Thanks for reporting this. I have updated the pull request accordingly. @rfvc any chance we can finish this pull request this week?
rfvc
@rfvc
@stefan-k, thanks for your testing and input on it!
@giffels, I think so, I’m on it
Stefan Kroboth
@stefan-k
Can I somehow trigger codecov for the slurm PR? It seems like the current codecov result is not based on the most recent commit...
Stefan Kroboth
@stefan-k
Sorry about the noise in the PR, I thought I was in our local Gitlab... embarassing...
Manuel Giffels
@giffels

Can I somehow trigger codecov for the slurm PR? It seems like the current codecov result is not based on the most recent commit...

I have re-started CI at travis, show actually update codecov afterwards. Fingers crossed.

Stefan Kroboth
@stefan-k
@giffels: Thanks! unfortunately it looks like it didn't work. At least the diff seems to be missing the last commit :/
Max Fischer
@maxfischer2781
It seems there was an issue mapping the report to the commit. codecov complains that it was "Unable to find commit in GitHub"
Max Fischer
@maxfischer2781
Looks like the metadata was garbled, not sure if we can reset that. The easiest is probably for you to push a new commit, so that an entirely new CI+Codecov run is triggered.
Stefan Kroboth
@stefan-k
Thanks a lot! I'll try to push something today