Where communities thrive


  • Join over 1.5M+ people
  • Join over 100K+ communities
  • Free without limits
  • Create your own community
People
Activity
  • Jul 31 20:34
    maxfischer2781 labeled #81
  • Jul 31 20:34
    maxfischer2781 labeled #81
  • Jul 31 20:34
    maxfischer2781 opened #81
  • Jul 31 20:34
    maxfischer2781 opened #81
  • Jul 09 19:25
    olifre commented #69
  • Jul 09 19:25
    olifre commented #69
  • Jul 01 11:23
    giffels commented #80
  • Jul 01 11:23
    giffels commented #80
  • Jul 01 10:04
    maxfischer2781 commented #80
  • Jul 01 10:04
    maxfischer2781 commented #80
  • Jul 01 09:57
    maxfischer2781 review_requested #80
  • Jul 01 09:57
    maxfischer2781 review_requested #80
  • Jul 01 09:57
    maxfischer2781 review_requested #80
  • Jul 01 09:57
    maxfischer2781 ready_for_review #80
  • Jul 01 09:57
    maxfischer2781 ready_for_review #80
  • Jun 30 19:14
    maxfischer2781 opened #80
  • Jun 30 19:14
    maxfischer2781 opened #80
  • Jun 30 14:50
    maxfischer2781 commented #73
  • Jun 30 14:50
    maxfischer2781 commented #73
  • Jun 30 12:59
    maxfischer2781 closed #52
Stefan Kroboth
@stefan-k
Thanks for the quick reply. I don't think that this will be a major problem in production, but just in case it is, I think it could be easily solved by also monitoring the partition/queue in the BatchsystemAdapter and just return 0 utilisation/allocation if the queue is empty.
Manuel Giffels
@giffels
@stefan-k No, this is exactly the difference between ROCED and COBalD/TARDIS. Our decisions are based on allocation and utilisation of resources and the scheduling is done by the overlay batch system. We would like to keep both as separated as possible. However, @mschnepf proposed some time ago an controller that boots up this "test" resources from time to time, instead of running them all the time. Maybe he can comment.

We managed to set up a test environment and hoooray, it seems to work more or less. However, one problem remains: utilisation and consumption both stay at 1.0, even if there are running idle drones and no jobs in the queue. As I said before, get_resource_ratios(...) is not being called (in case of our SlurmAdapter). I'm not sure how to debug this without knowing which conditions lead to a call of this method. I would assume that this method would be called early on and quite regularly, given that the decision whether or not to start new drones is based on those values. But even in case of the FakeBatchSystem it takes >10min until it is called for the first time.

@stefan-k : get_resource_ratios is only called for drones in state Available. Is this the case for you?

By default this should happen roughly any minute.
Stefan Kroboth
@stefan-k
Thank you, these were exactly the pointers we needed! They were in fact not Available because I forgot about the mapping of VM to DroneUuid... :grimacing:
Stefan Kroboth
@stefan-k
We got a lot further thanks to your help!
However, we see strange behaviour when our drones power themselves down due to being idle. COBalD/TARDIS then tries to kill the drones via the underlying batchsystem, which fails. It then retries until the garbage collector cleans up the dead drones. This somehow blocks the creation of new drones. This makes me assume that VMs should not power themselves down, this this correct?
Max Fischer
@maxfischer2781
VMs should be able to power down iff the adapter properly reports this. COBalD should consider a drone "dead" once its supply drops to 0. @giffels can you comment how this works for HTCondor?
Matthias Schnepf
@mschnepf
@stefan-k do you use the MOAB site adapter? COBalD/TARDIS do not know about the VM in Freiburg. TARDIS sees only a slot in a batch system. When the VM power itself down, TARDIS set the drone in ShutDownState. Until the batch job which started the VM is running, TARDIS tries to release the resources by killing this job. (see https://github.com/MatterMiners/tardis/blob/86899b1905ca790eb1a990991e20e42508e40848/tardis/resources/dronestates.py#L202)
Matthias Schnepf
@mschnepf
We also have an auto shut down of the VMs. If the job which started the M is still there, it gets killed if the job is dead, TADRIS switches the drone to the cleanup state.
Stefan Kroboth
@stefan-k
@mschnepf Yes, we use the MOAB site adapter. We feed the MOAB job id (as TardisDroneUuid) into Slurm as node features (this is done by a script within the VM, which is aware of the MOAB job id). This is then available to the SlurmBatchsystemAdapter, therefore the state of the drone/vm is known to TARDIS (I'm not sure if this is what you mean).
I assume your and our startVM.py script have the same ancestry? ;) Ours notices that the VM powered itself down, deletes the instance in OpenStack and ends itself, therefore ending the MOAB job. COBalD/TARDIS then still keeps trying to kill the MOAB job (unsuccessfully), even though it is known from the BatchsystemAdapter that the node is in down or drained state.
It may be that the mapping of Slurm node state to TARDIS MachineStatus is somehow not correct in our batchsystem adapter...
Matthias Schnepf
@mschnepf
I think it is the same script, too ;-) So this part should work accordingly.
In which state is the drone according to TARDIS when TARDIS tries to kill it?
HTCondor removes the worker node automatically by stopping the VM or after a timeout. So we need no further "disintegrating" steps. Is the drone no longer in HTCondor TARDIS switches the drone into the MachineStatus.NotAvailable
Stefan Kroboth
@stefan-k
I'll check the state tomorrow. You've pointed me to a potential problem: Reading the HTCondor code, I assumed that NotAvailable is to be interpreted as NotAvailableYet (in other words, I thought NotAvailable precedes Available, I did not think that a shut down drone is also NotAvailable but Drained). This may cause the issues. Thanks! :) I'll get back to you tomorow with details.
Max Fischer
@maxfischer2781
feel free to bump us if we should document some things better. ;)
Stefan Kroboth
@stefan-k
Thanks, I will :)
I was wondering what the difference between Drained and NotAvailable status is, because any drained drone is not available to my understanding.
Manuel Giffels
@giffels
NotAvailable means the drone is not registered (yet) in the overlay batch system. So, the status is unknown. A drone in state Drained means the drone is registered in the overlay batch system, however it does not accept new jobs.
Stefan Kroboth
@stefan-k
Thanks!
Stefan Kroboth
@stefan-k
Hi all! How does cobald/tardis decide whether a drone still exists? Currently, we see that cobald/tardis tries to kill drones which are long gone via canceljob XXXXX. The drones do not show up in showq and are invisible to our SLURM adapter, therefore I would assume that they are considered non-existent. The result of canceljob on one of these dead drones is ERROR: invalid job specified (XXXXXXX), therefore it is retried again and again. These dead drones accumulate, and the excessive canceljob calls potentially cause quite some strain on the site. I guess the garbage collector will take care of them at some point, but it seems as if this is not fast enough. We therefore need to regularly stop cobald/tardis, delete the database, stop all remaining drones and start it again. The non existent drones are considered NotAvailable in the Slurm adapter (not Drained). Could this be cause of this problem? Any pointers are highly appreciated :)
Manuel Giffels
@giffels
@stefan-k : We are currently pushing hard to finish the CHEP proceedings. ;-) So, we will come back afterwards to your question. Sorry, for that.
Stefan Kroboth
@stefan-k
@giffels : No worries, this is not at all urgent :) Good luck!
Manuel Giffels
@giffels
Hi @stefan-k, do I get it right, that the drones are not showing up in Moab's showq and not in your SLURM overlay batch system?
We use the Moab command showq --xml -w user=$(whoami) && showq -c --xml -w user=$(whoami) to get the status of the drone running at NEMO. The second command will list also complete ones. If the status in Moab is COMPLETED the drone should go to DownState in tardis and the garbage collector should take care.
Do you now the previous state of the Drones before going to NotAvailable?
Manuel Giffels
@giffels
What is the output of showq -c --xml -w user=$(whoami)?
Stefan Kroboth
@stefan-k
Unfortunately I just restarted everything for a different reason, so I don't have any of these zomies around. But I remember that the state of the ones I checked with showq -c ... was CNCLD(271). In Slurm they are (I think) shown as drained*, where * means that Slurm hasn't heard of the instance for a while (so it is likely offline). I don't know the TARDIS state before they go to NotAvailable. There should be a few around tomorrow morning. I'll check again then and let you know. Thanks! :)
Manuel Giffels
@giffels
CNCLD seems to be Canceled.
Let us check tomorrow, once you have new zombie's around.
Stefan Kroboth
@stefan-k
$ showq -c  -w user=$(whoami) | grep 7728XXX
7728XXX             V CNCLD         tor   1.01      1.0  - fr_XXXXXX bwXXXXXX nXXXX.nemo.priva    20    00:09:15   Tue Mar 17 00:10:56
(I've censored it a bit and used the non-XML version.)
Slurms sinfo will just return nothing for the corresponding node (therefore it should be considered NotAvailable in TARDIS). But at this point I can't guarantee that it is like that immediately after the drone goes down. There may be different transient states in Slurm right after the drone goes down which may confuse TARDIS.
Stefan Kroboth
@stefan-k
just for completeness:
$ canceljob 7728XXX
ERROR:  invalid job specified (7728XXX)
Stefan Kroboth
@stefan-k
I just realized that invalid job specified is handled by the site adapter.
Manuel Giffels
@giffels
What is the current state of this drone in TARDIS?
Stefan Kroboth
@stefan-k
root: 2020-03-17 09:17:07 Resource attributes: {'site_name': 'NEMO', 'machine_type': 'tardis_c40m120', 'remote_resource_uuid': 7728XXX, 'created': datetime.datetime(2020, 3, 17, 0, 1, 8, 58991), 'updated': datetime.datetime(2020, 3, 17, 9, 17, 7, 206896), 'drone_uuid': 'nemo-7728XXX', 'resource_status': <ResourceStatus.Stopped: 3>}
root: 2020-03-17 09:17:07 Destroying VM with ID 7728XXX
Manuel Giffels
@giffels
Are there any changes applied by you to the Moab site adapter?
Manuel Giffels
@giffels
This is a known issue and @rfvc is currently working on adding all stati known to Moab.
Stefan Kroboth
@stefan-k
Great, thanks! :) I've only made a minor change that should not make a difference: I changed the email address in the msub command.
Manuel Giffels
@giffels
Okay, that should not be the problem.
Stefan Kroboth
@stefan-k
@giffels regarding #142: I will investigate this further. This was pretty weird because I could see in Grafana that we sometimes get metrics, sometimes we didn't. It crashes tardis, it is restarted by systemd.
also, there's another bug in the most recent commit ;)
Stefan Kroboth
@stefan-k
prometheusmonitoring: 2020-04-16 13:31:09 Drone: {'site_name': 'NEMO', 'machine_type': 'tardis_c40m120', 'remote_resource_uuid': 8065787, 'created': datetime.datetime(2020, 4, 16, 13, 26, 12, 506621), 'updated': datetime.datetime(2020, 4, 16, 13, 31, 9, 748663), 'drone_uuid': 'nemo-8065787', 'resource_status': 'ResourceStatus.Running', 'state': 'IntegrateState', 'meta': 'atlsch', 'timestamp': 1587036609718, 'revision': 1} has changed state to IntegratingState
prometheusmonitoring: 2020-04-16 13:31:09 TTTT: <class 'str'>
interestingly, this happened once in about ~50 cases
(TTTT indicates the type of resource_status)
in all other cases, it is an enum:
prometheusmonitoring: 2020-04-16 13:29:09 Drone: {'site_name': 'NEMO', 'machine_type': 'tardis_c40m120', 'remote_resource_uuid': 8065030, 'created': datetime.datetime(2020, 4, 16, 12, 2, 52, 850664), 'updated': datetime.datetime(2020, 4, 16, 13, 29, 8, 464074), 'drone_uuid': 'nemo-8065030', 'resource_status': <ResourceStatus.Stopped: 3>} has changed state to CleanupState
prometheusmonitoring: 2020-04-16 13:29:09 TTTT: <enum 'ResourceStatus'>
ah I think I know what may be happening
"crosstalk" from the Elasticsearch plugin
Stefan Kroboth
@stefan-k
yep. sorry for the fuzz. resource_attributes is modified in the ElasticsearchPlugin. I never know whether arguments are passed by reference or value in Python...
In the ElasticsearchPlugin, I'm turning state into a str so that ElasticSearch isn't confused, but since resource_attributes is passed by reference it affects everything else