Where communities thrive


  • Join over 1.5M+ people
  • Join over 100K+ communities
  • Free without limits
  • Create your own community
People
Activity
  • Nov 11 09:34
    codecov[bot] commented #117
  • Nov 11 09:34
    codecov[bot] commented #117
  • Nov 11 09:34
    codecov[bot] commented #117
  • Nov 11 09:34
    codecov[bot] commented #117
  • Nov 11 09:31
    maxfischer2781 assigned #117
  • Nov 11 09:31
    maxfischer2781 assigned #117
  • Nov 10 16:18
    lgtm-com[bot] opened #117
  • Nov 10 16:18
    lgtm-com[bot] opened #117
  • Aug 29 13:31
    maxfischer2781 closed #115
  • Aug 29 13:31
    maxfischer2781 closed #115
  • Aug 17 08:52
    codecov[bot] commented #115
  • Aug 17 08:52
    codecov[bot] commented #115
  • Aug 17 08:51
    codecov[bot] commented #115
  • Aug 17 08:51
    codecov[bot] commented #115
  • Aug 17 08:49
    codecov[bot] commented #115
  • Aug 17 08:49
    codecov[bot] commented #115
  • Aug 17 08:49
    codecov[bot] commented #115
  • Aug 17 08:49
    codecov[bot] commented #115
  • Aug 17 08:49
    maxfischer2781 synchronize #115
  • Aug 17 08:49
    maxfischer2781 synchronize #115
Stefan Kroboth
@stefan-k
prometheusmonitoring: 2020-04-16 13:29:09 Drone: {'site_name': 'NEMO', 'machine_type': 'tardis_c40m120', 'remote_resource_uuid': 8065030, 'created': datetime.datetime(2020, 4, 16, 12, 2, 52, 850664), 'updated': datetime.datetime(2020, 4, 16, 13, 29, 8, 464074), 'drone_uuid': 'nemo-8065030', 'resource_status': <ResourceStatus.Stopped: 3>} has changed state to CleanupState
prometheusmonitoring: 2020-04-16 13:29:09 TTTT: <enum 'ResourceStatus'>
ah I think I know what may be happening
"crosstalk" from the Elasticsearch plugin
Stefan Kroboth
@stefan-k
yep. sorry for the fuzz. resource_attributes is modified in the ElasticsearchPlugin. I never know whether arguments are passed by reference or value in Python...
In the ElasticsearchPlugin, I'm turning state into a str so that ElasticSearch isn't confused, but since resource_attributes is passed by reference it affects everything else
Max Fischer
@maxfischer2781
Can you link where this translation happens? You should be able to directly access the enum's name, without converting it.
(Sorry for not being available to give hints earlier, by the way. Trying to catch up now.)
Stefan Kroboth
@stefan-k
No worries at all! I'm glad that I didn't get you sucked into this more ;)
But I think I will completely redesign this anyway
so I think it is not necessary that you waste any time on this :)
Max Fischer
@maxfischer2781
Alright, I'll try to sprinkle some random wisdom on the next version, then. ;)
Stefan Kroboth
@stefan-k
I'm looking forward to it :D
Manuel Giffels
@giffels
Sorry, for being late to the party. I was stuck in a zoom meeting. Good that found the problem. Dictionaries are mutable objects in Python. Passing mutable objects to a function behaves similar to passed by reference in other languages. Everything that is not mutable behaves similar to passed by value in other languages.
Max Fischer
@maxfischer2781
For reference, a simple/fast way to convert-with-copy some keys is to use a dict literal ({}) with unpacking (**) of the original dict.
resource_attributes = {
    **resource_attributes,
    "state": str(state),
    "meta": self._meta,
    "timestamp": int(time() * 1000),
    "resource_status": resource_attributes["resource_status"],
}
Stefan Kroboth
@stefan-k
Thanks to both of you!
Stefan Kroboth
@stefan-k
BTW: I'm using PR #141 in production for a while now and it does not solve the problem that tardis tries to cancel drones which are long gone. I guess this may be related to @giffels comment here: https://github.com/MatterMiners/tardis/pull/141#discussion_r409591396
Manuel Giffels
@giffels
Okay, thanks for the information. @rfvc could you address the remaining issue in #141, please? I found that CANCELLED is also not yet handled there.
Stefan Kroboth
@stefan-k
@giffels @rfvc: I just wanted to let you know that mapping TimeLimit and Vacated to ResourceStatus.Deleted in the MOAB site adapter seems to solve our problem of zombie drones outlined above. The zombie drones caused a couple of problems which required us to restart the service at least once a day. I thought this was related to a bug in the Slurm batchsystem adapter, but apparently it wasn't. Anyway, with the updated mapping it has been running for more than a week without needing to be restarted.
Manuel Giffels
@giffels
@stefan-k Thanks for reporting this. I have updated the pull request accordingly. @rfvc any chance we can finish this pull request this week?
rfvc
@rfvc
@stefan-k, thanks for your testing and input on it!
@giffels, I think so, I’m on it
Stefan Kroboth
@stefan-k
Can I somehow trigger codecov for the slurm PR? It seems like the current codecov result is not based on the most recent commit...
Stefan Kroboth
@stefan-k
Sorry about the noise in the PR, I thought I was in our local Gitlab... embarassing...
Manuel Giffels
@giffels

Can I somehow trigger codecov for the slurm PR? It seems like the current codecov result is not based on the most recent commit...

I have re-started CI at travis, show actually update codecov afterwards. Fingers crossed.

Stefan Kroboth
@stefan-k
@giffels: Thanks! unfortunately it looks like it didn't work. At least the diff seems to be missing the last commit :/
Max Fischer
@maxfischer2781
It seems there was an issue mapping the report to the commit. codecov complains that it was "Unable to find commit in GitHub"
Max Fischer
@maxfischer2781
Looks like the metadata was garbled, not sure if we can reset that. The easiest is probably for you to push a new commit, so that an entirely new CI+Codecov run is triggered.
Stefan Kroboth
@stefan-k
Thanks a lot! I'll try to push something today
WestByNoreaster
@mtwest2718
Good evening all.
Is this community forum still in use?
Max Fischer
@maxfischer2781
@mtwest2718 Yes it is. Latency is a bit longer when there isn't much going on, but us channel owners get notified of messages.
WestByNoreaster
@mtwest2718
I know you have a very pretty new github.io page
But there wasn't a clear way to contact y'all.
WestByNoreaster
@mtwest2718
So to introduce myself, I am a sys-admin at the University of Exeter.
  • We are setting up a 2500 core OpenStack system, which mostly will be used to serve researchers bespoke interactive VMs. I will admit I am a very new sys-admin so a lot of this stuff is confusing to me.
  • I would love to use Cobald/Tardis to utilize an idle cpu resources for batch compute. But the docs feel a bit light in setting up a new pool.
WestByNoreaster
@mtwest2718
I understand this is asking a bit from your (collective) time but a walk through on setting up a new pool of OpenStack nodes would be greatly beneficial.
Max Fischer
@maxfischer2781
Right, we've wanted to brush up the github.io page for a while but there are still some parts missing.
In case you want other ways of contacting us as well:
You can reach the core team at matterminers@lists.kit.edu at any time for practically any questions.
We've got a mattermost at https://chat.eudat.eu/ where you can reach other users of C/T as well – though it's mostly in German, as usual for the community people can switch to English at any time and will gladly do so.
Finally, a lot of us are also on the CERN mattermost.
Now that we've got that covered... :sweat_smile:
@giffels has more experience using C/T with OpenStack, but I should be able to walk you through the rough steps and we can work our way up from there on. ;)
Before we dive into C/T itself, do you have an existing batch system through which users will access the resources? I see you've been active on the HTCondor mailing, so I assume you're familiar with running an HTCondor Cluster?
WestByNoreaster
@mtwest2718
The batch system on our general purpose cluster is Slurm and I'd like to avoid touching that production system as much as possible. The OpenStack cluster is new and we have more flexibility to play with it.
3 replies
I am an experienced HTCondor user, not admin.
So would have to spin up a central manager and submit hosts. How my boss wants storage managed, I am not sure.
We just had the machines installed last week, a bunch is still in flux. I just wanted to reach out before hand. This also helps me make a clearer case to supervisor if I might need an extra node for running the submit &/or CM.
Max Fischer
@maxfischer2781
So, the rough outline of what you should be planning with: (I'll add them as separate message, feel free to reply-in-thread)
  • You'll need some HTCondor submit hosts to get jobs into the system. C/T doesn't really care about these, but be prepared that they must be able to communicate with your opportunistic resources. A public IP address is advisable since that means the resources don't need one then.
  • You'll need a central manager for the Collector and Negotiator. This one must be reachable from all opportunistic resources. I recommend configuring the Collector as a Condor Connection Broker as well, since that means only one of submit/worker node must be public.
Max Fischer
@maxfischer2781
  • You'll need a machine that runs COBalD/TARDIS. This one must be able to reach your resource provider (i.e. OpenStack) and the Collector. It's fine to just use the central manager for this as well; C/T is pretty flexible about being redeployed later on, so you don't lock-in yourself.
  • You can have static worker nodes (STARTD) but don't need them. C/T will opportunistically add them to your system. There is no problem having multiple C/T instances add resources, nor in having resources from other sources as well.
Max Fischer
@maxfischer2781
If you need recommendations on any of these, feel free to ask. Generally, C/T doesn't care much about how HTCondor is setup so we try to avoid giving people wrong ideas. But we do have the expertise for the entire HTCondor stack so we can give you more information as needed.
WestByNoreaster
@mtwest2718
Is it better to run HTCondor (and relevant container engine) on bare metal on these systems or have a VM instance that auto-starts as a worker node and connects to the pool, so as to use the OpenStack provisioning tools?
2 replies
Max Fischer
@maxfischer2781
As for getting started with C/T: I recommend to go through our tutorial first, which will walk you through setting up COBalD and TARDIS with a dummy resource pool. We'll walk you through using the OpenStack resource pool afterwards.