Where communities thrive


  • Join over 1.5M+ people
  • Join over 100K+ communities
  • Free without limits
  • Create your own community
People
Ward Vandewege
@cure
hmm, maybe those scancels are not normal
Tom Schoonjans
@tschoonj
Here is our slurm.conf, is there something that looks fishy?
SlurmctldPort=6817
SlurmdPort=6818
SrunPortRange=60001-63000
AuthType=auth/munge
StateSaveLocation=/var/spool/slurm
SlurmdSpoolDir=/var/spool/slurmd
SwitchType=switch/none
MpiDefault=none
SlurmctldPidFile=/run/slurm/slurmctld.pid
SlurmdPidFile=/run/slurm/slurmd.pid
ProctrackType=proctrack/cgroup
CacheGroups=0
ReturnToService=2
TaskPlugin=task/cgroup
SlurmUser=slurm
SlurmdUser=root
SlurmctldHost=crunch-dispatcher-slurm-controller-blue-dispatcher
ClusterName=core-infra
#
# TIMERS
SlurmctldTimeout=300
SlurmdTimeout=300
InactiveLimit=0
MinJobAge=300
KillWait=30
Waittime=0
#
# SCHEDULING
SchedulerType=sched/backfill
SchedulerPort=7321
SelectType=select/cons_res
SelectTypeParameters=CR_CPU_Memory
#
# LOGGING
SlurmctldDebug=3
SlurmctldLogFile=/var/log/slurm-llnl/slurmctld.log
SlurmdDebug=3
SlurmdLogFile=/var/log/slurm-llnl/slurmd.log
JobCompType=jobcomp/none
#JobCompLoc=
JobAcctGatherType=jobacct_gather/cgroup
# ACCOUNTING
AccountingStorageHost=localhost
AccountingStorageType=accounting_storage/slurmdbd
AccountingStoreJobComment=YES
AccountingStoragePort=6819

# COMPUTE NODES
NodeName=DEFAULT
PartitionName=DEFAULT MaxTime=INFINITE State=UP

NodeName=slurm-worker-blue-dispatcher-[1-2] RealMemory=7961 CPUs=4 TmpDisk=29597 State=UNKNOWN
PartitionName=compute Nodes=slurm-worker-blue-dispatcher-[1-2] Default=YES Shared=YES
Ward Vandewege
@cure
I'm not spotting anything weird here
maybe in NodeName, you have State=UNKNOWN
hmm no we have that too
Ward Vandewege
@cure
we have this on one of our systems:
# COMPUTE NODES
NodeName=DEFAULT CPUs=20 SOCKETs=1 CORESPERSOCKET=20 THREADSPERCORE=1 State=UNKNOWN RealMemory=32109 Weight=32109
PartitionName=DEFAULT MaxTime=INFINITE State=UP
#PartitionName=compute Default=YES Shared=yes

NodeName=compute[0-63]

PartitionName=compute Nodes=compute[0-63] Default=YES Shared=yes
that's just the bit about compute nodes
Tom Schoonjans
@tschoonj
well thanks for the help anyway. We will revisit this after the weekend. I will try first getting rid of the cgroup config as I am not sure that is working properly on our VMs
Ward Vandewege
@cure
sounds good! Happy to help
Tom Schoonjans
@tschoonj
many thanks Ward!
Tom Schoonjans
@tschoonj

Hi, we got things working by dropping cgroups support in the Slurm config.

We now also got our first CWL workflow running with arvados-cwl-runner, but it it possible to do this through the REST API? It looks possible to create/delete/modify a workflow via the REST API, but couldn't find how to start one except through the aforementioned tool.

Andrey Kartashov
@portah
@tschoonj Can you share the experience arvados on slurm?
Peter Amstutz
@tetron
@tschoonj there isn't a high level API for submitting workflows. the way arvados-cwl-runner works is by constructing a container request that runs the workflow runner, and that actually manages the particular instance of the running workflow
@tschoonj so you can programmatically submit a container request the follows the same pattern, there's just more detail to account for
Ward Vandewege
@cure
@tetron maybe we should add a code example for that to our docs
Tom Schoonjans
@tschoonj
Thanks for the replies!
Ward Vandewege
@cure
is everything working OK otherwise @tschoonj ?
Tom Schoonjans
@tschoonj
yes, we did get our minimal crunch-dispatch-slurm working
Ward Vandewege
@cure
excellent
Andrey Kartashov
@portah
@tschoonj Do you have a recommendation for a newbie? or just follow the docs?
Tom Schoonjans
@tschoonj
just follow the docs
one thing we are still discussing is federation
is it possible with 2.2 to submit a CWL wf to cluster A that has a keep but without crunch, to get that wf to get picked up by cluster B, which has no keep, but crunch support.
Andrey Kartashov
@portah
Arvados federation? Or some integration with you current infrastructure?
Tom Schoonjans
@tschoonj
Arvados federation
Peter Amstutz
@tetron
@tschoonj cluster B still needs to have Keep for storing intermediate results
there's some federation token handling that we need to work on to make that case of "cluster B uses federated data from cluster A" work smoothly
Tom Schoonjans
@tschoonj
@tetron so this won't be possible even with 2.3, even with cluster B having its own keep?
Peter Amstutz
@tetron
there might be a workaround
you can override the token it uses
actually
it depends on how they are federated
Tom Schoonjans
@tschoonj
in our case cluster A would be the LoginCluster
so I assume that its tokens should work also on cluster B?
Peter Amstutz
@tetron
so here's how it works: when a process (container) runs, it gets an ephemeral token for the lifetime of the process
that's created by the API server that owns the container
the problem is that if it is a satellite cluster, not the main login cluster, a token created on the satellite is only good for accessing data on the satellite
so what it needs to do is get a new token from the login cluster
but that feature doesn't exist yet
Tom Schoonjans
@tschoonj
and there's no workaround for this currently?
Peter Amstutz
@tetron
the API allows you to provide an explicit token to use when submitting a container
Tom Schoonjans
@tschoonj
aha
that sounds promising
Peter Amstutz
@tetron
I'm checking to see what circumstances the workflow runner passes that parameter
the parameter is "runtime_token" on container_request
Tom Schoonjans
@tschoonj
I assume that this is the runtime_token in the container_requests API?
:)
does arvados-cwl-runner have a similar option?
Peter Amstutz
@tetron
it does not. this is actually isolated from the workflow runner. the workflow runner has the ability to request that a container be submitted to a different cluster than the main one, and runtime_token is how controller provides credentials