Where communities thrive


  • Join over 1.5M+ people
  • Join over 100K+ communities
  • Free without limits
  • Create your own community
People
Ward Vandewege
@cure
wrt slurm being unhappy, does a basic test like srun -N 4 hostname work?
and you probably already know that slurm is extremely sensitive to a) time sync between nodes and b) dns resolution for the hostnames of the controller and workers
Ward Vandewege
@cure
it looks from the logs like things are more or less working
perhaps there's a firewalling/communication problem between api server and/or slurmctld and the compute nodes
are there actual errors in the container logs?
Tom Schoonjans
@tschoonj
Hi Ward
regular srun commands work perfectly fine
from the crunch side of things, everything looks good
Ward Vandewege
@cure
great
Tom Schoonjans
@tschoonj
it appears that the crunch-dispatch-slurm service is explicitly trying to scancel the job after it has successfully completed
Ward Vandewege
@cure
yeah those messages in the logs may just be noise that can be ignored
(and we can file a bug to get them fixed)
Tom Schoonjans
@tschoonj
they're not just noise I am afraid, as they drain the node :-(
Ward Vandewege
@cure
are the jobs completing OK from the arvados perspective?
Tom Schoonjans
@tschoonj
we are now manually setting them to idle after every crunch job
as far as I can tell they are
Ward Vandewege
@cure
so it's maybe a slurm configuration issue then
Tom Schoonjans
@tschoonj
maybe indeed
have you seen this before where the crunch-dispatch-slurm service is actively trying to run scancel on finished jobs?
Ward Vandewege
@cure
I think that is normal
the part where that turns into draining of the node is weird
does that happen if you try using scancel manually, too?
Tom Schoonjans
@tschoonj
we havent tried that yet, as we tested only with short running jobs
Ward Vandewege
@cure
hmm, maybe those scancels are not normal
Tom Schoonjans
@tschoonj
Here is our slurm.conf, is there something that looks fishy?
SlurmctldPort=6817
SlurmdPort=6818
SrunPortRange=60001-63000
AuthType=auth/munge
StateSaveLocation=/var/spool/slurm
SlurmdSpoolDir=/var/spool/slurmd
SwitchType=switch/none
MpiDefault=none
SlurmctldPidFile=/run/slurm/slurmctld.pid
SlurmdPidFile=/run/slurm/slurmd.pid
ProctrackType=proctrack/cgroup
CacheGroups=0
ReturnToService=2
TaskPlugin=task/cgroup
SlurmUser=slurm
SlurmdUser=root
SlurmctldHost=crunch-dispatcher-slurm-controller-blue-dispatcher
ClusterName=core-infra
#
# TIMERS
SlurmctldTimeout=300
SlurmdTimeout=300
InactiveLimit=0
MinJobAge=300
KillWait=30
Waittime=0
#
# SCHEDULING
SchedulerType=sched/backfill
SchedulerPort=7321
SelectType=select/cons_res
SelectTypeParameters=CR_CPU_Memory
#
# LOGGING
SlurmctldDebug=3
SlurmctldLogFile=/var/log/slurm-llnl/slurmctld.log
SlurmdDebug=3
SlurmdLogFile=/var/log/slurm-llnl/slurmd.log
JobCompType=jobcomp/none
#JobCompLoc=
JobAcctGatherType=jobacct_gather/cgroup
# ACCOUNTING
AccountingStorageHost=localhost
AccountingStorageType=accounting_storage/slurmdbd
AccountingStoreJobComment=YES
AccountingStoragePort=6819

# COMPUTE NODES
NodeName=DEFAULT
PartitionName=DEFAULT MaxTime=INFINITE State=UP

NodeName=slurm-worker-blue-dispatcher-[1-2] RealMemory=7961 CPUs=4 TmpDisk=29597 State=UNKNOWN
PartitionName=compute Nodes=slurm-worker-blue-dispatcher-[1-2] Default=YES Shared=YES
Ward Vandewege
@cure
I'm not spotting anything weird here
maybe in NodeName, you have State=UNKNOWN
hmm no we have that too
Ward Vandewege
@cure
we have this on one of our systems:
# COMPUTE NODES
NodeName=DEFAULT CPUs=20 SOCKETs=1 CORESPERSOCKET=20 THREADSPERCORE=1 State=UNKNOWN RealMemory=32109 Weight=32109
PartitionName=DEFAULT MaxTime=INFINITE State=UP
#PartitionName=compute Default=YES Shared=yes

NodeName=compute[0-63]

PartitionName=compute Nodes=compute[0-63] Default=YES Shared=yes
that's just the bit about compute nodes
Tom Schoonjans
@tschoonj
well thanks for the help anyway. We will revisit this after the weekend. I will try first getting rid of the cgroup config as I am not sure that is working properly on our VMs
Ward Vandewege
@cure
sounds good! Happy to help
Tom Schoonjans
@tschoonj
many thanks Ward!
Tom Schoonjans
@tschoonj

Hi, we got things working by dropping cgroups support in the Slurm config.

We now also got our first CWL workflow running with arvados-cwl-runner, but it it possible to do this through the REST API? It looks possible to create/delete/modify a workflow via the REST API, but couldn't find how to start one except through the aforementioned tool.

Andrey Kartashov
@portah
@tschoonj Can you share the experience arvados on slurm?
Peter Amstutz
@tetron
@tschoonj there isn't a high level API for submitting workflows. the way arvados-cwl-runner works is by constructing a container request that runs the workflow runner, and that actually manages the particular instance of the running workflow
@tschoonj so you can programmatically submit a container request the follows the same pattern, there's just more detail to account for
Ward Vandewege
@cure
@tetron maybe we should add a code example for that to our docs
Tom Schoonjans
@tschoonj
Thanks for the replies!
Ward Vandewege
@cure
is everything working OK otherwise @tschoonj ?
Tom Schoonjans
@tschoonj
yes, we did get our minimal crunch-dispatch-slurm working
Ward Vandewege
@cure
excellent
Andrey Kartashov
@portah
@tschoonj Do you have a recommendation for a newbie? or just follow the docs?
Tom Schoonjans
@tschoonj
just follow the docs
one thing we are still discussing is federation
is it possible with 2.2 to submit a CWL wf to cluster A that has a keep but without crunch, to get that wf to get picked up by cluster B, which has no keep, but crunch support.
Andrey Kartashov
@portah
Arvados federation? Or some integration with you current infrastructure?
Tom Schoonjans
@tschoonj
Arvados federation
Peter Amstutz
@tetron
@tschoonj cluster B still needs to have Keep for storing intermediate results
there's some federation token handling that we need to work on to make that case of "cluster B uses federated data from cluster A" work smoothly