Q&A, support and general discussion about the Arvados project, for development see https://gitter.im/arvados/development
crunch-dispatch-slurm
logs corresponding to the previously mentioned job:Oct 08 09:25:52 crunch-dispatcher-slurm-controller-blue-dispatcher crunch-dispatch-slurm[17079]: 2021/10/08 09:25:52 Submitting container 88d80-dz642-22qdy08ukoae458 to slurm
Oct 08 09:25:52 crunch-dispatcher-slurm-controller-blue-dispatcher crunch-dispatch-slurm[17079]: 2021/10/08 09:25:52 running sbatch ["--job-name=88d80-dz642-22qdy08ukoae458" "--nice=10000" "--no-requeue" "--mem=520" "--cpus-per-task=1" "--tmp=640"]
Oct 08 09:25:52 crunch-dispatcher-slurm-controller-blue-dispatcher crunch-dispatch-slurm[17079]: 2021/10/08 09:25:52 "/usr/bin/sbatch" ["sbatch" "--job-name=88d80-dz642-22qdy08ukoae458" "--nice=10000" "--no-requeue" "--mem=520" "--cpus-per-task=1" "--tmp=640"]: "Submitted batch job 4"
Oct 08 09:25:52 crunch-dispatcher-slurm-controller-blue-dispatcher crunch-dispatch-slurm[17079]: 2021/10/08 09:25:52 Start monitoring container 88d80-dz642-22qdy08ukoae458 in state "Locked"
Oct 08 09:26:22 crunch-dispatcher-slurm-controller-blue-dispatcher crunch-dispatch-slurm[17079]: 2021/10/08 09:26:22 container 88d80-dz642-22qdy08ukoae458 is done: cancel slurm job
Oct 08 09:26:22 crunch-dispatcher-slurm-controller-blue-dispatcher crunch-dispatch-slurm[17079]: 2021/10/08 09:26:22 container 88d80-dz642-22qdy08ukoae458 is still in squeue after scancel
Oct 08 09:26:23 crunch-dispatcher-slurm-controller-blue-dispatcher crunch-dispatch-slurm[17079]: 2021/10/08 09:26:23 container 88d80-dz642-22qdy08ukoae458 is done: cancel slurm job
Oct 08 09:26:32 crunch-dispatcher-slurm-controller-blue-dispatcher crunch-dispatch-slurm[17079]: 2021/10/08 09:26:32 container 88d80-dz642-22qdy08ukoae458 is still in squeue after scancel
...repeats several more times...
Oct 08 09:27:22 crunch-dispatcher-slurm-controller-blue-dispatcher crunch-dispatch-slurm[17079]: 2021/10/08 09:27:22 container 88d80-dz642-22qdy08ukoae458 is still in squeue after scancel
Oct 08 09:27:23 crunch-dispatcher-slurm-controller-blue-dispatcher crunch-dispatch-slurm[17079]: 2021/10/08 09:27:23 container 88d80-dz642-22qdy08ukoae458 is done: cancel slurm job
Oct 08 09:27:32 crunch-dispatcher-slurm-controller-blue-dispatcher crunch-dispatch-slurm[17079]: 2021/10/08 09:27:32 container 88d80-dz642-22qdy08ukoae458 is done: cancel slurm job
Oct 08 09:27:42 crunch-dispatcher-slurm-controller-blue-dispatcher crunch-dispatch-slurm[17079]: 2021/10/08 09:27:42 container 88d80-dz642-22qdy08ukoae458 is done: cancel slurm job
Oct 08 09:27:52 crunch-dispatcher-slurm-controller-blue-dispatcher crunch-dispatch-slurm[17079]: 2021/10/08 09:27:52 container 88d80-dz642-22qdy08ukoae458 is done: cancel slurm job
Oct 08 09:28:02 crunch-dispatcher-slurm-controller-blue-dispatcher crunch-dispatch-slurm[17079]: 2021/10/08 09:28:02 container 88d80-dz642-22qdy08ukoae458 is done: cancel slurm job
Oct 08 09:28:12 crunch-dispatcher-slurm-controller-blue-dispatcher crunch-dispatch-slurm[17079]: 2021/10/08 09:28:12 container 88d80-dz642-22qdy08ukoae458 is done: cancel slurm job
Oct 08 09:28:22 crunch-dispatcher-slurm-controller-blue-dispatcher crunch-dispatch-slurm[17079]: 2021/10/08 09:28:22 Done monitoring container 88d80-dz642-22qdy08ukoae458
srun -N 4 hostname
work?
SlurmctldPort=6817
SlurmdPort=6818
SrunPortRange=60001-63000
AuthType=auth/munge
StateSaveLocation=/var/spool/slurm
SlurmdSpoolDir=/var/spool/slurmd
SwitchType=switch/none
MpiDefault=none
SlurmctldPidFile=/run/slurm/slurmctld.pid
SlurmdPidFile=/run/slurm/slurmd.pid
ProctrackType=proctrack/cgroup
CacheGroups=0
ReturnToService=2
TaskPlugin=task/cgroup
SlurmUser=slurm
SlurmdUser=root
SlurmctldHost=crunch-dispatcher-slurm-controller-blue-dispatcher
ClusterName=core-infra
#
# TIMERS
SlurmctldTimeout=300
SlurmdTimeout=300
InactiveLimit=0
MinJobAge=300
KillWait=30
Waittime=0
#
# SCHEDULING
SchedulerType=sched/backfill
SchedulerPort=7321
SelectType=select/cons_res
SelectTypeParameters=CR_CPU_Memory
#
# LOGGING
SlurmctldDebug=3
SlurmctldLogFile=/var/log/slurm-llnl/slurmctld.log
SlurmdDebug=3
SlurmdLogFile=/var/log/slurm-llnl/slurmd.log
JobCompType=jobcomp/none
#JobCompLoc=
JobAcctGatherType=jobacct_gather/cgroup
# ACCOUNTING
AccountingStorageHost=localhost
AccountingStorageType=accounting_storage/slurmdbd
AccountingStoreJobComment=YES
AccountingStoragePort=6819
# COMPUTE NODES
NodeName=DEFAULT
PartitionName=DEFAULT MaxTime=INFINITE State=UP
NodeName=slurm-worker-blue-dispatcher-[1-2] RealMemory=7961 CPUs=4 TmpDisk=29597 State=UNKNOWN
PartitionName=compute Nodes=slurm-worker-blue-dispatcher-[1-2] Default=YES Shared=YES
NodeName
, you have State=UNKNOWN
# COMPUTE NODES
NodeName=DEFAULT CPUs=20 SOCKETs=1 CORESPERSOCKET=20 THREADSPERCORE=1 State=UNKNOWN RealMemory=32109 Weight=32109
PartitionName=DEFAULT MaxTime=INFINITE State=UP
#PartitionName=compute Default=YES Shared=yes
NodeName=compute[0-63]
PartitionName=compute Nodes=compute[0-63] Default=YES Shared=yes
Hi, we got things working by dropping cgroups support in the Slurm config.
We now also got our first CWL workflow running with arvados-cwl-runner
, but it it possible to do this through the REST API? It looks possible to create/delete/modify a workflow via the REST API, but couldn't find how to start one except through the aforementioned tool.