Q&A, support and general discussion about the Arvados project, for development see https://gitter.im/arvados/development
Hi all,
We managed to get the test CWL from the docs working, in the sense that it succeeds and that we can fetch the logs from the keep afterwards. However, Slurm is not happy. The stdout file that was written on the worker node contains:
2021/10/08 09:25:54 crunch-run 2.2.2 (go1.16.3) started
88d80-dz642-22qdy08ukoae458 2021-10-08T09:25:54.712529567Z crunch-run 2.2.2 (go1.16.3) started
88d80-dz642-22qdy08ukoae458 2021-10-08T09:25:54.712579936Z Executing container '88d80-dz642-22qdy08ukoae458'
88d80-dz642-22qdy08ukoae458 2021-10-08T09:25:54.712620617Z Executing on host 'slurm-worker-blue-dispatcher-2'
88d80-dz642-22qdy08ukoae458 2021-10-08T09:25:54.877163363Z Fetching Docker image from collection '4ad7d11381df349e464694762db14e04+303'
88d80-dz642-22qdy08ukoae458 2021-10-08T09:25:54.908683532Z Using Docker image id 'sha256:e67b8c126d8f2d411d72aa04cc0ab87dace18eef152d5f0b07dd677284fc0002'
88d80-dz642-22qdy08ukoae458 2021-10-08T09:25:54.913563419Z Loading Docker image from keep
88d80-dz642-22qdy08ukoae458 2021-10-08T09:26:06.816683221Z Docker response: {"stream":"Loaded image ID: sha256:e67b8c126d8f2d411d72aa04cc0ab87dace18eef152d5f0b07dd677284fc0002\n"}
88d80-dz642-22qdy08ukoae458 2021-10-08T09:26:06.830237887Z Running [arv-mount --foreground --allow-other --read-write --crunchstat-interval=10 --file-cache 268435456 --mount-by-pdh by_id /tmp/crunch-run.88d80-dz642-22qdy08ukoae458.388893745/keep083816412]
88d80-dz642-22qdy08ukoae458 2021-10-08T09:26:11.500969976Z Creating Docker container
88d80-dz642-22qdy08ukoae458 2021-10-08T09:26:19.974405474Z Attaching container streams
88d80-dz642-22qdy08ukoae458 2021-10-08T09:26:20.337499740Z Starting Docker container id '1b880c597d712a871580345588d704bdd2b6eaaeb46dc2dfe56b9f1bf3197bb0'
88d80-dz642-22qdy08ukoae458 2021-10-08T09:26:21.005081598Z Waiting for container to finish
88d80-dz642-22qdy08ukoae458 2021-10-08T09:26:21.704143751Z Container exited with code: 0
88d80-dz642-22qdy08ukoae458 2021-10-08T09:26:21.799195935Z Complete
88d80-dz642-22qdy08ukoae458 2021-10-08T09:26:22.017988309Z Running [arv-mount --unmount-timeout=8 --unmount /tmp/crunch-run.88d80-dz642-22qdy08ukoae458.388893745/keep083816412]
88d80-dz642-22qdy08ukoae458 2021-10-08T09:26:22.577220567Z crunch-run finished
slurmstepd-slurm-worker-blue-dispatcher-2: error: *** JOB 4 ON slurm-worker-blue-dispatcher-2 CANCELLED AT 2021-10-08T09:26:22 ***
88d80-dz642-22qdy08ukoae458 2021-10-08T09:26:22.681660707Z caught signal: terminated
88d80-dz642-22qdy08ukoae458 2021-10-08T09:26:22.681745921Z removing container
88d80-dz642-22qdy08ukoae458 2021-10-08T09:26:22.683569836Z error removing container: Error: No such container: 1b880c597d712a871580345588d704bdd2b6eaaeb46dc2dfe56b9f1bf3197bb0
slurmstepd-slurm-worker-blue-dispatcher-2: error: _handle_signal_container: failed signal 15 pid 1471 job 4.4294967294 No such process
slurmstepd-slurm-worker-blue-dispatcher-2: error: _handle_signal_container: failed signal 15 pid 1471 job 4.4294967294 No such process
slurmstepd-slurm-worker-blue-dispatcher-2: error: _handle_signal_container: failed signal 15 pid 1471 job 4.4294967294 No such process
slurmstepd-slurm-worker-blue-dispatcher-2: error: _handle_signal_container: failed signal 15 pid 1471 job 4.4294967294 No such process
slurmstepd-slurm-worker-blue-dispatcher-2: error: _handle_signal_container: failed signal 15 pid 1471 job 4.4294967294 No such process
slurmstepd-slurm-worker-blue-dispatcher-2: error: _handle_signal_container: failed signal 15 pid 1471 job 4.4294967294 No such process
slurmstepd-slurm-worker-blue-dispatcher-2: error: _handle_signal_container: failed signal 15 pid 1471 job 4.4294967294 No such process
slurmstepd-slurm-worker-blue-dispatcher-2: error: *** JOB 4 STEPD TERMINATED ON slurm-worker-blue-dispatcher-2 AT 2021-10-08T09:27:23 DUE TO JOB NOT ENDING WITH SIGNALS ***
crunch
, member of the docker
group on all slurm nodes and controller for this purpose, as suggested in the installation guide. Is there a configuration setting for this?
crunch-dispatch-slurm
logs corresponding to the previously mentioned job:Oct 08 09:25:52 crunch-dispatcher-slurm-controller-blue-dispatcher crunch-dispatch-slurm[17079]: 2021/10/08 09:25:52 Submitting container 88d80-dz642-22qdy08ukoae458 to slurm
Oct 08 09:25:52 crunch-dispatcher-slurm-controller-blue-dispatcher crunch-dispatch-slurm[17079]: 2021/10/08 09:25:52 running sbatch ["--job-name=88d80-dz642-22qdy08ukoae458" "--nice=10000" "--no-requeue" "--mem=520" "--cpus-per-task=1" "--tmp=640"]
Oct 08 09:25:52 crunch-dispatcher-slurm-controller-blue-dispatcher crunch-dispatch-slurm[17079]: 2021/10/08 09:25:52 "/usr/bin/sbatch" ["sbatch" "--job-name=88d80-dz642-22qdy08ukoae458" "--nice=10000" "--no-requeue" "--mem=520" "--cpus-per-task=1" "--tmp=640"]: "Submitted batch job 4"
Oct 08 09:25:52 crunch-dispatcher-slurm-controller-blue-dispatcher crunch-dispatch-slurm[17079]: 2021/10/08 09:25:52 Start monitoring container 88d80-dz642-22qdy08ukoae458 in state "Locked"
Oct 08 09:26:22 crunch-dispatcher-slurm-controller-blue-dispatcher crunch-dispatch-slurm[17079]: 2021/10/08 09:26:22 container 88d80-dz642-22qdy08ukoae458 is done: cancel slurm job
Oct 08 09:26:22 crunch-dispatcher-slurm-controller-blue-dispatcher crunch-dispatch-slurm[17079]: 2021/10/08 09:26:22 container 88d80-dz642-22qdy08ukoae458 is still in squeue after scancel
Oct 08 09:26:23 crunch-dispatcher-slurm-controller-blue-dispatcher crunch-dispatch-slurm[17079]: 2021/10/08 09:26:23 container 88d80-dz642-22qdy08ukoae458 is done: cancel slurm job
Oct 08 09:26:32 crunch-dispatcher-slurm-controller-blue-dispatcher crunch-dispatch-slurm[17079]: 2021/10/08 09:26:32 container 88d80-dz642-22qdy08ukoae458 is still in squeue after scancel
...repeats several more times...
Oct 08 09:27:22 crunch-dispatcher-slurm-controller-blue-dispatcher crunch-dispatch-slurm[17079]: 2021/10/08 09:27:22 container 88d80-dz642-22qdy08ukoae458 is still in squeue after scancel
Oct 08 09:27:23 crunch-dispatcher-slurm-controller-blue-dispatcher crunch-dispatch-slurm[17079]: 2021/10/08 09:27:23 container 88d80-dz642-22qdy08ukoae458 is done: cancel slurm job
Oct 08 09:27:32 crunch-dispatcher-slurm-controller-blue-dispatcher crunch-dispatch-slurm[17079]: 2021/10/08 09:27:32 container 88d80-dz642-22qdy08ukoae458 is done: cancel slurm job
Oct 08 09:27:42 crunch-dispatcher-slurm-controller-blue-dispatcher crunch-dispatch-slurm[17079]: 2021/10/08 09:27:42 container 88d80-dz642-22qdy08ukoae458 is done: cancel slurm job
Oct 08 09:27:52 crunch-dispatcher-slurm-controller-blue-dispatcher crunch-dispatch-slurm[17079]: 2021/10/08 09:27:52 container 88d80-dz642-22qdy08ukoae458 is done: cancel slurm job
Oct 08 09:28:02 crunch-dispatcher-slurm-controller-blue-dispatcher crunch-dispatch-slurm[17079]: 2021/10/08 09:28:02 container 88d80-dz642-22qdy08ukoae458 is done: cancel slurm job
Oct 08 09:28:12 crunch-dispatcher-slurm-controller-blue-dispatcher crunch-dispatch-slurm[17079]: 2021/10/08 09:28:12 container 88d80-dz642-22qdy08ukoae458 is done: cancel slurm job
Oct 08 09:28:22 crunch-dispatcher-slurm-controller-blue-dispatcher crunch-dispatch-slurm[17079]: 2021/10/08 09:28:22 Done monitoring container 88d80-dz642-22qdy08ukoae458
srun -N 4 hostname
work?
SlurmctldPort=6817
SlurmdPort=6818
SrunPortRange=60001-63000
AuthType=auth/munge
StateSaveLocation=/var/spool/slurm
SlurmdSpoolDir=/var/spool/slurmd
SwitchType=switch/none
MpiDefault=none
SlurmctldPidFile=/run/slurm/slurmctld.pid
SlurmdPidFile=/run/slurm/slurmd.pid
ProctrackType=proctrack/cgroup
CacheGroups=0
ReturnToService=2
TaskPlugin=task/cgroup
SlurmUser=slurm
SlurmdUser=root
SlurmctldHost=crunch-dispatcher-slurm-controller-blue-dispatcher
ClusterName=core-infra
#
# TIMERS
SlurmctldTimeout=300
SlurmdTimeout=300
InactiveLimit=0
MinJobAge=300
KillWait=30
Waittime=0
#
# SCHEDULING
SchedulerType=sched/backfill
SchedulerPort=7321
SelectType=select/cons_res
SelectTypeParameters=CR_CPU_Memory
#
# LOGGING
SlurmctldDebug=3
SlurmctldLogFile=/var/log/slurm-llnl/slurmctld.log
SlurmdDebug=3
SlurmdLogFile=/var/log/slurm-llnl/slurmd.log
JobCompType=jobcomp/none
#JobCompLoc=
JobAcctGatherType=jobacct_gather/cgroup
# ACCOUNTING
AccountingStorageHost=localhost
AccountingStorageType=accounting_storage/slurmdbd
AccountingStoreJobComment=YES
AccountingStoragePort=6819
# COMPUTE NODES
NodeName=DEFAULT
PartitionName=DEFAULT MaxTime=INFINITE State=UP
NodeName=slurm-worker-blue-dispatcher-[1-2] RealMemory=7961 CPUs=4 TmpDisk=29597 State=UNKNOWN
PartitionName=compute Nodes=slurm-worker-blue-dispatcher-[1-2] Default=YES Shared=YES
NodeName
, you have State=UNKNOWN
# COMPUTE NODES
NodeName=DEFAULT CPUs=20 SOCKETs=1 CORESPERSOCKET=20 THREADSPERCORE=1 State=UNKNOWN RealMemory=32109 Weight=32109
PartitionName=DEFAULT MaxTime=INFINITE State=UP
#PartitionName=compute Default=YES Shared=yes
NodeName=compute[0-63]
PartitionName=compute Nodes=compute[0-63] Default=YES Shared=yes
Hi, we got things working by dropping cgroups support in the Slurm config.
We now also got our first CWL workflow running with arvados-cwl-runner
, but it it possible to do this through the REST API? It looks possible to create/delete/modify a workflow via the REST API, but couldn't find how to start one except through the aforementioned tool.