Q&A, support and general discussion about the Arvados project, for development see https://gitter.im/arvados/development
cibin@cibins-beast-13-9380:~/EBI/arvados-k8s/charts/arvados$ kubectl get svc
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
arvados-api-server LoadBalancer 10.88.12.90 34.89.54.152 444:31588/TCP 31m
arvados-keep-proxy LoadBalancer 10.88.11.130 34.89.54.152 25107:31630/TCP 31m
arvados-keep-store ClusterIP None <none> 25107/TCP 31m
arvados-keep-web LoadBalancer 10.88.5.66 34.89.54.152 9002:32663/TCP 31m
arvados-postgres ClusterIP 10.88.12.232 <none> 5432/TCP 31m
arvados-slurm-compute ClusterIP None <none> 6818/TCP 31m
arvados-slurm-controller-0 ClusterIP 10.88.14.128 <none> 6817/TCP 31m
arvados-workbench LoadBalancer 10.88.8.200 <pending> 443:30734/TCP,445:32051/TCP 31m
arvados-ws LoadBalancer 10.88.5.207 34.89.54.152 9003:30153/TCP 31m
kubernetes ClusterIP 10.88.0.1 <none> 443/TCP 22h
cibin@cibins-beast-13-9380:~/EBI/arvados-k8s/charts/arvados$ kubectl describe service/arvados-workbench
Name: arvados-workbench
Namespace: default
Labels: app=arvados
app.kubernetes.io/managed-by=Helm
chart=arvados-0.1.0
heritage=Helm
release=arvados
Annotations: cloud.google.com/neg: {"ingress":true}
meta.helm.sh/release-name: arvados
meta.helm.sh/release-namespace: default
Selector: app=arvados-workbench
Type: LoadBalancer
IP Families: <none>
IP: 10.88.8.200
IPs: 10.88.8.200
IP: 34.89.54.152
Port: wb2 443/TCP
TargetPort: 443/TCP
NodePort: wb2 30734/TCP
Endpoints: 10.84.2.18:443
Port: wb 445/TCP
TargetPort: 445/TCP
NodePort: wb 32051/TCP
Endpoints: 10.84.2.18:445
Session Affinity: None
External Traffic Policy: Cluster
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal EnsuringLoadBalancer 2m38s (x11 over 28m) service-controller Ensuring load balancer
Warning SyncLoadBalancerFailed 2m34s (x11 over 28m) service-controller Error syncing load balancer: failed to ensure load balancer: failed to create forwarding rule for load balancer (ae0291ffb3043451580fc197edd8a34e(default/arvados-workbench)): googleapi: Error 400: Invalid value for field 'resource.IPAddress': '34.89.54.152'. Specified IP address is in-use and would result in a conflict., invalid
Hi all,
We are testing the arvados Slurm dispatch and are running into trouble:
$ sudo journalctl -o cat -fu crunch-dispatch-slurm.service
{"level":"info","msg":"crunch-dispatch-slurm 2.2.2 started","time":"2021-10-07T15:36:08.672769209Z"}
Started Arvados Crunch Dispatcher for SLURM.
{"level":"fatal","msg":"error getting my token UUID: Get \"https://88d80-crunch-dispatcher-slurm-controller-dispatcher.dev.core.genomicsplc.com/arvados/v1/api_client_authorizations/current\": dial tcp 10.93.111.119:443: connect: connection refused","time":"2021-10-07T15:37:00.794084728Z"}
crunch-dispatch-slurm.service: Main process exited, code=exited, status=1/FAILURE
crunch-dispatch-slurm.service: Failed with result 'exit-code'.
crunch-dispatch-slurm.service: Scheduled restart job, restart counter is at 121.
Stopped Arvados Crunch Dispatcher for SLURM.
Starting Arvados Crunch Dispatcher for SLURM...
{"level":"info","msg":"crunch-dispatch-slurm 2.2.2 started","time":"2021-10-07T15:37:01.919705722Z"}
Started Arvados Crunch Dispatcher for SLURM.
{"level":"fatal","msg":"error getting my token UUID: Get \"https://88d80-crunch-dispatcher-slurm-controller-dispatcher.dev.core.genomicsplc.com/arvados/v1/api_client_authorizations/current\": dial tcp 10.93.111.119:443: connect: connection refused","time":"2021-10-07T15:37:54.030722405Z"}
crunch-dispatch-slurm.service: Main process exited, code=exited, status=1/FAILURE
crunch-dispatch-slurm.service: Failed with result 'exit-code'.
crunch-dispatch-slurm.service: Scheduled restart job, restart counter is at 122.
Stopped Arvados Crunch Dispatcher for SLURM.
Starting Arvados Crunch Dispatcher for SLURM...
{"level":"info","msg":"crunch-dispatch-slurm 2.2.2 started","time":"2021-10-07T15:37:55.167350562Z"}
Started Arvados Crunch Dispatcher for SLURM.
This is bizarre, as we are able to use arv api_client_authorization current
without problems from the VM running the dispatcher, when using the root API token. Any thoughts? Thanks!
Hello, I am looking at using SLURM dispatch with @tschoonj.
We have tried running a job with the example command provided here: https://doc.arvados.org/v2.2/install/crunch2-slurm/install-test.html but get hit with this error:
Error: //railsapi.internal/arvados/v1/container_requests: 422 Unprocessable Entity: #<ArvadosModel::UnresolvableContainerError: docker image "arvados/jobs:latest" not found> (req-ecdzw2wz1qq5r24xfuus)
The documentation here: https://doc.arvados.org/v2.2/api/methods/container_requests.html suggests that the "container_image"
property should be set to the PDH of a collection containing the image, but in the example script mentioned above it is set to "arvados/jobs:latest"
which is obviously not a PDH.
Could you advise on exactly what the value should be here? If putting the image into a collection is necessary, will we need to do this for every image we need to use in the future? Thanks in advance.
"arvados/jobs:latest"
is what you want to do
Hi all,
We managed to get the test CWL from the docs working, in the sense that it succeeds and that we can fetch the logs from the keep afterwards. However, Slurm is not happy. The stdout file that was written on the worker node contains:
2021/10/08 09:25:54 crunch-run 2.2.2 (go1.16.3) started
88d80-dz642-22qdy08ukoae458 2021-10-08T09:25:54.712529567Z crunch-run 2.2.2 (go1.16.3) started
88d80-dz642-22qdy08ukoae458 2021-10-08T09:25:54.712579936Z Executing container '88d80-dz642-22qdy08ukoae458'
88d80-dz642-22qdy08ukoae458 2021-10-08T09:25:54.712620617Z Executing on host 'slurm-worker-blue-dispatcher-2'
88d80-dz642-22qdy08ukoae458 2021-10-08T09:25:54.877163363Z Fetching Docker image from collection '4ad7d11381df349e464694762db14e04+303'
88d80-dz642-22qdy08ukoae458 2021-10-08T09:25:54.908683532Z Using Docker image id 'sha256:e67b8c126d8f2d411d72aa04cc0ab87dace18eef152d5f0b07dd677284fc0002'
88d80-dz642-22qdy08ukoae458 2021-10-08T09:25:54.913563419Z Loading Docker image from keep
88d80-dz642-22qdy08ukoae458 2021-10-08T09:26:06.816683221Z Docker response: {"stream":"Loaded image ID: sha256:e67b8c126d8f2d411d72aa04cc0ab87dace18eef152d5f0b07dd677284fc0002\n"}
88d80-dz642-22qdy08ukoae458 2021-10-08T09:26:06.830237887Z Running [arv-mount --foreground --allow-other --read-write --crunchstat-interval=10 --file-cache 268435456 --mount-by-pdh by_id /tmp/crunch-run.88d80-dz642-22qdy08ukoae458.388893745/keep083816412]
88d80-dz642-22qdy08ukoae458 2021-10-08T09:26:11.500969976Z Creating Docker container
88d80-dz642-22qdy08ukoae458 2021-10-08T09:26:19.974405474Z Attaching container streams
88d80-dz642-22qdy08ukoae458 2021-10-08T09:26:20.337499740Z Starting Docker container id '1b880c597d712a871580345588d704bdd2b6eaaeb46dc2dfe56b9f1bf3197bb0'
88d80-dz642-22qdy08ukoae458 2021-10-08T09:26:21.005081598Z Waiting for container to finish
88d80-dz642-22qdy08ukoae458 2021-10-08T09:26:21.704143751Z Container exited with code: 0
88d80-dz642-22qdy08ukoae458 2021-10-08T09:26:21.799195935Z Complete
88d80-dz642-22qdy08ukoae458 2021-10-08T09:26:22.017988309Z Running [arv-mount --unmount-timeout=8 --unmount /tmp/crunch-run.88d80-dz642-22qdy08ukoae458.388893745/keep083816412]
88d80-dz642-22qdy08ukoae458 2021-10-08T09:26:22.577220567Z crunch-run finished
slurmstepd-slurm-worker-blue-dispatcher-2: error: *** JOB 4 ON slurm-worker-blue-dispatcher-2 CANCELLED AT 2021-10-08T09:26:22 ***
88d80-dz642-22qdy08ukoae458 2021-10-08T09:26:22.681660707Z caught signal: terminated
88d80-dz642-22qdy08ukoae458 2021-10-08T09:26:22.681745921Z removing container
88d80-dz642-22qdy08ukoae458 2021-10-08T09:26:22.683569836Z error removing container: Error: No such container: 1b880c597d712a871580345588d704bdd2b6eaaeb46dc2dfe56b9f1bf3197bb0
slurmstepd-slurm-worker-blue-dispatcher-2: error: _handle_signal_container: failed signal 15 pid 1471 job 4.4294967294 No such process
slurmstepd-slurm-worker-blue-dispatcher-2: error: _handle_signal_container: failed signal 15 pid 1471 job 4.4294967294 No such process
slurmstepd-slurm-worker-blue-dispatcher-2: error: _handle_signal_container: failed signal 15 pid 1471 job 4.4294967294 No such process
slurmstepd-slurm-worker-blue-dispatcher-2: error: _handle_signal_container: failed signal 15 pid 1471 job 4.4294967294 No such process
slurmstepd-slurm-worker-blue-dispatcher-2: error: _handle_signal_container: failed signal 15 pid 1471 job 4.4294967294 No such process
slurmstepd-slurm-worker-blue-dispatcher-2: error: _handle_signal_container: failed signal 15 pid 1471 job 4.4294967294 No such process
slurmstepd-slurm-worker-blue-dispatcher-2: error: _handle_signal_container: failed signal 15 pid 1471 job 4.4294967294 No such process
slurmstepd-slurm-worker-blue-dispatcher-2: error: *** JOB 4 STEPD TERMINATED ON slurm-worker-blue-dispatcher-2 AT 2021-10-08T09:27:23 DUE TO JOB NOT ENDING WITH SIGNALS ***
crunch
, member of the docker
group on all slurm nodes and controller for this purpose, as suggested in the installation guide. Is there a configuration setting for this?
crunch-dispatch-slurm
logs corresponding to the previously mentioned job:Oct 08 09:25:52 crunch-dispatcher-slurm-controller-blue-dispatcher crunch-dispatch-slurm[17079]: 2021/10/08 09:25:52 Submitting container 88d80-dz642-22qdy08ukoae458 to slurm
Oct 08 09:25:52 crunch-dispatcher-slurm-controller-blue-dispatcher crunch-dispatch-slurm[17079]: 2021/10/08 09:25:52 running sbatch ["--job-name=88d80-dz642-22qdy08ukoae458" "--nice=10000" "--no-requeue" "--mem=520" "--cpus-per-task=1" "--tmp=640"]
Oct 08 09:25:52 crunch-dispatcher-slurm-controller-blue-dispatcher crunch-dispatch-slurm[17079]: 2021/10/08 09:25:52 "/usr/bin/sbatch" ["sbatch" "--job-name=88d80-dz642-22qdy08ukoae458" "--nice=10000" "--no-requeue" "--mem=520" "--cpus-per-task=1" "--tmp=640"]: "Submitted batch job 4"
Oct 08 09:25:52 crunch-dispatcher-slurm-controller-blue-dispatcher crunch-dispatch-slurm[17079]: 2021/10/08 09:25:52 Start monitoring container 88d80-dz642-22qdy08ukoae458 in state "Locked"
Oct 08 09:26:22 crunch-dispatcher-slurm-controller-blue-dispatcher crunch-dispatch-slurm[17079]: 2021/10/08 09:26:22 container 88d80-dz642-22qdy08ukoae458 is done: cancel slurm job
Oct 08 09:26:22 crunch-dispatcher-slurm-controller-blue-dispatcher crunch-dispatch-slurm[17079]: 2021/10/08 09:26:22 container 88d80-dz642-22qdy08ukoae458 is still in squeue after scancel
Oct 08 09:26:23 crunch-dispatcher-slurm-controller-blue-dispatcher crunch-dispatch-slurm[17079]: 2021/10/08 09:26:23 container 88d80-dz642-22qdy08ukoae458 is done: cancel slurm job
Oct 08 09:26:32 crunch-dispatcher-slurm-controller-blue-dispatcher crunch-dispatch-slurm[17079]: 2021/10/08 09:26:32 container 88d80-dz642-22qdy08ukoae458 is still in squeue after scancel
...repeats several more times...
Oct 08 09:27:22 crunch-dispatcher-slurm-controller-blue-dispatcher crunch-dispatch-slurm[17079]: 2021/10/08 09:27:22 container 88d80-dz642-22qdy08ukoae458 is still in squeue after scancel
Oct 08 09:27:23 crunch-dispatcher-slurm-controller-blue-dispatcher crunch-dispatch-slurm[17079]: 2021/10/08 09:27:23 container 88d80-dz642-22qdy08ukoae458 is done: cancel slurm job
Oct 08 09:27:32 crunch-dispatcher-slurm-controller-blue-dispatcher crunch-dispatch-slurm[17079]: 2021/10/08 09:27:32 container 88d80-dz642-22qdy08ukoae458 is done: cancel slurm job
Oct 08 09:27:42 crunch-dispatcher-slurm-controller-blue-dispatcher crunch-dispatch-slurm[17079]: 2021/10/08 09:27:42 container 88d80-dz642-22qdy08ukoae458 is done: cancel slurm job
Oct 08 09:27:52 crunch-dispatcher-slurm-controller-blue-dispatcher crunch-dispatch-slurm[17079]: 2021/10/08 09:27:52 container 88d80-dz642-22qdy08ukoae458 is done: cancel slurm job
Oct 08 09:28:02 crunch-dispatcher-slurm-controller-blue-dispatcher crunch-dispatch-slurm[17079]: 2021/10/08 09:28:02 container 88d80-dz642-22qdy08ukoae458 is done: cancel slurm job
Oct 08 09:28:12 crunch-dispatcher-slurm-controller-blue-dispatcher crunch-dispatch-slurm[17079]: 2021/10/08 09:28:12 container 88d80-dz642-22qdy08ukoae458 is done: cancel slurm job
Oct 08 09:28:22 crunch-dispatcher-slurm-controller-blue-dispatcher crunch-dispatch-slurm[17079]: 2021/10/08 09:28:22 Done monitoring container 88d80-dz642-22qdy08ukoae458
srun -N 4 hostname
work?
SlurmctldPort=6817
SlurmdPort=6818
SrunPortRange=60001-63000
AuthType=auth/munge
StateSaveLocation=/var/spool/slurm
SlurmdSpoolDir=/var/spool/slurmd
SwitchType=switch/none
MpiDefault=none
SlurmctldPidFile=/run/slurm/slurmctld.pid
SlurmdPidFile=/run/slurm/slurmd.pid
ProctrackType=proctrack/cgroup
CacheGroups=0
ReturnToService=2
TaskPlugin=task/cgroup
SlurmUser=slurm
SlurmdUser=root
SlurmctldHost=crunch-dispatcher-slurm-controller-blue-dispatcher
ClusterName=core-infra
#
# TIMERS
SlurmctldTimeout=300
SlurmdTimeout=300
InactiveLimit=0
MinJobAge=300
KillWait=30
Waittime=0
#
# SCHEDULING
SchedulerType=sched/backfill
SchedulerPort=7321
SelectType=select/cons_res
SelectTypeParameters=CR_CPU_Memory
#
# LOGGING
SlurmctldDebug=3
SlurmctldLogFile=/var/log/slurm-llnl/slurmctld.log
SlurmdDebug=3
SlurmdLogFile=/var/log/slurm-llnl/slurmd.log
JobCompType=jobcomp/none
#JobCompLoc=
JobAcctGatherType=jobacct_gather/cgroup
# ACCOUNTING
AccountingStorageHost=localhost
AccountingStorageType=accounting_storage/slurmdbd
AccountingStoreJobComment=YES
AccountingStoragePort=6819
# COMPUTE NODES
NodeName=DEFAULT
PartitionName=DEFAULT MaxTime=INFINITE State=UP
NodeName=slurm-worker-blue-dispatcher-[1-2] RealMemory=7961 CPUs=4 TmpDisk=29597 State=UNKNOWN
PartitionName=compute Nodes=slurm-worker-blue-dispatcher-[1-2] Default=YES Shared=YES