Where communities thrive


  • Join over 1.5M+ people
  • Join over 100K+ communities
  • Free without limits
  • Create your own community
Tom Schoonjans
@tschoonj
ok thanks will investigate
Peter Amstutz
@tetron
however if you arn't using keepstore level replication (DefaultReplication: 1) and using replication at a lower level (object store or RAID) then it doesn't matter
Tom Schoonjans
@tschoonj
yes, we are using DefaultReplication: 1 in this setup
Peter Amstutz
@tetron
ok then you just need to figure out where the VPN fits in your network topology
Tom Schoonjans
@tschoonj
Ok, we got it fixed now. The proxy is now used everywhere except when using arv on the arvados VM itself
Andrey Kartashov
@portah
@tetron Does arvados have a preinstalled version on a cloud?
Peter Amstutz
@tetron
@portah to try it out or to do real workloads?
Andrey Kartashov
@portah
@tetron to check api and try with cwl
Peter Amstutz
@tetron
Andrey Kartashov
@portah
Thank you
Cibin S B
@cibinsb
Hi There, I have been trying to deploy Arvados on GKE and came across the following load balancer error from one of the Arvados services. How to fix this problem
cibin@cibins-beast-13-9380:~/EBI/arvados-k8s/charts/arvados$ kubectl get svc
NAME                         TYPE           CLUSTER-IP     EXTERNAL-IP    PORT(S)                       AGE
arvados-api-server           LoadBalancer   10.88.12.90    34.89.54.152   444:31588/TCP                 31m
arvados-keep-proxy           LoadBalancer   10.88.11.130   34.89.54.152   25107:31630/TCP               31m
arvados-keep-store           ClusterIP      None           <none>         25107/TCP                     31m
arvados-keep-web             LoadBalancer   10.88.5.66     34.89.54.152   9002:32663/TCP                31m
arvados-postgres             ClusterIP      10.88.12.232   <none>         5432/TCP                      31m
arvados-slurm-compute        ClusterIP      None           <none>         6818/TCP                      31m
arvados-slurm-controller-0   ClusterIP      10.88.14.128   <none>         6817/TCP                      31m
arvados-workbench            LoadBalancer   10.88.8.200    <pending>      443:30734/TCP,445:32051/TCP   31m
arvados-ws                   LoadBalancer   10.88.5.207    34.89.54.152   9003:30153/TCP                31m
kubernetes                   ClusterIP      10.88.0.1      <none>         443/TCP                       22h
cibin@cibins-beast-13-9380:~/EBI/arvados-k8s/charts/arvados$ kubectl describe service/arvados-workbench
Name:                     arvados-workbench
Namespace:                default
Labels:                   app=arvados
                          app.kubernetes.io/managed-by=Helm
                          chart=arvados-0.1.0
                          heritage=Helm
                          release=arvados
Annotations:              cloud.google.com/neg: {"ingress":true}
                          meta.helm.sh/release-name: arvados
                          meta.helm.sh/release-namespace: default
Selector:                 app=arvados-workbench
Type:                     LoadBalancer
IP Families:              <none>
IP:                       10.88.8.200
IPs:                      10.88.8.200
IP:                       34.89.54.152
Port:                     wb2  443/TCP
TargetPort:               443/TCP
NodePort:                 wb2  30734/TCP
Endpoints:                10.84.2.18:443
Port:                     wb  445/TCP
TargetPort:               445/TCP
NodePort:                 wb  32051/TCP
Endpoints:                10.84.2.18:445
Session Affinity:         None
External Traffic Policy:  Cluster
Events:
  Type     Reason                  Age                   From                Message
  ----     ------                  ----                  ----                -------
  Normal   EnsuringLoadBalancer    2m38s (x11 over 28m)  service-controller  Ensuring load balancer
  Warning  SyncLoadBalancerFailed  2m34s (x11 over 28m)  service-controller  Error syncing load balancer: failed to ensure load balancer: failed to create forwarding rule for load balancer (ae0291ffb3043451580fc197edd8a34e(default/arvados-workbench)): googleapi: Error 400: Invalid value for field 'resource.IPAddress': '34.89.54.152'. Specified IP address is in-use and would result in a conflict., invalid
Peter Amstutz
@tetron
@cure might know
Tom Schoonjans
@tschoonj

Hi all,

We are testing the arvados Slurm dispatch and are running into trouble:

$ sudo journalctl -o cat -fu crunch-dispatch-slurm.service
{"level":"info","msg":"crunch-dispatch-slurm 2.2.2 started","time":"2021-10-07T15:36:08.672769209Z"}
Started Arvados Crunch Dispatcher for SLURM.
{"level":"fatal","msg":"error getting my token UUID: Get \"https://88d80-crunch-dispatcher-slurm-controller-dispatcher.dev.core.genomicsplc.com/arvados/v1/api_client_authorizations/current\": dial tcp 10.93.111.119:443: connect: connection refused","time":"2021-10-07T15:37:00.794084728Z"}
crunch-dispatch-slurm.service: Main process exited, code=exited, status=1/FAILURE
crunch-dispatch-slurm.service: Failed with result 'exit-code'.
crunch-dispatch-slurm.service: Scheduled restart job, restart counter is at 121.
Stopped Arvados Crunch Dispatcher for SLURM.
Starting Arvados Crunch Dispatcher for SLURM...
{"level":"info","msg":"crunch-dispatch-slurm 2.2.2 started","time":"2021-10-07T15:37:01.919705722Z"}
Started Arvados Crunch Dispatcher for SLURM.
{"level":"fatal","msg":"error getting my token UUID: Get \"https://88d80-crunch-dispatcher-slurm-controller-dispatcher.dev.core.genomicsplc.com/arvados/v1/api_client_authorizations/current\": dial tcp 10.93.111.119:443: connect: connection refused","time":"2021-10-07T15:37:54.030722405Z"}
crunch-dispatch-slurm.service: Main process exited, code=exited, status=1/FAILURE
crunch-dispatch-slurm.service: Failed with result 'exit-code'.
crunch-dispatch-slurm.service: Scheduled restart job, restart counter is at 122.
Stopped Arvados Crunch Dispatcher for SLURM.
Starting Arvados Crunch Dispatcher for SLURM...
{"level":"info","msg":"crunch-dispatch-slurm 2.2.2 started","time":"2021-10-07T15:37:55.167350562Z"}
Started Arvados Crunch Dispatcher for SLURM.

This is bizarre, as we are able to use arv api_client_authorization current without problems from the VM running the dispatcher, when using the root API token. Any thoughts? Thanks!

Tom Schoonjans
@tschoonj
Please ignore, our config was wrong
Ward Vandewege
@cure
ok!
Callum-Joyce
@Callum-Joyce

Hello, I am looking at using SLURM dispatch with @tschoonj.

We have tried running a job with the example command provided here: https://doc.arvados.org/v2.2/install/crunch2-slurm/install-test.html but get hit with this error:

Error: //railsapi.internal/arvados/v1/container_requests: 422 Unprocessable Entity: #<ArvadosModel::UnresolvableContainerError: docker image "arvados/jobs:latest" not found> (req-ecdzw2wz1qq5r24xfuus)

The documentation here: https://doc.arvados.org/v2.2/api/methods/container_requests.html suggests that the "container_image"property should be set to the PDH of a collection containing the image, but in the example script mentioned above it is set to "arvados/jobs:latest" which is obviously not a PDH.

Could you advise on exactly what the value should be here? If putting the image into a collection is necessary, will we need to do this for every image we need to use in the future? Thanks in advance.

Ward Vandewege
@cure
@Callum-Joyce you need to load the images in Keep, see https://doc.arvados.org/v2.2/user/topics/arv-docker.html
the example script is correct, listing something like "arvados/jobs:latest" is what you want to do
Callum-Joyce
@Callum-Joyce
@cure Thanks, I can confirm that jobs submit correctly now
Ward Vandewege
@cure
ok! Hopefully it also runs successfully :)
Tom Schoonjans
@tschoonj

Hi all,

We managed to get the test CWL from the docs working, in the sense that it succeeds and that we can fetch the logs from the keep afterwards. However, Slurm is not happy. The stdout file that was written on the worker node contains:

2021/10/08 09:25:54 crunch-run 2.2.2 (go1.16.3) started
88d80-dz642-22qdy08ukoae458 2021-10-08T09:25:54.712529567Z crunch-run 2.2.2 (go1.16.3) started
88d80-dz642-22qdy08ukoae458 2021-10-08T09:25:54.712579936Z Executing container '88d80-dz642-22qdy08ukoae458'
88d80-dz642-22qdy08ukoae458 2021-10-08T09:25:54.712620617Z Executing on host 'slurm-worker-blue-dispatcher-2'
88d80-dz642-22qdy08ukoae458 2021-10-08T09:25:54.877163363Z Fetching Docker image from collection '4ad7d11381df349e464694762db14e04+303'
88d80-dz642-22qdy08ukoae458 2021-10-08T09:25:54.908683532Z Using Docker image id 'sha256:e67b8c126d8f2d411d72aa04cc0ab87dace18eef152d5f0b07dd677284fc0002'
88d80-dz642-22qdy08ukoae458 2021-10-08T09:25:54.913563419Z Loading Docker image from keep
88d80-dz642-22qdy08ukoae458 2021-10-08T09:26:06.816683221Z Docker response: {"stream":"Loaded image ID: sha256:e67b8c126d8f2d411d72aa04cc0ab87dace18eef152d5f0b07dd677284fc0002\n"}
88d80-dz642-22qdy08ukoae458 2021-10-08T09:26:06.830237887Z Running [arv-mount --foreground --allow-other --read-write --crunchstat-interval=10 --file-cache 268435456 --mount-by-pdh by_id /tmp/crunch-run.88d80-dz642-22qdy08ukoae458.388893745/keep083816412]
88d80-dz642-22qdy08ukoae458 2021-10-08T09:26:11.500969976Z Creating Docker container
88d80-dz642-22qdy08ukoae458 2021-10-08T09:26:19.974405474Z Attaching container streams
88d80-dz642-22qdy08ukoae458 2021-10-08T09:26:20.337499740Z Starting Docker container id '1b880c597d712a871580345588d704bdd2b6eaaeb46dc2dfe56b9f1bf3197bb0'
88d80-dz642-22qdy08ukoae458 2021-10-08T09:26:21.005081598Z Waiting for container to finish
88d80-dz642-22qdy08ukoae458 2021-10-08T09:26:21.704143751Z Container exited with code: 0
88d80-dz642-22qdy08ukoae458 2021-10-08T09:26:21.799195935Z Complete
88d80-dz642-22qdy08ukoae458 2021-10-08T09:26:22.017988309Z Running [arv-mount --unmount-timeout=8 --unmount /tmp/crunch-run.88d80-dz642-22qdy08ukoae458.388893745/keep083816412]
88d80-dz642-22qdy08ukoae458 2021-10-08T09:26:22.577220567Z crunch-run finished
slurmstepd-slurm-worker-blue-dispatcher-2: error: *** JOB 4 ON slurm-worker-blue-dispatcher-2 CANCELLED AT 2021-10-08T09:26:22 ***
88d80-dz642-22qdy08ukoae458 2021-10-08T09:26:22.681660707Z caught signal: terminated
88d80-dz642-22qdy08ukoae458 2021-10-08T09:26:22.681745921Z removing container
88d80-dz642-22qdy08ukoae458 2021-10-08T09:26:22.683569836Z error removing container: Error: No such container: 1b880c597d712a871580345588d704bdd2b6eaaeb46dc2dfe56b9f1bf3197bb0
slurmstepd-slurm-worker-blue-dispatcher-2: error: _handle_signal_container: failed signal 15 pid 1471 job 4.4294967294 No such process
slurmstepd-slurm-worker-blue-dispatcher-2: error: _handle_signal_container: failed signal 15 pid 1471 job 4.4294967294 No such process
slurmstepd-slurm-worker-blue-dispatcher-2: error: _handle_signal_container: failed signal 15 pid 1471 job 4.4294967294 No such process
slurmstepd-slurm-worker-blue-dispatcher-2: error: _handle_signal_container: failed signal 15 pid 1471 job 4.4294967294 No such process
slurmstepd-slurm-worker-blue-dispatcher-2: error: _handle_signal_container: failed signal 15 pid 1471 job 4.4294967294 No such process
slurmstepd-slurm-worker-blue-dispatcher-2: error: _handle_signal_container: failed signal 15 pid 1471 job 4.4294967294 No such process
slurmstepd-slurm-worker-blue-dispatcher-2: error: _handle_signal_container: failed signal 15 pid 1471 job 4.4294967294 No such process
slurmstepd-slurm-worker-blue-dispatcher-2: error: *** JOB 4 STEPD TERMINATED ON slurm-worker-blue-dispatcher-2 AT 2021-10-08T09:27:23 DUE TO JOB NOT ENDING WITH SIGNALS ***
We were surprised to see that the crunch jobs are running as user root: is there a way to change this? We created a user crunch, member of the docker group on all slurm nodes and controller for this purpose, as suggested in the installation guide. Is there a configuration setting for this?
The logs on the slurm nodes are being created directly in /: is there a way to change this? Thanks in advance!!
Tom Schoonjans
@tschoonj
Here are the crunch-dispatch-slurm logs corresponding to the previously mentioned job:
Oct 08 09:25:52 crunch-dispatcher-slurm-controller-blue-dispatcher crunch-dispatch-slurm[17079]: 2021/10/08 09:25:52 Submitting container 88d80-dz642-22qdy08ukoae458 to slurm
Oct 08 09:25:52 crunch-dispatcher-slurm-controller-blue-dispatcher crunch-dispatch-slurm[17079]: 2021/10/08 09:25:52 running sbatch ["--job-name=88d80-dz642-22qdy08ukoae458" "--nice=10000" "--no-requeue" "--mem=520" "--cpus-per-task=1" "--tmp=640"]
Oct 08 09:25:52 crunch-dispatcher-slurm-controller-blue-dispatcher crunch-dispatch-slurm[17079]: 2021/10/08 09:25:52 "/usr/bin/sbatch" ["sbatch" "--job-name=88d80-dz642-22qdy08ukoae458" "--nice=10000" "--no-requeue" "--mem=520" "--cpus-per-task=1" "--tmp=640"]: "Submitted batch job 4"
Oct 08 09:25:52 crunch-dispatcher-slurm-controller-blue-dispatcher crunch-dispatch-slurm[17079]: 2021/10/08 09:25:52 Start monitoring container 88d80-dz642-22qdy08ukoae458 in state "Locked"
Oct 08 09:26:22 crunch-dispatcher-slurm-controller-blue-dispatcher crunch-dispatch-slurm[17079]: 2021/10/08 09:26:22 container 88d80-dz642-22qdy08ukoae458 is done: cancel slurm job
Oct 08 09:26:22 crunch-dispatcher-slurm-controller-blue-dispatcher crunch-dispatch-slurm[17079]: 2021/10/08 09:26:22 container 88d80-dz642-22qdy08ukoae458 is still in squeue after scancel
Oct 08 09:26:23 crunch-dispatcher-slurm-controller-blue-dispatcher crunch-dispatch-slurm[17079]: 2021/10/08 09:26:23 container 88d80-dz642-22qdy08ukoae458 is done: cancel slurm job
Oct 08 09:26:32 crunch-dispatcher-slurm-controller-blue-dispatcher crunch-dispatch-slurm[17079]: 2021/10/08 09:26:32 container 88d80-dz642-22qdy08ukoae458 is still in squeue after scancel
...repeats several more times...

Oct 08 09:27:22 crunch-dispatcher-slurm-controller-blue-dispatcher crunch-dispatch-slurm[17079]: 2021/10/08 09:27:22 container 88d80-dz642-22qdy08ukoae458 is still in squeue after scancel
Oct 08 09:27:23 crunch-dispatcher-slurm-controller-blue-dispatcher crunch-dispatch-slurm[17079]: 2021/10/08 09:27:23 container 88d80-dz642-22qdy08ukoae458 is done: cancel slurm job
Oct 08 09:27:32 crunch-dispatcher-slurm-controller-blue-dispatcher crunch-dispatch-slurm[17079]: 2021/10/08 09:27:32 container 88d80-dz642-22qdy08ukoae458 is done: cancel slurm job
Oct 08 09:27:42 crunch-dispatcher-slurm-controller-blue-dispatcher crunch-dispatch-slurm[17079]: 2021/10/08 09:27:42 container 88d80-dz642-22qdy08ukoae458 is done: cancel slurm job
Oct 08 09:27:52 crunch-dispatcher-slurm-controller-blue-dispatcher crunch-dispatch-slurm[17079]: 2021/10/08 09:27:52 container 88d80-dz642-22qdy08ukoae458 is done: cancel slurm job
Oct 08 09:28:02 crunch-dispatcher-slurm-controller-blue-dispatcher crunch-dispatch-slurm[17079]: 2021/10/08 09:28:02 container 88d80-dz642-22qdy08ukoae458 is done: cancel slurm job
Oct 08 09:28:12 crunch-dispatcher-slurm-controller-blue-dispatcher crunch-dispatch-slurm[17079]: 2021/10/08 09:28:12 container 88d80-dz642-22qdy08ukoae458 is done: cancel slurm job
Oct 08 09:28:22 crunch-dispatcher-slurm-controller-blue-dispatcher crunch-dispatch-slurm[17079]: 2021/10/08 09:28:22 Done monitoring container 88d80-dz642-22qdy08ukoae458
Ward Vandewege
@cure
@tschoonj the slurm user would be the user that the crunch-dispatch-slurm service runs as (since it executes sbatch etc directly)
wrt slurm being unhappy, does a basic test like srun -N 4 hostname work?
and you probably already know that slurm is extremely sensitive to a) time sync between nodes and b) dns resolution for the hostnames of the controller and workers
Ward Vandewege
@cure
it looks from the logs like things are more or less working
perhaps there's a firewalling/communication problem between api server and/or slurmctld and the compute nodes
are there actual errors in the container logs?
Tom Schoonjans
@tschoonj
Hi Ward
regular srun commands work perfectly fine
from the crunch side of things, everything looks good
Ward Vandewege
@cure
great
Tom Schoonjans
@tschoonj
it appears that the crunch-dispatch-slurm service is explicitly trying to scancel the job after it has successfully completed
Ward Vandewege
@cure
yeah those messages in the logs may just be noise that can be ignored
(and we can file a bug to get them fixed)
Tom Schoonjans
@tschoonj
they're not just noise I am afraid, as they drain the node :-(
Ward Vandewege
@cure
are the jobs completing OK from the arvados perspective?
Tom Schoonjans
@tschoonj
we are now manually setting them to idle after every crunch job
as far as I can tell they are
Ward Vandewege
@cure
so it's maybe a slurm configuration issue then
Tom Schoonjans
@tschoonj
maybe indeed
have you seen this before where the crunch-dispatch-slurm service is actively trying to run scancel on finished jobs?
Ward Vandewege
@cure
I think that is normal
the part where that turns into draining of the node is weird
does that happen if you try using scancel manually, too?
Tom Schoonjans
@tschoonj
we havent tried that yet, as we tested only with short running jobs
Ward Vandewege
@cure
hmm, maybe those scancels are not normal
Tom Schoonjans
@tschoonj
Here is our slurm.conf, is there something that looks fishy?
SlurmctldPort=6817
SlurmdPort=6818
SrunPortRange=60001-63000
AuthType=auth/munge
StateSaveLocation=/var/spool/slurm
SlurmdSpoolDir=/var/spool/slurmd
SwitchType=switch/none
MpiDefault=none
SlurmctldPidFile=/run/slurm/slurmctld.pid
SlurmdPidFile=/run/slurm/slurmd.pid
ProctrackType=proctrack/cgroup
CacheGroups=0
ReturnToService=2
TaskPlugin=task/cgroup
SlurmUser=slurm
SlurmdUser=root
SlurmctldHost=crunch-dispatcher-slurm-controller-blue-dispatcher
ClusterName=core-infra
#
# TIMERS
SlurmctldTimeout=300
SlurmdTimeout=300
InactiveLimit=0
MinJobAge=300
KillWait=30
Waittime=0
#
# SCHEDULING
SchedulerType=sched/backfill
SchedulerPort=7321
SelectType=select/cons_res
SelectTypeParameters=CR_CPU_Memory
#
# LOGGING
SlurmctldDebug=3
SlurmctldLogFile=/var/log/slurm-llnl/slurmctld.log
SlurmdDebug=3
SlurmdLogFile=/var/log/slurm-llnl/slurmd.log
JobCompType=jobcomp/none
#JobCompLoc=
JobAcctGatherType=jobacct_gather/cgroup
# ACCOUNTING
AccountingStorageHost=localhost
AccountingStorageType=accounting_storage/slurmdbd
AccountingStoreJobComment=YES
AccountingStoragePort=6819

# COMPUTE NODES
NodeName=DEFAULT
PartitionName=DEFAULT MaxTime=INFINITE State=UP

NodeName=slurm-worker-blue-dispatcher-[1-2] RealMemory=7961 CPUs=4 TmpDisk=29597 State=UNKNOWN
PartitionName=compute Nodes=slurm-worker-blue-dispatcher-[1-2] Default=YES Shared=YES