Where communities thrive


  • Join over 1.5M+ people
  • Join over 100K+ communities
  • Free without limits
  • Create your own community
People
Activity
Tom Schoonjans
@tschoonj
thanks Peter!!
Peter Amstutz
@tetron:matrix.org
[m]
by the way we have an Arvados use group meeting today in half an hour
Tom Schoonjans
@tschoonj
I know, but won't be able to make it due to childcare :-(
Peter Amstutz
@tetron
@/all The user group video chat is happening soon https://forum.arvados.org/t/arvados-user-group-video-chat/47/8
Tom Schoonjans
@tschoonj

Hello again,

We are still setting up our test arvados infrastructure, and have now a single VM with API server, PostgreSQL, keepstore and keepproxy. Our issue now is with the keepproxy: the docs stipulate that arv keep_service accessible should contain a reference to the keepproxy server. This works fine when running the command on the office network, but fails when trying this from home over VPN as it is shown to contain the keepstore domain name instead.

I assume that this is related to the geo settings in the nginx config?

Thanks in advance!

Peter Amstutz
@tetron
yes
it is controlled by the geo setting
is the home VPN considered to be on the same network?
Tom Schoonjans
@tschoonj
apparently not :-)
I will ask our IT department what IP range we need to add to support our VPN connections
Peter Amstutz
@tetron
if you are outside the private network, you should get keepproxy from "keep_services accessible", if you are inside the private network, you should get the keepstore servers instead. it doesn't matter which one you get as long as it is reachable
so it sounds like either the keepstore needs to be reachable from the home VPN or your geo section needs to send the home VPN to keepproxy (which needs to be reachable?)
Tom Schoonjans
@tschoonj
aha
so what we are seeing here is actually ok?
Peter Amstutz
@tetron
does it work?
does arv-get work?
Tom Schoonjans
@tschoonj
my colleague on VPN just tested arv-put and that fails
Peter Amstutz
@tetron
well so either all the keepstore servers, or the keepproxy server, need to be reachable by home VPN
so you need to figure that out first
the one particular advantage of using keepproxy in this case, if you have keepstore-level replication enabled, it'll handle replicating the upload at the keepproxy level instead of the client having to send data twice
Tom Schoonjans
@tschoonj
ok thanks will investigate
Peter Amstutz
@tetron
however if you arn't using keepstore level replication (DefaultReplication: 1) and using replication at a lower level (object store or RAID) then it doesn't matter
Tom Schoonjans
@tschoonj
yes, we are using DefaultReplication: 1 in this setup
Peter Amstutz
@tetron
ok then you just need to figure out where the VPN fits in your network topology
Tom Schoonjans
@tschoonj
Ok, we got it fixed now. The proxy is now used everywhere except when using arv on the arvados VM itself
Andrey Kartashov
@portah
@tetron Does arvados have a preinstalled version on a cloud?
Peter Amstutz
@tetron
@portah to try it out or to do real workloads?
Andrey Kartashov
@portah
@tetron to check api and try with cwl
Peter Amstutz
@tetron
Andrey Kartashov
@portah
Thank you
Cibin S B
@cibinsb
Hi There, I have been trying to deploy Arvados on GKE and came across the following load balancer error from one of the Arvados services. How to fix this problem
cibin@cibins-beast-13-9380:~/EBI/arvados-k8s/charts/arvados$ kubectl get svc
NAME                         TYPE           CLUSTER-IP     EXTERNAL-IP    PORT(S)                       AGE
arvados-api-server           LoadBalancer   10.88.12.90    34.89.54.152   444:31588/TCP                 31m
arvados-keep-proxy           LoadBalancer   10.88.11.130   34.89.54.152   25107:31630/TCP               31m
arvados-keep-store           ClusterIP      None           <none>         25107/TCP                     31m
arvados-keep-web             LoadBalancer   10.88.5.66     34.89.54.152   9002:32663/TCP                31m
arvados-postgres             ClusterIP      10.88.12.232   <none>         5432/TCP                      31m
arvados-slurm-compute        ClusterIP      None           <none>         6818/TCP                      31m
arvados-slurm-controller-0   ClusterIP      10.88.14.128   <none>         6817/TCP                      31m
arvados-workbench            LoadBalancer   10.88.8.200    <pending>      443:30734/TCP,445:32051/TCP   31m
arvados-ws                   LoadBalancer   10.88.5.207    34.89.54.152   9003:30153/TCP                31m
kubernetes                   ClusterIP      10.88.0.1      <none>         443/TCP                       22h
cibin@cibins-beast-13-9380:~/EBI/arvados-k8s/charts/arvados$ kubectl describe service/arvados-workbench
Name:                     arvados-workbench
Namespace:                default
Labels:                   app=arvados
                          app.kubernetes.io/managed-by=Helm
                          chart=arvados-0.1.0
                          heritage=Helm
                          release=arvados
Annotations:              cloud.google.com/neg: {"ingress":true}
                          meta.helm.sh/release-name: arvados
                          meta.helm.sh/release-namespace: default
Selector:                 app=arvados-workbench
Type:                     LoadBalancer
IP Families:              <none>
IP:                       10.88.8.200
IPs:                      10.88.8.200
IP:                       34.89.54.152
Port:                     wb2  443/TCP
TargetPort:               443/TCP
NodePort:                 wb2  30734/TCP
Endpoints:                10.84.2.18:443
Port:                     wb  445/TCP
TargetPort:               445/TCP
NodePort:                 wb  32051/TCP
Endpoints:                10.84.2.18:445
Session Affinity:         None
External Traffic Policy:  Cluster
Events:
  Type     Reason                  Age                   From                Message
  ----     ------                  ----                  ----                -------
  Normal   EnsuringLoadBalancer    2m38s (x11 over 28m)  service-controller  Ensuring load balancer
  Warning  SyncLoadBalancerFailed  2m34s (x11 over 28m)  service-controller  Error syncing load balancer: failed to ensure load balancer: failed to create forwarding rule for load balancer (ae0291ffb3043451580fc197edd8a34e(default/arvados-workbench)): googleapi: Error 400: Invalid value for field 'resource.IPAddress': '34.89.54.152'. Specified IP address is in-use and would result in a conflict., invalid
Peter Amstutz
@tetron
@cure might know
Tom Schoonjans
@tschoonj

Hi all,

We are testing the arvados Slurm dispatch and are running into trouble:

$ sudo journalctl -o cat -fu crunch-dispatch-slurm.service
{"level":"info","msg":"crunch-dispatch-slurm 2.2.2 started","time":"2021-10-07T15:36:08.672769209Z"}
Started Arvados Crunch Dispatcher for SLURM.
{"level":"fatal","msg":"error getting my token UUID: Get \"https://88d80-crunch-dispatcher-slurm-controller-dispatcher.dev.core.genomicsplc.com/arvados/v1/api_client_authorizations/current\": dial tcp 10.93.111.119:443: connect: connection refused","time":"2021-10-07T15:37:00.794084728Z"}
crunch-dispatch-slurm.service: Main process exited, code=exited, status=1/FAILURE
crunch-dispatch-slurm.service: Failed with result 'exit-code'.
crunch-dispatch-slurm.service: Scheduled restart job, restart counter is at 121.
Stopped Arvados Crunch Dispatcher for SLURM.
Starting Arvados Crunch Dispatcher for SLURM...
{"level":"info","msg":"crunch-dispatch-slurm 2.2.2 started","time":"2021-10-07T15:37:01.919705722Z"}
Started Arvados Crunch Dispatcher for SLURM.
{"level":"fatal","msg":"error getting my token UUID: Get \"https://88d80-crunch-dispatcher-slurm-controller-dispatcher.dev.core.genomicsplc.com/arvados/v1/api_client_authorizations/current\": dial tcp 10.93.111.119:443: connect: connection refused","time":"2021-10-07T15:37:54.030722405Z"}
crunch-dispatch-slurm.service: Main process exited, code=exited, status=1/FAILURE
crunch-dispatch-slurm.service: Failed with result 'exit-code'.
crunch-dispatch-slurm.service: Scheduled restart job, restart counter is at 122.
Stopped Arvados Crunch Dispatcher for SLURM.
Starting Arvados Crunch Dispatcher for SLURM...
{"level":"info","msg":"crunch-dispatch-slurm 2.2.2 started","time":"2021-10-07T15:37:55.167350562Z"}
Started Arvados Crunch Dispatcher for SLURM.

This is bizarre, as we are able to use arv api_client_authorization current without problems from the VM running the dispatcher, when using the root API token. Any thoughts? Thanks!

Tom Schoonjans
@tschoonj
Please ignore, our config was wrong
Ward Vandewege
@cure
ok!
Callum-Joyce
@Callum-Joyce

Hello, I am looking at using SLURM dispatch with @tschoonj.

We have tried running a job with the example command provided here: https://doc.arvados.org/v2.2/install/crunch2-slurm/install-test.html but get hit with this error:

Error: //railsapi.internal/arvados/v1/container_requests: 422 Unprocessable Entity: #<ArvadosModel::UnresolvableContainerError: docker image "arvados/jobs:latest" not found> (req-ecdzw2wz1qq5r24xfuus)

The documentation here: https://doc.arvados.org/v2.2/api/methods/container_requests.html suggests that the "container_image"property should be set to the PDH of a collection containing the image, but in the example script mentioned above it is set to "arvados/jobs:latest" which is obviously not a PDH.

Could you advise on exactly what the value should be here? If putting the image into a collection is necessary, will we need to do this for every image we need to use in the future? Thanks in advance.

Ward Vandewege
@cure
@Callum-Joyce you need to load the images in Keep, see https://doc.arvados.org/v2.2/user/topics/arv-docker.html
the example script is correct, listing something like "arvados/jobs:latest" is what you want to do
Callum-Joyce
@Callum-Joyce
@cure Thanks, I can confirm that jobs submit correctly now
Ward Vandewege
@cure
ok! Hopefully it also runs successfully :)
Tom Schoonjans
@tschoonj

Hi all,

We managed to get the test CWL from the docs working, in the sense that it succeeds and that we can fetch the logs from the keep afterwards. However, Slurm is not happy. The stdout file that was written on the worker node contains:

2021/10/08 09:25:54 crunch-run 2.2.2 (go1.16.3) started
88d80-dz642-22qdy08ukoae458 2021-10-08T09:25:54.712529567Z crunch-run 2.2.2 (go1.16.3) started
88d80-dz642-22qdy08ukoae458 2021-10-08T09:25:54.712579936Z Executing container '88d80-dz642-22qdy08ukoae458'
88d80-dz642-22qdy08ukoae458 2021-10-08T09:25:54.712620617Z Executing on host 'slurm-worker-blue-dispatcher-2'
88d80-dz642-22qdy08ukoae458 2021-10-08T09:25:54.877163363Z Fetching Docker image from collection '4ad7d11381df349e464694762db14e04+303'
88d80-dz642-22qdy08ukoae458 2021-10-08T09:25:54.908683532Z Using Docker image id 'sha256:e67b8c126d8f2d411d72aa04cc0ab87dace18eef152d5f0b07dd677284fc0002'
88d80-dz642-22qdy08ukoae458 2021-10-08T09:25:54.913563419Z Loading Docker image from keep
88d80-dz642-22qdy08ukoae458 2021-10-08T09:26:06.816683221Z Docker response: {"stream":"Loaded image ID: sha256:e67b8c126d8f2d411d72aa04cc0ab87dace18eef152d5f0b07dd677284fc0002\n"}
88d80-dz642-22qdy08ukoae458 2021-10-08T09:26:06.830237887Z Running [arv-mount --foreground --allow-other --read-write --crunchstat-interval=10 --file-cache 268435456 --mount-by-pdh by_id /tmp/crunch-run.88d80-dz642-22qdy08ukoae458.388893745/keep083816412]
88d80-dz642-22qdy08ukoae458 2021-10-08T09:26:11.500969976Z Creating Docker container
88d80-dz642-22qdy08ukoae458 2021-10-08T09:26:19.974405474Z Attaching container streams
88d80-dz642-22qdy08ukoae458 2021-10-08T09:26:20.337499740Z Starting Docker container id '1b880c597d712a871580345588d704bdd2b6eaaeb46dc2dfe56b9f1bf3197bb0'
88d80-dz642-22qdy08ukoae458 2021-10-08T09:26:21.005081598Z Waiting for container to finish
88d80-dz642-22qdy08ukoae458 2021-10-08T09:26:21.704143751Z Container exited with code: 0
88d80-dz642-22qdy08ukoae458 2021-10-08T09:26:21.799195935Z Complete
88d80-dz642-22qdy08ukoae458 2021-10-08T09:26:22.017988309Z Running [arv-mount --unmount-timeout=8 --unmount /tmp/crunch-run.88d80-dz642-22qdy08ukoae458.388893745/keep083816412]
88d80-dz642-22qdy08ukoae458 2021-10-08T09:26:22.577220567Z crunch-run finished
slurmstepd-slurm-worker-blue-dispatcher-2: error: *** JOB 4 ON slurm-worker-blue-dispatcher-2 CANCELLED AT 2021-10-08T09:26:22 ***
88d80-dz642-22qdy08ukoae458 2021-10-08T09:26:22.681660707Z caught signal: terminated
88d80-dz642-22qdy08ukoae458 2021-10-08T09:26:22.681745921Z removing container
88d80-dz642-22qdy08ukoae458 2021-10-08T09:26:22.683569836Z error removing container: Error: No such container: 1b880c597d712a871580345588d704bdd2b6eaaeb46dc2dfe56b9f1bf3197bb0
slurmstepd-slurm-worker-blue-dispatcher-2: error: _handle_signal_container: failed signal 15 pid 1471 job 4.4294967294 No such process
slurmstepd-slurm-worker-blue-dispatcher-2: error: _handle_signal_container: failed signal 15 pid 1471 job 4.4294967294 No such process
slurmstepd-slurm-worker-blue-dispatcher-2: error: _handle_signal_container: failed signal 15 pid 1471 job 4.4294967294 No such process
slurmstepd-slurm-worker-blue-dispatcher-2: error: _handle_signal_container: failed signal 15 pid 1471 job 4.4294967294 No such process
slurmstepd-slurm-worker-blue-dispatcher-2: error: _handle_signal_container: failed signal 15 pid 1471 job 4.4294967294 No such process
slurmstepd-slurm-worker-blue-dispatcher-2: error: _handle_signal_container: failed signal 15 pid 1471 job 4.4294967294 No such process
slurmstepd-slurm-worker-blue-dispatcher-2: error: _handle_signal_container: failed signal 15 pid 1471 job 4.4294967294 No such process
slurmstepd-slurm-worker-blue-dispatcher-2: error: *** JOB 4 STEPD TERMINATED ON slurm-worker-blue-dispatcher-2 AT 2021-10-08T09:27:23 DUE TO JOB NOT ENDING WITH SIGNALS ***
We were surprised to see that the crunch jobs are running as user root: is there a way to change this? We created a user crunch, member of the docker group on all slurm nodes and controller for this purpose, as suggested in the installation guide. Is there a configuration setting for this?
The logs on the slurm nodes are being created directly in /: is there a way to change this? Thanks in advance!!
Tom Schoonjans
@tschoonj
Here are the crunch-dispatch-slurm logs corresponding to the previously mentioned job:
Oct 08 09:25:52 crunch-dispatcher-slurm-controller-blue-dispatcher crunch-dispatch-slurm[17079]: 2021/10/08 09:25:52 Submitting container 88d80-dz642-22qdy08ukoae458 to slurm
Oct 08 09:25:52 crunch-dispatcher-slurm-controller-blue-dispatcher crunch-dispatch-slurm[17079]: 2021/10/08 09:25:52 running sbatch ["--job-name=88d80-dz642-22qdy08ukoae458" "--nice=10000" "--no-requeue" "--mem=520" "--cpus-per-task=1" "--tmp=640"]
Oct 08 09:25:52 crunch-dispatcher-slurm-controller-blue-dispatcher crunch-dispatch-slurm[17079]: 2021/10/08 09:25:52 "/usr/bin/sbatch" ["sbatch" "--job-name=88d80-dz642-22qdy08ukoae458" "--nice=10000" "--no-requeue" "--mem=520" "--cpus-per-task=1" "--tmp=640"]: "Submitted batch job 4"
Oct 08 09:25:52 crunch-dispatcher-slurm-controller-blue-dispatcher crunch-dispatch-slurm[17079]: 2021/10/08 09:25:52 Start monitoring container 88d80-dz642-22qdy08ukoae458 in state "Locked"
Oct 08 09:26:22 crunch-dispatcher-slurm-controller-blue-dispatcher crunch-dispatch-slurm[17079]: 2021/10/08 09:26:22 container 88d80-dz642-22qdy08ukoae458 is done: cancel slurm job
Oct 08 09:26:22 crunch-dispatcher-slurm-controller-blue-dispatcher crunch-dispatch-slurm[17079]: 2021/10/08 09:26:22 container 88d80-dz642-22qdy08ukoae458 is still in squeue after scancel
Oct 08 09:26:23 crunch-dispatcher-slurm-controller-blue-dispatcher crunch-dispatch-slurm[17079]: 2021/10/08 09:26:23 container 88d80-dz642-22qdy08ukoae458 is done: cancel slurm job
Oct 08 09:26:32 crunch-dispatcher-slurm-controller-blue-dispatcher crunch-dispatch-slurm[17079]: 2021/10/08 09:26:32 container 88d80-dz642-22qdy08ukoae458 is still in squeue after scancel
...repeats several more times...

Oct 08 09:27:22 crunch-dispatcher-slurm-controller-blue-dispatcher crunch-dispatch-slurm[17079]: 2021/10/08 09:27:22 container 88d80-dz642-22qdy08ukoae458 is still in squeue after scancel
Oct 08 09:27:23 crunch-dispatcher-slurm-controller-blue-dispatcher crunch-dispatch-slurm[17079]: 2021/10/08 09:27:23 container 88d80-dz642-22qdy08ukoae458 is done: cancel slurm job
Oct 08 09:27:32 crunch-dispatcher-slurm-controller-blue-dispatcher crunch-dispatch-slurm[17079]: 2021/10/08 09:27:32 container 88d80-dz642-22qdy08ukoae458 is done: cancel slurm job
Oct 08 09:27:42 crunch-dispatcher-slurm-controller-blue-dispatcher crunch-dispatch-slurm[17079]: 2021/10/08 09:27:42 container 88d80-dz642-22qdy08ukoae458 is done: cancel slurm job
Oct 08 09:27:52 crunch-dispatcher-slurm-controller-blue-dispatcher crunch-dispatch-slurm[17079]: 2021/10/08 09:27:52 container 88d80-dz642-22qdy08ukoae458 is done: cancel slurm job
Oct 08 09:28:02 crunch-dispatcher-slurm-controller-blue-dispatcher crunch-dispatch-slurm[17079]: 2021/10/08 09:28:02 container 88d80-dz642-22qdy08ukoae458 is done: cancel slurm job
Oct 08 09:28:12 crunch-dispatcher-slurm-controller-blue-dispatcher crunch-dispatch-slurm[17079]: 2021/10/08 09:28:12 container 88d80-dz642-22qdy08ukoae458 is done: cancel slurm job
Oct 08 09:28:22 crunch-dispatcher-slurm-controller-blue-dispatcher crunch-dispatch-slurm[17079]: 2021/10/08 09:28:22 Done monitoring container 88d80-dz642-22qdy08ukoae458
Ward Vandewege
@cure
@tschoonj the slurm user would be the user that the crunch-dispatch-slurm service runs as (since it executes sbatch etc directly)
wrt slurm being unhappy, does a basic test like srun -N 4 hostname work?
and you probably already know that slurm is extremely sensitive to a) time sync between nodes and b) dns resolution for the hostnames of the controller and workers
Ward Vandewege
@cure
it looks from the logs like things are more or less working
perhaps there's a firewalling/communication problem between api server and/or slurmctld and the compute nodes
are there actual errors in the container logs?