Where communities thrive


  • Join over 1.5M+ people
  • Join over 100K+ communities
  • Free without limits
  • Create your own community
People
Activity
    Michael R. Crusoe
    @mr-c
    hey Toil peeps, here's a question on Biostars for y'all: https://www.biostars.org/p/448085/
    Hannes Schmidt
    @hannes-ucsc

    [Adam Novak, UCSC GI] I don't believe that Toil makes any attempt to preserve file permissions when copying them around, even the execute bit. A workaround might be to tar up the script and then untar it before executing it, or just cram a chmod +x in somewhere in the CWL workflow somehow.

    It would be good if Toil could make sure to set the execute bit when downloading stuff that was executable when it was stored. You can open a bug for that and we migth be able to do it.

    Asha Rostamianfar
    @arostamianfar
    Given Docker's recent announcement [1] about new limits for anonymous users (100 pulls per 6hr period), should we consider using a different host for prom/node-exporter (the only dockerhub image we currently pull during initialization) or provide more built-in ways for folks to authenticate to dockerhub with pro/team accounts when running large workflows?
    [1] https://www.docker.com/blog/scaling-docker-to-serve-millions-more-developers-network-egress/
    Hannes Schmidt
    @hannes-ucsc
    [Adam Novak, UCSC GI] Looks like the devs there are ahead of us. The README for the Prometheus node exporter now points to Quay: https://github.com/prometheus/node_exporter#using-docker
    [Adam Novak, UCSC GI] Quay is of course also eventually going to realize that hosting everyone's everything for free forever is a money-losing idea, but they seem to have more money.
    Asha Rostamianfar
    @arostamianfar
    awesome! I filed DataBiosphere/toil#3172 to update it.
    T. Thyer
    @tthyer
    Hi all, I am trying out running a workflow with toil-cwl-runner on EKS with the kubernetes batch system option. I submit a Job (the leader job) to my k8s cluster that uses toil-cwl-runner in its container command. It is failing early in the workflow on a step that has a DockerRequirement, with "cwltool.errors.UnsupportedRequirement: Docker is required to run this tool" and "Is the docker daemon running?" What it looks like is that the leader creates another k8s Job which starts a Pod using the toil image, and then when that runs it tries to start another docker container inside itself using the image specified by the tool, so basically, it attempts dind, except is not set up to do that. A gist of my leader job is here: https://gist.github.com/tthyer/3d867adce99bfc5f910b9f5778cf822b. I guess my question is, is the k8S batch system not quite ready to run CWL workflows? Is there an alternative way to set this up that I missed?
    Hannes Schmidt
    @hannes-ucsc
    [Adam Novak, UCSC GI] The Kubernetes batch system doesn't know to start a Docker daemon, or to mark the worker pods privileged to let them start their own Docker daemons, and the CWL runner I believe assumes that Docker is just available and doesn't do anything to set it up.
    [Adam Novak, UCSC GI] At UCSC we've been running all our Kubernetes Toil workflows using Singularity (with user-mode namespaces) instead of Docker to manage containers within jobs. But the code we use to do that hasn't been taken up into Toil, and when Toil runs CWL workflows it uses cwlrunner to do all its Docker stuff anyway.
    [Adam Novak, UCSC GI] The Kubernetes batch system also doesn't know how to read the secrets mounted in the leader pod and also mount them in the workers, so if your Synapse secret is supposed to be available to your jobs, that's another blocking issue.
    [Adam Novak, UCSC GI] In general I don't think using Docker from inside Kubernetes is going to work well; setting it up is a huge pain and requires a privileged pod. I don't think even cwltool running wthin a pod can handle a DockerRequirement unless you write a bunch of logic to specially prepare the pod and bring up the Docker daemon inside it.
    Hannes Schmidt
    @hannes-ucsc
    [Adam Novak, UCSC GI] You could give your Kubernetes pods access to the host Docker, but that's probably even worse for security than marking them privileged.
    [Adam Novak, UCSC GI] If cwlrunner could use singularity with user-mode namespaces to handle DockerRequirement when the Docker daemon isn't available and/or can't start, that would solve the problem.
    [Adam Novak, UCSC GI] If they can't do that, we can try to put together a setting for the Kubernetes batch system that makes it launch all its worker pods as privileged and do the magic required to bring up Docker inside them. We do something like this already for our CI system I believe, so it can be done.
    Hannes Schmidt
    @hannes-ucsc
    [Adam Novak, UCSC GI] I've opened up DataBiosphere/toil#3223 to track this.
    T. Thyer
    @tthyer
    Cool, thank you -- that was a very helpful explanation. Regarding the secret, yes, I had to modify the kubernetes.py code and build my own toil image to make it work. It doesn't handle any secret right now, just that single one that almost all of our workflows will depend on. It's a hack.
    I think I'll follow up on the suggestion of using singularity with cwlrunner, prefer that to running privileged
    Michael R. Crusoe
    @mr-c
    @tthyer I've added a comment about the existing --singularity feature of toil-cwl-runner to the issue that Adam Novak made. However I've never used the Toil k8s batch system, let alone with that option. Please let us know if that works!
    Michael Milton
    @TMiguelT
    Are there release notes anywhere for what changed in the new versions?
    Hannes Schmidt
    @hannes-ucsc
    [Adam Novak, UCSC GI] I don't think we've been maintaining changelogs, unfortunately.
    Michael Milton
    @TMiguelT
    Oh, that's a shame
    Lon Blauvelt
    @DailyDreaming
    @TMiguelT You're right. I'll try and update the changelogs in the release section.
    Michael Milton
    @TMiguelT
    I gather than Python 3 is the big new thing, but there have been a few other releases that I'd be interested to understand
    Yilong Li
    @yl3
    Hello
    I keep getting the following message when I try to toil launch-cluster
    [2020-10-18T21:48:20+0000] [MainThread] [I] [toil.provisioners.node] Attempting to establish SSH connection...
    [2020-10-18T21:48:20+0000] [MainThread] [I] [toil.provisioners.node] Executing the command "ps" on the appliance returned a non-zero exit code 255 with stdout b'' and stderr b"Warning: Permanently added '54.205.90.97' (ECDSA) to the list of known hosts.\r\ncore@54.205.90.97: Permission denied (publickey,password,keyboard-interactive).\r\n"
    [2020-10-18T21:48:20+0000] [MainThread] [I] [toil.provisioners.node] Connection rejected, waiting for public SSH key to be propagated. Trying again in 10s.

    After about a minute, the process fails with the following error.

    RuntimeError: Key propagation failed on machine with ip 54.205.90.97

    However, the instance itself seems to be configured correctly? I am able to ssh into the instance manually and the docker processes are running:

    $ docker ps
    CONTAINER ID        IMAGE                                                                        COMMAND                  CREATED             STATUS              PORTS                    NAMES
    4b746d4ee9ba        quay.io/ucsc_cgl/toil:4.2.0-3aa1da130141039cb357efe36d7df9b9f6ae9b5b-py3.6   "mesos-master --log_…"   14 minutes ago      Up 14 minutes                                toil_leader
    f4b09dc519c1        prom/node-exporter:v0.15.2                                                   "/bin/node_exporter …"   15 minutes ago      Up 14 minutes       0.0.0.0:9100->9100/tcp   node-exporter
    I wonder if anyone knows whether I'm doing something wrong?
    I am getting this same problem whether I try to launch the cluster on the default AWS VPC or in my own custom VPC
    Hannes Schmidt
    @hannes-ucsc
    [Adam Novak, UCSC GI] Does it just loop on that message about not being able to log in to the node?
    [Adam Novak, UCSC GI] One way that can happen is if your SSH key has a password on it. Toil needs your SSH identity file to be unencrypted so it can use it noninteractively to log in and set up the node.
    [Adam Novak, UCSC GI] Another way you can get this is I think if you don't give the right public key name. You need to make sure to upload your SSH public key to AWS beforehand, and give it a name, and give that name to Toil so it can tell AWS to load the right public key into the node when it sets it up.
    Yilong Li
    @yl3
    Hi Adam, thanks for looking into this. After trying different things, I realised that I hadn't added my private key to ssh-agent, since I had named it differently to id_rsa and thus ssh-add didn't see it by default.
    So you definitely had the right hunch
    I am still playing around with Toil. If I want to build a multi-user AWS cluster for submitting batch jobs on the AWS, what is the best practice for allowing multiple users access to the cluster? Creating new users on the toil cluster manually, or having each user launch their own cluster using launch-cluster?
    Hannes Schmidt
    @hannes-ucsc
    [Adam Novak, UCSC GI] A single Toil cluster isn't really designed to run multiple workflows at the same time. I think it if isn't autoscaling (you just launch-cluster with a certain fixed number of nodes), there's no good reason for it to not work. You'd just have to add everyone's SSH keys to the leader node (outside the appliance; at the coreos level. ssh core@whatever.ip), and then I think toil ssh-cluster should work for everyone to log into the appliance.
    [Adam Novak, UCSC GI] I don't think Flatcar wants to have multiple users, and I don't think the appliance container wants to have multiple users either. Everyone would be the one user account.
    [Adam Novak, UCSC GI] Here at UCSC for our multitenant stuff we use a Kubernetes cluster and Toil's Kubernetes support. I'm working on eventually getting launch-cluster to build autoscaling Kubernetes clusters instead of autoscaling Mesos clusters, and I suppose you could take one of the auto-built clusters that setup will make and decide to keep it around forever and manually admin it and make users and namespaces for everyone.
    Peter Amstutz
    @tetron
    @yl3 if you need a multi-user AWS cluster to run CWL workflows you might want to look at https://arvados.org
    Yilong Li
    @yl3
    Adam, really appreciate your detailed input. Peter, thank you and I'll have a look.
    Yilong Li
    @yl3
    When I run a job with aws and mesos, where are the log files (STDOUT and STDERR) stored at?
    Yilong Li
    @yl3
    Also, is it possible to login into an AWS worker node for debugging purposes?
    Hannes Schmidt
    @hannes-ucsc
    [Adam Novak, UCSC GI] You might be able to log into the workers with the SSH key you set for the cluster (or maybe with an ssh key that gets made on the leader?) as core@<ip>, since the base image is Flatcar OS (forked from CoreOS).
    [Adam Novak, UCSC GI] Then you need to shell into the Docker container where Toil actually runs.
    [Adam Novak, UCSC GI] If you want the actual stdout/stderr form the Toil worker processes under Mesos, I think Mesos has them.
    [Adam Novak, UCSC GI] But you need to get your browser into a position where it can talk directly to the worker nodes to get into their sandboxes through the web UI.
    Hannes Schmidt
    @hannes-ucsc
    [Adam Novak, UCSC GI] I remember toil ssh-cluster maybe being able to help with this, but toil ssh-cluster -D8080 core@<leader>, then setting your browser to use the SOCKS proxy at localhost:8080, and then browsing to http://<leader>:5050 ought to work.
    Yilong Li
    @yl3
    core@<ip> worked, thanks!
    Yilong Li
    @yl3
    For the STDOUT, I guess I should just redirect the STDOUT to a file in CWL and capture the output file
    Hannes Schmidt
    @hannes-ucsc
    [Adam Novak, UCSC GI] Yeah, if you actually want the command output as an output of the workflow, I would recommend grabbing it as part of the workflow. Toil doesn't really archive all the output in a coherent way, I don't think.