Where communities thrive


  • Join over 1.5M+ people
  • Join over 100K+ communities
  • Free without limits
  • Create your own community
People
Activity
  • Jan 21 23:41
    w-gao review_requested #3974
  • Jan 21 23:41
    w-gao review_requested #3974
  • Jan 21 23:40
    w-gao commented #3974
  • Jan 21 23:38
    w-gao ready_for_review #3974
  • Jan 21 23:38
    w-gao synchronize #3974
  • Jan 21 23:38

    w-gao on 3965-use-docker-init-in-cluster

    Update message (compare)

  • Jan 21 16:22
    adamnovak commented #3974
  • Jan 20 19:14
    adamnovak synchronize #3956
  • Jan 20 19:14

    adamnovak on 3942-aws-batch-batch-system

    Adapt to establish_boto3_sessio… (compare)

  • Jan 20 17:34
    w-gao synchronize #3974
  • Jan 20 17:34

    w-gao on 3965-use-docker-init-in-cluster

    Remove remains of Travis. (#397… Bump mypy from 0.920 to 0.921 (… Using uri_file_path to properly… and 17 more (compare)

  • Jan 20 17:18
    adamnovak synchronize #3956
  • Jan 20 17:18

    adamnovak on 3942-aws-batch-batch-system

    Require SDB and IAM stubs (compare)

  • Jan 20 16:50
    unito-bot assigned #4006
  • Jan 20 16:50
    unito-bot assigned #4010
  • Jan 20 16:49
    unito-bot assigned #4014
  • Jan 20 16:49
    unito-bot edited #4014
  • Jan 20 16:49
    unito-bot assigned #4015
  • Jan 20 16:49
    unito-bot edited #4015
  • Jan 20 16:49
    unito-bot assigned #4016
Lon Blauvelt
@DailyDreaming
[Adam Novak, UCSC GI] I think 120 is what you get when you treat -9 in a byte as unsigned, so that suggests a signal 9 (SIGKILL) killing the process. This will happen when an angry admin, an HPC batch system, or the out of memory killer kills your Toil job.
Rupert Nash
@rupertnash
Hello - I'm trying out toil-cwl-runner for running workflows containing "real HPC" steps (by HPC I mean MPI across O(100) nodes). First question: are there any plans to update the version of cwltool used by toil? (I also need some of the improvements in SoftwareRequirement handling)
5 replies
Cibin S B
@cibinsb

Hi all,
I'm trying to run toil on the internal Kubernetes cluster, following is the command which I used

toil-cwl-runner --logDebug --enable-dev --batchSystem kubernetes --jobStore aws:us-east-1:toil-test --stats --singularity --defaultCores 1 md_launch.cwl md_list_input_descriptions.yml

but I'm getting a permission error

<Response><Errors><Error><Code>AuthorizationFailure</Code><Message>User (arn:aws:iam::07445xxxxxx:user/cibin) does not have permission to perform (sdb:Select) on resource (arn:aws:sdb:us-east-1:074455289529:domain/toil-registry). Contact account owner.</Message><BoxUsage>0.0000137200</BoxUsage></Error></Errors><RequestID>0e71b5c9-150a-b570-a5cf-31f1f751abca</RequestID></Response>

Toil version is 5.4.0

1 reply
crusoe
@mr-c:matrix.org
[m]
Also, does it make sense to put the Jobstore in AWS? Seems expensive to not have it local
crusoe
@mr-c:matrix.org
[m]
So until a local/k8s Jobstore option is added, I wouldn't recommend using toil-cwl-runner with Kubernetes in production not on AWS. However, I'm sure assistance with adding other jobstore backends would be very welcome!
If you are quite keen on k8s @cibinsb , Arvados has preliminary support and manages it's own data (but please note the current limitations) https://doc.arvados.org/v2.2/install/arvados-on-kubernetes.html
1 reply
crusoe
@mr-c:matrix.org
[m]
Not CWL v1.2, they are using an old version of the reference runner under the hood (1.0.20191022103248, so from 2019-10-22)
Douglas Lowe
@douglowe
ahh - that wouldn't be suitable for @cibinsb's workflow then - he's testing out the same workflow as I am on HPC - which requires v1.2
I got over keen on using conditionals when writing it
crusoe
@mr-c:matrix.org
[m]
@douglowe @cibinsb maybe let them know you'd like to see CWL v1.2 over in https://gitter.im/reanahub/reana ?
Douglas Lowe
@douglowe
@mr-c:matrix.org - that is a good plan - thanks for opening issues with them for this :)
3 replies
Lon Blauvelt
@DailyDreaming
[Lon Blauvelt, UCSC GI] Hi cibinsb. That looks like a simpledb permissions issue. SimpleDB needs special permissions set for AWS so that might be the problem? Can you try "aws sdb list-domains" and see if that works?
3 replies
Lon Blauvelt
@DailyDreaming

[Adam Novak, UCSC GI] @cibinsb, if you are setting up your own AWS roles/credentials (instead of using toil launch-cluster), you need to make sure you are granting access to SimpleDB in addition to access to S3, for the AWS job store to work.

As described in https://toil.readthedocs.io/en/latest/running/cloud/kubernetes.html#aws-job-store-for-kubernetes you need to grab some AWS credentials, put them in a Kubernetes secret, and use TOIL_AWS_SECRET_NAME when you run the workflow, to grant the workers access. You also need to make sure the leader has access, either by running it in a pod with the secret mounted in to ~/.aws, or by setting up ~/.aws on whatever non-pod machine you are running the leader on.

[Adam Novak, UCSC GI] Lon is working on a better AWS job store that will work without SimpleDB, and so ought to be usable with any S3 clone you could deploy on your Kubernetes cluster, but until that's done Kubernetes only works with genuine AWS S3 and SimpleDB.
Lon Blauvelt
@DailyDreaming
[Adam Novak, UCSC GI] @cibinsb, are you sure you have credentials in ~/.aws/credentials on cibins-beast-13-9380 that correspond to an IAM user with access to S3 and SimpleDB?
[Adam Novak, UCSC GI] Also are you sure you own the bucket named toil-test--files? That sounds like a very generic name, and bucket names must be unique across all of AWS. It is quite possible that someone else is already using the jobstore named toil-test and that you will have to pick a different, unique name.
Cibin S B
@cibinsb
I confirm that aws user credentials are saved on ~/.aws/credentials on cibins-beast-13-9380. The given s3 bucket name was toil-test and I'm not sure why toil is trying to access `toil-test--files'
Lon Blauvelt
@DailyDreaming
@cibinsb Toil creates and uses a bucket named after the jobstore plus --files and disallows -- in jobstore names because of this.
Rohith B S
@rohith-bs
Traceback (most recent call last):
  File "/home/test-user/toil-scripts/script.py", line 282, in <module>
    Job.Runner.startToil(main_job, options)
  File "/usr/local/lib/python3.7/site-packages/toil/job.py", line 1743, in startToil
    return toil.restart()
  File "/usr/local/lib/python3.7/site-packages/toil/common.py", line 874, in restart
    return self._runMainLoop(rootJobDescription)
  File "/usr/local/lib/python3.7/site-packages/toil/common.py", line 1132, in _runMainLoop
    jobCache=self._jobCache).run()
  File "/usr/local/lib/python3.7/site-packages/toil/leader.py", line 229, in run
    self.innerLoop()
  File "/usr/local/lib/python3.7/site-packages/toil/leader.py", line 614, in innerLoop
    self._gatherUpdatedJobs(updatedJobTuple)
  File "/usr/local/lib/python3.7/site-packages/toil/leader.py", line 573, in _gatherUpdatedJobs
    self.processFinishedJob(jobID, exitStatus, wallTime=wallTime, exitReason=exitReason)
  File "/usr/local/lib/python3.7/site-packages/toil/leader.py", line 959, in processFinishedJob
    replacementJob = self.jobStore.load(jobStoreID)
  File "/usr/local/lib/python3.7/site-packages/toil/jobStores/fileJobStore.py", line 209, in load
    with open(jobFile, 'rb') as fileHandle:
FileNotFoundError: [Errno 2] No such file or directory: 'jobStore/jobs/kind-FunctionWrappingJob/instance-pb6jcg2c/job'
Kindly suggest how to resolve this issue. The pipeline run successfully and does what ever the intention was but toil ends like this.
crusoe
@mr-c:matrix.org
[m]
Welcome @rohith-bs Which pipeline are you running?
Rohith B S
@rohith-bs
@mr-c:matrix.org Thank you. It is a custom developed pipeline.
crusoe
@mr-c:matrix.org
[m]
Okay, I can only help with CWL pipelines
Lon Blauvelt
@DailyDreaming

[Adam Novak, UCSC GI] @cibinsb, Toil automatically appends --files to the job store name to derive the S3 bucket name, because a job store is more than just an S3 bucket; it currently includes some SimpleDB stuff, and only files go in S3.

I would try a different name than toil-test, maybe something with cibinsb in it, and see if that works.

11 replies
Lon Blauvelt
@DailyDreaming

[Adam Novak, UCSC GI] @rohith-bs Are you running your job store on a shared network filesystem that might be lagging behind messages sent through the job scheduler/not globally consistent in real time? The job appears to exist, but then is gone by the time Toil goes to load it: https://github.com/DataBiosphere/toil/blob/77a39f507b729525926c5efc9e07377483cdd005/src/toil/leader.py#L955-L959

We may be best off extending the special case handling for stale reads we have for the AWS job store to also cover the file job store, so that when the job is slow to disappear we don't crash.

1 reply
Rohith B S
@rohith-bs
@mr-c:matrix.org Thank you for your time.
@DailyDreaming Yes. I was running in a network job store. Thank you for the support.
Adam Novak
@adamnovak
Hey Lon, are you in here?
How did you get the SameRoom bot to be you on this end?
And did that fix the usage limits we are running into in the vg channel?
Or did you pay them or something?
Hmmm
Looks like I can send unlimited messages.
Actually, after a few messages on this end they get dropped on the Slack end as well, we just don't get ads for SameRoom spammed to the channel to notify us.
crusoe
@mr-c:matrix.org
[m]

@cibinsb @adamnovak I started working on this, though I'm very unfamiliar with all things boto3/AWS :-)

DataBiosphere/toil#3710

@adamnovak: Any thoughts on DataBiosphere/toil#3694 ?
Adam Novak
@adamnovak
I just reviewed #3694. It looks OK to me, but it changes over to a git branch dependency on cwltool, so we couldn't release Toil if we merged it as is.
crusoe
@mr-c:matrix.org
[m]
RIght, just wanted to confirm the general direction before we make the changes on the cwltool side
Will update with a released version of cwltool before merging
crusoe
@mr-c:matrix.org
[m]
Huzzah!
@cibinsb: Hopefully we will fix the issue for good, so no documentation update will be needed
Ian
@ionox0
what is the difference between --workDir and the TMPDIR environment variable?
crusoe
@mr-c:matrix.org
[m]
the later is used by the former as a fallback
Ian
@ionox0
:thumbsup:
Kevin Chau
@kkchau
Hello, new to toil here. I was curious as to the reason the default AMI was CoreOS (and now Flatcar) instead of something like Ubuntu LTS?
Lon Blauvelt
@DailyDreaming
[Adam Novak, UCSC GI] CoreOS and Flatcar update themselves, and are designed to just sit there and run containers, I think was the logic. They also have a nice easy way to fetch the current best AMI to use. For Ubuntu, the system kind of expects an administrator to exist and periodically come by and sudo apt upgrade.
Kevin Chau
@kkchau
Makes perfect sense, thank you