adamnovak on 4094-fix-s3-cwl
Only import from typing_extensi… Merge remote-tracking branch 'u… Always use IO[bytes/str] which … and 2 more (compare)
mr-c on master
Only import from typing_extensi… (compare)
adamnovak on 4089-typing-extensions
Add more missing sys imports (compare)
adamnovak on 4094-fix-s3-cwl
Satisfy typechecker locally (compare)
FileNotFoundError: [Errno 2] No such file or directory: '_:file:///juno/work/access/testing/users/johnsoni/access_qc/work5/f26b1779cb2f5ea39419d80a2066faf9/e81b/360a/tmpb4cbwihh/out/simplex_bam_pool_a_dir'
toilfs:16305:0:files/for-job/kind-CWLJob/instance-7ousmklr/file-184521b84be2407b8ea6ef01398c6f81/histogram.pdf
[Adam Novak, UCSC GI] @ionox0 This is part of the not-actually-correct logic that Toil has for preparing directories for CWL workflows. The underscore prefix is something that cwltool
uses that indicates a directory is to be created to present to a CWL tool. Toil tries to generate this to control cwltool
, but in practice what we have in 5.4 only works when running on a single machine or otherwise using a shared filesystem between nodes.
I've been redoing all that logic in DataBiosphere/toil#3628 so that Toil can be responsible for setting up the directory structures that CWL tools expect to see, whether there's a shared filesystem or not, but I still don't have it fully working yet. When it's done, it should be much harder to break.
exit code 120
pointing to ?
SoftwareRequirement
handling)
Hi all,
I'm trying to run toil on the internal Kubernetes cluster, following is the command which I used
toil-cwl-runner --logDebug --enable-dev --batchSystem kubernetes --jobStore aws:us-east-1:toil-test --stats --singularity --defaultCores 1 md_launch.cwl md_list_input_descriptions.yml
but I'm getting a permission error
<Response><Errors><Error><Code>AuthorizationFailure</Code><Message>User (arn:aws:iam::07445xxxxxx:user/cibin) does not have permission to perform (sdb:Select) on resource (arn:aws:sdb:us-east-1:074455289529:domain/toil-registry). Contact account owner.</Message><BoxUsage>0.0000137200</BoxUsage></Error></Errors><RequestID>0e71b5c9-150a-b570-a5cf-31f1f751abca</RequestID></Response>
Toil version is 5.4.0
[Adam Novak, UCSC GI] @cibinsb, if you are setting up your own AWS roles/credentials (instead of using toil launch-cluster
), you need to make sure you are granting access to SimpleDB in addition to access to S3, for the AWS job store to work.
As described in https://toil.readthedocs.io/en/latest/running/cloud/kubernetes.html#aws-job-store-for-kubernetes you need to grab some AWS credentials, put them in a Kubernetes secret, and use TOIL_AWS_SECRET_NAME
when you run the workflow, to grant the workers access. You also need to make sure the leader has access, either by running it in a pod with the secret mounted in to ~/.aws
, or by setting up ~/.aws
on whatever non-pod machine you are running the leader on.
toil-test--files
? That sounds like a very generic name, and bucket names must be unique across all of AWS. It is quite possible that someone else is already using the jobstore named toil-test
and that you will have to pick a different, unique name.
Traceback (most recent call last):
File "/home/test-user/toil-scripts/script.py", line 282, in <module>
Job.Runner.startToil(main_job, options)
File "/usr/local/lib/python3.7/site-packages/toil/job.py", line 1743, in startToil
return toil.restart()
File "/usr/local/lib/python3.7/site-packages/toil/common.py", line 874, in restart
return self._runMainLoop(rootJobDescription)
File "/usr/local/lib/python3.7/site-packages/toil/common.py", line 1132, in _runMainLoop
jobCache=self._jobCache).run()
File "/usr/local/lib/python3.7/site-packages/toil/leader.py", line 229, in run
self.innerLoop()
File "/usr/local/lib/python3.7/site-packages/toil/leader.py", line 614, in innerLoop
self._gatherUpdatedJobs(updatedJobTuple)
File "/usr/local/lib/python3.7/site-packages/toil/leader.py", line 573, in _gatherUpdatedJobs
self.processFinishedJob(jobID, exitStatus, wallTime=wallTime, exitReason=exitReason)
File "/usr/local/lib/python3.7/site-packages/toil/leader.py", line 959, in processFinishedJob
replacementJob = self.jobStore.load(jobStoreID)
File "/usr/local/lib/python3.7/site-packages/toil/jobStores/fileJobStore.py", line 209, in load
with open(jobFile, 'rb') as fileHandle:
FileNotFoundError: [Errno 2] No such file or directory: 'jobStore/jobs/kind-FunctionWrappingJob/instance-pb6jcg2c/job'
[Adam Novak, UCSC GI] @cibinsb, Toil automatically appends --files to the job store name to derive the S3 bucket name, because a job store is more than just an S3 bucket; it currently includes some SimpleDB stuff, and only files go in S3.
I would try a different name than toil-test
, maybe something with cibinsb
in it, and see if that works.
[Adam Novak, UCSC GI] @rohith-bs Are you running your job store on a shared network filesystem that might be lagging behind messages sent through the job scheduler/not globally consistent in real time? The job appears to exist, but then is gone by the time Toil goes to load it: https://github.com/DataBiosphere/toil/blob/77a39f507b729525926c5efc9e07377483cdd005/src/toil/leader.py#L955-L959
We may be best off extending the special case handling for stale reads we have for the AWS job store to also cover the file job store, so that when the job is slow to disappear we don't crash.