w-gao on 3917-toil-kill-on-different-machine-from-leader
Only import from typing_extensi… More consistent function names … Merge branch 'master' into issu… (compare)
adamnovak on 4094-fix-s3-cwl
Try some tricks Amazon suggeste… Test all the bucket location st… (compare)
adamnovak on 4094-fix-s3-cwl
Actually import urlsplit (compare)
adamnovak on 4094-fix-s3-cwl
Only import from typing_extensi… Merge remote-tracking branch 'u… Always use IO[bytes/str] which … and 2 more (compare)
mr-c on master
Only import from typing_extensi… (compare)
So when I run make diff-cover
on that PR locally I get:
cwltool/command_line_tool.py (62.5%): Missing lines 207,256-257
revmap_file
and the new test. I get the impression that the test (in its current setup) can only check that what you put in as filename, also gets out (i.e. the external filename representation). I guess that's why only the if
clause is covered by the test. I guess the else
clause will only be executed if you supply an internal filename representation (at least, that's what I'm guessing right now). I'm not sure how I would have to supply an internal filename representation in that current test, because it uses a CommandLineTool
, which is an external thingy.
internal
in this case refers to a path within a software (docker) container
DockerRequirement
at https://github.com/common-workflow-language/cwltool/pull/1446/files#diff-39c8c56d7c38aab05d7eb4a8a765fcc4ea98d28bc4d0fedd22bce834e28dc843R123 is enough?
RuntimeContext.outdir
to some file:///
reference to a tmpdir
(tmp_path / "outdir").as_uri()
I encounter NFS issues when running Toil with Slurm on one of our clusters, causing jobs to fail. Two typical tracebacks follow below:
Traceback (most recent call last):
File "/project/rapthor/Software/rapthor/bin/_toil_worker", line 8, in <module>
sys.exit(main())
File "/project/rapthor/Software/rapthor/lib/python3.6/site-packages/toil/worker.py", line 710, in main
with in_contexts(options.context):
File "/usr/lib/python3.6/contextlib.py", line 81, in __enter__
return next(self.gen)
File "/project/rapthor/Software/rapthor/lib/python3.6/site-packages/toil/worker.py", line 684, in in_contexts
with manager:
File "/project/rapthor/Software/rapthor/lib/python3.6/site-packages/toil/batchSystems/abstractBatchSystem.py", line 505, in __enter__
self.arena.enter()
File "/project/rapthor/Software/rapthor/lib/python3.6/site-packages/toil/lib/threading.py", line 438, in enter
with global_mutex(self.workDir, self.mutex):
File "/usr/lib/python3.6/contextlib.py", line 81, in __enter__
return next(self.gen)
File "/project/rapthor/Software/rapthor/lib/python3.6/site-packages/toil/lib/threading.py", line 340, in global_mutex
fd_stats = os.fstat(fd)
OSError: [Errno 116] Stale file handle
Traceback (most recent call last):
File "/project/rapthor/Software/rapthor/lib/python3.6/site-packages/toil/deferred.py", line 215, in cleanupWorker
robust_rmtree(os.path.join(stateDirBase, cls.STATE_DIR_STEM))
File "/project/rapthor/Software/rapthor/lib/python3.6/site-packages/toil/lib/io.py", line 51, in robust_rmtree
robust_rmtree(child_path)
File "/project/rapthor/Software/rapthor/lib/python3.6/site-packages/toil/lib/io.py", line 64, in robust_rmtree
os.unlink(path)
OSError: [Errno 16] Device or resource busy: b'/project/rapthor/Share/prefactor/L667520/working/f7a704078c8f54fc8a7ccb44a8d5d5f6/deferred/.nfs00000000000e74070000e57a'
Both types of error seem to occur during clean-up.
[2021-11-01T11:50:40+0530] [MainThread] [W] [toil.leader] Job failed with exit value 1: 'JobFunctionWrappingJob' kind-JobFunctionWrappingJob/instance-l0jq0ypl
Exit reason: None
[2021-11-01T11:50:40+0530] [MainThread] [W] [toil.leader] No log file is present, despite job failing: 'JobFunctionWrappingJob' kind-JobFunctionWrappingJob/instance-l0jq0ypl
[2021-11-01T11:50:54+0530] [MainThread] [W] [toil.job] Due to failure we are reducing the remaining try count of job 'JobFunctionWrappingJob' kind-JobFunctionWrappingJob/instance-l0jq0ypl with ID kind-JobFunctionWrappingJob/instance-l0jq0ypl to 2
Kindly suggest the reason this happens. I do not see any errors as such in the execution. I am currently using slurm bacthSystem.
Anybody ever see anything like this from a Toil worker? Maybe @mr-c:matrix.org ?
Traceback (most recent call last):
File "/usr/local/lib/python3.6/dist-packages/toil/worker.py", line 376, in workerScript
job = Job.loadJob(jobStore, jobDesc)
File "/usr/local/lib/python3.6/dist-packages/toil/job.py", line 2251, in loadJob
job = cls._unpickle(userModule, fileHandle, requireInstanceOf=Job)
File "/usr/local/lib/python3.6/dist-packages/toil/job.py", line 1876, in _unpickle
runnable = unpickler.load()
AttributeError: 'Comment' object has no attribute '_end'
I'm trying to run some CWL CI tests, which broke when we rebuilt our Gitlab, with a local leader against our Kubernetes, and I'm getting this. I'd say it's a cwltool
version mismatch, but as far as I can tell I have cwltool==3.1.20211020155521
in both my container and my leader virtualenv. Does CWL have a Comment object that recently grew or lost a _end
?
Hey,
My team is working on a cloud/hpc platform for image processing. We are using toil and CWL for our hpc platform. One request on our platform is to be able to monitor individual steps of a workflow.
We are calling toil via a rest api and submitting our workflows to a slurm cluster. I would like to be able to monitor individual steps of the workflow but I don't know how I can know ahead of time what the name of the job would be. I was hoping to use squeue or sacct to get the status of each step of the workflow.
For our cloud platform, we use the id of the workflow and we can queue Argo to get status of each step of the workflow.
I'm hoping to be able to give each step of the workflow a name such as echo for the echo step and sleep for the sleep step. The workflow is below:
cwlVersion: v1.0
class: Workflow
id: echo-sleep
inputs:
echoMessage: string
sleepParam: int
outputs: {}
steps:
echo:
run: echo.cwl
in:
hello: echoMessage
out: []
sleep:
run: sleep.cwl
in:
sleepTime: sleepParam
out: []
Currently, the jobName would be toil_job_3_CwlJob. I would like to modify this behavior so that the name of the step is the name of the job. Is there any suggestion on allowing this?
jobName
off as the grid scheduler job name component here: https://github.com/DataBiosphere/toil/blob/312b6e1f221ee7f7f187dd6dbfce1aecffd00e09/src/toil/batchSystems/abstractGridEngineBatchSystem.py#L372-L373
instanceName
instead of jobName
anyway.
jobName
really should be the tool ID and instanceName
would be tool ID x step ID
@mr-c:matrix.org and @adamnovak thanks for pointing out in the code where that is. I will poke around with logging to see what is going on.
So I have a javascript process that runs the toil-cwl-runner. I saw that I can specify various batchsystems so I was hoping that I could use toil to get job ids and status for the individual jobs.
Currently, I have nothing tieing me into a HPC system in this pattern. But I notice that toil doesn't really report the status of the jobs. If I do toil status jobStore, it says what jobs are running but it doesn't really tell me if the job is pending/running/submitting. So I think I would have to query a hpc cli to get the status of the job.
So I am thinking I would have a separate process that runs a status check to verify status of each step of the workflow. and we are really aiming for support of slurm right now, we would have to use sacct or squeue to get status of the job. But I realize I don't know what job ids correspond to what.
Hey, I posted the issue above to give more descriptive job names for HPC schedulers.
I talked with my manager and since this directly impacts my project, I have the bandwidth to fix this.
So for workflows, we are thinking toiljob{id}_workflowId.stepId. For single tools, we would use toiljob{id}_toolId. What would I use for scatter/gather job names?