jonathanxu18 on 3733-graviton-performance-test
Revert prometheus query job name (compare)
jonathanxu18 on 3733-graviton-performance-test
Update prometheus query job name (compare)
jonathanxu18 on 3733-graviton-performance-test
Update prometheus queries (compare)
dependabot[bot] on pip
dependabot[bot] on pip
Bump mypy from 0.941 to 0.960 … (compare)
w-gao on 3876-run-cwl-tests-via-wes
Fix linting (compare)
adamnovak on 3889-sync-google-auth
Allow newer google-cloud-storag… (compare)
RuntimeContext.outdir
to some file:///
reference to a tmpdir
(tmp_path / "outdir").as_uri()
I encounter NFS issues when running Toil with Slurm on one of our clusters, causing jobs to fail. Two typical tracebacks follow below:
Traceback (most recent call last):
File "/project/rapthor/Software/rapthor/bin/_toil_worker", line 8, in <module>
sys.exit(main())
File "/project/rapthor/Software/rapthor/lib/python3.6/site-packages/toil/worker.py", line 710, in main
with in_contexts(options.context):
File "/usr/lib/python3.6/contextlib.py", line 81, in __enter__
return next(self.gen)
File "/project/rapthor/Software/rapthor/lib/python3.6/site-packages/toil/worker.py", line 684, in in_contexts
with manager:
File "/project/rapthor/Software/rapthor/lib/python3.6/site-packages/toil/batchSystems/abstractBatchSystem.py", line 505, in __enter__
self.arena.enter()
File "/project/rapthor/Software/rapthor/lib/python3.6/site-packages/toil/lib/threading.py", line 438, in enter
with global_mutex(self.workDir, self.mutex):
File "/usr/lib/python3.6/contextlib.py", line 81, in __enter__
return next(self.gen)
File "/project/rapthor/Software/rapthor/lib/python3.6/site-packages/toil/lib/threading.py", line 340, in global_mutex
fd_stats = os.fstat(fd)
OSError: [Errno 116] Stale file handle
Traceback (most recent call last):
File "/project/rapthor/Software/rapthor/lib/python3.6/site-packages/toil/deferred.py", line 215, in cleanupWorker
robust_rmtree(os.path.join(stateDirBase, cls.STATE_DIR_STEM))
File "/project/rapthor/Software/rapthor/lib/python3.6/site-packages/toil/lib/io.py", line 51, in robust_rmtree
robust_rmtree(child_path)
File "/project/rapthor/Software/rapthor/lib/python3.6/site-packages/toil/lib/io.py", line 64, in robust_rmtree
os.unlink(path)
OSError: [Errno 16] Device or resource busy: b'/project/rapthor/Share/prefactor/L667520/working/f7a704078c8f54fc8a7ccb44a8d5d5f6/deferred/.nfs00000000000e74070000e57a'
Both types of error seem to occur during clean-up.
[2021-11-01T11:50:40+0530] [MainThread] [W] [toil.leader] Job failed with exit value 1: 'JobFunctionWrappingJob' kind-JobFunctionWrappingJob/instance-l0jq0ypl
Exit reason: None
[2021-11-01T11:50:40+0530] [MainThread] [W] [toil.leader] No log file is present, despite job failing: 'JobFunctionWrappingJob' kind-JobFunctionWrappingJob/instance-l0jq0ypl
[2021-11-01T11:50:54+0530] [MainThread] [W] [toil.job] Due to failure we are reducing the remaining try count of job 'JobFunctionWrappingJob' kind-JobFunctionWrappingJob/instance-l0jq0ypl with ID kind-JobFunctionWrappingJob/instance-l0jq0ypl to 2
Kindly suggest the reason this happens. I do not see any errors as such in the execution. I am currently using slurm bacthSystem.
Anybody ever see anything like this from a Toil worker? Maybe @mr-c:matrix.org ?
Traceback (most recent call last):
File "/usr/local/lib/python3.6/dist-packages/toil/worker.py", line 376, in workerScript
job = Job.loadJob(jobStore, jobDesc)
File "/usr/local/lib/python3.6/dist-packages/toil/job.py", line 2251, in loadJob
job = cls._unpickle(userModule, fileHandle, requireInstanceOf=Job)
File "/usr/local/lib/python3.6/dist-packages/toil/job.py", line 1876, in _unpickle
runnable = unpickler.load()
AttributeError: 'Comment' object has no attribute '_end'
I'm trying to run some CWL CI tests, which broke when we rebuilt our Gitlab, with a local leader against our Kubernetes, and I'm getting this. I'd say it's a cwltool
version mismatch, but as far as I can tell I have cwltool==3.1.20211020155521
in both my container and my leader virtualenv. Does CWL have a Comment object that recently grew or lost a _end
?
Hey,
My team is working on a cloud/hpc platform for image processing. We are using toil and CWL for our hpc platform. One request on our platform is to be able to monitor individual steps of a workflow.
We are calling toil via a rest api and submitting our workflows to a slurm cluster. I would like to be able to monitor individual steps of the workflow but I don't know how I can know ahead of time what the name of the job would be. I was hoping to use squeue or sacct to get the status of each step of the workflow.
For our cloud platform, we use the id of the workflow and we can queue Argo to get status of each step of the workflow.
I'm hoping to be able to give each step of the workflow a name such as echo for the echo step and sleep for the sleep step. The workflow is below:
cwlVersion: v1.0
class: Workflow
id: echo-sleep
inputs:
echoMessage: string
sleepParam: int
outputs: {}
steps:
echo:
run: echo.cwl
in:
hello: echoMessage
out: []
sleep:
run: sleep.cwl
in:
sleepTime: sleepParam
out: []
Currently, the jobName would be toil_job_3_CwlJob. I would like to modify this behavior so that the name of the step is the name of the job. Is there any suggestion on allowing this?
jobName
off as the grid scheduler job name component here: https://github.com/DataBiosphere/toil/blob/312b6e1f221ee7f7f187dd6dbfce1aecffd00e09/src/toil/batchSystems/abstractGridEngineBatchSystem.py#L372-L373
instanceName
instead of jobName
anyway.
jobName
really should be the tool ID and instanceName
would be tool ID x step ID
@mr-c:matrix.org and @adamnovak thanks for pointing out in the code where that is. I will poke around with logging to see what is going on.
So I have a javascript process that runs the toil-cwl-runner. I saw that I can specify various batchsystems so I was hoping that I could use toil to get job ids and status for the individual jobs.
Currently, I have nothing tieing me into a HPC system in this pattern. But I notice that toil doesn't really report the status of the jobs. If I do toil status jobStore, it says what jobs are running but it doesn't really tell me if the job is pending/running/submitting. So I think I would have to query a hpc cli to get the status of the job.
So I am thinking I would have a separate process that runs a status check to verify status of each step of the workflow. and we are really aiming for support of slurm right now, we would have to use sacct or squeue to get status of the job. But I realize I don't know what job ids correspond to what.
Hey, I posted the issue above to give more descriptive job names for HPC schedulers.
I talked with my manager and since this directly impacts my project, I have the bandwidth to fix this.
So for workflows, we are thinking toiljob{id}_workflowId.stepId. For single tools, we would use toiljob{id}_toolId. What would I use for scatter/gather job names?
And I created a feature request. DataBiosphere/toil#3885
Let me know if I can clear that up more or correctly post.
Hi, I've made some changes in my branch and I'm trying to run the unit tests. I am running into some problems but I don't think its related to my code.
When I do make test, I get the following 22 failures.
======================================================================================================= ERRORS ========================================================================================================
______________________________________________________________________________ ERROR collecting src/toil/test/cwl/spec_v11/tests/index.py ______________________________________________________________________________
Traceback (most recent call last):
File "/Users/kevinhannon/Work/GIT/toil/src/toil/test/cwl/spec_v11/tests/index.py", line 14, in <module>
main = open(mainfile)
FileNotFoundError: [Errno 2] No such file or directory: '--durations=0'
______________________________________________________________________________ ERROR collecting src/toil/test/cwl/spec_v11/tests/index.py ______________________________________________________________________________
Traceback (most recent call last):
File "/Users/kevinhannon/Work/GIT/toil/src/toil/test/cwl/spec_v11/tests/index.py", line 14, in <module>
main = open(mainfile)
FileNotFoundError: [Errno 2] No such file or directory: '--durations=0'
_____________________________________________________________________________ ERROR collecting src/toil/test/cwl/spec_v11/tests/search.py ______________________________________________________________________________
Traceback (most recent call last):
This also happens if I use the quick test run also.
Hi, I submitted a PR for issue 3884. DataBiosphere/toil#3893
I wanted to get some feedback on this. This works for my purposes but I'm not sure if I handle scatter/gather workflows or subworkflows correctly.
--tmp-outdir-prefix
), at least when using Slurm as batch control system. Since I cannot use Toil's internal file store (I need to set --bypass-file-store
, because some of my workflows use InplaceUpdateRequirement
), I had to put this temporary output directory on a shared (NFS) disk. Can someone acknowledge this behaviour?
--bypass-file-store
is quite recent. And I don't think we added any logic to figure out what can safely be deleted (not used by a downstream step) and what can't. If deferring clean-up to the end would suffice, then that won't be so hard to implement (though care must be taken to not delete and InplaceUpdateRequirement
files that are apart of the original inputs or the final outputs
--logFile
option, like you get on the console? I couldn't find any mention of it in the description of the options. The work-around I use at the moment is to redirect all console output to a file, which makes the option --logFile
quite useless in that case.logFile
.