Where communities thrive


  • Join over 1.5M+ people
  • Join over 100K+ communities
  • Free without limits
  • Create your own community
People
Activity
  • Oct 26 16:15
    adamnovak closed #2203
  • Oct 26 16:15
    adamnovak commented #2203
  • Oct 26 16:14
    unito-bot edited #3873
  • Oct 26 16:14
    unito-bot opened #3873
  • Oct 26 16:10
    adamnovak edited #2203
  • Oct 26 16:10
    adamnovak edited #2203
  • Oct 26 16:10
    adamnovak edited #2203
  • Oct 26 16:10
    adamnovak edited #2203
  • Oct 26 16:08
    unito-bot assigned #3872
  • Oct 26 16:08
    unito-bot opened #3872
  • Oct 26 14:48
    adamnovak commented #3869
  • Oct 26 14:46
    adamnovak synchronize #3869
  • Oct 26 14:46

    adamnovak on 3831-disableCaching-argument

    TES batch system prototype (#38… Merge remote-tracking branch 'u… Fix and simplify syntax and 1 more (compare)

  • Oct 26 14:36
    adamnovak closed #3756
  • Oct 26 14:36
    adamnovak commented #3756
  • Oct 26 14:01
    jonathanxu18 synchronize #3802
  • Oct 26 14:01

    jonathanxu18 on 3783-integrate-multi-arch-docker-images

    Try command substitution (compare)

  • Oct 26 13:22
    jonathanxu18 synchronize #3802
  • Oct 26 13:22

    jonathanxu18 on 3783-integrate-multi-arch-docker-images

    Use double quotes for grep (compare)

  • Oct 26 04:24
    dependabot[bot] labeled #3871
crusoe
@mr-c:matrix.org
[m]
the later is used by the former as a fallback
Ian
@ionox0
:thumbsup:
Kevin Chau
@kkchau
Hello, new to toil here. I was curious as to the reason the default AMI was CoreOS (and now Flatcar) instead of something like Ubuntu LTS?
Lon Blauvelt
@DailyDreaming
[Adam Novak, UCSC GI] CoreOS and Flatcar update themselves, and are designed to just sit there and run containers, I think was the logic. They also have a nice easy way to fetch the current best AMI to use. For Ubuntu, the system kind of expects an administrator to exist and periodically come by and sudo apt upgrade.
Kevin Chau
@kkchau
Makes perfect sense, thank you
Martín Beracochea
@mberacochea
Hi, I have a small issue with toil (toil-cwl-runner is the one we use). I can't run it with a shared --workDir because the only shared filesystem we have are either GPFS or NFS (https://github.com/DataBiosphere/toil/issues/3497#issuecomment-803021190). That is OK, I can use the workers tmp directory just fine. The problem is that it's very hard to debug when a job fails, as the logs are not reported to the leader. Would it be possible to use the --writeLogs directory here -> https://github.com/DataBiosphere/toil/blob/master/src/toil/batchSystems/abstractBatchSystem.py#L369 , that way we can record all the logs). Thanks
1 reply
Lon Blauvelt
@DailyDreaming
@mberacochea If you have a solution that works for you, then we would happily accept and review a PR to add it into the main code base.
Martín Beracochea
@mberacochea
Super then, I'll send the PR as soon as I can. Thanks!
Lon Blauvelt
@DailyDreaming
Thank you!
anthfm
@anthfm
Hi, I am using toil-cwl-runner in order to parallelize jobs across a few nodes on a slurm cluster. However, I keep on receiving the following error: Job requested more disk than requested. For CWL, consider increasing the outdirMin requirement, otherwise, consider increasing the disk requirement. I increased outdirMin in my CWL file, but it is still showing the error with 56.0 KiB used, and 0bB bytes requested. I am not sure how to further approach this error? Thank you.
6 replies
Marcel Loose
@gmloose

Hi, one of my Toil jobs, using Slurm as batch system, crashed with the following error:

Traceback (most recent call last):
  File "/project/rapthor/Software/rapthor/lib/python3.6/site-packages/toil/cwl/cwltoil.py", line 3231, in main
    outobj = toil.start(wf1)
  File "/project/rapthor/Software/rapthor/lib/python3.6/site-packages/toil/common.py", line 841, in start
    return self._runMainLoop(rootJobDescription)
  File "/project/rapthor/Software/rapthor/lib/python3.6/site-packages/toil/common.py", line 1132, in _runMainLoop
    jobCache=self._jobCache).run()
  File "/project/rapthor/Software/rapthor/lib/python3.6/site-packages/toil/leader.py", line 246, in run
    self.innerLoop()
  File "/project/rapthor/Software/rapthor/lib/python3.6/site-packages/toil/leader.py", line 667, in innerLoop
    assert self.toilState.successorCounts == {}
AssertionError

Has anyone seen this error before?

crusoe
@mr-c:matrix.org
[m]
@gmloose: it is an old check, but not one that I've seen triggered before. Can you open an issue with your version of toil and your CWL workflow + inputs?
Adam Novak
@adamnovak
Hmm. Which job is it complaining about @anthfm?
Adam Novak
@adamnovak
@gmloose That looks like some of the bookkeeping structures we use for working out when jobs with multiple predecessors have all their predecessors done. If it's not empty at the end of the workflow, we conclude something went wrong and bail. We think there are race conditions in the core scheduler that can cause some updates to clobber other updates; I'm trying to refactor that part so this sort of thing can't happen, but that's not ready yet. If you have the whole log at debug level we might be able to find what got dropped, which might help us figure out what the race condition is. Otherwise the best thing to do might be to --restartwith the same --jobStore and resume the workflow.
Marcel Loose
@gmloose
It's a quite complicated workflow, consisting of a number of sub-workflows and several tens of steps (commands and expressions). The actual input files that need to be processed (not the JSON-file) are huge. I don't have debug logs of this particular run, and it is the first time I ever saw this error message. I can try to restart it (with --logDebug) and see if it fails again.
crusoe
@mr-c:matrix.org
[m]
@gmloose: Thanks, do let us know what happens!
Marcel Loose
@gmloose
As was (almost) to be expected, the restarted workflow completed without error. So this type of error is probably very hard to track down.
anthfm
@anthfm
Hi, I have a CWL workflow (subworkflow with scatter) that takes in input files, processes them with step1, and subsequently utilizes step1 output to run step2, resulting in final output files. It executes with no errors using Toil, however, I noticed that after step1 and step2 jobs are executed and completed, Toil re-issues empty step1 jobs that terminate successfully immediately. Is this normal behaviour for CWL scatter/subworkflows using Toil? Thank you
Adam Novak
@adamnovak
@anthfm That's normal behavior for Toil; jobs are issued once on the way down through the workflow graph, and then again on the way back up to do some cleanup. It shouldn't actually redo any CWL work, although the resource requirements might be excessive for the cleanup because I think it still asks for the same as the original job.
1 reply
Martín Beracochea
@mberacochea
Hey, I have a CWL pipeline which I run in LSF. I'm using TOIL_LSF_ARGS to set the queue, all good. But, I want to run one step in a different queue... is this possible?
crusoe
@mr-c:matrix.org
[m]
@mberacochea just a single step and skip the rest of the workflow?
There is a cwltool option to extract a single step from a workflow and either print it out or run it
I forget if we exposed that in toil-cwl-runner. If we didn't, you can use cwltool --single-step name --print-subgraph and then take the result to toil-cwl-runner
Martín Beracochea
@mberacochea
hi @mr-c:matrix.org , I need to run the whole workflow (but that step has to run in a different queue).
crusoe
@mr-c:matrix.org
[m]
Okay. What is special about that queue? Longer walltime is allowed? Special hardware? Bigmem?
Martín Beracochea
@mberacochea
bigmem
Adam Novak
@adamnovak
Yeah, Toil doesn't have this feature. If you have an idea for how to improve the LSF batch system code in Toil so that it can know what queues jobs need to go in based on their memory requirements, and you can figure out how to code it so that it will still work on everybody else's LSF clusters, we could take a PR.
Or if CWL grew a way to add an LSF queue annotation to a job, we could maybe punch a hole through to the batch system so that it could know about it.
crusoe
@mr-c:matrix.org
[m]
@mberacochea do you have to specify a queue? Toil does put the memory requirements in the batch job. You could ask your LSF admin if that'd be enough
There is an unimplemented proposal for overriding the queue for a certain job: common-workflow-language/common-workflow-language#581
Martín Beracochea
@mberacochea
Thank you both. I need to specify the queue, otherwise the LSF rejects the job (I'll see if the admins have something to sort that out).
yeah, the overrides seem to be the way to go in my case
crusoe
@mr-c:matrix.org
[m]
Okay, if they can't cope then we could try implementing a vendor extension version of BatchQueue for toil-cwl-runner
Martín Beracochea
@mberacochea
All right. Sounds like a plan. Thanks
crusoe
@mr-c:matrix.org
[m]
@adamnovak that would require a way to pass per Toil.job options to the batch systems, yes
Adam Novak
@adamnovak
In general we need to revise requirements to be a bit more free-form to support things like GPUs. I think we'll end up with some kind of dict of keys that the batch system can consult.
Michael Milton
@multimeric
If I run a Python workflow, and it completes, then I edit the workflow and re-run it with--restart, Toil still thinks it's finished and I get [2021-09-24T14:33:48+1000] [MainThread] [W] [toil.common] Requested restart but the workflow has already been completed; allowing exports to rerun.. What I actually want it to do here is cache the jobs that are unchanged and re-run those that have changed. Is this possible at all in toil?
Adam Novak
@adamnovak
@multimeric Unfortunately that's not a feature that we have. We don't keep any record of what particular Python values or class or function definitions each workflow task depends on, or what version of those definitions it ran with. In fact, we don't really keep records of the jobs at all after they and their descendants complete, and we sever the connection between a job description and its Python code after that code runs successfully, unless it's a checkpoint job.
If we wanted to do this, we'd have to basically make all jobs checkpoint jobs, except even more so because we'd have to keep them around after they and their descendants all finished. Then we'd have to come up with a new way to enumerate the jobs that still need to happen (which right now is basically 1:1 with the jobs that still exist). We'd also need to come up with a way to traverse the possible call and constant access graph of a Python function, determine an identifier for the version of each function or constant that is used, and store that along with the finished job.
And anything that accessed code via dynamic lookup would need to either always or never rerun, because we wouldn't be able to find the code.
Adam Novak
@adamnovak
That all being said, WDL runners are able to do this with WDL code, so it might not be impossible.
Michael Milton
@multimeric
Thanks for the answer. One angle I've seen used in another system is to annotate each job with a hash which we allow the user to calculate. Then the user can try simple solutions like just hashing the file that the job resides in, or alternatively just keeping a manual version number for each task.
If WDL already does this, then I guess Toil has some concept of "has this job changed", which I would just need to plug this logic into
Marcel Loose
@gmloose
Can anyone explain to me how to interpret the output of toil --stats? The documentation is quite limited.
Marcel Loose
@gmloose
Today, I've been bitten by the fact that CWLTool URL-encodes a + character is a filename to %2B. This results in a error: Cannot make job: Invalid filename: 'P233%2B35_structure.txt' contains illegal characters
I saw there are several issues that refer to this:
common-workflow-language/cwltool#1260,
common-workflow-language/cwltool#1098, and
common-workflow-language/cwltool#1445.
Where the last one even contains an almost finished pull request.
So I was wondering, what's the status of this issue? Is it indeed a bug in CWLTool, or is this a (too) strict limitation by CWLTool on allowed characters in a filename?
crusoe
@mr-c:matrix.org
[m]
Sorry to hear that @gmloose ; https://github.com/common-workflow-language/cwltool/pull/1446#issuecomment-850896086 shows that the PR needs some assistance. Would you like to finish it up?
It is indeed a bug in cwltool; and it should be fixed
Marcel Loose
@gmloose
I could have a look, though I have limited time.