Where communities thrive


  • Join over 1.5M+ people
  • Join over 100K+ communities
  • Free without limits
  • Create your own community
People
Activity
  • 04:29
    dependabot[bot] labeled #3874
  • 04:29

    dependabot[bot] on pip

    Update docutils requirement fro… (compare)

  • 04:29
    dependabot[bot] opened #3874
  • Oct 26 16:15
    adamnovak closed #2203
  • Oct 26 16:15
    adamnovak commented #2203
  • Oct 26 16:14
    unito-bot edited #3873
  • Oct 26 16:14
    unito-bot opened #3873
  • Oct 26 16:10
    adamnovak edited #2203
  • Oct 26 16:10
    adamnovak edited #2203
  • Oct 26 16:10
    adamnovak edited #2203
  • Oct 26 16:10
    adamnovak edited #2203
  • Oct 26 16:08
    unito-bot assigned #3872
  • Oct 26 16:08
    unito-bot opened #3872
  • Oct 26 14:48
    adamnovak commented #3869
  • Oct 26 14:46
    adamnovak synchronize #3869
  • Oct 26 14:46

    adamnovak on 3831-disableCaching-argument

    TES batch system prototype (#38… Merge remote-tracking branch 'u… Fix and simplify syntax and 1 more (compare)

  • Oct 26 14:36
    adamnovak closed #3756
  • Oct 26 14:36
    adamnovak commented #3756
  • Oct 26 14:01
    jonathanxu18 synchronize #3802
  • Oct 26 14:01

    jonathanxu18 on 3783-integrate-multi-arch-docker-images

    Try command substitution (compare)

Adam Novak
@adamnovak
Hmmm
Looks like I can send unlimited messages.
Actually, after a few messages on this end they get dropped on the Slack end as well, we just don't get ads for SameRoom spammed to the channel to notify us.
crusoe
@mr-c:matrix.org
[m]

@cibinsb @adamnovak I started working on this, though I'm very unfamiliar with all things boto3/AWS :-)

DataBiosphere/toil#3710

@adamnovak: Any thoughts on DataBiosphere/toil#3694 ?
Adam Novak
@adamnovak
I just reviewed #3694. It looks OK to me, but it changes over to a git branch dependency on cwltool, so we couldn't release Toil if we merged it as is.
crusoe
@mr-c:matrix.org
[m]
RIght, just wanted to confirm the general direction before we make the changes on the cwltool side
Will update with a released version of cwltool before merging
crusoe
@mr-c:matrix.org
[m]
Huzzah!
@cibinsb: Hopefully we will fix the issue for good, so no documentation update will be needed
Ian
@ionox0
what is the difference between --workDir and the TMPDIR environment variable?
crusoe
@mr-c:matrix.org
[m]
the later is used by the former as a fallback
Ian
@ionox0
:thumbsup:
Kevin Chau
@kkchau
Hello, new to toil here. I was curious as to the reason the default AMI was CoreOS (and now Flatcar) instead of something like Ubuntu LTS?
Lon Blauvelt
@DailyDreaming
[Adam Novak, UCSC GI] CoreOS and Flatcar update themselves, and are designed to just sit there and run containers, I think was the logic. They also have a nice easy way to fetch the current best AMI to use. For Ubuntu, the system kind of expects an administrator to exist and periodically come by and sudo apt upgrade.
Kevin Chau
@kkchau
Makes perfect sense, thank you
Martín Beracochea
@mberacochea
Hi, I have a small issue with toil (toil-cwl-runner is the one we use). I can't run it with a shared --workDir because the only shared filesystem we have are either GPFS or NFS (https://github.com/DataBiosphere/toil/issues/3497#issuecomment-803021190). That is OK, I can use the workers tmp directory just fine. The problem is that it's very hard to debug when a job fails, as the logs are not reported to the leader. Would it be possible to use the --writeLogs directory here -> https://github.com/DataBiosphere/toil/blob/master/src/toil/batchSystems/abstractBatchSystem.py#L369 , that way we can record all the logs). Thanks
1 reply
Lon Blauvelt
@DailyDreaming
@mberacochea If you have a solution that works for you, then we would happily accept and review a PR to add it into the main code base.
Martín Beracochea
@mberacochea
Super then, I'll send the PR as soon as I can. Thanks!
Lon Blauvelt
@DailyDreaming
Thank you!
anthfm
@anthfm
Hi, I am using toil-cwl-runner in order to parallelize jobs across a few nodes on a slurm cluster. However, I keep on receiving the following error: Job requested more disk than requested. For CWL, consider increasing the outdirMin requirement, otherwise, consider increasing the disk requirement. I increased outdirMin in my CWL file, but it is still showing the error with 56.0 KiB used, and 0bB bytes requested. I am not sure how to further approach this error? Thank you.
6 replies
Marcel Loose
@gmloose

Hi, one of my Toil jobs, using Slurm as batch system, crashed with the following error:

Traceback (most recent call last):
  File "/project/rapthor/Software/rapthor/lib/python3.6/site-packages/toil/cwl/cwltoil.py", line 3231, in main
    outobj = toil.start(wf1)
  File "/project/rapthor/Software/rapthor/lib/python3.6/site-packages/toil/common.py", line 841, in start
    return self._runMainLoop(rootJobDescription)
  File "/project/rapthor/Software/rapthor/lib/python3.6/site-packages/toil/common.py", line 1132, in _runMainLoop
    jobCache=self._jobCache).run()
  File "/project/rapthor/Software/rapthor/lib/python3.6/site-packages/toil/leader.py", line 246, in run
    self.innerLoop()
  File "/project/rapthor/Software/rapthor/lib/python3.6/site-packages/toil/leader.py", line 667, in innerLoop
    assert self.toilState.successorCounts == {}
AssertionError

Has anyone seen this error before?

crusoe
@mr-c:matrix.org
[m]
@gmloose: it is an old check, but not one that I've seen triggered before. Can you open an issue with your version of toil and your CWL workflow + inputs?
Adam Novak
@adamnovak
Hmm. Which job is it complaining about @anthfm?
Adam Novak
@adamnovak
@gmloose That looks like some of the bookkeeping structures we use for working out when jobs with multiple predecessors have all their predecessors done. If it's not empty at the end of the workflow, we conclude something went wrong and bail. We think there are race conditions in the core scheduler that can cause some updates to clobber other updates; I'm trying to refactor that part so this sort of thing can't happen, but that's not ready yet. If you have the whole log at debug level we might be able to find what got dropped, which might help us figure out what the race condition is. Otherwise the best thing to do might be to --restartwith the same --jobStore and resume the workflow.
Marcel Loose
@gmloose
It's a quite complicated workflow, consisting of a number of sub-workflows and several tens of steps (commands and expressions). The actual input files that need to be processed (not the JSON-file) are huge. I don't have debug logs of this particular run, and it is the first time I ever saw this error message. I can try to restart it (with --logDebug) and see if it fails again.
crusoe
@mr-c:matrix.org
[m]
@gmloose: Thanks, do let us know what happens!
Marcel Loose
@gmloose
As was (almost) to be expected, the restarted workflow completed without error. So this type of error is probably very hard to track down.
anthfm
@anthfm
Hi, I have a CWL workflow (subworkflow with scatter) that takes in input files, processes them with step1, and subsequently utilizes step1 output to run step2, resulting in final output files. It executes with no errors using Toil, however, I noticed that after step1 and step2 jobs are executed and completed, Toil re-issues empty step1 jobs that terminate successfully immediately. Is this normal behaviour for CWL scatter/subworkflows using Toil? Thank you
Adam Novak
@adamnovak
@anthfm That's normal behavior for Toil; jobs are issued once on the way down through the workflow graph, and then again on the way back up to do some cleanup. It shouldn't actually redo any CWL work, although the resource requirements might be excessive for the cleanup because I think it still asks for the same as the original job.
1 reply
Martín Beracochea
@mberacochea
Hey, I have a CWL pipeline which I run in LSF. I'm using TOIL_LSF_ARGS to set the queue, all good. But, I want to run one step in a different queue... is this possible?
crusoe
@mr-c:matrix.org
[m]
@mberacochea just a single step and skip the rest of the workflow?
There is a cwltool option to extract a single step from a workflow and either print it out or run it
I forget if we exposed that in toil-cwl-runner. If we didn't, you can use cwltool --single-step name --print-subgraph and then take the result to toil-cwl-runner
Martín Beracochea
@mberacochea
hi @mr-c:matrix.org , I need to run the whole workflow (but that step has to run in a different queue).
crusoe
@mr-c:matrix.org
[m]
Okay. What is special about that queue? Longer walltime is allowed? Special hardware? Bigmem?
Martín Beracochea
@mberacochea
bigmem
Adam Novak
@adamnovak
Yeah, Toil doesn't have this feature. If you have an idea for how to improve the LSF batch system code in Toil so that it can know what queues jobs need to go in based on their memory requirements, and you can figure out how to code it so that it will still work on everybody else's LSF clusters, we could take a PR.
Or if CWL grew a way to add an LSF queue annotation to a job, we could maybe punch a hole through to the batch system so that it could know about it.
crusoe
@mr-c:matrix.org
[m]
@mberacochea do you have to specify a queue? Toil does put the memory requirements in the batch job. You could ask your LSF admin if that'd be enough
There is an unimplemented proposal for overriding the queue for a certain job: common-workflow-language/common-workflow-language#581
Martín Beracochea
@mberacochea
Thank you both. I need to specify the queue, otherwise the LSF rejects the job (I'll see if the admins have something to sort that out).
yeah, the overrides seem to be the way to go in my case
crusoe
@mr-c:matrix.org
[m]
Okay, if they can't cope then we could try implementing a vendor extension version of BatchQueue for toil-cwl-runner
Martín Beracochea
@mberacochea
All right. Sounds like a plan. Thanks
crusoe
@mr-c:matrix.org
[m]
@adamnovak that would require a way to pass per Toil.job options to the batch systems, yes
Adam Novak
@adamnovak
In general we need to revise requirements to be a bit more free-form to support things like GPUs. I think we'll end up with some kind of dict of keys that the batch system can consult.
Michael Milton
@multimeric
If I run a Python workflow, and it completes, then I edit the workflow and re-run it with--restart, Toil still thinks it's finished and I get [2021-09-24T14:33:48+1000] [MainThread] [W] [toil.common] Requested restart but the workflow has already been completed; allowing exports to rerun.. What I actually want it to do here is cache the jobs that are unchanged and re-run those that have changed. Is this possible at all in toil?