Where communities thrive


  • Join over 1.5M+ people
  • Join over 100K+ communities
  • Free without limits
  • Create your own community
People
Activity
  • Oct 16 09:12
    dependabot[bot] commented #3859
  • Oct 16 09:12

    dependabot[bot] on pip

    (compare)

  • Oct 16 09:12
    dependabot[bot] closed #3859
  • Oct 16 09:12
    mr-c commented #3859
  • Oct 16 09:10
    mr-c edited #3858
  • Oct 16 09:09
    mr-c auto_merge_enabled #3858
  • Oct 16 09:09
    mr-c edited #3858
  • Oct 16 09:09
    mr-c synchronize #3858
  • Oct 16 09:09

    mr-c on pip

    pyyaml is no longer used dropp… (compare)

  • Oct 16 08:30
    mr-c synchronize #3845
  • Oct 16 08:30

    mr-c on pip

    Allow more docker-py versions (… Bump cwltest from 2.1.202106261… Merge branch 'master' into depe… (compare)

  • Oct 16 08:30
    mr-c commented #3858
  • Oct 16 08:27
    mr-c commented #3859
  • Oct 16 08:23
    dependabot[bot] labeled #3862
  • Oct 16 08:23
    dependabot[bot] opened #3862
  • Oct 16 08:23

    dependabot[bot] on pip

    Update apache-libcloud requirem… (compare)

  • Oct 16 08:23
    mr-c updated the wiki
  • Oct 16 08:23
    dependabot[bot] labeled #3861
  • Oct 16 08:23
    dependabot[bot] opened #3861
  • Oct 16 08:23

    dependabot[bot] on pip

    Update addict requirement from … (compare)

Lon Blauvelt
@DailyDreaming
@cibinsb Toil creates and uses a bucket named after the jobstore plus --files and disallows -- in jobstore names because of this.
Rohith B S
@rohith-bs
Traceback (most recent call last):
  File "/home/test-user/toil-scripts/script.py", line 282, in <module>
    Job.Runner.startToil(main_job, options)
  File "/usr/local/lib/python3.7/site-packages/toil/job.py", line 1743, in startToil
    return toil.restart()
  File "/usr/local/lib/python3.7/site-packages/toil/common.py", line 874, in restart
    return self._runMainLoop(rootJobDescription)
  File "/usr/local/lib/python3.7/site-packages/toil/common.py", line 1132, in _runMainLoop
    jobCache=self._jobCache).run()
  File "/usr/local/lib/python3.7/site-packages/toil/leader.py", line 229, in run
    self.innerLoop()
  File "/usr/local/lib/python3.7/site-packages/toil/leader.py", line 614, in innerLoop
    self._gatherUpdatedJobs(updatedJobTuple)
  File "/usr/local/lib/python3.7/site-packages/toil/leader.py", line 573, in _gatherUpdatedJobs
    self.processFinishedJob(jobID, exitStatus, wallTime=wallTime, exitReason=exitReason)
  File "/usr/local/lib/python3.7/site-packages/toil/leader.py", line 959, in processFinishedJob
    replacementJob = self.jobStore.load(jobStoreID)
  File "/usr/local/lib/python3.7/site-packages/toil/jobStores/fileJobStore.py", line 209, in load
    with open(jobFile, 'rb') as fileHandle:
FileNotFoundError: [Errno 2] No such file or directory: 'jobStore/jobs/kind-FunctionWrappingJob/instance-pb6jcg2c/job'
Kindly suggest how to resolve this issue. The pipeline run successfully and does what ever the intention was but toil ends like this.
crusoe
@mr-c:matrix.org
[m]
Welcome @rohith-bs Which pipeline are you running?
Rohith B S
@rohith-bs
@mr-c:matrix.org Thank you. It is a custom developed pipeline.
crusoe
@mr-c:matrix.org
[m]
Okay, I can only help with CWL pipelines
Lon Blauvelt
@DailyDreaming

[Adam Novak, UCSC GI] @cibinsb, Toil automatically appends --files to the job store name to derive the S3 bucket name, because a job store is more than just an S3 bucket; it currently includes some SimpleDB stuff, and only files go in S3.

I would try a different name than toil-test, maybe something with cibinsb in it, and see if that works.

11 replies
Lon Blauvelt
@DailyDreaming

[Adam Novak, UCSC GI] @rohith-bs Are you running your job store on a shared network filesystem that might be lagging behind messages sent through the job scheduler/not globally consistent in real time? The job appears to exist, but then is gone by the time Toil goes to load it: https://github.com/DataBiosphere/toil/blob/77a39f507b729525926c5efc9e07377483cdd005/src/toil/leader.py#L955-L959

We may be best off extending the special case handling for stale reads we have for the AWS job store to also cover the file job store, so that when the job is slow to disappear we don't crash.

1 reply
Rohith B S
@rohith-bs
@mr-c:matrix.org Thank you for your time.
@DailyDreaming Yes. I was running in a network job store. Thank you for the support.
Adam Novak
@adamnovak
Hey Lon, are you in here?
How did you get the SameRoom bot to be you on this end?
And did that fix the usage limits we are running into in the vg channel?
Or did you pay them or something?
Hmmm
Looks like I can send unlimited messages.
Actually, after a few messages on this end they get dropped on the Slack end as well, we just don't get ads for SameRoom spammed to the channel to notify us.
crusoe
@mr-c:matrix.org
[m]

@cibinsb @adamnovak I started working on this, though I'm very unfamiliar with all things boto3/AWS :-)

DataBiosphere/toil#3710

@adamnovak: Any thoughts on DataBiosphere/toil#3694 ?
Adam Novak
@adamnovak
I just reviewed #3694. It looks OK to me, but it changes over to a git branch dependency on cwltool, so we couldn't release Toil if we merged it as is.
crusoe
@mr-c:matrix.org
[m]
RIght, just wanted to confirm the general direction before we make the changes on the cwltool side
Will update with a released version of cwltool before merging
crusoe
@mr-c:matrix.org
[m]
Huzzah!
@cibinsb: Hopefully we will fix the issue for good, so no documentation update will be needed
Ian
@ionox0
what is the difference between --workDir and the TMPDIR environment variable?
crusoe
@mr-c:matrix.org
[m]
the later is used by the former as a fallback
Ian
@ionox0
:thumbsup:
Kevin Chau
@kkchau
Hello, new to toil here. I was curious as to the reason the default AMI was CoreOS (and now Flatcar) instead of something like Ubuntu LTS?
Lon Blauvelt
@DailyDreaming
[Adam Novak, UCSC GI] CoreOS and Flatcar update themselves, and are designed to just sit there and run containers, I think was the logic. They also have a nice easy way to fetch the current best AMI to use. For Ubuntu, the system kind of expects an administrator to exist and periodically come by and sudo apt upgrade.
Kevin Chau
@kkchau
Makes perfect sense, thank you
Martín Beracochea
@mberacochea
Hi, I have a small issue with toil (toil-cwl-runner is the one we use). I can't run it with a shared --workDir because the only shared filesystem we have are either GPFS or NFS (https://github.com/DataBiosphere/toil/issues/3497#issuecomment-803021190). That is OK, I can use the workers tmp directory just fine. The problem is that it's very hard to debug when a job fails, as the logs are not reported to the leader. Would it be possible to use the --writeLogs directory here -> https://github.com/DataBiosphere/toil/blob/master/src/toil/batchSystems/abstractBatchSystem.py#L369 , that way we can record all the logs). Thanks
1 reply
Lon Blauvelt
@DailyDreaming
@mberacochea If you have a solution that works for you, then we would happily accept and review a PR to add it into the main code base.
Martín Beracochea
@mberacochea
Super then, I'll send the PR as soon as I can. Thanks!
Lon Blauvelt
@DailyDreaming
Thank you!
anthfm
@anthfm
Hi, I am using toil-cwl-runner in order to parallelize jobs across a few nodes on a slurm cluster. However, I keep on receiving the following error: Job requested more disk than requested. For CWL, consider increasing the outdirMin requirement, otherwise, consider increasing the disk requirement. I increased outdirMin in my CWL file, but it is still showing the error with 56.0 KiB used, and 0bB bytes requested. I am not sure how to further approach this error? Thank you.
6 replies
Marcel Loose
@gmloose

Hi, one of my Toil jobs, using Slurm as batch system, crashed with the following error:

Traceback (most recent call last):
  File "/project/rapthor/Software/rapthor/lib/python3.6/site-packages/toil/cwl/cwltoil.py", line 3231, in main
    outobj = toil.start(wf1)
  File "/project/rapthor/Software/rapthor/lib/python3.6/site-packages/toil/common.py", line 841, in start
    return self._runMainLoop(rootJobDescription)
  File "/project/rapthor/Software/rapthor/lib/python3.6/site-packages/toil/common.py", line 1132, in _runMainLoop
    jobCache=self._jobCache).run()
  File "/project/rapthor/Software/rapthor/lib/python3.6/site-packages/toil/leader.py", line 246, in run
    self.innerLoop()
  File "/project/rapthor/Software/rapthor/lib/python3.6/site-packages/toil/leader.py", line 667, in innerLoop
    assert self.toilState.successorCounts == {}
AssertionError

Has anyone seen this error before?

crusoe
@mr-c:matrix.org
[m]
@gmloose: it is an old check, but not one that I've seen triggered before. Can you open an issue with your version of toil and your CWL workflow + inputs?
Adam Novak
@adamnovak
Hmm. Which job is it complaining about @anthfm?
Adam Novak
@adamnovak
@gmloose That looks like some of the bookkeeping structures we use for working out when jobs with multiple predecessors have all their predecessors done. If it's not empty at the end of the workflow, we conclude something went wrong and bail. We think there are race conditions in the core scheduler that can cause some updates to clobber other updates; I'm trying to refactor that part so this sort of thing can't happen, but that's not ready yet. If you have the whole log at debug level we might be able to find what got dropped, which might help us figure out what the race condition is. Otherwise the best thing to do might be to --restartwith the same --jobStore and resume the workflow.
Marcel Loose
@gmloose
It's a quite complicated workflow, consisting of a number of sub-workflows and several tens of steps (commands and expressions). The actual input files that need to be processed (not the JSON-file) are huge. I don't have debug logs of this particular run, and it is the first time I ever saw this error message. I can try to restart it (with --logDebug) and see if it fails again.
crusoe
@mr-c:matrix.org
[m]
@gmloose: Thanks, do let us know what happens!
Marcel Loose
@gmloose
As was (almost) to be expected, the restarted workflow completed without error. So this type of error is probably very hard to track down.
anthfm
@anthfm
Hi, I have a CWL workflow (subworkflow with scatter) that takes in input files, processes them with step1, and subsequently utilizes step1 output to run step2, resulting in final output files. It executes with no errors using Toil, however, I noticed that after step1 and step2 jobs are executed and completed, Toil re-issues empty step1 jobs that terminate successfully immediately. Is this normal behaviour for CWL scatter/subworkflows using Toil? Thank you
Adam Novak
@adamnovak
@anthfm That's normal behavior for Toil; jobs are issued once on the way down through the workflow graph, and then again on the way back up to do some cleanup. It shouldn't actually redo any CWL work, although the resource requirements might be excessive for the cleanup because I think it still asks for the same as the original job.
1 reply
Martín Beracochea
@mberacochea
Hey, I have a CWL pipeline which I run in LSF. I'm using TOIL_LSF_ARGS to set the queue, all good. But, I want to run one step in a different queue... is this possible?
crusoe
@mr-c:matrix.org
[m]
@mberacochea just a single step and skip the rest of the workflow?
There is a cwltool option to extract a single step from a workflow and either print it out or run it
I forget if we exposed that in toil-cwl-runner. If we didn't, you can use cwltool --single-step name --print-subgraph and then take the result to toil-cwl-runner
Martín Beracochea
@mberacochea
hi @mr-c:matrix.org , I need to run the whole workflow (but that step has to run in a different queue).