Where communities thrive


  • Join over 1.5M+ people
  • Join over 100K+ communities
  • Free without limits
  • Create your own community
People
Activity
  • 10:38

    dependabot[bot] on pip

    (compare)

  • 10:38

    mr-c on master

    Bump cwltool from 3.1.202201191… (compare)

  • 10:38
    mr-c closed #4020
  • 08:13
    mr-c auto_merge_enabled #4020
  • 04:26
    dependabot[bot] labeled #4020
  • 04:26
    dependabot[bot] opened #4020
  • 04:26

    dependabot[bot] on pip

    Bump cwltool from 3.1.202201191… (compare)

  • Jan 24 17:09
    unito-bot assigned #4019
  • Jan 24 17:07
    adamnovak edited #4013
  • Jan 24 17:06
    adamnovak synchronize #4013
  • Jan 24 17:06

    adamnovak on 3993-report-versions-as-strings

    Bump cwltool from 3.1.202111071… feat: square bracket to period … Bump cwltool from 3.1.202201171… and 4 more (compare)

  • Jan 24 16:49
    unito-bot edited #4019
  • Jan 24 16:48
    adamnovak opened #4019
  • Jan 24 16:12
    adamnovak synchronize #3956
  • Jan 24 16:12

    adamnovak on 3942-aws-batch-batch-system

    Use an `init` process that reap… Merge branch 'master' into issu… (compare)

  • Jan 24 16:11
    adamnovak updated the wiki
  • Jan 24 16:11

    adamnovak on master

    Use an `init` process that reap… (compare)

  • Jan 24 16:11
    adamnovak closed #3974
  • Jan 21 23:41
    w-gao review_requested #3974
  • Jan 21 23:41
    w-gao review_requested #3974
crusoe
@mr-c:matrix.org
[m]
you have to run all the tests to see how the other changes may have impacted the code coverage. You aren't doing anything wrong, no.
crusoe
@mr-c:matrix.org
[m]

So when I run make diff-cover on that PR locally I get:

cwltool/command_line_tool.py (62.5%): Missing lines 207,256-257

Which matches what codecov.io reports, so that is good 🙂
Marcel Loose
@gmloose
I've tried to get my head around what exactly is going on in revmap_file and the new test. I get the impression that the test (in its current setup) can only check that what you put in as filename, also gets out (i.e. the external filename representation). I guess that's why only the if clause is covered by the test. I guess the else clause will only be executed if you supply an internal filename representation (at least, that's what I'm guessing right now). I'm not sure how I would have to supply an internal filename representation in that current test, because it uses a CommandLineTool, which is an external thingy.
crusoe
@mr-c:matrix.org
[m]
internal in this case refers to a path within a software (docker) container
(if that works then we can collapse the code duplication later, don't worry about that for now)
Marcel Loose
@gmloose
Only adding a DockerRequirement doesn't help much. I still have an empty scheme in line 203 of command_line_tool.py. I know I can call as_uri() on the Path variable that is used in the test , but I don't know how to tweak the test to do so, without completely breaking it.
crusoe
@mr-c:matrix.org
[m]
Ah, for that set RuntimeContext.outdir to some file:/// reference to a tmpdir
(tmp_path / "outdir").as_uri()
hmm.. no, that doesn't work
That is some very old code, the check for a schema in outdir
crusoe
@mr-c:matrix.org
[m]
Toil-cwl-runner doesn't use that. Maybe arvados-cwl-runner does? Paging @tetron ..
@gmloose: Lets ignore that part for the moment (and thanks for looking into this!) ; what about the other two lines that have no coverage?
huh, even more ancient code, from January 2018..
I'm tempted to remove both uncovered branches...
crusoe
@mr-c:matrix.org
[m]
@gmloose: I've removed the unused code and added a docstring ; thanks for the reminder about this PR!
Marcel Loose
@gmloose
So, it's ready to be merged? That would be great. I'm one of the first that would like to give it a test spin.
crusoe
@mr-c:matrix.org
[m]
If all the CI tests pass, I'll merge and make a new release; yep 🙂
Marcel Loose
@gmloose
Great!
Marcel Loose
@gmloose
Thanks. I'll give a go today.
Marcel Loose
@gmloose
BTW: This probably means that not only issue common-workflow-language/cwltool#1445 can be closed, but common-workflow-language/cwltool#1260, and common-workflow-language/cwltool#1098 too.
crusoe
@mr-c:matrix.org
[m]
@gmloose: Huzzah, thanks for noticing!
Marcel Loose
@gmloose
Maybe a bit of a naive question. But why is it that the Toil Workflow progress bar doesn't know the total number of jobs to run beforehand? The workflow is validated and parsed completely before processing starts, right? However, when I run my workflow, I see the total number of jobs increasing during the run. This makes the progress bar quite useless for measuring progress.
crusoe
@mr-c:matrix.org
[m]
I think that feature was made for traditional (Python only) Toil workflows; it probably needs some work for toil-cwl-runner
Adam Novak
@adamnovak
@gmloose I didn't really understand the leader when I wrote it, and I wasn't willing to add any sort of traversal of the job graph. So we're using just the jobs that are currently ready to run as the progress denominator: https://github.com/DataBiosphere/toil/blob/eb2ae8365ae2ebdd50132570b20f7d480eb40cac/src/toil/leader.py#L854-L856
Maybe we should really be using just every job ID the leader has ever heard of, now that we have a cache in the ToilState that ought to have copies of everything.
4 replies
Marcel Loose
@gmloose
I know this is against the idea behind Toil doing its utmost best to do as much work as possible and provide a means to recover from (spurious) errors in one of the steps in your workflow, but ... is it possible to let Toil fail early and let it exit upon the first error it encounters? This would be very helpful in debugging.
5 replies
Marcel Loose
@gmloose

I encounter NFS issues when running Toil with Slurm on one of our clusters, causing jobs to fail. Two typical tracebacks follow below:

Traceback (most recent call last):
  File "/project/rapthor/Software/rapthor/bin/_toil_worker", line 8, in <module>
    sys.exit(main())
  File "/project/rapthor/Software/rapthor/lib/python3.6/site-packages/toil/worker.py", line 710, in main
    with in_contexts(options.context):
  File "/usr/lib/python3.6/contextlib.py", line 81, in __enter__
    return next(self.gen)
  File "/project/rapthor/Software/rapthor/lib/python3.6/site-packages/toil/worker.py", line 684, in in_contexts
    with manager:
  File "/project/rapthor/Software/rapthor/lib/python3.6/site-packages/toil/batchSystems/abstractBatchSystem.py", line 505, in __enter__
    self.arena.enter()
  File "/project/rapthor/Software/rapthor/lib/python3.6/site-packages/toil/lib/threading.py", line 438, in enter
    with global_mutex(self.workDir, self.mutex):
  File "/usr/lib/python3.6/contextlib.py", line 81, in __enter__
    return next(self.gen)
  File "/project/rapthor/Software/rapthor/lib/python3.6/site-packages/toil/lib/threading.py", line 340, in global_mutex
    fd_stats = os.fstat(fd)
OSError: [Errno 116] Stale file handle
Traceback (most recent call last):
  File "/project/rapthor/Software/rapthor/lib/python3.6/site-packages/toil/deferred.py", line 215, in cleanupWorker
    robust_rmtree(os.path.join(stateDirBase, cls.STATE_DIR_STEM))
  File "/project/rapthor/Software/rapthor/lib/python3.6/site-packages/toil/lib/io.py", line 51, in robust_rmtree
    robust_rmtree(child_path)
  File "/project/rapthor/Software/rapthor/lib/python3.6/site-packages/toil/lib/io.py", line 64, in robust_rmtree
    os.unlink(path)
OSError: [Errno 16] Device or resource busy: b'/project/rapthor/Share/prefactor/L667520/working/f7a704078c8f54fc8a7ccb44a8d5d5f6/deferred/.nfs00000000000e74070000e57a'

Both types of error seem to occur during clean-up.

7 replies
Rohith B S
@rohith-bs
[2021-11-01T11:50:40+0530] [MainThread] [W] [toil.leader] Job failed with exit value 1: 'JobFunctionWrappingJob' kind-JobFunctionWrappingJob/instance-l0jq0ypl
Exit reason: None
[2021-11-01T11:50:40+0530] [MainThread] [W] [toil.leader] No log file is present, despite job failing: 'JobFunctionWrappingJob' kind-JobFunctionWrappingJob/instance-l0jq0ypl
[2021-11-01T11:50:54+0530] [MainThread] [W] [toil.job] Due to failure we are reducing the remaining try count of job 'JobFunctionWrappingJob' kind-JobFunctionWrappingJob/instance-l0jq0ypl with ID kind-JobFunctionWrappingJob/instance-l0jq0ypl to 2
Kindly suggest the reason this happens. I do not see any errors as such in the execution. I am currently using slurm bacthSystem.
9 replies
Adam Novak
@adamnovak

Anybody ever see anything like this from a Toil worker? Maybe @mr-c:matrix.org ?

    Traceback (most recent call last):
      File "/usr/local/lib/python3.6/dist-packages/toil/worker.py", line 376, in workerScript
        job = Job.loadJob(jobStore, jobDesc)
      File "/usr/local/lib/python3.6/dist-packages/toil/job.py", line 2251, in loadJob
        job = cls._unpickle(userModule, fileHandle, requireInstanceOf=Job)
      File "/usr/local/lib/python3.6/dist-packages/toil/job.py", line 1876, in _unpickle
        runnable = unpickler.load()
    AttributeError: 'Comment' object has no attribute '_end'

I'm trying to run some CWL CI tests, which broke when we rebuilt our Gitlab, with a local leader against our Kubernetes, and I'm getting this. I'd say it's a cwltool version mismatch, but as far as I can tell I have cwltool==3.1.20211020155521 in both my container and my leader virtualenv. Does CWL have a Comment object that recently grew or lost a _end?

1 reply
Kevin Hannon
@kannon92

Hey,

My team is working on a cloud/hpc platform for image processing. We are using toil and CWL for our hpc platform. One request on our platform is to be able to monitor individual steps of a workflow.

We are calling toil via a rest api and submitting our workflows to a slurm cluster. I would like to be able to monitor individual steps of the workflow but I don't know how I can know ahead of time what the name of the job would be. I was hoping to use squeue or sacct to get the status of each step of the workflow.

For our cloud platform, we use the id of the workflow and we can queue Argo to get status of each step of the workflow.

I'm hoping to be able to give each step of the workflow a name such as echo for the echo step and sleep for the sleep step. The workflow is below:

cwlVersion: v1.0
class: Workflow
id: echo-sleep
inputs:
  echoMessage: string
  sleepParam: int
outputs: {}
steps:
  echo:
    run: echo.cwl
    in:
      hello: echoMessage
    out: []
  sleep:
    run: sleep.cwl
    in:
      sleepTime: sleepParam
    out: []

Currently, the jobName would be toil_job_3_CwlJob. I would like to modify this behavior so that the name of the step is the name of the job. Is there any suggestion on allowing this?

crusoe
@mr-c:matrix.org
[m]
Hello @kannon92 ; we do try to set good job names https://github.com/DataBiosphere/toil/blob/312b6e1f221ee7f7f187dd6dbfce1aecffd00e09/src/toil/cwl/cwltoil.py#L1862 so that needs investigating
7 replies
It does look like the whole section at https://github.com/DataBiosphere/toil/blob/312b6e1f221ee7f7f187dd6dbfce1aecffd00e09/src/toil/cwl/cwltoil.py#L1840-L1847 and https://github.com/DataBiosphere/toil/blob/312b6e1f221ee7f7f187dd6dbfce1aecffd00e09/src/toil/cwl/cwltoil.py#L1856-L1866 could be revised to set the names differently, but I think at this point we might not have easy access to what the workflow step's key was; we might need to pass it into the CWLJob constructor from whatever code is looping over the workflow.
@kannon92 You should probably open a Toil feature request issue for this, and spec out e.g. exactly what you want to happen when a workflow calls another workflow and they have steps with the same names.
Adam Novak
@adamnovak
How exactly are you observing the Toil job names? Are you independently querying Slurm? Maybe we just need to pass more pieces of the Toil internal job names to Slurm when naming the jobs for Slurm.
And it gets composed with the batch-system integer ID for the job and a prefix here: https://github.com/DataBiosphere/toil/blob/312b6e1f221ee7f7f187dd6dbfce1aecffd00e09/src/toil/batchSystems/slurm.py#L266
Adam Novak
@adamnovak
I think we would want to keep toil_job_3_ at the front, but we could change the cwltoil code to use the step key instead of or in addition to the tool ID in that last piece, right @mr-c:matrix.org ?
1 reply
And we might want to tell Slurm instanceName instead of jobName anyway.
jobName really should be the tool ID and instanceName would be tool ID x step ID
crusoe
@mr-c:matrix.org
[m]
Who is making the issue for this?
Kevin Hannon
@kannon92

@mr-c:matrix.org and @adamnovak thanks for pointing out in the code where that is. I will poke around with logging to see what is going on.

So I have a javascript process that runs the toil-cwl-runner. I saw that I can specify various batchsystems so I was hoping that I could use toil to get job ids and status for the individual jobs.

Currently, I have nothing tieing me into a HPC system in this pattern. But I notice that toil doesn't really report the status of the jobs. If I do toil status jobStore, it says what jobs are running but it doesn't really tell me if the job is pending/running/submitting. So I think I would have to query a hpc cli to get the status of the job.

So I am thinking I would have a separate process that runs a status check to verify status of each step of the workflow. and we are really aiming for support of slurm right now, we would have to use sacct or squeue to get status of the job. But I realize I don't know what job ids correspond to what.

@adamnovak I'd be up for making a feature request for this but it also sounds like this is a defect? I can make an issue on the github page if you all want.
4 replies
crusoe
@mr-c:matrix.org
[m]
Yep; feel free to open two separate issues
Kevin Hannon
@kannon92
I'll post two issues on monday.
1 reply