Where communities thrive


  • Join over 1.5M+ people
  • Join over 100K+ communities
  • Free without limits
  • Create your own community
People
Activity
  • May 26 20:47

    jonathanxu18 on 3733-graviton-performance-test

    Revert prometheus query job name (compare)

  • May 26 19:03

    jonathanxu18 on 3733-graviton-performance-test

    Update prometheus query job name (compare)

  • May 26 18:24

    jonathanxu18 on 3733-graviton-performance-test

    Update prometheus queries (compare)

  • May 26 17:02
    adamnovak edited #3653
  • May 26 17:01
    unito-bot edited #3653
  • May 26 17:01
    adamnovak commented #3653
  • May 26 16:32
    mr-c auto_merge_enabled #4115
  • May 26 04:24

    dependabot[bot] on pip

    (compare)

  • May 26 04:24
    dependabot[bot] closed #4086
  • May 26 04:24
    dependabot[bot] commented #4086
  • May 26 04:24
    dependabot[bot] labeled #4115
  • May 26 04:24
    dependabot[bot] opened #4115
  • May 26 04:24

    dependabot[bot] on pip

    Bump mypy from 0.941 to 0.960 … (compare)

  • May 25 18:14
    w-gao synchronize #4052
  • May 25 18:14

    w-gao on 3876-run-cwl-tests-via-wes

    Fix linting (compare)

  • May 25 17:31
    adamnovak commented #3889
  • May 25 17:29
    adamnovak opened #4114
  • May 25 17:28

    adamnovak on 3889-sync-google-auth

    Allow newer google-cloud-storag… (compare)

  • May 25 15:41
    adamnovak updated the wiki
  • May 25 15:41
    adamnovak closed #4098
Marcel Loose
@gmloose
Only adding a DockerRequirement doesn't help much. I still have an empty scheme in line 203 of command_line_tool.py. I know I can call as_uri() on the Path variable that is used in the test , but I don't know how to tweak the test to do so, without completely breaking it.
crusoe
@mr-c:matrix.org
[m]
Ah, for that set RuntimeContext.outdir to some file:/// reference to a tmpdir
(tmp_path / "outdir").as_uri()
hmm.. no, that doesn't work
That is some very old code, the check for a schema in outdir
crusoe
@mr-c:matrix.org
[m]
Toil-cwl-runner doesn't use that. Maybe arvados-cwl-runner does? Paging @tetron ..
@gmloose: Lets ignore that part for the moment (and thanks for looking into this!) ; what about the other two lines that have no coverage?
huh, even more ancient code, from January 2018..
I'm tempted to remove both uncovered branches...
crusoe
@mr-c:matrix.org
[m]
@gmloose: I've removed the unused code and added a docstring ; thanks for the reminder about this PR!
Marcel Loose
@gmloose
So, it's ready to be merged? That would be great. I'm one of the first that would like to give it a test spin.
crusoe
@mr-c:matrix.org
[m]
If all the CI tests pass, I'll merge and make a new release; yep 🙂
Marcel Loose
@gmloose
Great!
Marcel Loose
@gmloose
Thanks. I'll give a go today.
Marcel Loose
@gmloose
BTW: This probably means that not only issue common-workflow-language/cwltool#1445 can be closed, but common-workflow-language/cwltool#1260, and common-workflow-language/cwltool#1098 too.
crusoe
@mr-c:matrix.org
[m]
@gmloose: Huzzah, thanks for noticing!
Marcel Loose
@gmloose
Maybe a bit of a naive question. But why is it that the Toil Workflow progress bar doesn't know the total number of jobs to run beforehand? The workflow is validated and parsed completely before processing starts, right? However, when I run my workflow, I see the total number of jobs increasing during the run. This makes the progress bar quite useless for measuring progress.
crusoe
@mr-c:matrix.org
[m]
I think that feature was made for traditional (Python only) Toil workflows; it probably needs some work for toil-cwl-runner
Adam Novak
@adamnovak
@gmloose I didn't really understand the leader when I wrote it, and I wasn't willing to add any sort of traversal of the job graph. So we're using just the jobs that are currently ready to run as the progress denominator: https://github.com/DataBiosphere/toil/blob/eb2ae8365ae2ebdd50132570b20f7d480eb40cac/src/toil/leader.py#L854-L856
Maybe we should really be using just every job ID the leader has ever heard of, now that we have a cache in the ToilState that ought to have copies of everything.
4 replies
Marcel Loose
@gmloose
I know this is against the idea behind Toil doing its utmost best to do as much work as possible and provide a means to recover from (spurious) errors in one of the steps in your workflow, but ... is it possible to let Toil fail early and let it exit upon the first error it encounters? This would be very helpful in debugging.
5 replies
Marcel Loose
@gmloose

I encounter NFS issues when running Toil with Slurm on one of our clusters, causing jobs to fail. Two typical tracebacks follow below:

Traceback (most recent call last):
  File "/project/rapthor/Software/rapthor/bin/_toil_worker", line 8, in <module>
    sys.exit(main())
  File "/project/rapthor/Software/rapthor/lib/python3.6/site-packages/toil/worker.py", line 710, in main
    with in_contexts(options.context):
  File "/usr/lib/python3.6/contextlib.py", line 81, in __enter__
    return next(self.gen)
  File "/project/rapthor/Software/rapthor/lib/python3.6/site-packages/toil/worker.py", line 684, in in_contexts
    with manager:
  File "/project/rapthor/Software/rapthor/lib/python3.6/site-packages/toil/batchSystems/abstractBatchSystem.py", line 505, in __enter__
    self.arena.enter()
  File "/project/rapthor/Software/rapthor/lib/python3.6/site-packages/toil/lib/threading.py", line 438, in enter
    with global_mutex(self.workDir, self.mutex):
  File "/usr/lib/python3.6/contextlib.py", line 81, in __enter__
    return next(self.gen)
  File "/project/rapthor/Software/rapthor/lib/python3.6/site-packages/toil/lib/threading.py", line 340, in global_mutex
    fd_stats = os.fstat(fd)
OSError: [Errno 116] Stale file handle
Traceback (most recent call last):
  File "/project/rapthor/Software/rapthor/lib/python3.6/site-packages/toil/deferred.py", line 215, in cleanupWorker
    robust_rmtree(os.path.join(stateDirBase, cls.STATE_DIR_STEM))
  File "/project/rapthor/Software/rapthor/lib/python3.6/site-packages/toil/lib/io.py", line 51, in robust_rmtree
    robust_rmtree(child_path)
  File "/project/rapthor/Software/rapthor/lib/python3.6/site-packages/toil/lib/io.py", line 64, in robust_rmtree
    os.unlink(path)
OSError: [Errno 16] Device or resource busy: b'/project/rapthor/Share/prefactor/L667520/working/f7a704078c8f54fc8a7ccb44a8d5d5f6/deferred/.nfs00000000000e74070000e57a'

Both types of error seem to occur during clean-up.

7 replies
Rohith B S
@rohith-bs
[2021-11-01T11:50:40+0530] [MainThread] [W] [toil.leader] Job failed with exit value 1: 'JobFunctionWrappingJob' kind-JobFunctionWrappingJob/instance-l0jq0ypl
Exit reason: None
[2021-11-01T11:50:40+0530] [MainThread] [W] [toil.leader] No log file is present, despite job failing: 'JobFunctionWrappingJob' kind-JobFunctionWrappingJob/instance-l0jq0ypl
[2021-11-01T11:50:54+0530] [MainThread] [W] [toil.job] Due to failure we are reducing the remaining try count of job 'JobFunctionWrappingJob' kind-JobFunctionWrappingJob/instance-l0jq0ypl with ID kind-JobFunctionWrappingJob/instance-l0jq0ypl to 2
Kindly suggest the reason this happens. I do not see any errors as such in the execution. I am currently using slurm bacthSystem.
9 replies
Adam Novak
@adamnovak

Anybody ever see anything like this from a Toil worker? Maybe @mr-c:matrix.org ?

    Traceback (most recent call last):
      File "/usr/local/lib/python3.6/dist-packages/toil/worker.py", line 376, in workerScript
        job = Job.loadJob(jobStore, jobDesc)
      File "/usr/local/lib/python3.6/dist-packages/toil/job.py", line 2251, in loadJob
        job = cls._unpickle(userModule, fileHandle, requireInstanceOf=Job)
      File "/usr/local/lib/python3.6/dist-packages/toil/job.py", line 1876, in _unpickle
        runnable = unpickler.load()
    AttributeError: 'Comment' object has no attribute '_end'

I'm trying to run some CWL CI tests, which broke when we rebuilt our Gitlab, with a local leader against our Kubernetes, and I'm getting this. I'd say it's a cwltool version mismatch, but as far as I can tell I have cwltool==3.1.20211020155521 in both my container and my leader virtualenv. Does CWL have a Comment object that recently grew or lost a _end?

1 reply
Kevin Hannon
@kannon92

Hey,

My team is working on a cloud/hpc platform for image processing. We are using toil and CWL for our hpc platform. One request on our platform is to be able to monitor individual steps of a workflow.

We are calling toil via a rest api and submitting our workflows to a slurm cluster. I would like to be able to monitor individual steps of the workflow but I don't know how I can know ahead of time what the name of the job would be. I was hoping to use squeue or sacct to get the status of each step of the workflow.

For our cloud platform, we use the id of the workflow and we can queue Argo to get status of each step of the workflow.

I'm hoping to be able to give each step of the workflow a name such as echo for the echo step and sleep for the sleep step. The workflow is below:

cwlVersion: v1.0
class: Workflow
id: echo-sleep
inputs:
  echoMessage: string
  sleepParam: int
outputs: {}
steps:
  echo:
    run: echo.cwl
    in:
      hello: echoMessage
    out: []
  sleep:
    run: sleep.cwl
    in:
      sleepTime: sleepParam
    out: []

Currently, the jobName would be toil_job_3_CwlJob. I would like to modify this behavior so that the name of the step is the name of the job. Is there any suggestion on allowing this?

crusoe
@mr-c:matrix.org
[m]
Hello @kannon92 ; we do try to set good job names https://github.com/DataBiosphere/toil/blob/312b6e1f221ee7f7f187dd6dbfce1aecffd00e09/src/toil/cwl/cwltoil.py#L1862 so that needs investigating
7 replies
It does look like the whole section at https://github.com/DataBiosphere/toil/blob/312b6e1f221ee7f7f187dd6dbfce1aecffd00e09/src/toil/cwl/cwltoil.py#L1840-L1847 and https://github.com/DataBiosphere/toil/blob/312b6e1f221ee7f7f187dd6dbfce1aecffd00e09/src/toil/cwl/cwltoil.py#L1856-L1866 could be revised to set the names differently, but I think at this point we might not have easy access to what the workflow step's key was; we might need to pass it into the CWLJob constructor from whatever code is looping over the workflow.
@kannon92 You should probably open a Toil feature request issue for this, and spec out e.g. exactly what you want to happen when a workflow calls another workflow and they have steps with the same names.
Adam Novak
@adamnovak
How exactly are you observing the Toil job names? Are you independently querying Slurm? Maybe we just need to pass more pieces of the Toil internal job names to Slurm when naming the jobs for Slurm.
And it gets composed with the batch-system integer ID for the job and a prefix here: https://github.com/DataBiosphere/toil/blob/312b6e1f221ee7f7f187dd6dbfce1aecffd00e09/src/toil/batchSystems/slurm.py#L266
Adam Novak
@adamnovak
I think we would want to keep toil_job_3_ at the front, but we could change the cwltoil code to use the step key instead of or in addition to the tool ID in that last piece, right @mr-c:matrix.org ?
1 reply
And we might want to tell Slurm instanceName instead of jobName anyway.
jobName really should be the tool ID and instanceName would be tool ID x step ID
crusoe
@mr-c:matrix.org
[m]
Who is making the issue for this?
Kevin Hannon
@kannon92

@mr-c:matrix.org and @adamnovak thanks for pointing out in the code where that is. I will poke around with logging to see what is going on.

So I have a javascript process that runs the toil-cwl-runner. I saw that I can specify various batchsystems so I was hoping that I could use toil to get job ids and status for the individual jobs.

Currently, I have nothing tieing me into a HPC system in this pattern. But I notice that toil doesn't really report the status of the jobs. If I do toil status jobStore, it says what jobs are running but it doesn't really tell me if the job is pending/running/submitting. So I think I would have to query a hpc cli to get the status of the job.

So I am thinking I would have a separate process that runs a status check to verify status of each step of the workflow. and we are really aiming for support of slurm right now, we would have to use sacct or squeue to get status of the job. But I realize I don't know what job ids correspond to what.

@adamnovak I'd be up for making a feature request for this but it also sounds like this is a defect? I can make an issue on the github page if you all want.
4 replies
crusoe
@mr-c:matrix.org
[m]
Yep; feel free to open two separate issues
Kevin Hannon
@kannon92
I'll post two issues on monday.
1 reply
Kevin Hannon
@kannon92

Hey, I posted the issue above to give more descriptive job names for HPC schedulers.

DataBiosphere/toil#3884

I talked with my manager and since this directly impacts my project, I have the bandwidth to fix this.

So for workflows, we are thinking toiljob{id}_workflowId.stepId. For single tools, we would use toiljob{id}_toolId. What would I use for scatter/gather job names?

2 replies
Kevin Hannon
@kannon92

And I created a feature request. DataBiosphere/toil#3885

Let me know if I can clear that up more or correctly post.

Kevin Hannon
@kannon92

Hi, I've made some changes in my branch and I'm trying to run the unit tests. I am running into some problems but I don't think its related to my code.

When I do make test, I get the following 22 failures.

======================================================================================================= ERRORS ========================================================================================================
______________________________________________________________________________ ERROR collecting src/toil/test/cwl/spec_v11/tests/index.py ______________________________________________________________________________
Traceback (most recent call last):
  File "/Users/kevinhannon/Work/GIT/toil/src/toil/test/cwl/spec_v11/tests/index.py", line 14, in <module>
    main = open(mainfile)
FileNotFoundError: [Errno 2] No such file or directory: '--durations=0'
______________________________________________________________________________ ERROR collecting src/toil/test/cwl/spec_v11/tests/index.py ______________________________________________________________________________
Traceback (most recent call last):
  File "/Users/kevinhannon/Work/GIT/toil/src/toil/test/cwl/spec_v11/tests/index.py", line 14, in <module>
    main = open(mainfile)
FileNotFoundError: [Errno 2] No such file or directory: '--durations=0'
_____________________________________________________________________________ ERROR collecting src/toil/test/cwl/spec_v11/tests/search.py ______________________________________________________________________________
Traceback (most recent call last):

This also happens if I use the quick test run also.

3 replies
Kevin Hannon
@kannon92

Hi, I submitted a PR for issue 3884. DataBiosphere/toil#3893

I wanted to get some feedback on this. This works for my purposes but I'm not sure if I handle scatter/gather workflows or subworkflows correctly.

2 replies
Viktor Gal
@vigsterkr
hi! i'm wondering how one would pass fileID between dependent jobs? i suppose the idea is to use promises? namely say i have 2 jobs, where A -> B, i.e. B is A job's child. and B's input depends on A job's output and currently wondering how one would pass that output of A (or rather fileID) to B.
Marcel Loose
@gmloose
I have the impression that Toil does not clean up the files it creates in the temporary output directory (the location of which can be controlled by --tmp-outdir-prefix), at least when using Slurm as batch control system. Since I cannot use Toil's internal file store (I need to set --bypass-file-store, because some of my workflows use InplaceUpdateRequirement), I had to put this temporary output directory on a shared (NFS) disk. Can someone acknowledge this behaviour?
crusoe
@mr-c:matrix.org
[m]
That wouldn't surprise me. --bypass-file-store is quite recent. And I don't think we added any logic to figure out what can safely be deleted (not used by a downstream step) and what can't. If deferring clean-up to the end would suffice, then that won't be so hard to implement (though care must be taken to not delete and InplaceUpdateRequirement files that are apart of the original inputs or the final outputs
1 reply
Marcel Loose
@gmloose
Is there a way to get time stamps in the log file created with the --logFile option, like you get on the console? I couldn't find any mention of it in the description of the options. The work-around I use at the moment is to redirect all console output to a file, which makes the option --logFile quite useless in that case.
Basically, I would like to be able to somehow control the formatting of the output to logFile.
1 reply