Where communities thrive


  • Join over 1.5M+ people
  • Join over 100K+ communities
  • Free without limits
  • Create your own community
People
Activity
  • 00:03
    w-gao synchronize #3727
  • 00:03

    w-gao on 3714-cluster-based-cloud-storage

    o.o (compare)

  • 00:00
    w-gao synchronize #3727
  • 00:00

    w-gao on 3714-cluster-based-cloud-storage

    snake_case >.< (compare)

  • Jul 27 23:55
    w-gao synchronize #3727
  • Jul 27 23:55

    w-gao on 3714-cluster-based-cloud-storage

    Use snake_case for new functions Review comments (compare)

  • Jul 27 16:04
    unito-bot closed #3729
  • Jul 27 16:04
    unito-bot edited #3729
  • Jul 27 08:12
    unito-bot edited #3730
  • Jul 27 08:11
    jonathanxu18 commented #3730
  • Jul 27 07:55
    mr-c commented #3730
  • Jul 27 05:46
    unito-bot assigned #3734
  • Jul 27 05:46
    unito-bot edited #3729
  • Jul 27 05:46
    unito-bot assigned #3735
  • Jul 27 05:46
    unito-bot opened #3735
  • Jul 27 05:46
    unito-bot assigned #3733
  • Jul 27 05:45
    unito-bot assigned #3732
  • Jul 27 05:45
    unito-bot assigned #3731
  • Jul 27 05:45
    unito-bot assigned #3730
  • Jul 27 05:43
    unito-bot opened #3734
crusoe
@mr-c:matrix.org
[m]
Google just "alerted" me to this 11 month old toil-cwl-runner question on biostars https://www.biostars.org/p/448085
Lon Blauvelt
@DailyDreaming
@mr-c:matrix.org Thanks for the link! Good work around hack, I agree. We should definitely make it less hacky. I'll create a Toil-side issue.
mareq
@mareq
hello, i am not sure if this is the correct place to ask but anyway: i would like to run cwl workflows in azure cloud and toil seems like one of the good ways to go. the k8s option looks quite good (based on the documentation i have read), except it says that it requires aws for job store. hence the question: is there some other way (possibly without even k8s: it would be great to do it in a way independent of particular cloud, but it is not hard requirement) to use toil for running cwl workflows on azure and azure only?
Lon Blauvelt
@DailyDreaming
[Adam Novak, UCSC GI] @mareq, we used to have an Azure job store and an Azure template for stamping out clusters, but we dropped them. The AWS job store is not able to run on top of just any S3-compatible storage system; we currently also need SimpleDB, and we're swapping that out for DynamoDB which also isn't part of S3.
[Adam Novak, UCSC GI] In the future to run on Azure we would use the Kubernetes batch system, because everybody has a good way to provide Kubernetes, but there's still the need for a job store. The only one that isn't tied to a cloud provider is the one that requires a shared filesystem.
[Adam Novak, UCSC GI] We might be able to get the file job store working with Kubernetes if you can get all your nodes a shared strongly-consistent filesystem (which NFS sort of isn't) and we can get it exposed to the Toil kubernetes pods, maybe with some kind of extra host path volume.
[Adam Novak, UCSC GI] But current;y we don't have a way to do what you want out of the box, unfortunately.
mareq
@mareq
@DailyDreaming throwing in additional assumptions: the workflows i need to run are tiny and the data they process in order of MBs for each job, so i do not really need super-distributed system (performance is not an issue at all). i just need some cwl-running backend to put behind the azure-based webapp, which allows its users run cwl workflows one job at a time (the webapp itself is multi-user, so there will be parallelism coming from there, but i guess that is just an irrelevant detail for the purpose of this discussion); based on your answer, this would fly with some sort of filesystem job store, right?
Lon Blauvelt
@DailyDreaming
@mareq If you want to just run toil scripts as single_machine on an Azure instance, that would work. But we no longer have any native Azure support (dropped due to funding): DataBiosphere/toil#2860
mareq
@mareq
i see. thank you :thumbsup:
Peter Amstutz
@tetron
hi @mareq for running CWL workflows on Azure another option is http://arvados.org
Douglas Lowe
@douglowe
I'm testing a cwl workflow with toil and slurm, and finding that my runtime is being dominated by queue wait times, as a lot of the tasks are preparatory work for the main task, any don't take a lot of time to run. Is there a way of asking cwl-toil to submit tasks together within the same job, or (as I'm running the toil job manager in it's own serial job) can I mark jobs that should be run within the primary job (I guess the reverse of the intent of --runCwlInternalJobsOnWorkers)?
Lon Blauvelt
@DailyDreaming
[Adam Novak, UCSC GI] Hm. Toil has a "chaining" feature where single direct child or follow-on jobs with the same resource requirements will run in the same execution as the previous job. If you set your resource requirements so the child jobs are no larger than the parent ones, you might be able to use that.
[Adam Novak, UCSC GI] To get stuff to run on the Toil leader, right now we match on particular internal job names for jobs that are part of Toil's CWL interpreter machinery. We don't have a feature currently to let you add other particular jobs to that set.
Douglas Lowe
@douglowe
Possibly the 'chaining' feature might be the best approach for me to try. I'll try explicitly setting resource requirements for some of the child jobs, and see this cause them to share executions.
Ian
@ionox0
Would anyone have any guess as to how I might be getting file prefixes that look like _:file:///... in my workflow?
    FileNotFoundError: [Errno 2] No such file or directory: '_:file:///juno/work/access/testing/users/johnsoni/access_qc/work5/f26b1779cb2f5ea39419d80a2066faf9/e81b/360a/tmpb4cbwihh/out/simplex_bam_pool_a_dir'
I'm on toil 5.4
most of them look like
toilfs:16305:0:files/for-job/kind-CWLJob/instance-7ousmklr/file-184521b84be2407b8ea6ef01398c6f81/histogram.pdf
Ian
@ionox0
but some of them, specifically Directory objects, have the _:file:// prefix
Lon Blauvelt
@DailyDreaming

[Adam Novak, UCSC GI] @ionox0 This is part of the not-actually-correct logic that Toil has for preparing directories for CWL workflows. The underscore prefix is something that cwltool uses that indicates a directory is to be created to present to a CWL tool. Toil tries to generate this to control cwltool, but in practice what we have in 5.4 only works when running on a single machine or otherwise using a shared filesystem between nodes.

I've been redoing all that logic in DataBiosphere/toil#3628 so that Toil can be responsible for setting up the directory structures that CWL tools expect to see, whether there's a shared filesystem or not, but I still don't have it fully working yet. When it's done, it should be much harder to break.

Ian
@ionox0
Thanks for the quick response, that makes sense, I'm getting this issue on a shared filesystem, but perhaps our workflow is a bit complicated in this case because I'm dealing with a Directory object that also has Directory objects inside of it.
For now would you suggest us avoiding the use of Directory objects in 5.4 ?
Ian
@ionox0
I should note that the issue I'm having is specific to the InitialWorkDirRequirement, so I've gotten around it by using regular CWL inputs. I'll try to submit a minimal reproducible example as an issue.
Vijay Lakhujani
@v-lakhujani
image.png
can someone suggest what is the exit code 120 pointing to ?
3 replies
Ian
@ionox0
if you are on an HPC system you should check the exit reason given by your scheduler
for that job ID
Lon Blauvelt
@DailyDreaming
[Adam Novak, UCSC GI] I think 120 is what you get when you treat -9 in a byte as unsigned, so that suggests a signal 9 (SIGKILL) killing the process. This will happen when an angry admin, an HPC batch system, or the out of memory killer kills your Toil job.
Rupert Nash
@rupertnash
Hello - I'm trying out toil-cwl-runner for running workflows containing "real HPC" steps (by HPC I mean MPI across O(100) nodes). First question: are there any plans to update the version of cwltool used by toil? (I also need some of the improvements in SoftwareRequirement handling)
5 replies
Cibin S B
@cibinsb

Hi all,
I'm trying to run toil on the internal Kubernetes cluster, following is the command which I used

toil-cwl-runner --logDebug --enable-dev --batchSystem kubernetes --jobStore aws:us-east-1:toil-test --stats --singularity --defaultCores 1 md_launch.cwl md_list_input_descriptions.yml

but I'm getting a permission error

<Response><Errors><Error><Code>AuthorizationFailure</Code><Message>User (arn:aws:iam::07445xxxxxx:user/cibin) does not have permission to perform (sdb:Select) on resource (arn:aws:sdb:us-east-1:074455289529:domain/toil-registry). Contact account owner.</Message><BoxUsage>0.0000137200</BoxUsage></Error></Errors><RequestID>0e71b5c9-150a-b570-a5cf-31f1f751abca</RequestID></Response>

Toil version is 5.4.0

1 reply
crusoe
@mr-c:matrix.org
[m]
Also, does it make sense to put the Jobstore in AWS? Seems expensive to not have it local
crusoe
@mr-c:matrix.org
[m]
So until a local/k8s Jobstore option is added, I wouldn't recommend using toil-cwl-runner with Kubernetes in production not on AWS. However, I'm sure assistance with adding other jobstore backends would be very welcome!
If you are quite keen on k8s @cibinsb , Arvados has preliminary support and manages it's own data (but please note the current limitations) https://doc.arvados.org/v2.2/install/arvados-on-kubernetes.html
1 reply
crusoe
@mr-c:matrix.org
[m]
Not CWL v1.2, they are using an old version of the reference runner under the hood (1.0.20191022103248, so from 2019-10-22)
Douglas Lowe
@douglowe
ahh - that wouldn't be suitable for @cibinsb's workflow then - he's testing out the same workflow as I am on HPC - which requires v1.2
I got over keen on using conditionals when writing it
crusoe
@mr-c:matrix.org
[m]
@douglowe @cibinsb maybe let them know you'd like to see CWL v1.2 over in https://gitter.im/reanahub/reana ?
Douglas Lowe
@douglowe
@mr-c:matrix.org - that is a good plan - thanks for opening issues with them for this :)
3 replies
Lon Blauvelt
@DailyDreaming
[Lon Blauvelt, UCSC GI] Hi cibinsb. That looks like a simpledb permissions issue. SimpleDB needs special permissions set for AWS so that might be the problem? Can you try "aws sdb list-domains" and see if that works?
3 replies
Lon Blauvelt
@DailyDreaming

[Adam Novak, UCSC GI] @cibinsb, if you are setting up your own AWS roles/credentials (instead of using toil launch-cluster), you need to make sure you are granting access to SimpleDB in addition to access to S3, for the AWS job store to work.

As described in https://toil.readthedocs.io/en/latest/running/cloud/kubernetes.html#aws-job-store-for-kubernetes you need to grab some AWS credentials, put them in a Kubernetes secret, and use TOIL_AWS_SECRET_NAME when you run the workflow, to grant the workers access. You also need to make sure the leader has access, either by running it in a pod with the secret mounted in to ~/.aws, or by setting up ~/.aws on whatever non-pod machine you are running the leader on.

[Adam Novak, UCSC GI] Lon is working on a better AWS job store that will work without SimpleDB, and so ought to be usable with any S3 clone you could deploy on your Kubernetes cluster, but until that's done Kubernetes only works with genuine AWS S3 and SimpleDB.
Lon Blauvelt
@DailyDreaming
[Adam Novak, UCSC GI] @cibinsb, are you sure you have credentials in ~/.aws/credentials on cibins-beast-13-9380 that correspond to an IAM user with access to S3 and SimpleDB?
[Adam Novak, UCSC GI] Also are you sure you own the bucket named toil-test--files? That sounds like a very generic name, and bucket names must be unique across all of AWS. It is quite possible that someone else is already using the jobstore named toil-test and that you will have to pick a different, unique name.
Cibin S B
@cibinsb
I confirm that aws user credentials are saved on ~/.aws/credentials on cibins-beast-13-9380. The given s3 bucket name was toil-test and I'm not sure why toil is trying to access `toil-test--files'
Lon Blauvelt
@DailyDreaming
@cibinsb Toil creates and uses a bucket named after the jobstore plus --files and disallows -- in jobstore names because of this.
Rohith B S
@rohith-bs
Traceback (most recent call last):
  File "/home/test-user/toil-scripts/script.py", line 282, in <module>
    Job.Runner.startToil(main_job, options)
  File "/usr/local/lib/python3.7/site-packages/toil/job.py", line 1743, in startToil
    return toil.restart()
  File "/usr/local/lib/python3.7/site-packages/toil/common.py", line 874, in restart
    return self._runMainLoop(rootJobDescription)
  File "/usr/local/lib/python3.7/site-packages/toil/common.py", line 1132, in _runMainLoop
    jobCache=self._jobCache).run()
  File "/usr/local/lib/python3.7/site-packages/toil/leader.py", line 229, in run
    self.innerLoop()
  File "/usr/local/lib/python3.7/site-packages/toil/leader.py", line 614, in innerLoop
    self._gatherUpdatedJobs(updatedJobTuple)
  File "/usr/local/lib/python3.7/site-packages/toil/leader.py", line 573, in _gatherUpdatedJobs
    self.processFinishedJob(jobID, exitStatus, wallTime=wallTime, exitReason=exitReason)
  File "/usr/local/lib/python3.7/site-packages/toil/leader.py", line 959, in processFinishedJob
    replacementJob = self.jobStore.load(jobStoreID)
  File "/usr/local/lib/python3.7/site-packages/toil/jobStores/fileJobStore.py", line 209, in load
    with open(jobFile, 'rb') as fileHandle:
FileNotFoundError: [Errno 2] No such file or directory: 'jobStore/jobs/kind-FunctionWrappingJob/instance-pb6jcg2c/job'