Where communities thrive


  • Join over 1.5M+ people
  • Join over 100K+ communities
  • Free without limits
  • Create your own community
People
Repo info
Activity
  • Apr 09 19:51
    gaow commented #1419
  • Apr 09 18:13
    BoPeng commented #1419
  • Apr 09 17:49
    gaow commented #1419
  • Apr 09 17:39
    BoPeng commented #1419
  • Apr 09 13:25
    gaow reopened #1419
  • Apr 09 13:25
    gaow commented #1419
  • Feb 26 20:09
    BoPeng labeled #1437
  • Feb 26 20:09
    BoPeng assigned #1437
  • Feb 26 20:09
    BoPeng opened #1437
  • Feb 26 20:07
    BoPeng commented #1435
  • Feb 26 20:07
    BoPeng closed #1435
  • Feb 26 20:00
    BoPeng closed #1436
  • Feb 26 19:51

    BoPeng on master

    Fix the display of global varia… Task monitor now honor walltime… (compare)

  • Feb 26 19:51
    BoPeng assigned #1436
  • Feb 26 19:51
    BoPeng opened #1436
  • Feb 22 04:42
    BoPeng labeled #1424
  • Feb 22 04:42
    BoPeng assigned #1424
  • Feb 22 04:41
    BoPeng labeled #1435
  • Feb 22 04:41
    BoPeng assigned #1435
  • Feb 22 04:41
    BoPeng commented #1435
Patrick Cudahy
@pgcudahy
So If I want to figure out why a job failed, run sos status?
Bo
@BoPeng
Yes. The error messages are absorbed so sos status -v4 is currently the only way to go. From notebook, you can run %task status jobid -q queue with the same effect.
Actually if you hover the mouse to tasks, there is a little icon for you to submit the %task status magic. However, because currently %run -q magic is blocking, that magic would not be run until after the end of %run.
I have been thinking of making %run not blocking after the tasks are all submitted.
Patrick Cudahy
@pgcudahy
That would be nice
Bo
@BoPeng
I am using this mechanism heavily these days because it makes running "small" scripts very easy. Debugging of failed jobs is not particularly easy and I have found myself logging into the cluster to run sos status (because %task is blocked)... definitely something need to be improved. Please feel free to submit tickets for problems and feature requests so that we know what the "pain points" are.
Patrick Cudahy
@pgcudahy
Thanks for walking me through things, half of my issues are because I'm not familiar with slurm either, so it's a very steep learning curve
Bo
@BoPeng

Yes, it will take some configuration and time to use to, and you might want to add

module load {" ".join(modules)}

to your template that allows you to do

task: modules=['module1', 'module2']

to load particular modules for the scripts in the task. Again, feel free to let me know if you get into trouble so that we can make this process as easy and error-proof as possible.

Patrick Cudahy
@pgcudahy

Okay, now I'm trying to run a real job but having an issue. Per your suggestion I set a scratch path for my local machine as scratch: /data2 and for the cluster as scratch: /home/pgc29/scratch60. Then I try and run

%run test -q 'yale_hpc_slurm'
[test]
input: f'/data2/helen_mixed_infection/dataraw/R9994_CATCAAGT_S35_L001_R1_001.fastq.gz', 
        f'/data2/helen_mixed_infection/dataraw/R9994_CATCAAGT_S35_L001_R2_001.fastq.gz'
output: f'/data2/helen_mixed_infection/data/tb-profiler/results/R9994_CATCAAGT_S35_L001.results.json'
task: walltime='00:15:00', mem='2G', workdir='#scratch/helen_mixed_infection/data/tb-profiler'
run: expand=True
    module load miniconda
    conda activate tbprofiler
    cd /data2/helen_mixed_infection/data/tb-profiler
    tb-profiler profile -1 {_input[0]} -2 {_input[1]} -p R9994_CATCAAGT_S35_L001

It ends up hanging forever with INFO: Waiting for the completion of 1 task.
sos status b5eb33955aee5a77 -v4 gives

b5eb33955aee5a77    submitted

Created 33 min ago
TASK:
=====
run(fr"""module load miniconda
conda activate tbprofiler
cd /data2/helen_mixed_infection/data/tb-profiler
tb-profiler profile -1 {_input[0]} -2 {_input[1]} -p R9994_CATCAAGT_S35_L001

""")

TAGS:
=====
1a2a7669047f0dc5 notebooks test

GLOBAL:
=======
(<_ast.Module object at 0x2b96dfcc8ca0>, {})

ENVIRONMENT:
============
__signature_vars__    {'_input', 'run'}
_depends              []
_index                0
_input                [file_target('/data2/helen_mixed_infection/dataraw/R9994_CATCAAGT_S35_L001_R1_001.fastq.gz'), file_target('/data2/helen_mixed_infection/dataraw/R9994_CATCAAGT_S35_L001_R2_001.fastq.gz')]
_output               [file_target('/data2/helen_mixed_infection/data/tb-profiler/results/R9994_CATCAAGT_S35_L001.results.json')]
_runtime              {'mem': 2000000000,
 'queue': 'yale_hpc_slurm',
 'run_mode': 'interactive',
 'sig_mode': 'default',
 'verbosity': 2,
 'walltime': '00:15:00',
 'workdir': path('/data2/helen_mixed_infection/notebooks')}
step_name             'test'
workflow_id           '1a2a7669047f0dc5'


b5eb33955aee5a77.sh:
====================
#!/bin/bash
#SBATCH --time=00:15:00
#SBATCH --nodes=1
#SBATCH --mem-per-cpu=5G
#SBATCH --job-name=b5eb33955aee5a77
#SBATCH --output=/home/pgcudahy/.sos/tasks/b5eb33955aee5a77.out
#SBATCH --error=/home/pgcudahy/.sos/tasks/b5eb33955aee5a77.err
cd /data2/helen_mixed_infection/notebooks
sos execute b5eb33955aee5a77 -v 2 -s default -m interactive


b5eb33955aee5a77.job_id:
========================
job_id: 31773933

But when I run on the cluster sacct -j 31773933

sacct -j 31773933
       JobID    JobName  Partition    Account  AllocCPUS      State ExitCode 
------------ ---------- ---------- ---------- ---------- ---------- -------- 
31773933     b5eb33955+    general cohen_the+          1     FAILED      1:0 
31773933.ba+      batch            cohen_the+          1     FAILED      1:0 
31773933.ex+     extern            cohen_the+          1  COMPLETED      0:0
I'd expect the submitted job to have translated /data2/helen_mixed_infection/notebooks to /home/pgc29/scratch60/helen_mixed_infection/notebooks but b5eb33955aee5a77.sh has cd /data2/helen_mixed_infection/notebooks
Patrick Cudahy
@pgcudahy
Doh, I think I see it now. Going to try with
[test]
input: f'#scratch/helen_mixed_infection/dataraw/R9994_CATCAAGT_S35_L001_R1_001.fastq.gz', 
        f'#scratch/helen_mixed_infection/dataraw/R9994_CATCAAGT_S35_L001_R2_001.fastq.gz'
output: f'#scratch/helen_mixed_infection/data/tb-profiler/results/R9994_CATCAAGT_S35_L001.results.json'
task: walltime='00:15:00', mem='2G', workdir='#scratch/helen_mixed_infection/data/tb-profiler'
run: expand=True
    module load miniconda
    conda activate tbprofiler
    tb-profiler profile -1 {_input[0]} -2 {_input[1]} -p R9994_CATCAAGT_S35_L001
Patrick Cudahy
@pgcudahy
That ran and finished with success, but the output file isn't where I expect it. I think it's because the execution script didn't replace #scratch
execution script:
================
#!/bin/bash
#SBATCH --time=00:15:00
#SBATCH --nodes=1
#SBATCH --mem-per-cpu=5G
#SBATCH --job-name=a904366ca2a493bc
#SBATCH --output=/home/pgc29/.sos/tasks/a904366ca2a493bc.out
#SBATCH --error=/home/pgc29/.sos/tasks/a904366ca2a493bc.err
cd #scratch/helen_mixed_infection/data/tb-profiler
/home/pgc29/.local/bin/sos execute a904366ca2a493bc -v 2 -s default -m interactive
Patrick Cudahy
@pgcudahy
Not sure where to go from here but it is very late here now, so headed to bed. thanks very much for your help today Bo
Bo
@BoPeng
Good. workdir on non-shared path is a problem... I think I have a ticket for that but have not looked in details... nbconvert 6.0.0 introduces new template structures which broke sos convert and I have to fix that first.
I sort of hate when the upstream make incompatible changes.
Patrick Cudahy
@pgcudahy

Hello, I have what I think is a simple question but haven't been able to get it to work. I'm processing genomes that are a mix of single-ended and paired-end reads. Before mapping them to a reference the commands are different between single and paired, so I have two parallel pipelines. But after mapping them, they're all bam files and I'd like to continue processing using just one pipeline. To show you what I mean, first I grab all the fastq files and join the ones that are paired. The filenames have the sample name followed by "_R1.fastq.gz" or "_R2.fastq.gz" to indicate a forward or reverse read.

[global]
import glob
import itertools
import os

fastq_files = sorted(glob.glob("/data/*.fastq.gz"))

grouped_fastq_dict = dict()
for k, v in itertools.groupby(fastq_files, lambda a: os.path.split(a)[1].split("_R", 1)[0]):
    grouped_fastq_dict[k] = list(v)

single_read, paired_read = dict(), dict()
for k,v in grouped_fastq_dict.items():
    if len(v) == 1:
        single_read[k] = v
    elif len(v) == 2:
        paired_read[k] = v
    else:
        print(f'Error: {k} has < 1 or more than 2 associated fastq files')

Then I process them and map them to a reference

[trimmomatic-single]
input: single_read, group_by=1
output: trim_single = f'/data/{_input.labels[0]}/{_input:bnn}_trimmed.fastq.gz'
run: expand=True
    trimmomatic SE -phred33 {_input} {_output} LEADING:10 TRAILING:10 SLIDINGWINDOW:4:16 MINLEN:40

[trimmomatic-paired]
input: paired_read, group_by=2
output: trim_paired_1=f'/data/{_input.labels[0]}/{_input[0]:bnn}_trimmed.fastq.gz',
trim_unpaired_1=f'/data/{_input.labels[0]}/{_input[0]:bnn}_trimmed_unpaired.fastq.gz',
trim_paired_2=f'/data/{_input.labels[0]}/{_input[1]:bnn}_trimmed.fastq.gz',
trim_unpaired_2=f'/data/{_input.labels[0]}/{_input[1]:bnn}_trimmed_unpaired.fastq.gz'
run: expand=True
    trimmomatic PE -phred33 {_input} {_output["trim_paired_1"]} {_output["trim_unpaired_1"]} \
    {_output["trim_paired_2"]} {_output["trim_unpaired_2"]} LEADING:10 TRAILING:10 SLIDINGWINDOW:4:16 MINLEN:40

[map-single]
input: output_from("trimmomatic-single"), group_by=1
output: bam = f'/data/{_input.name.split("_R")[0]}_GCF_000195955.2_filtered_sorted.bam'

id=_input.name.split("_R")[0]
rg=f'\"@RG\\tID:{id}\\tPL:Illumina\\tSM:{id}\"'

run: expand=True
    bwa mem -v 3 -Y -R {rg} {reference} {_input} | samtools view -bu - | \
    samtools sort -T /data2/helen_mixed_infection/data/bam/tmp.{id} -o {_output}

[map-paired]
input: output_from("trimmomatic-paired")["trim_paired_1"], output_from("trimmomatic-paired")["trim_paired_2"], group_by="pairs"
output: bam = f'/data/{_input["trim_paired_1"].name.split("_R")[0]}_GCF_000195955.2_filtered_sorted.bam'

id=_input["trim_paired_1"].name.split("_R")[0]
rg = f'\"@RG\\tID:{id}\\tPL:Illumina\\tSM:{id}\"'

run: expand=True
    bwa mem -v 3 -Y -R {rg} {reference} {_input} | samtools view -bu - | \
    samtools sort -T /data2/helen_mixed_infection/data/bam/tmp.{id} -o {_output}

But now I want to combine the output of the two parallel pipelines into the next step

[duplicate_marking]
input: output_from("map-single"),  output_from("map-paired"), group_by=1
output: dedup=f'{_input:n}_dedup.bam'
bash: expand=True
    export JAVA_OPTS='-Xmx3g'
    picard MarkDuplicates I={_input} O={_output} M={_output:n}.duplicate_metrics \
    REMOVE_DUPLICATES=false ASSUME_SORT_ORDER=coordinate

But SoS complains because the output from map-single and map-paired are of different lengths. How can I use the output from both steps as the input to my step duplicate-marking?

Bo
@BoPeng
This is because sos tried to aggregate groups of inputs if two output_from are grouped. To ungroup the output you need to use group_by='all' inside output_from.
Bo
@BoPeng

Running sos run test combined with test.sos having the following workflow,

[single]
input: for_each=dict(i=range(2))
output: f'single_{i}.bam'

_output.touch()

[double]
input: for_each=dict(i=range(2))
output: f'single_{i}.bam'

_output.touch()

[combined]
input: output_from('single'), output_from('double')

print(_input)

You will see that the two groups from single and double are combined to form two groups with one output from single and one output from double.

single]
input: for_each=dict(i=range(2))
output: f'single_{i}.bam'

_output.touch()

[double]
input: for_each=dict(i=range(3))
output: f'single_{i}.bam'

_output.touch()

[combined]
input: output_from('single', group_by='all'), output_from('double', group_by='all'), group_by=1

print(_input)
basically "flatten" and join both output_from into a single group before separating them into groups with one file (group_by=).
Bo
@BoPeng
This is documented here but perhaps a more explicit example should be given.
Patrick Cudahy
@pgcudahy
That works well, thanks! I had read that documentation but couldn't figure how to put it all together
Patrick Cudahy
@pgcudahy
Hello, another quick question. The login nodes for my cluster get pretty congested, and during peak hours I start to see a lot of ERROR: ERROR workflow_executor.py:1206 - Failed to connect to yale_hpc_slurm: ssh connection to pgc29@xxx.xxx.xxx.xxx time out with prompt: b'' - None errors. Is there a way to adjust the timeout to make it longer?
Bo
@BoPeng
There is an option to adjust frequency to check task status (30s if you copied the examples), but as far as I know there is no option to adjust timeout time for the underlying ssh command.
Patrick Cudahy
@pgcudahy
My dataset has gotten to be so large (several thousand genomes) that for every step, there will likely be one or two failures due to subtle race conditions, or an outlier requiring much more memory or runtime so the cluster kills it, or the initial file was a contaminant etc. So my workflow has started to bog down into a cycle of 1) submit job 2) check in a few hours later and see which substeps failed 3) adjust parameters or just resubmit (it often just works the second time) 4) check in a few hours later to see how step 2 failed 5) resubmit 6) repeat. With a pipeline of >10 steps, this is tedious. I'd prefer if I wouldn't have to babysit runs as much. Is there a way to have SoS continue to the next step, even if some substeps fail? That way 99% of my samples will make it from fastq file to a final VCF and then I can tweak things for a second run to finish up the failed 1%. Any other suggestions on how to improve robustness would be welcome.
Bo
@BoPeng
Sorry, been a busy day. If you run sos run -h, there is an option -e ERRORMODE, I think you are trying to use -e ignore.
@pgcudahy
Patrick Cudahy
@pgcudahy
Thanks!
Patrick Cudahy
@pgcudahy
I'm having some issues with running steps with remote inputs and outputs and SoS not noticing changed files, so ignoring steps with saved signatures. Where exactly are signatures stored for jobs run remotely and how can I clear them? I've tried !sos remove -s from within my notebook, but I still get steps skipped.
Bo
@BoPeng
@pgcudahy The content of "remote" files are not checked at this point and this is why signatures related to remote files are inaccurate. The current mechanism for remote files work for some cases, but broken for others (I have a ticket for returning remote files with workdir is set) and certainly needs improvement.
This is now vatlab/sos#1411
Bo
@BoPeng
@pgcudahy Just released sos notebook 0.22.4 that makes task execution non-blocking so that you can check status, remove task etc with the buttons. Let me know if you notice any problem.
Bo
@BoPeng
vatlab/sos#1411 is also implemented, although more testing is needed for the next release.
Patrick Cudahy
@pgcudahy
Thank you Bo. I only see 0.22.3 on pip. Are there instructions somewhere on how to install from github?
0.22.3 seems to have broken my scripts, so I'm trying to figure out what's wrong
Bo
@BoPeng
0.22.3 should be also on conda. To install from github you will have to use commands such as pip install git+https://github.com/vatlab/sos.git
shashj
@shashj
Hi, is it possible to change the sigil in a workflow?
@BoPeng ?
shashj
@shashj
found it thanks
Bo
@BoPeng
@shashj Yeah, that is a basic feature but we recently added a warning message for the inclusion of scripts without indentation. I just updated the documentation.
Patrick Cudahy
@pgcudahy
Hello Bo, when submitting workflows and tasks to a cluster (eg %run check_validation -q yale_hpc_task_spooler -r yale_hpc_task_spooler) is there a way to synchronize the output back to my local computer? Using a named path like #scratch fails with WARNING: Error from step check_validation is ignored: [check_validation]: Failed to process step output (f'#scratch/helen_mixed_infection/data/fqtools/good_files.txt'): 'NoneType' object has no attribute 'expanduser'
Bo
@BoPeng
@pgcudahy I will have a look through #1437
Bo
@BoPeng
You seemed to talk about two problems. For the first one, -r is designed to execute everything on remote host and -q is designed to execute part of the things remotely so have the mechanism to sync files back. There are sos remote pu//posh commands but rsync can be more straightforward.
The second issue on anchored path looks more like a bug but I need more background on how it happened.