Where communities thrive


  • Join over 1.5M+ people
  • Join over 100K+ communities
  • Free without limits
  • Create your own community
People
Repo info
Activity
  • Apr 09 19:51
    gaow commented #1419
  • Apr 09 18:13
    BoPeng commented #1419
  • Apr 09 17:49
    gaow commented #1419
  • Apr 09 17:39
    BoPeng commented #1419
  • Apr 09 13:25
    gaow reopened #1419
  • Apr 09 13:25
    gaow commented #1419
  • Feb 26 20:09
    BoPeng labeled #1437
  • Feb 26 20:09
    BoPeng assigned #1437
  • Feb 26 20:09
    BoPeng opened #1437
  • Feb 26 20:07
    BoPeng commented #1435
  • Feb 26 20:07
    BoPeng closed #1435
  • Feb 26 20:00
    BoPeng closed #1436
  • Feb 26 19:51

    BoPeng on master

    Fix the display of global varia… Task monitor now honor walltime… (compare)

  • Feb 26 19:51
    BoPeng assigned #1436
  • Feb 26 19:51
    BoPeng opened #1436
  • Feb 22 04:42
    BoPeng labeled #1424
  • Feb 22 04:42
    BoPeng assigned #1424
  • Feb 22 04:41
    BoPeng labeled #1435
  • Feb 22 04:41
    BoPeng assigned #1435
  • Feb 22 04:41
    BoPeng commented #1435
Bo
@BoPeng
There is actually a gnuplot kernel. https://github.com/has2k1/gnuplot_kernel If you use it, you can use SoS Notebook, and run the script directly in the kernel.

If you are using SoS workerlow, in a SoS cell, gnuplot: action is not defined, but you can try

run:
    #!/bin/env gnuplot
    script...

which will use gnuplot command to run the script. Or you can do

script: args='gnuplot {filename}'
    script ...

to specify the interpreter. see https://vatlab.github.io/sos-docs/doc/user_guide/sos_actions.html#Option--args for details.

D. R. Evans
@N7DR

There is actually a gnuplot kernel. https://github.com/has2k1/gnuplot_kernel

Yes, that's part of what I was trying to get across :-)

I'll try your first suggestion above; that seems cleaner in the absence of explicit support for the gnuplot kernel.

Bo
@BoPeng
I saw a message that was later deleted. I think https://vatlab.github.io/sos-docs/doc/user_guide/sos_actions.html#Option-template-and-template_name was what was asked.
Patrick Cudahy
@pgcudahy

Hello, I'm trying to set up a remote host for my university cluster but having trouble getting started
My hosts.yml is

localhost: macbook
hosts:
  yale_farnam:
    address: farnam.hpc.yale.edu
    paths:
      home: /home/pgc29/scratch60
  macbook:
    address: 127.0.0.1
    paths:
      home: /Users/pgcudahy

But when I run something simple like

%run -r yale_farnam -c ~/.sos/hosts.yml
sh:
    echo Working on `pwd` of $HOSTNAME

I get ERROR: Failed to connect to yale_farnam: pgcudahy@farnam.hpc.yale.edu: Permission denied (publickey).

I also tried sos remote setup but get the error INFO: scp -P 22 /Users/pgcudahy/.ssh/id_rsa.pub farnam.hpc.yale.edu:id_rsa.pub.yale_farnam ERROR: Failed to copy public key to farnam.hpc.yale.edu

Perhaps the issue is that my usernames for my computer and the cluster are different. A simple ssh pgc29@farnam.hpc.yale.edu works for me, but ssh farnam.hpc.yale.edu does not. The scp command referenced in the error message doesn't prepend a username to the cluster's domain name. Any help on how to move forward would be great. Thanks

Bo
@BoPeng
Could you try to change address: farnam.hpc.yale.edu to address: pgc29@farnam.hpc.yale.edu
Patrick Cudahy
@pgcudahy
Ah, thanks. Of course the cluster just went down for maintenance until Wednesday! I'll try it as soon as it's back up.
Bo
@BoPeng
It is a cluster, then you will have to install sos on it, set sos: /path/to/sos in the config file (to avoid changing $PATH on server), and then add a template to submit jobs. I usually have two hosts for headnode and cluster in case I want to run something on the headnode. Let me know if you encounter any problem.
Bo
@BoPeng
BTW -c ~/.sos/hosts.yml is not needed. That file is always read.
Patrick Cudahy
@pgcudahy
Adding my username to the address worked well, thanks! My cluster recommends installing pip modules like sos with anaconda, but I couldn't figure out how to get my local notebook to access sos in the remote conda environment. Instead I was able to install sos with pip install --user sos and then change my PYTHONPATH in ~/.bash_profile
Patrick Cudahy
@pgcudahy

Okay, so now that I can directly run remote jobs, I'm trying to get SLURM set up. The cluster documentation has this example task that I want to replicate

#!/bin/bash
#SBATCH --job-name=example_job
#SBATCH --out="slurm-%j.out"
#SBATCH --time=01:00
#SBATCH --nodes=1 --ntasks=1 --cpus-per-task=2
#SBATCH --mem-per-cpu=5G
#SBATCH --mail-type=ALL

mem_bytes=$(</sys/fs/cgroup/memory/slurm/uid_${SLURM_JOB_UID}/job_${SLURM_JOB_ID}/memory.limit_in_bytes)
mem_gbytes=$(( $mem_bytes / 1024 **3 ))

echo "Starting at $(date)"
echo "Job submitted to the ${SLURM_JOB_PARTITION} partition, the default partition on ${SLURM_CLUSTER_NAME}"
echo "Job name: ${SLURM_JOB_NAME}, Job ID: ${SLURM_JOB_ID}"
echo "  I have ${SLURM_CPUS_ON_NODE} CPUs and ${mem_gbytes}GiB of RAM on compute node $(hostname)"

My updated hosts.yml is now

%save ~/.sos/hosts.yml -f

localhost: ubuntu
hosts:
  ubuntu:
    address: 127.0.0.1
    paths:
      home: /data2
  yale_farnam:
    address: pgc29@farnam1.hpc.yale.edu
    paths:
      home: /home/pgc29/scratch60/
  yale_hpc_slurm:
    address: pgc29@farnam.hpc.yale.edu
    paths:
      home: /home/pgc29/scratch60/
    queue_type: pbs
    submit_cmd: sbatch {job_file}
    submit_cmd_output: "Submitted batch job {job_id}"
    status_cmd: squeue --job {job_id}
    kill_cmd: scancel {job_id}
    status_check_interval: 120
    max_running_jobs: 100
    max_cores: 200 
    max_walltime: "72:00:00"
    max_mem: 1280G
    task_template: |
        #!/bin/bash
        #SBATCH --time={walltime}
        #SBATCH --nodes=1
        #SBATCH --ntasks-per-node={cores}
        #SBATCH --job-name={task}
        #SBATCH --output=/home/{user_name}/.sos/tasks/{task}.out
        #SBATCH --error=/home/{user_name}/.sos/tasks/{task}.err
        cd {workdir}
        {command}

But when I run

%run -q yale_hpc_slurm 

bash:
    mem_bytes=$(</sys/fs/cgroup/memory/slurm/uid_${SLURM_JOB_UID}/job_${SLURM_JOB_ID}/memory.limit_in_bytes)
    mem_gbytes=$(( $mem_bytes / 1024 **3 ))

    echo "Starting at $(date)"
    echo "Job submitted to the ${SLURM_JOB_PARTITION} partition, the default partition on ${SLURM_CLUSTER_NAME}"
    echo "Job name: ${SLURM_JOB_NAME}, Job ID: ${SLURM_JOB_ID}"
    echo "  I have ${SLURM_CPUS_ON_NODE} CPUs and ${mem_gbytes}GiB of RAM on compute node $(hostname)"

It tries to run on my local machine and returns

INFO: Running default:

/tmp/tmp9o84b7hs.sh: line 1: /sys/fs/cgroup/memory/slurm/uid_/job_/memory.limit_in_bytes: No such file or directory
/tmp/tmp9o84b7hs.sh: line 2: / 1024 **3 : syntax error: operand expected (error token is "/ 1024 **3 ")
Starting at Wed Oct  7 04:14:57 EDT 2020
Job submitted to the  partition, the default partition on 
Job name: , Job ID: 
  I have  CPUs and GiB of RAM on compute node ubuntu

INFO: Workflow default (ID=187c6b10c86049c7) is executed successfully with 1 completed step.

I tried

%run -r yale_hpc_slurm

bash:
    mem_bytes=$(</sys/fs/cgroup/memory/slurm/uid_${SLURM_JOB_UID}/job_${SLURM_JOB_ID}/memory.limit_in_bytes)
    mem_gbytes=$(( $mem_bytes / 1024 **3 ))

    echo "Starting at $(date)"
    echo "Job submitted to the ${SLURM_JOB_PARTITION} partition, the default partition on ${SLURM_CLUSTER_NAME}"
    echo "Job name: ${SLURM_JOB_NAME}, Job ID: ${SLURM_JOB_ID}"
    echo "  I have ${SLURM_CPUS_ON_NODE} CPUs and ${mem_gbytes}GiB of RAM on compute node $(hostname)"

and got

ERROR: No workflow engine or invalid engine definition defined for host yale_hpc_slurm

Workflow exited with code 1

Where have I messed up my template? Thanks

Patrick Cudahy
@pgcudahy

Okay, got it to work with

%run -q 'yale_hpc_slurm'

task: walltime='00:05:00'
bash:
    mem_bytes=$(</sys/fs/cgroup/memory/slurm/uid_${SLURM_JOB_UID}/job_${SLURM_JOB_ID}/memory.limit_in_bytes)
    mem_gbytes=$(( $mem_bytes / 1024 **3 ))

    echo "Starting at $(date)"
    echo "Job submitted to the ${SLURM_JOB_PARTITION} partition, the default partition on ${SLURM_CLUSTER_NAME}"
    echo "Job name: ${SLURM_JOB_NAME}, Job ID: ${SLURM_JOB_ID}"
    echo "  I have ${SLURM_CPUS_ON_NODE} CPUs and ${mem_gbytes}GiB of RAM on compute node $(hostname)"

Not sure why I had to quote the remote host name. Also it does not produce a .out file in ~/.sos/tasks/ on the remote computer or my local computer, just job_id, pulse, task and sh files.

Bo
@BoPeng
ok, first, if you add sos: /path/to/sos, you do not need to set up PATH or PYTHONPATH on the cluster. In our experience it is better to leave $PATH on the cluster alone because your job might be using a different Python than the one sos uses.
Then, task is needed to define a portion of the step as external tasks. Using only bash would not work. Basically the template executes the task with sos execute, which can be bash, python, R and any other scripts...
third, the "Starting at $(data)" stuff usually belongs to the template as your notebook would focus on "real" stuff, not anything directly related to the cluster. There is a problem with interpolation of ${ } since sos expands { }, so you will have to use ${{ }} to avoid that.
mem_bytes is read from you job file. Actually sos provides variable mem which is exactly that.
Bo
@BoPeng
Finally, I do not see where you specify mem in your template, is it needed at all for your system?

Also, at least on our cluster, running stuff in $HOME is not recommended so I have things like

    cluster:
        paths:
            scratch: /path/to/scratch/

and

task: workdir='#scratch/project/etc'

to run the tasks under scratch directory.

Patrick Cudahy
@pgcudahy

Thanks, I guess my main question right now is when I run the example task

%run -q 'yale_hpc_slurm'

task: walltime='00:05:00', mem='1G'
bash:
    mem_bytes=$(</sys/fs/cgroup/memory/slurm/uid_${SLURM_JOB_UID}/job_${SLURM_JOB_ID}/memory.limit_in_bytes)
    mem_gbytes=$(( $mem_bytes / 1024 **3 ))

    echo "Starting at $(date)"
    echo "Job submitted to the ${SLURM_JOB_PARTITION} partition, the default partition on ${SLURM_CLUSTER_NAME}"
    echo "Job name: ${SLURM_JOB_NAME}, Job ID: ${SLURM_JOB_ID}"
    echo "  I have ${SLURM_CPUS_ON_NODE} CPUs and ${mem_gbytes}GiB of RAM on compute node $(hostname)"

The notebook reports

INFO: cdb813b50789fb95 submitted to yale_hpc_slurm with job id 31774151
INFO: Waiting for the completion of 1 task.
INFO: Workflow default (ID=187c6b10c86049c7) is executed successfully with 1 completed step and 1 completed task.

But there are no .out or .err files in ~/.sos/tasks

Just a .task and a .pulse
Bo
@BoPeng
ohmm, on the cluster if you run ``sos status cdb813b50789fb95 -v4, what do you see? sos "absorbs" these files into task file to keep the number of files low.
Patrick Cudahy
@pgcudahy
Ah, there it is. It says
cdb813b50789fb95        completed

Created 9 hr ago
Started 5 min ago
Signature checked
TASK:
=====
bash('mem_bytes=$(</sys/fs/cgroup/memory/slurm/uid_${SLURM_JOB_UID}/job_${SLURM_JOB_ID}/memory.limit_in_bytes)\nmem_gbytes=$(( $mem_bytes / 1024 **3 ))\n\necho "Starting at $(date)"\necho "Job submitted to the ${SLURM_JOB_PARTITION} partition, the default partition on ${SLURM_CLUSTER_NAME}"\necho "Job name: ${SLURM_JOB_NAME}, Job ID: ${SLURM_JOB_ID}"\necho "  I have ${SLURM_CPUS_ON_NODE} CPUs and ${mem_gbytes}GiB of RAM on compute node $(hostname)"\n')

TAGS:
=====
187c6b10c86049c7 default notebooks

GLOBAL:
=======
(<_ast.Module object at 0x7f6b8f19bb50>, {})

ENVIRONMENT:
============
__signature_vars__    {'bash'}
_depends              []
_index                0
_input                []
_output               Unspecified
_runtime              {'queue': 'yale_hpc_slurm',
 'run_mode': 'interactive',
 'sig_mode': 'default',
 'verbosity': 2,
 'walltime': '00:05:00',
 'workdir': path('/data2/helen_mixed_infection/notebooks')}
step_name             'default'
workflow_id           '187c6b10c86049c7'

EXECUTION STATS:
================
Duration:       0s
Peak CPU:       0.0 %
Peak mem:       36.7 MiB

execution script:
================
#!/bin/bash
#SBATCH --time=00:05:00
#SBATCH --nodes=1 --ntasks=1 --cpus-per-task=2
#SBATCH --mem-per-cpu=5G
#SBATCH --job-name=cdb813b50789fb95
#SBATCH --output=/home/pgc29/.sos/tasks/cdb813b50789fb95.out
#SBATCH --error=/home/pgc29/.sos/tasks/cdb813b50789fb95.err
cd /data2/helen_mixed_infection/notebooks
sos execute cdb813b50789fb95 -v 2 -s default -m interactive


standard output:
================
Starting at Wed Oct  7 05:49:27 EDT 2020
Job submitted to the general partition, the default partition on farnam
Job name: cdb813b50789fb95, Job ID: 31767135
  I have 2 CPUs and 10GiB of RAM on compute node c23n12.farnam.hpc.yale.internal


standard error:
================
/var/spool/slurmd/job31767135/slurm_script: line 8: cd: /data2/helen_mixed_infection/notebooks: No such file or directory
INFO: cdb813b50789fb95 started
So If I want to figure out why a job failed, run sos status?
Bo
@BoPeng
Yes. The error messages are absorbed so sos status -v4 is currently the only way to go. From notebook, you can run %task status jobid -q queue with the same effect.
Actually if you hover the mouse to tasks, there is a little icon for you to submit the %task status magic. However, because currently %run -q magic is blocking, that magic would not be run until after the end of %run.
I have been thinking of making %run not blocking after the tasks are all submitted.
Patrick Cudahy
@pgcudahy
That would be nice
Bo
@BoPeng
I am using this mechanism heavily these days because it makes running "small" scripts very easy. Debugging of failed jobs is not particularly easy and I have found myself logging into the cluster to run sos status (because %task is blocked)... definitely something need to be improved. Please feel free to submit tickets for problems and feature requests so that we know what the "pain points" are.
Patrick Cudahy
@pgcudahy
Thanks for walking me through things, half of my issues are because I'm not familiar with slurm either, so it's a very steep learning curve
Bo
@BoPeng

Yes, it will take some configuration and time to use to, and you might want to add

module load {" ".join(modules)}

to your template that allows you to do

task: modules=['module1', 'module2']

to load particular modules for the scripts in the task. Again, feel free to let me know if you get into trouble so that we can make this process as easy and error-proof as possible.

Patrick Cudahy
@pgcudahy

Okay, now I'm trying to run a real job but having an issue. Per your suggestion I set a scratch path for my local machine as scratch: /data2 and for the cluster as scratch: /home/pgc29/scratch60. Then I try and run

%run test -q 'yale_hpc_slurm'
[test]
input: f'/data2/helen_mixed_infection/dataraw/R9994_CATCAAGT_S35_L001_R1_001.fastq.gz', 
        f'/data2/helen_mixed_infection/dataraw/R9994_CATCAAGT_S35_L001_R2_001.fastq.gz'
output: f'/data2/helen_mixed_infection/data/tb-profiler/results/R9994_CATCAAGT_S35_L001.results.json'
task: walltime='00:15:00', mem='2G', workdir='#scratch/helen_mixed_infection/data/tb-profiler'
run: expand=True
    module load miniconda
    conda activate tbprofiler
    cd /data2/helen_mixed_infection/data/tb-profiler
    tb-profiler profile -1 {_input[0]} -2 {_input[1]} -p R9994_CATCAAGT_S35_L001

It ends up hanging forever with INFO: Waiting for the completion of 1 task.
sos status b5eb33955aee5a77 -v4 gives

b5eb33955aee5a77    submitted

Created 33 min ago
TASK:
=====
run(fr"""module load miniconda
conda activate tbprofiler
cd /data2/helen_mixed_infection/data/tb-profiler
tb-profiler profile -1 {_input[0]} -2 {_input[1]} -p R9994_CATCAAGT_S35_L001

""")

TAGS:
=====
1a2a7669047f0dc5 notebooks test

GLOBAL:
=======
(<_ast.Module object at 0x2b96dfcc8ca0>, {})

ENVIRONMENT:
============
__signature_vars__    {'_input', 'run'}
_depends              []
_index                0
_input                [file_target('/data2/helen_mixed_infection/dataraw/R9994_CATCAAGT_S35_L001_R1_001.fastq.gz'), file_target('/data2/helen_mixed_infection/dataraw/R9994_CATCAAGT_S35_L001_R2_001.fastq.gz')]
_output               [file_target('/data2/helen_mixed_infection/data/tb-profiler/results/R9994_CATCAAGT_S35_L001.results.json')]
_runtime              {'mem': 2000000000,
 'queue': 'yale_hpc_slurm',
 'run_mode': 'interactive',
 'sig_mode': 'default',
 'verbosity': 2,
 'walltime': '00:15:00',
 'workdir': path('/data2/helen_mixed_infection/notebooks')}
step_name             'test'
workflow_id           '1a2a7669047f0dc5'


b5eb33955aee5a77.sh:
====================
#!/bin/bash
#SBATCH --time=00:15:00
#SBATCH --nodes=1
#SBATCH --mem-per-cpu=5G
#SBATCH --job-name=b5eb33955aee5a77
#SBATCH --output=/home/pgcudahy/.sos/tasks/b5eb33955aee5a77.out
#SBATCH --error=/home/pgcudahy/.sos/tasks/b5eb33955aee5a77.err
cd /data2/helen_mixed_infection/notebooks
sos execute b5eb33955aee5a77 -v 2 -s default -m interactive


b5eb33955aee5a77.job_id:
========================
job_id: 31773933

But when I run on the cluster sacct -j 31773933

sacct -j 31773933
       JobID    JobName  Partition    Account  AllocCPUS      State ExitCode 
------------ ---------- ---------- ---------- ---------- ---------- -------- 
31773933     b5eb33955+    general cohen_the+          1     FAILED      1:0 
31773933.ba+      batch            cohen_the+          1     FAILED      1:0 
31773933.ex+     extern            cohen_the+          1  COMPLETED      0:0
I'd expect the submitted job to have translated /data2/helen_mixed_infection/notebooks to /home/pgc29/scratch60/helen_mixed_infection/notebooks but b5eb33955aee5a77.sh has cd /data2/helen_mixed_infection/notebooks
Patrick Cudahy
@pgcudahy
Doh, I think I see it now. Going to try with
[test]
input: f'#scratch/helen_mixed_infection/dataraw/R9994_CATCAAGT_S35_L001_R1_001.fastq.gz', 
        f'#scratch/helen_mixed_infection/dataraw/R9994_CATCAAGT_S35_L001_R2_001.fastq.gz'
output: f'#scratch/helen_mixed_infection/data/tb-profiler/results/R9994_CATCAAGT_S35_L001.results.json'
task: walltime='00:15:00', mem='2G', workdir='#scratch/helen_mixed_infection/data/tb-profiler'
run: expand=True
    module load miniconda
    conda activate tbprofiler
    tb-profiler profile -1 {_input[0]} -2 {_input[1]} -p R9994_CATCAAGT_S35_L001
Patrick Cudahy
@pgcudahy
That ran and finished with success, but the output file isn't where I expect it. I think it's because the execution script didn't replace #scratch
execution script:
================
#!/bin/bash
#SBATCH --time=00:15:00
#SBATCH --nodes=1
#SBATCH --mem-per-cpu=5G
#SBATCH --job-name=a904366ca2a493bc
#SBATCH --output=/home/pgc29/.sos/tasks/a904366ca2a493bc.out
#SBATCH --error=/home/pgc29/.sos/tasks/a904366ca2a493bc.err
cd #scratch/helen_mixed_infection/data/tb-profiler
/home/pgc29/.local/bin/sos execute a904366ca2a493bc -v 2 -s default -m interactive
Patrick Cudahy
@pgcudahy
Not sure where to go from here but it is very late here now, so headed to bed. thanks very much for your help today Bo
Bo
@BoPeng
Good. workdir on non-shared path is a problem... I think I have a ticket for that but have not looked in details... nbconvert 6.0.0 introduces new template structures which broke sos convert and I have to fix that first.
I sort of hate when the upstream make incompatible changes.
Patrick Cudahy
@pgcudahy

Hello, I have what I think is a simple question but haven't been able to get it to work. I'm processing genomes that are a mix of single-ended and paired-end reads. Before mapping them to a reference the commands are different between single and paired, so I have two parallel pipelines. But after mapping them, they're all bam files and I'd like to continue processing using just one pipeline. To show you what I mean, first I grab all the fastq files and join the ones that are paired. The filenames have the sample name followed by "_R1.fastq.gz" or "_R2.fastq.gz" to indicate a forward or reverse read.

[global]
import glob
import itertools
import os

fastq_files = sorted(glob.glob("/data/*.fastq.gz"))

grouped_fastq_dict = dict()
for k, v in itertools.groupby(fastq_files, lambda a: os.path.split(a)[1].split("_R", 1)[0]):
    grouped_fastq_dict[k] = list(v)

single_read, paired_read = dict(), dict()
for k,v in grouped_fastq_dict.items():
    if len(v) == 1:
        single_read[k] = v
    elif len(v) == 2:
        paired_read[k] = v
    else:
        print(f'Error: {k} has < 1 or more than 2 associated fastq files')

Then I process them and map them to a reference

[trimmomatic-single]
input: single_read, group_by=1
output: trim_single = f'/data/{_input.labels[0]}/{_input:bnn}_trimmed.fastq.gz'
run: expand=True
    trimmomatic SE -phred33 {_input} {_output} LEADING:10 TRAILING:10 SLIDINGWINDOW:4:16 MINLEN:40

[trimmomatic-paired]
input: paired_read, group_by=2
output: trim_paired_1=f'/data/{_input.labels[0]}/{_input[0]:bnn}_trimmed.fastq.gz',
trim_unpaired_1=f'/data/{_input.labels[0]}/{_input[0]:bnn}_trimmed_unpaired.fastq.gz',
trim_paired_2=f'/data/{_input.labels[0]}/{_input[1]:bnn}_trimmed.fastq.gz',
trim_unpaired_2=f'/data/{_input.labels[0]}/{_input[1]:bnn}_trimmed_unpaired.fastq.gz'
run: expand=True
    trimmomatic PE -phred33 {_input} {_output["trim_paired_1"]} {_output["trim_unpaired_1"]} \
    {_output["trim_paired_2"]} {_output["trim_unpaired_2"]} LEADING:10 TRAILING:10 SLIDINGWINDOW:4:16 MINLEN:40

[map-single]
input: output_from("trimmomatic-single"), group_by=1
output: bam = f'/data/{_input.name.split("_R")[0]}_GCF_000195955.2_filtered_sorted.bam'

id=_input.name.split("_R")[0]
rg=f'\"@RG\\tID:{id}\\tPL:Illumina\\tSM:{id}\"'

run: expand=True
    bwa mem -v 3 -Y -R {rg} {reference} {_input} | samtools view -bu - | \
    samtools sort -T /data2/helen_mixed_infection/data/bam/tmp.{id} -o {_output}

[map-paired]
input: output_from("trimmomatic-paired")["trim_paired_1"], output_from("trimmomatic-paired")["trim_paired_2"], group_by="pairs"
output: bam = f'/data/{_input["trim_paired_1"].name.split("_R")[0]}_GCF_000195955.2_filtered_sorted.bam'

id=_input["trim_paired_1"].name.split("_R")[0]
rg = f'\"@RG\\tID:{id}\\tPL:Illumina\\tSM:{id}\"'

run: expand=True
    bwa mem -v 3 -Y -R {rg} {reference} {_input} | samtools view -bu - | \
    samtools sort -T /data2/helen_mixed_infection/data/bam/tmp.{id} -o {_output}

But now I want to combine the output of the two parallel pipelines into the next step

[duplicate_marking]
input: output_from("map-single"),  output_from("map-paired"), group_by=1
output: dedup=f'{_input:n}_dedup.bam'
bash: expand=True
    export JAVA_OPTS='-Xmx3g'
    picard MarkDuplicates I={_input} O={_output} M={_output:n}.duplicate_metrics \
    REMOVE_DUPLICATES=false ASSUME_SORT_ORDER=coordinate

But SoS complains because the output from map-single and map-paired are of different lengths. How can I use the output from both steps as the input to my step duplicate-marking?

Bo
@BoPeng
This is because sos tried to aggregate groups of inputs if two output_from are grouped. To ungroup the output you need to use group_by='all' inside output_from.
Bo
@BoPeng

Running sos run test combined with test.sos having the following workflow,

[single]
input: for_each=dict(i=range(2))
output: f'single_{i}.bam'

_output.touch()

[double]
input: for_each=dict(i=range(2))
output: f'single_{i}.bam'

_output.touch()

[combined]
input: output_from('single'), output_from('double')

print(_input)

You will see that the two groups from single and double are combined to form two groups with one output from single and one output from double.

single]
input: for_each=dict(i=range(2))
output: f'single_{i}.bam'

_output.touch()

[double]
input: for_each=dict(i=range(3))
output: f'single_{i}.bam'

_output.touch()

[combined]
input: output_from('single', group_by='all'), output_from('double', group_by='all'), group_by=1

print(_input)
basically "flatten" and join both output_from into a single group before separating them into groups with one file (group_by=).
Bo
@BoPeng
This is documented here but perhaps a more explicit example should be given.
Patrick Cudahy
@pgcudahy
That works well, thanks! I had read that documentation but couldn't figure how to put it all together
Patrick Cudahy
@pgcudahy
Hello, another quick question. The login nodes for my cluster get pretty congested, and during peak hours I start to see a lot of ERROR: ERROR workflow_executor.py:1206 - Failed to connect to yale_hpc_slurm: ssh connection to pgc29@xxx.xxx.xxx.xxx time out with prompt: b'' - None errors. Is there a way to adjust the timeout to make it longer?
Bo
@BoPeng
There is an option to adjust frequency to check task status (30s if you copied the examples), but as far as I know there is no option to adjust timeout time for the underlying ssh command.
Patrick Cudahy
@pgcudahy
My dataset has gotten to be so large (several thousand genomes) that for every step, there will likely be one or two failures due to subtle race conditions, or an outlier requiring much more memory or runtime so the cluster kills it, or the initial file was a contaminant etc. So my workflow has started to bog down into a cycle of 1) submit job 2) check in a few hours later and see which substeps failed 3) adjust parameters or just resubmit (it often just works the second time) 4) check in a few hours later to see how step 2 failed 5) resubmit 6) repeat. With a pipeline of >10 steps, this is tedious. I'd prefer if I wouldn't have to babysit runs as much. Is there a way to have SoS continue to the next step, even if some substeps fail? That way 99% of my samples will make it from fastq file to a final VCF and then I can tweak things for a second run to finish up the failed 1%. Any other suggestions on how to improve robustness would be welcome.
Bo
@BoPeng
Sorry, been a busy day. If you run sos run -h, there is an option -e ERRORMODE, I think you are trying to use -e ignore.
@pgcudahy
Patrick Cudahy
@pgcudahy
Thanks!