Where communities thrive


  • Join over 1.5M+ people
  • Join over 100K+ communities
  • Free without limits
  • Create your own community
People
Repo info
Activity
  • Apr 09 19:51
    gaow commented #1419
  • Apr 09 18:13
    BoPeng commented #1419
  • Apr 09 17:49
    gaow commented #1419
  • Apr 09 17:39
    BoPeng commented #1419
  • Apr 09 13:25
    gaow reopened #1419
  • Apr 09 13:25
    gaow commented #1419
  • Feb 26 20:09
    BoPeng labeled #1437
  • Feb 26 20:09
    BoPeng assigned #1437
  • Feb 26 20:09
    BoPeng opened #1437
  • Feb 26 20:07
    BoPeng commented #1435
  • Feb 26 20:07
    BoPeng closed #1435
  • Feb 26 20:00
    BoPeng closed #1436
  • Feb 26 19:51

    BoPeng on master

    Fix the display of global varia… Task monitor now honor walltime… (compare)

  • Feb 26 19:51
    BoPeng assigned #1436
  • Feb 26 19:51
    BoPeng opened #1436
  • Feb 22 04:42
    BoPeng labeled #1424
  • Feb 22 04:42
    BoPeng assigned #1424
  • Feb 22 04:41
    BoPeng labeled #1435
  • Feb 22 04:41
    BoPeng assigned #1435
  • Feb 22 04:41
    BoPeng commented #1435
D. R. Evans
@N7DR
I really don't understand this whole indentation issue, even though you've tried to explain it :-( The cognitive dissonance I experience when having to indent shell code not at all trivial :-( Isn't there some other way to write bash code so that it doesn't have to be indented... it feels like going back to FORTRAN IV :-) How about allowing something like:
D. R. Evans
@N7DR

Stupid thing was in chat mode... I hate the way it does that, basically requiring one to remember to check which mode it's in... anyway, as I was saying, how about allowing something like:

bash: expand=True, end='FRED'

cat << 'EOF' > phase-0.gplt
set terminal png
set output "phase-0.png"

set key off

#set xrange [180:0] etc. etc.

plot "phase-0.txt" using (log10($6)):(abs($14-180)) with lines

EOF
gnuplot phase-0.gplt
FRED

so that one can write the bash script without indentation up until one hits the string 'FRED' on a line by itself. ['FRED', of course, could be any string one likes, defined by the parameter to "end=" on the "bash:" line.]

D. R. Evans
@N7DR

SoS supports all kernels, just that there is no variable exchange for kernels of unsupported languages

I had tried the following:

[C]
input:
output:
parameter:
gnuplot:

plot "phase-0.txt" using (log10($6)):14 with lines

But running the step produced: NameError: name 'gnuplot' is not defined, so I thought that that meant that I couldn't use the gnuplot kernel. What do I need to change in order for it to work? (Although it still wouldn't be usable for many of my plots anyway, because of the Unicode problem.)

Bo
@BoPeng
You are using sos kernel, not gnuplot kernel if there is such a thing. The sh: stuff is a sos function/action. Since did does not provide one, you can use
script:
Bo
@BoPeng
There is actually a gnuplot kernel. https://github.com/has2k1/gnuplot_kernel If you use it, you can use SoS Notebook, and run the script directly in the kernel.

If you are using SoS workerlow, in a SoS cell, gnuplot: action is not defined, but you can try

run:
    #!/bin/env gnuplot
    script...

which will use gnuplot command to run the script. Or you can do

script: args='gnuplot {filename}'
    script ...

to specify the interpreter. see https://vatlab.github.io/sos-docs/doc/user_guide/sos_actions.html#Option--args for details.

D. R. Evans
@N7DR

There is actually a gnuplot kernel. https://github.com/has2k1/gnuplot_kernel

Yes, that's part of what I was trying to get across :-)

I'll try your first suggestion above; that seems cleaner in the absence of explicit support for the gnuplot kernel.

Bo
@BoPeng
I saw a message that was later deleted. I think https://vatlab.github.io/sos-docs/doc/user_guide/sos_actions.html#Option-template-and-template_name was what was asked.
Patrick Cudahy
@pgcudahy

Hello, I'm trying to set up a remote host for my university cluster but having trouble getting started
My hosts.yml is

localhost: macbook
hosts:
  yale_farnam:
    address: farnam.hpc.yale.edu
    paths:
      home: /home/pgc29/scratch60
  macbook:
    address: 127.0.0.1
    paths:
      home: /Users/pgcudahy

But when I run something simple like

%run -r yale_farnam -c ~/.sos/hosts.yml
sh:
    echo Working on `pwd` of $HOSTNAME

I get ERROR: Failed to connect to yale_farnam: pgcudahy@farnam.hpc.yale.edu: Permission denied (publickey).

I also tried sos remote setup but get the error INFO: scp -P 22 /Users/pgcudahy/.ssh/id_rsa.pub farnam.hpc.yale.edu:id_rsa.pub.yale_farnam ERROR: Failed to copy public key to farnam.hpc.yale.edu

Perhaps the issue is that my usernames for my computer and the cluster are different. A simple ssh pgc29@farnam.hpc.yale.edu works for me, but ssh farnam.hpc.yale.edu does not. The scp command referenced in the error message doesn't prepend a username to the cluster's domain name. Any help on how to move forward would be great. Thanks

Bo
@BoPeng
Could you try to change address: farnam.hpc.yale.edu to address: pgc29@farnam.hpc.yale.edu
Patrick Cudahy
@pgcudahy
Ah, thanks. Of course the cluster just went down for maintenance until Wednesday! I'll try it as soon as it's back up.
Bo
@BoPeng
It is a cluster, then you will have to install sos on it, set sos: /path/to/sos in the config file (to avoid changing $PATH on server), and then add a template to submit jobs. I usually have two hosts for headnode and cluster in case I want to run something on the headnode. Let me know if you encounter any problem.
Bo
@BoPeng
BTW -c ~/.sos/hosts.yml is not needed. That file is always read.
Patrick Cudahy
@pgcudahy
Adding my username to the address worked well, thanks! My cluster recommends installing pip modules like sos with anaconda, but I couldn't figure out how to get my local notebook to access sos in the remote conda environment. Instead I was able to install sos with pip install --user sos and then change my PYTHONPATH in ~/.bash_profile
Patrick Cudahy
@pgcudahy

Okay, so now that I can directly run remote jobs, I'm trying to get SLURM set up. The cluster documentation has this example task that I want to replicate

#!/bin/bash
#SBATCH --job-name=example_job
#SBATCH --out="slurm-%j.out"
#SBATCH --time=01:00
#SBATCH --nodes=1 --ntasks=1 --cpus-per-task=2
#SBATCH --mem-per-cpu=5G
#SBATCH --mail-type=ALL

mem_bytes=$(</sys/fs/cgroup/memory/slurm/uid_${SLURM_JOB_UID}/job_${SLURM_JOB_ID}/memory.limit_in_bytes)
mem_gbytes=$(( $mem_bytes / 1024 **3 ))

echo "Starting at $(date)"
echo "Job submitted to the ${SLURM_JOB_PARTITION} partition, the default partition on ${SLURM_CLUSTER_NAME}"
echo "Job name: ${SLURM_JOB_NAME}, Job ID: ${SLURM_JOB_ID}"
echo "  I have ${SLURM_CPUS_ON_NODE} CPUs and ${mem_gbytes}GiB of RAM on compute node $(hostname)"

My updated hosts.yml is now

%save ~/.sos/hosts.yml -f

localhost: ubuntu
hosts:
  ubuntu:
    address: 127.0.0.1
    paths:
      home: /data2
  yale_farnam:
    address: pgc29@farnam1.hpc.yale.edu
    paths:
      home: /home/pgc29/scratch60/
  yale_hpc_slurm:
    address: pgc29@farnam.hpc.yale.edu
    paths:
      home: /home/pgc29/scratch60/
    queue_type: pbs
    submit_cmd: sbatch {job_file}
    submit_cmd_output: "Submitted batch job {job_id}"
    status_cmd: squeue --job {job_id}
    kill_cmd: scancel {job_id}
    status_check_interval: 120
    max_running_jobs: 100
    max_cores: 200 
    max_walltime: "72:00:00"
    max_mem: 1280G
    task_template: |
        #!/bin/bash
        #SBATCH --time={walltime}
        #SBATCH --nodes=1
        #SBATCH --ntasks-per-node={cores}
        #SBATCH --job-name={task}
        #SBATCH --output=/home/{user_name}/.sos/tasks/{task}.out
        #SBATCH --error=/home/{user_name}/.sos/tasks/{task}.err
        cd {workdir}
        {command}

But when I run

%run -q yale_hpc_slurm 

bash:
    mem_bytes=$(</sys/fs/cgroup/memory/slurm/uid_${SLURM_JOB_UID}/job_${SLURM_JOB_ID}/memory.limit_in_bytes)
    mem_gbytes=$(( $mem_bytes / 1024 **3 ))

    echo "Starting at $(date)"
    echo "Job submitted to the ${SLURM_JOB_PARTITION} partition, the default partition on ${SLURM_CLUSTER_NAME}"
    echo "Job name: ${SLURM_JOB_NAME}, Job ID: ${SLURM_JOB_ID}"
    echo "  I have ${SLURM_CPUS_ON_NODE} CPUs and ${mem_gbytes}GiB of RAM on compute node $(hostname)"

It tries to run on my local machine and returns

INFO: Running default:

/tmp/tmp9o84b7hs.sh: line 1: /sys/fs/cgroup/memory/slurm/uid_/job_/memory.limit_in_bytes: No such file or directory
/tmp/tmp9o84b7hs.sh: line 2: / 1024 **3 : syntax error: operand expected (error token is "/ 1024 **3 ")
Starting at Wed Oct  7 04:14:57 EDT 2020
Job submitted to the  partition, the default partition on 
Job name: , Job ID: 
  I have  CPUs and GiB of RAM on compute node ubuntu

INFO: Workflow default (ID=187c6b10c86049c7) is executed successfully with 1 completed step.

I tried

%run -r yale_hpc_slurm

bash:
    mem_bytes=$(</sys/fs/cgroup/memory/slurm/uid_${SLURM_JOB_UID}/job_${SLURM_JOB_ID}/memory.limit_in_bytes)
    mem_gbytes=$(( $mem_bytes / 1024 **3 ))

    echo "Starting at $(date)"
    echo "Job submitted to the ${SLURM_JOB_PARTITION} partition, the default partition on ${SLURM_CLUSTER_NAME}"
    echo "Job name: ${SLURM_JOB_NAME}, Job ID: ${SLURM_JOB_ID}"
    echo "  I have ${SLURM_CPUS_ON_NODE} CPUs and ${mem_gbytes}GiB of RAM on compute node $(hostname)"

and got

ERROR: No workflow engine or invalid engine definition defined for host yale_hpc_slurm

Workflow exited with code 1

Where have I messed up my template? Thanks

Patrick Cudahy
@pgcudahy

Okay, got it to work with

%run -q 'yale_hpc_slurm'

task: walltime='00:05:00'
bash:
    mem_bytes=$(</sys/fs/cgroup/memory/slurm/uid_${SLURM_JOB_UID}/job_${SLURM_JOB_ID}/memory.limit_in_bytes)
    mem_gbytes=$(( $mem_bytes / 1024 **3 ))

    echo "Starting at $(date)"
    echo "Job submitted to the ${SLURM_JOB_PARTITION} partition, the default partition on ${SLURM_CLUSTER_NAME}"
    echo "Job name: ${SLURM_JOB_NAME}, Job ID: ${SLURM_JOB_ID}"
    echo "  I have ${SLURM_CPUS_ON_NODE} CPUs and ${mem_gbytes}GiB of RAM on compute node $(hostname)"

Not sure why I had to quote the remote host name. Also it does not produce a .out file in ~/.sos/tasks/ on the remote computer or my local computer, just job_id, pulse, task and sh files.

Bo
@BoPeng
ok, first, if you add sos: /path/to/sos, you do not need to set up PATH or PYTHONPATH on the cluster. In our experience it is better to leave $PATH on the cluster alone because your job might be using a different Python than the one sos uses.
Then, task is needed to define a portion of the step as external tasks. Using only bash would not work. Basically the template executes the task with sos execute, which can be bash, python, R and any other scripts...
third, the "Starting at $(data)" stuff usually belongs to the template as your notebook would focus on "real" stuff, not anything directly related to the cluster. There is a problem with interpolation of ${ } since sos expands { }, so you will have to use ${{ }} to avoid that.
mem_bytes is read from you job file. Actually sos provides variable mem which is exactly that.
Bo
@BoPeng
Finally, I do not see where you specify mem in your template, is it needed at all for your system?

Also, at least on our cluster, running stuff in $HOME is not recommended so I have things like

    cluster:
        paths:
            scratch: /path/to/scratch/

and

task: workdir='#scratch/project/etc'

to run the tasks under scratch directory.

Patrick Cudahy
@pgcudahy

Thanks, I guess my main question right now is when I run the example task

%run -q 'yale_hpc_slurm'

task: walltime='00:05:00', mem='1G'
bash:
    mem_bytes=$(</sys/fs/cgroup/memory/slurm/uid_${SLURM_JOB_UID}/job_${SLURM_JOB_ID}/memory.limit_in_bytes)
    mem_gbytes=$(( $mem_bytes / 1024 **3 ))

    echo "Starting at $(date)"
    echo "Job submitted to the ${SLURM_JOB_PARTITION} partition, the default partition on ${SLURM_CLUSTER_NAME}"
    echo "Job name: ${SLURM_JOB_NAME}, Job ID: ${SLURM_JOB_ID}"
    echo "  I have ${SLURM_CPUS_ON_NODE} CPUs and ${mem_gbytes}GiB of RAM on compute node $(hostname)"

The notebook reports

INFO: cdb813b50789fb95 submitted to yale_hpc_slurm with job id 31774151
INFO: Waiting for the completion of 1 task.
INFO: Workflow default (ID=187c6b10c86049c7) is executed successfully with 1 completed step and 1 completed task.

But there are no .out or .err files in ~/.sos/tasks

Just a .task and a .pulse
Bo
@BoPeng
ohmm, on the cluster if you run ``sos status cdb813b50789fb95 -v4, what do you see? sos "absorbs" these files into task file to keep the number of files low.
Patrick Cudahy
@pgcudahy
Ah, there it is. It says
cdb813b50789fb95        completed

Created 9 hr ago
Started 5 min ago
Signature checked
TASK:
=====
bash('mem_bytes=$(</sys/fs/cgroup/memory/slurm/uid_${SLURM_JOB_UID}/job_${SLURM_JOB_ID}/memory.limit_in_bytes)\nmem_gbytes=$(( $mem_bytes / 1024 **3 ))\n\necho "Starting at $(date)"\necho "Job submitted to the ${SLURM_JOB_PARTITION} partition, the default partition on ${SLURM_CLUSTER_NAME}"\necho "Job name: ${SLURM_JOB_NAME}, Job ID: ${SLURM_JOB_ID}"\necho "  I have ${SLURM_CPUS_ON_NODE} CPUs and ${mem_gbytes}GiB of RAM on compute node $(hostname)"\n')

TAGS:
=====
187c6b10c86049c7 default notebooks

GLOBAL:
=======
(<_ast.Module object at 0x7f6b8f19bb50>, {})

ENVIRONMENT:
============
__signature_vars__    {'bash'}
_depends              []
_index                0
_input                []
_output               Unspecified
_runtime              {'queue': 'yale_hpc_slurm',
 'run_mode': 'interactive',
 'sig_mode': 'default',
 'verbosity': 2,
 'walltime': '00:05:00',
 'workdir': path('/data2/helen_mixed_infection/notebooks')}
step_name             'default'
workflow_id           '187c6b10c86049c7'

EXECUTION STATS:
================
Duration:       0s
Peak CPU:       0.0 %
Peak mem:       36.7 MiB

execution script:
================
#!/bin/bash
#SBATCH --time=00:05:00
#SBATCH --nodes=1 --ntasks=1 --cpus-per-task=2
#SBATCH --mem-per-cpu=5G
#SBATCH --job-name=cdb813b50789fb95
#SBATCH --output=/home/pgc29/.sos/tasks/cdb813b50789fb95.out
#SBATCH --error=/home/pgc29/.sos/tasks/cdb813b50789fb95.err
cd /data2/helen_mixed_infection/notebooks
sos execute cdb813b50789fb95 -v 2 -s default -m interactive


standard output:
================
Starting at Wed Oct  7 05:49:27 EDT 2020
Job submitted to the general partition, the default partition on farnam
Job name: cdb813b50789fb95, Job ID: 31767135
  I have 2 CPUs and 10GiB of RAM on compute node c23n12.farnam.hpc.yale.internal


standard error:
================
/var/spool/slurmd/job31767135/slurm_script: line 8: cd: /data2/helen_mixed_infection/notebooks: No such file or directory
INFO: cdb813b50789fb95 started
So If I want to figure out why a job failed, run sos status?
Bo
@BoPeng
Yes. The error messages are absorbed so sos status -v4 is currently the only way to go. From notebook, you can run %task status jobid -q queue with the same effect.
Actually if you hover the mouse to tasks, there is a little icon for you to submit the %task status magic. However, because currently %run -q magic is blocking, that magic would not be run until after the end of %run.
I have been thinking of making %run not blocking after the tasks are all submitted.
Patrick Cudahy
@pgcudahy
That would be nice
Bo
@BoPeng
I am using this mechanism heavily these days because it makes running "small" scripts very easy. Debugging of failed jobs is not particularly easy and I have found myself logging into the cluster to run sos status (because %task is blocked)... definitely something need to be improved. Please feel free to submit tickets for problems and feature requests so that we know what the "pain points" are.
Patrick Cudahy
@pgcudahy
Thanks for walking me through things, half of my issues are because I'm not familiar with slurm either, so it's a very steep learning curve
Bo
@BoPeng

Yes, it will take some configuration and time to use to, and you might want to add

module load {" ".join(modules)}

to your template that allows you to do

task: modules=['module1', 'module2']

to load particular modules for the scripts in the task. Again, feel free to let me know if you get into trouble so that we can make this process as easy and error-proof as possible.

Patrick Cudahy
@pgcudahy

Okay, now I'm trying to run a real job but having an issue. Per your suggestion I set a scratch path for my local machine as scratch: /data2 and for the cluster as scratch: /home/pgc29/scratch60. Then I try and run

%run test -q 'yale_hpc_slurm'
[test]
input: f'/data2/helen_mixed_infection/dataraw/R9994_CATCAAGT_S35_L001_R1_001.fastq.gz', 
        f'/data2/helen_mixed_infection/dataraw/R9994_CATCAAGT_S35_L001_R2_001.fastq.gz'
output: f'/data2/helen_mixed_infection/data/tb-profiler/results/R9994_CATCAAGT_S35_L001.results.json'
task: walltime='00:15:00', mem='2G', workdir='#scratch/helen_mixed_infection/data/tb-profiler'
run: expand=True
    module load miniconda
    conda activate tbprofiler
    cd /data2/helen_mixed_infection/data/tb-profiler
    tb-profiler profile -1 {_input[0]} -2 {_input[1]} -p R9994_CATCAAGT_S35_L001

It ends up hanging forever with INFO: Waiting for the completion of 1 task.
sos status b5eb33955aee5a77 -v4 gives

b5eb33955aee5a77    submitted

Created 33 min ago
TASK:
=====
run(fr"""module load miniconda
conda activate tbprofiler
cd /data2/helen_mixed_infection/data/tb-profiler
tb-profiler profile -1 {_input[0]} -2 {_input[1]} -p R9994_CATCAAGT_S35_L001

""")

TAGS:
=====
1a2a7669047f0dc5 notebooks test

GLOBAL:
=======
(<_ast.Module object at 0x2b96dfcc8ca0>, {})

ENVIRONMENT:
============
__signature_vars__    {'_input', 'run'}
_depends              []
_index                0
_input                [file_target('/data2/helen_mixed_infection/dataraw/R9994_CATCAAGT_S35_L001_R1_001.fastq.gz'), file_target('/data2/helen_mixed_infection/dataraw/R9994_CATCAAGT_S35_L001_R2_001.fastq.gz')]
_output               [file_target('/data2/helen_mixed_infection/data/tb-profiler/results/R9994_CATCAAGT_S35_L001.results.json')]
_runtime              {'mem': 2000000000,
 'queue': 'yale_hpc_slurm',
 'run_mode': 'interactive',
 'sig_mode': 'default',
 'verbosity': 2,
 'walltime': '00:15:00',
 'workdir': path('/data2/helen_mixed_infection/notebooks')}
step_name             'test'
workflow_id           '1a2a7669047f0dc5'


b5eb33955aee5a77.sh:
====================
#!/bin/bash
#SBATCH --time=00:15:00
#SBATCH --nodes=1
#SBATCH --mem-per-cpu=5G
#SBATCH --job-name=b5eb33955aee5a77
#SBATCH --output=/home/pgcudahy/.sos/tasks/b5eb33955aee5a77.out
#SBATCH --error=/home/pgcudahy/.sos/tasks/b5eb33955aee5a77.err
cd /data2/helen_mixed_infection/notebooks
sos execute b5eb33955aee5a77 -v 2 -s default -m interactive


b5eb33955aee5a77.job_id:
========================
job_id: 31773933

But when I run on the cluster sacct -j 31773933

sacct -j 31773933
       JobID    JobName  Partition    Account  AllocCPUS      State ExitCode 
------------ ---------- ---------- ---------- ---------- ---------- -------- 
31773933     b5eb33955+    general cohen_the+          1     FAILED      1:0 
31773933.ba+      batch            cohen_the+          1     FAILED      1:0 
31773933.ex+     extern            cohen_the+          1  COMPLETED      0:0
I'd expect the submitted job to have translated /data2/helen_mixed_infection/notebooks to /home/pgc29/scratch60/helen_mixed_infection/notebooks but b5eb33955aee5a77.sh has cd /data2/helen_mixed_infection/notebooks
Patrick Cudahy
@pgcudahy
Doh, I think I see it now. Going to try with
[test]
input: f'#scratch/helen_mixed_infection/dataraw/R9994_CATCAAGT_S35_L001_R1_001.fastq.gz', 
        f'#scratch/helen_mixed_infection/dataraw/R9994_CATCAAGT_S35_L001_R2_001.fastq.gz'
output: f'#scratch/helen_mixed_infection/data/tb-profiler/results/R9994_CATCAAGT_S35_L001.results.json'
task: walltime='00:15:00', mem='2G', workdir='#scratch/helen_mixed_infection/data/tb-profiler'
run: expand=True
    module load miniconda
    conda activate tbprofiler
    tb-profiler profile -1 {_input[0]} -2 {_input[1]} -p R9994_CATCAAGT_S35_L001
Patrick Cudahy
@pgcudahy
That ran and finished with success, but the output file isn't where I expect it. I think it's because the execution script didn't replace #scratch
execution script:
================
#!/bin/bash
#SBATCH --time=00:15:00
#SBATCH --nodes=1
#SBATCH --mem-per-cpu=5G
#SBATCH --job-name=a904366ca2a493bc
#SBATCH --output=/home/pgc29/.sos/tasks/a904366ca2a493bc.out
#SBATCH --error=/home/pgc29/.sos/tasks/a904366ca2a493bc.err
cd #scratch/helen_mixed_infection/data/tb-profiler
/home/pgc29/.local/bin/sos execute a904366ca2a493bc -v 2 -s default -m interactive
Patrick Cudahy
@pgcudahy
Not sure where to go from here but it is very late here now, so headed to bed. thanks very much for your help today Bo
Bo
@BoPeng
Good. workdir on non-shared path is a problem... I think I have a ticket for that but have not looked in details... nbconvert 6.0.0 introduces new template structures which broke sos convert and I have to fix that first.
I sort of hate when the upstream make incompatible changes.
Patrick Cudahy
@pgcudahy

Hello, I have what I think is a simple question but haven't been able to get it to work. I'm processing genomes that are a mix of single-ended and paired-end reads. Before mapping them to a reference the commands are different between single and paired, so I have two parallel pipelines. But after mapping them, they're all bam files and I'd like to continue processing using just one pipeline. To show you what I mean, first I grab all the fastq files and join the ones that are paired. The filenames have the sample name followed by "_R1.fastq.gz" or "_R2.fastq.gz" to indicate a forward or reverse read.

[global]
import glob
import itertools
import os

fastq_files = sorted(glob.glob("/data/*.fastq.gz"))

grouped_fastq_dict = dict()
for k, v in itertools.groupby(fastq_files, lambda a: os.path.split(a)[1].split("_R", 1)[0]):
    grouped_fastq_dict[k] = list(v)

single_read, paired_read = dict(), dict()
for k,v in grouped_fastq_dict.items():
    if len(v) == 1:
        single_read[k] = v
    elif len(v) == 2:
        paired_read[k] = v
    else:
        print(f'Error: {k} has < 1 or more than 2 associated fastq files')

Then I process them and map them to a reference

[trimmomatic-single]
input: single_read, group_by=1
output: trim_single = f'/data/{_input.labels[0]}/{_input:bnn}_trimmed.fastq.gz'
run: expand=True
    trimmomatic SE -phred33 {_input} {_output} LEADING:10 TRAILING:10 SLIDINGWINDOW:4:16 MINLEN:40

[trimmomatic-paired]
input: paired_read, group_by=2
output: trim_paired_1=f'/data/{_input.labels[0]}/{_input[0]:bnn}_trimmed.fastq.gz',
trim_unpaired_1=f'/data/{_input.labels[0]}/{_input[0]:bnn}_trimmed_unpaired.fastq.gz',
trim_paired_2=f'/data/{_input.labels[0]}/{_input[1]:bnn}_trimmed.fastq.gz',
trim_unpaired_2=f'/data/{_input.labels[0]}/{_input[1]:bnn}_trimmed_unpaired.fastq.gz'
run: expand=True
    trimmomatic PE -phred33 {_input} {_output["trim_paired_1"]} {_output["trim_unpaired_1"]} \
    {_output["trim_paired_2"]} {_output["trim_unpaired_2"]} LEADING:10 TRAILING:10 SLIDINGWINDOW:4:16 MINLEN:40

[map-single]
input: output_from("trimmomatic-single"), group_by=1
output: bam = f'/data/{_input.name.split("_R")[0]}_GCF_000195955.2_filtered_sorted.bam'

id=_input.name.split("_R")[0]
rg=f'\"@RG\\tID:{id}\\tPL:Illumina\\tSM:{id}\"'

run: expand=True
    bwa mem -v 3 -Y -R {rg} {reference} {_input} | samtools view -bu - | \
    samtools sort -T /data2/helen_mixed_infection/data/bam/tmp.{id} -o {_output}

[map-paired]
input: output_from("trimmomatic-paired")["trim_paired_1"], output_from("trimmomatic-paired")["trim_paired_2"], group_by="pairs"
output: bam = f'/data/{_input["trim_paired_1"].name.split("_R")[0]}_GCF_000195955.2_filtered_sorted.bam'

id=_input["trim_paired_1"].name.split("_R")[0]
rg = f'\"@RG\\tID:{id}\\tPL:Illumina\\tSM:{id}\"'

run: expand=True
    bwa mem -v 3 -Y -R {rg} {reference} {_input} | samtools view -bu - | \
    samtools sort -T /data2/helen_mixed_infection/data/bam/tmp.{id} -o {_output}

But now I want to combine the output of the two parallel pipelines into the next step

[duplicate_marking]
input: output_from("map-single"),  output_from("map-paired"), group_by=1
output: dedup=f'{_input:n}_dedup.bam'
bash: expand=True
    export JAVA_OPTS='-Xmx3g'
    picard MarkDuplicates I={_input} O={_output} M={_output:n}.duplicate_metrics \
    REMOVE_DUPLICATES=false ASSUME_SORT_ORDER=coordinate

But SoS complains because the output from map-single and map-paired are of different lengths. How can I use the output from both steps as the input to my step duplicate-marking?

Bo
@BoPeng
This is because sos tried to aggregate groups of inputs if two output_from are grouped. To ungroup the output you need to use group_by='all' inside output_from.
Bo
@BoPeng

Running sos run test combined with test.sos having the following workflow,

[single]
input: for_each=dict(i=range(2))
output: f'single_{i}.bam'

_output.touch()

[double]
input: for_each=dict(i=range(2))
output: f'single_{i}.bam'

_output.touch()

[combined]
input: output_from('single'), output_from('double')

print(_input)

You will see that the two groups from single and double are combined to form two groups with one output from single and one output from double.

single]
input: for_each=dict(i=range(2))
output: f'single_{i}.bam'

_output.touch()

[double]
input: for_each=dict(i=range(3))
output: f'single_{i}.bam'

_output.touch()

[combined]
input: output_from('single', group_by='all'), output_from('double', group_by='all'), group_by=1

print(_input)
basically "flatten" and join both output_from into a single group before separating them into groups with one file (group_by=).
Bo
@BoPeng
This is documented here but perhaps a more explicit example should be given.
Patrick Cudahy
@pgcudahy
That works well, thanks! I had read that documentation but couldn't figure how to put it all together