BoPeng on master
Fix the display of global varia… Task monitor now honor walltime… (compare)
SoS supports all kernels, just that there is no variable exchange for kernels of unsupported languages
I had tried the following:
[C]
input:
output:
parameter:
gnuplot:
plot "phase-0.txt" using (log10($6)):14 with lines
But running the step produced: NameError: name 'gnuplot' is not defined, so I thought that that meant that I couldn't use the gnuplot kernel. What do I need to change in order for it to work? (Although it still wouldn't be usable for many of my plots anyway, because of the Unicode problem.)
If you are using SoS workerlow, in a SoS cell, gnuplot:
action is not defined, but you can try
run:
#!/bin/env gnuplot
script...
which will use gnuplot
command to run the script. Or you can do
script: args='gnuplot {filename}'
script ...
to specify the interpreter. see https://vatlab.github.io/sos-docs/doc/user_guide/sos_actions.html#Option--args for details.
There is actually a gnuplot kernel. https://github.com/has2k1/gnuplot_kernel
Yes, that's part of what I was trying to get across :-)
I'll try your first suggestion above; that seems cleaner in the absence of explicit support for the gnuplot kernel.
Hello, I'm trying to set up a remote host for my university cluster but having trouble getting started
My hosts.yml is
localhost: macbook
hosts:
yale_farnam:
address: farnam.hpc.yale.edu
paths:
home: /home/pgc29/scratch60
macbook:
address: 127.0.0.1
paths:
home: /Users/pgcudahy
But when I run something simple like
%run -r yale_farnam -c ~/.sos/hosts.yml
sh:
echo Working on `pwd` of $HOSTNAME
I get ERROR: Failed to connect to yale_farnam: pgcudahy@farnam.hpc.yale.edu: Permission denied (publickey).
I also tried sos remote setup
but get the error INFO: scp -P 22 /Users/pgcudahy/.ssh/id_rsa.pub farnam.hpc.yale.edu:id_rsa.pub.yale_farnam
ERROR: Failed to copy public key to farnam.hpc.yale.edu
Perhaps the issue is that my usernames for my computer and the cluster are different. A simple ssh pgc29@farnam.hpc.yale.edu
works for me, but ssh farnam.hpc.yale.edu
does not. The scp
command referenced in the error message doesn't prepend a username to the cluster's domain name. Any help on how to move forward would be great. Thanks
sos
on it, set sos: /path/to/sos
in the config file (to avoid changing $PATH
on server), and then add a template to submit jobs. I usually have two hosts
for headnode and cluster in case I want to run something on the headnode. Let me know if you encounter any problem.
sos
with anaconda, but I couldn't figure out how to get my local notebook to access sos in the remote conda environment. Instead I was able to install sos
with pip install --user sos
and then change my PYTHONPATH in ~/.bash_profile
Okay, so now that I can directly run remote jobs, I'm trying to get SLURM set up. The cluster documentation has this example task that I want to replicate
#!/bin/bash
#SBATCH --job-name=example_job
#SBATCH --out="slurm-%j.out"
#SBATCH --time=01:00
#SBATCH --nodes=1 --ntasks=1 --cpus-per-task=2
#SBATCH --mem-per-cpu=5G
#SBATCH --mail-type=ALL
mem_bytes=$(</sys/fs/cgroup/memory/slurm/uid_${SLURM_JOB_UID}/job_${SLURM_JOB_ID}/memory.limit_in_bytes)
mem_gbytes=$(( $mem_bytes / 1024 **3 ))
echo "Starting at $(date)"
echo "Job submitted to the ${SLURM_JOB_PARTITION} partition, the default partition on ${SLURM_CLUSTER_NAME}"
echo "Job name: ${SLURM_JOB_NAME}, Job ID: ${SLURM_JOB_ID}"
echo " I have ${SLURM_CPUS_ON_NODE} CPUs and ${mem_gbytes}GiB of RAM on compute node $(hostname)"
My updated hosts.yml is now
%save ~/.sos/hosts.yml -f
localhost: ubuntu
hosts:
ubuntu:
address: 127.0.0.1
paths:
home: /data2
yale_farnam:
address: pgc29@farnam1.hpc.yale.edu
paths:
home: /home/pgc29/scratch60/
yale_hpc_slurm:
address: pgc29@farnam.hpc.yale.edu
paths:
home: /home/pgc29/scratch60/
queue_type: pbs
submit_cmd: sbatch {job_file}
submit_cmd_output: "Submitted batch job {job_id}"
status_cmd: squeue --job {job_id}
kill_cmd: scancel {job_id}
status_check_interval: 120
max_running_jobs: 100
max_cores: 200
max_walltime: "72:00:00"
max_mem: 1280G
task_template: |
#!/bin/bash
#SBATCH --time={walltime}
#SBATCH --nodes=1
#SBATCH --ntasks-per-node={cores}
#SBATCH --job-name={task}
#SBATCH --output=/home/{user_name}/.sos/tasks/{task}.out
#SBATCH --error=/home/{user_name}/.sos/tasks/{task}.err
cd {workdir}
{command}
But when I run
%run -q yale_hpc_slurm
bash:
mem_bytes=$(</sys/fs/cgroup/memory/slurm/uid_${SLURM_JOB_UID}/job_${SLURM_JOB_ID}/memory.limit_in_bytes)
mem_gbytes=$(( $mem_bytes / 1024 **3 ))
echo "Starting at $(date)"
echo "Job submitted to the ${SLURM_JOB_PARTITION} partition, the default partition on ${SLURM_CLUSTER_NAME}"
echo "Job name: ${SLURM_JOB_NAME}, Job ID: ${SLURM_JOB_ID}"
echo " I have ${SLURM_CPUS_ON_NODE} CPUs and ${mem_gbytes}GiB of RAM on compute node $(hostname)"
It tries to run on my local machine and returns
INFO: Running default:
/tmp/tmp9o84b7hs.sh: line 1: /sys/fs/cgroup/memory/slurm/uid_/job_/memory.limit_in_bytes: No such file or directory
/tmp/tmp9o84b7hs.sh: line 2: / 1024 **3 : syntax error: operand expected (error token is "/ 1024 **3 ")
Starting at Wed Oct 7 04:14:57 EDT 2020
Job submitted to the partition, the default partition on
Job name: , Job ID:
I have CPUs and GiB of RAM on compute node ubuntu
INFO: Workflow default (ID=187c6b10c86049c7) is executed successfully with 1 completed step.
I tried
%run -r yale_hpc_slurm
bash:
mem_bytes=$(</sys/fs/cgroup/memory/slurm/uid_${SLURM_JOB_UID}/job_${SLURM_JOB_ID}/memory.limit_in_bytes)
mem_gbytes=$(( $mem_bytes / 1024 **3 ))
echo "Starting at $(date)"
echo "Job submitted to the ${SLURM_JOB_PARTITION} partition, the default partition on ${SLURM_CLUSTER_NAME}"
echo "Job name: ${SLURM_JOB_NAME}, Job ID: ${SLURM_JOB_ID}"
echo " I have ${SLURM_CPUS_ON_NODE} CPUs and ${mem_gbytes}GiB of RAM on compute node $(hostname)"
and got
ERROR: No workflow engine or invalid engine definition defined for host yale_hpc_slurm
Workflow exited with code 1
Where have I messed up my template? Thanks
Okay, got it to work with
%run -q 'yale_hpc_slurm'
task: walltime='00:05:00'
bash:
mem_bytes=$(</sys/fs/cgroup/memory/slurm/uid_${SLURM_JOB_UID}/job_${SLURM_JOB_ID}/memory.limit_in_bytes)
mem_gbytes=$(( $mem_bytes / 1024 **3 ))
echo "Starting at $(date)"
echo "Job submitted to the ${SLURM_JOB_PARTITION} partition, the default partition on ${SLURM_CLUSTER_NAME}"
echo "Job name: ${SLURM_JOB_NAME}, Job ID: ${SLURM_JOB_ID}"
echo " I have ${SLURM_CPUS_ON_NODE} CPUs and ${mem_gbytes}GiB of RAM on compute node $(hostname)"
Not sure why I had to quote the remote host name. Also it does not produce a .out
file in ~/.sos/tasks/ on the remote computer or my local computer, just job_id, pulse, task and sh files.
task
is needed to define a portion of the step as external tasks. Using only bash would not work. Basically the template executes the task with sos execute
, which can be bash, python, R and any other scripts...
${ }
since sos
expands { }
, so you will have to use ${{ }}
to avoid that.
mem_bytes
is read from you job file. Actually sos provides variable mem
which is exactly that.
Also, at least on our cluster, running stuff in $HOME
is not recommended so I have things like
cluster:
paths:
scratch: /path/to/scratch/
and
task: workdir='#scratch/project/etc'
to run the tasks under scratch directory.
Thanks, I guess my main question right now is when I run the example task
%run -q 'yale_hpc_slurm'
task: walltime='00:05:00', mem='1G'
bash:
mem_bytes=$(</sys/fs/cgroup/memory/slurm/uid_${SLURM_JOB_UID}/job_${SLURM_JOB_ID}/memory.limit_in_bytes)
mem_gbytes=$(( $mem_bytes / 1024 **3 ))
echo "Starting at $(date)"
echo "Job submitted to the ${SLURM_JOB_PARTITION} partition, the default partition on ${SLURM_CLUSTER_NAME}"
echo "Job name: ${SLURM_JOB_NAME}, Job ID: ${SLURM_JOB_ID}"
echo " I have ${SLURM_CPUS_ON_NODE} CPUs and ${mem_gbytes}GiB of RAM on compute node $(hostname)"
The notebook reports
INFO: cdb813b50789fb95 submitted to yale_hpc_slurm with job id 31774151
INFO: Waiting for the completion of 1 task.
INFO: Workflow default (ID=187c6b10c86049c7) is executed successfully with 1 completed step and 1 completed task.
But there are no .out
or .err
files in ~/.sos/tasks
.task
and a .pulse
cdb813b50789fb95 completed
Created 9 hr ago
Started 5 min ago
Signature checked
TASK:
=====
bash('mem_bytes=$(</sys/fs/cgroup/memory/slurm/uid_${SLURM_JOB_UID}/job_${SLURM_JOB_ID}/memory.limit_in_bytes)\nmem_gbytes=$(( $mem_bytes / 1024 **3 ))\n\necho "Starting at $(date)"\necho "Job submitted to the ${SLURM_JOB_PARTITION} partition, the default partition on ${SLURM_CLUSTER_NAME}"\necho "Job name: ${SLURM_JOB_NAME}, Job ID: ${SLURM_JOB_ID}"\necho " I have ${SLURM_CPUS_ON_NODE} CPUs and ${mem_gbytes}GiB of RAM on compute node $(hostname)"\n')
TAGS:
=====
187c6b10c86049c7 default notebooks
GLOBAL:
=======
(<_ast.Module object at 0x7f6b8f19bb50>, {})
ENVIRONMENT:
============
__signature_vars__ {'bash'}
_depends []
_index 0
_input []
_output Unspecified
_runtime {'queue': 'yale_hpc_slurm',
'run_mode': 'interactive',
'sig_mode': 'default',
'verbosity': 2,
'walltime': '00:05:00',
'workdir': path('/data2/helen_mixed_infection/notebooks')}
step_name 'default'
workflow_id '187c6b10c86049c7'
EXECUTION STATS:
================
Duration: 0s
Peak CPU: 0.0 %
Peak mem: 36.7 MiB
execution script:
================
#!/bin/bash
#SBATCH --time=00:05:00
#SBATCH --nodes=1 --ntasks=1 --cpus-per-task=2
#SBATCH --mem-per-cpu=5G
#SBATCH --job-name=cdb813b50789fb95
#SBATCH --output=/home/pgc29/.sos/tasks/cdb813b50789fb95.out
#SBATCH --error=/home/pgc29/.sos/tasks/cdb813b50789fb95.err
cd /data2/helen_mixed_infection/notebooks
sos execute cdb813b50789fb95 -v 2 -s default -m interactive
standard output:
================
Starting at Wed Oct 7 05:49:27 EDT 2020
Job submitted to the general partition, the default partition on farnam
Job name: cdb813b50789fb95, Job ID: 31767135
I have 2 CPUs and 10GiB of RAM on compute node c23n12.farnam.hpc.yale.internal
standard error:
================
/var/spool/slurmd/job31767135/slurm_script: line 8: cd: /data2/helen_mixed_infection/notebooks: No such file or directory
INFO: cdb813b50789fb95 started
sos status
?
%task status
magic. However, because currently %run -q
magic is blocking, that magic would not be run until after the end of %run
.
%run
not blocking after the tasks are all submitted.
sos status
(because %task
is blocked)... definitely something need to be improved. Please feel free to submit tickets for problems and feature requests so that we know what the "pain points" are.
Yes, it will take some configuration and time to use to, and you might want to add
module load {" ".join(modules)}
to your template that allows you to do
task: modules=['module1', 'module2']
to load particular modules for the scripts in the task. Again, feel free to let me know if you get into trouble so that we can make this process as easy and error-proof as possible.
Okay, now I'm trying to run a real job but having an issue. Per your suggestion I set a scratch path for my local machine as scratch: /data2
and for the cluster as scratch: /home/pgc29/scratch60
. Then I try and run
%run test -q 'yale_hpc_slurm'
[test]
input: f'/data2/helen_mixed_infection/dataraw/R9994_CATCAAGT_S35_L001_R1_001.fastq.gz',
f'/data2/helen_mixed_infection/dataraw/R9994_CATCAAGT_S35_L001_R2_001.fastq.gz'
output: f'/data2/helen_mixed_infection/data/tb-profiler/results/R9994_CATCAAGT_S35_L001.results.json'
task: walltime='00:15:00', mem='2G', workdir='#scratch/helen_mixed_infection/data/tb-profiler'
run: expand=True
module load miniconda
conda activate tbprofiler
cd /data2/helen_mixed_infection/data/tb-profiler
tb-profiler profile -1 {_input[0]} -2 {_input[1]} -p R9994_CATCAAGT_S35_L001
It ends up hanging forever with INFO: Waiting for the completion of 1 task.
sos status b5eb33955aee5a77 -v4
gives
b5eb33955aee5a77 submitted
Created 33 min ago
TASK:
=====
run(fr"""module load miniconda
conda activate tbprofiler
cd /data2/helen_mixed_infection/data/tb-profiler
tb-profiler profile -1 {_input[0]} -2 {_input[1]} -p R9994_CATCAAGT_S35_L001
""")
TAGS:
=====
1a2a7669047f0dc5 notebooks test
GLOBAL:
=======
(<_ast.Module object at 0x2b96dfcc8ca0>, {})
ENVIRONMENT:
============
__signature_vars__ {'_input', 'run'}
_depends []
_index 0
_input [file_target('/data2/helen_mixed_infection/dataraw/R9994_CATCAAGT_S35_L001_R1_001.fastq.gz'), file_target('/data2/helen_mixed_infection/dataraw/R9994_CATCAAGT_S35_L001_R2_001.fastq.gz')]
_output [file_target('/data2/helen_mixed_infection/data/tb-profiler/results/R9994_CATCAAGT_S35_L001.results.json')]
_runtime {'mem': 2000000000,
'queue': 'yale_hpc_slurm',
'run_mode': 'interactive',
'sig_mode': 'default',
'verbosity': 2,
'walltime': '00:15:00',
'workdir': path('/data2/helen_mixed_infection/notebooks')}
step_name 'test'
workflow_id '1a2a7669047f0dc5'
b5eb33955aee5a77.sh:
====================
#!/bin/bash
#SBATCH --time=00:15:00
#SBATCH --nodes=1
#SBATCH --mem-per-cpu=5G
#SBATCH --job-name=b5eb33955aee5a77
#SBATCH --output=/home/pgcudahy/.sos/tasks/b5eb33955aee5a77.out
#SBATCH --error=/home/pgcudahy/.sos/tasks/b5eb33955aee5a77.err
cd /data2/helen_mixed_infection/notebooks
sos execute b5eb33955aee5a77 -v 2 -s default -m interactive
b5eb33955aee5a77.job_id:
========================
job_id: 31773933
But when I run on the cluster sacct -j 31773933
sacct -j 31773933
JobID JobName Partition Account AllocCPUS State ExitCode
------------ ---------- ---------- ---------- ---------- ---------- --------
31773933 b5eb33955+ general cohen_the+ 1 FAILED 1:0
31773933.ba+ batch cohen_the+ 1 FAILED 1:0
31773933.ex+ extern cohen_the+ 1 COMPLETED 0:0
/data2/helen_mixed_infection/notebooks
to /home/pgc29/scratch60/helen_mixed_infection/notebooks
but b5eb33955aee5a77.sh has cd /data2/helen_mixed_infection/notebooks
[test]
input: f'#scratch/helen_mixed_infection/dataraw/R9994_CATCAAGT_S35_L001_R1_001.fastq.gz',
f'#scratch/helen_mixed_infection/dataraw/R9994_CATCAAGT_S35_L001_R2_001.fastq.gz'
output: f'#scratch/helen_mixed_infection/data/tb-profiler/results/R9994_CATCAAGT_S35_L001.results.json'
task: walltime='00:15:00', mem='2G', workdir='#scratch/helen_mixed_infection/data/tb-profiler'
run: expand=True
module load miniconda
conda activate tbprofiler
tb-profiler profile -1 {_input[0]} -2 {_input[1]} -p R9994_CATCAAGT_S35_L001
#scratch
execution script:
================
#!/bin/bash
#SBATCH --time=00:15:00
#SBATCH --nodes=1
#SBATCH --mem-per-cpu=5G
#SBATCH --job-name=a904366ca2a493bc
#SBATCH --output=/home/pgc29/.sos/tasks/a904366ca2a493bc.out
#SBATCH --error=/home/pgc29/.sos/tasks/a904366ca2a493bc.err
cd #scratch/helen_mixed_infection/data/tb-profiler
/home/pgc29/.local/bin/sos execute a904366ca2a493bc -v 2 -s default -m interactive
Hello, I have what I think is a simple question but haven't been able to get it to work. I'm processing genomes that are a mix of single-ended and paired-end reads. Before mapping them to a reference the commands are different between single and paired, so I have two parallel pipelines. But after mapping them, they're all bam
files and I'd like to continue processing using just one pipeline. To show you what I mean, first I grab all the fastq files and join the ones that are paired. The filenames have the sample name followed by "_R1.fastq.gz" or "_R2.fastq.gz" to indicate a forward or reverse read.
[global]
import glob
import itertools
import os
fastq_files = sorted(glob.glob("/data/*.fastq.gz"))
grouped_fastq_dict = dict()
for k, v in itertools.groupby(fastq_files, lambda a: os.path.split(a)[1].split("_R", 1)[0]):
grouped_fastq_dict[k] = list(v)
single_read, paired_read = dict(), dict()
for k,v in grouped_fastq_dict.items():
if len(v) == 1:
single_read[k] = v
elif len(v) == 2:
paired_read[k] = v
else:
print(f'Error: {k} has < 1 or more than 2 associated fastq files')
Then I process them and map them to a reference
[trimmomatic-single]
input: single_read, group_by=1
output: trim_single = f'/data/{_input.labels[0]}/{_input:bnn}_trimmed.fastq.gz'
run: expand=True
trimmomatic SE -phred33 {_input} {_output} LEADING:10 TRAILING:10 SLIDINGWINDOW:4:16 MINLEN:40
[trimmomatic-paired]
input: paired_read, group_by=2
output: trim_paired_1=f'/data/{_input.labels[0]}/{_input[0]:bnn}_trimmed.fastq.gz',
trim_unpaired_1=f'/data/{_input.labels[0]}/{_input[0]:bnn}_trimmed_unpaired.fastq.gz',
trim_paired_2=f'/data/{_input.labels[0]}/{_input[1]:bnn}_trimmed.fastq.gz',
trim_unpaired_2=f'/data/{_input.labels[0]}/{_input[1]:bnn}_trimmed_unpaired.fastq.gz'
run: expand=True
trimmomatic PE -phred33 {_input} {_output["trim_paired_1"]} {_output["trim_unpaired_1"]} \
{_output["trim_paired_2"]} {_output["trim_unpaired_2"]} LEADING:10 TRAILING:10 SLIDINGWINDOW:4:16 MINLEN:40
[map-single]
input: output_from("trimmomatic-single"), group_by=1
output: bam = f'/data/{_input.name.split("_R")[0]}_GCF_000195955.2_filtered_sorted.bam'
id=_input.name.split("_R")[0]
rg=f'\"@RG\\tID:{id}\\tPL:Illumina\\tSM:{id}\"'
run: expand=True
bwa mem -v 3 -Y -R {rg} {reference} {_input} | samtools view -bu - | \
samtools sort -T /data2/helen_mixed_infection/data/bam/tmp.{id} -o {_output}
[map-paired]
input: output_from("trimmomatic-paired")["trim_paired_1"], output_from("trimmomatic-paired")["trim_paired_2"], group_by="pairs"
output: bam = f'/data/{_input["trim_paired_1"].name.split("_R")[0]}_GCF_000195955.2_filtered_sorted.bam'
id=_input["trim_paired_1"].name.split("_R")[0]
rg = f'\"@RG\\tID:{id}\\tPL:Illumina\\tSM:{id}\"'
run: expand=True
bwa mem -v 3 -Y -R {rg} {reference} {_input} | samtools view -bu - | \
samtools sort -T /data2/helen_mixed_infection/data/bam/tmp.{id} -o {_output}
But now I want to combine the output of the two parallel pipelines into the next step
[duplicate_marking]
input: output_from("map-single"), output_from("map-paired"), group_by=1
output: dedup=f'{_input:n}_dedup.bam'
bash: expand=True
export JAVA_OPTS='-Xmx3g'
picard MarkDuplicates I={_input} O={_output} M={_output:n}.duplicate_metrics \
REMOVE_DUPLICATES=false ASSUME_SORT_ORDER=coordinate
But SoS complains because the output from map-single
and map-paired
are of different lengths. How can I use the output from both steps as the input to my step duplicate-marking
?
Running sos run test combined
with test.sos
having the following workflow,
[single]
input: for_each=dict(i=range(2))
output: f'single_{i}.bam'
_output.touch()
[double]
input: for_each=dict(i=range(2))
output: f'single_{i}.bam'
_output.touch()
[combined]
input: output_from('single'), output_from('double')
print(_input)
You will see that the two groups from single
and double
are combined to form two groups with one output from single
and one output from double
.
single]
input: for_each=dict(i=range(2))
output: f'single_{i}.bam'
_output.touch()
[double]
input: for_each=dict(i=range(3))
output: f'single_{i}.bam'
_output.touch()
[combined]
input: output_from('single', group_by='all'), output_from('double', group_by='all'), group_by=1
print(_input)
basically "flatten" and join both output_from
into a single group before separating them into groups with one file (group_by=
).
ERROR: ERROR workflow_executor.py:1206 - Failed to connect to yale_hpc_slurm: ssh connection to pgc29@xxx.xxx.xxx.xxx time out with prompt: b'' - None
errors. Is there a way to adjust the timeout to make it longer?