These are chat archives for nextflow-io/nextflow

28th
Mar 2018
Maxime Garcia
@MaxUlysse
Mar 28 2018 07:24
Impressive @mes5k
Stephen J Newhouse
@snewhouse
Mar 28 2018 09:12
Hi Nextflow folks and the awesome @pditommaso I’m tasked with putting together an short course on applied bioinformatics next year and would like to include nextflow as part of the teaching - I’ll have half a day only. It will be to a mix of medics, nurses, science grads ..very few CS, Healh/Bio informatics experience…has anyone done something like this before with a student mix like this?any advice and or material would be much appreciated…aim to teach largely via hands on workshops with premade scripts ect focus is on NGS processing for Genomic Medicine . If this is the worng place to post - then apologies! :) thanks in advance
Paolo Di Tommaso
@pditommaso
Mar 28 2018 09:15
Hi Stephen, that looks good
I'm my experience a basic knowledge of the linux shell helps a lot
Stephen J Newhouse
@snewhouse
Mar 28 2018 09:17
aye - we’ll be having basic Unix in the firts 2 days of the 5 day course - it will be tough, but do-abls-ish….a win would be just getting them to run a few ready made scripts and pipelines from the command line. I’d like to show them nextflow as one of the latest greates things to hit bioinformatics :)
Paolo Di Tommaso
@pditommaso
Mar 28 2018 09:18
ahah, make sense !
my suggestion, provide self-contained tutorial/examples
so they can run/try and see the results
IMO it's the best way to lean
Kevin Sayers
@KevinSayers
Mar 28 2018 09:29
@snewhouse https://github.com/nextflow-io/hack17-tutorial may be a good starting point for some tutorial ideas. It worked fairly well for the Nextflow workshop with a mixed crowd in my opinion. Not sure if @pditommaso agrees? :smile:
Stephen J Newhouse
@snewhouse
Mar 28 2018 09:35
@KevinSayers nice - thanks! I’ll have a play with this :thumbsup:
Paolo Di Tommaso
@pditommaso
Mar 28 2018 09:36
definitely
Fredrik Boulund
@boulund
Mar 28 2018 14:01
Is it possible to query/see what scheduler Nextflow is running from inside the workflow script? I can't seem to find this information in any bound variable (args, params, baseDir, workDir, workflow, nextflow). I can see the profile that is currently running, but that's not really what I want. I want the workflow behavior to change a bit depending on the scheduler used.
Paolo Di Tommaso
@pditommaso
Mar 28 2018 14:02
that's not possible by design
you can use a custom parameter to distinguish that (but it sounds a bad idea)
Fredrik Boulund
@boulund
Mar 28 2018 14:05
Ah, good to know then!
Then I can stop looking :)
What are the risks that you are protecting me from? :D
Paolo Di Tommaso
@pditommaso
Mar 28 2018 14:13
well, task should not depend on the execution platform
this can potentially make the pipeline not portable
Fredrik Boulund
@boulund
Mar 28 2018 14:18
Sure, that makes sense. I wasn't going to do something as nasty as that, but you never know what other people might do... ;)
Brian Fulton-Howard
@BEFH
Mar 28 2018 18:12
Snakemake is a major part of my workflow, but there seem to be some advantages to nextflow. There are a couple of really easy things to do in snakemake that I can't figure out from the documentation (there, I think, but not sinking in). The first is splitting and joining files within the shell script.
In snakemake, I put an additional {thing} in the output file name that's not in the input when I want to split. That {thing} is in downstream inputs and outputs until the desired final files.
In order to join, I do the reverse by providing a list with the expand() function in the input.
Is there anyone familiar with both pipelining tools that can clarify this for me?
Paolo Di Tommaso
@pditommaso
Mar 28 2018 18:16
do you mean scatter-gather pattern ?
Brian Fulton-Howard
@BEFH
Mar 28 2018 18:17
Yes, but those seem to be for text files. I'm thinking of generic patterns where the pipelining tool expects many-in, few-out or vice versa.
An example would be bcftools merge or bcftools concat for gathering.
and for the reverse:
Paolo Di Tommaso
@pditommaso
Mar 28 2018 18:21
the main thing to understand with nextflow is tasks are driven by data not the other way around
therefore multiplicity of task execution ie. how many times a task is executed is controlled by the input data is receiving
Brian Fulton-Howard
@BEFH
Mar 28 2018 18:22
for i in $(seq 1 22); do bcftools view -r $i -Oz -o chr${i}.vcf.gz; done
Paolo Di Tommaso
@pditommaso
Mar 28 2018 18:23
nearly
Brian Fulton-Howard
@BEFH
Mar 28 2018 18:23
I understand that. Snakemake is similar, but I just can't figure it out
Paolo Di Tommaso
@pditommaso
Mar 28 2018 18:24
merge is is generally managed with a collect or a groupTuple
Brian Fulton-Howard
@BEFH
Mar 28 2018 18:24
and a split?
Paolo Di Tommaso
@pditommaso
Mar 28 2018 18:27
you can split the input dataset
but I guess it's not what you are looking for
generally you simply have a multiple samples/files for each of them you want to execute a task
Mike Smoot
@mes5k
Mar 28 2018 18:28
@BEFH I think you for loop would be expressed something like this:
Channel.from(1..22).set{ inch }

process run_bcf {

    input:
    val(i) from inch

    output:
    file("${i}.vcf.gz") into outch

    process:
    """
    bcftools view -r ${i} -Oz -o chr${i}.vcf.gz
    """
}

outch.view()
Paolo Di Tommaso
@pditommaso
Mar 28 2018 18:30
or
process run_bcf {

    input:
    each i from (1..22)

    output:
    file("${i}.vcf.gz") into outch

    process:
    """
    bcftools view -r ${i} -Oz -o chr${i}.vcf.gz
    """
}

outch.view()
Mike Smoot
@mes5k
Mar 28 2018 18:32
It's probably worth trying both to see what the difference is!
Paolo Di Tommaso
@pditommaso
Mar 28 2018 18:32
:+1:
Brian Fulton-Howard
@BEFH
Mar 28 2018 18:37
so probably something like this for splitting then combining?
process split_bcf {

    input:
    val(i) from inch
    file(infile) "in.vcf.gz"

    output:
    file("${i}.vcf.gz") into outCH

    process:
    """
    bcftools view -r ${i} -Oz -o chr${i}.vcf.gz $infile
    """
}

process join_bcf {

  input:
  file(inp) outCH.collect()

  output:
  file(out) "all.vcf.gz"

  process:
  """
  bcftools concat -Oz -o chr${i}.vcf.gz $inp
  """

}
that's probably horribly broken, but...
Paolo Di Tommaso
@pditommaso
Mar 28 2018 18:40
nearly
is "in.vcf.gz" in the current path ?
Brian Fulton-Howard
@BEFH
Mar 28 2018 18:42
in the mwe, yes. In real usage, it would be in an input directory, and there would be cohort names, and rules in between, etc.
Paolo Di Tommaso
@pditommaso
Mar 28 2018 18:43
ok, in any case you should replace "in.vcf.gz" with a typed object ie.
process split_bcf {

    input:
    val(i) from inch
    file(infile) from file("in.vcf.gz")

    output:
    file("${i}.vcf.gz") into outCH

    process:
    """
    bcftools view -r ${i} -Oz -o chr${i}.vcf.gz $infile
    """
}
the file(infile) from file("in.vcf.gz") is a bit bizzare (and likely will be improved in future) but it's required
or something more file
params.vcf_file = "in.vcf.gz"
vcf_file = file(params.vcf_file)

process split_bcf {

    input:
    val(i) from inch
    file(infile) from vcf_file

    output:
    file("${i}.vcf.gz") into outCH

    process:
    """
    bcftools view -r ${i} -Oz -o chr${i}.vcf.gz $infile
    """
}
this allows you to specify --vcf_file as command line parameter having in.vcf.gz as default
Brian Fulton-Howard
@BEFH
Mar 28 2018 18:53

and I would use .combine() and some sort of string formatting to do combinations of cohort and chromosome? What about if I want to do nC3 or nC4?

eg: "{cohort}_chr{chrom}_filtered.bgen" where cohort is all cohorts and chrom is all chromosomes and I want the operation to run in parallel on all generated files.

Paolo Di Tommaso
@pditommaso
Mar 28 2018 18:54
yes
Brian Fulton-Howard
@BEFH
Mar 28 2018 18:54
is combine overloaded from Groovy?
Paolo Di Tommaso
@pditommaso
Mar 28 2018 18:55
Nope, it's NF code
Brian Fulton-Howard
@BEFH
Mar 28 2018 19:04
so would you call something like file(collected) from sprintf("A:$s_B:$s_C:$s.file", a.combine(b).combine(c).flatten())?
Seems a bit much...
Paolo Di Tommaso
@pditommaso
Mar 28 2018 19:13
not sure to understand ..
Mike Smoot
@mes5k
Mar 28 2018 19:17
Me neither! I'd really recommend starting with simple examples with mock data and mock processes that don't compute anything, but demonstrate how data flows through nextflow. Something like this:
Channel.from('a','b','c').set{ cohort }
Channel.from(1..22).set{ chrom }

cohort.combine(chrom).set{ co_chrom }

process run_bcf {

    input:
    set val(coh), val(chr) from co_chrom

    output:
    file("${coh}_${chr}.vcf.gz") into outch

    script:
    """
    touch ${coh}_${chr}.vcf.gz
    """
}

outch.view()
Brian Fulton-Howard
@BEFH
Mar 28 2018 19:34
If you had three things you wanted to combine, would you just use two combine()s, or would you also need to use something like flatten()?
Mike Smoot
@mes5k
Mar 28 2018 19:36
three combines.
sorry, two
Michael L Heuer
@heuermh
Mar 28 2018 19:37
Outsider opinion here, it appears you're trying to coerce the workflow into parallelizing things a certain way. I believe that makes the workflow harder to conceptualize and may fight against the parallelization that nextflow provides.
Brian Fulton-Howard
@BEFH
Mar 28 2018 19:37
Channel.from('a','b','c').set{ cohort }
Channel.from(1..22).set{ chrom }
Channel.from('a','b','c').set{ other }

cohort.combine(chrom).combine(other).set{ co_chrom }

process run_bcf {

    input:
    set val(coh), val(chr), val(other) from co_chrom

    output:
    file("${coh}_${chr}_${other}.vcf.gz") into outch

    script:
    """
    touch ${coh}_${chr}_${other}.vcf.gz
    """
}

outch.view()
Mike Smoot
@mes5k
Mar 28 2018 19:38
Does that produce what you'd expect to see?
Michael L Heuer
@heuermh
Mar 28 2018 19:38
E.g. splitting by chromosome does not balance execution across a cluster well
Brian Fulton-Howard
@BEFH
Mar 28 2018 19:47

Oh, I'm well aware that splitting by chromosome is not best in most cases, but some tools require single chromosomes, and some split even farther. For instance, imputation methods often split stuff into 10Kbase contigs and run machine learning algorithms to generate haplotypes and infer ungenotyped variants. Other times, splitting the files and operating in parallel is the easiest way to split operations across nodes. Do you have an alternative way that doesn't use openmp?

I generally tend towards merging, though...

@mes5k I don't know. This is a theoretical exercise so far, because I don't want to deal with java dependancies right now. I'm going to save this for a weekend project.
Mike Smoot
@mes5k
Mar 28 2018 19:50
I think experimenting with mock data will clarify things quite a bit.
Brian Fulton-Howard
@BEFH
Mar 28 2018 19:51
I'll probably do that this weekend.

Snakemake seems to have the advantage of a robust implicit wildcard system as compared to nextflow, and being clearer as to what the input and output files exactly are.

However, things seem much more explicit in nextflow which might prevent things from breaking in unpredictable ways. It also seems like a shorter path from shellscript to nextflow than to Snakemake. I also like the reporting and visualization features.

Brian Fulton-Howard
@BEFH
Mar 28 2018 20:02
Thanks for all your help! I think I have a much better idea of how things work and will be able to play around with some toy data.
Michael L Heuer
@heuermh
Mar 28 2018 22:03
Do you have an alternative way that doesn't use openmp?
Brian Fulton-Howard
@BEFH
Mar 28 2018 22:05
Yes, well we're moving towards spark, but our cluster doesn't yet support it.
Michael L Heuer
@heuermh
Mar 28 2018 22:06
Understood. Spark on Slurm/LSF is an option, Spark on Kubernetes is maturing, but yeah it is a significant barrier to entry.
Brian Fulton-Howard
@BEFH
Mar 28 2018 22:07
Our cluster is currently on an older version of LSF that doesn't support Spark. When they do their major upgrade this year to centOS 7 with the new LSF, I've requested they support Spark
Are those things better than HAIL, or is their main advantage legacy file format support?
Michael L Heuer
@heuermh
Mar 28 2018 22:13
Better than Hail, yes, but then I am biased. ;)
Hail implements one use case (variants with variant and sample annotations), ADAM &c. supports that use case and many others. I have on my long list of things to do a Jupyter/Zeppelin notebook that follows a Hail example with ADAM.
I wish the two projects could work more closely together; I reach out often and receive no reply.
Brian Fulton-Howard
@BEFH
Mar 28 2018 22:17
It looks like there's no VDS reader? If my data is in VDS, I would have to export to VCF?
Michael L Heuer
@heuermh
Mar 28 2018 22:19
We like to think of it as Hail not having an ADAM Parquet+Avro reader. :) But yeah, that's what I mean by the two projects needing to work together.
From what I understand, VDS is already Parquet, we'd only need to convert from their dataset/dataframe schema to ours.
Brian Fulton-Howard
@BEFH
Mar 28 2018 22:21
They're probably not snubbing you. They're really busy upgrading major versions right now and afaik, they're in a feature freeze.
Michael L Heuer
@heuermh
Mar 28 2018 22:22
Perhaps this conversation should move over to https://gitter.im/bigdatagenomics/adam. Sorry again, nextflow-ers.
Mike Smoot
@mes5k
Mar 28 2018 22:26