These are chat archives for nextflow-io/nextflow

24th
May 2017
Karin Lagesen
@karinlag
May 24 2017 10:42
I'd like to set my number of cpus in the profile file, with process. $trimmomatic.cpus.
trimmomatic requires me to give it a number of threads with an option, otherwise it will just use 1.
how do I access the nose of cpus I have specified in the profile in the script? do I just use $cpus since it is definied within that processes scope?
Paolo Di Tommaso
@pditommaso
May 24 2017 10:45
like this
process trimmomatic {

   """
   command_here --threads $task.cpus 
  """
}
Karin Lagesen
@karinlag
May 24 2017 10:46
ah, that makes sense.
thanks!
Paolo Di Tommaso
@pditommaso
May 24 2017 10:49
welcome
Ted Toal
@tedtoal
May 24 2017 19:53
I'm taking my first look at nextflow, and I'm coming from a makefile background in my current pipeline. So, I'm trying to understand nextflow pipelines from that point of view. And so far, I don't understand whether or not nextflow accomplishes the same thing as a basic makefile does: only running a process on a file if the input file is newer than the output file. I can see that processes can be connected together without using files, e.g. as pipes between them, and that it is possible to specify a file/files for the input, but I've seen nothing saying how the executor decides whether or not a process needs to be run, leading me to believe it always runs the entire pipeline?
Ted Toal
@tedtoal
May 24 2017 20:01
Another big area of questions I have involves the flexibility (or lack) in structuring directories and filenames for a project. In my current multi-regional sequencing project, I've chosen to have a top-level data directory containing one subdirectory for each PERSON that is sequenced, and within the person subdirectory, one subdirectory for each TUMOR SAMPLE OR NORMAL SAMPLE that is sequenced, and within the sample subdirectories are the files generated from the sequencing files for that sample, and those filenames include the full sample name (which is a combination of the person ID and the sample ID). So for example two files that might exist, for persons JP23 and CN34, are data/JP23/T4/JP23T4.bam and data/CN34/N2/CN34N2_B02_L03_R2.fastq. Part of my challenge in a pipeline is to take a table of person IDs, sample type IDs for each person, and FASTQ files for each sample, and automatically generate appropriate filenames at each stage in the pipeline. The occurrence of the person ID twice in a file pathname has been a bit of an issue, because make doesn't like that sort of thing. How flexible is nextflow with file and directory naming?
Mike Smoot
@mes5k
May 24 2017 20:03
@tedtoal the -resume flag will tell nextflow to re-use the results of a process assuming the inputs to the process haven't changed (so in your case files). You have some control over what about the inputs is considered when re-running a process: https://www.nextflow.io/docs/latest/process.html?highlight=cache#cache
Ted Toal
@tedtoal
May 24 2017 20:10
Thanks, Mike. The cache part of nextflow is what makes it accomplish make-like dependencies. Now another question: makefiles often have "clean" targets, which would be equivalent to marking the cache entry as dirty or out-of-date. Is there a way to do that in nextflow?
And a question about modularity: does nextflow provide for easily breaking a pipeline into modules that are contained in separate files? Would this be accomplished with the include statement I saw somewhere?
Mike Smoot
@mes5k
May 24 2017 20:12
Given an input file/dir, nextflow (and groovy and java) provide a lot of tools for slicing and dicing the path to extract the various names you might want. You'll probably use the map operator on a channel to produce a channel of tuples: [person_id, sample_id, fileObj]. Then you can use the publishDir directive in a process to put things in the appropriate output dir.
yes, nextflow has a clean command. Nextflow's version gives you a lot of control over what in your history you might want to delete.
Mike Smoot
@mes5k
May 24 2017 20:18
Modularity is still an outstanding feature request: nextflow-io/nextflow#238. You're very far from alone in wanting this feature.
Paolo Di Tommaso
@pditommaso
May 24 2017 20:26
How flexible is nextflow with file and directory naming?
the first thing to realise is that NF keeps the working directory complete separate from the final pipeline output
that can be confusing coming from Make
said that NF provides great flexibility on managing/structuring output files and directory
your use case, a separate directory for each person/sample/etc is very common, it's enough to provide that path in the associated process
See here for example
Ted Toal
@tedtoal
May 24 2017 20:31
Ok, thanks, Mike and Paolo. It looks like the channel commands use shell-like wildcards for handling multiple filenames. My file names don't map well with shell wildcards. I looked at map operator and publishDir briefly, but it's still vague to me how well NF would work with my file naming. We use multiple file extensions to indicate the state of processing of a file, e.g. JP24T.sorted.bam, JP24T.recal.bam. I want to write a module that I could plug into my pipeline that would process .bed files, and its output filenames would be derived from input filenames by adding an additional extension, e.g. input = Targets.sorted.merged.bed, output = Targets.norgns.sorted.merged.bed. What I want is a sort of filename wildcarding syntax that is much smarter than the typical shell globbing. But I want something that is easy to specify with a clear pipeline statement, that doesn't involve any processing of the filename with a regular expression that I write.
If NF keep working dir separate from final output, it isn't unnecessarily copying what could be extremely large files from working dir to final dir, is it?
Paolo Di Tommaso
@pditommaso
May 24 2017 20:33
If NF keep working dir separate from final output, it isn't unnecessarily copying what could be extremely large files from working dir to final dir, is it?
by default, NF creates a symlink in the final output folder, you will need to consolidate the output before cleaning up the temporary files
otherwise you can choose to create an hard link or copy the file
about file handling, the use of wildcards is the quickest way, however you can provide the with a configuration file, a yaml input file or a custom formatted file that you will need to parse
there's no limitation how you can handle inputs, NF is a DSL on top a general purpose programming lang, so you don't have limitation typical in pure declarative system like Make
Ted Toal
@tedtoal
May 24 2017 20:38
Ok, that sounds good.
Thanks for info.
Paolo Di Tommaso
@pditommaso
May 24 2017 20:38
Have a look at this config for an possibile way to specify the inputs
Mike Smoot
@mes5k
May 24 2017 20:39
What you want to do with file naming is almost certainly possible. I'd recommend starting with a dummy pipeline that just touches file or writes simple output and see if you can generate what you want before trying to add real work to your pipeline.
Paolo Di Tommaso
@pditommaso
May 24 2017 20:39
good tip
Mike Smoot
@mes5k
May 24 2017 20:40
I'd also guess that the way nextflow passes files around will make complicated file naming patterns less of a necessity.
Ted Toal
@tedtoal
May 24 2017 20:45
What would be ideal is if there was a way that the user could specify his file naming patterns in one place, and pipeline modules could be written without knowing anything about the naming scheme. People do lots of different sorts of naming and directory hierarchy schemes, and forcing them to adopt a particular one is not a good way to make a pipeline popular.
It may be that the NF cache could be used in place of a lot of intermediate files that we don't ever actually examine.
I believe the recommendation these days re sequence data for humans is to store all data for one person under one directory. Including the sample ID in the file is almost essential to avoid accidental mixups. Encoding things like run number, lane number, replicate number, or paired end number in the filename is a pretty natural and logical thing to do.
Paolo Di Tommaso
@pditommaso
May 24 2017 20:48
for this reason we tend to use glob patterns, this allows to traverse easily a directory structure and fetch all files matching one ore more file extensions
Ted Toal
@tedtoal
May 24 2017 20:58
Generally sequencing pipelines go from more files to fewer, and there are typically several steps where multiple input files produce one output file. The user should be able to provide the set of names used at each such step, and an overall pattern for file naming at each step, and the pipeline would use appropriate names. For example, user could specify 4 "levels": person ID, sample type, lane number, and paired end number, and provide a table of all combinations (because some combinations might not exist, so simply enumerating the possibilities at each level is not enough). A short example would help. I need to figure out how to enter code here.
Paolo Di Tommaso
@pditommaso
May 24 2017 20:59
if it may help there's an introductory tutorial here
Ted Toal
@tedtoal
May 24 2017 21:13
For example:
Table 1: raw input data information table

    Person    SampType    Lane    PE    RawPath
    JP24    N1    4    R1    JP-24-N_1-CBG001_L4_1.fastq.gz
    JP24    N1    4    R2    JP-24-N_1-CBG001_L4_2.fastq.gz
    JP24    N1    5    R1    JP-24-N_1-CBG001_L5_1.fastq.gz
    JP24    N1    5    R2    JP-24-N_1-CBG001_L5_2.fastq.gz
    JP24    T1    4    R1    JP-24-T_1-CBG002_L4_1.fastq.gz
    JP24    T1    4    R2    JP-24-T_1-CBG002_L4_2.fastq.gz
    JP24    T1    5    R1    JP-24-T_1-CBG002_L5_1.fastq.gz
    JP24    T1    5    R2    JP-24-T_1-CBG002_L5_2.fastq.gz
    JP24    T2    4    R1    JP-24-T_1-CBG003_L4_1.fastq.gz
    JP24    T2    4    R2    JP-24-T_1-CBG003_L4_2.fastq.gz
    JP24    T2    5    R1    JP-24-T_1-CBG003_L5_1.fastq.gz
    JP24    T2    5    R2    JP-24-T_1-CBG003_L5_2.fastq.gz
    CN33    N2    6    R1    CN-02-33-N_2-5K01_L6_1.fastq.gz
    CN33    N2    6    R2    CN-02-33-N_2-5K01_L6_2.fastq.gz
    CN33    T1    6    R1    CN-02-33-T_1-5K01_L6_1.fastq.gz
        CN33    T1    6    R2    CN-02-33-T_1-5K01_L6_2.fastq.gz

Table 2: file and directory path patterns at each level

Level    Pattern
Person    data/${Person}/${Person}*
SampType    data/${Person}/${SampType}/${Person}-${SampType}*
Lane    data/${Person}/${SampType}/${Person}-${SampType}_L${Lane:%02d}*
PE    data/${Person}/${SampType}/${Person}-${SampType}_L${Lane:%02d}_${PE}*
Karin Lagesen
@karinlag
May 24 2017 21:16
@ted as far as I know, NF does not do what you want it to do
however, I could see this being a preprocessing script that ends up writing your config file which sorts out all sorts of params
Ted Toal
@tedtoal
May 24 2017 21:17

Ok, table columns don't line up well. But idea here is that there are four levels at which files undergoing processing have a different number of files per person. I named those levels Person, SampType, Lane, PE. And the raw input sequencing files, .fastq.gz, often have bizarre naming not conforming to standards. Each level might have different combos for each person. E.g. JP24 was sequenced on 2 lanes, CN33 only on one. JP24 has two tumor samples, CN33 only 1.

And each level has a pattern that specifies how files at that level will be named. * in the pattern means all the file extensions for whatever kind of file you are dealing with at that level.

Karin Lagesen
@karinlag
May 24 2017 21:18
nf makes heavy use of the params scope which is where you define all sorts of options to things such as output directories, variable names etc
note, where you can define (you can do it in other ways too)
Ted Toal
@tedtoal
May 24 2017 21:18
Another Q: does NF have the ability to access mySQL databases? Does it have hooks where you can write Java code if need be?
Karin Lagesen
@karinlag
May 24 2017 21:18
afaik, no sql directly, but you can run any code from nf
anything you can run on the command line you can run from nf
Ted Toal
@tedtoal
May 24 2017 21:20
Can the code you run from NF manipulate the config parameters that are used by other processes, e.g. code would read an SQL table and add parameters to the config info from the table?
Paolo Di Tommaso
@pditommaso
May 24 2017 21:21
NF script is a superset of the Java programming lang, you can import and write any regular java code
Ted Toal
@tedtoal
May 24 2017 21:21
Ok, good.
Paolo Di Tommaso
@pditommaso
May 24 2017 21:22
(actually NF is a superset of Groovy which is superset of Java)
regarding that input table, you can store into a file or a multiline string and then process accordingly
regarding SQL table, you can use plain Java JDBC code, or Groovy SQL provided you import the groovy-sql library
Ted Toal
@tedtoal
May 24 2017 21:46

Say you have the above tables of info. You want to write a pipeline module that is blind to what the data files and directory structure might be. Say it is a module that combines .bam files from the Lane level into single .bam files at the SampType level. I'd want the input and output specifications for this module to be (for example): input=level("Lane"), output=level("SampType"), and from that knowledge and from the table data the module could deduce which set of input files corresponds to which output files (except for the * part of the name, the extensions). The module would be invoked 5 times in the example above.

Say the module requires that both the input and output files occur in pairs because each set of data has a ".bam" file and a ".bai" file. The module wants to add ".merge" before ".bam" and ".bai" in the output filenames. It has no idea which module (process, in NF case?) might be connected to it to provide input, but it knows that that module must provide .bam/.bai file pairs and they must have filenames conforming to the pattern for level "Lane", i.e. that module's output should be at level "Lane". So, the module spec might say something like: input_fileextension_pattern=[#.bam, #.bai] and output_fileextension_pattern=[.merge.bam, .merge.bai].

The input and output filenames for the five runs would be (using # = .sorted as the pre-extension used by the input module that was chosen):

Run 1
Input:
data/JP24/N1/JP24-N1_L04.sorted.bam
data/JP24/N1/JP24-N1_L04.sorted.bai
data/JP24/N1/JP24-N1_L05.sorted.bam
data/JP24/N1/JP24-N1_L05.sorted.bai
Output:
data/JP24/N1/JP24-N1.merge.bam
data/JP24/N1/JP24-N1.merge.bai

Run 2
Input:
data/JP24/T1/JP24-T1_L04.sorted.bam
data/JP24/T1/JP24-T1_L04.sorted.bai
data/JP24/T1/JP24-T1_L05.sorted.bam
data/JP24/T1/JP24-T1_L05.sorted.bai
Output:
data/JP24/T1/JP24-T1.merge.bam
data/JP24/T1/JP24-T1.merge.bai

Run 3
Input:
data/JP24/T2/JP24-T2_L04.sorted.bam
data/JP24/T2/JP24-T2_L04.sorted.bai
data/JP24/T2/JP24-T2_L05.sorted.bam
data/JP24/T2/JP24-T2_L05.sorted.bai
Output:
data/JP24/T2/JP24-T2.merge.bam
data/JP24/T2/JP24-T2.merge.bai

Run 4
Input:
data/CN33/N2/CN33-N2_L06.sorted.bam
data/CN33/N2/CN33-N2_L06.sorted.bai
Output:
data/CN33/N2/CN33-N2.merge.bam
data/CN33/N2/CN33-N2.merge.bai

Run 5
Input:
data/CN33/T1/CN33-T1_L06.sorted.bam
data/CN33/T1/CN33-T1_L06.sorted.bai
Output:
data/CN33/T1/CN33-T1.merge.bam
data/CN33/T1/CN33-T1.merge.bai

Does this make sense? I think this shows the fundamental naming problem that sequencing pipelines have to solve.

Paolo Di Tommaso
@pditommaso
May 24 2017 21:47
you need to interact with @skptic that's struggling with a similar problem
Paolo Di Tommaso
@pditommaso
May 24 2017 21:55
said that, pure rule/declarative approach is inherently limited, you will find always in a use case you are not able to handle w/o writing and complex module logic, that ultimately will bring more problems than the ones that solve
Karin Lagesen
@karinlag
May 24 2017 21:55
I do something that slightly resembles this, however, I do it without a config file
Paolo Di Tommaso
@pditommaso
May 24 2017 21:56
NF is data structure agnostic, you need only to feed a task with the right files that are needed
Karin Lagesen
@karinlag
May 24 2017 21:56
In my config file, I specify what the input and the output file endings are
I have all my inputs in one directory and I use the fromFilePairs to stick datasets together
Paolo Di Tommaso
@pditommaso
May 24 2017 21:57
exactly
Karin Lagesen
@karinlag
May 24 2017 21:57
that way I can get hold of the prefix for the files, which I then use for the naming
Ted Toal
@tedtoal
May 24 2017 22:05
It does seem that each new analysis is somehow different and throws a new curve at you, file-naming-wise. But the names are important, so there need to be good ways to handle naming. glob rules are inadequate. makefile % patterns are inadequate. I can't yet tell if NF might be able to handle the above. I'd need to learn more details about NF. I have makefiles that work, but they've grown way too big and complex, I want something more modular, and easier to work with than makefiles.
Karin Lagesen
@karinlag
May 24 2017 22:05
my advice is to keep things simple
what I do is that I keep names at one, and only one level
all else are named generically
Ted Toal
@tedtoal
May 24 2017 22:05
Thank you! :-) I am notorious for NOT doing that. But you are right.
Karin Lagesen
@karinlag
May 24 2017 22:06
it is tempting to keep information in file names, because that\s what we have been trained to do
Ted Toal
@tedtoal
May 24 2017 22:07
I have a scheme now where I'm making makefile "modules" that I include with the make include command. The system works, but I haven't converted my large cumbersome makefiles to it, and I'm thinking I shouldn't. Instead, I should pick a good pipeline workflow manager and go with it.
Karin Lagesen
@karinlag
May 24 2017 22:07
but - you are creating things based on info from a db
how about just keeping all that info in the db where it belongs and just running with a number or some other name that you can then back-translate to?
Ted Toal
@tedtoal
May 24 2017 22:08
Sequencing data is often transferred around, and if you don't have sample info in the filename, almost for sure sooner or later samples will get mixed up. They get mixed up anyway, for other reasons.
Often you want to view a particular person's data manually. You open his data in a genome browser.
Karin Lagesen
@karinlag
May 24 2017 22:08
so add something (like the flowcell id number etc) and then a tracking number?
Ted Toal
@tedtoal
May 24 2017 22:08
You want to be able to find that file without looking in a database and searching for a GUID or something.
The IDs also show up in reports.
We are moving towards using databases, but only just starting. Currently it is all file-oriented.
Karin Lagesen
@karinlag
May 24 2017 22:09
yeah, but the reports != processed files, you can grab that in the reporting stage
(also known as, you should create a reporting process :))
Ted Toal
@tedtoal
May 24 2017 22:10
And I guess I just don't LIKE the idea of filenames being pure numbers, associated with something only by referencing a database.
Karin Lagesen
@karinlag
May 24 2017 22:10
I will be creating one myself later this summer
Ted Toal
@tedtoal
May 24 2017 22:10
Reporting process, yes, tell me about it. Probably our biggest problem.
Karin Lagesen
@karinlag
May 24 2017 22:10
:)
I found something that writes both html and pdf and more stuff.... let me find the name