These are chat archives for nextflow-io/nextflow

11th
Jul 2018
Shawn Rynearson
@srynobio
Jul 11 2018 02:42
@pditommaso & @bioinforad_twitter does the time directive apply to aws-batch?
Rad Suchecki
@rsuchecki
Jul 11 2018 04:38
apparently it does @srynobio https://www.nextflow.io/docs/latest/process.html?highlight=time but I haven't tried that
Paolo Di Tommaso
@pditommaso
Jul 11 2018 07:23
@srynobio yes, since version 0.30.x nextflow-io/nextflow#648
@acerj_twitter a workaround is to have tasks working in the local storage setting process.scratch=true in the config file
Francesco Strozzi
@fstrozzi
Jul 11 2018 09:27
question for the borader NF community: How do you test your Nextflow pipelines ? I mean, is someone using some sort of unit tests, that can fit with the NF model, to ensure that the pipeline is running correctly and it is producing what it is expected ?
Phil Ewels
@ewels
Jul 11 2018 09:29
Interesting question @fstrozzi - we've been thinking about this problem a bit with our pipelines (now nf-core)
We have written a linting tool for nf-core that checks a bunch of stuff (key files and variables are present, version numbers are consistent, various templates are used properly etc)
So that runs on every CI check, plus we also use a minimal dataset to test that the pipeline runs without error
Francesco Strozzi
@fstrozzi
Jul 11 2018 09:30
interesting
Phil Ewels
@ewels
Jul 11 2018 09:30
Using both the latest version of nextflow and also the defined "minimum version" in the pipeline
Currently just docker, but we may expand to singularity as well
Francesco Strozzi
@fstrozzi
Jul 11 2018 09:31
I can find this on the nf-core repository ?
Phil Ewels
@ewels
Jul 11 2018 09:31
These are the linting tests that we have at the moment: http://nf-co.re/errors
The linting tool itself is at https://github.com/nf-core/tools
A travis config for a typical pipeline is at https://github.com/nf-core/methylseq/blob/master/.travis.yml
At the suggestion of @pditommaso, we're aiming for all pipelines to have a config profile called test which basically does everything. See https://github.com/nf-core/methylseq/blob/master/conf/test.config
Francesco Strozzi
@fstrozzi
Jul 11 2018 09:32
that’s great
Phil Ewels
@ewels
Jul 11 2018 09:32
So the actual nextflow command to run the test workflow is super easy: nextflow run <pipeline> -profile test
Paolo Di Tommaso
@pditommaso
Jul 11 2018 09:32
@ewels it would be great if you could present the tools your are developing in your presentation during at the NF workshop
Phil Ewels
@ewels
Jul 11 2018 09:32
Finally, we keep the test data in a separate repo so as not to bloat the pipeline downloads: https://github.com/nf-core/test-datasets
Each has its own branch so not all pipelines have to be downloaded for a test
This isn't perfect, but it's the best solution we've come up with so far
@pditommaso - I'd love to! I already submitted an abstract about the project :wink:
Paolo Di Tommaso
@pditommaso
Jul 11 2018 09:34
I haven't read yet , but you know .. !
also this could be a material for a paper
have you ever though to that ?
Phil Ewels
@ewels
Jul 11 2018 09:35
Yup, hoping to do something soon-ish..
I want to get a few more pipelines finished with stable releases first
hence the upcoming nf-core hackathon week :)
Paolo Di Tommaso
@pditommaso
Jul 11 2018 09:35
"fostering genomic pipelines reproducibility with nf-core stack"
hence the upcoming nf-core hackathon week
I can't join :(
Phil Ewels
@ewels
Jul 11 2018 09:37
:frowning: :-1:
hah, no I understand :wink:
It started off as a small internal thing, I'm amazed at how much interest there has been. Going to struggle to physically fit everyone into our floor space as it stands..
Francesco Strozzi
@fstrozzi
Jul 11 2018 09:37
@pditommaso @ewels I think CI and pipelines testing is a topic deserving lot of attention, we are really seeing NF allowing agile pipelines development, so testing is a critical feature we need to get in right. I am submitting an abstract right now on this (agile pipelines with NF) for the workshop
Phil Ewels
@ewels
Jul 11 2018 09:38
Nice! :+1:
Yes, our testing is not super-deep, we don't check the outputs of the pipelines for example, or have code coverage etc.
So definitely some room for improvement
Francesco Strozzi
@fstrozzi
Jul 11 2018 09:38
phil I’ll look into the lint from nf-core, seems great and thanks for pointing that out
Phil Ewels
@ewels
Jul 11 2018 09:39
no problem! We have a nf-core gitter channel if you want to discuss further
Paolo Di Tommaso
@pditommaso
Jul 11 2018 09:39
regarding tested, nextflow has an embedded test feature but I'be never promoted
you can use nextflow run -test <script>
and it runs all methods with a testXxx prefix
Francesco Strozzi
@fstrozzi
Jul 11 2018 09:41
I’ll start from that…
Phil Ewels
@ewels
Jul 11 2018 09:42
Nice! :+1: @pditommaso - could there be a way to use these test methods in combination with a pipeline run to validate the output files?
Paolo Di Tommaso
@pditommaso
Jul 11 2018 09:44
I think that would require a separate feature to validate a process output
something like
process foo {
  output: 
  file x 

  script:
  """
  your_command
  """

  validate: 
  """
  validation_command --data $x
  """
}
or you were thinking to something else ?
Francesco Strozzi
@fstrozzi
Jul 11 2018 09:47
It would be great if we could add something like this for the testing process
Phil Ewels
@ewels
Jul 11 2018 09:47
that would be very clear, which would be nice :+1: However it could make the code quite bloated..
Francesco Strozzi
@fstrozzi
Jul 11 2018 09:47
integrating unit testing in NF can be “disruptive” :smile:
Phil Ewels
@ewels
Jul 11 2018 09:47
I was thinking the methods were quite nice as you could keep that in a separate file and just include it to keep things clean
Paolo Di Tommaso
@pditommaso
Jul 11 2018 09:48
I see
Francesco Strozzi
@fstrozzi
Jul 11 2018 09:48
good point
Paolo Di Tommaso
@pditommaso
Jul 11 2018 09:48
but the validation should be native groovy/java code or external scripts
I guess both .. !
Phil Ewels
@ewels
Jul 11 2018 09:49
what about having a special subdir called test which could have a test.config file in it and a test.nf file with processTest foo { } blocks...?
Francesco Strozzi
@fstrozzi
Jul 11 2018 09:49
or bash
@ewels in the processTest you will hence have only the validation part I think
Phil Ewels
@ewels
Jul 11 2018 09:50
yes exactly
basically the same thing as above, except removing it all out into a separate directory
more work to build this into nextflow of course :wink: Maybe a little too crazy..
but processTest could be like a regular process except just stage the work directory and have a script block again as suggested above
Paolo Di Tommaso
@pditommaso
Jul 11 2018 09:51
that would run only as a special "test" execution or automatically after the processes in the main script are executed ?
Phil Ewels
@ewels
Jul 11 2018 09:51
yes exactly
special "test" execution
Paolo Di Tommaso
@pditommaso
Jul 11 2018 09:51
yes exactly
Phil Ewels
@ewels
Jul 11 2018 09:51
sorry
Paolo Di Tommaso
@pditommaso
Jul 11 2018 09:51
ok
Phil Ewels
@ewels
Jul 11 2018 09:51
that was my thinking with having the config file in the same directory
so you run -test and it picks up that config file (with input test data and everything else)
and then runs the validation tests when complete
Paolo Di Tommaso
@pditommaso
Jul 11 2018 09:52
how is different to have a test profile ?
Phil Ewels
@ewels
Jul 11 2018 09:52
You have the validation code blocks
for just the config bit alone there is no difference
Paolo Di Tommaso
@pditommaso
Jul 11 2018 09:52
wait, you want to run a pipeline then launch a separate validation phase
Phil Ewels
@ewels
Jul 11 2018 09:53
possibly...... not sure
I was thinking basically identical execution to your example above with the validation block below script
but just reorganising where that code is kept into a separate subdir to keep things clean and tidy
Paolo Di Tommaso
@pditommaso
Jul 11 2018 09:54
yes, but I think you are envisioning this only for test the pipeline
instead for validation I mean verify data integrity checks and so on after a real run
Phil Ewels
@ewels
Jul 11 2018 09:57
ah I see
ah yeah, that's a much more broad thing, I hadn't considered that
any reason you couldn't put that in the normal script block though?
Paolo Di Tommaso
@pditommaso
Jul 11 2018 09:59
that's true, that would be just syntax sugar
even if I was also thinking to something more details, such as the ability to specify for each output file a different validation strategy
but I've concluded that it would be overkilling
one thing we should also discuss, tool annotation and tool version fetching
Phil Ewels
@ewels
Jul 11 2018 10:03
so - for our minimal case were we want unit testing with specific test data. I could imagine wanting to use samtools to count the number of aligned reads and checking that this is the expected number for example. Is that possible somehow?
I guess we could have a bunch of extra processes at the bottom and have when: workflow.profile == 'test'

one thing we should also discuss, tool annotation and tool version fetching

Sure :smile: What do you mean by tool annotation?

Paolo Di Tommaso
@pditommaso
Jul 11 2018 10:05
the ability to annotate a process with the tool used
so that it can be included in a provenance report
Phil Ewels
@ewels
Jul 11 2018 10:05
ah I see
yup, it would be nice to have a more standardised method of doing this
Paolo Di Tommaso
@pditommaso
Jul 11 2018 10:05
we have some request for this, do you have this kind of need ?
* standard *
:smile:
Phil Ewels
@ewels
Jul 11 2018 10:06
we already grab the version numbers and put them in the final MultiQC report
But it's not super pretty, and it only works (nicely) when you have a single container for the entire pipeline
Previously we tried using regexes to grab version numbers from the tool stdout, but not all tools print this information and it got messy really fast
Francesco Strozzi
@fstrozzi
Jul 11 2018 10:07

so - for our minimal case were we want unit testing with specific test data. I could imagine wanting to use samtools to count the number of aligned reads and checking that this is the expected number for example. Is that possible somehow?

instead for validation I mean verify data integrity checks and so on after a real run

Outputs checking is the key point here in my opinion. We should be able to ideally capture the validation checks when the pipeline run in testing mode and produce a sort of unit test report with green / red processes that passed / failed the tests

Paolo Di Tommaso
@pditommaso
Jul 11 2018 10:07
maybe it could be introduced two special processes called validation and version (or whatever) specialised for these purpose
I mean
validation {
  """
  validation_script  
  """
}
the advantage that it would automatically collect the output data and syncronise with the pipeline execution
Phil Ewels
@ewels
Jul 11 2018 10:09
Would validation and version be per-process somehow?
Paolo Di Tommaso
@pditommaso
Jul 11 2018 10:09
no, once at termination
tho, could not work for "version"
uff, a lot of things to discuss about ! :smile:
sorry, need to go now, leave your comment here
Phil Ewels
@ewels
Jul 11 2018 10:11
yeah.. possibly overkill I wonder. I'm fairly happy with how our version number stuff works currently. Though it uses MultiQC for reporting instead of nextflow
validation stuff would be great, but again we could probably implement everything that we want in groovy with methods without much core nextflow work.
Francesco Strozzi
@fstrozzi
Jul 11 2018 10:15
I think there is enough to start developing something and maybe expand / improve it during the NF Workshop / Hackathon
Sven F.
@sven1103
Jul 11 2018 10:19
I was just out for lunch, 110 new messages :D
Phil Ewels
@ewels
Jul 11 2018 10:19
I came here to ask a super specific small question but was distracted by @fstrozzi's question instead :laughing:
Sven F.
@sven1103
Jul 11 2018 10:19
I like the idea very much, having a more rigorous testing of the pipeline scripts
Phil Ewels
@ewels
Jul 11 2018 10:19
I can't even remember what my original question was now
Sven F.
@sven1103
Jul 11 2018 10:20
hahahhaa
Francesco Strozzi
@fstrozzi
Jul 11 2018 10:20

I came here to ask a super specific small question but was distracted by @fstrozzi's question instead :laughing:

:+1:

micans
@micans
Jul 11 2018 10:29
I want to process per-file-output like this: https://github.com/nextflow-io/patterns/tree/master/process-per-file-output, but then later I need to merge them. I've clumsily attempted things like this: file '*.[ABC].txt' into Channel.flatten().map { file -> return tuple[samplename, file] }.set { ch_onion }; the point is that I want to keep track of a grouping value (samplename in this case) to later merge them. The previous obviously does not work; I aim to marry flatten with samplename-in-tuple. What to do?
LukeGoodsell
@LukeGoodsell
Jul 11 2018 11:16
Hi there. Is there a built-in way to get a task id or other unique identifier of a job within a script/shell block? I can add a map command to my channel to add one, but would rather avoid extra code if possible
LukeGoodsell
@LukeGoodsell
Jul 11 2018 11:32
@micans I’m not entirely sure I know what you’re trying to do - maybe if you could provide some more context that would help. However, does this get you some of the way? :
#!/usr/bin/env nextflow

sampleChannel = Channel.from(["sample_a", "sample_b"])

process sampleProcess {
    input:
    val sample from sampleChannel

    output:
    set(val(sample), file('*.[ABC].txt')) into sampleOutputChannel

    shell:
    '''
    echo -e "line 1\nline 2" > aaa.A.txt
    echo -e "line 1\nline 2" > aaa.B.txt
    '''
}

sampleOutputChannel
    .flatMap { item ->
        sample = item[0];
        files = item[1];
        files.collect { this_file ->
            return [ sample, this_file ]
        }
    }
    .subscribe{ println it }
This will output something like:
$ ./test2.nf 
N E X T F L O W  ~  version 0.30.1
Launching `./test2.nf` [cranky_morse] - revision: b59d571941
[warm up] executor > local
[cf/7380e5] Submitted process > sampleProcess (1)
[44/12ab42] Submitted process > sampleProcess (2)
[sample_a, /home/l.goodsell/tmp/nf-stdout/work/cf/7380e558c070164aad4c073944c3a8/aaa.A.txt]
[sample_a, /home/l.goodsell/tmp/nf-stdout/work/cf/7380e558c070164aad4c073944c3a8/aaa.B.txt]
[sample_b, /home/l.goodsell/tmp/nf-stdout/work/44/12ab42c909b809fa97e456ba88486a/aaa.A.txt]
[sample_b, /home/l.goodsell/tmp/nf-stdout/work/44/12ab42c909b809fa97e456ba88486a/aaa.B.txt]
Paolo Di Tommaso
@pditommaso
Jul 11 2018 11:44
@micans what about this or this ?
micans
@micans
Jul 11 2018 12:01
Back from lunch ... thanks guys will try these things now
@pditommaso I knew about collect(), but the second one, groupTuple looks like what I want (I had that in my crosshairs). I was struggling with the split step, but with the example by @LukeGoodsell hopefully I can put things together.
micans
@micans
Jul 11 2018 12:53
I have a toy example with four processes genesis, sample_split, sample_parallel and sample_reconstitute that I think does what I want. It's 71 lines; is it OK to post here? Otherwise I can put it on github if people are interested. I assume this must be a common pattern, I'm sure what I did can be improved if not wrong to begin with.
Paolo Di Tommaso
@pditommaso
Jul 11 2018 13:03
as you prefer
micans
@micans
Jul 11 2018 14:09
here goes ...
    // multiple samples.
process genesis {
  output:
  file '*.txt' into ch_genesis

  script:
  '''
  echo amazingly few discotheques > sample1.txt
  echo a quart jar of oil > sample2.txt
  echo about sixty codfish eggs > sample3.txt
  '''
}
    // each sample generates multiple files
process sample_split {
  input:
  file x from ch_genesis.flatten()

  output:
  set val(samplename), file('*.[ABC].txt') into ch_onionprep

          // Below mimicks one sample encoded in multiple cram files
  script:
  samplename = x.toString() - ~/.txt$/
  """
  (echo A; cat $x) > ${samplename}.A.txt
  (echo B; cat $x) > ${samplename}.B.txt
  (echo C; cat $x) > ${samplename}.C.txt
  """
}
    // track each sub-sample file with the sample ID, example provided by Luke Goodsell.
ch_onionprep
    .flatMap { item ->
        sample = item[0];
        files  = item[1];
        files.collect { onefile -> return [ sample, onefile ] }
    } .set { ch_onion_root }

    // Process the sub-sample file.
process sample_parallel {
  tag "${samplename}-${x}"

  input:
  set val(samplename), file(x) from ch_onion_root

  output:
  set val(samplename), file('*.pal') into ch_onion_middle

  script:
  """
  (echo "Done and dusted"; cat $x) > out.${samplename}.${x}.pal
  """
}

ch_onion_middle.map { key, file -> return tuple(key.toString(), file) } .groupTuple().set { ch_onion_tip }

  // Merge the results back to the sample level
process sample_reconstitute {
  publishDir "onion", mode: 'copy'

  input:
  set samplename, file(x) from ch_onion_tip

  output:
  file('*.concat')

                  // Use ls as we need different batches sorted in the same lexicographic way.
  script:
  """
  cat \$(ls $x) > out.${samplename}.concat
  """
}
Paolo Di Tommaso
@pditommaso
Jul 11 2018 14:24
an the question is ? :smile:
micans
@micans
Jul 11 2018 14:25
If it seems reasonable. But I know it's long, I don't mind if TLDR
Is a NF set pretty much the same as a Groovy tuple?
Paolo Di Tommaso
@pditommaso
Jul 11 2018 14:27
set is just syntax sugar in place of =
micans
@micans
Jul 11 2018 14:29
That's the .set on channels I assume though? I mean set samplename, file(x)
Paolo Di Tommaso
@pditommaso
Jul 11 2018 14:29
ahh
yes, that should be read: the process is receiving as input a tuple with the val, file structure
micans
@micans
Jul 11 2018 14:36
ok thanks :+1:
Shawn Rynearson
@srynobio
Jul 11 2018 16:39
@pditommaso so to use the time directive in aws-batch I need to create a Job Definition that uses the timeout Parameter.
Paolo Di Tommaso
@pditommaso
Jul 11 2018 16:39
no
you don't need to create Job defs with NF
Shawn Rynearson
@srynobio
Jul 11 2018 16:42

I've added the following to my processes:

process fastp {
    tag { sample_id }
    time '2h'
}

but when I look at the aws-batch job details it still using a nextflow default job definition.

Paolo Di Tommaso
@pditommaso
Jul 11 2018 16:42
but why all people is using this notation tag { sample_id }
when this is one is much simpler tag "$sample_id" :smile: ?
Shawn Rynearson
@srynobio
Jul 11 2018 16:43
it allows me to tie a sample to it's .... hahahaha :)
Paolo Di Tommaso
@pditommaso
Jul 11 2018 16:43
but when I look at the aws-batch job details it still using a nextflow default job definition.
what do you mean ?
Shawn Rynearson
@srynobio
Jul 11 2018 16:44
On my aws-batch JD page this is the one NF created.
nf-726197484957-dkr-ecr-us-west-2-amazonaws-com-ucgd-docker
Paolo Di Tommaso
@pditommaso
Jul 11 2018 16:44
and ?
Shawn Rynearson
@srynobio
Jul 11 2018 16:45
It the one still used for the process with or with out the time directive added.
by default
Paolo Di Tommaso
@pditommaso
Jul 11 2018 16:45
the timeout is not added in the job definition
but to the job container overrides
you should see in the job page
Shawn Rynearson
@srynobio
Jul 11 2018 16:48

strange I do not.
I just relaunched the job and this is the definition it's using.

from the jobs page:

Job definitionarn:aws:batch:us-west-2:726197484957:job-definition/nf-726197484957-dkr-ecr-us-west-2-amazonaws-com-ucgd-docker:5
This is what it says:
Execution timeout --
Paolo Di Tommaso
@pditommaso
Jul 11 2018 16:50
weird
other setting such as mem? cpus ?
Shawn Rynearson
@srynobio
Jul 11 2018 16:51
shouldn't really change if I add it to the config file right?
process {
   $fastp {
        time = 2h
   }
}
Ya the mem and cpus are added correctly.
vCPUs 8
Memory 13312 MiB
Paolo Di Tommaso
@pditommaso
Jul 11 2018 16:52
time = '2h' OR time = 2.h
Shawn Rynearson
@srynobio
Jul 11 2018 16:52
2h
Paolo Di Tommaso
@pditommaso
Jul 11 2018 16:53
no, my was an assertion not a question
Shawn Rynearson
@srynobio
Jul 11 2018 16:53
sorry, I dont understand
Paolo Di Tommaso
@pditommaso
Jul 11 2018 16:54
I mean in the config you need to write
time = '2h' OR time = 2.h
Shawn Rynearson
@srynobio
Jul 11 2018 16:54
Let me relaunch with this added to the config and see if it changes.
didn't change it.
    $fastp {
        cpus = 8
        memory = '13 GB'
        time = '2h'
        queue = 'ucgd-medium'
        container = '726197484957.dkr.ecr.us-west-2.amazonaws.com/ucgd-docker'
    }
Paolo Di Tommaso
@pditommaso
Jul 11 2018 17:00
Screen Shot 2018-07-11 at 18.59.30.png
I have it, are you sure you are using the latest NF version?
Shawn Rynearson
@srynobio
Jul 11 2018 17:03
I can check my version.
okay, I just updated to: nextflow version 0.30.2.4867
I'll run again.
Shawn Rynearson
@srynobio
Jul 11 2018 17:11
Ya the update fix it. Execution timeout 7200.
I like it when it something simple that I didn't to. :)
thanks @pditommaso
Paolo Di Tommaso
@pditommaso
Jul 11 2018 17:15
:v:
Tim Dudgeon
@tdudgeon
Jul 11 2018 17:52
Hi, I would welcome some advice on a matter. I have a workflow that needs to process a set of directories, each containing a set of input files (5 files in this case) which always have the same names and need to be processed in the same way. I have the workflow working in a single directory and now need to move it up a level and handle each directory (in parallel). Does anyone have an example of this sort of thing?
Paolo Di Tommaso
@pditommaso
Jul 11 2018 18:37
no #ENGCRO football match? that's bad! :smile:
is there's any semantic associated to the dir, you you just need to process these files independently the folder where they are located ?
Tim Dudgeon
@tdudgeon
Jul 11 2018 19:25
Yes, each dir is independent.
Paolo Di Tommaso
@pditommaso
Jul 11 2018 19:58
still not understanding, you need to parallelise per file or per dir ?
Tim Dudgeon
@tdudgeon
Jul 11 2018 20:01
Per dir. But within the dir there is a workflow that itself is paralellised (that bit is done already). Each dir has 5 input files that generate one output file.
Paolo Di Tommaso
@pditommaso
Jul 11 2018 20:02
dir there is a workflow that itself is paralellised
how are you planning to handle this ?
Tim Dudgeon
@tdudgeon
Jul 11 2018 20:13
I already have NF workflow that handles an individual dir. I now need to move it up one level so that it can handle multiple dirs.
Paolo Di Tommaso
@pditommaso
Jul 11 2018 20:16
it's the same of this using a pattern on the directory names instead of file names
Tim Dudgeon
@tdudgeon
Jul 11 2018 20:20
So for me it's more like:
Channel.fromPath('**/somefile.gz')
(other than there are 5 input files, not 1)
Paolo Di Tommaso
@pditommaso
Jul 11 2018 20:21
you said you want to process per dir therefore the pattern needs to capture the dir not the file
Netsanet Gebremedhin
@gnetsanet
Jul 11 2018 21:52
Hello everyone
Sample_ID,Sample_Name,Sample_Plate,Sample_Well,I7
_Index_ID,index,Sample_Project,Description

ControlIndex1,,,,A013,AGTCAA,,email@gmail.com
ControlIndex2,,,,A014,AGTTCC,,email@email.com
ControlIndex1,,,,A015,ATGTCA,,email@email.com
newlotIndex12,,,,A016,CCGTCC,,email@email.com
newlotIndex13,,,,A018,GTCCGC,,email@email.com
newlotIndex14,,,,A019,GTGAAA,,email@email.com
I have a csv file with the above contents. splitCsv fails to handle the empty line in the file
I was not able to even use the .filter operator as none of the lines beneath the empty line are emitted.
Any thoughts?