These are chat archives for nextflow-io/nextflow

17th
Aug 2017
Luca Cozzuto
@lucacozzuto
Aug 17 2017 09:26 UTC
Dear all,
I noticed that sometimes the caching does not work. Basically some processes that are already finished are submitted again if you simply rerun the pipeline.
Jean-Christophe Houde
@jchoude
Aug 17 2017 12:53 UTC
hi @pditommaso , I'll try to create a minimal "working" example. I'm pretty sure it's related to the rsync call, since it tries to do a rsync bundle/sub1__*endpoints_metrics.nii.gz, which raises an error since there is not even one file matching the pattern
It might not be a problem, I'm just wondering if there is another way to do the same and make it work :)
Simone Baffelli
@baffelli
Aug 17 2017 12:58 UTC
@lucacozzuto did you try checking -dump-hashes?
It prints all the parameters for all processes and the corresponding hashes
you could run it twice using two different log files with nextflow -log first_log.txt run yourpipeline.nf and nextflow -log second_log.txt run yourpipeline.nf, look for the matching processes and see whether the parameter changed
To be sure to find your processes, you could give them an unique name using the tag directive
Simone Baffelli
@baffelli
Aug 17 2017 13:07 UTC
Actually, it would be really nice if the hashes could be dumped in a separate file, because they tend to mess up the log a lot
Luca Cozzuto
@lucacozzuto
Aug 17 2017 13:21 UTC
@baffelli it looks like more related to a process rather then a single instance. I just rerun the script twice so I don't change any parameter. However I'll try to use two different logs to see what's going on
Simone Baffelli
@baffelli
Aug 17 2017 13:23 UTC
@lucacozzuto that is exactly my point, if you do it in the way I suggested, you will be able to compare the parameters for each process istance
and see if something changed
tip: if you are collecting files for a downstream process, the order matters
I learned it the hard way
Luca Cozzuto
@lucacozzuto
Aug 17 2017 13:25 UTC
uh. I'll check.
Simone Baffelli
@baffelli
Aug 17 2017 13:25 UTC
let me know because I'm curious
I often encounter problems with caching
Luca Cozzuto
@lucacozzuto
Aug 17 2017 13:27 UTC
is there a way to "force" the order with .collect ?
Jean-Christophe Houde
@jchoude
Aug 17 2017 13:30 UTC
also, is there a way to keep the /tmp directories to inspect what was effectively created by the processes?
Simone Baffelli
@baffelli
Aug 17 2017 13:35 UTC
@lucacozzuto you can use toSortedList instead of collect. As I discussed with Paolo, the process execution order is non-deterministic
actually a parameter allowing to ignore the order would be useful/ another option that could solve the problem would be to use sets instead of arrays for collect
@jchoude I think you can see everything by going into the process instance folder
except for the groovy script, if you use one. I heavily rely on groovy scripts to construct my commands and that makes debugging a little bit harder
Luca Cozzuto
@lucacozzuto
Aug 17 2017 13:40 UTC
@baffelli another question: when I use collect I don't keep the name of the file I'm collecting. They are replaced with number when I have (for instance FASTQ_ becomes FASTQ_1, FASTQ_2 etc). is there a way to keep the original name? (maybe using toSortedList?)
Simone Baffelli
@baffelli
Aug 17 2017 13:42 UTC
To be honest I'dont know. I usually use file(slcs:"*.slc") for files called 1.slc to n.slc
But why would you keep the original names?
Luca Cozzuto
@lucacozzuto
Aug 17 2017 13:55 UTC
for instance using multiQC on them
Simone Baffelli
@baffelli
Aug 17 2017 13:56 UTC
but does the command require specific names?
I am not a bioinformatician, so I cannot help in that sense
Luca Cozzuto
@lucacozzuto
Aug 17 2017 13:56 UTC
the software uses file names for making a report
so preserving the file names can be sometimes useful
Simone Baffelli
@baffelli
Aug 17 2017 13:58 UTC
I see. As far as i know, there is no way to preserve the names when you are collecting several files using "file(name:"pattern")
Jean-Christophe Houde
@jchoude
Aug 17 2017 13:58 UTC
@pditommaso just a heads up: I can't seem to reproduce the problem with a simple example... Something deeper is at work here.
Francesco Strozzi
@fstrozzi
Aug 17 2017 14:04 UTC
@lucacozzuto can’t you read the samples names from a simple list and pass each sample name explicitly to MultiQC using something like --file-list ?
assuming your data are organized in subdirectories corresponding to samples names, of course
Luca Cozzuto
@lucacozzuto
Aug 17 2017 14:07 UTC
@fstrozzi I have subdirectories like SAMPLE_A SAMPLE_B etc. but once you do "collect" they become SAMPLE_1 SAMPLE_2 etc. so I was wondering if there is an easy way to keep them (not only for multiQC but in general)
Francesco Strozzi
@fstrozzi
Aug 17 2017 14:13 UTC
I see your point, maybe having a map keeping the correspondance between originalName —> stagedName could be helpful. No way to construct it from the list returned by “collect” ?
I mean, “by hand”
Luca Cozzuto
@lucacozzuto
Aug 17 2017 14:14 UTC
maybe using toSortedList as suggested by @baffelli. I'll try and keep you posted
Francesco Strozzi
@fstrozzi
Aug 17 2017 14:15 UTC
:+1: yes please, I think it’s a fairly common case
Simone Baffelli
@baffelli
Aug 17 2017 14:25 UTC
Well I guess you could build it manually by getting the location the file points to
because when nextflow stages files, it just links them to the original file
something like
input:
 file(thefiles:"my_files_*.something")
script:
 paths = thefiles as List
paths.collectEntries{file ->[(file.absolutePath.take(file.absolutePath.lastIndexOf('.'))): file]}
Simone Baffelli
@baffelli
Aug 17 2017 14:31 UTC
Thats completely untested, but I suppose something similar could work
Let me know if it works, I'm quite curios
Francesco Strozzi
@fstrozzi
Aug 17 2017 14:36 UTC
slightly related question. What’s the simpler way to initiate multiple tasks for a list of samples, for instance I have a list of samples, for each of them I have fastq files in separate folders named as the sample, and I want a process, let’s say for trimming, to start on each sample in parallel. So from one side I have a channel reading the list of samples, from the other I don’t quite understand what could be the best way to create N channels using Channel.fromFilePairs for each sample in the same process
(I hope it’s clear what I mean, I think it’s a very common case but I don’t find a clear example to look at)
Simone Baffelli
@baffelli
Aug 17 2017 14:37 UTC
Well, I think nextflow automatically starts a process for each sample it receives
if the channel receiving the N samples is connected to Channel.fromFilePairs
Francesco Strozzi
@fstrozzi
Aug 17 2017 14:38 UTC
can you point me to some example ?
Simone Baffelli
@baffelli
Aug 17 2017 14:38 UTC
letters = Channel.from("A".."Z")
process showThem{
 input:
 val let from letters
script:
"""
echo ${let}
"""
}
Francesco Strozzi
@fstrozzi
Aug 17 2017 14:38 UTC
I keep having errors when trying to create Channel.fromFilePairs
Simone Baffelli
@baffelli
Aug 17 2017 14:38 UTC
Probabily the pattern does not return any file
or something similar
(I should become a nextflow consultant somehow :grimacing: )
Francesco Strozzi
@fstrozzi
Aug 17 2017 14:39 UTC
I think the code you posted has some problem, I don’t see anything :)
Simone Baffelli
@baffelli
Aug 17 2017 14:40 UTC
Still working on it, sorry
Francesco Strozzi
@fstrozzi
Aug 17 2017 14:40 UTC
ops
Simone Baffelli
@baffelli
Aug 17 2017 14:41 UTC
For each emission, the process will start, up to nprocesses in parallel
Francesco Strozzi
@fstrozzi
Aug 17 2017 14:42 UTC
yes, that’s clear. But I need to then create a Channel.fromFilePairs for each element in the list (i.e. each sample)
Simone Baffelli
@baffelli
Aug 17 2017 14:42 UTC
why don't you create a process that reads the list
and then emits the filenames?
Francesco Strozzi
@fstrozzi
Aug 17 2017 14:43 UTC
and I get a "No such variable: let” (to match your example)
yeah probably that’s better I think
Simone Baffelli
@baffelli
Aug 17 2017 14:44 UTC
sorry, typos
Francesco Strozzi
@fstrozzi
Aug 17 2017 14:44 UTC
this way I can declare as input the Channel.fromFilePairs in the receiving process
Simone Baffelli
@baffelli
Aug 17 2017 14:44 UTC
fromFilePairs is not an operator, I don't think it can receive values
Francesco Strozzi
@fstrozzi
Aug 17 2017 14:45 UTC
mmm
Simone Baffelli
@baffelli
Aug 17 2017 14:45 UTC
I think you should parse the file, copy the files in the process working directory and then send them to the outputs
This message was deleted
Francesco Strozzi
@fstrozzi
Aug 17 2017 14:49 UTC
I’ll try it, thanks
Simone Baffelli
@baffelli
Aug 17 2017 14:50 UTC
I can't really help because soon is time for me to leave and go for some biking
but I could help tomorrow
Francesco Strozzi
@fstrozzi
Aug 17 2017 14:50 UTC
no problem
Luca Cozzuto
@lucacozzuto
Aug 17 2017 14:50 UTC
@baffelli I tried to re run the pipeline and the two logs are similar. For some strange reason without changing anything the process is executed instead of being cached
Simone Baffelli
@baffelli
Aug 17 2017 14:51 UTC
Similar or the same?
Did you compare the hashes one to one?
Luca Cozzuto
@lucacozzuto
Aug 17 2017 14:51 UTC
well the time is different
Simone Baffelli
@baffelli
Aug 17 2017 14:52 UTC
did you use -dump-hashes?
Luca Cozzuto
@lucacozzuto
Aug 17 2017 14:52 UTC
ehm :smile:
Simone Baffelli
@baffelli
Aug 17 2017 14:52 UTC
I guess that means no
:grinning:
Luca Cozzuto
@lucacozzuto
Aug 17 2017 14:53 UTC
enjoy the bike!
:)
Simone Baffelli
@baffelli
Aug 17 2017 14:54 UTC
I hope to catch the last sunshine and warmth
a cold front is nearing
Luca Cozzuto
@lucacozzuto
Aug 17 2017 14:55 UTC
here 31 degrees and plenty of sun
Francesco Strozzi
@fstrozzi
Aug 17 2017 14:57 UTC
:)
here is humid and grey
Simone Baffelli
@baffelli
Aug 17 2017 15:15 UTC
Here is fine atm,but it will not improve
But that does not stop me from biking
Francesco Strozzi
@fstrozzi
Aug 17 2017 15:19 UTC
@baffelli I think I have understood what I need to do, will do some test and then share the results
good biking
Luca Cozzuto
@lucacozzuto
Aug 17 2017 15:22 UTC
btw I see that @pditommaso is already working on the file mapping... nextflow-io/nextflow#402
Francesco Strozzi
@fstrozzi
Aug 17 2017 15:28 UTC
:+1:
Luca Cozzuto
@lucacozzuto
Aug 17 2017 15:59 UTC
terrorist attack here in BCN :(
Francesco Strozzi
@fstrozzi
Aug 17 2017 16:01 UTC
:(
just saw the news
Simone Baffelli
@baffelli
Aug 17 2017 20:58 UTC
Hope everyone is doing well