These are chat archives for nextflow-io/nextflow

7th
Apr 2017
marchoeppner
@marchoeppner
Apr 07 2017 09:48
hi - thanks for looking into my little pipeline issue. Seems I am having trouble with the whole Channel concept (coming from bpipe, i.e. more graph based)
Paolo Di Tommaso
@pditommaso
Apr 07 2017 09:48
Hi Marc, welcome
can you share also the printed stdout ?
marchoeppner
@marchoeppner
Apr 07 2017 09:50
if that is not in the logs somewhere, I'd have to rerun, takes a minute. Nothing suspicious there tho, just submitted stages, no errors
Paolo Di Tommaso
@pditommaso
Apr 07 2017 09:50
but it says Submitted process xxx .. ?
marchoeppner
@marchoeppner
Apr 07 2017 09:51
yea, and most of the expected outputs are in work/
seems to fail to properly bring it all together for publishing in the end , so I am guessing it is a problem with how i am collecting from the output channel(s)
but a channel should be able to a) read a list of file and b) split each file into chunks in one go?
Paolo Di Tommaso
@pditommaso
Apr 07 2017 09:52
by fail do you mean a runtime error or it's not producing what's expected ?
marchoeppner
@marchoeppner
Apr 07 2017 09:52
not producing what I am expecting
Paolo Di Tommaso
@pditommaso
Apr 07 2017 09:52
ok, which process is not producing the expected results ?
marchoeppner
@marchoeppner
Apr 07 2017 09:53
so say I have five vcf files (as by the example and script linked on gg) - it will read these, split each file into 2 chunks (test data) and run two annotation tools on each file - and then it should merge each chunk and annotation output to produce 2 output files for each input file
collectVep and collectAnnovar (last procedures)
Paolo Di Tommaso
@pditommaso
Apr 07 2017 09:55
ok, the problem is the collect() cannot be used in that way
apologies if I've suggested that .. i don't remember :)
marchoeppner
@marchoeppner
Apr 07 2017 09:56
you suggested collectFile , true - would that work, theoretically?
Paolo Di Tommaso
@pditommaso
Apr 07 2017 09:56
now I remember .. yes that would work
marchoeppner
@marchoeppner
Apr 07 2017 09:57
I see, will give it go, fingers crossed
Paolo Di Tommaso
@pditommaso
Apr 07 2017 09:57
wait
marchoeppner
@marchoeppner
Apr 07 2017 09:57
ok
Paolo Di Tommaso
@pditommaso
Apr 07 2017 09:58
ok, yes it makes more sense to use collectFile for just appending the result to a file instead of a process
marchoeppner
@marchoeppner
Apr 07 2017 10:01
stupid question perhaps, but in your example - how is the output file name chosen? I have tried to carry over the original file name (before splitting) so I can use it to name the output. Not sure how that is accomplished with collectFile
so the Channel I would be applying collectFile to is not a flat list but a bunch of arrays with [ id , some_output_chunk ]
Paolo Di Tommaso
@pditommaso
Apr 07 2017 10:15
so the Channel I would be applying collectFile to is not a flat list but a bunch of arrays with [ id , some_output_chunk ]
exactly
when doing so the first item is supposed to be the grouping key and it's used as well as the resulting file name
marchoeppner
@marchoeppner
Apr 07 2017 10:16
ah, built-in magic then
Paolo Di Tommaso
@pditommaso
Apr 07 2017 10:16
convention-over-configuration
but you can change it obviously, by doing something like that
x.collectFile { id, file -> [ "your-file-name-with-${id}.extension", file ] }
makes sense ?
if you are coming from Bpipe, shouldn't be too complicated ;)
marchoeppner
@marchoeppner
Apr 07 2017 10:20
yes that makes sense, thanks - will try this now
Paolo Di Tommaso
@pditommaso
Apr 07 2017 15:33
hope that all our Swedish friends and their families are fine :/
Karin Lagesen
@karinlag
Apr 07 2017 15:44
so do I
if i have the following:
```
set pair_id, file(reads) from in_read_pairs
what kind of thing is really reads then?
is it a list, a channel, or..._
?
Phil Ewels
@ewels
Apr 07 2017 15:54
We're all ok over here thanks. Not very nice news and all a bit chaotic though!
😓🇸🇪
Karin Lagesen
@karinlag
Apr 07 2017 15:55
are you anywhere close to where things are happening?
Paolo Di Tommaso
@pditommaso
Apr 07 2017 16:07
what kind of thing is really reads then
it's a list of files, or at least is supposed to be ..
Karin Lagesen
@karinlag
Apr 07 2017 16:12
I figured it out :)
Paolo Di Tommaso
@pditommaso
Apr 07 2017 16:12
:ok_hand:
Karin Lagesen
@karinlag
Apr 07 2017 16:12
I now know enough nextflow to (almost) debug and figure out things myself!
Paolo Di Tommaso
@pditommaso
Apr 07 2017 16:13
great! very encouraging ! :)
Karin Lagesen
@karinlag
Apr 07 2017 16:13
well, it's most definitively my win for today :)
Paolo Di Tommaso
@pditommaso
Apr 07 2017 16:15
I'm already in weekend mode sipping a glass of wine .. :sunglasses:
Karin Lagesen
@karinlag
Apr 07 2017 16:15
I will be doing the same once I just get this one thing done....
enabling people to run things through with 2 or 4 files, without having to change code :)
Paolo Di Tommaso
@pditommaso
Apr 07 2017 16:16
this is classic way to work until midnight :)
ah, this with NF should be trivial at this point ..
Karin Lagesen
@karinlag
Apr 07 2017 16:16
:grinning:
yeah, well, still new syntax, so :)
trying my hand at if statements :)
Karin Lagesen
@karinlag
Apr 07 2017 16:29
and I need regexes...
not the thing that I am the most in love with :)
Karin Lagesen
@karinlag
Apr 07 2017 16:31
:grin:
:laughing:
...how would I select everything in a list that matches the pattern R1.. ?
Paolo Di Tommaso
@pditommaso
Apr 07 2017 16:33
/^R1.*/ ?
Karin Lagesen
@karinlag
Apr 07 2017 16:34
"now use it in a sentence" :grin:
where does my list variable go, and where does the output end up, and what is the output....?
Paolo Di Tommaso
@pditommaso
Apr 07 2017 16:35
wat ?
Karin Lagesen
@karinlag
Apr 07 2017 16:35
yes
Paolo Di Tommaso
@pditommaso
Apr 07 2017 16:35
what do you mean ?
Karin Lagesen
@karinlag
Apr 07 2017 16:35
I have a variable containing 4 fastq files
I want a variable that contains those that match a certain pattern
from that variable that contains the 4 files
Paolo Di Tommaso
@pditommaso
Apr 07 2017 16:35
um
Karin Lagesen
@karinlag
Apr 07 2017 16:36
I have to merge all the R1 files and the R2 files, so I need to ensure that I merge the right set of files :)
Paolo Di Tommaso
@pditommaso
Apr 07 2017 16:37
how are defined these lists ?
is the output of a NF process ?
Karin Lagesen
@karinlag
Apr 07 2017 16:38
I use fromFilePairs to get a channel with a tuple and a list
I then take that channel into a process
Paolo Di Tommaso
@pditommaso
Apr 07 2017 16:38
yes
Karin Lagesen
@karinlag
Apr 07 2017 16:39
and I have input like this:
    set pair_id, file(reads) from in_read_pairs
Paolo Di Tommaso
@pditommaso
Apr 07 2017 16:39
fine
then?
Karin Lagesen
@karinlag
Apr 07 2017 16:40
then I go into an if statement - I ask users to input the right number of files that belong to each prefix, and that is what I'm using in my if statement
what I have done so far is just:
    script:
    if ( params.setsize == 2 )
        """
        echo "file set is", ${params.setsize}
        echo cp ${reads[0]} ${pair_id}_R1.fastq.gz
        echo cp ${reads[1]} ${pair_id}_R2.fastq.gz
        """

    else if (params.setsize == 4)
        """
        echo "file set is", ${params.setsize}
        echo cat ${reads[0]} ${reads[1]} > ${pair_id}_R1.fastq.gz
        echo cat ${reads[2]} ${reads[3]} > ${pair_id}_R2.fastq.gz
        """
just to see what things were
but the way I'm cating things together isn't robust, I need to ensure that I only cat R1s and R2s together
em, only cat R1s together, and only cat R2s together
hence: how do I select items in a list that matches a pattern
Paolo Di Tommaso
@pditommaso
Apr 07 2017 16:43
thinking to alternatives ..
Karin Lagesen
@karinlag
Apr 07 2017 16:43
I could just depend on order, but that is a bit weak
Paolo Di Tommaso
@pditommaso
Apr 07 2017 16:44
well, you could do
cat R1* > ${pair_id}_R1.fastq.gz
cat R2* > ${pair_id}_R2.fastq.gz
bash automatically expand it to the matching file names
tho, not sure the file names are ordered
Karin Lagesen
@karinlag
Apr 07 2017 16:46
ldoh
Paolo Di Tommaso
@pditommaso
Apr 07 2017 16:46
sorry ? :)
Karin Lagesen
@karinlag
Apr 07 2017 16:47
I keep thinking of difficult solutions when easy ones will do :)
Paolo Di Tommaso
@pditommaso
Apr 07 2017 16:47
:)
Karin Lagesen
@karinlag
Apr 07 2017 16:47
hence the doh :)
I can drop the if statement too, which makes everything better too :)
Paolo Di Tommaso
@pditommaso
Apr 07 2017 16:48
This message was deleted
Karin Lagesen
@karinlag
Apr 07 2017 16:48
thanks!
Paolo Di Tommaso
@pditommaso
Apr 07 2017 16:49
but in this case why copying ?
         cp ${reads[0]} ${pair_id}_R1.fastq.gz
         cp ${reads[1]} ${pair_id}_R2.fastq.gz
Karin Lagesen
@karinlag
Apr 07 2017 16:49
the first one was if there was only 1 set of files
Paolo Di Tommaso
@pditommaso
Apr 07 2017 16:49
would not be enough to rename it ?
Karin Lagesen
@karinlag
Apr 07 2017 16:49
I am a bit paranoid
Paolo Di Tommaso
@pditommaso
Apr 07 2017 16:49
about what ?
file naming ? :)
Karin Lagesen
@karinlag
Apr 07 2017 16:49
about overwriting data
and of tracing things
Paolo Di Tommaso
@pditommaso
Apr 07 2017 16:50
you can't, each task run in its own dir ..
Karin Lagesen
@karinlag
Apr 07 2017 16:50
I know :)
Paolo Di Tommaso
@pditommaso
Apr 07 2017 16:50
so it can't be overwritten !
Karin Lagesen
@karinlag
Apr 07 2017 16:50
but for somebody reading the code, and reading the output, so to speak, if I have every step there, it's easier for a user to read
Paolo Di Tommaso
@pditommaso
Apr 07 2017 16:51
ok...
Karin Lagesen
@karinlag
Apr 07 2017 16:51
so for instance, the ariba program always produces a file called report.tsv
now, in my script I make a copy of that with a new name which I shove into a channel
If I didn't do that, any user who knew ariba, might get confused because they can't see the output they expect
then again, this might just be because 1. I'm paranoid and 2. I have worked with people who confuse easily :)
Paolo Di Tommaso
@pditommaso
Apr 07 2017 16:52
that's good as long as you have plenty of cheap storage .. ;)
Karin Lagesen
@karinlag
Apr 07 2017 16:53
I do
Paolo Di Tommaso
@pditommaso
Apr 07 2017 16:53
good!
Karin Lagesen
@karinlag
Apr 07 2017 16:53
which makes everything easier :)
Paolo Di Tommaso
@pditommaso
Apr 07 2017 16:53
ariba I guess it's a spanish author ..
Karin Lagesen
@karinlag
Apr 07 2017 16:53
but then again, I also clearly delineate between "experimental" runs, and "production" runs
nope, Martin Hunt at Sanger
it's for antibiotic/virulence/MLST finding
Paolo Di Tommaso
@pditommaso
Apr 07 2017 16:54
I didn't know it
Karin Lagesen
@karinlag
Apr 07 2017 16:54
and I'll be running it through ...around 1000 isolates or thereabouts
and also other things
anyhow, I'm pretty much ready to commit here, so I'll let you get back to your wine in the sun :)
Paolo Di Tommaso
@pditommaso
Apr 07 2017 16:56
I will, have a great week-end !
Karin Lagesen
@karinlag
Apr 07 2017 16:56
you too!
Paolo Di Tommaso
@pditommaso
Apr 07 2017 16:57
tx
Félix C. Morency
@fmorency
Apr 07 2017 18:01
Anyone here uses the CIFS multiuser feature?
Paolo Di Tommaso
@pditommaso
Apr 07 2017 18:03
not me
Félix C. Morency
@fmorency
Apr 07 2017 18:04
it works, just not without an active user session. been hacking on it the past few days. i already have full file audit capabilities via cifs so i was hoping to keep using it
Paolo Di Tommaso
@pditommaso
Apr 07 2017 18:05
it's the same as automount ?
Félix C. Morency
@fmorency
Apr 07 2017 18:06
it allows multiple users (with their respective credentials/account) to use the same mountpoint
ie. bob writes to /mnt/share under bob account and alice writes to /mnt/share under alice account
Paolo Di Tommaso
@pditommaso
Apr 07 2017 18:08
do you have any specific problem with that ?
I mean related to NF ?
Félix C. Morency
@fmorency
Apr 07 2017 18:09
yeah. -resume doesn't work if the input path aren't the same. bob can't work on alice data and vice-versa
Paolo Di Tommaso
@pditommaso
Apr 07 2017 18:10
ahh
yes, because the hash is computed on the file metadata
Félix C. Morency
@fmorency
Apr 07 2017 18:10
exactly :)
my idea was to use the cifs multiuser feature so the path would always stay the same but it's not as simple as I thought
Paolo Di Tommaso
@pditommaso
Apr 07 2017 18:11
well, it's even possible to compute it on the content but not sure I won't to suggest that ..
would not be better to use a common name for shared data ?
Félix C. Morency
@fmorency
Apr 07 2017 18:13
you mean a single cifs account mounted on all nodes?
Paolo Di Tommaso
@pditommaso
Apr 07 2017 18:14
I don't have a clear overview how it works in detail, so doesn't make so sense
but you could mount always the same path for all users
Félix C. Morency
@fmorency
Apr 07 2017 18:16
without the multiuser feature, that would imply a single cifs account. I can't do that because FDA regulation
Paolo Di Tommaso
@pditommaso
Apr 07 2017 18:16
oh
is that for auditing purpose ?
Félix C. Morency
@fmorency
Apr 07 2017 18:17
yes, exactly
Paolo Di Tommaso
@pditommaso
Apr 07 2017 18:18
I see
thus you want alice launching the pipeline and bob resuming it ?
Félix C. Morency
@fmorency
Apr 07 2017 18:19
yes, I would like to be able to do that. we had this exact use case last week
Paolo Di Tommaso
@pditommaso
Apr 07 2017 18:21
interesting, it was thought to be used at user level
Félix C. Morency
@fmorency
Apr 07 2017 18:23
is there anything preventing multiple users to work on the same dataset?
Paolo Di Tommaso
@pditommaso
Apr 07 2017 18:24
on the same input dataset nothing
in the same launching directory it's not suggested
Félix C. Morency
@fmorency
Apr 07 2017 18:25
why not?
i mean, not at the same time, obviously
Paolo Di Tommaso
@pditommaso
Apr 07 2017 18:26
if so it's fine
in general the idea was to organise different experiment in its own working folder, tho you can resume multiple times the same experiments
Félix C. Morency
@fmorency
Apr 07 2017 18:27
yes this is how we're working atm
Paolo Di Tommaso
@pditommaso
Apr 07 2017 18:27
ok
Félix C. Morency
@fmorency
Apr 07 2017 18:29
are you aware if there is any mechanism to detect that two different mountpoints point to the exact same data?
it seems /proc/mounts contains that information
Paolo Di Tommaso
@pditommaso
Apr 07 2017 18:48
umm, not sure how much it can help
anyhow you may want to open a feature request providing more details on a possibile solution
Félix C. Morency
@fmorency
Apr 07 2017 18:49
well if NF can detect that, even if the path is different, it points to the same data, it might work
thx. i wrote on the samba ml. i'll wait for their answer before entering a feature request
Paolo Di Tommaso
@pditommaso
Apr 07 2017 18:50
ok