These are chat archives for nextflow-io/nextflow

23rd
May 2017
Shaz
@shashiranjan_86_twitter
May 23 2017 09:42
Nextflow uses the SGE also
Shaz
@shashiranjan_86_twitter
May 23 2017 09:47
Thanks
Paolo Di Tommaso
@pditommaso
May 23 2017 09:53
@shashiranjan_86_twitter definitely, have a look at this white paper for an overview
Robert Syme
@robsyme
May 23 2017 14:33
Heya all. When executing on aws with nextflow cloud, is it necessary to use elastic file system? If no EFS id is provided, does ignite automatically take care of getting the job data dependencies to the necessary worker nodes?
Paolo Di Tommaso
@pditommaso
May 23 2017 15:22
@robsyme If you do not use EFS, you will need to use a shared work directory allocated in a S3 bucket
Simone Baffelli
@baffelli
May 23 2017 15:25
Hello! Is there any reason for a cached process to fail?
Paolo Di Tommaso
@pditommaso
May 23 2017 15:26
if fail there's a reason :)
Simone Baffelli
@baffelli
May 23 2017 15:27
but it worked in a previous run
and now I resume and some randomly fail
Robert Syme
@robsyme
May 23 2017 15:27
Thanks Paolo.
Paolo Di Tommaso
@pditommaso
May 23 2017 15:27
@robsyme welcome
@baffelli without error trace can't say more
Simone Baffelli
@baffelli
May 23 2017 15:27
sure, I'll se if that happens again
Simone Baffelli
@baffelli
May 23 2017 15:35
I suspect it's due to the fact that I'm using maps in several places
Paolo Di Tommaso
@pditommaso
May 23 2017 15:35
ummm, possible
it could a race condition
Simone Baffelli
@baffelli
May 23 2017 15:36
Caused by:
  Process `unwrap (10)` terminated with an error exit status (255)

Command executed:

  width=$(get_value cc_mask.bmp interferogram_width)
  mcf input.2 off_par input.4 unw ${width} - - - - - - - 343 57 1


Command exit status:
  255

Command output:
  *** Phase unwrapping using Minimum Cost Flow (MCF) and triangulation ***
  *** Copyright 2015, Gamma Remote Sensing, v2.0 clw/uw 6-Dec-2015 ***
  input interferogram file: input.2
  weight file: off_par

Command error:

  ERROR ras_type(): Unsupported output image format, only SUN raster, BMP, or TIFF formats are currently supported: input.4
Indeed, somehow the files are not obtained in the right order
I gave them names, but the process is receiving input.n
Paolo Di Tommaso
@pditommaso
May 23 2017 15:37
how are you passing the map ?
Simone Baffelli
@baffelli
May 23 2017 15:37
like that
data_for_unw
            .map({it->map_outputs(it, ['cc', 'ifgram', 'off_par', 'baseline', 'master_id', 'slave_id'])})
            .filter({it->filter_baseline(baseline(it['master_id'],it['slave_id']), params.pair_bl)})
            .cross(unw_mask_named){it-> [it.master_id, it.slave_id]}//synchronize basing on the ids
            .map({it->it[0]+it[1]})//merge the pairs into a single mapping
            .into({ifgram_to_unw})






/*
* Unwrap
*/
process unwrap{
        input:
            set file(cc), file(ifgram), file(off_par), val(baseline), val(master_id), val(slave_id), file(unw_mask) from ifgram_to_unw
            each unw_start from ref_pix_unw


        output:
            set file(unw), file(off_par), val(baseline), val(master_id), val(slave_id) into unw

        shell:
            println("${unw_start}")
            log.info "Unwrapping !{ifgram}"
            '''
            width=$(get_value !{off_par} interferogram_width)
            mcf !{ifgram} !{cc} !{unw_mask} unw ${width} - - - - - - - !{make_string(unw_start)} 1
            '''


}
I suspect that relying on the order given by the map is not a good idea
maps aren't necessarily ordered
Indeed, input.2 contains the value of 'baseline'
but what happens if I just pass the map to the shell and unpack it there by name? Will the necessary files still be staged?
Paolo Di Tommaso
@pditommaso
May 23 2017 15:41
which is the map here, sorry ?
Simone Baffelli
@baffelli
May 23 2017 15:41
I'm unpacking it
the map is ifgram_to_unw
I construct it with the chains of operators above
Paolo Di Tommaso
@pditommaso
May 23 2017 15:43
uhh no, you should not do that
Simone Baffelli
@baffelli
May 23 2017 15:43
But I have so many file :worried:
process unwrap{
        input:
            val unw_data from ifgram_to_unw
            each unw_start from ref_pix_unw


        output:
            set file(unw), file(off_par), val(baseline), val(master_id), val(slave_id) into unw

        shell:
            println("${unw_start}")
            log.info "Unwrapping !{ifgram}"
            cc = unw_data['cc']
            ifgram = unw_data['ifgram']
            off_par = unw_data['off_par']
            baseline = unw_data['baseline']
            master_id=unw_data['master_id']
            slave_id=unw_data['master_id']
            unw_mask=unw_data['mask']
            '''
            width=$(get_value !{off_par} interferogram_width)
            mcf !{ifgram} !{cc} !{unw_mask} unw ${width} - - - - - - - !{make_string(unw_start)} 1
            '''


}
and of course doing that does not work
It complaing about files being out of scope
Paolo Di Tommaso
@pditommaso
May 23 2017 15:47
are known the name of the files ?
Simone Baffelli
@baffelli
May 23 2017 15:47
well I'm not using any known file name througout the pipeline
i did not like to rely on name patterns
Paolo Di Tommaso
@pditommaso
May 23 2017 15:48
make sense
and I understand your feature request, but this not implemented yet
Simone Baffelli
@baffelli
May 23 2017 15:48
I know, I thought I could kind of hack it
by using map
but it seems to be quite fragile
Paolo Di Tommaso
@pditommaso
May 23 2017 15:49
where are you creating the map P
Simone Baffelli
@baffelli
May 23 2017 15:50
data_for_unw
            .map({it->map_outputs(it, ['cc', 'ifgram', 'off_par', 'baseline', 'master_id', 'slave_id'])})
            .filter({it->filter_baseline(baseline(it['master_id'],it['slave_id']), params.pair_bl)})
            .cross(unw_mask_named){it-> [it.master_id, it.slave_id]}//synchronize basing on the ids
            .map({it->it[0]+it[1]})//merge the pairs into a single mapping
            .into({ifgram_to_unw})
here
map_outputs just creates a map from a list of keys
Paolo Di Tommaso
@pditommaso
May 23 2017 15:51
are you using an HashMap or a LinkedHashMap ?
Simone Baffelli
@baffelli
May 23 2017 15:51
i initialize it with [:]
I guess that's a bad habit from python?
Paolo Di Tommaso
@pditommaso
May 23 2017 15:52
not at all, that's the groovy map constructor
Simone Baffelli
@baffelli
May 23 2017 15:52
so if I replace it with a linkedhashmapo
the order should be ensured
Paolo Di Tommaso
@pditommaso
May 23 2017 15:52
that creates a LinkedHashMap that are supposed to maintain the key creation order
Simone Baffelli
@baffelli
May 23 2017 15:53
but perhaps the sum
changes the order
Paolo Di Tommaso
@pditommaso
May 23 2017 15:53
!
what sum ?
Simone Baffelli
@baffelli
May 23 2017 15:54
sorry i mean the second to last line
where I merge the two mappings
it is really strange cause most of the cases it works
Simone Baffelli
@baffelli
May 23 2017 16:04
it seems that it happens when one of the two channels I'm calling 'phase' on does not send me the desired value
Karin Lagesen
@karinlag
May 23 2017 16:08
good evening :)
If I make my channel like this: """
Channel
.fromFilePairs( params.reads, size:params.setsize )
.ifEmpty { error "Cannot find any reads matching: ${params.reads}" }
.into{ read_pairs }
"""
Paolo Di Tommaso
@pditommaso
May 23 2017 16:10
Hi Karin !
Karin Lagesen
@karinlag
May 23 2017 16:10
I get a list structure like this [pair_id, [R11 R12 R21 R2....]]
Paolo Di Tommaso
@pditommaso
May 23 2017 16:11
but ?
Karin Lagesen
@karinlag
May 23 2017 16:11
is there any way to recreate that for an ouutput channel from a process?
(and hi again, Paolo!)
Paolo Di Tommaso
@pditommaso
May 23 2017 16:12
:)
I guess so
Karin Lagesen
@karinlag
May 23 2017 16:12
not sure, but I am hesitant to send pairs of stuff from a process via two different channels, not sure that I won't mess up pairing
Paolo Di Tommaso
@pditommaso
May 23 2017 16:13
process foo {
  output: 
  set val(pair_id), file('*{R11,R12,R21,R2}') into some_ch 

}
Karin Lagesen
@karinlag
May 23 2017 16:13
nice!
thanks!
Paolo Di Tommaso
@pditommaso
May 23 2017 16:13
all the point is to provide a pattern that capture the files you need
you can even specify names one by one
I mean file('{A,B,C}')
Karin Lagesen
@karinlag
May 23 2017 16:20
hmmm.... do you usually go for camel case or underscores in process names?
Paolo Di Tommaso
@pditommaso
May 23 2017 16:21
latest trend is number begin underscore separated string :satisfied:
Evan Floden
@evanfloden
May 23 2017 16:22
Does NF console support the use of a config file?
Karin Lagesen
@karinlag
May 23 2017 16:22
@skptic yes, and it makes NF even more awesome
Paolo Di Tommaso
@pditommaso
May 23 2017 16:22
ummm, the default one it should
brava karin !
:)
Karin Lagesen
@karinlag
May 23 2017 16:22
you can have both config files and profiles (that would be for the machine types)
Evan Floden
@evanfloden
May 23 2017 16:23
Great thanks!
Karin Lagesen
@karinlag
May 23 2017 16:23
chdem
@chdem
May 23 2017 16:51
good evening !
I've a strange behaviour with groupTuple
perhaps it's my mistake but I don't understand why :
I've got a channel that containning
```
[patientM, [file_patientM.gz, file_patientM.tbi]]
[patientM, [file_patientM.gz, file_patientM.tbi]]
chdem
@chdem
May 23 2017 16:56
sorry, wrong text
[patientM, [file_patientM_EVAL.gz, file_patientM_EVAL.tbi]]
[patientQ, [file_patientQ_EVAL.gz, file_patientQ_EVAL.tbi]]
[patientU, [file_patientU_EVAL.gz, file_patientU_EVAL.tbi]]
[patientX, [file_patientX_EVAL.gz, file_patientX_EVAL.tbi]]
[patientM, [file_patientM_TRUTH.gz, file_patientM_TRUTH.tbi]]
[patientQ, [file_patientQ_TRUTH.gz, file_patientQ_TRUTH.tbi]]
[patientU, [file_patientU_TRUTH.gz, file_patientU_TRUTH.tbi]]
[patientX, [file_patientX_TRUTH.gz, file_patientX_TRUTH.tbi]]
I'm doing a groupTuple() on this channel
chdem
@chdem
May 23 2017 17:04
and I obtain this :
[patientM, [file_patientM_EVAL.gz, file_patientM_TRUTH.gz], [file_patientM_EVAL.gz.tbi, file_patientM_TRUTH.gz.tbi]]
[patientU, [file_patientU_EVAL.gz, file_patientU_TRUTH.gz], [file_patientU_EVAL.gz.tbi, file_patientU_TRUTH.gz.tbi]]
[patientX, [file_patientX_TRUTH.gz, file_patientX_EVAL.gz], [file_patientX_TRUTH.gz.tbi, file_patientX_EVAL.gz.tbi]]
[patientQ, [file_patientQ_EVAL.gz, file_patientQ_TRUTH.gz], [file_patientQ_EVAL.gz.tbi, file_patientQ_TRUTH.gz.tbi]]
as you can see, the third line show a different order for the EVAL/TRUTH files....
Why ? How to correct this ?
Thank you for your help...
Paolo Di Tommaso
@pditommaso
May 23 2017 17:07
Looks like more an issue, can you open a ticket on Github?
chdem
@chdem
May 23 2017 17:08
of course @pditommaso ....
Karin Lagesen
@karinlag
May 23 2017 17:36
ok, so, regarding config files and process specific things
how do I specify the path to a specific program from the profile file, so that I can use it in the process script?
Karin Lagesen
@karinlag
May 23 2017 17:43
what do I write in my script if I need to access the memory value of the cluster? process.memory?
Paolo Di Tommaso
@pditommaso
May 23 2017 17:45
one question at time !
Karin Lagesen
@karinlag
May 23 2017 17:45
(that was supposed to be one question.... )
Paolo Di Tommaso
@pditommaso
May 23 2017 17:45
path, profile or memory? which is the first ?:)
Karin Lagesen
@karinlag
May 23 2017 17:45
ok
in my profile file, I want to specify where a program lives
then I want to use that in my script
Paolo Di Tommaso
@pditommaso
May 23 2017 17:46
you should never use abs path in your script
what about adding the program path to the env $PATH ?
Karin Lagesen
@karinlag
May 23 2017 17:47
can I do that in my profile?
Paolo Di Tommaso
@pditommaso
May 23 2017 17:47
yep
Karin Lagesen
@karinlag
May 23 2017 17:47
that makes things easier
what would that look like?
Paolo Di Tommaso
@pditommaso
May 23 2017 17:48
env.PATH = '/foo/bar:$PATH'
Karin Lagesen
@karinlag
May 23 2017 17:48
do I do that inside of the process statement, or outside of that?
hmm... just relized that that is a stupid question :)
Paolo Di Tommaso
@pditommaso
May 23 2017 17:48
outside, that's the same for
env {
  PATH = ..
}
caveat
Karin Lagesen
@karinlag
May 23 2017 17:49
thanks!
Paolo Di Tommaso
@pditommaso
May 23 2017 17:49
using single quote $PATH is expanded on the node
using double quote $PATH is resolved by NF on the login node
BTW most of the time it's the same so you should not worry about that
Karin Lagesen
@karinlag
May 23 2017 17:51
I'm using slurm, so I am pretty certain something like that will come up...
Paolo Di Tommaso
@pditommaso
May 23 2017 17:52
umm, are you using module ?
Karin Lagesen
@karinlag
May 23 2017 17:54
on the cluster, yes
but not on my local machine
trying to make things run locally before I mess with the slurm things
one variable at the time :)
Paolo Di Tommaso
@pditommaso
May 23 2017 17:54
use single quote, maybe is better
:D
good luck, having some food
Karin Lagesen
@karinlag
May 23 2017 17:55
enjoy!
Karin Lagesen
@karinlag
May 23 2017 18:28
ok, I have a bit of a systematic question for you
Paolo Di Tommaso
@pditommaso
May 23 2017 18:28
!
Karin Lagesen
@karinlag
May 23 2017 18:30
I would like my results to be published in this kind of structure: results/raw_data/sample_name/fastqfiles, results/trimmed/sample_name/trimmed_fq, results/assembly/sample_name/contigs.fa
Paolo Di Tommaso
@pditommaso
May 23 2017 18:30
yep
Karin Lagesen
@karinlag
May 23 2017 18:30
I suspect this basically means that I just ship the sample_name directory around as a channel?
and that I should simplify and just keep all file names the same, and just keep the dir name specific to the sample?
a part of me would like to keep the sample name in the file name, but that is starting to look a bit messy
Paolo Di Tommaso
@pditommaso
May 23 2017 18:32
something like this?
Karin Lagesen
@karinlag
May 23 2017 18:32
I think that looks like something, yes
so, keep the dir name as a variable, but hard code the rest
Paolo Di Tommaso
@pditommaso
May 23 2017 18:33
the rest what? the remaining part of the directory structure ?
Karin Lagesen
@karinlag
May 23 2017 18:33
all of the other files
i.e. rename fastq files to "R1.fq", "R1_trimmed.fq", etc
instead of using names with variables in them
Paolo Di Tommaso
@pditommaso
May 23 2017 18:35
ah, that's the way I prefer
Karin Lagesen
@karinlag
May 23 2017 18:35
I can sort of start to tell :)
Paolo Di Tommaso
@pditommaso
May 23 2017 18:35
;)
Karin Lagesen
@karinlag
May 23 2017 18:42
you are making me draw stuff on envelopes!
/confused
Paolo Di Tommaso
@pditommaso
May 23 2017 18:43
why ?
Karin Lagesen
@karinlag
May 23 2017 18:44
just me having to draw my channels to understand what the hell I'm doing :grin:
Paolo Di Tommaso
@pditommaso
May 23 2017 18:44
:D
Michael L Heuer
@heuermh
May 23 2017 21:06
@pditommaso let me know if you need this part to go faster, I may have just the thing ;)
Paolo Di Tommaso
@pditommaso
May 23 2017 21:07
ah!
GATK 4 ?
Michael L Heuer
@heuermh
May 23 2017 21:11
Them's fighting words! ADAM does mark duplicates, bqsr, etc.
Just need to figure out that same Apache Spark as service-job-in-Nextflow/CWL issue, perhaps at this years' BOSC ;)
Paolo Di Tommaso
@pditommaso
May 23 2017 21:13
ahah, thus ADAM can be used in place of GATK
interesting
does it depend on Spark ?
I mean how do you achieve better performance?
by leveraging Spark parallelisation or by implementing a better algorithm ?
Michael L Heuer
@heuermh
May 23 2017 21:16
Yes, on Spark, and both, we can scale linearly.
Paolo Di Tommaso
@pditommaso
May 23 2017 21:17
cool
unfortunately I didn't have much time to continue on the Spark integration, but it's still on the radar!
two days at BOSC is too little !