These are chat archives for nextflow-io/nextflow

3rd
Nov 2016
Johan Viklund
@viklund
Nov 03 2016 06:58
Nextflow is the only groovy thing I've used. Previously I've only seen $ used for stringinterpolation. Is it something similar here?
Is it nextflow that parses the configuration file? That is, is this a convention you have introduced?
Paolo Di Tommaso
@pditommaso
Nov 03 2016 07:28
@viklund that's only a convention used in the nextflow config file
anyhow don't confuse "$something"in a string that's used to interpolate variables
with the $ character, that in Java and Groovy is a valid identifier character
Johan Viklund
@viklund
Nov 03 2016 07:51
Ok. Thanks. Since I don't know groovy that we'll it's sometimes hard to know where groovy ends and nextflow starts.
Paolo Di Tommaso
@pditommaso
Nov 03 2016 07:52
you are welcome, sometimes happens to me as well ;)
Maxime Garcia
@MaxUlysse
Nov 03 2016 09:42
Hello @pditommaso
I have this warning about a duplicate channel output :
WARN: Duplicate output channel name: 'tiny' in the script context -- it's worth to rename it to avoid possible conflicts
How can I assign specific names to Channels ?
Paolo Di Tommaso
@pditommaso
Nov 03 2016 09:47
what's tiny an output file name?
Maxime Garcia
@MaxUlysse
Nov 03 2016 09:59
it's my PatientID
In this case it's a small sample file
Paolo Di Tommaso
@pditommaso
Nov 03 2016 09:59
can I see the snipped of code where u are using/declaring it?
Maxime Garcia
@MaxUlysse
Nov 03 2016 10:01
in almost each process as an output we have a set like:
set idPatient, idSample, file("${idSample}.recal.bam"), file("${idSample}.recal.bai") into recalibratedBams
set idPatient, idSample, file(realignedBamFile), file(realignedBaiFile), file("${idSample}.recal.table") into recalibrationTable
Paolo Di Tommaso
@pditommaso
Nov 03 2016 10:03
but doesn't make much sense that warning in this context
Maxime Garcia
@MaxUlysse
Nov 03 2016 10:04
OK, so it might not be in here, I'll keep looking to see where it might come from, but since there is no line number in the warning message, it's a little complicated to find where the problem might be
Paolo Di Tommaso
@pditommaso
Nov 03 2016 10:05
try to look for all references of tiny in the script
Maxime Garcia
@MaxUlysse
Nov 03 2016 10:06
thanks
Paolo Di Tommaso
@pditommaso
Nov 03 2016 10:09
I've realised that message is confusing
that warning is produced having a piece of code as the following
A = 'tiny'

process foo {
  output:
  file 'x' into A

  '''
  touch x
  '''

}
WARN: Duplicate output channel name: 'tiny' in the script context -- it's worth to rename it to avoid possible conflicts
[7c/57c7f5] Submitted process > foo (1)
I will improve that message, thanks for reporting it
Maxime Garcia
@MaxUlysse
Nov 03 2016 10:11
Thanks a lot for your help
Paolo Di Tommaso
@pditommaso
Nov 03 2016 10:11
welcome
Maxime Garcia
@MaxUlysse
Nov 03 2016 10:11
You're very reponsive it's impressive ;-)
Paolo Di Tommaso
@pditommaso
Nov 03 2016 10:12
:)
Johan Viklund
@viklund
Nov 03 2016 11:15
is it possible to have the input and output files of workflow steps stored on S3 (or other object storage) instead of having them as symlinks in the filesystem?
Ideally this should be transparent from the process writing and just som configuration setting. I didn't find anything obvious in the docs.
Paolo Di Tommaso
@pditommaso
Nov 03 2016 11:18
you can use s3 both as input, publishDir target and the entire pipeline work dir
Johan Viklund
@viklund
Nov 03 2016 11:18
work dir can be compute-node local, but input/output and publishDir
Paolo Di Tommaso
@pditommaso
Nov 03 2016 11:19
yes
Johan Viklund
@viklund
Nov 03 2016 11:19
hmm, nice
Paolo Di Tommaso
@pditommaso
Nov 03 2016 11:19
when the input is a s3 file, it's downloaded locally
the the output will continue locally as usually
Johan Viklund
@viklund
Nov 03 2016 11:20
of course, but how do I get a output file into S3?
Paolo Di Tommaso
@pditommaso
Nov 03 2016 11:20
by using pusblishDir 's3://your/bucket/'
Johan Viklund
@viklund
Nov 03 2016 11:21
ok, but then the next process will not get the file from S3, it will assume it's on a shared filesystem
Paolo Di Tommaso
@pditommaso
Nov 03 2016 11:21
yes exactly
Johan Viklund
@viklund
Nov 03 2016 11:21
what if I can't assume a shared filesystem
I don't have a usecase for this yet
Paolo Di Tommaso
@pditommaso
Nov 03 2016 11:22
if you are using local or grid executor that's not an option
if you are deploying in the (AWS) cloud you can use s3 as shared storage or EFS
Johan Viklund
@viklund
Nov 03 2016 11:23
Ok
Paolo Di Tommaso
@pditommaso
Nov 03 2016 11:24
which of the two is your use case ?
Johan Viklund
@viklund
Nov 03 2016 11:25
I don't have one
Paolo Di Tommaso
@pditommaso
Nov 03 2016 11:25
:)
Johan Viklund
@viklund
Nov 03 2016 11:25
I'm just thinking about the future
I expect HPC clusters to stop using shared filesystems and go all in on object storage
it's just a matter of when
especially for bio-data where the files are big
not for physics and chemistry perhaps
Paolo Di Tommaso
@pditommaso
Nov 03 2016 11:26
ah, I didn't noticed this trend
Johan Viklund
@viklund
Nov 03 2016 11:26
it's not a trend yet
I've just seen the problems with using, esp. NFS, for bioinfo compute
Paolo Di Tommaso
@pditommaso
Nov 03 2016 11:27
this you think this also for on-premises HPC ?
Johan Viklund
@viklund
Nov 03 2016 11:27
If I could just decide for everyone else, I would do that.
But I'm just a developer
:)
but we're probably not there yet, but we can't really handle the load from the sequencing machines as it is
Paolo Di Tommaso
@pditommaso
Nov 03 2016 11:29
yep, there are too many legacy application especially in the scientific field
in my opinion shared file system will remain a critical component for a while
also cloud are starting to recognise that
you may be interested to this