These are chat archives for nextflow-io/nextflow

3rd
Nov 2016
Johan Viklund
@viklund
Nov 03 2016 06:58 UTC
Nextflow is the only groovy thing I've used. Previously I've only seen $ used for stringinterpolation. Is it something similar here?
Is it nextflow that parses the configuration file? That is, is this a convention you have introduced?
Paolo Di Tommaso
@pditommaso
Nov 03 2016 07:28 UTC
@viklund that's only a convention used in the nextflow config file
anyhow don't confuse "$something"in a string that's used to interpolate variables
with the $ character, that in Java and Groovy is a valid identifier character
Johan Viklund
@viklund
Nov 03 2016 07:51 UTC
Ok. Thanks. Since I don't know groovy that we'll it's sometimes hard to know where groovy ends and nextflow starts.
Paolo Di Tommaso
@pditommaso
Nov 03 2016 07:52 UTC
you are welcome, sometimes happens to me as well ;)
Maxime Garcia
@MaxUlysse
Nov 03 2016 09:42 UTC
Hello @pditommaso
I have this warning about a duplicate channel output :
WARN: Duplicate output channel name: 'tiny' in the script context -- it's worth to rename it to avoid possible conflicts
How can I assign specific names to Channels ?
Paolo Di Tommaso
@pditommaso
Nov 03 2016 09:47 UTC
what's tiny an output file name?
Maxime Garcia
@MaxUlysse
Nov 03 2016 09:59 UTC
it's my PatientID
In this case it's a small sample file
Paolo Di Tommaso
@pditommaso
Nov 03 2016 09:59 UTC
can I see the snipped of code where u are using/declaring it?
Maxime Garcia
@MaxUlysse
Nov 03 2016 10:01 UTC
in almost each process as an output we have a set like:
set idPatient, idSample, file("${idSample}.recal.bam"), file("${idSample}.recal.bai") into recalibratedBams
set idPatient, idSample, file(realignedBamFile), file(realignedBaiFile), file("${idSample}.recal.table") into recalibrationTable
Paolo Di Tommaso
@pditommaso
Nov 03 2016 10:03 UTC
but doesn't make much sense that warning in this context
Maxime Garcia
@MaxUlysse
Nov 03 2016 10:04 UTC
OK, so it might not be in here, I'll keep looking to see where it might come from, but since there is no line number in the warning message, it's a little complicated to find where the problem might be
Paolo Di Tommaso
@pditommaso
Nov 03 2016 10:05 UTC
try to look for all references of tiny in the script
Maxime Garcia
@MaxUlysse
Nov 03 2016 10:06 UTC
thanks
Paolo Di Tommaso
@pditommaso
Nov 03 2016 10:09 UTC
I've realised that message is confusing
that warning is produced having a piece of code as the following
A = 'tiny'

process foo {
  output:
  file 'x' into A

  '''
  touch x
  '''

}
WARN: Duplicate output channel name: 'tiny' in the script context -- it's worth to rename it to avoid possible conflicts
[7c/57c7f5] Submitted process > foo (1)
I will improve that message, thanks for reporting it
Maxime Garcia
@MaxUlysse
Nov 03 2016 10:11 UTC
Thanks a lot for your help
Paolo Di Tommaso
@pditommaso
Nov 03 2016 10:11 UTC
welcome
Maxime Garcia
@MaxUlysse
Nov 03 2016 10:11 UTC
You're very reponsive it's impressive ;-)
Paolo Di Tommaso
@pditommaso
Nov 03 2016 10:12 UTC
:)
Johan Viklund
@viklund
Nov 03 2016 11:15 UTC
is it possible to have the input and output files of workflow steps stored on S3 (or other object storage) instead of having them as symlinks in the filesystem?
Ideally this should be transparent from the process writing and just som configuration setting. I didn't find anything obvious in the docs.
Paolo Di Tommaso
@pditommaso
Nov 03 2016 11:18 UTC
you can use s3 both as input, publishDir target and the entire pipeline work dir
Johan Viklund
@viklund
Nov 03 2016 11:18 UTC
work dir can be compute-node local, but input/output and publishDir
Paolo Di Tommaso
@pditommaso
Nov 03 2016 11:19 UTC
yes
Johan Viklund
@viklund
Nov 03 2016 11:19 UTC
hmm, nice
Paolo Di Tommaso
@pditommaso
Nov 03 2016 11:19 UTC
when the input is a s3 file, it's downloaded locally
the the output will continue locally as usually
Johan Viklund
@viklund
Nov 03 2016 11:20 UTC
of course, but how do I get a output file into S3?
Paolo Di Tommaso
@pditommaso
Nov 03 2016 11:20 UTC
by using pusblishDir 's3://your/bucket/'
Johan Viklund
@viklund
Nov 03 2016 11:21 UTC
ok, but then the next process will not get the file from S3, it will assume it's on a shared filesystem
Paolo Di Tommaso
@pditommaso
Nov 03 2016 11:21 UTC
yes exactly
Johan Viklund
@viklund
Nov 03 2016 11:21 UTC
what if I can't assume a shared filesystem
I don't have a usecase for this yet
Paolo Di Tommaso
@pditommaso
Nov 03 2016 11:22 UTC
if you are using local or grid executor that's not an option
if you are deploying in the (AWS) cloud you can use s3 as shared storage or EFS
Johan Viklund
@viklund
Nov 03 2016 11:23 UTC
Ok
Paolo Di Tommaso
@pditommaso
Nov 03 2016 11:24 UTC
which of the two is your use case ?
Johan Viklund
@viklund
Nov 03 2016 11:25 UTC
I don't have one
Paolo Di Tommaso
@pditommaso
Nov 03 2016 11:25 UTC
:)
Johan Viklund
@viklund
Nov 03 2016 11:25 UTC
I'm just thinking about the future
I expect HPC clusters to stop using shared filesystems and go all in on object storage
it's just a matter of when
especially for bio-data where the files are big
not for physics and chemistry perhaps
Paolo Di Tommaso
@pditommaso
Nov 03 2016 11:26 UTC
ah, I didn't noticed this trend
Johan Viklund
@viklund
Nov 03 2016 11:26 UTC
it's not a trend yet
I've just seen the problems with using, esp. NFS, for bioinfo compute
Paolo Di Tommaso
@pditommaso
Nov 03 2016 11:27 UTC
this you think this also for on-premises HPC ?
Johan Viklund
@viklund
Nov 03 2016 11:27 UTC
If I could just decide for everyone else, I would do that.
But I'm just a developer
:)
but we're probably not there yet, but we can't really handle the load from the sequencing machines as it is
Paolo Di Tommaso
@pditommaso
Nov 03 2016 11:29 UTC
yep, there are too many legacy application especially in the scientific field
in my opinion shared file system will remain a critical component for a while
also cloud are starting to recognise that
you may be interested to this