These are chat archives for nextflow-io/nextflow

15th
Nov 2018
Riccardo Giannico
@giannicorik_twitter
Nov 15 2018 08:31
Hi guys, Is there a list of methods I can apply to a variable in a process bash script?
For example in the following example I know I can use ".baseName" method, but I suppose there are more. I also suppose if it's a list I could even use something like ".first" or "[0]", and so on..
process myproc {
   input: 
   file (bam) from c_bamfile
   """
   echo ${bam.baseName}
   """
}
Tobias Neumann
@t-neumann
Nov 15 2018 08:41

To people with AWS batch experience:

I have setup a batch queue with fixed c5.large large instances. According to their listing, these provide 2 vCPUs and 4 GB of memory https://aws.amazon.com/de/ec2/instance-types/

Now I have configure my jobs to require exactly those 2 CPUs and 4 GB of memory:

cpus = 2
memory = { 4.GB * task.attempt }

However, when I run Nextflow with this config, the jobs are stuck in the Runnable state, but no instances are fired up. I played around and with 2 CPUs and 3 GB it works - so the limiting step seems to be the memory. Has anybody experienced the same? And can this be circumvented without going for larger instances? Or how did you do this? Specify 3.99 GB of memory?
This is super inconvenient if the instance specs do not translate into nextflow specs

Tobias Neumann
@t-neumann
Nov 15 2018 09:21

@pditommaso how would you specify fractions of e.g. gigabytes?

memory = { 3.9GB * task.attempt }

This is giving me an error:

Nov-15 09:49:52.230 [Actor Thread 8] ERROR nextflow.processor.TaskProcessor - Error executing process > 'bamToFastq (570d9e4b-eced-42e2-bb69-3d3f73d90649_gdc_realn_rehead)'

Caused by:
  No signature of method: groovy.util.ConfigObject.multiply() is applicable for argument types: (java.lang.Integer) values: [1]

groovy.lang.MissingMethodException: No signature of method: groovy.util.ConfigObject.multiply() is applicable for argument types: (java.lang.Integer) values: [1]
Rad Suchecki
@rsuchecki
Nov 15 2018 10:31
@giannicorik_twitter this depends on the class of the variable (object) you are referring to. Forget for a moment about script: block specifically. Nextflow variable's value simply gets inserted into a script (.command.sh in a process directory). There may be better ways of checking what is available, but you may start by calling yourVariable.getClass() and looking at a relevant doc and/or perhaps try this: https://stackoverflow.com/questions/22065290/how-to-get-all-method-names-of-a-class-without-inherited-methods-with-groovy
Rad Suchecki
@rsuchecki
Nov 15 2018 10:46
@t-neumann memory = { 3900.MB * task.attempt }?
Tobias Neumann
@t-neumann
Nov 15 2018 10:53
@rsuchecki ok so no way of multiplying task.attempt with a float
@pditommaso Is there any Nextflow memory and CPU -> AWS memory and CPU conversion table? because also 3.9 GB on 4 GB memory machine jobs won't start.
Is there a certain part of the memory taken up by the AMI one would have to subtract? But even so, why is it that I don't have to do this for CPU then
Rad Suchecki
@rsuchecki
Nov 15 2018 11:25
actually this should work @t-neumann where are you setting this? Try memory = { 3.9.GB * task.attempt }
Riccardo Giannico
@giannicorik_twitter
Nov 15 2018 11:35

@rsuchecki thank you for your tips, but I tried the following:

process myproc {
   input: 
   file (bam) from c_bamfile
   """
   echo "first try: ${bam.getClass().declaredMethods.findAll { !it.synthetic }.name} "
   echo "second try: ${bam.getClass().methods.collect { it.name }} "
   """
}

obtaining something weird from the 2 echoes in the .command.sh file:

echo "first try: [copyTo, copyTo, moveTo, moveTo, getBaseName, isLink, setPermissions, setPermissions, resolveSymLink, createDirIfNotExists, rollFile, mklink, mklink, mklink, mklink, mklink, mklink, mklink, mklink, readAttributes, getUri, minus, div, div, deleteDir, complete, equals, equals, toString, register, register, hashCode, compareTo, getName, getName, startsWith, startsWith, endsWith, endsWith, matches, size, iterator, getSimpleName, getParent, isAbsolute, delete, resolve, resolve, getPermissions, setReadOnly, getRoot, normalize, canRead, canWrite, exists, isDirectory, isFile, isHidden, lastModified, deleteOnExit, mkdir, mkdirs, renameTo, renameTo, setLastModified, setWritable, setWritable, setReadable, setReadable, setExecutable, setExecutable, canExecute, getFileSystem, getFileName, empty, getNameCount, subpath, resolveSibling, resolveSibling, relativize, toUri, toAbsolutePath, toRealPath, toFile, or, or, getExtension, plus, plus]" 
echo "second try: [copyTo, copyTo, moveTo, moveTo, getBaseName, isLink, setPermissions, setPermissions, resolveSymLink, createDirIfNotExists, rollFile, mklink, mklink, mklink, mklink, mklink, mklink, mklink, mklink, readAttributes, getUri, minus, div, div, deleteDir, complete, setProperty, getProperty, equals, equals, toString, register, register, hashCode, compareTo, compareTo, getName, getName, startsWith, startsWith, endsWith, endsWith, matches, size, iterator, getSimpleName, getParent, isAbsolute, delete, resolve, resolve, getPermissions, setReadOnly, getRoot, normalize, canRead, canWrite, exists, isDirectory, isFile, isHidden, lastModified, deleteOnExit, mkdir, mkdirs, renameTo, renameTo, setLastModified, setWritable, setWritable, setReadable, setReadable, setExecutable, setExecutable, canExecute, getFileSystem, getFileName, empty, getNameCount, subpath, resolveSibling, resolveSibling, relativize, toUri, toAbsolutePath, toRealPath, toFile, or, or, getExtension, plus, plus, getMetaClass, setMetaClass, invokeMethod, wait, wait, wait, getClass, notify, notifyAll, spliterator, forEach]"

weird because as you can see neither of them shows me the ".baseName" method , but I find a ".getBaseName" that's why I think I'm missing something..

Tobias Neumann
@t-neumann
Nov 15 2018 11:36
@rsuchecki ah I missed that second dot - I will try in a sec
Rad Suchecki
@rsuchecki
Nov 15 2018 11:37
@giannicorik_twitter I assume .baseName is just a groovy shorthand for more Java-ish .getBaseName()
Riccardo Giannico
@giannicorik_twitter
Nov 15 2018 11:41

@rsuchecki ok, but .getBaseName is NOT a method I can use:
If I writhe this:

echo "this: ${bam.getBaseName}"

I get an error:

ERROR ~ Error executing process > 'myproc (2)'
Caused by:
  No such property: getBaseName for class: nextflow.processor.TaskPath
Rad Suchecki
@rsuchecki
Nov 15 2018 11:41
by the way no reason to exclude inherited methods, so go for the second list.
and you are missing () at the end to make this a method call
Riccardo Giannico
@giannicorik_twitter
Nov 15 2018 11:43
@rsuchecki aaahhh!!! That's the trick! .baseName is a shortcut for .getBaseName() with the () !!! Great! Now I get it! :D Thank you so much :D
Rad Suchecki
@rsuchecki
Nov 15 2018 11:48
:thumbsup:
Hugues Fontenelle
@huguesfontenelle
Nov 15 2018 13:31
Hello :)
Very interesting question (of course, I asked it ;-) ) on the forums: https://groups.google.com/forum/#!topic/nextflow/bL9lZpvRFPE
topic: Docker-in-Docker; running nextflow with Docker in CI (nextflow itself is dockerized) <3
micans
@micans
Nov 15 2018 13:35

Basic question about groovy code and processing logic:

    tag "$samplename"
    publishDir "${params.outdir}/mixcr/$bucket/$samplename", mode: 'copy'
    def bucket = samplename.md5()[-1..-2]

    input:
    set val(samplename), file(reads) from ch_mixcr

This indicates what I am trying to achieve, but yields ERROR ~ No such variable: samplename. Incidentally, I may have missed it, but I haven't found a place where the logic of processing in a process is explained, what its scopes are, what variables are visible where etc.

Riccardo Giannico
@giannicorik_twitter
Nov 15 2018 13:42
@micans I think you need tag {samplename} instead of tag "$samplename" https://www.nextflow.io/docs/latest/process.html?highlight=tag#tag
micans
@micans
Nov 15 2018 13:42
No, the first samplename is fine. It's the second usage that is the problem.
I mean the third
def bucket = samplename.md5()[-1..-2]
micans
@micans
Nov 15 2018 13:50
I'm trying
saveAs: { filename -> "${samplename.md5()[-1..-2]}/$filename" }
now ... maybe that's the way to do it.
micans
@micans
Nov 15 2018 14:03
OK, my final answer ...
    tag {samplename}
    publishDir "${params.outdir}/mixcr/", mode: 'copy',
      saveAs: { filename -> "${samplename.md5()[-1..-2]}/$samplename/$filename" }
that worked
Paolo Di Tommaso
@pditommaso
Nov 15 2018 14:04
excellent !
micans
@micans
Nov 15 2018 14:06
Thanks Paolo! I'm a bit annoyed with myself ... I am scratching to find the right question to ask, to understand better what goes on in the process processing, but this is not yet it :grinning: As you can see slowly getting some groovy idioms .. ahh all that power
Paolo Di Tommaso
@pditommaso
Nov 15 2018 14:08
great progress, actually I didn't know it was possible to calc a md5 signature in that way
but the md5 is computed on the file name or the file content ?
micans
@micans
Nov 15 2018 14:09
Grazing the internets ... on the file name, that is fine for my purpose actually
Paolo Di Tommaso
@pditommaso
Nov 15 2018 14:09
:ok_hand:
micans
@micans
Nov 15 2018 14:09
I've seen snippets that indicate computing it on file contents is much harder, and I haven't seen a standard idiomatic way.
Well, not hard, but just not anything that I could snap up. Don't need it now, however.
(actually not the file name, but the samplename ... its our primary token throughout the pipeline, unique by necessity)
Paolo Di Tommaso
@pditommaso
Nov 15 2018 14:16
should not be that hard, but definitely that makes sense to add to the publishDir options as built-in feature
micans
@micans
Nov 15 2018 14:23
Yes, I've been thinking about the feature request. Then work happened and steamrollered me. Maybe it's trickier to get the interface right when there are multiple files (or file types) produced in the process. And it would pbb need options what to md5 on, e.g. filename, or other string (like in my case), or file content. Then the amount of levels ... is it too much to envision an auto-scaling option, based on a counter available to the program (or set in the program)? Maybe overkill for now. I am curious what is going to happen with the number of files produced by bioinformatics pipelines and if it is going to require changes in the way we organise things.
Paolo Di Tommaso
@pditommaso
Nov 15 2018 14:26
frankly I don't see the added value of md5 signature of the sample name, I understand the file content because that guarantees the file has not been altered
how is useful a md5 on a string value
micans
@micans
Nov 15 2018 14:27
ah. It's purely to distribute files
7000K or more files in a single directory can kill file sytem performance
Paolo Di Tommaso
@pditommaso
Nov 15 2018 14:28
7000K or more files in a single directory can kill file sytem performance
micans
@micans
Nov 15 2018 14:28
I've been told to stop at ~500 often
so I'm not an expert .. this is what I have been told. Maybe depends on the file system
but my experience is that this is correct
Paolo Di Tommaso
@pditommaso
Nov 15 2018 14:28
this why NF organises work hash generated subdirs
micans
@micans
Nov 15 2018 14:30
yes, I thought I was doing the same thing, and in my case md5 (or any hash) based on samplename is sufficient for my needs. I don't really need the alteration thing ... I'm disregarding the rest of the hash anyway.
it was what came to mind quickest
Paolo Di Tommaso
@pditommaso
Nov 15 2018 14:30
I thought I was doing the same thing,
I've realised that only now :smile:
micans
@micans
Nov 15 2018 14:31
hehe
Paolo Di Tommaso
@pditommaso
Nov 15 2018 14:31
I was not parsing the /
actually this is a good candidate for a new pattern
micans
@micans
Nov 15 2018 14:52
Is it possible to have a process that publishes its input?
I'm thinking
    when:
    params.save_star_bam

    input:
    set val(samplename), file(thebam) from ch_publishbam

    output:
    set val(samplename), file("*.bam")
maybe ignore val(samplename) in output.
Can I do this with no script section? I assume I can fudge it anyway by perhaps linking a second file, but is a fudge necessary? This flow looks good to me as I can use the when: section.
(I mean the fudge would be if I have a script section and create an actual output file)
The start of this would be
ch_bam_hisat2
  .mix(ch_bam_star)
  .into{ ch_indexbam; ch_publishbam }
Paolo Di Tommaso
@pditommaso
Nov 15 2018 15:02
I think I've lost something. Short question ?
micans
@micans
Nov 15 2018 15:03
I want to publish a file from the input section of a process
preferably without a script section, is that possible?
Paolo Di Tommaso
@pditommaso
Nov 15 2018 15:04
wildcard does not capture inputs, you need to declare explicitly that input as output
micans
@micans
Nov 15 2018 15:05
fair enough
Paolo Di Tommaso
@pditommaso
Nov 15 2018 15:05
even if it's not used then in a channel, but only to get it published
micans
@micans
Nov 15 2018 15:05
if input has file(thebam), can I use the same in the output?
Paolo Di Tommaso
@pditommaso
Nov 15 2018 15:06
I guess so
micans
@micans
Nov 15 2018 15:06
hehe that indicates severe doubts
Paolo Di Tommaso
@pditommaso
Nov 15 2018 15:06
LOL
micans
@micans
Nov 15 2018 15:06
let me show you my tiny process ...
Paolo Di Tommaso
@pditommaso
Nov 15 2018 15:07
try it
micans
@micans
Nov 15 2018 15:07
process publisbam {
    tag "${samplename}"
    publishDir "${params.outdir}/${params.aligner}-bams/", mode: 'link',
      saveAs: { filename -> "${samplename.md5()[-1..-2]}/$samplename/$filename" }

    when:
    params.save_bam

    input:
    set val(samplename), file(thebam) from ch_publishbam

    output:
    file(thebam)
}
sorry about that. yes I will try. it makes sense in my head but maybe I miss something obvious
It needs a script section or it fails (ERROR ~ Invalid process definition -- Make sure the process ends with a script wrapped by quote characters) but otherwise it's running now
Paolo Di Tommaso
@pditommaso
Nov 15 2018 15:12
ahhhhhhhhhhhh
well, this make no sense
why use a process if you are not processing anyhting ..
I know for the publishDir, right?
micans
@micans
Nov 15 2018 15:13
well, the when: section, really
that's a convenient switch I can flip on the command line
Paolo Di Tommaso
@pditommaso
Nov 15 2018 15:14
no script => no process
micans
@micans
Nov 15 2018 15:15
I guess I could stick that in a saveAs closure and return null unless params.save_bam is set ... but this feels cleaner actually
right now I have an empty script. that might not work. Maybe it needs a single true
So what do you think is the cleanest way to do this publishing of bam files conditional on params.save_bam? also ... these bams can come from different aligners, depending on the run. So I'd prefer to do the publishing not separately in each of the align processes.
Hugues Fontenelle
@huguesfontenelle
Nov 15 2018 15:23
save_bam = false

process mapping {
    if (save_bam) {
        publishDir "data"
    }

    output:
    file("sample.bam")

"""
touch sample.bam
"""
}
Not as a separate process, but where you produce the bam.
What do you think?
micans
@micans
Nov 15 2018 15:25
(1) I produce bams in multiple places (2) those places publish other stuff as well. So if my solution above works then that's my prefered option, for now, until new insights develop :-)
Paolo Di Tommaso
@pditommaso
Nov 15 2018 15:50
don't use a process, it's enough a map or a subscribe and then copy such files https://www.nextflow.io/docs/latest/script.html#copy-files
micans
@micans
Nov 15 2018 15:59
Nice, thanks. can use that with until { ! params.save_bams }, and then I presume something like subscribe { it.copyTo("${params.outdir}/stuff/morestuff") }. Will figure it out!
I'll use mklink, hard=true to save space
Tobias Neumann
@t-neumann
Nov 15 2018 17:14
@pditommaso I took quite some time to figure out how to calculate the discrepancy between listed EC2 memory and actual available memory to the ECS container agent. I think this https://docs.aws.amazon.com/batch/latest/userguide/memory-management.html should go in the documentation.
Paolo Di Tommaso
@pditommaso
Nov 15 2018 17:52
you may want to wrote a blog post about that for the sake of the community
Tobias Neumann
@t-neumann
Nov 15 2018 19:55
I would be happy to post something, just don't have a blog on my own. open for suggestions
Alexander Peltzer
@apeltzer
Nov 15 2018 20:05
medium ?
Tobias Neumann
@t-neumann
Nov 15 2018 20:15
I'll check it out, thx
gawells
@gawells
Nov 15 2018 21:57
@pditommaso Hi, only got back to this now, how do I mkdir in the subscribe body? I wasn't clear the first time, I want the files to be copied to a subdirectory ./results/dir1, not the file results/dir1
gawells
@gawells
Nov 15 2018 22:52
Is there a nextflow way to separate channels x number of times (with indexing and without having to come up with new names for each channel)? My current hack is to use a process to do something like for x in \$(seq -w 1 100); do ln -sf ${channelName} ${channelName}.\$x ;done; in the script section and output the links