These are chat archives for nextflow-io/nextflow

17th
Jan 2019
Paolo Di Tommaso
@pditommaso
Jan 17 09:12
well, that message is not really an error, S3 file system does not define its own path matcher and fallback on the default one
provide an example
Raoul J.P. Bonnal
@helios
Jan 17 09:48

This is the NF working directory

s3://mybucket/projects/nf_work/e1/8b1ea3d68c33d93bf4b7ae787e13e8/sampleNameX

this is the content of the inner directory sampleNameX above

NXF_S3.PNG

but I do not want to publish the sub-directory

SC_RNA_COUNTER_CS

and in command.run

...
# copies output files to target
if [[ ${ret:=0} == 0 ]]; then
  uploads=()
  uploads+=("nxf_s3_upload '*' s3://mubucket/nf_work/b8/d7124331d4574dbf2115e5c547dd01")
  uploads+=("nxf_s3_upload 'sampleNameX/outs/metrics_summary.csv' s3://mybucket/nf_work/b8/d7124331d4574dbf2115e5c547dd01")
  nxf_parallel "${uploads[@]}"
fi
...

2 questions:
1) how to publish all the sampleNameX directory and its su-directories but SC_RNA_COUNTER_CS ?
2) the publishDir runs on the nextflow client, and transfer data locally before sending back to the landing storage ? I see a lot of traffic on my local server (where nextflow is running) in my case publishDir must transfer data in "different directories" from to the same bucket

Paolo Di Tommaso
@pditommaso
Jan 17 09:50
not clear
Raoul J.P. Bonnal
@helios
Jan 17 09:58
yeah I was fighting with gitter interface
Paolo Di Tommaso
@pditommaso
Jan 17 09:58
lol
Raoul J.P. Bonnal
@helios
Jan 17 10:00
this is a 10xgenomics single cell analysis (cellranger software)
Paolo Di Tommaso
@pditommaso
Jan 17 10:01
2) publishDir logic is executed by the main NF app, not by the Batch job
(actually this could be a potential improvement)
1) ...
not sure it's possible to have a negative pattern
therefore you can do something like
publishDir 's3://something', saveAs: { name -> name!='SC_RNA_COUNTER_CS' ? name : null  }
Tobias Neumann
@t-neumann
Jan 17 10:13
@pditommaso Is it possible to list the SSH connection string for clusters previously created and now listed in nextflow cloud list?
Paolo Di Tommaso
@pditommaso
Jan 17 10:13
don't think so
Tobias Neumann
@t-neumann
Jan 17 10:32
ok so one has to read them from the AWS dashboard then
Paolo Di Tommaso
@pditommaso
Jan 17 10:37
it could be an improvement maybe
Raoul J.P. Bonnal
@helios
Jan 17 10:38
I will try. In case S3FileHelper has no matcher it falls back to getDefaultPathMatcher but the syntaxAndPattern has been predefined as glob; I do not know if it is possible to configure the pattern as a regexp
publishDir 's3://something', pattern: 'regexp:.*(?!SC_RNA_COUNTER_CS).*$' // regexp not validated
Paolo Di Tommaso
@pditommaso
Jan 17 10:39
nope only glob is supported
Raoul J.P. Bonnal
@helios
Jan 17 10:41
ok. saveAs should be enough. Could you point me on publishDir logic specifically where it handles S3 ?
Paolo Di Tommaso
@pditommaso
Jan 17 10:41
there's nothing special for S3
the one I've posted should work
Raoul J.P. Bonnal
@helios
Jan 17 10:46

ok but I am talking about

2) publishDir logic is executed by the main NF app, not by the Batch job
(actually this could be a potential improvement)

how publishDir perform the copy internally, and where to start if I want to contribute and improve it

that is invoked here
in the case the publishDir does not specify any saveAs, it should be possible to offload the copy to the task script
Raoul J.P. Bonnal
@helios
Jan 17 11:04
ok great
Paolo Di Tommaso
@pditommaso
Jan 17 11:05
the strategy should be:
1) determine if the publishDir is eligible to be managed at task level
2) if yes, add the relevant information here
3) generate the appropriate copy command, along the same line of outputFiles
4) otherwise fallback on the default behavior
Paolo Di Tommaso
@pditommaso
Jan 17 11:13
interestingly the task has already the the list of output file names, therefore it's just a matter of apply the same logic to a different target dir (and pattern) give by the publishDir
Raoul J.P. Bonnal
@helios
Jan 17 13:30
@pditommaso I need to lear groovy better and it will take me a bit of time to digest the design, but I am confident
it is a good exercise for the brain
Paolo Di Tommaso
@pditommaso
Jan 17 13:31
do you know java ?
Raoul J.P. Bonnal
@helios
Jan 17 13:31
yes, I am not using it since a long time
but I should be fine
Paolo Di Tommaso
@pditommaso
Jan 17 13:31
so you already know groovy ;)
Raoul J.P. Bonnal
@helios
Jan 17 13:32
:)
right now I need to crack TCR
next NF
Paolo Di Tommaso
@pditommaso
Jan 17 13:33
what's TCR ?
Alexander Peltzer
@apeltzer
Jan 17 13:36
to my knowledge that means t-cell receptor, but I guess he's speaking about this one? https://medium.com/@tdeniffel/tcr-test-commit-revert-a-test-alternative-to-tdd-6e6b03c22bec
Raoul J.P. Bonnal
@helios
Jan 17 13:36
yep, using 10xgenomics and mix with their transcriptomic
Paolo Di Tommaso
@pditommaso
Jan 17 13:36
cool
Paolo Di Tommaso
@pditommaso
Jan 17 14:27
opened an issue to keep track of it nextflow-io/nextflow#1002
micans
@micans
Jan 17 15:01

With -resume I get

WARN: Killing pending tasks (2)
ERROR ~ index is out of range 0..-1 (index = 0)

 -- Check script 'main.nf' at line: 552 or see '.nextflow.log' file for more details
ERROR ~ Unexpected error [ClosedByInterruptException]

 -- Check script 'main.nf' at line: 409 or see '.nextflow.log' file for more details

where that line points to within a function

def star_filter(logs) {
    def percent_aligned = 0
    logs.eachLine { line ->
        if ((matcher = line =~ /Uniquely mapped reads %\s*\|\s*([\d\.]+)%/)) {
            percent_aligned = matcher[0][1]
        }
    }

that is used to filter out tuples in a channel based on the log files in that tuple. I'm trying to find the piece of the puzzle that moves. What I don't get is that no process is submitted at all ... trace.txt ends with

314 71/5963ab 5262119 featureCounts (GSM1901310)  CACHED  0 2019-01-16 19:04:04.997 15m 18s 13m 44s 204.9%  622.7 MB  788.6 MB  17.7 GB 5.3 GB
330 24/b93605 5262220 featureCounts (GSM1901316)  CACHED  0 2019-01-16 19:12:53.161 5m 55s  5m 47s  200.1%  824 MB  1 GB  20.3 GB 7.4 GB
78  18/017dad 5292776 get_fastq_files (GSM1901343)  ABORTED - 2019-01-17 14:05:56.720 - - - - - - -
228 79/a83dbd 5292777 workflow_manifest ABORTED - 2019-01-17 14:05:57.516 - - - - - - -

I don't quite understand why that function is being run. This is pretty vague ... just hoping for a nudge in the right direction, or some way to test. I've tried moving aside very specific work directories for the inputs that need to be rerun e.g.

Paolo Di Tommaso
@pditommaso
Jan 17 15:08
I guess the if is evaluated even if the the regex does not match
I think you need if( matcher.matches() ) { .. }
micans
@micans
Jan 17 15:11
ooo. blimey. So.
matcher = line =~ /Uniquely mapped reads %\s*\|\s*([\d\.]+)%/
if( matcher.matches() ) { ... }
Paolo Di Tommaso
@pditommaso
Jan 17 15:12
likely, try it !
nextflow console
micans
@micans
Jan 17 15:13
The strange thing is that resume generally works fine. In specific cases (I don't see the pattern yet) I've seen this come up. In the instance now, just one of the inputs changed. (but in the failed run before, it is true that the failing process would have created the log file that is being tested). Thanks! Will test!