These are chat archives for nextflow-io/nextflow

17th
May 2019
Alaa Badredine
@AlaaBadredine_twitter
May 17 08:02
in which case does the -resume flag becomes unusable ?
Alexander Peltzer
@apeltzer
May 17 08:18
I have a metadata sheet that sometimes contains incomplete information - e.g. R1,R2, FastQ, a Path and a value
My idea was that I just pass these as a set .... to processes (which works obviously) but what should I do if something is missing there?
Evan Floden
@evanfloden
May 17 08:23
@apeltzer If I remember correctly, there was some examples on passing metadata as a map object. Are you doing this already? If someone else doesn't chip in, I can try dig it up.
Paolo Di Tommaso
@pditommaso
May 17 09:04
@apeltzer there's a good example here and here
Alexander Peltzer
@apeltzer
May 17 10:44
Nice, thanks guys!
Paolo Di Tommaso
@pditommaso
May 17 11:37
:+1:
Laurent Modolo
@l-modolo
May 17 11:37

Hi guys, I am trying to run nf on a shared environment with conda. I was planning to use a shared miniconda install readable by everyone to avoid multiple copies of the envs
here is a process configuration example

      withName: index_fasta {
        beforeScript = "source /sps/lbmc/common/miniconda3/init.sh"
        conda = "/sps/lbmc/common/miniconda3/envs/bowtie2_2.3.4.1"
        scratch = true
        stageInMode = "copy"
        stageOutMode = "rsync"
        executor = "sge"
        clusterOptions = "-P P_lbmc -l os=cl7 -l sps=1 -r n \
        -o ~/logs/ -e ~/logs/"
        cpus = 1
        queue = 'huge'
      }

and here is the error I get:

ERROR ~ Error executing process > 'index_fasta (Gallus_gallus.GRCg6a.cdna.ercc.all)'

Caused by:
 java.io.FileNotFoundException: /sps/lbmc/common/miniconda3/envs/.bowtie2_2.3.4.1.lock (Permission denied)

I am not sure why nextflow is trying to create a .lock file in the env nor how to prevent it from doing that (it should not have to write anything in the env directory)

Paolo Di Tommaso
@pditommaso
May 17 11:39
how is the full error stack trace ?
Laurent Modolo
@l-modolo
May 17 11:44
let me ask my intern who has the error and don’t have writing rights in the /sps/lbmc/common/miniconda3/envs/ directory
Laurent Modolo
@l-modolo
May 17 12:04
N E X T F L O W  ~  version 19.04.0
Launching `/sps/lbmc/rseraphi/mars-seq/src/Mars_seq_V23.nf` [distracted_blackwell] - revision: b460a7bd7c
fastq files : /sps/lbmc/rseraphi/mars-seq/Sequences/sample_1000000_EV116_S1_CF03_S2_R{1,2}.fastq.gz
tags files : /sps/lbmc/rseraphi/mars-seq/Sequences/tags2.fa
transcriptome files : /sps/lbmc/rseraphi/mars-seq/Sequences/Gallus_gallus.GRCg6a.cdna.ercc.all.fa
gtf file : /sps/lbmc/rseraphi/mars-seq/Sequences/Combined_Gallus_Gallus_ERCC.gtf
results : /sps/lbmc/rseraphi/results
val : 5
index :
whitelist :
ERROR ~ Error executing process > 'sample_control_qual (1)'

Caused by:
 java.io.FileNotFoundException: /sps/lbmc/common/miniconda3/envs/.multiqc_1.7.lock (Permission denied)


-- Check '.nextflow.log' file for details
Paolo Di Tommaso
@pditommaso
May 17 12:04
Check '.nextflow.log' file for details
^^^^^^^^^^^^^^^^^^^^^^
Laurent Modolo
@l-modolo
May 17 12:04
Yes I was coying the file :p
Paolo Di Tommaso
@pditommaso
May 17 12:07
umm .. it should not try to create that lock file, please report an issue
Laurent Modolo
@l-modolo
May 17 12:10

Ok I will do that, could I be a problem with my configuration ? are the lines

beforeScript = "source /sps/lbmc/common/miniconda3/init.sh” # to load conda
conda = "/sps/lbmc/common/miniconda3/envs/bowtie2_2.3.4.1# to specify where the env in installed

correct ?

Paolo Di Tommaso
@pditommaso
May 17 12:11
it could make sense that it's a read-only dir, that lock file is useless
and in your case stopping the execution
Laurent Modolo
@l-modolo
May 17 12:12
ok
evanbiederstedt
@evanbiederstedt
May 17 15:20

Apologies for the delays

I think you should also have the info on failed tasks in HTML report as well as in the trace
@rsuchecki This is a good point, but the HTML report alone would require some parsing to automatically debug the samples which are failed. I could be missing something however.

Thanks for the help @rsuchecki @lebernstein
I think something like this would help for debugging.

evanbiederstedt
@evanbiederstedt
May 17 15:26

@lebernstein @evanbiederstedt @rsuchecki This issue is very relevant: nextflow-io/nextflow#903
CC @micans @stevekm @rsuchecki

This does look very relevant. I'm trying to figure out where this discussion currently stands

It think NF should provide a more declarative approach ie. the process should have a meta directive that allows you to declare the metadata attributes you to track. Then the system can collect all these info and save automatically to a file.

Yes, something like this from @pditommaso would be extraordinarily useful, especially if one could read in the file and access paths. Otherwise, you're parsing post-processing reports for debugging.

The issue for running NF with many samples is either the case of unforeseen process failures (as @micans mentions) or a certain percentage of samples failing for various QC-related reasons.

I'll respond at the github issue: nextflow-io/nextflow#903

Perhaps we can organize a PR plan

Stephen Kelly
@stevekm
May 17 15:30

I have a metadata sheet that sometimes contains incomplete information - e.g. R1,R2, FastQ, a Path and a value
My idea was that I just pass these as a set .... to processes (which works obviously) but what should I do if something is missing there?

@apeltzer another option that I typically use is to pack this kind of metadata into either a .tsv file, or a .json file, and leave 'NA' values in there (.tsv), or 'Null' (.json). Then, my scripts that parse the file to do something with the information would have its own handling for missing data. In these kinds of cases I find it convenient to use either Python or R, since in Python you can load .tsv with the csv.DictReader which gives you a dict(map) that you can work with easily, or in R you can read into a dataframe with na.chars set to auto-fill with NA.

@l-modolo is someone else accessing that location that could be creating a lock file?
might have to chmod -R a+rx the install location?
also what version of conda are you using? I never ever do this:
beforeScript = "source /sps/lbmc/common/miniconda3/init.sh” # to load conda
because on our system the environment is wacky and sourcing the conda loader sometimes breaks for certain versions of conda, instead I just do export PATH:/path/to/conda/bin:${PATH} either within the beforeScript, or in the wrapper script that I execute the Nextflow command from
Stephen Kelly
@stevekm
May 17 15:36
it might make your life easier instead of using the 'env's to just have a dedicated conda install for each software stack you are using, maybe
Laurent Modolo
@l-modolo
May 17 15:45
@stevekm no, when I activate environment from multiple users with conda everything is fine and no lock file are created
@stevekm It’s just a small script to activate conda, it’s easier for our lab to share nextflow .config files than having everyone with correct .bashrc or .profile files
This message was deleted
Laurence E. Bernstein
@lebernstein
May 17 17:50

@evanbiederstedt @stevekm @rsuchecki For my purposes, the html report would probably be difficult to use since it would require a lot of parsing. My Nextflow workflow is part of a larger process and my goal was to provide a simple file for post-processing that could be used by follow on processing to tell which samples were processed successfully and which were not so that appropriate actions could be taken (possibly automatically).
In the end my workflow generates a JSON file that looks like:

{
  "Status" : "Failure",
  "Sample_1" : "Successful",
  "Sample_2" : "Successful",
  "Sample_3" : "Failed",
}

The report COULD then be used to track down the problem, but I also have my status files (1 per sample) that are plain text and can be read to see everything you need to know (hopefully) about the issue.

evanbiederstedt
@evanbiederstedt
May 17 19:40

Hey @stevekm

@evanbiederstedt do not use retries for this, you should use filtering in your Channels to try and figure out which data might be bad and prevent them from making it to processes where they will fail. I use this strategy a lot on my exome pipeline;

Apologies for the delay, and thanks for this! This would be useful QC filtering.

evanbiederstedt
@evanbiederstedt
May 17 21:15

Hey @lebernstein

In the end my workflow generates a JSON file that looks like:

Do you have an example on github somewhere?

Laurence E. Bernstein
@lebernstein
May 17 21:18
@evanbiederstedt Well... no.. I work for Quest Diagnostics and I'm not sure how much of this could be considered proprietary, but I may be able to generate a stripped/simplified version if I get a free moment. Which part(s) are you interested in?
Stephen Kelly
@stevekm
May 17 21:21

@lebernstein sorry cannot remember, did I share this spot in my pipeline where I do the determining of failed samples and then log it? Its here: https://github.com/NYU-Molecular-Pathology/NGS580-nf/blob/50f937e253eebd79d26222445cdc548f2859177b/main.nf#L2620

the log entries end up getting collected in a .tsv file which I can then parse later;

https://user-images.githubusercontent.com/10505524/57956755-1c979800-78c8-11e9-986a-cfa83ea5eddb.png

for example, I include it in my custom HTML report at the completion of the pipeline:

https://user-images.githubusercontent.com/10505524/57956800-36d17600-78c8-11e9-85d3-e94d6a7adbee.png

something like that could be fed to other programs as well, and you could make it JSON if you wanted too, Groovy has native JSON libraries or just hand off the text file to a custom Python script or such and convert there
evanbiederstedt
@evanbiederstedt
May 17 21:42
@lebernstein I would be really interested in creating a final file to parse, which allows me to check which samples failed. Do you have this code by chance?

@stevekm This is done for the channel filters, correct? That's helpful, but there are probably a priori issues which could cause pipeline failures.

I just have a file with paths and success status in mind to be honest. That would make debugging far easier

Laurence E. Bernstein
@lebernstein
May 17 22:03

@stevekm I think I looked at this a while ago.. but.. it's 4000 lines of code and I am way to noob-ish to even begin to understand it. :) But also, a lot of what you (seem to be) doing there is to use filters (as you were mentioning) to try to filter out bad data. I have literaly no idea what my issues are going to be so I don't what to look for a priori. I am trying to keep my pipeline "simple" so that non-CS peeps can make sense of it and hopefully write their own new pipelines given a simple template. People around here know python so I am using python scripts wherever possible to pull the pieces together. Also.. I don't know Groovy. :)

So.. what I was trying to do was set up a system where anything that happens will not stop good samples from being processed and will log failures. This is critical since we are not running single jobs, but run primarily in a "production" mode where we want 24-7 up time. The method must also integrate nicely with the sample tracking that we already have in place at multiple locations in the company.

In order to do that I decided (at least for now) that 1 JSON file in.. 1 JSON file out to limit the amount of interfacing others will have to do with this system.

Laurence E. Bernstein
@lebernstein
May 17 22:31
@evanbiederstedt Check your email. :)
evanbiederstedt
@evanbiederstedt
May 17 23:15
@lebernstein Thanks!