mydata
directory and then add only the _R1_001.fastq.gz
and _R2_001.fastq.gz
files.mydata
folder to have only these files.output_directory
so pema builds a new one
Hi Haris, have an issue with ITS analysis in the pandaseq step.
Task failed:
Program & line : '/home/pema_latest.bds', line 444
Task Name : ''
Task ID : 'pema_latest.bds.20210107_223115_440/task.pema_latest.line_444.id_924'
Task PID : '15358'
Task hint : '/home/tools/PANDAseq/bin/pandaseq -f filtered_max_ERR0000033_1.fastq.gz.1P.fastq.00.0_0.cor.fastq.gz -r filtered_max_ERR0000033_2.fastq.gz.2P.fastq.00'
Task resources : 'cpus: 1 mem: -1.0 B timeout: 86400 wall-timeout: 86400'
State : 'ERROR'
Dependency state : 'ERROR'
Retries available : '1'
Input files : '[]'
Output files : '[]'
Script file : '/home1/naurasd/pema_latest.bds.20210107_223115_440/task.pema_latest.line_444.id_924.sh'
Exit status : '1'
StdErr (10 lines) :
0xa9ce40:2 STAT OVERLAPS 0
0xa9ce40:1 STAT TIME Thu Jan 7 23:46:16 2021
0xa9ce40:1 STAT ELAPSED 0
0xa9ce40:1 STAT READS 0
0xa9ce40:1 STAT NOALGN 0
0xa9ce40:1 STAT LOWQ 0
0xa9ce40:1 STAT BADR 0
0xa9ce40:1 STAT SLOW 0
0xa9ce40:1 STAT OK 0
0xa9ce40:1 STAT OVERLAPS 0
Fatal error: /home/pema_latest.bds, line 446, pos 4. Task/s failed.
pema_latest.bds, line 418 : for ( string correctFile : correct ) {
pema_latest.bds, line 446 : wait
The reads in this sample cotain a lot of N bases, don't know if this causes the problem. Thanks for your help
Hi @/all !
Here is a short note for the pema v.2.0.2
release!
Midori 2 is now available, including not just metazoa sequences but all Eukaryotes taxa.
In addition, you can now use your custom, in-house database for the taxonomy assignment.
You can find the new version through Docker and SingularityHub.
A PEMA documentation site is also now available.
Some hints and tips are also available there.
Hi @hariszaf . I was running PEMA on my COI dataset with the new version. I used the new parameters.tsv file required for the newest version but got the following error during the taxonomy assignment step:
Fatal error: /home/pema_latest.bds, line 1291, pos 1. Map 'paramsOfTaxonomy' does not have key 'midori_version'.
pema_latest.bds, line 1254 : if ( ( paramsOfTaxonomy{'gene'} == 'gene_COI' || paramsOfTaxonomy{'gene'} == "gene_ITS") && paramsOfTaxonomy{'clusteringAlgoForCOI_ITS'} == 'algo_SWARM' ) {
pema_latest.bds, line 1266 : if ( paramsOfTaxonomy{'gene'} == "gene_COI" ) {
pema_latest.bds, line 1290 : if ( paramsOfTaxonomy{'custom_ref_db'} != 'Yes' ) {
pema_latest.bds, line 1291 : task java -Xmx64g -jar $path/tools/RDPTools/classifier.jar classify -t $path/tools/RDPTools/TRAIN/$paramsOfTaxonomy{'midori_version'}/rRNAClassifier.properties -o tax_assign_swarm_COI_temp.txt rdpClassifierInput.fasta
ProgramCounter.pop(100): Node ID does not match!
PC : PC: size 10 / 0, nodes: 1422 -> 6727 -> 6746 -> 6785 -> 6792 -> 6827 -> 6833 -> 6834 -> 6835 -> 6837
Node Id : 6840
bdsNode Id : 6837
ProgramCounter.pop(100): Node ID does not match!
PC : PC: size 9 / 0, nodes: 1422 -> 6727 -> 6746 -> 6785 -> 6792 -> 6827 -> 6833 -> 6834 -> 6835
Node Id : 6837
bdsNode Id : 6835
ProgramCounter.pop(100): Node ID does not match!
PC : PC: size 8 / 0, nodes: 1422 -> 6727 -> 6746 -> 6785 -> 6792 -> 6827 -> 6833 -> 6834
Node Id : 6835
bdsNode Id : 6834
ProgramCounter.pop(100): Node ID does not match!
PC : PC: size 7 / 0, nodes: 1422 -> 6727 -> 6746 -> 6785 -> 6792 -> 6827 -> 6833
Node Id : 6834
bdsNode Id : 6833
ProgramCounter.pop(100): Node ID does not match!
PC : PC: size 6 / 0, nodes: 1422 -> 6727 -> 6746 -> 6785 -> 6792 -> 6827
Node Id : 6833
bdsNode Id : 6827
ProgramCounter.pop(100): Node ID does not match!
PC : PC: size 5 / 0, nodes: 1422 -> 6727 -> 6746 -> 6785 -> 6792
Node Id : 6827
bdsNode Id : 6792
ProgramCounter.pop(100): Node ID does not match!
PC : PC: size 4 / 0, nodes: 1422 -> 6727 -> 6746 -> 6785
Node Id : 6792
bdsNode Id : 6785
ProgramCounter.pop(100): Node ID does not match!
PC : PC: size 3 / 0, nodes: 1422 -> 6727 -> 6746
Node Id : 6785
bdsNode Id : 6746
ProgramCounter.pop(100): Node ID does not match!
PC : PC: size 2 / 0, nodes: 1422 -> 6727
Node Id : 6746
bdsNode Id : 6727
ProgramCounter.pop(100): Node ID does not match!
PC : PC: size 1 / 0, nodes: 1422
Node Id : 6727
bdsNode Id : 1422
I don't know what the problem is and I don't kno why the error shows
if ( paramsOfTaxonomy{'custom_ref_db'} != 'Yes' )
I did specify No for custom_ref_db in the parameters file, I checked again. Your help is much appreciated.
@hariszaf : General question about the taxonomy format. I've been working the past few weeks on a series of bash scripts to get everything in the right format for the .fasta file and the taxonomy.tsv. I noticed in your examples here
the first 3 columns are Seq_ID, Domain, Kingdom
however under Domain you have "Root" and Kingdom you have "Eukaryota" (I've never seen "Root" as a domain under any system). The first line example with "Eunice_vittata" I've found online (genebank) that Domain is Eukaryota and Kingdom is Metazoa. Everything else is the same.
My question is: Does the software assume a certain format (2 empire, 3 empire, etc.) or is it just important that the system be consistent between the taxonomy and .fasta files?
eg. For the Eunice_vittata, I'd like to have the columns be:
Seq_ID, Domain, Kingdom, Phylum, Class, ...
>SEQXX,Eukaryota, Metazoa, Annelida, Polychaeta, ...
instead of
Seq_ID, Domain, Kingdom, Phylum, Class, ...
>SEQXX, Root, Eukaryota, Annelida, Polychaeta, ...
root
part you saw in my example I think it was due to the Midori initial taxonomies
In the java part of step 2 here, is the RDPClassifier coming from: https://github.com/rdpstaff/RDPTools? I couldn't find it anywhere on the website or PEMAs GitHub. If it is indeed that or elsewhere, it would be great to include a link I came across that from a google search
-- but on a related note, the build for their RDPClassifier seems to be failing as outlined many times by their issues. If you have an environment (conda or else) that works, providing an environment.yml or otherwise a list of supporting packages and versions would be very helpful
Here to add more to the mayhem. Running with the RDPClassifier.
I'm getting this error after running singularity run -B analysis_directory/:/mnt/analysis pema_v.2.0.2.sif
:
mkdir: cannot create directory 'TRAIN/CLEAN_COI_custom_db': Read-only file system
/home/scripts/trainDbForRDPClassifier.sh: line 15: ready4train_taxonomy.txt: Read-only file system
/home/scripts/trainDbForRDPClassifier.sh: line 17: ready4train_seqs.fasta: Read-only file system
Picked up JAVA_TOOL_OPTIONS: -XX:+UseContainerSupport
Exception in thread "main" java.io.FileNotFoundException: ready4train_taxonomy.txt (No such file or directory)
at java.io.FileInputStream.open0(Native Method)
at java.io.FileInputStream.open(FileInputStream.java:195)
at java.io.FileInputStream.<init>(FileInputStream.java:138)
at java.io.FileInputStream.<init>(FileInputStream.java:93)
at java.io.FileReader.<init>(FileReader.java:58)
at edu.msu.cme.rdp.classifier.train.ClassifierTraineeMaker.<init>(ClassifierTraineeMaker.java:60)
at edu.msu.cme.rdp.classifier.train.ClassifierTraineeMaker.main(ClassifierTraineeMaker.java:170)
at edu.msu.cme.rdp.classifier.cli.ClassifierMain.main(ClassifierMain.java:77)
cp: cannot create regular file 'TRAIN/CLEAN_COI_custom_db/': Not a directory
Fatal error: /home/pema_latest.bds, line 177, pos 7. Exec failed.
Exit value : 1
Command : bash /home/scripts/trainDbForRDPClassifier.sh /mnt/analysis/custom_ref_db/*.tsv /mnt/analysis/custom_ref_db/*.fasta CLEAN_COI_custom_db
pema_latest.bds, line 163 : if ( paramsFirstAssignment{'custom_ref_db'} == 'Yes' ){
pema_latest.bds, line 169 : if ( paramsFirstAssignment{'gene'} == 'gene_COI' ){
pema_latest.bds, line 177 : sys bash $path/scripts/trainDbForRDPClassifier.sh $taxonomyFile $sequenceFile $paramsFirstAssignment{'name_of_custom_db'}
Sounds like it's saying the environment is "ready only" and so it can't create. Is this a singulairty issue or PEMA issue? I don't have sudo permission on the system
Do you have a guide as to what the output from PEMA means (specifically from the RDPClassifier)? I couldn't find anything on GitHub. Here's a list of the output files that gets created. What's important to focus on?
1.quality_control/ 7.gene_dependent/
2.trimmomatic_output/ 8.output_per_sample/
3.correct_by_BayesHammer/ all_samples.fasta*
4.merged_by_PANDAseq/ checkpoints_for_COI_RDPClassifier_test_10feb/
5.dereplicate_by_obiuniq/ final_all_samples.fasta*
6.linearized_files/ parameters0f.COI_RDPClassifier_test_10feb.tsv*
One critical file I'm expecting to find is a "percent match" , i.e. return all the sequences that have some match to entries in my personal database (custum_ref_db) and tell me there's a 98% match with SEQ7 for example. I found a file that has a number associated with each taxonomy.
The most promising seems to be 7.gene_dependent/gene_COI/SWARM
subdirectory. Here there's a "tax_assign_SWARM_COI.txt" But these number seem to deal with something else
final_table.tsv
file that you will find in the path you mentioned
@Hariszaf
hmm, then is it possible there's an error with PEMA or my data format? (Normally I would keep this in a thread, but threads on gitter don't seem to allow for adding photos). It's still hard to see the the first entry is the same as the 2nd, it is classified under OSU and seems to have arbitrarily been cut off. The remaining are under the 2nd column (ERR1) as per the single tab.
No numbers that I can see either
@hariszaf , do you still have the output folders from running my data? I'm asking because I'm wondering if there's a gap in my understanding of the parameters.tsv or if there's a bug in PEMA. For example, there's a threshold .6
which makes me think that matches below .6 will be discarded. But when I look under directory 7.gene_dependent/gene_COI/SWARM
, there's a file called tax_assign_swarm_COI.txt
which has numbers associated with each phyclum, class etc. And some of those (down in the species level) are pretty low, like .37 and so on.
Am I misunderstanding this cutoff number?
I know you have a lot to deal with from me. Hopefully this will just be a quick one. I just ran something overnight and it timed out (since as discussed above I'm doing this on a virtual machine I don't have the advantage of a speedy and powerful HPC) , but I see from the printout there's a checkpoint file that has been created. After a timed-out error, this gets created:Creating checkpoint file '/home/pema_latest.bds.line_889.chp'
INFO: Cleaning up image...
I don't want to loose the 20+ hours of work so far, can I simply continue the run command
(singularity run --fakeroot --writable -B ./analysis_directory:/mnt/analysis pema_fixperms
)
and it will continue where it left off? I'm uncertain if this is something you built into pema or if it's part of one of the tools you used. It's of course in the clustering section.
parameters.tsv
file and use the correspoding .chp
to run pema from the asv inference step and downstream
my_analysis
directory and mount those along with everything else and once you do that, you need to move them in the /home/tools/CREST/LCAClassifier/parts/flatdb/unite
directory. then you go back in the /home
directory and run ./pema_latest.bds