Thanks for your answer! Sorry to bother you again, but now I'm having a problem running the pipeline.
This time I'm using my own dataset and it worked well, except for three chromosomes (1, 9 and 16) that didn't complete the analysis correctly.
The command that fails is:
multcore_ihh command = Rscript /sfw/selectionTools/selectionTools/corescripts/multicore_iHH.R -p AYM -i AYM_genetic_dist.haps -c 9 --window 5000000
--overlap 2000000 --maf 0.0 --big_gap 200000 --small_gap 20000 --small_gap_penalty 20000 --haplo_hh --physical_map_haps AYM_genetic_dist.pos --cores 40 --working_dir . --offset 1 --ihs
once I run it, it returns this message:
*tmp*, , 2, value = c(45440, 109810, 175050, :
replacement has 14060 rows, data has 13391
Calls: [<- -> [<-.data.frame
I tried changing the window's size to 8Mb and it worked, but I'm not sure if that's the right way to solve it or if there is any better way to do it.
I would really appreciate you advice in this matter. Thanks a lot!
I have had this problem in my own data and did change the way multicore_iHH.R rejoined the individual iHH files after the calculation step.
The new method was updated into the selectionTools1.1 branch. It requires running the also updated haps_interpolate step before the updated multicore_iHH.R
Changing window size is valid, ideally a larger window size is used where possible so that the ehh decay for the SNPs isn't prematurely stopped at the window boundary
[.data.frame(map_positions, , 2) : undefined columns selected
*tmp*, , 3, value = 16) :
6.89177428-89377427 runs fine, but 19.20959639-21159638 fails, as does X.109745664-109945663. Test lactase dataset works well. selection_stderr.tmp is as follows:
Traceback (most recent call last):
File "/usr/local/bin/selection_pipeline", line 9, in <module>
load_entry_point('selectionTools==1.0', 'console_scripts', 'selection_pipeline')()
File "/usr/local/lib/python2.7/dist-packages/selectionTools-1.0-py2.7.egg/selection_pipeline/selection_pipeline.py", line 240, in main
File "/usr/local/lib/python2.7/dist-packages/selectionTools-1.0-py2.7.egg/selection_pipeline/standard_run.py", line 248, in run_pipeline
fayandwus = self.variscan_fayandwus(haps2_haps)
File "/usr/local/lib/python2.7/dist-packages/selectionTools-1.0-py2.7.egg/selection_pipeline/standard_run.py", line 536, in variscan_fayandwus
start_position = int(start_pos)
ValueError: invalid literal for int() with base 10: 'pos\n'
Seems to be failing in variscan now, though earlier runs made it through to iHH and failed there. I'm initiating this with the CEU and YRI population files provided with the lactase test data, and using data downloaded directly from the 1000g browser.
click Get VCF data, confirm coordinates in text entry box, click next, right click to save and extract.
snips from my actual data file:
snip ~250 lines. The following lines intentionally truncated to fit in the chat.
19 20862396 esv3643896;esv3643897 A <CN0>,<CN2> 100 PASS AC=2,2;AF=0.000399361,0.000399361;AN=5008;CS=DUP_gs;END=20974015;NS=2504;SVTYPE=CNV;DP=13403;EAS_AF=0,0;AMR_AF=0,0.0029;AFR_AF=0,0;EUR_AF=0,0;SAS_AF=0.002,0;VT=SV GT 0|0 0|0
19 20959646 rs73543372 C T 100 PASS AC=61;AF=0.0121805;AN=5008;NS=2504;DP=12049;EAS_AF=0;AMR_AF=0.0029;AFR_AF=0.0446;EUR_AF=0;SAS_AF=0;AA=.|||;VT=SNP GT 0|0 0|0
19 20959677 rs541375182 T G 100 PASS AC=1;AF=0.000199681;AN=5008;NS=2504;DP=14387;EAS_AF=0;AMR_AF=0;AFR_AF=0.0008;EUR_AF=0;SAS_AF=0;AA=.|||;VT=SNP GT 0|0 0|0 0|0
19 20959686 rs554447352 C G 100 PASS AC=1;AF=0.000199681;AN=5008;NS=2504;DP=14414;EAS_AF=0;AMR_AF=0;AFR_AF=0.0008;EUR_AF=0;SAS_AF=0;AA=.|||;VT=SNP GT 0|0 0|0 0|0
Thanks very much!