These are chat archives for dereneaton/ipyrad

18th
May 2017
Lily Roberta Lewis
@LilyRobertLewis_twitter
May 18 2017 14:28
Hello, I'm beginning to process RADseq data that has already been demultiplexed and has the cutsites and adapters removed. I recall from pyrad that you could put @ at the beginning of the path to your sorted *.fq.gz files to indicate that the cut sites have already been removed. Is there a way to indicate this in ipyrad? When running step 2, I get the following error message:
2017-05-17 17:19:33,601 pid=17645 [rawedit.py] ERROR error in run_cutadapt(): ImportError(No module named indexes.base)
That error is listed for each sample in my data set
and is then followed by the error message:
2017-05-17 17:19:33,835 pid=17645 [assembly.py] ERROR No Samples ready to be clustered. First run step 2.
Thank you for any advice!
Deren Eaton
@dereneaton
May 18 2017 15:15
Hi @LilyRobertLewis_twitter, the input methods have changed a bit for ipyrad. No need for the @ symbol any more, ipyrad will recognize your data just fine whether or not there is a cutsite overhang at the beginning. The big difference now is that you must run steps 1 and 2, even if your data are already demultiplexed and filtered. Step1 will simply read in the fastq files, and step2 will perform filtering on them, or not, depending on your param settings. This will ensure that the data are properly formatted for step3, when the reads are either clustered or mapped. From the error message above it looks like you were skipping step1.
Lily Roberta Lewis
@LilyRobertLewis_twitter
May 18 2017 15:22
Thank you so much Deren, Let me go back and check this. Before the error was produced ipyrad produced .edits/s2_rawedit_stats.txt, as well as a .json and a _s1_demultiplex_stats.txt in my working directory. My _s1_demultiplex_stats.txt file has a list of all my samples and the number of raw reads for each.
in my .edits/s2_rawedit_stats.txt file there is a list of my samples, which looks like: Empty DataFrame
Columns: []
Index: [aacu-21-14-tk, aacu-24-14-tk, aacu-25-14-tk, apal-1-6-fb, apal-11-10-fb, apal-12-7-fb, apal-13-7-fb, apal-14-10-fb, apal-15-12-tk, apal-16-12
-tk, apal-17-6-fb, apal-18-6-fb, apal-19-10-fb, apal-2-13-tk, apal-20-8-fb, apal-22-12-tk, apal-26-3-fb, apal-3-8-fb, apal-4-8-fb, apal-44-10-fb, apa
l-5-5-fb, apal-7-5-fb, apal-8-10-fb, apal-9-5-fb, atur-10-10-fb, atur-23-12-tk, atur-27-11-tk, atur-28-14-tk, atur-29-14-tk, atur-30-5-fb, atur-31-8-
fb, atur-32-13-tk, atur-33-12-tk,
the list ends with ...]
Lily Roberta Lewis
@LilyRobertLewis_twitter
May 18 2017 16:00
Hi Deren, Okay, so I double checked, and it appears that step 1 ran properly for me, but I'm still having problems with step 2. Here is the output in my log file:

ipyrad [v.0.6.20]

Interactive assembly and analysis of RAD-seq data

loading Assembly: aul
from saved path: /ufrc/mcdaniel/lilyrlewis/aul.json
host compute node: [32 cores] on c21a-s2.ufhpc

Step 1: Loading sorted fastq data to Samples
Skipping: 189 Samples already found in Assembly aul.
(can overwrite with force argument)

Step 2: Filtering reads
[####################] 100% processing reads | 0:00:21
found an error in step2; see ipyrad_log.txt

oops I mean in my output file
and in the log file I receive this message: 2017-05-18 11:52:41,228 pid=9887 [rawedit.py] ERROR error in run_cutadapt(): ImportError(No module named indexes.base)
2017-05-18 11:52:41,387 pid=9887 [assembly.py] ERROR No Samples ready to be clustered. First run step 2.
Deren Eaton
@dereneaton
May 18 2017 16:21
Hi @LilyRobertLewis_twitter , did you install ipyrad with conda like on the installation instructions? It seems to be missing a dependency.
Lily Roberta Lewis
@LilyRobertLewis_twitter
May 18 2017 16:23
Hmmm, okay that makes sense. I just requested that my institutions cluster administrator install ipyrad . Perhaps they missed a dependency. I'll contact them now with my error messages. Is it obvious to you which dependency i'm missing?
Lily Roberta Lewis
@LilyRobertLewis_twitter
May 18 2017 16:30
Thank you for your help Deren. I submitted a request to my cluster administrator asking them to check the dependencies.
Deren Eaton
@dereneaton
May 18 2017 16:31
I thought that might the problem. It's a little complicated to have your cluster administrator install ipyrad, since you will need to load the system software slightly differently depending on how they installed it. Hopefully they used conda, and then you would probably just need to do something like module load conda. That's why we instead recommend that you do a local installation of ipyrad, meaning that you install the software yourself into your home directory. The ipyrad installation instructions for this are quite simple, you just need to copy and paste about two lines of code. You do not need administrator privileges to do this, since conda just installs software into a local folder. Conda is especially designed to do just this, to allow users on a large system to maintain control of their own software, and to update it whenever they want. See here: http://ipyrad.readthedocs.io/installation.html#linux-install-instructions-for-conda
But in case you do want to stick with the system-wide installation, the specific module that does not seem to be properly installed is called cutadapt. The fact that it is missing suggests to me that your administrator might have installed ipyrad though a method other than conda, which will likely lead to other problems later. So we strongly recommend the simple conda install.
Lily Roberta Lewis
@LilyRobertLewis_twitter
May 18 2017 16:34
ooooo! Thank you! I'm going to do that right now!
Deren Eaton
@dereneaton
May 18 2017 16:36
great
Lily Roberta Lewis
@LilyRobertLewis_twitter
May 18 2017 17:29
All is working as expected now! Thanks again for helping me troubleshoot basic stuff. I appreciate your help and patience:)
Deren Eaton
@dereneaton
May 18 2017 17:41
No problem. conda is super useful, but takes a little bit to wrap your head around. Definitely worth learning.
Jenny Archibald
@jenarch
May 18 2017 19:14

@isaacovercast @dereneaton Hello, I've hit another problem with my run. It was on step 6 and seemed almost done before stopping again due to walltime limits:

Step 6: Clustering at 0.9 similarity across 288 samples
[####################] 100% concat/shuffle input | 0:02:15
[####################] 100% clustering across | 3 days, 16:55:37
[####################] 100% building clusters | 0:04:02
[####################] 100% aligning clusters | 1:44:17
[####################] 100% database indels | 0:21:33
[####################] 100% indexing clusters | 3:28:41
[###### ] 31% building database | 2 days, 11:05:59 =>> PBS: job killed: walltime 604836 exceeded limit 604800

When I started it again, I ran it for only 24 hrs (usually I can run for a week at a time), because they were going to do maintenance on the cluster. That run gave a lot of errors in the log and also seemed to have switched where it was within step 6 (claiming it was at 33% building clusters instead of building database):

Step 6: Clustering at 0.9 similarity across 288 samples
[###### ] 33% building clusters | 0:41:14 ERROR:tornado.general:Uncaught exception, closing connection.
Traceback (most recent call last):
File "/home/jkarch/miniconda2/lib/python2.7/site-packages/zmq/eventloop/zmqstream.py", line 414, in _run_callback
callback(args, **kwargs)
File "/home/jkarch/miniconda2/lib/python2.7/site-packages/tornado/stack_context.py", line 275, in null_wrapper
return fn(
args, kwargs)
File "<decorator-gen-140>", line 2, in _dispatch_reply
File "/home/jkarch/miniconda2/lib/python2.7/site-packages/ipyparallel/client/client.py", line 71, in unpack_message
return f(self, msg)
File "/home/jkarch/miniconda2/lib/python2.7/site-packages/ipyparallel/client/client.py", line 885, in _dispatch_reply
raise KeyError("Unhandled reply message type: %s" % msg_type)
KeyError: u'Unhandled reply message type: apply_request'
ERROR:tornado.general:Uncaught exception, closing connection.
Traceback (most recent call last):
File "/home/jkarch/miniconda2/lib/python2.7/site-packages/zmq/eventloop/zmqstream.py", line 440, in _handle_events
self._handle_recv()
File "/home/jkarch/miniconda2/lib/python2.7/site-packages/zmq/eventloop/zmqstream.py", line 472, in _handle_recv
self._run_callback(callback, msg)
File "/home/jkarch/miniconda2/lib/python2.7/site-packages/zmq/eventloop/zmqstream.py", line 414, in _run_callback
callback(*args,
kwargs)
File "/home/jkarch/miniconda2/lib/python2.7/site-packages/tornado/stack_context.py", line 275, in null_wrapper
return fn(args, *kwargs)
File "<decorator-gen-140>", line 2, in _dispatch_reply
File "/home/jkarch/miniconda2/lib/python2.7/site-packages/ipyparallel/client/client.py", line 71, in unpack_message
return f(self, msg)
File "/home/jkarch/miniconda2/lib/python2.7/site-packages/ipyparallel/client/client.py", line 885, in _dispatch_reply
raise KeyError("Unhandled reply message type: %s" % msg_type)
[etc, portion deleted]
[###### ] 33% building clusters | 23:59:30 =>> PBS: job killed: walltime 86411 exceeded limit 86400

I did another 24 hr run once the cluster was back up, as a test, and it did not give errors, but it didn't seem to make any progress either (still said 33% building clusters) and our cluster help guy said it was using almost no CPU and only one node (despite me requesting 20).

Any suggestions?

Deren Eaton
@dereneaton
May 18 2017 19:17
Hi @jenarch, can you tell me which version you are on?
Jenny Archibald
@jenarch
May 18 2017 19:18
v.0.6.19
Deren Eaton
@dereneaton
May 18 2017 19:18
We just added the checkpoint in the middle of step6 that was intended to make it easier to restart that step if you finished clustering but did not finish indexing/databasing. It might need another look I guess.
is your assembly method denovo or reference mapped?
Jenny Archibald
@jenarch
May 18 2017 19:18
denovo
Deren Eaton
@dereneaton
May 18 2017 19:19
OK. We're working on trying to make that databasing step much faster. I'm not sure why it is not progressing past "building clusters", that step should be quite fast.
Deren Eaton
@dereneaton
May 18 2017 19:26
I'll make a note for us to look into restarting interrupted step6 jobs. #243
Jenny Archibald
@jenarch
May 18 2017 19:28
Thank you! Do you have any ideas on getting this one moving again? I could do a run with -d, I haven't tried that before. Do you know of problems with running it for 24 hrs to debug, or should I be giving it the full week?
Deren Eaton
@dereneaton
May 18 2017 19:29
For now, you could restart step 6 with the force flag (-f) and it will rerun it from the beginning of step6, including clustering. It looks like it might take over a week to run on your data set though. But it should work. I think we can find the problem without needing you to run -d, so don't use it for now since it will slow down your analysis a bit.
Hopefully the newer faster version we have in the works will be able to fly through the databasing step, but we're pretty swamped right now, so it could take a little while before we have that up.
Jenny Archibald
@jenarch
May 18 2017 19:39
ok, I'll try that - and check to see if there is a way to get more than a week on our cluster. Is this how it would be set up: ipyrad -p params-m04c90.txt -f -s 67 -c 20 --MPI ? Basically, that's what I've done before but with the -f added before -s
Deren Eaton
@dereneaton
May 18 2017 19:40
yeah, the order doesn't matter. Are you running on a single node with 20 cores, or getting 20 cores from multiple nodes. You only need the --MPI flag if it is for the latter.
Jenny Archibald
@jenarch
May 18 2017 19:42
It is supposedly multiple nodes, although not sure how many - set up like this: procs=20,pmem=30gb
Deren Eaton
@dereneaton
May 18 2017 19:54
we have some recommended workflows here, depending on your job submission system: http://ipyrad.readthedocs.io/HPC_script.html