These are chat archives for dereneaton/ipyrad

26th
Sep 2017
Deren Eaton
@dereneaton
Sep 26 2017 01:49
Hi @emhudson , step 5 should run quite quickly (step 6 on the other hand is typically the slow step on large assemblies) so I'm guessing the sbatch setup may be not quite right. Your setup varies slightly from our recommended sbatch terms here (http://ipyrad.readthedocs.io/HPC_script.html). I'm not sure how ntasks= vs ntasks-per-node= affects things, but there are several options for ntasks-per-{something} which allocate tasks in different ways (see https://slurm.schedmd.com/mc_support.html). Similarly, the --mem= versus --mem-per-cpu may not be allocating sufficient memory to all cores. You may also need to load the MPI module to use MPI mode with ipcluster. Also, it looks like you are loading a system-wide anaconda distribution as opposed to installing a local version for ipyrad. Are you using a recently updated version, or an older copy of ipyrad? The current version is 0.7.13. It's typically easier to keep your version up-to-date if you do a local install. Let me know if that doesn't help.
Ollie White
@Ollie_W_White_twitter
Sep 26 2017 09:44

Hello. I would like to remove low quality samples following steps 1 and 2 of ipyrad. These are samples which have a low number of reads and generally lots of missing data at the end of read assembly. I did this over the command line with the command below which essentially printed the names of samples with more than 500000 filtered reads to a new text file. I could then create a new branch assembling only these samples.

awk ' $7 >= 500000 { print $1 } ' s2_rawedit_stats.txt  > samples-to-keep.txt

Is anyone aware of a way to do this using the API notebooks?

Deren Eaton
@dereneaton
Sep 26 2017 14:08
Hi @Ollie_W_White_twitter , sure, this is exactly what the API is designed for, though we need to document better how to access all the attributes of the Assembly object.
sub = [i.name for i in data.samples.values() if i.stats.reads_raw > 500000]
new = data.branch("new-assembly", subsamples=sub)
data.samples is a dictionary mapping sample names to Sample objects, which is useful for this.
Ollie White
@Ollie_W_White_twitter
Sep 26 2017 14:24
Thanks @dereneaton , yes I am hoping it will be useful to use the API for this reason. Thanks for the filtering advice too.
I'm not sure that my notebook is using all of the 16 threads available on the compute node I am using. I have entered ipcluster start as you suggested before opening the notebook. Is there a way of testing if all the threads are being used? Step 1 for 95 samples is taking over 2 hours. I seem to remember this being quite a quick step using the CLI so I might have missed something
Deren Eaton
@dereneaton
Sep 26 2017 15:14
@Ollie_W_White_twitter yes, see the API docs here (http://ipyrad.readthedocs.io/API_user-guide.html#the-run-command). You can use the ipyparallel library to check your connection with ipcluster. Step 1 should run very fast like you said. You can restart the ipcluster at any time if needed, before or after the notebook is started.
Hi @toczydlowski , There have been enormous changes between v0.5 and v0.7, you will find things run a lot faster and use less memory and disk space (step 6 in particular). We recommend that you update whenever new versions come out, as we're probably done with breaking backwards-compatibility now that things are more stable. I can't think of any reason that steps 4,5,6,7 would need to know where the raw data is located. Copy the error here if you encounter it again.
Deren Eaton
@dereneaton
Sep 26 2017 15:20
btw @emhudson , wrap code between lines at the beginning and end that have three backtick markers to format in gitter.
Isaac Overcast
@isaacovercast
Sep 26 2017 15:43
@emhudson The problem is almost certainly a memory allocation issue. In your job script you have:
SBATCH --mem=16000
This specifies memory per node in megabytes, so youre allocating 16GB per node. We recommend 4GB per core, in normal conditions.
Isaac Overcast
@isaacovercast
Sep 26 2017 15:49
@toczydlowski The raw_fastq_path isn't required for steps beyond 2, but the assembly has a habit of checking for it every time it's loaded, so it expects it to be there even though it doesn't use it. It's good practice not to move ipyrad directories around, we recommend against it, so if you're moving stuff around you'll have to deal with occasional issues like this. You don't have to copy the raw_fastq directory to the cluster, its probably enough just to create an empty directory with the name the assembly expects.
Ollie White
@Ollie_W_White_twitter
Sep 26 2017 17:08

Thanks @dereneaton that's running a lot quicker now. I would like to further subsample my taxa list again by species. For example, selecting taxa with brbr in their name

data2 = [i for i in allsamples if "brbr" in i]

Is there a way to sub sample by multiple species? For example, select taxa with "brbr", "frfr", and "crcr" in their names? Apologies I need to work on my python... I have always relied upon linux

Deren Eaton
@dereneaton
Sep 26 2017 18:12
@Ollie_W_White_twitter
## a list of sample names 
subnames = [i.name for i in data.samples.values() if 'brbr' in i.name]

## or a list of the sample objects
subsamples = [i for i in data.samples.values() if 'brbr' in i.name]
Ollie White
@Ollie_W_White_twitter
Sep 26 2017 18:20
Cheers @dereneaton, can I edit this to make a list of sample names that included both 'brbr' and 'frfr' in i.name for example?
toczydlowski
@toczydlowski
Sep 26 2017 20:48
@dereneaton @isaacovercast Thanks! Hoping to get to manageable disk space usage with new version and gzipped input files. Isaac, I wondered about making a blank input directory as you suggested for later steps. Will try that. And yes, I know all about the perils of moving files around - but sadly now we are required to write everything to the execute node and then transfer one final tarball back to the home node to comply with our campus cluster policies. Before we could set the project directory to live on a mounted huge drive with a static address and write back and forth between there and execute node. We solved path issues created by moving files by running ipyrad via Docker - so paths are all relative within the Docker container and stay constant regardless of the execute node the job lands on. Pretty cool. Thanks as always for your help. - Rachel
Deren Eaton
@dereneaton
Sep 26 2017 21:42
@toczydlowski that sounds like a nice trick.