De novo assembly of RADseq loci without a genome
We will be working on a set of six population pairs of Heliosperma, each population being represented by 5 individuals (so altogether 6 mountains x 2 ecotypes x 5 individuals = 60 individuals). Each population pair contains a low elevation population (montaine) and a neighboring alpine population. We want to understand how these populations relate to one another.
denovo_map.pl from the STACKS suit normally takes at least one full day for one run even if used in the best available computers. Taking this into account I have substracted the data to have less than 1000 loci (instead of tens to hundred thousands) in the working data. This should speed up significantly the analyses and make them possible even in a laptop computer.
Constructing de novo a RAD catalog with denovo_map.pl
If not already open, open a terminal with shortcut Ctrl+Alt+T or click the terminal icon.
Go to the folder RADLECTURE with cd. Create a new directory STACKS in the folder
cd ~/RADlecture
mkdir stacks
Create a new MySQL database called lecture_radtags with
mysql -u root -p
with password as written on your computer.
You are now in mysql (the cursor symbol "$" must have changed to “>”). Type
>CREATE DATABASE lecture_radtags;
Check if you really created the database with
>SHOW DATABASES;
then
>EXIT
Populate the tables by loading the table definitions with
mysql -u root -p lecture_radtags < /usr/local/share/stacks/sql/stacks.sql
again with the password as indicated on the computer.
Run Stacks’ denovo_map.pl pipeline program. This program will run ustacks, cstacks, and sstacks on the individuals in our study. [Once you get denovo_map.pl running, it will hopefully take approximately an hour.] In the terminal type
nice -n 19 denovo_map.pl -m 10 -M 1 -n 4 -T 4 -B lecture_radtags -b 1 -t -D “m10 M1 n4 indels” --gapped -o ./stacks/ -O ./popmap --samples ./samples/
The options explained:
nice -n 19 - this sets a low priority for the process denovo_map.pl
-m — specify a minimum number of identical, raw reads required to create a stack.
-M — specify the number of mismatches allowed between loci when processing a single individual (default 2).
-n — specify the number of mismatches allowed between loci when building the catalog (default 0).
-T — specify the number of threads to execute.
-B — specify a database to load data into.
-b — batch ID representing this dataset.
-t — remove, or break up, highly repetitive RAD-Tags in the ustacks program.
-D — batch description, "batch description" (e. g. specify your settings to keep an overview)
--gapped — allow for indels in ustacks, cstacks, and sstacks
-o — path to write pipeline output files.
-O — specify path to a population map file (format is "[name] TAB [pop]", one sample per line), the file indicades if all samples belong to the same or different populations
--samples — specify a path to the directory of samples (samples will be read from population map)
denovo_map.pl is a wrapper perl script which will call several programs one after the other. Normally it is better to run denovo_map.pl with several settings for m, M and n and decide for the best settings for the respective dataset, but this has been already done in this case.
- Examine the Stacks log and output files when execution is complete. From the log: How many reads are used in each ustacks execution?
- What is the depth of coverage for each sample?
- How many markers were identified?
View the results of Stacks analysis thorugh the web interface
denovo_map.pl loads the results in the mysql database that we created before, and this is linked and can be queried from a user-friendly web interface.
Open Firefox or any other web browser of choice.
Type in the address bar 'localhost/stacks'.
- Explore the web interface. Why are some markers found in several samples? Why are the number of genotypes sometimes less than the number of total samples?
- Expand the details for one locus by clicking on its Id. Click on Allele Depths to view additional information. Click on one sample in that respective locus to view the SNPs in the sequence.
- The web interface also indicates for each population the expected/observed heterzygosity, π (nucleotide diversity), P (frequency of the most frequent allele in the population) and FIS (inbreeding coefficient) at each nucleotide position.
View more population details of the results of Stacks analysis
Denovo_map.pl produces files with results of pairwise FST calculations for each nucleotide position. These can be found in the STACKS folder. It also summarizes the results for each population in the file batch_1.sumstats_summary.tsv. Open by double clicking on it.
- Do you notice any difference in inbreeding coefficient or nucleotide diversity (eg, π) measures between the two taxa?