Abundance estimation
FeatureCounts sumarizes a set of uniquely-mapping reads, and quantifies the expression patterns of features that could be represented by exons or gene models. FeatureCounts will output a table of counts and a summary of the assigned reads.
cd ~/students/YourName/Counts
featureCounts -a ../Mapping/refgenome.subset.gff -o ./counts.txt -g ID -t gene -G ../Mapping/refgenome.subset.fasta -T 4 ~/data/Mapped/*.bam
The options explained:
-a: specifies an annotation file;
-g: specifies the attribute type to be extracted from annotation and used for summarization;
-t: specifies the feature type used in GFF annotation.
cat counts.txt.summary
- A great portion of reads are indicated as Unassigned_NoFeatures. What is represented by this category of reads? Is this something to worry about?
Here we produced a table of counts only for one mountain/locality in the interest of time. We are expanding now our analyses to 3 mountains/localities, each with the two ecotypes included. The coresponding table of counts has already been filetered as an example to retain only features/genes that are expressed over all individuals at least a number of times equal with twice the number of individuals (i.e., 36). For a scientific paper, the count tables should be filtered based on CPM (i.e., counts per million) of some value, for example 0.5 or 1 across a number of accessions equal to the smallest group in the data.
We want to run tests of differential expression for each locality separately (due to their likely independent history), so we need to include only the columns that are needed for our next analytical steps. So, use awk to prepare 3 files with counts for pair 1, for pair 4 and, respectively, for pair 5. Make use of the option -v OFS to define the Output Field Separator as tabs. For example for locality 1:
cd ~/students/YourName/
mkdir Loc1 Loc4 Loc5
head -1 145counts.txt
awk -v OFS='\t' '{print $1, $2, $3, $4, $11, $12, $13}' 145counts.txt >> ./Loc1/1counts.txt
- How many individuals are included in 1counts.txt? What about how many genes?
Download the datasets in your local computer. We will continue to work locally in R for the next steps.