Assembly

De novo transcriptome reconstruction

Transcriptomic analyses can even be performed in the absence of a reference genome or a transcriptome. In such a case, the transcriptome reference can be reconstructed directly from the RNAseq data prior to differential expression analyses. Assembling a transcriptome can also prove very useful during the structural annotation of a reference genome.

Trinity represents a widely used method for efficient and robust de novo reconstruction of transcriptomes from RNA-seq data. Trinity comes with a perl wrapper that combines the three modules of the pipeline: Inchworm, Chrysalis, and Butterfly.

If any problems appear while running the program, try adding either --bypass_java_version_check and/or --no_bowtie to the trinity command.

In the VetGrid10:
cd ~/students/YourName/Trimmed

export PATH="/home/student/bin/subread-2.0.6-Linux-x86_64/bin:$PATH"

export PATH="/home/student/bin/STAR-2.7.11a/source:$PATH"

export PATH="/home/student/bin/trinityrnaseq-v2.15.1:$PATH"

export PATH="/home/student/bin/bowtie2-2.5.1-linux-x86_64:$PATH"

export PATH="/home/student/bin:$PATH"

export LD_LIBRARY_PATH="/home/student/bin/salmon-1.10.1/lib/:$LD_LIBRARY_PATH"

export PATH="/home/student/bin/salmon-1.10.1/bin/:$PATH"

Trinity --seqType fq --max_memory 6G --left ./incTrimP.1.fq --right ./incTrimP.2.fq --CPU 4 --SS_lib_type FR --output ../TrinityOut

Running Trinity on this "toy" dataset will take roughly 5 min, however a full transcriptome assembly will take multiple hours. You can observe in the terminal the progress through the various stages of the pipeline, starting with Jellyfish to generate the k-mer catalog, then followed by Inchworm, Chrysalis and finally Butterfly. The options explained:

Close up of alpine (left) and mountain (right) plants of the Heliosperma pusillum. Notice the difference in the absence/presence of trichomes

--seqType: type of reads: ( fa, or fq );
--left: left reads, one or more file names (separated by commas, not spaces);
--right: right reads, one or more file names (separated by commas, not spaces);
--SS_lib_type: Strand-specific RNA-Seq read orientation, if paired: RF or FR;
--CPU: number of CPUs to use, default: 2;
--output: name of directory for output (will be created if it doesn't already exist).

Examine assembly stats

The assembled transcripts will be found at /YourName/TrinityOut.Trinity.fasta.

cd ~/students/YourName/

Have a look at the top few lines in the file. Try to find out how many contigs have been assembled. For example use grep.

Use TrinityStats.pl on the fasta output to obtain some basic statistics about your assembly.

~/bin/TrinityStats ./TrinityOut.Trinity.fasta

Have a look at the number of transcripts and the number of genes. Why are these numbers different?
What is the GC content? What is the minimum length of largest half of the contigs? What is the N50 value of your assembly?
Compare the results with those of other groups. Do you expect that they are the same? Why?
Evaluation of different assemblies is very difficult in the absence of a reference genome. Can you think why?

More about what Trinity can do and the downstream pipelines here.