Multiple Alignments

 

Software and data sets


Before starting this practical, the software needed and example data must be downloaded onto the computer.

To download the software and data used in the examples, do the following:

    Open a terminal.
    Create a new folder in the student directory, i.e. mkdir lastname_phy
    Click here, to download the compressed (gzipped tar-ball) to your working directory.

    For example:
    cd Downloads
    mv multiplealign.tar /Users/student/lastname_phy
    cd
    cd lastname_phy
    tar -xvf multiplealign.tar
    cd multiplealign
    ls

The globin data comes from the Pfam globin protein family.


Programs

We will play with the following Multiple Sequence Alignment (MSA) programs ( see also this link):

  • Clustal Omega is a multiple sequence alignment program for protein sequences by Higgins et al. You can download it from here or run it remotely using the EBI server.
  • Prank is probabilistic multiple alignment program that treats insertions correctly from an evolutionary viewpoint and avoids over-estimation of the number of deletion events. It can be downloaded from here. We will use the oneline version webPRANK in this tutorial.
  • MUSCLE is, then, the second most accurate protein aligner (though, it is much faster), developed by Bob Edgar and downloadable from here.
  • BLAST is a set of similarity search programs designed to explore all of the available sequence databases. It can be used online at NCBI
  • SeaView is a graphical tool to modify sequence alignments developed by Galtier et al. in France and downloadable from here


In the following examples, you can save some effort by copying the command given in bold and pasting it in the terminal window.


Working with webBLAST


Go to http://blast.ncbi.nlm.nih.gov/

  • Open nerd.fasta in a plain text editor and copy it to the clipboard.


Click on nucleotide blast and paste the sequence into the query sequence space. Then under the program selection, click on somewhat similar, then click BLAST.

This is one of the more common exercises for BLAST, but it is fun so forgive me. This is the sequence used by Crichton in The Lost World. What do you notice about the sequence? What is it closely related to (top hit).

Go back and run a blastx on the same sequence. Scroll to the first few alignments. What do you notice in the gaps? If you are wondering Mark Boguski was a scientist at NCBI.

  • Conduct a blast on the r4.fasta sequence. Does the label match the sequence? Click on the accession for the first hit and you can see all the info for the sequence. Including some analyses. If you go back you can see some interesting tools. At the descriptions, click select all and then graphics. This gives a nice overview of the comparison of the sequences.


Go back and click on taxonomy reports at the top of the page. This will give you a rundown of the hits in taxa.

Go to the first blast page and go to the bottom and click on algorithm parameters. This will expand the options for the blast searches.

Change the word size to 15

match/mismatch to 1,-4

If you click open results in new window you can compare things more easily. Compare this with the one that you did before. (You can rerun with default parameters if you didn't leave them open).

  • Go to blastp and in the box type

>test
NIRVANA

Then click show results in new window and then click BLAST
Now go back and click algorithm, unclick change parameters for short input sequences. Click BLAST again.

Now go back and click algorithm, make sure change parameters for short input sequences is still off and increase Expect threshold to 100000 then click BLAST again. The Expect threshold is the number of hits you expect by random with larger being less stringent. Lower values are important for significance.

  • Conduct a blastp on r3.fasta but change the organism to Amborella. Is the first hit significant? This is another way to shrink the results but not the database. Compare the search summary of the first to a search summary without the organism limit. The statistics are the same so the size of the database didn't change. Click on the Amborella sequence that we hit.

Click on fasta.

Copy and paste the sequence to a plain text file and blast it against our protein database r_p_3.fasta.

  • Working with webBLAST if you have extra time

Go to blastp and try some different names that happen to overlap with the amino acid alphabet (ARNDCQEGHILKMFPSTWYVBZ). Note the evalues as you increase the letters.

You can try, NEIL, FLEA, ELVIS, SLASH, EMINEM, HANNAH, NICKCAVE, STEPHEN

Report your best


Algorithms & Parameters


Different alignment programs, and even the same method when different parameter values are used, may produce very different alignment solutions for the same sequence data. We test this with a set of globin sequences.


Progressive alignment
We compare four different programs in the alignment of the globin_seed.fas data set. The globin data comes the Pfam globin protein family.

  1. To make a MUSCLE-alignment, open a terminal window, go to the new folder (type cd ~/lastname_phy/multiplealign and press "Enter") and type the command ./muscle -in globin_seed.fas -out globin_seed_muscle.fas. The resulting alignment can be found in the file globin_seed_muscle.fas. (You can use SeaView to browse the resulting alignment. Click the icon to start the program).
  2. To make a prank-alignment, use the on-line version webPRANK. Download the fasta file.
  3. For ClustalW/ClustalX, type the command ./clustalw2 /infile=globin_seed.fas /output=fasta /outfile=globin_seed_clustalw.fas in the terminal window.
  4. To compare the three alignments type: ./Seaview globin_seed_clustalw.fas & ./Seaview globin_seed_muscle.fas & ./Seaview prank...fas &

Which alignment do you like best?


Alignment Algorithms and Dynamic Programming

The best way to understand how an algorithm works is to try it in practice with an easy example. To simplify this, we will use a java applet that will do computations for you. There may compatibility issues with this example, so please ensure that you use either Firefox as the browser. You will also need to add http://www.ebi.ac.uk to java security exception.

Click here to launch the java applet.

How to use the alignment applet:

  1. Enter two nucleic-acid strings in the fields seq1 and seq2. Strings can be of different length but no longer than 15 characters.
  2. Using the scroll down button, choose between the Needleman-Wunsch (standard) or the Gotoh (affine-gap-cost) algorithms.
  3. Specify the model parameters using the fields opening cost, extension cost and the scoring matrix.
  4. After 1.-3., press the button Align.
  5. You will find the alignment as well as its score below the matrices.
  6. The matrices explain graphically the steps taken by the algorithm:
      Matrices are filled from the upper-left corner towards the lower-right corner. The cell (0,0) is set to 0; all other cells show the score for the best sub-alignment from (0,0) to that point . The best full alignment is found by following the pointers (\, --, |) from the lower-right corner to the upper-left corner.
  7. You can also find the optimal alignment by pressing the Path button and then Align again. The path will be displayed in dark blue.

A brief description of the two alternative alignment algorithms can be found at the bottom of this page.



Exercises

Run the alignment applet using the following input sequences and parameters. In the case of any problems, have a look on the instructions above. Note that if you do the examples in the given order, you only need to change the parameters highlighted in blue.

 

Example 1
seq1 ATGAAATG
seq2 ATGATG
scoring matrix:
  match = +10
  mismatch = 0
gap cost = -5
method = Needleman

Run the alignment multiple times and see what happens. What do you think is the right solution? How many "good" solutions there are? And what about the rest?

Example 2
seq1 ATGAAATG
seq2 ATGATG
scoring matrix:
  match = +10
  mismatch = 0
gap (opening) cost = -5
gap extension cost= -1
method = Gotoh

Do like above. What happens?

Example 3
seq1 AGCT
seq2 AACGT
scoring matrix:
  match = +10
  mismatch = 0
gap cost = -5
method = Needleman

Example 4
seq1 AGCT
seq2 AACGT
Scoring matrix (press Return after changing any entry of the table):
  match = 10
  transition (A <-> G or C <-> T) = 6
gap cost = -5
method = Needleman

Compare the alignments from 3 and 4. What happens? Can you explain it? What do you know about frequencies of different types of substititutions?

Example 5
seq1 AGGT
seq2 ATGGAT
scoring matrix:
  match = +10
  mismatch = 0
gap cost = -12
method = Needleman

Example 6
seq1 AGGT
seq2 ATGGAT
scoring matrix:
  match = +10
  mismatch = 0
gap cost = -12
gap extension cost= -1
method = Gotoh

Compare the alignments from 5 and 6. Can you explain what happens?

 

A brief description of the two variants of the algorithm

Needleman-Wunch's algorithm:

The score of an alignment will be the sum of scores for individual nucleic acid pairs minus the penalty for creating gaps. We start with a score of zero and then fill in the matrix recursively using the maximum of three possibilities:

  1. Two characters are matched (diagonal move).
    The score of the sub-alignment is the sum of the partial alignment in the diagonally preceding cell plus the score of the next two nucleic acids matching.
  2. A gap is inserted in sequence i (horisontal move).
    The score of the sub-alignment is the sum of the partial alignment in the horisontally preceding cell plus the negative gap penalty
  3. A gap is inserted in sequence j (vertical move).
    The score of the sub-alignment is the sum of the partial alignment in the vertically preceding cell plus the negative gap penalty.

For each cell in the matrix, we choose the maximum of the three possibilities, and record from which predecessor cell the move was taken. When the entire matrix is full, we can follow the path of the backward pointers from the lower right corner to the upper left to yield the alignment.

NB: For Needleman-Wunsch algorithm the gap extension cost is not used and each horizontal or vertical move is penalised by the gap cost! Thus, the penalty for a gap grows linearly with its length.

Gotoh's algorithm:

Gotoh's algorithm separates the gap opening and extension events, and, hence allows the assignment of a high penalty for the opening of a new gap (gap opening cost), and a lower cost for the extension of an existing gap (gap extension cost).

To visualise Gotoh's algorithm we will fill three matrices instead of one. We will call them M-, X- and Y- matrices, the first always matching the characters in the two strings, and the two others creating either horizontal or vertical gaps.

  1. At cell (i,j) of the M-matrix, the characters are matched and the optimal move is chosen from one of the three diagonal predecessors, i.e., from the cell (i-1, j-1) in the matrix M, X or Y. A gap is closed if the best move is from matrices X or Y.
  2. At cell (i,j) of the X matrix, the move is chosen from the cell (i-1,j) in the matrices M, or X. A move from the M-matrix opens a new gap, and it is penalized by its cost, whereas a move from the X matrix extends an existing gap, and is assigned a lower extension penalty.
  3. Similarly, at cell (i,j) of the Y-matrix, the move is chosen from the cell (i,j-1) in the matrices M, or Y.