Before starting this practical, the software needed and example data must be downloaded onto the computer.
To download the software and data used in the examples, do the following:
Open a terminal.
Create a new folder in the student directory, i.e. mkdir lastname_phy
Click here, to download the compressed (gzipped tarball) to your working directory.
For example:
cd Downloads
mv multiplealign.tar /Users/student/lastname_phy
cd
cd lastname_phy
tar xvf multiplealign.tar
cd multiplealign
ls
The globin data comes from the Pfam globin protein family.
We will play with the following Multiple Sequence Alignment (MSA) programs ( see also this link):
In the following examples, you can save some effort by copying the command given in bold and pasting it in the terminal window.
Go to http://blast.ncbi.nlm.nih.gov/
Click on nucleotide blast and paste the sequence into the query sequence space. Then under the program selection, click on somewhat similar, then click BLAST.
This is one of the more common exercises for BLAST, but it is fun so forgive me. This is the sequence used by Crichton in The Lost World. What do you notice about the sequence? What is it closely related to (top hit).
Go back and run a blastx on the same sequence. Scroll to the first few alignments. What do you notice in the gaps? If you are wondering Mark Boguski was a scientist at NCBI.
Go back and click on taxonomy reports at the top of the page. This will give you a rundown of the hits in taxa.
Go to the first blast page and go to the bottom and click on algorithm parameters. This will expand the options for the blast searches.
Change the word size to 15
match/mismatch to 1,4
If you click open results in new window you can compare things more easily. Compare this with the one that you did before. (You can rerun with default parameters if you didn't leave them open).
>test
NIRVANA
Then click show results in new window and then click BLAST
Now go back and click algorithm, unclick change parameters for short input sequences. Click BLAST again.
Now go back and click algorithm, make sure change parameters for short input sequences is still off and increase Expect threshold to 100000 then click BLAST again. The Expect threshold is the number of hits you expect by random with larger being less stringent. Lower values are important for significance.
Click on fasta.
Copy and paste the sequence to a plain text file and blast it against our protein database r_p_3.fasta.
Go to blastp and try some different names that happen to overlap with the amino acid alphabet (ARNDCQEGHILKMFPSTWYVBZ). Note the evalues as you increase the letters.
You can try, NEIL, FLEA, ELVIS, SLASH, EMINEM, HANNAH, NICKCAVE, STEPHEN
Report your best
Different alignment programs, and even the same method when different parameter values are used, may produce very different alignment solutions for the same sequence data. We test this with a set of globin sequences.
Progressive alignment
We compare four different programs in the alignment of the globin_seed.fas data set. The globin data comes the Pfam globin protein family.
Which alignment do you like best?
The best way to understand how an algorithm works is to try it in practice with an easy example. To simplify this, we will use a java applet that will do computations for you. There may compatibility issues with this example, so please ensure that you use either Firefox as the browser. You will also need to add http://www.ebi.ac.uk to java security exception.
Click here to launch the java applet.
How to use the alignment applet:
A brief description of the two alternative alignment algorithms can be found at the bottom of this page. 
Run the alignment applet using the following input sequences and parameters. In the case of any problems, have a look on the instructions above. Note that if you do the examples in the given order, you only need to change the parameters highlighted in blue.
Example 1
seq1 ATGAAATG
seq2 ATGATG
scoring matrix:
match = +10
mismatch = 0
gap cost = 5
method = Needleman
Run the alignment multiple times and see what happens. What do you think is the right solution? How many "good" solutions there are? And what about the rest?
Example 2
seq1 ATGAAATG
seq2 ATGATG
scoring matrix:
match = +10
mismatch = 0
gap (opening) cost = 5
gap extension cost= 1
method = Gotoh
Do like above. What happens?
Example 3
seq1 AGCT
seq2 AACGT
scoring matrix:
match = +10
mismatch = 0
gap cost = 5
method = Needleman
Example 4
seq1 AGCT
seq2 AACGT
Scoring matrix (press Return after changing any entry of the table):
match = 10
transition (A <> G or C <> T) = 6
gap cost = 5
method = Needleman
Compare the alignments from 3 and 4. What happens? Can you explain it? What do you know about frequencies of different types of substititutions?
Example 5
seq1 AGGT
seq2 ATGGAT
scoring matrix:
match = +10
mismatch = 0
gap cost = 12
method = Needleman
Example 6
seq1 AGGT
seq2 ATGGAT
scoring matrix:
match = +10
mismatch = 0
gap cost = 12
gap extension cost= 1
method = Gotoh
Compare the alignments from 5 and 6. Can you explain what happens?
NeedlemanWunch's algorithm: The score of an alignment will be the sum of scores for individual nucleic acid pairs minus the penalty for creating gaps. We start with a score of zero and then fill in the matrix recursively using the maximum of three possibilities:
For each cell in the matrix, we choose the maximum of the three possibilities, and record from which predecessor cell the move was taken. When the entire matrix is full, we can follow the path of the backward pointers from the lower right corner to the upper left to yield the alignment. NB: For NeedlemanWunsch algorithm the gap extension cost is not used and each horizontal or vertical move is penalised by the gap cost! Thus, the penalty for a gap grows linearly with its length. 
Gotoh's algorithm: Gotoh's algorithm separates the gap opening and extension events, and, hence allows the assignment of a high penalty for the opening of a new gap (gap opening cost), and a lower cost for the extension of an existing gap (gap extension cost). To visualise Gotoh's algorithm we will fill three matrices instead of one. We will call them M, X and Y matrices, the first always matching the characters in the two strings, and the two others creating either horizontal or vertical gaps.
