Hypothesis testing

This example is taken from

  • Wesley M. Brown, Ellen M. Prager, Alice Wang and Allan C. Wilson, 1982. Mitochondrial DNA sequences of primates: Tempo and mode of evolution. Journal of Molecular Evolution 18 (4): 225-239.

We will use BASEML (part of the PAML package by Ziheng Yang) because it requires a good understanding of the procedure involved in the hypothesis test, and hence has greater didactic power. Additionally, this introduces you to a general procedure that is used by BASEML (and other software in the PAML package) to test many other evolutionary hypotheses in a similar fashion.

BASEML operates differently from the other programmes you have experienced so far. Rather than specifying the options with which the programme should be executed through an interactive menu, these options are specified in a so-called “control file”. Please download brown_data.zip, including the the baseml control file, the tree and the alignment file.

This control file specifies, amongst other things, parameters describing:

·      the name and location of the alignment and tree files to be used in the calculations;

·      the name and location of the file containing the substitution matrix to be used in the calculation;

·      whether a model of between-site rate heterogeneity should be used;

·      if between-site rate heterogeneity is modelled, the details of this model;

To set the name of the alignment file and tree file needed by the analysis you would include the following two lines in the control file:

seqfile = brown.nuc * name of sequence data file

treefile = 5s.unrooted.trees * name of tree structure file

For hypothesis testing using BASEML, you will need to execute the program several times. For example the first time, you will run the analysis using a JC model, the second time, you will run the analysis using a K2P model. You will then need to examine the output file (whose name is specified in the control file by the "outfile" parameter) to determine the log likelihood score of the tree under these models.

To run the baseml program, double-click on baseml (Windows) or type ./baseml at the command line (UNIX).

Using BASEML you will need to calculate for yourself the difference in the likelihood statistic between the two models (this is twice the difference in log likelihoods as estimated under the two models). Your can assess the probability that the observed difference in the likelihood statistic would be obtained from a datasets that undergoes selection using a chi-squared distribution where the degrees of freedom is 1 for the JC versus K2P comparison (the difference in the number of parameters between the two models).