Case study: Lousiana gastroenterologist

 

Software and data sets


Before starting this practical, the software needed and example data must be downloaded onto the computer.

To download the software and data used in the examples, do the following:

    Open a terminal.
    Create a new folder in the student directory, i.e. mkdir lastname_phy
    Click here, to download the compressed (gzipped tar-ball) to your working directory.

    For example:
    cd Downloads
    mv branchsupport.tar.gz /Users/student/lastname_phy
    cd
    cd lastname_phy
    tar -zxvf branchsupport.tar.gz
    cd branchsupport
    ls

    NB: If you run out of time, the fast version of the tutorial here can be achieved at this stage with:
    ./mb
    execute thirtySeqsLouisianaGastro_trimmed.nex


Programs

We will play with the following programs:

  • seaview, a program to manipulate Multiple Sequence Alignment (executables for other operating systems and source code are available here).
  • MrBayes, a Bayesian phylogentetic program (executables for other operating systems and source code are available here).
  • PHYML,is a program for Maximum Likelihood inference of trees that we have already used in the last practical. I can be downloaded at the following link.
  • RaXML, RAxML is a like PHYML a programm for ML phylogeny. It is very fast and therefore particular good for large tree and for bootstraping. It can be run online or downloaded here.

Data sets: Louisiana gastroenterologist

Metzker et al. PNAS 2002 describe a phylogenetic analysis of HIV env/gp120 sequences that was used as evidence in the trial of a Louisiana gastroenterologist accused of deliberately infecting a someone (the "victim") with HIV-infected blood from one of the gastroenterologist's patients.

In this exercise, you are asked to try this analysis (or at least a similar one) to that carried out by Metzker et al., with the aim of deciding how well the data supports the hypothesis that the victim (whose sequence identifiers all begin with a "V") was directly infected by blood taken from the patient (whose sequence identifiers all begin with a "P").

Carry out the analysis beginning with this file of 30 unaligned env/gp120 sequences. This contains:

  1. selected sequences (so that the analysis doesn't take so long to run) taken from the complete set analysed by Metzker
  2. two additional, more divergent "reference" sequences

To place the analysis in context, consider that the aim of the prosecution questioning of the expert witnesses in this case was to establish whether the virus samples in the victim and the patient were "closely related". The jury was (at least as far as I understand it) charged with addressing the question whether it was beyond reasonable doubt that the victim was infected as a result of the actions of the gastroenterologist; the results of the phylogenetic analysis are only a small part of the evidence they considered when trying to answer this question.

You might also be interested to read the following documents, which are the decisions of the first and second appeals to the guilty verdict; they contain some comments/quotes from some of the expert phylogenetic witnesses:

For the sake of completeness, here are files containing

  1. the full set of env/gp120 nucleotide sequences used in the study
  2. the full set of RT-pol sequences described in the study

Phylogenetic analysis

Metzker et al. PNAS 2002 describe a phylogenetic analysis of HIV env/gp120 sequences that was used as evidence in the trial of a Louisiana gastroenterologist accused of deliberately infecting a someone (the "victim") with HIV-infected blood from one of the gastroenterologist's patients.

In this exercise, you are asked to try this analysis (or at least a similar one) to that carried out by Metzker et al., with the aim of deciding how well the data supports the hypothesis that the victim (whose sequence identifiers all begin with a "V") was directly infected by blood taken from the patient (whose sequence identifiers all begin with a "P").

Consider that the aim of the prosecution questioning of the expert witnesses in this case was to establish whether the virus samples in the victim and the patient were "closely related". The jury was (at least as far as I understand it) charged with addressing the question whether it was beyond reasonable doubt that the victim was infected as a result of the actions of the gastroenterologist; the results of the phylogenetic analysis are only a small part of the evidence they considered when trying to answer this question.


Carry out the analysis beginning with this file of 30 unaligned env/gp120 sequences.

  1. Begin by align the initial set of nucleotide sequences using webPRANK, asking webPRANK to estimate its own tree.Alternatively, use Muscle.
  2. Using SeaView (or JalView)
    • remove sequences from the alignment that are either unnecessary or seem to contain errors
    • remove columns from the alignment that contain gaps or where you are not confident that all residues in the column are related by substitutions
  3. Realign the sequences automatically using webPRANK, if unavailable use Muscle
  4. Using Seaview
    • save as FASTA format and change the names to be 10 or fewer alpha-numeric (or underscore) characters
    • save as PHYLIP format
  5. Estimate the phylogeny using PHYML, beginning with a GTR + gamma substitution model (which is usually/often a good choice)
  6. Run PHYML to identify the protein substitution model that best describes the sequences in your alignment. To speed up this analysis, only examine the matrices HKY and GTR, with Add-ons +Gamma (i.e., allowing for several substitution rate categories according to a gamma distribution).
  7. Use likelihood ratio tests (LRTs) to determine which of the above models performs best. Note that for +Gamma the shape parameter alpha (thus one additional parameter) is estimated. Your can assess the probability that the observed difference in the likelihood statistic would be obtained from a datasets that shows rate variation as assumed by the gamma model using a chi-squared distribution. The degrees of freedom is the difference in the number of parameters between the two models.
  8. Calculate branch support values by using the non-parametric bootstraps option in PHYML. As a start just specify the number of replicates as 100 (normally should be at least 1000 or more). Alternatively, you can use RAxML online or calculate posterior probabilities using MrBayes.
  9. Visualize the tree using NJPLOT. Save as PDF file.
  10. Finally, what do you feel is the right way of thinking about the relationship between branch support (e.g. bootstrap values or posterior probabilities) and whether the data supports particular conclusions "beyond reasonable doubt"?