Maximum likelihood estimation of
phylogeny under protein covarion models
The
covarionhypothesis
of
protein
evolution proposes
that
selective pressures on an amino acid or nucleotide site change
throughout time, resulting in changes of evolutionary rates of
sites
along the branches of a phylogenetic
tree (W. M. Fitch
& E. Markowitz, Biochem. Genet. 4: 479-593, 1970 ).Covarion-like
evolution is now recognized as an important
mode of molecular
evolution in proteins,
structural RNA genes and protein-coding genes. Empirical
studies have shown that phylogenetic estimation under a covarion
model may recover different optimal topologies than when
estimation is
performed ignoring covarion effects. Simulation studies have
demonstrated that under
some edge-length conditions, use of rates-across-sites models that
ignore
covarion effects may cause long branch repulsion biases in the
resulting
phylogenetic estimates (Wang, Susko, Spencer &
Roger,
2008).
PROCOV
implements a
number of covarion models of protein evolution (Tuffley and
Steel, 1998; Galtier, 2001; Huelsenbeck,
2002; Wang et al.,
2007). It
evaluates the maximum likelihood of a given tree under these
covarion
models
and optimize the tree topology using the subtree pruning and
regrafting
tree-searching algorithm. Covarion models may be especially
useful for
phylogenetic estimation when ancient divergences between
sequences have
occurred and rates of evolution at sites are likely to have
changed
over the
tree. It can also be used to study functional shifts in protein
families that
result in changes in site-rates in subtrees.
Wang H-C, Spencer M., Susko E. & A. J.
Roger, Topological
estimation
biases with covarion evolution. J. Mol. Evol. 66:
50-60, 2008. (Simulation studies that show the
impact of covarions on phylogenetic inference. Ignoring
covarion
effects may cause long-branch repulsion bias in phylogeny)
January 2009: Procov_2.0 released: Add tree search function to version1.0.
January 2007: Procov_1.0 released.
This version
evaluates maximum likelihood for a fixed topology and
protein alignment.
covTests
implements
three statistical tests
for detecting whether a protein sequence alignment has
heterotachy
property. The test statistics are
The w-statistic:
compares amino acid
substitution patterns between two monophyletic groups of
protein
sequences. It
is defined as the difference between the fraction of varied
sites in
both
groups and the fraction of varied sites in each group (Lockhart et al. Mol. Biol. Evol.
15:1183–1188,
1998).
The w'-statistic: using
site entropy as a measure
of variability of a sequence site, the w statistic is generalized
to be a w’
statistic by assigning those sites that are varied in both
groups but
have a
large entropy difference to those sites that are variable only
in one
group,
thus modifying the fractions of varied sites in each group and
in both
groups.
Pearson
correlation
coefficient (r): Under
an rates-across-sites (RAS)
model, the r of site
entropies between two monophyletic
groups is positive. Under
the covarion models, the r
is
also positive but smaller than that under RAS, because sites
switching
from ON to OFF and from OFF to ON diminishes the
correlation. Under an equal rate (ER) model, the r
is statistically 0.
Simulating
sequence
evolution: various versions of Seq-gen under Profile mixture models (C20 and C60) and
Site-specific frequency model (SSF);
Covarion models and other heterotachy models:
In developing profile mixture models
and PMSF models as well as the covarion and heterotachy models,
we
frequently need to simulate sequences under these models to
evaluate their performance. The sequence simulation programs we
often use are
listed below.
Seq-gen:
a general program developed by Andrew Rambaut and Nicholas
Grass for
simulating sequence alignments based on a given phylogenetic
tree and
common models of the substitution process, described in
Rambaut A &
N C Grassly, Computer Applications in the Biosciences13:235-238,
1997.
Seq-gen-cov:
Cécile Ané modified Seq-gen to include two covarion
models (Tuffley and Steel model and Huelsenbeck model for
nucleotide
sequences), described in Ané C, Burleigh J, McMahon M & M
Sanderson, Mol. Biol. Evol.
22:914–924, 2005.
Seq-gen-aminocov:
my former collegue, Matthew Spencer, further modified Seq-gen
and
Seq-gen-cov so that the simulator can simulate nucleotide and
protein
evolution under various covarion models, including Tuffley and
Steel
model, Huelsenbeck model, Galtier model and the general
covarion model.
The program is described in Wang H-C, Spencer M., Susko E.
& A J
Roger, J. Mol. Evol.
66:
50-60, 2008.
indel-seq-gen:
developed by Strope C L, Abel K, Scott, S D & E N Moriyama
(Mol. Biol. Evol. 26:
2581-2593,
2009) to allow simulating insertion and deletion events and
the use of
multiple related root sequences in simulation.
LineageSpecificSeqgen:
generating sequence data with lineage-specific variation in
the
proportion of variable sites, described in Shavit Grievink L,
Penny D,
Hendy MD & B R Holland, BMC Evol. Biol. 8:317,
2008.
SiteSpecific.seq-gen:
generating sequence data under C20+F, C60+F or
site-specific frequencies (SSF), designed by me and
described in Wang H-C, Susko E, Minh BQ & A J
Roger: Modeling Site Heterogeneity
with Posterior Mean Site Frequency Profiles Accelerates
Accurate Phylogenomic
Estimation
(in submission).