procov.html

PROCOV

PROtein COVarion analysis

Maximum likelihood estimation of phylogeny under protein covarion models

The covarion hypothesis of protein evolution proposes that selective pressures on an amino acid or nucleotide site change throughout time, resulting in changes of evolutionary rates of sites along the branches of a phylogenetic tree (W. M. Fitch & E. Markowitz, Biochem. Genet. 4: 479-593, 1970 ). Covarion-like evolution is now recognized as an important mode of molecular evolution in proteins, structural RNA genes and protein-coding genes. Empirical studies have shown that phylogenetic estimation under a covarion model may recover different optimal topologies than when estimation is performed ignoring covarion effects. Simulation studies have demonstrated that under some edge-length conditions, use of rates-across-sites models that ignore covarion effects may cause long branch repulsion biases in the resulting phylogenetic estimates (Wang, Susko, Spencer & Roger, 2008).

PROCOV implements a number of covarion models of protein evolution (Tuffley and Steel, 1998; Galtier, 2001; Huelsenbeck, 2002; Wang et al., 2007). It evaluates the maximum likelihood of a given tree under these covarion models and optimize the tree topology using the subtree pruning and regrafting tree-searching algorithm. Covarion models may be especially useful for phylogenetic estimation when ancient divergences between sequences have occurred and rates of evolution at sites are likely to have changed over the tree. It can also be used to study functional shifts in protein families that result in changes in site-rates in subtrees.

New version of PROCOV available: Procov_2.0

Source code

Test datasets

Features of Procov_2.0 compared with Procov_1.0:

Tree searching available
New user interface - command-line arguments are used
Three amino acid substitution models available: JTT, WAG and LG
Implemented numerical libraries (BLAS) for matrix manupulations, increasing procov running speed by about 3-fold.
Can print log likelihoods at sites

Citations

Wang H-C, Spencer M., Susko E. & A. J. Roger, Testing for covarion-like evolution in protein sequences. Mol. Biol. Evol. 24: 294-305, 2007. (Proposing a general covarion model and model tests; release Procov_1.0)

Wang H-C, Spencer M., Susko E. & A. J. Roger, Topological estimation biases with covarion evolution. J. Mol. Evol. 66: 50-60, 2008. (Simulation studies that show the impact of covarions on phylogenetic inference. Ignoring covarion effects may cause long-branch repulsion bias in phylogeny)

Wang H-C, Susko E. & A. J. Roger, PROCOV: maximum likelihood estimation of protein phylogeny under covarion models and site-specific covarion pattern analysis. BMC Evol. Biol. 9:225, 2009. (release Procov_2.0; added tree search function and more features)

PROCOV history:

January 2009: Procov_2.0 released: Add tree search function to version1.0.

January 2007: Procov_1.0 released. This version evaluates maximum likelihood for a fixed topology and protein alignment.

covTests

implements three statistical tests for detecting whether a protein sequence alignment has heterotachy property. The test statistics are

The w-statistic: compares amino acid substitution patterns between two monophyletic groups of protein sequences. It is defined as the difference between the fraction of varied sites in both groups and the fraction of varied sites in each group (Lockhart et al. Mol. Biol. Evol. 15:1183–1188, 1998).
The w'-statistic: using site entropy as a measure of variability of a sequence site, the w statistic is generalized to be a w’ statistic by assigning those sites that are varied in both groups but have a large entropy difference to those sites that are variable only in one group, thus modifying the fractions of varied sites in each group and in both groups.
Pearson correlation coefficient (r): Under an rates-across-sites (RAS) model, the r of site entropies between two monophyletic groups is positive. Under the covarion models, the r is also positive but smaller than that under RAS, because sites switching from ON to OFF and from OFF to ON diminishes the correlation. Under an equal rate (ER) model, the r is statistically 0.

Source code and EF-related data
Reference: Wang H-C, Susko E. & A. J. Roger, Fast statistical tests for detecting heterotachy in protein evolution. Mol. Biol. Evol. 28: 2305-2315, 2011.

Simulating sequence evolution: various versions of Seq-gen
under Profile mixture models (C20 and C60) and Site-specific frequency model (SSF);
Covarion models and other heterotachy models:

In developing profile mixture models and PMSF models as well as the covarion and heterotachy models, we frequently need to simulate sequences under these models to evaluate their performance. The sequence simulation programs we often use are listed below.

Seq-gen: a general program developed by Andrew Rambaut and Nicholas Grass for simulating sequence alignments based on a given phylogenetic tree and common models of the substitution process, described in Rambaut A & N C Grassly, Computer Applications in the Biosciences 13:235-238, 1997.
Seq-gen-cov: Cécile Ané modified Seq-gen to include two covarion models (Tuffley and Steel model and Huelsenbeck model for nucleotide sequences), described in Ané C, Burleigh J, McMahon M & M Sanderson, Mol. Biol. Evol. 22:914–924, 2005.
Seq-gen-aminocov: my former collegue, Matthew Spencer, further modified Seq-gen and Seq-gen-cov so that the simulator can simulate nucleotide and protein evolution under various covarion models, including Tuffley and Steel model, Huelsenbeck model, Galtier model and the general covarion model. The program is described in Wang H-C, Spencer M., Susko E. & A J Roger, J. Mol. Evol. 66: 50-60, 2008.
indel-seq-gen: developed by Strope C L, Abel K, Scott, S D & E N Moriyama (Mol. Biol. Evol. 26: 2581-2593, 2009) to allow simulating insertion and deletion events and the use of multiple related root sequences in simulation.
LineageSpecificSeqgen: generating sequence data with lineage-specific variation in the proportion of variable sites, described in Shavit Grievink L, Penny D, Hendy MD & B R Holland, BMC Evol. Biol. 8:317, 2008.
SiteSpecific.seq-gen: generating sequence data under C20+F, C60+F or site-specific frequencies (SSF), designed by me and described in Wang H-C, Susko E, Minh BQ & A J Roger: Modeling Site Heterogeneity with Posterior Mean Site Frequency Profiles Accelerates Accurate Phylogenomic Estimation (in submission).

Last updated: 10/24/2016