Use an empirical protein substitution model¶
Section author: Gavin Huttley
This file contains an example of importing an empirically determined protein substitution matrix such as Dayhoff et al 1978 and using it to create a substitution model. The globin alignment is from the PAML distribution.
>>> from cogent3 import load_aligned_seqs, make_tree
>>> from cogent3.evolve.substitution_model import EmpiricalProteinMatrix
>>> from cogent3.parse.paml_matrix import PamlMatrixParser
Make a tree object. In this case from a string.
>>> treestring="(((rabbit,rat),human),goat-cow,marsupial);"
>>> t = make_tree(treestring)
Import the alignment, explicitly setting the moltype
to be protein
>>> al = load_aligned_seqs('data/abglobin_aa.phylip',
... moltype="protein",
... )
Open the file that contains the empirical matrix and parse the matrix and frequencies.
>>> matrix_file = open('data/dayhoff.dat')
The PamlMatrixParser
will import the matrix and frequency from files designed for Yang’s PAML package. This format is the lower half of the matrix in three letter amino acid name order white space delineated followed by motif frequencies in the same order.
>>> empirical_matrix, empirical_frequencies = PamlMatrixParser(matrix_file)
Create an Empirical Protein Matrix Substitution model object. This will take the unscaled empirical matrix and use it and the motif frequencies to create a scaled Q matrix.
>>> sm = EmpiricalProteinMatrix(empirical_matrix, empirical_frequencies)
Make a parameter controller, likelihood function object and optimise.
>>> lf = sm.make_likelihood_function(t)
>>> lf.set_alignment(al)
>>> lf.optimise(show_progress=False)
>>> print(lf.get_log_likelihood())
-1706...
>>> print(lf)
Likelihood function statistics
log-likelihood = -1706.2489
number of free parameters = 7
=============================
edge parent length
-----------------------------
rabbit edge.0 0.0785
rat edge.0 0.1750
edge.0 edge.1 0.0324
human edge.1 0.0545
edge.1 root 0.0269
goat-cow root 0.0972
marsupial root 0.2424
-----------------------------
============================================================================
A C D E F G H I
----------------------------------------------------------------------------
0.0871 0.0335 0.0469 0.0495 0.0398 0.0886 0.0336 0.0369
----------------------------------------------------------------------------...