Perform a coevolutionary analysis on biological sequence alignments

Section author: Greg Caporaso

This document describes how to perform a coevolutionary analysis on a ArrayAlignment object. Coevolutionary analyses identify correlated substitution patterns between ArrayAlignment positions (columns). Several coevolution detection methods are currently provided via the PyCogent3 coevolution module. ArrayAlignment objects must always be used as input to these functions.

Before using an alignment in a coevolutionary analysis, you should be confident in the alignment. Poorly aligned sequences can yield very misleading results. There can be no ambiguous residue/base codes (e.g., B/Z/X in protein alignments) – while some of the algorithms could tolerate them (e.g. Mutual Information), others which rely on information such as background residue frequencies (e.g. Statistical Coupling Analysis) cannot handle them. Some recoded amino acid alphabets will also not handle ambiguous residues. The best strategy is just to exclude ambiguous codes all together. To test for invalid characters before starting an analysis you can do the following:

>>> from cogent3 import LoadSeqs, PROTEIN, DNA, RNA
>>> from cogent3.core.alignment import ArrayAlignment
>>> from cogent3.evolve.coevolution import validate_alignment
>>> aln = LoadSeqs(data={'1':'GAA','2':'CTA', '3':'CTC','4':'-TC'},
...                moltype=PROTEIN, array_align=True)
>>> validate_alignment(aln)

To run a coevolutionary analysis, first create a ArrayAlignment:

>>> from cogent3 import LoadSeqs, PROTEIN, DNA, RNA
>>> from cogent3.core.alignment import ArrayAlignment
>>> aln = LoadSeqs(data={'1':'AAA','2':'CTA', '3':'CTC','4':'-TC'},
...                moltype=PROTEIN, array_align=True)

Perform a coevolutionary analysis on a pair of positions in the alignment using mutual information (mi):

>>> from cogent3.evolve.coevolution import coevolve_pair_functions, coevolve_pair
>>> coevolve_pair(coevolve_pair_functions['mi'],aln,pos1=1,pos2=2)
0.31127...

Perform a coevolutionary analysis on a pair of positions in the alignment using statistical coupling analysis (sca):

>>> from cogent3.evolve.coevolution import coevolve_pair_functions, coevolve_pair
>>> coevolve_pair(coevolve_pair_functions['sca'],aln,pos1=1,pos2=2,cutoff=0.5)
0.98053...

Perform a coevolutionary analysis on one position and all other positions in the alignment using mutual information (mi):

>>> from cogent3.evolve.coevolution import coevolve_position_functions, coevolve_position
>>> coevolve_position(coevolve_position_functions['mi'],aln,position=1)
array([        nan,  0.81127812,  0.31127812])

Perform a coevolutionary analysis on all pairs of positions in the alignment using mutual information (mi):

>>> from cogent3.evolve.coevolution import coevolve_alignment_functions, coevolve_alignment
>>> coevolve_alignment(coevolve_alignment_functions['mi'],aln)
array([[        nan,         nan,         nan],
       [        nan,  0.81127812,  0.31127812],
       [        nan,  0.31127812,  1.        ]])

View the available algorithms for computing coevolution values:

>>> print(coevolve_pair_functions.keys())
dict_keys(['an', 'rmi', 'sca', 'mi', 'nmi'])

Perform an intermolecular coevolutionary analysis using mutual information (mi). Note that there are strict requirements on the sequence identifiers for intermolecular analyses, and some important considerations involved in preparing alignments for these analyses. See the coevolve_alignments docstring (i.e., help(coevolve_alignments) from the python interpreter) for information. Briefly, sequence identifiers are split on + symbols. The ids before the + must match perfectly between the two alignments as these are used to match the sequences between alignments. In the following example, these are common species names: human, chicken, echidna, and pig. The text after the + can be anything, and should probably be the original database identifiers of the sequences.

>>> from cogent3.evolve.coevolution import coevolve_alignment_functions,\
...   coevolve_alignments
>>> aln1 = LoadSeqs(data={'human+protein1':'AAA','pig+protein1':'CTA',
...  'chicken+protein1':'CTC','echidna+weird_db_identifier':'-TC'},
...   moltype=PROTEIN, array_align=True)
>>> aln2 = LoadSeqs(data={'pig+protein2':'AAAY','chicken+protein2':'CTAY',
...  'echidna+protein2':'CTCF','human+protein2':'-TCF'},
...   moltype=PROTEIN, array_align=True)
>>> coevolve_alignments(coevolve_alignment_functions['mi'],aln1,aln2)
array([[        nan,         nan,         nan],
       [        nan,  0.12255625,  0.31127812],
       [        nan,  0.31127812,  0.        ],
       [        nan,  0.31127812,  0.        ]])