Genetic distance calculation

Fast pairwise distance estimation

For a limited number of evolutionary models a fast implementation is available.

[1]:
from cogent3 import available_distances

available_distances()
[1]:
Specify a pairwise genetic distance calculator using 'Abbreviation' (case insensitive).
Abbreviation Suitable for moltype
paralinear dna, rna, protein
logdet dna, rna, protein
jc69 dna, rna
tn93 dna, rna
hamming dna, rna, protein, text
percent dna, rna, protein, text

6 rows x 2 columns

Computing genetic distances using the Alignment object

Abbreviations listed from available_distances() can be used as values for the distance_matrix(calc=<abbreviation>).

[2]:
from cogent3 import load_aligned_seqs
aln = load_aligned_seqs('../data/primate_brca1.fasta', moltype="dna")
dists = aln.distance_matrix(calc="tn93", show_progress=False)
dists
[2]:
Chimpanzee Galago Gorilla HowlerMon Human Orangutan Rhesus
Chimpanzee 0.000 0.192 0.005 0.070 0.009 0.014 0.040
Galago 0.192 0.000 0.192 0.216 0.196 0.194 0.196
Gorilla 0.005 0.192 0.000 0.070 0.009 0.014 0.039
HowlerMon 0.070 0.216 0.070 0.000 0.074 0.072 0.074
Human 0.009 0.196 0.009 0.074 0.000 0.017 0.042
Orangutan 0.014 0.194 0.014 0.072 0.017 0.000 0.041
Rhesus 0.040 0.196 0.039 0.074 0.042 0.041 0.000

Using the distance calculator directly

[3]:
from cogent3 import load_aligned_seqs, get_distance_calculator
aln = load_aligned_seqs('../data/primate_brca1.fasta')
dist_calc = get_distance_calculator("tn93", alignment=aln)
dist_calc
[3]:
<cogent3.evolve.fast_distance.TN93Pair at 0x1163fc820>
[4]:
dist_calc.run(show_progress=False)
dists = dist_calc.get_pairwise_distances()
dists
[4]:
Chimpanzee Galago Gorilla HowlerMon Human Orangutan Rhesus
Chimpanzee 0.000 0.192 0.005 0.070 0.009 0.014 0.040
Galago 0.192 0.000 0.192 0.216 0.196 0.194 0.196
Gorilla 0.005 0.192 0.000 0.070 0.009 0.014 0.039
HowlerMon 0.070 0.216 0.070 0.000 0.074 0.072 0.074
Human 0.009 0.196 0.009 0.074 0.000 0.017 0.042
Orangutan 0.014 0.194 0.014 0.072 0.017 0.000 0.041
Rhesus 0.040 0.196 0.039 0.074 0.042 0.041 0.000

The distance calculation object can provide more information. For instance, the standard errors.

[5]:
dist_calc.stderr
[5]:
Standard Error of Pairwise Distances
Seq1 \ Seq2 Galago HowlerMon Rhesus Orangutan Gorilla Human Chimpanzee
Galago 0 0.0103 0.0096 0.0095 0.0095 0.0096 0.0095
HowlerMon 0.0103 0 0.0054 0.0053 0.0053 0.0054 0.0053
Rhesus 0.0096 0.0054 0 0.0039 0.0039 0.0040 0.0039
Orangutan 0.0095 0.0053 0.0039 0 0.0022 0.0025 0.0023
Gorilla 0.0095 0.0053 0.0039 0.0022 0 0.0018 0.0014
Human 0.0096 0.0054 0.0040 0.0025 0.0018 0 0.0018
Chimpanzee 0.0095 0.0053 0.0039 0.0023 0.0014 0.0018 0

7 rows x 8 columns

Likelihood based pairwise distance estimation

The standard cogent3 likelihood function can also be used to estimate distances. Because these require numerical optimisation they can be significantly slower than the fast estimation approach above.

The following will use the F81 nucleotide substitution model and perform numerical optimisation.

[6]:
from cogent3 import load_aligned_seqs, get_model
from cogent3.evolve import distance

aln = load_aligned_seqs('../data/primate_brca1.fasta', moltype="dna")
d = distance.EstimateDistances(aln, submodel=get_model("F81"))
d.run(show_progress=False)
dists = d.get_pairwise_distances()
dists
[6]:
Chimpanzee Galago Gorilla HowlerMon Human Orangutan Rhesus
Chimpanzee 0.000 0.189 0.005 0.070 0.009 0.014 0.039
Galago 0.189 0.000 0.189 0.211 0.193 0.192 0.193
Gorilla 0.005 0.189 0.000 0.069 0.009 0.014 0.039
HowlerMon 0.070 0.211 0.069 0.000 0.073 0.071 0.073
Human 0.009 0.193 0.009 0.073 0.000 0.017 0.042
Orangutan 0.014 0.192 0.014 0.071 0.017 0.000 0.041
Rhesus 0.039 0.193 0.039 0.073 0.042 0.041 0.000

All cogent3 substitution models can be used for distance calculation via this approach, with the caveat that identifiability issues mean this is not possible for some non-stationary model classes.