{
"cells": [
{
"cell_type": "markdown",
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%% md\n"
}
},
"source": [
"# Genetic distance calculation\n",
"\n",
"## Fast pairwise distance estimation\n",
"\n",
"For a limited number of evolutionary models a fast implementation is available."
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {
"pycharm": {
"is_executing": false
}
},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"Specify a pairwise genetic distance calculator using 'Abbreviation' (case insensitive).\n",
"\n",
"Abbreviation | \n",
"Suitable for moltype | \n",
"\n",
"\n",
"\n",
"paralinear | \n",
"dna, rna, protein | \n",
"
\n",
"\n",
"logdet | \n",
"dna, rna, protein | \n",
"
\n",
"\n",
"jc69 | \n",
"dna, rna | \n",
"
\n",
"\n",
"tn93 | \n",
"dna, rna | \n",
"
\n",
"\n",
"hamming | \n",
"dna, rna, protein, text | \n",
"
\n",
"\n",
"percent | \n",
"dna, rna, protein, text | \n",
"
\n",
"\n",
"
\n",
"\n",
"6 rows x 2 columns
"
],
"text/plain": [
"Specify a pairwise genetic distance calculator using 'Abbreviation' (case insensitive).\n",
"=======================================\n",
"Abbreviation Suitable for moltype\n",
"---------------------------------------\n",
" paralinear dna, rna, protein\n",
" logdet dna, rna, protein\n",
" jc69 dna, rna\n",
" tn93 dna, rna\n",
" hamming dna, rna, protein, text\n",
" percent dna, rna, protein, text\n",
"---------------------------------------\n",
"\n",
"6 rows x 2 columns"
]
},
"execution_count": 1,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"from cogent3 import available_distances\n",
"\n",
"available_distances()"
]
},
{
"cell_type": "markdown",
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%% md\n"
}
},
"source": [
"## Computing genetic distances using the `Alignment` object\n",
"\n",
"Abbreviations listed from `available_distances()` can be used as values for the `distance_matrix(calc=)`."
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {
"collapsed": false,
"pycharm": {
"is_executing": false,
"name": "#%%\n"
}
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"\n",
" | \n",
"Chimpanzee | \n",
"Galago | \n",
"Gorilla | \n",
"HowlerMon | \n",
"Human | \n",
"Orangutan | \n",
"Rhesus | \n",
"\n",
"\n",
"\n",
"Chimpanzee | \n",
"0.000 | \n",
"0.192 | \n",
"0.005 | \n",
"0.070 | \n",
"0.009 | \n",
"0.014 | \n",
"0.040 | \n",
"
\n",
"\n",
"Galago | \n",
"0.192 | \n",
"0.000 | \n",
"0.192 | \n",
"0.216 | \n",
"0.196 | \n",
"0.194 | \n",
"0.196 | \n",
"
\n",
"\n",
"Gorilla | \n",
"0.005 | \n",
"0.192 | \n",
"0.000 | \n",
"0.070 | \n",
"0.009 | \n",
"0.014 | \n",
"0.039 | \n",
"
\n",
"\n",
"HowlerMon | \n",
"0.070 | \n",
"0.216 | \n",
"0.070 | \n",
"0.000 | \n",
"0.074 | \n",
"0.072 | \n",
"0.074 | \n",
"
\n",
"\n",
"Human | \n",
"0.009 | \n",
"0.196 | \n",
"0.009 | \n",
"0.074 | \n",
"0.000 | \n",
"0.017 | \n",
"0.042 | \n",
"
\n",
"\n",
"Orangutan | \n",
"0.014 | \n",
"0.194 | \n",
"0.014 | \n",
"0.072 | \n",
"0.017 | \n",
"0.000 | \n",
"0.041 | \n",
"
\n",
"\n",
"Rhesus | \n",
"0.040 | \n",
"0.196 | \n",
"0.039 | \n",
"0.074 | \n",
"0.042 | \n",
"0.041 | \n",
"0.000 | \n",
"
\n",
"\n",
"
\n"
],
"text/plain": [
"===========================================================================================\n",
" Chimpanzee Galago Gorilla HowlerMon Human Orangutan Rhesus\n",
"-------------------------------------------------------------------------------------------\n",
"Chimpanzee 0.0000 0.1921 0.0054 0.0704 0.0089 0.0140 0.0396\n",
" Galago 0.1921 0.0000 0.1923 0.2157 0.1965 0.1944 0.1962\n",
" Gorilla 0.0054 0.1923 0.0000 0.0700 0.0086 0.0137 0.0393\n",
" HowlerMon 0.0704 0.2157 0.0700 0.0000 0.0736 0.0719 0.0736\n",
" Human 0.0089 0.1965 0.0086 0.0736 0.0000 0.0173 0.0423\n",
" Orangutan 0.0140 0.1944 0.0137 0.0719 0.0173 0.0000 0.0411\n",
" Rhesus 0.0396 0.1962 0.0393 0.0736 0.0423 0.0411 0.0000\n",
"-------------------------------------------------------------------------------------------"
]
},
"execution_count": 2,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"from cogent3 import load_aligned_seqs\n",
"aln = load_aligned_seqs('../data/primate_brca1.fasta', moltype=\"dna\")\n",
"dists = aln.distance_matrix(calc=\"tn93\", show_progress=False)\n",
"dists"
]
},
{
"cell_type": "markdown",
"metadata": {
"collapsed": false
},
"source": [
"## Using the distance calculator directly"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {
"collapsed": false,
"pycharm": {
"is_executing": false,
"name": "#%%\n"
}
},
"outputs": [
{
"data": {
"text/plain": [
""
]
},
"execution_count": 3,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"from cogent3 import load_aligned_seqs, get_distance_calculator\n",
"aln = load_aligned_seqs('../data/primate_brca1.fasta')\n",
"dist_calc = get_distance_calculator(\"tn93\", alignment=aln)\n",
"dist_calc"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {
"collapsed": false,
"pycharm": {
"is_executing": false,
"name": "#%%\n"
}
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"\n",
" | \n",
"Chimpanzee | \n",
"Galago | \n",
"Gorilla | \n",
"HowlerMon | \n",
"Human | \n",
"Orangutan | \n",
"Rhesus | \n",
"\n",
"\n",
"\n",
"Chimpanzee | \n",
"0.000 | \n",
"0.192 | \n",
"0.005 | \n",
"0.070 | \n",
"0.009 | \n",
"0.014 | \n",
"0.040 | \n",
"
\n",
"\n",
"Galago | \n",
"0.192 | \n",
"0.000 | \n",
"0.192 | \n",
"0.216 | \n",
"0.196 | \n",
"0.194 | \n",
"0.196 | \n",
"
\n",
"\n",
"Gorilla | \n",
"0.005 | \n",
"0.192 | \n",
"0.000 | \n",
"0.070 | \n",
"0.009 | \n",
"0.014 | \n",
"0.039 | \n",
"
\n",
"\n",
"HowlerMon | \n",
"0.070 | \n",
"0.216 | \n",
"0.070 | \n",
"0.000 | \n",
"0.074 | \n",
"0.072 | \n",
"0.074 | \n",
"
\n",
"\n",
"Human | \n",
"0.009 | \n",
"0.196 | \n",
"0.009 | \n",
"0.074 | \n",
"0.000 | \n",
"0.017 | \n",
"0.042 | \n",
"
\n",
"\n",
"Orangutan | \n",
"0.014 | \n",
"0.194 | \n",
"0.014 | \n",
"0.072 | \n",
"0.017 | \n",
"0.000 | \n",
"0.041 | \n",
"
\n",
"\n",
"Rhesus | \n",
"0.040 | \n",
"0.196 | \n",
"0.039 | \n",
"0.074 | \n",
"0.042 | \n",
"0.041 | \n",
"0.000 | \n",
"
\n",
"\n",
"
\n"
],
"text/plain": [
"===========================================================================================\n",
" Chimpanzee Galago Gorilla HowlerMon Human Orangutan Rhesus\n",
"-------------------------------------------------------------------------------------------\n",
"Chimpanzee 0.0000 0.1921 0.0054 0.0704 0.0089 0.0140 0.0396\n",
" Galago 0.1921 0.0000 0.1923 0.2157 0.1965 0.1944 0.1962\n",
" Gorilla 0.0054 0.1923 0.0000 0.0700 0.0086 0.0137 0.0393\n",
" HowlerMon 0.0704 0.2157 0.0700 0.0000 0.0736 0.0719 0.0736\n",
" Human 0.0089 0.1965 0.0086 0.0736 0.0000 0.0173 0.0423\n",
" Orangutan 0.0140 0.1944 0.0137 0.0719 0.0173 0.0000 0.0411\n",
" Rhesus 0.0396 0.1962 0.0393 0.0736 0.0423 0.0411 0.0000\n",
"-------------------------------------------------------------------------------------------"
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"dist_calc.run(show_progress=False)\n",
"dists = dist_calc.get_pairwise_distances()\n",
"dists"
]
},
{
"cell_type": "markdown",
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%% md\n"
}
},
"source": [
"The distance calculation object can provide more information. For instance, the standard errors.\n",
"\n"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {
"collapsed": false,
"pycharm": {
"is_executing": false,
"name": "#%%\n"
}
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"Standard Error of Pairwise Distances\n",
"\n",
"Seq1 \\ Seq2 | \n",
"Galago | \n",
"HowlerMon | \n",
"Rhesus | \n",
"Orangutan | \n",
"Gorilla | \n",
"Human | \n",
"Chimpanzee | \n",
"\n",
"\n",
"\n",
"Galago | \n",
"0 | \n",
"0.0103 | \n",
"0.0096 | \n",
"0.0095 | \n",
"0.0095 | \n",
"0.0096 | \n",
"0.0095 | \n",
"
\n",
"\n",
"HowlerMon | \n",
"0.0103 | \n",
"0 | \n",
"0.0054 | \n",
"0.0053 | \n",
"0.0053 | \n",
"0.0054 | \n",
"0.0053 | \n",
"
\n",
"\n",
"Rhesus | \n",
"0.0096 | \n",
"0.0054 | \n",
"0 | \n",
"0.0039 | \n",
"0.0039 | \n",
"0.0040 | \n",
"0.0039 | \n",
"
\n",
"\n",
"Orangutan | \n",
"0.0095 | \n",
"0.0053 | \n",
"0.0039 | \n",
"0 | \n",
"0.0022 | \n",
"0.0025 | \n",
"0.0023 | \n",
"
\n",
"\n",
"Gorilla | \n",
"0.0095 | \n",
"0.0053 | \n",
"0.0039 | \n",
"0.0022 | \n",
"0 | \n",
"0.0018 | \n",
"0.0014 | \n",
"
\n",
"\n",
"Human | \n",
"0.0096 | \n",
"0.0054 | \n",
"0.0040 | \n",
"0.0025 | \n",
"0.0018 | \n",
"0 | \n",
"0.0018 | \n",
"
\n",
"\n",
"Chimpanzee | \n",
"0.0095 | \n",
"0.0053 | \n",
"0.0039 | \n",
"0.0023 | \n",
"0.0014 | \n",
"0.0018 | \n",
"0 | \n",
"
\n",
"\n",
"
\n",
"\n",
"7 rows x 8 columns
"
],
"text/plain": [
"Standard Error of Pairwise Distances\n",
"============================================================================================\n",
"Seq1 \\ Seq2 Galago HowlerMon Rhesus Orangutan Gorilla Human Chimpanzee\n",
"--------------------------------------------------------------------------------------------\n",
" Galago 0 0.0103 0.0096 0.0095 0.0095 0.0096 0.0095\n",
" HowlerMon 0.0103 0 0.0054 0.0053 0.0053 0.0054 0.0053\n",
" Rhesus 0.0096 0.0054 0 0.0039 0.0039 0.0040 0.0039\n",
" Orangutan 0.0095 0.0053 0.0039 0 0.0022 0.0025 0.0023\n",
" Gorilla 0.0095 0.0053 0.0039 0.0022 0 0.0018 0.0014\n",
" Human 0.0096 0.0054 0.0040 0.0025 0.0018 0 0.0018\n",
" Chimpanzee 0.0095 0.0053 0.0039 0.0023 0.0014 0.0018 0\n",
"--------------------------------------------------------------------------------------------\n",
"\n",
"7 rows x 8 columns"
]
},
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"dist_calc.stderr"
]
},
{
"cell_type": "markdown",
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%% md\n"
}
},
"source": [
"## Likelihood based pairwise distance estimation\n",
"\n",
"The standard ``cogent3`` likelihood function can also be used to estimate distances. Because these require numerical optimisation they can be significantly slower than the fast estimation approach above.\n",
"\n",
"The following will use the F81 nucleotide substitution model and perform numerical optimisation."
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {
"collapsed": false,
"pycharm": {
"is_executing": false,
"name": "#%%\n"
}
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"\n",
" | \n",
"Chimpanzee | \n",
"Galago | \n",
"Gorilla | \n",
"HowlerMon | \n",
"Human | \n",
"Orangutan | \n",
"Rhesus | \n",
"\n",
"\n",
"\n",
"Chimpanzee | \n",
"0.000 | \n",
"0.189 | \n",
"0.005 | \n",
"0.070 | \n",
"0.009 | \n",
"0.014 | \n",
"0.039 | \n",
"
\n",
"\n",
"Galago | \n",
"0.189 | \n",
"0.000 | \n",
"0.189 | \n",
"0.211 | \n",
"0.193 | \n",
"0.192 | \n",
"0.193 | \n",
"
\n",
"\n",
"Gorilla | \n",
"0.005 | \n",
"0.189 | \n",
"0.000 | \n",
"0.069 | \n",
"0.009 | \n",
"0.014 | \n",
"0.039 | \n",
"
\n",
"\n",
"HowlerMon | \n",
"0.070 | \n",
"0.211 | \n",
"0.069 | \n",
"0.000 | \n",
"0.073 | \n",
"0.071 | \n",
"0.073 | \n",
"
\n",
"\n",
"Human | \n",
"0.009 | \n",
"0.193 | \n",
"0.009 | \n",
"0.073 | \n",
"0.000 | \n",
"0.017 | \n",
"0.042 | \n",
"
\n",
"\n",
"Orangutan | \n",
"0.014 | \n",
"0.192 | \n",
"0.014 | \n",
"0.071 | \n",
"0.017 | \n",
"0.000 | \n",
"0.041 | \n",
"
\n",
"\n",
"Rhesus | \n",
"0.039 | \n",
"0.193 | \n",
"0.039 | \n",
"0.073 | \n",
"0.042 | \n",
"0.041 | \n",
"0.000 | \n",
"
\n",
"\n",
"
\n"
],
"text/plain": [
"===========================================================================================\n",
" Chimpanzee Galago Gorilla HowlerMon Human Orangutan Rhesus\n",
"-------------------------------------------------------------------------------------------\n",
"Chimpanzee 0.0000 0.1892 0.0054 0.0697 0.0089 0.0140 0.0395\n",
" Galago 0.1892 0.0000 0.1891 0.2112 0.1934 0.1915 0.1930\n",
" Gorilla 0.0054 0.1891 0.0000 0.0693 0.0086 0.0136 0.0391\n",
" HowlerMon 0.0697 0.2112 0.0693 0.0000 0.0729 0.0713 0.0729\n",
" Human 0.0089 0.1934 0.0086 0.0729 0.0000 0.0173 0.0421\n",
" Orangutan 0.0140 0.1915 0.0136 0.0713 0.0173 0.0000 0.0410\n",
" Rhesus 0.0395 0.1930 0.0391 0.0729 0.0421 0.0410 0.0000\n",
"-------------------------------------------------------------------------------------------"
]
},
"execution_count": 6,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"from cogent3 import load_aligned_seqs, get_model\n",
"from cogent3.evolve import distance\n",
"\n",
"aln = load_aligned_seqs('../data/primate_brca1.fasta', moltype=\"dna\")\n",
"d = distance.EstimateDistances(aln, submodel=get_model(\"F81\"))\n",
"d.run(show_progress=False)\n",
"dists = d.get_pairwise_distances()\n",
"dists"
]
},
{
"cell_type": "markdown",
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%% md\n"
}
},
"source": [
"All `cogent3` substitution models can be used for distance calculation via this approach, with the caveat that identifiability issues mean this is not possible for some non-stationary model classes."
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python [conda env:c3dev] *",
"language": "python",
"name": "conda-env-c3dev-py"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.8.1"
},
"pycharm": {
"stem_cell": {
"cell_type": "raw",
"metadata": {
"collapsed": false
},
"source": []
}
}
},
"nbformat": 4,
"nbformat_minor": 4
}