Calculate pairwise distances between sequences

Section author: Gavin Huttley

An example of how to calculate the pairwise distances for a set of sequences.

>>> from cogent3 import load_aligned_seqs
>>> from cogent3.evolve import distance

Import a substitution model (or create your own)

>>> from cogent3.evolve.models import HKY85

Load my alignment

>>> al = load_aligned_seqs("data/long_testseqs.fasta")

Create a pairwise distances object with your alignment and substitution model

>>> d = distance.EstimateDistances(al, submodel=HKY85())

Printing d before execution shows its status.

>>> print(d)
=========================================================================
Seq1 \ Seq2       Human    HowlerMon       Mouse    NineBande    DogFaced
-------------------------------------------------------------------------
      Human           *     Not Done    Not Done     Not Done    Not Done
  HowlerMon    Not Done            *    Not Done     Not Done    Not Done
      Mouse    Not Done     Not Done           *     Not Done    Not Done
  NineBande    Not Done     Not Done    Not Done            *    Not Done
   DogFaced    Not Done     Not Done    Not Done     Not Done           *
-------------------------------------------------------------------------

Which in this case is to simply indicate nothing has been done.

>>> d.run(show_progress=False)
>>> print(d)
=====================================================================
Seq1 \ Seq2     Human    HowlerMon     Mouse    NineBande    DogFaced
---------------------------------------------------------------------
      Human         *       0.0730    0.3363       0.1804      0.1972
  HowlerMon    0.0730            *    0.3487       0.1865      0.2078
      Mouse    0.3363       0.3487         *       0.3813      0.4022
  NineBande    0.1804       0.1865    0.3813            *      0.2019
   DogFaced    0.1972       0.2078    0.4022       0.2019           *
---------------------------------------------------------------------

Note that pairwise distances can be distributed for computation across multiple CPU’s. In this case, when statistics (like distances) are requested only the master CPU returns data.

We’ll write a phylip formatted distance matrix.

>>> d.write('dists_for_phylo.phylip', format="phylip")

We’ll also save the distances to file in Python’s pickle format.

>>> import pickle
>>> f = open('dists_for_phylo.pickle', "wb")
>>> pickle.dump(d.get_pairwise_distances(), f)
>>> f.close()