# Building alignments¶

## Using the cogent3 aligners¶

### Running a progressive aligner¶

We import useful functions and then load the sequences to be aligned.

>>> from cogent3 import LoadSeqs, LoadTree, DNA
>>> seqs = LoadSeqs('data/test2.fasta', aligned=False, moltype=DNA)


#### For nucleotides¶

We load a canned nucleotide substitution model and the progressive aligner TreeAlign function.

>>> from cogent3.evolve.models import HKY85
>>> from cogent3.align.progressive import TreeAlign


We first align without providing a guide tree. The TreeAlign algorithm builds pairwise alignments and estimates the substitution model parameters and pairwise distances. The distances are used to build a neighbour joining tree and the median value of substitution model parameters are provided to the substitution model for the progressive alignment step.

>>> aln, tree = TreeAlign(HKY85(), seqs)
Param Estimate Summary Stats: kappa
==============================
Statistic        Value
------------------------------
Count           10
Sum        1e+06
Median        4.256
Mean        1e+05
StandardDeviation    3.162e+05
Variance        1e+11
------------------------------
>>> aln
5 x 60 bytes alignment: NineBande[-C-----GCCA...], Mouse[GCAGTGAGCCA...], DogFaced[GCAAGGAGCCA...], ...


We then align using a guide tree (pre-estimated) and specifying the ratio of transitions to transversions (kappa).

>>> tree = LoadTree(treestring='(((NineBande:0.0128202449453,Mouse:0.184732725695):0.0289459522137,DogFaced:0.0456427810916):0.0271363715538,Human:0.0341320714654,HowlerMon:0.0188456837006)root;')
>>> params={'kappa': 4.0}
>>> aln, tree = TreeAlign(HKY85(), seqs, tree=tree, param_vals=params)
>>> aln
5 x 60 bytes alignment: NineBande[-C-----GCCA...], Mouse[GCAGTGAGCCA...], DogFaced[GCAAGGAGCCA...], ...


#### For codons¶

We load a canned codon substitution model and use a pre-defined tree and parameter estimates.

>>> from cogent3.evolve.models import MG94HKY
>>> params={'kappa': 4.0, 'omega': 1.3}
>>> aln, tree = TreeAlign(MG94HKY(), seqs, tree=tree, param_vals=params)
>>> aln
5 x 60 bytes alignment: NineBande[------CGCCA...], Mouse[GCAGTGAGCCA...], DogFaced[GCAAGGAGCCA...], ...


## Converting gaps from aa-seq alignment to nuc seq alignment¶

We load some unaligned DNA sequences and show their translation.

>>> from cogent3 import LoadSeqs, DNA, PROTEIN
>>> seqs = [('hum', 'AAGCAGATCCAGGAAAGCAGCGAGAATGGCAGCCTGGCCGCGCGCCAGGAGAGGCAGGCCCAGGTCAACCTCACT'),
...         ('mus', 'AAGCAGATCCAGGAGAGCGGCGAGAGCGGCAGCCTGGCCGCGCGGCAGGAGAGGCAGGCCCAAGTCAACCTCACG'),
...         ('rat', 'CTGAACAAGCAGCCACTTTCAAACAAGAAA')]
>>> unaligned_DNA = LoadSeqs(data=seqs, moltype=DNA, aligned=False)
>>> print(unaligned_DNA.to_fasta())
>hum
AAGCAGATCCAGGAAAGCAGCGAGAATGGCAGCCTGGCCGCGCGCCAGGAGAGGCAGGCCCAGGTCAACCTCACT
>mus
AAGCAGATCCAGGAGAGCGGCGAGAGCGGCAGCCTGGCCGCGCGGCAGGAGAGGCAGGCCCAAGTCAACCTCACG
>rat
CTGAACAAGCAGCCACTTTCAAACAAGAAA
>>> print(unaligned_DNA.get_translation())
>hum
KQIQESSENGSLAARQERQAQVNLT
>mus
KQIQESGESGSLAARQERQAQVNLT
>rat
LNKQPLSNKK


We load an alignment of these protein sequences.

>>> aligned_aa_seqs = [('hum', 'KQIQESSENGSLAARQERQAQVNLT'),
...                    ('mus', 'KQIQESGESGSLAARQERQAQVNLT'),
...                    ('rat', 'LNKQ------PLS---------NKK')]


We then obtain an alignment of the DNA sequences from the alignment of their translation.

>>> aligned_DNA = aligned_aa.replace_seqs(unaligned_DNA, aa_to_codon=True)
>>> print(aligned_DNA)
>hum
AAGCAGATCCAGGAAAGCAGCGAGAATGGCAGCCTGGCCGCGCGCCAGGAGAGGCAGGCCCAGGTCAACCTCACT
>mus
AAGCAGATCCAGGAGAGCGGCGAGAGCGGCAGCCTGGCCGCGCGGCAGGAGAGGCAGGCCCAAGTCAACCTCACG
>rat
CTGAACAAGCAG------------------CCACTTTCA---------------------------AACAAGAAA


Setting the argument aa_to_codons=False is only useful when the sequences have exactly the length. One use case is to allow introducing the gaps onto another copy of the alignment where there are annotations.