Annotations

Annotations with coordinates

For more extensive documentation about annotations see Advanced sequence handling.

Automated introduction from reading genbank files

We load a sample genbank file with plenty of features and grab the CDS features.

>>> from cogent3.parse.genbank import RichGenbankParser
>>> parser = RichGenbankParser(open('data/ST_genome_part.gb'))
>>> for accession, seq in parser:
...     print(accession)
...
AE006468
>>> cds = seq.get_annotations_matching('CDS')
>>> print(cds)
[CDS "thrL" at [189:255]/10020, CDS "thrA" at ...

Customising annotation construction from reading a genbank file

You can write your own code to construct annotation objects. One reason you might do this is some genbank files do not have a /gene tag on gene related features, instead only possessing a /locus_tag. For illustrating the approach we only create annotations for CDS features. We write a custom callback function that uses the locus_tag as the Feature name.

>>> from cogent3.core.annotation import Feature
>>> def add_annotation(seq, feature, spans):
...     type_ = feature['type']
...     if type_ != 'CDS':
...         return
...     name = feature.get('locus_tag', None)
...     if name and not isinstance(name, str):
...         name = ' '.join(name)
...     seq.add_annotation(Feature, type_, name, spans)
...
>>> parser = RichGenbankParser(open('data/ST_genome_part.gb'),
...          add_annotation=add_annotation)
>>> for accession, seq in parser: # just reading one accession,sequence
...     break
...
>>> genes = seq.get_annotations_matching('CDS')
>>> print(genes)
[CDS "STM0001" at [189:255]/10020, CDS "STM0002" at [336:2799]/10020...

Creating directly on a sequence

>>> from cogent3 import DNA
>>> from cogent3.core.annotation import Feature
>>> s1 = DNA.make_seq("AAGAAGAAGACCCCCAAAAAAAAAA"\
...                      "TTTTTTTTTTAAAAAGGGAACCCT",
...                      name="seq1")
...
>>> print(s1[10:15]) # this will be exon 1
CCCCC
>>> print(s1[30:40]) # this will be exon 2
TTTTTAAAAA
>>> print(s1[45:48]) # this will be exon 3
CCC
>>> s2 = DNA.make_seq("CGAAACGTTT", name="seq2")
>>> s3 = DNA.make_seq("CGAAACGTTT", name="seq3")

Via

add_annotation

>>> from cogent3 import DNA
>>> from cogent3.core.annotation import Feature
>>> s1 = DNA.make_seq("AAGAAGAAGACCCCCAAAAAAAAAA"\
...                      "TTTTTTTTTTAAAAAGGGAACCCT",
...                      name="seq1")
...
>>> exon1 = s1.add_annotation(Feature, 'exon', 'A', [(10,15)])
>>> exon2 = s1.add_annotation(Feature, 'exon', 'B', [(30,40)])

add_feature

>>> from cogent3 import DNA
>>> s1 = DNA.make_seq("AAGAAGAAGACCCCCAAAAAAAAAA"\
...                      "TTTTTTTTTTAAAAAGGGAACCCT",
...                      name="seq1")
...
>>> exon3 = s1.add_feature('exon', 'C', [(45, 48)])

There are other annotation types.

Adding as a series or item-wise

>>> from cogent3 import DNA
>>> s2 = DNA.make_seq("CGAAACGTTT", name="seq2")
>>> cpgs_series = s2.add_feature('cpgsite', 'cpg', [(0,2), (5,7)])
>>> s3 = DNA.make_seq("CGAAACGTTT", name="seq3")
>>> cpg1 = s3.add_feature('cpgsite', 'cpg', [(0,2)])
>>> cpg2 = s3.add_feature('cpgsite', 'cpg', [(5,7)])

Taking the union of annotations

Construct a pseudo-feature (cds) that’s a union of other features (exon1, exon2, exon3).

>>> from cogent3 import DNA
>>> s1 = DNA.make_seq("AAGAAGAAGACCCCCAAAAAAAAAA"\
...                      "TTTTTTTTTTAAAAAGGGAACCCT",
...                      name="seq1")
...
>>> exon1 = s1.add_feature('exon', 'A', [(10,15)])
>>> exon2 = s1.add_feature('exon', 'B', [(30,40)])
>>> exon3 = s1.add_feature('exon', 'C', [(45, 48)])
>>> cds = s1.get_region_covering_all([exon1, exon2, exon3])

Getting annotation coordinates

These are useful for doing custom things, e.g. you could construct intron features using the below.

>>> cds.get_coordinates()
[(10, 15), (30, 40), (45, 48)]

Annotations have shadows

A shadow is a span representing everything but the annotation.

>>> not_cds = cds.get_shadow()
>>> not_cds
region "not exon" at [0:10, 15:30, 40:45, 48:49]/49

Compare to the coordinates of the original.

>>> cds
region "exon" at [10:15, 30:40, 45:48]/49

Adding to a sequence member of an alignment

The following annotation is directly applied onto the sequence and so is in ungapped sequence coordinates.

>>> from cogent3 import make_aligned_seqs
>>> aln1 = make_aligned_seqs(data=[['x','-AAACCCCCA'],
...                       ['y','TTTT--TTTT']], array_align=False)
>>> seq_exon = aln1.get_seq('x').add_feature('exon', 'A', [(3,8)])

Adding to an alignment

We add an annotation directly onto an alignment. In this example we add a Variable that can be displayed as a red line on a drawing. The resulting annotation (red_data here) is in alignment coordinates!

>>> from cogent3.core.annotation import Variable
>>> red_data = aln1.add_annotation(Variable, 'redline', 'align',
...              [((0,15),1),((15,30),2),((30,45),3)])
...

Slicing sequences and alignments by annotations

By a feature or coordinates returns same sequence span

>>> from cogent3 import DNA
>>> s1 = DNA.make_seq("AAGAAGAAGACCCCCAAAAAAAAAA"\
...                      "TTTTTTTTTTAAAAAGGGAACCCT",
...                      name="seq1")
...
>>> exon1 = s1.add_feature('exon', 'A', [(10,15)])
>>> exon2 = s1.add_feature('exon', 'B', [(30,40)])
>>> s1[exon1]
DnaSequence(CCCCC)
>>> s1[10:15]
DnaSequence(CCCCC)

Using the annotation object get_slice method returns the same thing.

>>> s1[exon2]
DnaSequence(TTTTTAAAAA)
>>> exon2.get_slice()
DnaSequence(TTTTTAAAAA)

Slicing by pseudo-feature or feature series

>>> from cogent3 import DNA
>>> s1 = DNA.make_seq("AAGAAGAAGACCCCCAAAAAAAAAA"\
...                      "TTTTTTTTTTAAAAAGGGAACCCT",
...                      name="seq1")
...
>>> exon1 = s1.add_feature('exon', 'A', [(10,15)])
>>> exon2 = s1.add_feature('exon', 'B', [(30,40)])
>>> exon3 = s1.add_feature('exon', 'C', [(45, 48)])
>>> cds = s1.get_region_covering_all([exon1, exon2, exon3])
>>> print(s1[cds])
CCCCCTTTTTAAAAACCC
>>> print(s1[exon1, exon2, exon3])
CCCCCTTTTTAAAAACCC

Warning

Slices are applied in order!

>>> print(s1)
AAGAAGAAGACCCCCAAAAAAAAAATTTTTTTTTTAAAAAGGGAACCCT
>>> print(s1[exon1, exon2, exon3])
CCCCCTTTTTAAAAACCC
>>> print(s1[exon2])
TTTTTAAAAA
>>> print(s1[exon3])
CCC
>>> print(s1[exon1, exon3, exon2])
CCCCCCCCTTTTTAAAAA

Slice series must not be overlapping

>>> s1[1:10, 9:15]
Traceback (most recent call last):
ValueError: Uninvertable. Overlap: 9 < 10
>>> s1[exon1, exon1]
Traceback (most recent call last):
ValueError: Uninvertable. Overlap: 10 < 15

But get_region_covering_all resolves this, ensuring no overlaps.

>>> print(s1.get_region_covering_all([exon3, exon3]).get_slice())
CCC

You can slice an annotation itself

>>> print(s1[exon2])
TTTTTAAAAA
>>> ex2_start = exon2[0:3]
>>> print(s1[ex2_start])
TTT
>>> ex2_end = exon2[-3:]
>>> print(s1[ex2_end])
AAA

Sequence vs Alignment slicing

You can’t slice an alignment using an annotation from a sequence.

>>> aln1[seq_exon]
Traceback (most recent call last):
ValueError: Can't map exon "A" at [3:8]/9 onto 2 x 10 text alignment: x[-AAACCCCCA], y[TTTT--TTTT] via []

Copying annotations

You can copy annotations onto sequences with the same name, even if the length differs

>>> aln2 = make_aligned_seqs(data=[['x', '-AAAAAAAAA'], ['y', 'TTTT--TTTT']],
...                 array_align=False)
>>> seq = DNA.make_seq('CCCCCCCCCCCCCCCCCCCC', 'x')
>>> match_exon = seq.add_feature('exon', 'A', [(3,8)])
>>> aln2.get_seq('x').copy_annotations(seq)
>>> copied = list(aln2.get_annotations_from_seq('x', 'exon'))
>>> copied
[exon "A" at [4:9]/10]

but if the feature lies outside the sequence being copied to, you get a lost span

>>> aln2 = make_aligned_seqs(data=[['x', '-AAAA'], ['y', 'TTTTT']], array_align=False)
>>> seq = DNA.make_seq('CCCCCCCCCCCCCCCCCCCC', 'x')
>>> match_exon = seq.add_feature('exon', 'A', [(5,8)])
>>> aln2.get_seq('x').copy_annotations(seq)
>>> copied = list(aln2.get_annotations_from_seq('x', 'exon'))
>>> copied
[exon "A" at [5:5, -4-]/5]
>>> copied[0].get_slice()
2 x 4 text alignment: x[----], y[----]

You can copy to a sequence with a different name, in a different alignment if the feature lies within the length

>>> # new test
>>> aln2 = make_aligned_seqs(data=[['x', '-AAAAAAAAA'], ['y', 'TTTT--TTTT']],
...                  array_align=False)
>>> seq = DNA.make_seq('CCCCCCCCCCCCCCCCCCCC', 'x')
>>> match_exon = seq.add_feature('exon', 'A', [(5,8)])
>>> aln2.get_seq('y').copy_annotations(seq)
>>> copied = list(aln2.get_annotations_from_seq('y', 'exon'))
>>> copied
[exon "A" at [7:10]/10]

If the sequence is shorter, again you get a lost span.

>>> aln2 = make_aligned_seqs(data=[['x', '-AAAAAAAAA'], ['y', 'TTTT--TTTT']],
...                 array_align=False)
>>> diff_len_seq = DNA.make_seq('CCCCCCCCCCCCCCCCCCCCCCCCCCCC', 'x')
>>> nonmatch = diff_len_seq.add_feature('repeat', 'A', [(12,14)])
>>> aln2.get_seq('y').copy_annotations(diff_len_seq)
>>> copied = list(aln2.get_annotations_from_seq('y', 'repeat'))
>>> copied
[repeat "A" at [10:10, -6-]/10]

Querying

You need to get a corresponding annotation projected into alignment coordinates via a query.

>>> aln_exon = aln1.get_annotations_from_any_seq('exon')
>>> print(aln1[aln_exon])
>x
CCCCC
>y
--TTT

Querying produces objects only valid for their source

>>> cpgsite2 = s2.get_annotations_matching('cpgsite')
>>> print(s2[cpgsite2])
CGCG
>>> cpgsite3 = s3.get_annotations_matching('cpgsite')
>>> s2[cpgsite3]
Traceback (most recent call last):
ValueError: Can't map cpgsite "cpg" at [0:2]/10 onto DnaSequence(CGAAACGTTT) via []

Querying for absent annotation

You get back an empty list, and slicing with this returns an empty sequence.

>>> # this test is new
>>> dont_exist = s2.get_annotations_matching('dont_exist')
>>> dont_exist
[]
>>> s2[dont_exist]
DnaSequence()

Querying features that span gaps in alignments

If you query for a feature from a sequence, it’s alignment coordinates may be discontinuous.

>>> aln3 = make_aligned_seqs(data=[['x', 'C-CCCAAAAA'], ['y', '-T----TTTT']],
...                 array_align=False)
>>> exon = aln3.get_seq('x').add_feature('exon', 'ex1', [(0,4)])
>>> print(exon.get_slice())
CCCC
>>> aln_exons = list(aln3.get_annotations_from_seq('x', 'exon'))
>>> print(aln_exons)
[exon "ex1" at [0:1, 2:5]/10]
>>> print(aln3[aln_exons])
>x
CCCC
>y
----

Note

The T opposite the gap is missing since this approach only returns positions directly corresponding to the feature.

as_one_span unifies features with discontinuous alignment coordinates

To get positions spanned by a feature, including gaps, use as_one_span.

>>> unified = aln_exons[0].as_one_span()
>>> print(aln3[unified])
>x
C-CCC
>y
-T---

Behaviour of annotations on nucleic acid sequences

Reverse complementing a sequence does not reverse annotations, that is they retain the reference to the frame for which they were defined.

>>> plus = DNA.make_seq("CCCCCAAAAAAAAAATTTTTTTTTTAAAGG")
>>> plus_rpt = plus.add_feature('blah', 'a', [(5,15), (25, 28)])
>>> print(plus[plus_rpt])
AAAAAAAAAAAAA
>>> minus = plus.rc()
>>> print(minus)
CCTTTAAAAAAAAAATTTTTTTTTTGGGGG
>>> minus_rpt = minus.get_annotations_matching('blah')
>>> print(minus[minus_rpt])
AAAAAAAAAAAAA

Masking annotated regions

We mask the CDS regions.

>>> from cogent3.parse.genbank import RichGenbankParser
>>> parser = RichGenbankParser(open('data/ST_genome_part.gb'))
>>> seq = [seq for accession, seq in parser][0]
>>> no_cds = seq.with_masked_annotations('CDS')
>>> print(no_cds[150:400])
CAAGACAGACAAATAAAAATGACAGAGTACACAACATCC?????????...

The above sequence could then have positions filtered so no position with the ambiguous character ‘?’ was present.

Masking annotated regions on alignments

We mask exon’s on an alignment.

>>> from cogent3 import make_aligned_seqs
>>> aln = make_aligned_seqs(data=[['x', 'C-CCCAAAAAGGGAA'],
...                      ['y', '-T----TTTTG-GTT']],
...                moltype="dna", array_align=False)
>>> exon = aln.get_seq('x').add_feature('exon', 'norwegian', [(0,4)])
>>> print(aln.with_masked_annotations('exon', mask_char='?'))
>x
?-???AAAAAGGGAA
>y
-T----TTTTG-GTT

These also persist through reverse complement operations.

>>> rc = aln.rc()
>>> print(rc)
>x
TTCCCTTTTTGGG-G
>y
AAC-CAAAA----A-

>>> print(rc.with_masked_annotations('exon', mask_char='?'))
>x
TTCCCTTTTT???-?
>y
AAC-CAAAA----A-

You can take mask of the shadow

>>> from cogent3 import DNA
>>> s = DNA.make_seq('CCCCAAAAAGGGAA', 'x')
>>> exon = s.add_feature('exon', 'norwegian', [(0,4)])
>>> rpt = s.add_feature('repeat', 'norwegian', [(9, 12)])
>>> rc = s.rc()
>>> print(s.with_masked_annotations('exon', shadow=True))
CCCC??????????
>>> print(rc.with_masked_annotations('exon', shadow=True))
??????????GGGG
>>> print(s.with_masked_annotations(['exon', 'repeat'], shadow=True))
CCCC?????GGG??
>>> print(rc.with_masked_annotations(['exon', 'repeat'], shadow=True))
??CCC?????GGGG

What features of a certain type are available?

>>> from cogent3 import DNA
>>> s = DNA.make_seq('ATGACCCTGTAAAAAATGTGTTAACCC',
...    name='a')
>>> cds1 = s.add_feature('cds','cds1', [(0,12)])
>>> cds2 = s.add_feature('cds','cds2', [(15,24)])
>>> all_cds = s.get_annotations_matching('cds')
>>> all_cds
[cds "cds1" at [0:12]/27, cds "cds2" at [15:24]/27]

Getting all features of a type, or everything but that type

The annotation methods get_region_covering_all and get_shadow can be used to grab all the coding sequences or non-coding sequences in a DnaSequence object.

>>> from cogent3.parse.genbank import RichGenbankParser
>>> parser = RichGenbankParser(open('data/ST_genome_part.gb'))
>>> seq = [seq for accession, seq in parser][0]
>>> all_cds = seq.get_annotations_matching('CDS')
>>> coding_seqs = seq.get_region_covering_all(all_cds)
>>> coding_seqs
region "CDS" at [189:255, 336:2799, 2800:3730, 3733...
>>> coding_seqs.get_slice()
DnaSequence(ATGAACC... 9063)
>>> noncoding_seqs = coding_seqs.get_shadow()
>>> noncoding_seqs
region "not CDS" at [0:189, 255:336, 2799:2800, ...
>>> noncoding_seqs.get_slice()
DnaSequence(AGAGATT... 957)

Getting sequence features when you have an alignment object

Sequence features can be accessed via a containing Alignment.

>>> from cogent3 import make_aligned_seqs
>>> aln = make_aligned_seqs(data=[['x','-AAAAAAAAA'], ['y','TTTT--TTTT']],
...                array_align=False)
>>> print(aln)
>x
-AAAAAAAAA
>y
TTTT--TTTT

>>> exon = aln.get_seq('x').add_feature('exon', '1', [(3,8)])
>>> aln_exons = aln.get_annotations_from_seq('x', 'exon')
>>> aln_exons = aln.get_annotations_from_any_seq('exon')
>>> aln_exons
[exon "1" at [4:9]/10]

Annotation display on sequences

We can display annotations on sequences, writing to file.

Note

This requires matplotlib be installed.

We first make a sequence and add some annotations.

>>> from cogent3 import DNA
>>> seq = DNA.make_seq('aaaccggttt' * 10)
>>> v = seq.add_feature('exon', 'exon', [(20,35)])
>>> v = seq.add_feature('repeat_unit', 'repeat_unit', [(39,49)])
>>> v = seq.add_feature('repeat_unit', 'rep2', [(49,60)])