Complete version of manipulating sequence annotations¶
Section author: Peter Maxwell, Gavin Huttley
A Sequence with a couple of exons on it.
>>> from cogent3 import DNA
>>> from cogent3.core.annotation import Feature
>>> s = DNA.make_seq("AAGAAGAAGACCCCCAAAAAAAAAATTTTTTTTTTAAAAAAAAAAAAA",
... name="Orig")
>>> exon1 = s.add_annotation(Feature, 'exon', 'fred', [(10,15)])
>>> exon2 = s.add_annotation(Feature, 'exon', 'trev', [(30,40)])
The corresponding sequence can be extracted either with slice notation or by asking the feature to do it, since the feature knows what sequence it belongs to.
>>> s[exon1]
DnaSequence(CCCCC)
>>> exon1.get_slice()
DnaSequence(CCCCC)
Usually the only way to get a Feature
object like exon1
is to ask the sequence for it. There is one method for querying annotations by type and optionally by name:
>>> exons = s.get_annotations_matching('exon')
>>> print(exons)
[exon "fred" at [10:15]/48, exon "trev" at [30:40]/48]
If the sequence does not have a matching feature you get back an empty list, and slicing the sequence with that returns a sequence of length 0.
>>> dont_exist = s.get_annotations_matching('dont_exist')
>>> dont_exist
[]
>>> s[dont_exist]
DnaSequence()
To construct a pseudo-feature covering (or excluding) multiple features, use get_region_covering_all
:
>>> print(s.get_region_covering_all(exons))
region "exon" at [10:15, 30:40]/48
>>> print(s.get_region_covering_all(exons).get_shadow())
region "not exon" at [0:10, 15:30, 40:48]/48
eg: all the exon sequence:
>>> s.get_region_covering_all(exons).get_slice()
DnaSequence(CCCCCTT... 15)
or with slice notation:
>>> s[exon1, exon2]
DnaSequence(CCCCCTT... 15)
Though .get_region_covering_all
also guarantees no overlaps within the result, slicing does not:
>>> print(s.get_region_covering_all(exons+exons))
region "exon" at [10:15, 30:40]/48
>>> s[exon1, exon1, exon1, exon1, exon1]
Traceback (most recent call last):
ValueError: Uninvertable. Overlap: 10 < 15
You can use features, maps, slices or integers, but non-monotonic slices are not allowed:
>>> s[15:20, 5:16]
Traceback (most recent call last):
ValueError: Uninvertable. Overlap: 15 < 16
Features are themselves sliceable:
>>> exon1[0:3].get_slice()
DnaSequence(CCC)
When sequences are concatenated they keep their (non-overlapping) annotations:
>>> c = s[exon1[4:]]+s
>>> print(len(c))
49
>>> for feat in c.annotations:
... print(feat)
...
exon "fred" at [-4-, 0:1]/49
exon "fred" at [11:16]/49
exon "trev" at [31:41]/49
Since features know their parents you can’t use a feature from one sequence to slice another:
>>> print(c[exon1])
Traceback (most recent call last):
ValueError: Can't map exon "fred" at [10:15]/48 onto ...
Features are generally attached to the thing they annotate, but in those cases where a free-floating feature is created it can later be attached:
>>> len(s.annotations)
2
>>> region = s.get_region_covering_all(exons)
>>> len(s.annotations)
2
>>> region.attach()
>>> len(s.annotations)
3
>>> region.detach()
>>> len(s.annotations)
2
When dealing with sequences that can be reverse complemented (e.g. DnaSequence
) features are not reversed. Features are considered to have strand specific meaning (.e.g CDS, exons) and so stay on their original strands. We create a sequence with a CDS that spans multiple exons, and show that after getting the reverse complement we have exactly the same result from getting the CDS annotation.
>>> plus = DNA.make_seq("AAGGGGAAAACCCCCAAAAAAAAAATTTTTTTTTTAAA",
... name="plus")
>>> plus_cds = plus.add_annotation(Feature, 'CDS', 'gene',
... [(2,6),(10,15),(25,35)])
>>> print(plus_cds.get_slice())
GGGGCCCCCTTTTTTTTTT
>>> minus = plus.rc()
>>> minus_cds = minus.get_annotations_matching('CDS')[0]
>>> print(minus_cds.get_slice())
GGGGCCCCCTTTTTTTTTT
Sequence features can be accessed via a containing Alignment
:
>>> from cogent3 import make_aligned_seqs
>>> aln = make_aligned_seqs([['x','-AAAAAAAAA'], ['y','TTTT--TTTT']], array_align=False)
>>> print(aln)
>x
-AAAAAAAAA
>y
TTTT--TTTT
>>> exon = aln.get_seq('x').add_annotation(Feature, 'exon', 'fred', [(3,8)])
>>> aln_exons = aln.get_annotations_from_seq('x', 'exon')
>>> aln_exons = aln.get_annotations_from_any_seq('exon')
But these will be returned as alignment features with locations in alignment coordinates.
>>> print(exon)
exon "fred" at [3:8]/9
>>> print(aln_exons[0])
exon "fred" at [4:9]/10
>>> print(aln_exons[0].get_slice())
>x
AAAAA
>y
--TTT
>>> aln_exons[0].attach()
>>> len(aln.annotations)
1
Similarly alignment features can be projected onto the aligned sequences, where they may end up falling across gaps:
>>> exons = aln.get_projected_annotations('y', 'exon')
>>> print(exons)
[exon "fred" at [-2-, 4:7]/8]
>>> print(aln.get_seq('y')[exons[0].map.without_gaps()])
TTT
We copy the annotations from another sequence,
>>> aln = make_aligned_seqs([['x', '-AAAAAAAAA'], ['y', 'TTTT--CCCC']], array_align=False)
>>> s = DNA.make_seq("AAAAAAAAA", name="x")
>>> exon = s.add_annotation(Feature, 'exon', 'fred', [(3,8)])
>>> exon = aln.get_seq('x').copy_annotations(s)
>>> aln_exons = list(aln.get_annotations_from_seq('x', 'exon'))
>>> print(aln_exons)
[exon "fred" at [4:9]/10]
even if the name is different.
>>> exon = aln.get_seq('y').copy_annotations(s)
>>> aln_exons = list(aln.get_annotations_from_seq('y', 'exon'))
>>> print(aln_exons)
[exon "fred" at [3:4, 6:10]/10]
>>> print(aln[aln_exons])
>x
AAAAA
>y
TCCCC
If the feature lies outside the sequence being copied to, you get a lost span
>>> aln = make_aligned_seqs([['x', '-AAAA'], ['y', 'TTTTT']], array_align=False)
>>> seq = DNA.make_seq('CCCCCCCCCCCCCCCCCCCC', 'x')
>>> exon = seq.add_feature('exon', 'A', [(5,8)])
>>> aln.get_seq('x').copy_annotations(seq)
>>> copied = list(aln.get_annotations_from_seq('x', 'exon'))
>>> copied
[exon "A" at [5:5, -4-]/5]
>>> copied[0].get_slice()
2 x 4 text alignment: x[----], y[----]
You can copy to a sequence with a different name, in a different alignment if the feature lies within the length
>>> aln = make_aligned_seqs([['x', '-AAAAAAAAA'], ['y', 'TTTT--TTTT']], array_align=False)
>>> seq = DNA.make_seq('CCCCCCCCCCCCCCCCCCCC', 'x')
>>> match_exon = seq.add_feature('exon', 'A', [(5,8)])
>>> aln.get_seq('y').copy_annotations(seq)
>>> copied = list(aln.get_annotations_from_seq('y', 'exon'))
>>> copied
[exon "A" at [7:10]/10]
If the sequence is shorter, again you get a lost span.
>>> aln = make_aligned_seqs([['x', '-AAAAAAAAA'], ['y', 'TTTT--TTTT']], array_align=False)
>>> diff_len_seq = DNA.make_seq('CCCCCCCCCCCCCCCCCCCCCCCCCCCC', 'x')
>>> nonmatch = diff_len_seq.add_feature('repeat', 'A', [(12,14)])
>>> aln.get_seq('y').copy_annotations(diff_len_seq)
>>> copied = list(aln.get_annotations_from_seq('y', 'repeat'))
>>> copied
[repeat "A" at [10:10, -6-]/10]
We consider cases where there are terminal gaps.
>>> aln = make_aligned_seqs([['x', '-AAAAAAAAA'], ['y', '------TTTT']], array_align=False)
>>> exon = aln.get_seq('x').add_feature('exon', 'fred', [(3,8)])
>>> aln_exons = list(aln.get_annotations_from_seq('x', 'exon'))
>>> print(aln_exons)
[exon "fred" at [4:9]/10]
>>> print(aln_exons[0].get_slice())
>x
AAAAA
>y
--TTT
>>> aln = make_aligned_seqs([['x', '-AAAAAAAAA'], ['y', 'TTTT--T---']], array_align=False)
>>> exon = aln.get_seq('x').add_feature('exon', 'fred', [(3,8)])
>>> aln_exons = list(aln.get_annotations_from_seq('x', 'exon'))
>>> print(aln_exons[0].get_slice())
>x
AAAAA
>y
--T--
In this case, only those residues included within the feature are covered - note the omission of the T in y
opposite the gap in x
.
>>> aln = make_aligned_seqs([['x', 'C-CCCAAAAA'], ['y', '-T----TTTT']],
... moltype="dna", array_align=False)
>>> print(aln)
>x
C-CCCAAAAA
>y
-T----TTTT
>>> exon = aln.get_seq('x').add_feature('exon', 'ex1', [(0,4)])
>>> print(exon)
exon "ex1" at [0:4]/9
>>> print(exon.get_slice())
CCCC
>>> aln_exons = list(aln.get_annotations_from_seq('x', 'exon'))
>>> print(aln_exons)
[exon "ex1" at [0:1, 2:5]/10]
>>> print(aln_exons[0].get_slice())
>x
CCCC
>y
----
Feature.as_one_span()
, is applied to the exon that straddles the gap in x
. The result is we preserve that feature.
>>> print(aln_exons[0].as_one_span().get_slice())
>x
C-CCC
>y
-T---
These properties also are consistently replicated with reverse complemented sequences.
>>> aln_rc = aln.rc()
>>> rc_exons = list(aln_rc.get_annotations_from_any_seq('exon'))
>>> print(aln_rc[rc_exons]) # not using as_one_span, so gap removed from x
>x
CCCC
>y
----
>>> print(aln_rc[rc_exons[0].as_one_span()])
>x
C-CCC
>y
-T---
Features can provide their coordinates, useful for custom analyses.
>>> all_exons = aln.get_region_covering_all(aln_exons)
>>> coords = all_exons.get_coordinates()
>>> assert coords == [(0,1),(2,5)]
Annotated regions can be masked (observed sequence characters replaced by another), either through the sequence on which they reside or by projection from the alignment. Note that mask_char
must be a valid character for the sequence MolType
. Either the features (multiple can be named), or their shadow, can be masked.
We create an alignment with a sequence that has two different annotation types.
>>> aln = make_aligned_seqs([['x', 'C-CCCAAAAAGGGAA'], ['y', '-T----TTTTG-GTT']],
... array_align=False)
>>> print(aln)
>x
C-CCCAAAAAGGGAA
>y
-T----TTTTG-GTT
>>> exon = aln.get_seq('x').add_feature('exon', 'norwegian', [(0,4)])
>>> print(exon.get_slice())
CCCC
>>> repeat = aln.get_seq('x').add_feature('repeat', 'blue', [(9,12)])
>>> print(repeat.get_slice())
GGG
>>> repeat = aln.get_seq('y').add_feature('repeat', 'frog', [(5,7)])
>>> print(repeat.get_slice())
GG
Each sequence should correctly mask either the single feature, it’s shadow, or the multiple features, or shadow.
>>> print(aln.get_seq('x').with_masked_annotations('exon', mask_char='?'))
????AAAAAGGGAA
>>> print(aln.get_seq('x').with_masked_annotations('exon', mask_char='?',
... shadow=True))
CCCC??????????
>>> print(aln.get_seq('x').with_masked_annotations(['exon', 'repeat'],
... mask_char='?'))
????AAAAA???AA
>>> print(aln.get_seq('x').with_masked_annotations(['exon', 'repeat'],
... mask_char='?', shadow=True))
CCCC?????GGG??
>>> print(aln.get_seq('y').with_masked_annotations('exon', mask_char='?'))
TTTTTGGTT
>>> print(aln.get_seq('y').with_masked_annotations('repeat', mask_char='?'))
TTTTT??TT
>>> print(aln.get_seq('y').with_masked_annotations('repeat', mask_char='?',
... shadow=True))
?????GG??
The same methods can be applied to annotated Alignment’s.
>>> print(aln.with_masked_annotations('exon', mask_char='?'))
>x
?-???AAAAAGGGAA
>y
-T----TTTTG-GTT
>>> print(aln.with_masked_annotations('exon', mask_char='?', shadow=True))
>x
C-CCC??????????
>y
-?----?????-???
>>> print(aln.with_masked_annotations('repeat', mask_char='?'))
>x
C-CCCAAAAA???AA
>y
-T----TTTT?-?TT
>>> print(aln.with_masked_annotations('repeat', mask_char='?', shadow=True))
>x
?-????????GGG??
>y
-?----????G-G??
>>> print(aln.with_masked_annotations(['repeat', 'exon'], mask_char='?'))
>x
?-???AAAAA???AA
>y
-T----TTTT?-?TT
>>> print(aln.with_masked_annotations(['repeat', 'exon'],shadow=True))
>x
C-CCC?????GGG??
>y
-?----????G-G??
It shouldn’t matter whether annotated coordinates are entered separately, or as a series.
>>> data = [['human', 'CGAAACGTTT'], ['mouse', 'CTAAACGTCG']]
>>> as_series = make_aligned_seqs(data, array_align=False)
>>> as_items = make_aligned_seqs(data, array_align=False)
We add annotations to the sequences as a series.
>>> as_series.get_seq('human').add_feature('cpgsite', 'cpg', [(0,2), (5,7)])
cpgsite "cpg" at [0:2, 5:7]/10
>>> as_series.get_seq('mouse').add_feature('cpgsite', 'cpg', [(5,7), (8,10)])
cpgsite "cpg" at [5:7, 8:10]/10
We add the annotations to the sequences one segment at a time.
>>> as_items.get_seq('human').add_feature('cpgsite', 'cpg', [(0,2)])
cpgsite "cpg" at [0:2]/10
>>> as_items.get_seq('human').add_feature('cpgsite', 'cpg', [(5,7)])
cpgsite "cpg" at [5:7]/10
>>> as_items.get_seq('mouse').add_feature('cpgsite', 'cpg', [(5,7)])
cpgsite "cpg" at [5:7]/10
>>> as_items.get_seq('mouse').add_feature('cpgsite', 'cpg', [(8,10)])
cpgsite "cpg" at [8:10]/10
These different constructions should generate the same output.
>>> serial = as_series.with_masked_annotations(['cpgsite'])
>>> print(serial)
>human
??AAA??TTT
>mouse
CTAAA??T??
>>> itemwise = as_items.with_masked_annotations(['cpgsite'])
>>> print(itemwise)
>human
??AAA??TTT
>mouse
CTAAA??T??
Annotations should be correctly masked, whether the sequence has been reverse complemented or not. We use the plus/minus strand CDS containing sequences created above.
>>> print(plus.with_masked_annotations("CDS"))
AA????AAAA?????AAAAAAAAAA??????????AAA
>>> print(minus.with_masked_annotations("CDS"))
TTT??????????TTTTTTTTTT?????TTTT????TT