Complete version of manipulating sequence annotations ===================================================== .. sectionauthor:: Peter Maxwell, Gavin Huttley A Sequence with a couple of exons on it. .. doctest:: >>> from cogent3 import DNA >>> from cogent3.core.annotation import Feature >>> s = DNA.make_seq("AAGAAGAAGACCCCCAAAAAAAAAATTTTTTTTTTAAAAAAAAAAAAA", ... name="Orig") >>> exon1 = s.add_annotation(Feature, 'exon', 'fred', [(10,15)]) >>> exon2 = s.add_annotation(Feature, 'exon', 'trev', [(30,40)]) The corresponding sequence can be extracted either with slice notation or by asking the feature to do it, since the feature knows what sequence it belongs to. .. doctest:: >>> s[exon1] DnaSequence(CCCCC) >>> exon1.get_slice() DnaSequence(CCCCC) Usually the only way to get a ``Feature`` object like ``exon1`` is to ask the sequence for it. There is one method for querying annotations by type and optionally by name: .. doctest:: >>> exons = s.get_annotations_matching('exon') >>> print(exons) [exon "fred" at [10:15]/48, exon "trev" at [30:40]/48] If the sequence does not have a matching feature you get back an empty list, and slicing the sequence with that returns a sequence of length 0. .. doctest:: >>> dont_exist = s.get_annotations_matching('dont_exist') >>> dont_exist [] >>> s[dont_exist] DnaSequence() To construct a pseudo-feature covering (or excluding) multiple features, use ``get_region_covering_all``: .. doctest:: >>> print(s.get_region_covering_all(exons)) region "exon" at [10:15, 30:40]/48 >>> print(s.get_region_covering_all(exons).get_shadow()) region "not exon" at [0:10, 15:30, 40:48]/48 eg: all the exon sequence: .. doctest:: >>> s.get_region_covering_all(exons).get_slice() DnaSequence(CCCCCTT... 15) or with slice notation: .. doctest:: >>> s[exon1, exon2] DnaSequence(CCCCCTT... 15) Though ``.get_region_covering_all`` also guarantees no overlaps within the result, slicing does not: .. doctest:: >>> print(s.get_region_covering_all(exons+exons)) region "exon" at [10:15, 30:40]/48 >>> s[exon1, exon1, exon1, exon1, exon1] Traceback (most recent call last): ValueError: Uninvertable. Overlap: 10 < 15 You can use features, maps, slices or integers, but non-monotonic slices are not allowed: .. doctest:: >>> s[15:20, 5:16] Traceback (most recent call last): ValueError: Uninvertable. Overlap: 15 < 16 Features are themselves sliceable: .. doctest:: >>> exon1[0:3].get_slice() DnaSequence(CCC) When sequences are concatenated they keep their (non-overlapping) annotations: .. doctest:: >>> c = s[exon1[4:]]+s >>> print(len(c)) 49 >>> for feat in c.annotations: ... print(feat) ... exon "fred" at [-4-, 0:1]/49 exon "fred" at [11:16]/49 exon "trev" at [31:41]/49 Since features know their parents you can't use a feature from one sequence to slice another: .. doctest:: >>> print(c[exon1]) Traceback (most recent call last): ValueError: Can't map exon "fred" at [10:15]/48 onto ... Features are generally attached to the thing they annotate, but in those cases where a free-floating feature is created it can later be attached: .. doctest:: >>> len(s.annotations) 2 >>> region = s.get_region_covering_all(exons) >>> len(s.annotations) 2 >>> region.attach() >>> len(s.annotations) 3 >>> region.detach() >>> len(s.annotations) 2 When dealing with sequences that can be reverse complemented (e.g. ``DnaSequence``) features are **not** reversed. Features are considered to have strand specific meaning (.e.g CDS, exons) and so stay on their original strands. We create a sequence with a CDS that spans multiple exons, and show that after getting the reverse complement we have exactly the same result from getting the CDS annotation. .. doctest:: >>> plus = DNA.make_seq("AAGGGGAAAACCCCCAAAAAAAAAATTTTTTTTTTAAA", ... name="plus") >>> plus_cds = plus.add_annotation(Feature, 'CDS', 'gene', ... [(2,6),(10,15),(25,35)]) >>> print(plus_cds.get_slice()) GGGGCCCCCTTTTTTTTTT >>> minus = plus.rc() >>> minus_cds = minus.get_annotations_matching('CDS')[0] >>> print(minus_cds.get_slice()) GGGGCCCCCTTTTTTTTTT Sequence features can be accessed via a containing ``Alignment``: .. doctest:: >>> from cogent3 import make_aligned_seqs >>> aln = make_aligned_seqs([['x','-AAAAAAAAA'], ['y','TTTT--TTTT']], array_align=False) >>> print(aln) >x -AAAAAAAAA >y TTTT--TTTT >>> exon = aln.get_seq('x').add_annotation(Feature, 'exon', 'fred', [(3,8)]) >>> aln_exons = aln.get_annotations_from_seq('x', 'exon') >>> aln_exons = aln.get_annotations_from_any_seq('exon') But these will be returned as **alignment** features with locations in alignment coordinates. .. doctest:: >>> print(exon) exon "fred" at [3:8]/9 >>> print(aln_exons[0]) exon "fred" at [4:9]/10 >>> print(aln_exons[0].get_slice()) >x AAAAA >y --TTT >>> aln_exons[0].attach() >>> len(aln.annotations) 1 Similarly alignment features can be projected onto the aligned sequences, where they may end up falling across gaps: .. doctest:: >>> exons = aln.get_projected_annotations('y', 'exon') >>> print(exons) [exon "fred" at [-2-, 4:7]/8] >>> print(aln.get_seq('y')[exons[0].map.without_gaps()]) TTT We copy the annotations from another sequence, .. doctest:: >>> aln = make_aligned_seqs([['x', '-AAAAAAAAA'], ['y', 'TTTT--CCCC']], array_align=False) >>> s = DNA.make_seq("AAAAAAAAA", name="x") >>> exon = s.add_annotation(Feature, 'exon', 'fred', [(3,8)]) >>> exon = aln.get_seq('x').copy_annotations(s) >>> aln_exons = list(aln.get_annotations_from_seq('x', 'exon')) >>> print(aln_exons) [exon "fred" at [4:9]/10] even if the name is different. .. doctest:: >>> exon = aln.get_seq('y').copy_annotations(s) >>> aln_exons = list(aln.get_annotations_from_seq('y', 'exon')) >>> print(aln_exons) [exon "fred" at [3:4, 6:10]/10] >>> print(aln[aln_exons]) >x AAAAA >y TCCCC If the feature lies outside the sequence being copied to, you get a lost span .. doctest:: >>> aln = make_aligned_seqs([['x', '-AAAA'], ['y', 'TTTTT']], array_align=False) >>> seq = DNA.make_seq('CCCCCCCCCCCCCCCCCCCC', 'x') >>> exon = seq.add_feature('exon', 'A', [(5,8)]) >>> aln.get_seq('x').copy_annotations(seq) >>> copied = list(aln.get_annotations_from_seq('x', 'exon')) >>> copied [exon "A" at [5:5, -4-]/5] >>> copied[0].get_slice() 2 x 4 text alignment: x[----], y[----] You can copy to a sequence with a different name, in a different alignment if the feature lies within the length .. doctest:: >>> aln = make_aligned_seqs([['x', '-AAAAAAAAA'], ['y', 'TTTT--TTTT']], array_align=False) >>> seq = DNA.make_seq('CCCCCCCCCCCCCCCCCCCC', 'x') >>> match_exon = seq.add_feature('exon', 'A', [(5,8)]) >>> aln.get_seq('y').copy_annotations(seq) >>> copied = list(aln.get_annotations_from_seq('y', 'exon')) >>> copied [exon "A" at [7:10]/10] If the sequence is shorter, again you get a lost span. .. doctest:: >>> aln = make_aligned_seqs([['x', '-AAAAAAAAA'], ['y', 'TTTT--TTTT']], array_align=False) >>> diff_len_seq = DNA.make_seq('CCCCCCCCCCCCCCCCCCCCCCCCCCCC', 'x') >>> nonmatch = diff_len_seq.add_feature('repeat', 'A', [(12,14)]) >>> aln.get_seq('y').copy_annotations(diff_len_seq) >>> copied = list(aln.get_annotations_from_seq('y', 'repeat')) >>> copied [repeat "A" at [10:10, -6-]/10] We consider cases where there are terminal gaps. .. doctest:: >>> aln = make_aligned_seqs([['x', '-AAAAAAAAA'], ['y', '------TTTT']], array_align=False) >>> exon = aln.get_seq('x').add_feature('exon', 'fred', [(3,8)]) >>> aln_exons = list(aln.get_annotations_from_seq('x', 'exon')) >>> print(aln_exons) [exon "fred" at [4:9]/10] >>> print(aln_exons[0].get_slice()) >x AAAAA >y --TTT >>> aln = make_aligned_seqs([['x', '-AAAAAAAAA'], ['y', 'TTTT--T---']], array_align=False) >>> exon = aln.get_seq('x').add_feature('exon', 'fred', [(3,8)]) >>> aln_exons = list(aln.get_annotations_from_seq('x', 'exon')) >>> print(aln_exons[0].get_slice()) >x AAAAA >y --T-- In this case, only those residues included within the feature are covered - note the omission of the T in ``y`` opposite the gap in ``x``. .. doctest:: >>> aln = make_aligned_seqs([['x', 'C-CCCAAAAA'], ['y', '-T----TTTT']], ... moltype="dna", array_align=False) >>> print(aln) >x C-CCCAAAAA >y -T----TTTT >>> exon = aln.get_seq('x').add_feature('exon', 'ex1', [(0,4)]) >>> print(exon) exon "ex1" at [0:4]/9 >>> print(exon.get_slice()) CCCC >>> aln_exons = list(aln.get_annotations_from_seq('x', 'exon')) >>> print(aln_exons) [exon "ex1" at [0:1, 2:5]/10] >>> print(aln_exons[0].get_slice()) >x CCCC >y ---- ``Feature.as_one_span()``, is applied to the exon that straddles the gap in ``x``. The result is we preserve that feature. .. doctest:: >>> print(aln_exons[0].as_one_span().get_slice()) >x C-CCC >y -T--- These properties also are consistently replicated with reverse complemented sequences. .. doctest:: >>> aln_rc = aln.rc() >>> rc_exons = list(aln_rc.get_annotations_from_any_seq('exon')) >>> print(aln_rc[rc_exons]) # not using as_one_span, so gap removed from x >x CCCC >y ---- >>> print(aln_rc[rc_exons[0].as_one_span()]) >x C-CCC >y -T--- Features can provide their coordinates, useful for custom analyses. .. doctest:: >>> all_exons = aln.get_region_covering_all(aln_exons) >>> coords = all_exons.get_coordinates() >>> assert coords == [(0,1),(2,5)] Annotated regions can be masked (observed sequence characters replaced by another), either through the sequence on which they reside or by projection from the alignment. Note that ``mask_char`` must be a valid character for the sequence ``MolType``. Either the features (multiple can be named), or their shadow, can be masked. We create an alignment with a sequence that has two different annotation types. .. doctest:: >>> aln = make_aligned_seqs([['x', 'C-CCCAAAAAGGGAA'], ['y', '-T----TTTTG-GTT']], ... array_align=False) >>> print(aln) >x C-CCCAAAAAGGGAA >y -T----TTTTG-GTT >>> exon = aln.get_seq('x').add_feature('exon', 'norwegian', [(0,4)]) >>> print(exon.get_slice()) CCCC >>> repeat = aln.get_seq('x').add_feature('repeat', 'blue', [(9,12)]) >>> print(repeat.get_slice()) GGG >>> repeat = aln.get_seq('y').add_feature('repeat', 'frog', [(5,7)]) >>> print(repeat.get_slice()) GG Each sequence should correctly mask either the single feature, it's shadow, or the multiple features, or shadow. .. doctest:: >>> print(aln.get_seq('x').with_masked_annotations('exon', mask_char='?')) ????AAAAAGGGAA >>> print(aln.get_seq('x').with_masked_annotations('exon', mask_char='?', ... shadow=True)) CCCC?????????? >>> print(aln.get_seq('x').with_masked_annotations(['exon', 'repeat'], ... mask_char='?')) ????AAAAA???AA >>> print(aln.get_seq('x').with_masked_annotations(['exon', 'repeat'], ... mask_char='?', shadow=True)) CCCC?????GGG?? >>> print(aln.get_seq('y').with_masked_annotations('exon', mask_char='?')) TTTTTGGTT >>> print(aln.get_seq('y').with_masked_annotations('repeat', mask_char='?')) TTTTT??TT >>> print(aln.get_seq('y').with_masked_annotations('repeat', mask_char='?', ... shadow=True)) ?????GG?? The same methods can be applied to annotated Alignment's. .. doctest:: >>> print(aln.with_masked_annotations('exon', mask_char='?')) >x ?-???AAAAAGGGAA >y -T----TTTTG-GTT >>> print(aln.with_masked_annotations('exon', mask_char='?', shadow=True)) >x C-CCC?????????? >y -?----?????-??? >>> print(aln.with_masked_annotations('repeat', mask_char='?')) >x C-CCCAAAAA???AA >y -T----TTTT?-?TT >>> print(aln.with_masked_annotations('repeat', mask_char='?', shadow=True)) >x ?-????????GGG?? >y -?----????G-G?? >>> print(aln.with_masked_annotations(['repeat', 'exon'], mask_char='?')) >x ?-???AAAAA???AA >y -T----TTTT?-?TT >>> print(aln.with_masked_annotations(['repeat', 'exon'],shadow=True)) >x C-CCC?????GGG?? >y -?----????G-G?? It shouldn't matter whether annotated coordinates are entered separately, or as a series. .. doctest:: >>> data = [['human', 'CGAAACGTTT'], ['mouse', 'CTAAACGTCG']] >>> as_series = make_aligned_seqs(data, array_align=False) >>> as_items = make_aligned_seqs(data, array_align=False) We add annotations to the sequences as a series. .. doctest:: >>> as_series.get_seq('human').add_feature('cpgsite', 'cpg', [(0,2), (5,7)]) cpgsite "cpg" at [0:2, 5:7]/10 >>> as_series.get_seq('mouse').add_feature('cpgsite', 'cpg', [(5,7), (8,10)]) cpgsite "cpg" at [5:7, 8:10]/10 We add the annotations to the sequences one segment at a time. .. doctest:: >>> as_items.get_seq('human').add_feature('cpgsite', 'cpg', [(0,2)]) cpgsite "cpg" at [0:2]/10 >>> as_items.get_seq('human').add_feature('cpgsite', 'cpg', [(5,7)]) cpgsite "cpg" at [5:7]/10 >>> as_items.get_seq('mouse').add_feature('cpgsite', 'cpg', [(5,7)]) cpgsite "cpg" at [5:7]/10 >>> as_items.get_seq('mouse').add_feature('cpgsite', 'cpg', [(8,10)]) cpgsite "cpg" at [8:10]/10 These different constructions should generate the same output. .. doctest:: >>> serial = as_series.with_masked_annotations(['cpgsite']) >>> print(serial) >human ??AAA??TTT >mouse CTAAA??T?? >>> itemwise = as_items.with_masked_annotations(['cpgsite']) >>> print(itemwise) >human ??AAA??TTT >mouse CTAAA??T?? Annotations should be correctly masked, whether the sequence has been reverse complemented or not. We use the plus/minus strand CDS containing sequences created above. .. doctest:: >>> print(plus.with_masked_annotations("CDS")) AA????AAAA?????AAAAAAAAAA??????????AAA >>> print(minus.with_masked_annotations("CDS")) TTT??????????TTTTTTTTTT?????TTTT????TT .. todo:: Not documented, Source features.