RnaSequence¶
-
class
cogent3.core.sequence.
RnaSequence
(seq='', name=None, info=None, check=True, preserve_case=False, gaps_allowed=True, wildcards_allowed=True)¶ - Attributes
- PROTEIN
- line_wrap
Methods
annotate_from_gff
(self, f[, pre_parsed])annotates a Sequence from a gff file where each entry has the same SeqID
annotate_matches_to
(self, pattern, …[, …])Adds an annotation at sequence positions matching pattern.
can_match
(self, other)Returns True if every pos in self could match same pos in other.
can_mismatch
(self, other)Returns True if any position in self could mismatch with other.
can_mispair
(self, other)Returns True if any position in self could mispair with other.
can_pair
(self, other)Returns True if self and other could pair.
codon_alphabet
(ignore, \*args, \*\*kwargs)If CodonAlphabet is set as a property, it gets self as extra 1st arg.
complement
(self)Returns complement of self, using data from MolType.
copy
(self)returns a copy of self
count
(self, item)count() delegates to self._seq.
count_degenerate
(self)Counts the degenerate bases in the specified sequence.
count_gaps
(self)Counts the gaps in the specified sequence.
counts
(self[, motif_length, …])returns dict of counts of motifs
degap
(self)Deletes all gap characters from sequence.
diff
(self, other)Returns number of differences between self and other.
disambiguate
(self[, method])Returns a non-degenerate sequence from a degenerate one.
distance
(self, other[, function])Returns distance between self and other using function(i,j).
first_degenerate
(self)Returns the index of first degenerate symbol in sequence, or None.
first_gap
(self)Returns the index of the first gap in the sequence, or None.
first_invalid
(self)Returns the index of first invalid symbol in sequence, or None.
first_non_strict
(self)Returns the index of first non-strict symbol in sequence, or None.
frac_diff
(self, other)Returns fraction of positions where self and other differ.
frac_diff_gaps
(self, other)Returns frac.
frac_diff_non_gaps
(self, other)Returns fraction of non-gap positions where self differs from other.
frac_same
(self, other)Returns fraction of positions where self and other are the same.
frac_same_gaps
(self, other)Returns fraction of positions where self and other share gap states.
frac_same_non_gaps
(self, other)Returns fraction of non-gap positions where self matches other.
frac_similar
(self, other, similar_pairs)Returns fraction of positions where self[i] is similar to other[i].
gap_indices
(self)Returns list of indices of all gaps in the sequence, or [].
gap_maps
(self)Returns dicts mapping between gapped and ungapped positions.
gap_vector
(self)Returns vector of True or False according to which pos are gaps.
get_annotations_matching
(self, annotation_type)- Parameters
get_by_annotation
(self, annotation_type[, …])yields the sequence segments corresponding to the specified annotation_type and name one at a time.
get_drawable
(self[, width, vertical])returns Drawable instance
get_drawables
(self)returns a dict of drawables, keyed by type
get_in_motif_size
(self[, motif_length, …])returns sequence as list of non-overlapping motifs
get_name
(self)Return the sequence name – should just use name instead.
get_translation
(self[, gc, incomplete_ok])translate to amino acid sequence
gettype
(self)Return the sequence type.
has_terminal_stop
(self[, gc, allow_partial])Return True if the sequence has a terminal stop codon.
is_annotated
(self)returns True if sequence has any annotations
is_degenerate
(self)Returns True if sequence contains degenerate characters.
is_gap
(self[, char])Returns True if char is a gap.
is_gapped
(self)Returns True if sequence contains gaps.
is_strict
(self)Returns True if sequence contains only monomers.
is_valid
(self)Returns True if sequence contains no items absent from alphabet.
matrix_distance
(self, other, matrix)Returns distance between self and other using a score matrix.
must_match
(self, other)Returns True if all positions in self must match positions in other.
must_pair
(self, other)Returns True if all positions in self must pair with other.
mw
(self[, method, delta])Returns the molecular weight of (one strand of) the sequence.
possibilities
(self)Counts number of possible sequences matching the sequence.
rc
(self)Converts a nucleic acid sequence to its reverse complement.
replace
(self, oldchar, newchar)return new instance with oldchar replaced by newchar
resolveambiguities
(self)Returns a list of tuples of strings.
reverse_complement
(self)Converts a nucleic acid sequence to its reverse complement.
shuffle
(self)returns a randomized copy of the Sequence object
sliding_windows
(self, window, step[, start, end])Generator function that yield new sequence objects of a given length at a given interval.
strand_symmetry
(self[, motif_length])returns G-test for strand symmetry
strip_bad
(self)Removes any symbols not in the alphabet.
strip_bad_and_gaps
(self)Removes any symbols not in the alphabet, and any gaps.
strip_degenerate
(self)Removes degenerate bases by stripping them out of the sequence.
to_dna
(self)Returns copy of self as DNA.
to_fasta
(self[, make_seqlabel, block_size])Return string of self in FASTA format, no trailing newline
to_json
(self)returns a json formatted string
to_moltype
(self, moltype)returns copy of self with moltype seq
to_rich_dict
(self)returns {‘name’: name, ‘seq’: sequence, ‘moltype’: moltype.label}
to_rna
(self)Returns copy of self as RNA.
translate
(self, \*args, \*\*kwargs)translate() delegates to self._seq.
trim_stop_codon
(self[, gc, allow_partial])Removes a terminal stop codon from the sequence
with_masked_annotations
(self, annot_types[, …])returns a sequence with annot_types regions replaced by mask_char if shadow is False, otherwise all other regions are masked.
with_termini_unknown
(self)Returns copy of sequence with terminal gaps remapped as missing.
add_annotation
add_feature
attach_annotations
clear_annotations
copy_annotations
detach_annotations
gapped_by_map
gapped_by_map_motif_iter
gapped_by_map_segment_iter
get_color_scheme
get_colour_scheme
get_orf_positions
get_region_covering_all
parse_out_gaps
-
PROTEIN
= None¶
-
add_annotation
(self, klass, *args, **kw)¶
-
add_feature
(self, type, name, spans)¶
-
annotate_from_gff
(self, f, pre_parsed=False)¶ annotates a Sequence from a gff file where each entry has the same SeqID
-
annotate_matches_to
(self, pattern, annot_type, name, allow_multiple=False)¶ Adds an annotation at sequence positions matching pattern.
- Parameters
- patternstring
The search string for which annotations are made. IUPAC ambiguities are converted to regex on sequences with the appropriate MolType.
- annot_typestring
The type of the annotation (e.g. “domain”).
- namestring
The name of the annotation.
- allow_multipleboolean
If True, allows multiple occurrences of the input pattern. Otherwise only the first match is used.
- Returns
- Returns a list of Annotation instances.
-
annotations
= ()¶
-
attach_annotations
(self, annots)¶
-
can_match
(self, other)¶ Returns True if every pos in self could match same pos in other.
Truncates at length of shorter sequence. gaps are only allowed to match other gaps.
-
can_mismatch
(self, other)¶ Returns True if any position in self could mismatch with other.
Truncates at length of shorter sequence. gaps are always counted as matches.
-
can_mispair
(self, other)¶ Returns True if any position in self could mispair with other.
Pairing occurs in reverse order, i.e. last position of other with first position of self, etc.
Truncates at length of shorter sequence. gaps are always counted as possible mispairs, as are weak pairs like GU.
-
can_pair
(self, other)¶ Returns True if self and other could pair.
Pairing occurs in reverse order, i.e. last position of other with first position of self, etc.
Truncates at length of shorter sequence. gaps are only allowed to pair with other gaps, and are counted as ‘weak’ (same category as GU and degenerate pairs).
NOTE: second must be able to be reverse
-
clear_annotations
(self)¶
-
codon_alphabet
(ignore, *args, **kwargs)¶ If CodonAlphabet is set as a property, it gets self as extra 1st arg.
-
complement
(self)¶ Returns complement of self, using data from MolType.
Always tries to return same type as item: if item looks like a dict, will return list of keys.
-
copy
(self)¶ returns a copy of self
-
copy_annotations
(self, other)¶
-
count
(self, item)¶ count() delegates to self._seq.
-
count_degenerate
(self)¶ Counts the degenerate bases in the specified sequence.
-
count_gaps
(self)¶ Counts the gaps in the specified sequence.
-
counts
(self, motif_length=1, include_ambiguity=False, allow_gap=False, exclude_unobserved=False)¶ returns dict of counts of motifs
only non-overlapping motifs are counted.
- Parameters
- motif_length
number of elements per character.
- include_ambiguity
if True, motifs containing ambiguous characters from the seq moltype are included. No expansion of those is attempted.
- allow_gaps
if True, motifs containing a gap character are included.
- exclude_unobserved
if True, unobserved motif combinations are excluded.
-
degap
(self)¶ Deletes all gap characters from sequence.
-
detach_annotations
(self, annots)¶
-
diff
(self, other)¶ Returns number of differences between self and other.
NOTE: truncates at the length of the shorter sequence. Case-sensitive.
-
disambiguate
(self, method='strip')¶ Returns a non-degenerate sequence from a degenerate one.
method can be ‘strip’ (deletes any characters not in monomers or gaps) or ‘random’(assigns the possibilities at random, using equal frequencies).
-
distance
(self, other, function=None)¶ Returns distance between self and other using function(i,j).
other must be a sequence.
function should be a function that takes two items and returns a number. To turn a 2D matrix into a function, use cogent3.util.miscs.DistanceFromMatrix(matrix).
NOTE: Truncates at the length of the shorter sequence.
Note that the function acts on two _elements_ of the sequences, not the two sequences themselves (i.e. the behavior will be the same for every position in the sequences, such as identity scoring or a function derived from a distance matrix as suggested above). One limitation of this approach is that the distance function cannot use properties of the sequences themselves: for example, it cannot use the lengths of the sequences to normalize the scores as percent similarities or percent differences.
If you want functions that act on the two sequences themselves, there is no particular advantage in making these functions methods of the first sequences by passing them in as parameters like the function in this method. It makes more sense to use them as standalone functions. The factory function cogent3.util.transform.for_seq is useful for converting per-element functions into per-sequence functions, since it takes as parameters a per-element scoring function, a score aggregation function, and a normalization function (which itself takes the two sequences as parameters), returning a single function that combines these functions and that acts on two complete sequences.
-
first_degenerate
(self)¶ Returns the index of first degenerate symbol in sequence, or None.
-
first_gap
(self)¶ Returns the index of the first gap in the sequence, or None.
-
first_invalid
(self)¶ Returns the index of first invalid symbol in sequence, or None.
-
first_non_strict
(self)¶ Returns the index of first non-strict symbol in sequence, or None.
-
frac_diff
(self, other)¶ Returns fraction of positions where self and other differ.
Truncates at length of shorter sequence. Note that frac_same and frac_diff are both 0 if one sequence is empty.
-
frac_diff_gaps
(self, other)¶ Returns frac. of positions where self and other’s gap states differ.
In other words, if self and other are both all gaps, or both all non-gaps, or both have gaps in the same places, frac_diff_gaps will return 0.0. If self is all gaps and other has no gaps, frac_diff_gaps will return 1.0.
Returns 0 if one sequence is empty.
Uses self’s gap characters for both sequences.
-
frac_diff_non_gaps
(self, other)¶ Returns fraction of non-gap positions where self differs from other.
Doesn’t count any position where self or other has a gap. Truncates at the length of the shorter sequence.
Returns 0 if one sequence is empty. Note that this means that frac_diff_non_gaps is _not_ the same as 1 - frac_same_non_gaps, since both return 0 if one sequence is empty.
-
frac_same
(self, other)¶ Returns fraction of positions where self and other are the same.
Truncates at length of shorter sequence. Note that frac_same and frac_diff are both 0 if one sequence is empty.
-
frac_same_gaps
(self, other)¶ Returns fraction of positions where self and other share gap states.
In other words, if self and other are both all gaps, or both all non-gaps, or both have gaps in the same places, frac_same_gaps will return 1.0. If self is all gaps and other has no gaps, frac_same_gaps will return 0.0. Returns 0 if one sequence is empty.
Uses self’s gap characters for both sequences.
-
frac_same_non_gaps
(self, other)¶ Returns fraction of non-gap positions where self matches other.
Doesn’t count any position where self or other has a gap. Truncates at the length of the shorter sequence.
Returns 0 if one sequence is empty.
-
frac_similar
(self, other, similar_pairs)¶ Returns fraction of positions where self[i] is similar to other[i].
similar_pairs must be a dict such that d[(i,j)] exists if i and j are to be counted as similar. Use PairsFromGroups in cogent3.util.misc to construct such a dict from a list of lists of similar residues.
Truncates at the length of the shorter sequence.
Note: current implementation re-creates the distance function each time, so may be expensive compared to creating the distance function using for_seq separately.
Returns 0 if one sequence is empty.
-
gap_indices
(self)¶ Returns list of indices of all gaps in the sequence, or [].
-
gap_maps
(self)¶ Returns dicts mapping between gapped and ungapped positions.
-
gap_vector
(self)¶ Returns vector of True or False according to which pos are gaps.
-
gapped_by_map
(self, map, recode_gaps=False)¶
-
gapped_by_map_motif_iter
(self, map)¶
-
gapped_by_map_segment_iter
(self, map, allow_gaps=True, recode_gaps=False)¶
-
get_annotations_matching
(self, annotation_type, name=None, extend_query=False)¶ - Parameters
- annotation_typestring
name of the annotation type. Wild-cards allowed.
- namestring
name of the instance. Wild-cards allowed.
- extend_queryboolean
queries sub-annotations if True
- Returns
- ——-
- list of AnnotatableFeatures
-
get_by_annotation
(self, annotation_type, name=None, ignore_partial=False)¶ yields the sequence segments corresponding to the specified annotation_type and name one at a time.
- Parameters
- ignore_partial
if True, annotations that extend beyond the current sequence are ignored.
-
get_color_scheme
(self, colors)¶
-
get_colour_scheme
(self, colours)¶
-
get_drawable
(self, width=600, vertical=False)¶ returns Drawable instance
-
get_drawables
(self)¶ returns a dict of drawables, keyed by type
-
get_in_motif_size
(self, motif_length=1, log_warnings=True)¶ returns sequence as list of non-overlapping motifs
- Parameters
- motif_length
length of the motifs
- log_warnings
whether to notify of an incomplete terminal motif
-
get_name
(self)¶ Return the sequence name – should just use name instead.
-
get_orf_positions
(self, gc=None, atg=False)¶
-
get_region_covering_all
(self, annotations, feature_class=None, extend_query=False)¶
-
get_translation
(self, gc=None, incomplete_ok=False)¶ translate to amino acid sequence
- Parameters
- gc
name or ID of genetic code
- incomplete_okbool
codons that are mixes of nucleotide and gaps converted to ‘?’. raises a ValueError if False
- Returns
- sequence of PROTEIN moltype
-
gettype
(self)¶ Return the sequence type.
-
has_terminal_stop
(self, gc=None, allow_partial=False)¶ Return True if the sequence has a terminal stop codon.
- Parameters
- gc
genetic code object
- allow_partial
if True and the sequence length is not dividisble by 3, ignores the 3’ terminal incomplete codon
-
is_annotated
(self)¶ returns True if sequence has any annotations
-
is_degenerate
(self)¶ Returns True if sequence contains degenerate characters.
-
is_gap
(self, char=None)¶ Returns True if char is a gap.
If char is not supplied, tests whether self is gaps only.
-
is_gapped
(self)¶ Returns True if sequence contains gaps.
-
is_strict
(self)¶ Returns True if sequence contains only monomers.
-
is_valid
(self)¶ Returns True if sequence contains no items absent from alphabet.
-
line_wrap
= None¶
-
matrix_distance
(self, other, matrix)¶ Returns distance between self and other using a score matrix.
WARNING: the matrix must explicitly contain scores for the case where a position is the same in self and other (e.g. for a distance matrix, an identity between U and U might have a score of 0). The reason the scores for the ‘diagonals’ need to be passed explicitly is that for some kinds of distance matrices, e.g. log-odds matrices, the ‘diagonal’ scores differ from each other. If these elements are missing, this function will raise a KeyError at the first position that the two sequences are identical.
-
moltype
= MolType(('U', 'C', 'A', 'G'))¶
-
must_match
(self, other)¶ Returns True if all positions in self must match positions in other.
-
must_pair
(self, other)¶ Returns True if all positions in self must pair with other.
Pairing occurs in reverse order, i.e. last position of other with first position of self, etc.
-
mw
(self, method='random', delta=None)¶ Returns the molecular weight of (one strand of) the sequence.
If the sequence is ambiguous, uses method (random or strip) to disambiguate the sequence.
If delta is passed in, adds delta per strand (default is None, which uses the alphabet default. Typically, this adds 18 Da for terminal water. However, note that the default nucleic acid weight assumes 5’ monophosphate and 3’ OH: pass in delta=18.0 if you want 5’ OH as well.
Note that this method only calculates the MW of the coding strand. If you want the MW of the reverse strand, add self.rc().mw(). DO NOT just multiply the MW by 2: the results may not be accurate due to strand bias, e.g. in mitochondrial genomes.
-
parse_out_gaps
(self)¶
-
possibilities
(self)¶ Counts number of possible sequences matching the sequence.
Uses self.degenerates to decide how many possibilites there are at each position in the sequence.
-
protein
= MolType(('A', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'K', 'L', 'M', 'N', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'W', 'Y'))¶
-
rc
(self)¶ Converts a nucleic acid sequence to its reverse complement.
-
replace
(self, oldchar, newchar)¶ return new instance with oldchar replaced by newchar
-
resolveambiguities
(self)¶ Returns a list of tuples of strings.
-
reverse_complement
(self)¶ Converts a nucleic acid sequence to its reverse complement. Synonymn for rc.
-
shuffle
(self)¶ returns a randomized copy of the Sequence object
-
sliding_windows
(self, window, step, start=None, end=None)¶ Generator function that yield new sequence objects of a given length at a given interval.
- Parameters
- window
The length of the returned sequence
- step
The interval between the start of the returned sequence objects
- start
first window start position
- end
last window start position
-
strand_symmetry
(self, motif_length=1)¶ returns G-test for strand symmetry
-
strip_bad
(self)¶ Removes any symbols not in the alphabet.
-
strip_bad_and_gaps
(self)¶ Removes any symbols not in the alphabet, and any gaps.
-
strip_degenerate
(self)¶ Removes degenerate bases by stripping them out of the sequence.
-
to_dna
(self)¶ Returns copy of self as DNA.
-
to_fasta
(self, make_seqlabel=None, block_size=60)¶ Return string of self in FASTA format, no trailing newline
- Parameters
- make_seqlabel
callback function that takes the seq object and returns a label str
-
to_json
(self)¶ returns a json formatted string
-
to_moltype
(self, moltype)¶ returns copy of self with moltype seq
- Parameters
- moltypestr
molecular type
-
to_rich_dict
(self)¶ returns {‘name’: name, ‘seq’: sequence, ‘moltype’: moltype.label}
-
to_rna
(self)¶ Returns copy of self as RNA.
-
translate
(self, *args, **kwargs)¶ translate() delegates to self._seq.
-
trim_stop_codon
(self, gc=None, allow_partial=False)¶ Removes a terminal stop codon from the sequence
- Parameters
- gc
genetic code object
- allow_partial
if True and the sequence length is not divisible by 3, ignores the 3’ terminal incomplete codon
-
with_masked_annotations
(self, annot_types, mask_char=None, shadow=False, extend_query=False)¶ returns a sequence with annot_types regions replaced by mask_char if shadow is False, otherwise all other regions are masked.
- Parameters
- annot_types
annotation type(s)
- mask_char
must be a character valid for the seq MolType. The default value is the most ambiguous character, eg. ‘?’ for DNA
- shadow
whether to mask the annotated regions, or everything but the annotated regions
- extend_queryboolean
queries sub-annotations if True
-
with_termini_unknown
(self)¶ Returns copy of sequence with terminal gaps remapped as missing.