Alignment¶
-
class
cogent3.core.alignment.
Alignment
(*args, **kwargs)¶ An annotatable alignment class
- Attributes
Methods
add_from_ref_aln
(self, ref_aln[, …])Insert sequence(s) to self based on their alignment to a reference sequence.
add_seqs
(self, other[, before_name, after_name])Returns new object of class self with sequences from other added.
alignment_quality
(self[, equifreq_mprobs])Computes the alignment quality for an alignment based on eq.
annotate_from_gff
(self, f)Copies annotations from gff-format file to self.
apply_pssm
(self[, pssm, path, background, …])scores sequences using the specified pssm
coevolution
(self[, method, segments, …])performs pairwise coevolution measurement
copy
(self)Returns deep copy of self.
copy_annotations
(self, unaligned)Copies annotations from seqs in unaligned to self, matching by name.
count_gaps_per_pos
(self[, include_ambiguity])return counts of gaps per position as a DictArray
count_gaps_per_seq
(self[, induced_by, …])return counts of gaps per sequence as a DictArray
counts
(self[, motif_length, …])returns dict of counts of motifs
counts_per_pos
(self[, motif_length, …])return DictArray of counts per position
counts_per_seq
(self[, motif_length, …])returns dict of counts of non-overlapping motifs per sequence
deepcopy
(self[, sliced])Returns deep copy of self.
degap
(self, \*\*kwargs)Returns copy in which sequences have no gaps.
distance_matrix
(self[, calc, show_progress, …])Returns pairwise distances between sequences.
dotplot
(self[, name1, name2, window, …])make a dotplot between specified sequences.
entropy_per_pos
(self[, motif_length, …])returns shannon entropy per position
entropy_per_seq
(self[, motif_length, …])returns the Shannon entropy per sequence
filtered
(self, predicate[, motif_length, …])The alignment positions where predicate(column) is true.
get_ambiguous_positions
(self)Returns dict of seq:{position:char} for ambiguous chars.
get_annotations_matching
(self, annotation_type)- Parameters
get_by_annotation
(self, annotation_type[, …])yields the sequence segments corresponding to the specified annotation_type and name one at a time.
get_degapped_relative_to
(self, name)Remove all columns with gaps in sequence with given name.
get_drawable
(self[, width, vertical])returns Drawable instance
get_drawables
(self)returns a dict of drawables, keyed by type
get_gap_array
(self[, include_ambiguity])returns bool array with gap state True, False otherwise
get_gapped_seq
(self, seq_name[, …])Return a gapped Sequence object for the specified seqname.
get_identical_sets
(self[, mask_degen])returns sets of names for sequences that are identical
get_lengths
(self[, include_ambiguity, allow_gap])returns {name: seq length, …}
get_motif_probs
(self[, alphabet, …])Return a dictionary of motif probs, calculated as the averaged frequency across sequences.
get_position_indices
(self, f[, native, negate])Returns list of column indices for which f(col) is True.
get_seq
(self, seqname)Return a ungapped Sequence object for the specified seqname.
get_seq_indices
(self, f[, negate])Returns list of keys of seqs where f(row) is True.
get_similar
(self, target[, min_similarity, …])Returns new Alignment containing sequences similar to target.
get_translation
(self[, gc, incomplete_ok])translate from nucleic acid to protein
has_terminal_stops
(self[, gc, allow_partial])Returns True if any sequence has a terminal stop codon.
information_plot
(self[, width, height, …])plot information per position
is_ragged
(self)Returns True if alignment has sequences of different lengths.
iter_positions
(self[, pos_order])Iterates over positions in the alignment, in order.
iter_selected
(self[, seq_order, pos_order])Iterates over elements in the alignment.
iter_seqs
(self[, seq_order])Iterates over values (sequences) in the alignment, in order.
iupac_consensus
(self[, alphabet])Returns string containing IUPAC consensus sequence of the alignment.
majority_consensus
(self)Returns list containing most frequent item at each position.
matching_ref
(self, ref_name, gap_fraction, …)Returns new alignment with seqs well aligned with a reference.
no_degenerates
(self[, motif_length, allow_gap])returns new alignment without degenerate characters
omit_bad_seqs
(self[, quantile])Returns new alignment without sequences with a number of uniquely introduced gaps exceeding quantile
omit_gap_pos
(self[, allowed_gap_frac, …])Returns new alignment where all cols (motifs) have <= allowed_gap_frac gaps.
omit_gap_runs
(self[, allowed_run])Returns new alignment where all seqs have runs of gaps <=allowed_run.
omit_gap_seqs
(self[, allowed_gap_frac])Returns new alignment with seqs that have <= allowed_gap_frac.
pad_seqs
(self[, pad_length])Returns copy in which sequences are padded to same length.
probs_per_pos
(self[, motif_length, …])returns MotifFreqsArray per position
probs_per_seq
(self[, motif_length, …])return MotifFreqsArray per sequence
quick_tree
(self[, calc, bootstrap, …])Returns pairwise distances between sequences.
rc
(self)Returns the reverse complement alignment
rename_seqs
(self, renamer)returns new instance with sequences renamed
replace_seqs
(self, seqs[, aa_to_codon])Returns new alignment with same shape but with data taken from seqs.
reverse_complement
(self)Returns the reverse complement alignment.
sample
(self[, n, with_replacement, …])Returns random sample of positions from self, e.g.
seqlogo
(self[, width, height, wrap, vspace, …])returns Drawable sequence logo using mutual information
set_repr_policy
(self[, num_seqs, num_pos, …])specify policy for repr(self)
sliding_windows
(self, window, step[, start, end])Generator yielding new Alignments of given length and interval.
strand_symmetry
(self[, motif_length])returns dict of strand symmetry test results per seq
take_positions
(self, cols[, negate])Returns new Alignment containing only specified positions.
take_positions_if
(self, f[, negate])Returns new Alignment containing cols where f(col) is True.
take_seqs
(self, seqs[, negate])Returns new Alignment containing only specified seqs.
take_seqs_if
(self, f[, negate])Returns new Alignment containing seqs where f(row) is True.
to_dict
(self)Returns the alignment as dict of names -> strings.
to_dna
(self)returns copy of self as an alignment of DNA moltype seqs
to_fasta
(self)Return alignment in Fasta format
to_html
(self[, name_order, interleave_len, …])returns html with embedded styles for sequence colouring
to_json
(self)returns json formatted string
to_moltype
(self, moltype)returns copy of self with moltype seqs
to_nexus
(self, seq_type[, interleave_len])Return alignment in NEXUS format and mapping to sequence ids
to_phylip
(self)Return alignment in PHYLIP format and mapping to sequence ids
to_pretty
(self[, name_order, interleave_len])returns a string representation of the alignment in pretty print format
to_protein
(self)returns copy of self as an alignment of PROTEIN moltype seqs
to_rich_dict
(self)returns detailed content including info and moltype attributes
to_rna
(self)returns copy of self as an alignment of RNA moltype seqs
to_type
(self[, array_align, moltype, alphabet])returns alignment of type indicated by array_align
trim_stop_codons
(self[, gc, allow_partial])Removes any terminal stop codons from the sequences
variable_positions
(self[, include_gap_motif])Return a list of variable position indexes.
with_gaps_from
(self, template)Same alignment but overwritten with the gaps from ‘template’
with_masked_annotations
(self, annot_types[, …])returns an alignment with annot_types regions replaced by mask_char if shadow is False, otherwise all other regions are masked.
with_modified_termini
(self)Changes the termini to include termini char instead of gapmotif.
write
(self[, filename, format])Write the alignment to a file, preserving order of sequences.
add_annotation
add_feature
attach_annotations
clear_annotations
detach_annotations
gapped_by_map
get_annotations_from_any_seq
get_annotations_from_seq
get_by_seq_annotation
get_projected_annotations
get_region_covering_all
project_annotation
-
add_annotation
(self, klass, *args, **kw)¶
-
add_feature
(self, type, name, spans)¶
-
add_from_ref_aln
(self, ref_aln, before_name=None, after_name=None)¶ Insert sequence(s) to self based on their alignment to a reference sequence. Assumes the first sequence in ref_aln.names[0] is the reference.
By default the sequence is appended to the end of the alignment, this can be changed by using either before_name or after_name arguments.
Returns Alignment object of the same class.
- Parameters
- ref_aln
reference alignment (Alignment object/series) of reference sequence and sequences to add. New sequences in ref_aln (ref_aln.names[1:] are sequences to add. If series is used as ref_aln, it must have the structure [[‘ref_name’, SEQ], [‘name’, SEQ]]
- before_name
name of the sequence before which sequence is added
- after_name
name of the sequence after which sequence is added If both before_name and after_name are specified seqs will be inserted using before_name.
- Example:
- Aln1:
- -AC-DEFGHI (name: seq1)
- XXXXXX–XX (name: seq2)
- YYYY-YYYYY (name: seq3)
- Aln2:
- ACDEFGHI (name: seq1)
- KL–MNPR (name: seqX)
- KLACMNPR (name: seqY)
- KL–MNPR (name: seqZ)
- Out:
- -AC-DEFGHI (name: seq1)
- XXXXXX–XX (name: seq2)
- YYYY-YYYYY (name: seq3)
- -KL—MNPR (name: seqX)
- -KL-ACMNPR (name: seqY)
- -KL—MNPR (name: seqZ)
-
add_seqs
(self, other, before_name=None, after_name=None)¶ Returns new object of class self with sequences from other added.
- Parameters
- other
same class as self or coerceable to that class
- before_namestr
which sequence is added
- after_namestr
which sequence is added
Notes
If both before_name and after_name are specified, the seqs will be inserted using before_name.
By default the sequence is appended to the end of the alignment, this can be changed by using either before_name or after_name arguments.
-
alignment_quality
(self, equifreq_mprobs=True)¶ Computes the alignment quality for an alignment based on eq. (2) in noted reference.
- Parameters
- equifreq_mprobsbool
If true, specifies equally frequent motif probabilities.
Notes
Hertz, G. D. Stormo - Published 1999, Bioinformatics, vol. 15 pg. 563-577.
The alignment quality statistic is a log-likelihood ratio (computed using log2) of the observed alignment column freqs versus the expected.
-
annotate_from_gff
(self, f)¶ Copies annotations from gff-format file to self.
Matches by name of sequence. This method accepts string path or pathlib.Path or file-like object (e.g. StringIO)
Skips sequences in the file that are not in self.
-
annotations
= ()¶
-
apply_pssm
(self, pssm=None, path=None, background=None, pseudocount=0, names=None, ui=None)¶ scores sequences using the specified pssm
- Parameters
- pssmprofile.PSSM
if not provided, will be loaded from path
- path
path to either a jaspar or cisbp matrix (path must end have a suffix matching the format).
- pseudocount
adjustment for zero in matrix
- names
returns only scores for these sequences and in the name order
- Returns
- numpy array of log2 based scores at every position
-
attach_annotations
(self, annots)¶
-
clear_annotations
(self)¶
-
coevolution
(self, method='nmi', segments=None, drawable=None, show_progress=False, ui=None)¶ performs pairwise coevolution measurement
- Parameters
- methodstr
coevolution metric, defaults to ‘nmi’ (Normalized Mutual Information). Valid choices are ‘rmi’ (Resampled Mutual Information) and ‘mi’, mutual information.
- segmentscoordinate series
coordinates of the form [(start, end), …] where all possible pairs of alignment positions within and between segments are examined.
- drawableNone or str
Result object is capable of plotting data specified type. str value must be one of plot type ‘box’, ‘heatmap’, ‘violin’.
- show_progressbool
shows a progress bar
- Returns
- DictArray of results with lower-triangular values. Upper triangular
- elements and estimates that could not be computed for numerical reasons
- are set as nan
-
copy
(self)¶ Returns deep copy of self.
-
copy_annotations
(self, unaligned)¶ Copies annotations from seqs in unaligned to self, matching by name.
Alignment programs like ClustalW don’t preserve annotations, so this method is available to copy annotations off the unaligned sequences.
unaligned should be a dictionary of Sequence instances.
Ignores sequences that are not in self, so safe to use on larger dict of seqs that are not in the current collection/alignment.
-
count_gaps_per_pos
(self, include_ambiguity=True)¶ return counts of gaps per position as a DictArray
- Parameters
- include_ambiguitybool
if True, ambiguity characters that include the gap state are included
-
count_gaps_per_seq
(self, induced_by=False, unique=False, include_ambiguity=True, drawable=False)¶ return counts of gaps per sequence as a DictArray
- Parameters
- induced_bybool
a gapped column is considered to be induced by a seq if the seq has a non-gap character in that column.
- uniquebool
count is limited to gaps uniquely induced by each sequence
- include_ambiguitybool
if True, ambiguity characters that include the gap state are included
- drawablebool or str
if True, resulting object is capable of plotting data via specified plot type ‘bar’, ‘box’ or ‘violin’
-
counts
(self, motif_length=1, include_ambiguity=False, allow_gap=False, exclude_unobserved=False)¶ returns dict of counts of motifs
- Parameters
- motif_length
number of elements per character.
- include_ambiguity
if True, motifs containing ambiguous characters from the seq moltype are included. No expansion of those is attempted.
- allow_gaps
if True, motifs containing a gap character are included.
- exclude_unobserved
if True, unobserved motif combinations are excluded.
Notes
only non-overlapping motifs are counted
-
counts_per_pos
(self, motif_length=1, include_ambiguity=False, allow_gap=False, alert=False)¶ return DictArray of counts per position
- Parameters
- alert
warns if motif_length > 1 and alignment trimmed to produce motif columns
-
counts_per_seq
(self, motif_length=1, include_ambiguity=False, allow_gap=False, exclude_unobserved=False, alert=False)¶ returns dict of counts of non-overlapping motifs per sequence
- Parameters
- motif_length
number of elements per character.
- include_ambiguity
if True, motifs containing ambiguous characters from the seq moltype are included. No expansion of those is attempted.
- allow_gaps
if True, motifs containing a gap character are included.
- exclude_unobserved
if False, all canonical states included
- alert
warns if motif_length > 1 and alignment trimmed to produce motif columns
-
deepcopy
(self, sliced=True)¶ Returns deep copy of self.
-
default_gap
= '-'¶
-
degap
(self, **kwargs)¶ Returns copy in which sequences have no gaps.
-
detach_annotations
(self, annots)¶
-
distance_matrix
(self, calc='percent', show_progress=False, drop_invalid=False)¶ Returns pairwise distances between sequences.
- Parameters
- calcstr
a pairwise distance calculator or name of one. For options see cogent3.evolve.fast_distance.available_distances
- show_progressbool
controls progress display for distance calculation
- drop_invalidbool
If True, sequences for which a pairwise distance could not be calculated are excluded. If False, an ArithmeticError is raised if a distance could not be computed on observed data.
-
dotplot
(self, name1=None, name2=None, window=20, threshold=None, min_gap=0, width=500, title=None, rc=False, show_progress=False)¶ make a dotplot between specified sequences. Random sequences chosen if names not provided.
- Parameters
- name1, name2str or None
names of sequences. If one is not provided, a random choice is made
- windowint
k-mer size for comparison between sequences
- thresholdint
windows where the sequences are identical >= threshold are a match
- min_gapint
permitted gap for joining adjacent line segments, default is no gap joining
- widthint
figure width. Figure height is computed based on the ratio of len(seq1) / len(seq2)
- title
title for the plot
- rcbool or None
include dotplot of reverse compliment also. Only applies to Nucleic acids moltypes
- Returns
- ——-
- a Drawable or AnnotatedDrawable
-
entropy_per_pos
(self, motif_length=1, include_ambiguity=False, allow_gap=False, alert=False)¶ returns shannon entropy per position
-
entropy_per_seq
(self, motif_length=1, include_ambiguity=False, allow_gap=False, exclude_unobserved=True, alert=False)¶ returns the Shannon entropy per sequence
- Parameters
- motif_length
number of characters per tuple.
- include_ambiguity
if True, motifs containing ambiguous characters from the seq moltype are included. No expansion of those is attempted.
- allow_gap
if True, motifs containing a gap character are included.
- exclude_unobserved
if True, unobserved motif combinations are excluded.
Notes
For motif_length > 1, it’s advisable to specify exclude_unobserved=True, this avoids unnecessary calculations.
-
filtered
(self, predicate, motif_length=1, drop_remainder=True, **kwargs)¶ The alignment positions where predicate(column) is true.
- Parameters
- predicatecallable
a callback function that takes an tuple of motifs and returns True/False
- motif_lengthint
length of the motifs the sequences should be split into, eg. 3 for filtering aligned codons.
- drop_remainderbool
If length is not modulo motif_length, allow dropping the terminal remaining columns
-
gap_chars
= {'-': None, '?': None}¶
-
gapped_by_map
(self, keep, **kwargs)¶
-
get_ambiguous_positions
(self)¶ Returns dict of seq:{position:char} for ambiguous chars.
Used in likelihood calculations.
-
get_annotations_from_any_seq
(self, annotation_type='*', **kwargs)¶
-
get_annotations_from_seq
(self, seq_name, annotation_type='*', **kwargs)¶
-
get_annotations_matching
(self, annotation_type, name=None, extend_query=False)¶ - Parameters
- annotation_typestring
name of the annotation type. Wild-cards allowed.
- namestring
name of the instance. Wild-cards allowed.
- extend_queryboolean
queries sub-annotations if True
- Returns
- ——-
- list of AnnotatableFeatures
-
get_by_annotation
(self, annotation_type, name=None, ignore_partial=False)¶ yields the sequence segments corresponding to the specified annotation_type and name one at a time.
- Parameters
- ignore_partial
if True, annotations that extend beyond the current sequence are ignored.
-
get_by_seq_annotation
(self, seq_name, *args)¶
-
get_degapped_relative_to
(self, name)¶ Remove all columns with gaps in sequence with given name.
Returns Alignment object of the same class. Note that the seqs in the new Alignment are always new objects.
- Parameters
- name
sequence name
-
get_drawable
(self, width=600, vertical=False)¶ returns Drawable instance
-
get_drawables
(self)¶ returns a dict of drawables, keyed by type
-
get_gap_array
(self, include_ambiguity=True)¶ returns bool array with gap state True, False otherwise
- Parameters
- include_ambiguitybool
if True, ambiguity characters that include the gap state are included
-
get_gapped_seq
(self, seq_name, recode_gaps=False, moltype=None)¶ Return a gapped Sequence object for the specified seqname.
Note: always returns Sequence object, not ArraySequence.
-
get_identical_sets
(self, mask_degen=False)¶ returns sets of names for sequences that are identical
- Parameters
- mask_degen
if True, degenerate characters are ignored
-
get_lengths
(self, include_ambiguity=False, allow_gap=False)¶ returns {name: seq length, …}
- Parameters
- include_ambiguity
if True, motifs containing ambiguous characters from the seq moltype are included. No expansion of those is attempted.
- allow_gaps
if True, motifs containing a gap character are included.
-
get_motif_probs
(self, alphabet=None, include_ambiguity=False, exclude_unobserved=False, allow_gap=False, pseudocount=0)¶ Return a dictionary of motif probs, calculated as the averaged frequency across sequences.
- Parameters
- include_ambiguity
if True resolved ambiguous codes are included in estimation of frequencies, default is False.
- exclude_unobserved
if True, motifs that are not present in the alignment are excluded from the returned dictionary, default is False.
- allow_gap
allow gap motif
Notes
only non-overlapping motifs are counted
-
get_position_indices
(self, f, native=False, negate=False)¶ Returns list of column indices for which f(col) is True.
- fcallable
function that returns true/false given an alignment position
- nativeboolean
if True, and ArrayAlignment, f is provided with slice of array otherwise the string is used
- negateboolean
if True, not f() is used
-
get_projected_annotations
(self, seq_name, *args)¶
-
get_region_covering_all
(self, annotations, feature_class=None, extend_query=False)¶
-
get_seq
(self, seqname)¶ Return a ungapped Sequence object for the specified seqname.
Note: always returns Sequence object, not ArraySequence.
-
get_seq_indices
(self, f, negate=False)¶ Returns list of keys of seqs where f(row) is True.
List will be in the same order as self.names, if present.
-
get_similar
(self, target, min_similarity=0.0, max_similarity=1.0, metric=<cogent3.util.transform.for_seq object at 0x7fcd68dd4990>, transform=None)¶ Returns new Alignment containing sequences similar to target.
- Parameters
- target
sequence object to compare to. Can be in the alignment.
- min_similarity
minimum similarity that will be kept. Default 0.0.
- max_similarity
maximum similarity that will be kept. Default 1.0. (Note that both min_similarity and max_similarity are inclusive.) metric similarity function to use. Must be f(first_seq, second_seq).
- The default metric is fraction similarity, ranging from 0.0 (0%
- identical) to 1.0 (100% identical). The Sequence classes have lots
- of methods that can be passed in as unbound methods to act as the
- metric, e.g. frac_same_gaps.
- transform
transformation function to use on the sequences before the metric is calculated. If None, uses the whole sequences in each case. A frequent transformation is a function that returns a specified range of a sequence, e.g. eliminating the ends. Note that the transform applies to both the real sequence and the target sequence.
- WARNING: if the transformation changes the type of the sequence (e.g.
- extracting a string from an RnaSequence object), distance metrics that
- depend on instance data of the original class may fail.
-
get_translation
(self, gc=None, incomplete_ok=False, **kwargs)¶ translate from nucleic acid to protein
- Parameters
- gc
genetic code, either the number or name (use cogent3.core.genetic_code.available_codes)
- incomplete_okbool
codons that are mixes of nucleotide and gaps converted to ‘?’. raises a ValueError if False
- kwargs
related to construction of the resulting object
- Returns
- A new instance of self translated into protein
-
has_terminal_stops
(self, gc=None, allow_partial=False)¶ Returns True if any sequence has a terminal stop codon.
- Parameters
- gc
genetic code object
- allow_partial
if True and the sequence length is not divisible by 3, ignores the 3’ terminal incomplete codon
-
information_plot
(self, width=None, height=None, window=None, stat='median', include_gap=True)¶ plot information per position
- Parameters
- widthint
figure width in pixels
- heightint
figure height in pixels
- windowint or None
used for smoothing line, defaults to sqrt(length)
- statstr
‘mean’ or ‘median, used as the summary statistic for each window
- include_gap
whether to include gap counts, shown on right y-axis
-
is_array
= {'array', 'array_seqs'}¶
-
is_ragged
(self)¶ Returns True if alignment has sequences of different lengths.
-
iter_positions
(self, pos_order=None)¶ Iterates over positions in the alignment, in order.
pos_order refers to a list of indices (ints) specifying the column order. This lets you rearrange positions if you want to (e.g. to pull out individual codon positions).
Note that self.iter_positions() always returns new objects, by default lists of elements. Use map(f, self.iter_positions) to apply the constructor or function f to the resulting lists (f must take a single list as a parameter).
Will raise IndexError if one of the indices in order exceeds the sequence length. This will always happen on ragged alignments: assign to self.seq_len to set all sequences to the same length.
-
iter_selected
(self, seq_order=None, pos_order=None)¶ Iterates over elements in the alignment.
seq_order (names) can be used to select a subset of seqs. pos_order (positions) can be used to select a subset of positions.
Always iterates along a seq first, then down a position (transposes normal order of a[i][j]; possibly, this should change)..
WARNING: Alignment.iter_selected() is not the same as alignment.iteritems() (which is the built-in dict iteritems that iterates over key-value pairs).
-
iter_seqs
(self, seq_order=None)¶ Iterates over values (sequences) in the alignment, in order.
seq_order: list of keys giving the order in which seqs will be returned. Defaults to self.Names. Note that only these sequences will be returned, and that KeyError will be raised if there are sequences in order that have been deleted from the Alignment. If self.Names is None, returns the sequences in the same order as self.named_seqs.values().
Use map(f, self.seqs()) to apply the constructor f to each seq. f must accept a single list as an argument.
Always returns references to the same objects that are values of the alignment.
-
iupac_consensus
(self, alphabet=None)¶ Returns string containing IUPAC consensus sequence of the alignment.
-
majority_consensus
(self)¶ Returns list containing most frequent item at each position.
Optional parameter transform gives constructor for type to which result will be converted (useful when consensus should be same type as originals).
-
matching_ref
(self, ref_name, gap_fraction, gap_run)¶ Returns new alignment with seqs well aligned with a reference.
- gap_fraction = fraction of positions that either have a gap in the
template but not in the seq or in the seq but not in the template
- gap_run = number of consecutive gaps tolerated in query relative to
sequence or sequence relative to query
-
moltype
= MolType(('a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z', 'A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L', 'M', 'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'W', 'X', 'Y', 'Z'))¶
-
no_degenerates
(self, motif_length=1, allow_gap=False)¶ returns new alignment without degenerate characters
- Parameters
- motif_length
sequences are segmented into units of this size
- allow_gaps
whether gaps are to be treated as a degenerate character (default, most evolutionary modelling treats gaps as N) or not.
-
property
num_seqs
¶ Returns the number of sequences in the alignment.
-
omit_bad_seqs
(self, quantile=None)¶ Returns new alignment without sequences with a number of uniquely introduced gaps exceeding quantile
Uses count_gaps_per_seq(unique=True) to obtain the counts of gaps uniquely introduced by a sequence. The cutoff is the the quantile of this distribution.
- Parameters
- quantilefloat or None
sequences whose unique gap count is in a quantile larger than this cutoff are excluded. The default quantile is (num_seqs - 1) / num_seqs
-
omit_gap_pos
(self, allowed_gap_frac=0.999999, motif_length=1)¶ Returns new alignment where all cols (motifs) have <= allowed_gap_frac gaps.
- Parameters
- allowed_gap_frac
specifies proportion of gaps is allowed in each column (default is just < 1, i.e. only cols with at least one gap character are preserved). Set to 1 - e-6 to exclude strictly gapped columns.
- motif_length
set’s the “column” width, e.g. setting to 3 corresponds to codons. A motif that includes a gap at any position is included in the counting. Default is 1.
-
omit_gap_runs
(self, allowed_run=1)¶ Returns new alignment where all seqs have runs of gaps <=allowed_run.
Note that seqs with exactly allowed_run gaps are not deleted. Default is for allowed_run to be 1 (i.e. no consecutive gaps allowed).
Because the test for whether the current gap run exceeds the maximum allowed gap run is only triggered when there is at least one gap, even negative values for allowed_run will still let sequences with no gaps through.
-
omit_gap_seqs
(self, allowed_gap_frac=0)¶ Returns new alignment with seqs that have <= allowed_gap_frac.
allowed_gap_frac should be a fraction between 0 and 1 inclusive. Default is 0.
-
pad_seqs
(self, pad_length=None, **kwargs)¶ Returns copy in which sequences are padded to same length.
- Parameters
- pad_length
Length all sequences are to be padded to. Will pad to max sequence length if pad_length is None or less than max length.
-
property
positions
¶ Iterates over positions in the alignment, in order.
pos_order refers to a list of indices (ints) specifying the column order. This lets you rearrange positions if you want to (e.g. to pull out individual codon positions).
Note that self.iter_positions() always returns new objects, by default lists of elements. Use map(f, self.iter_positions) to apply the constructor or function f to the resulting lists (f must take a single list as a parameter).
Will raise IndexError if one of the indices in order exceeds the sequence length. This will always happen on ragged alignments: assign to self.seq_len to set all sequences to the same length.
-
probs_per_pos
(self, motif_length=1, include_ambiguity=False, allow_gap=False, alert=False)¶ returns MotifFreqsArray per position
-
probs_per_seq
(self, motif_length=1, include_ambiguity=False, allow_gap=False, exclude_unobserved=False, alert=False)¶ return MotifFreqsArray per sequence
- Parameters
- motif_length
number of characters per tuple.
- include_ambiguity
if True, motifs containing ambiguous characters from the seq moltype are included. No expansion of those is attempted.
- allow_gap
if True, motifs containing a gap character are included.
- exclude_unobserved
if True, unobserved motif combinations are excluded.
-
project_annotation
(self, seq_name, annot)¶
-
quick_tree
(self, calc='percent', bootstrap=None, drop_invalid=False, show_progress=False, ui=None)¶ Returns pairwise distances between sequences.
- Parameters
- calcstr
a pairwise distance calculator or name of one. For options see cogent3.evolve.fast_distance.available_distances
- show_progressbool
controls progress display for distance calculation
- drop_invalidbool
If True, sequences for which a pairwise distance could not be calculated are excluded. If False, an ArithmeticError is raised if a distance could not be computed on observed data.
- bootstrapint or None
Number of non-parametric bootstrap replicates. Resamples alignment columns with replacement and builds a phylogeny for each such resampling.
- drop_invalidbool
If True, sequences for which a pairwise distance could not be calculated are excluded. If False, an ArithmeticError is raised if a distance could not be computed on observed data.
- Returns
- a phylogenetic tree. If bootstrap specified, returns the weighted
- majority consensus. Support for each node is stored as
- edge.params[‘params’].
Notes
Sequences in the observed alignment for which distances could not be computed are omitted. Bootstrap replicates are required to have distances for all seqs present in the observed data distance matrix.
-
rc
(self)¶ Returns the reverse complement alignment
-
rename_seqs
(self, renamer)¶ returns new instance with sequences renamed
- Parameters
- renamercallable
function that will take current sequences and return the new one
-
replace_seqs
(self, seqs, aa_to_codon=True)¶ Returns new alignment with same shape but with data taken from seqs.
- Parameters
- aa_to_codon
If True (default) aligns codons from protein alignment, or, more generally, substituting in codons from a set of protein sequences (not necessarily aligned). For this reason, it takes characters from seqs three at a time rather than one at a time (i.e. 3 characters in seqs are put in place of 1 character in self). If False, seqs must be the same lengths.
- If seqs is an alignment, any gaps in it will be ignored.
-
reverse_complement
(self)¶ Returns the reverse complement alignment. A synonymn for rc.
-
sample
(self, n=None, with_replacement=False, motif_length=1, randint=<built-in method randint of numpy.random.mtrand.RandomState object at 0x7fcd71e38160>, permutation=<built-in method permutation of numpy.random.mtrand.RandomState object at 0x7fcd71e38160>)¶ Returns random sample of positions from self, e.g. to bootstrap.
- Parameters
- n
the number of positions to sample from the alignment. Default is alignment length
- with_replacement
boolean flag for determining if sampled positions
- random_series
a random number generator with .randint(min,max) .random() methods
- Notes:
By default (resampling all positions without replacement), generates a permutation of the positions of the alignment.
Setting with_replacement to True and otherwise leaving parameters as defaults generates a standard bootstrap resampling of the alignment.
-
seqlogo
(self, width=700, height=100, wrap=None, vspace=0.005, colours=None)¶ returns Drawable sequence logo using mutual information
- Parameters
- width, heightfloat
plot dimensions in pixels
- wrapint
number of alignment columns per row
- vspacefloat
vertical separation between rows, as a proportion of total plot
- coloursdict
mapping of characters to colours. If note provided, defaults to custom for everything ecept protein, which uses protein moltype colours.
Notes
Computes MI based on log2 and includes the gap state, so the maximum possible value is -log2(1/num_states)
-
property
seqs
¶
-
set_repr_policy
(self, num_seqs=None, num_pos=None, ref_name=None)¶ specify policy for repr(self)
- Parameters
- num_seqsint or None
number of sequences to include in represented display.
- num_posint or None
length of sequences to include in represented display.
- ref_namestr or None
name of sequence to be placed first, or “longest” (default). If latter, indicates longest sequence will be chosen.
-
sliding_windows
(self, window, step, start=None, end=None)¶ Generator yielding new Alignments of given length and interval.
- Parameters
- window
The length of each returned alignment.
- step
The interval between the start of the successive alignment objects returned.
- start
first window start position
- end
last window start position
-
strand_symmetry
(self, motif_length=1)¶ returns dict of strand symmetry test results per seq
-
take_positions
(self, cols, negate=False)¶ Returns new Alignment containing only specified positions.
By default, the seqs will be lists, but an alternative constructor can be specified.
Note that take_positions will fail on ragged positions.
-
take_positions_if
(self, f, negate=False)¶ Returns new Alignment containing cols where f(col) is True.
-
take_seqs
(self, seqs, negate=False, **kwargs)¶ Returns new Alignment containing only specified seqs.
Note that the seqs in the new alignment will be references to the same objects as the seqs in the old alignment.
-
take_seqs_if
(self, f, negate=False, **kwargs)¶ Returns new Alignment containing seqs where f(row) is True.
Note that the seqs in the new Alignment are the same objects as the seqs in the old Alignment, not copies.
-
to_dict
(self)¶ Returns the alignment as dict of names -> strings.
Note: returns strings, NOT Sequence objects.
-
to_dna
(self)¶ returns copy of self as an alignment of DNA moltype seqs
-
to_fasta
(self)¶ Return alignment in Fasta format
- Parameters
- make_seqlabel
callback function that takes the seq object and returns a label str
-
to_html
(self, name_order=None, interleave_len=60, limit=None, ref_name='longest', colors=None, font_size=12, font_family='Lucida Console')¶ returns html with embedded styles for sequence colouring
- Parameters
- name_order
order of names for display.
- interleave_len
maximum number of printed bases, defaults to alignment length
- limit
truncate alignment to this length
- ref_name
Name of an existing sequence or ‘longest’. If the latter, the longest sequence (excluding gaps and ambiguities) is selected as the reference.
- colors
{character moltype.
- font_size
in points. Affects labels and sequence and line spacing (proportional to value)
- font_family
string denoting font family
- To display in jupyter notebook:
>>> from IPython.core.display import HTML >>> HTML(aln.to_html())
-
to_json
(self)¶ returns json formatted string
-
to_moltype
(self, moltype)¶ returns copy of self with moltype seqs
-
to_nexus
(self, seq_type, interleave_len=50)¶ Return alignment in NEXUS format and mapping to sequence ids
- NOTE Not that every sequence in the alignment MUST come from
a different species!! (You can concatenate multiple sequences from same species together before building tree)
seq_type: dna, rna, or protein
Raises exception if invalid alignment
-
to_phylip
(self)¶ Return alignment in PHYLIP format and mapping to sequence ids
raises exception if invalid alignment
-
to_pretty
(self, name_order=None, interleave_len=None)¶ returns a string representation of the alignment in pretty print format
- Parameters
- name_order
order of names for display.
- interleave_len
maximum number of printed bases, defaults to alignment length
-
to_protein
(self)¶ returns copy of self as an alignment of PROTEIN moltype seqs
-
to_rich_dict
(self)¶ returns detailed content including info and moltype attributes
-
to_rna
(self)¶ returns copy of self as an alignment of RNA moltype seqs
-
to_type
(self, array_align=False, moltype=None, alphabet=None)¶ returns alignment of type indicated by array_align
- Parameters
- array_align: bool
if True, returns as ArrayAlignment. Otherwise as “standard” Alignment class. Conversion to ArrayAlignment loses annotations.
- moltypeMolType instance
overrides self.moltype
- alphabetAlphabet instance
overrides self.alphabet
- If array_align would result in no change (class is same as self),
- returns self
-
trim_stop_codons
(self, gc=None, allow_partial=False, **kwargs)¶ Removes any terminal stop codons from the sequences
- Parameters
- gc
genetic code object
- allow_partial
if True and the sequence length is not divisible by 3, ignores the 3’ terminal incomplete codon
-
variable_positions
(self, include_gap_motif=True)¶ Return a list of variable position indexes.
- Parameters
- include_gap_motif
if False, sequences with a gap motif in a column are ignored.
-
with_gaps_from
(self, template)¶ Same alignment but overwritten with the gaps from ‘template’
-
with_masked_annotations
(self, annot_types, mask_char=None, shadow=False)¶ returns an alignment with annot_types regions replaced by mask_char if shadow is False, otherwise all other regions are masked.
- Parameters
- annot_types
annotation type(s)
- mask_char
must be a character valid for the seq moltype. The default value is the most ambiguous character, eg. ‘?’ for DNA
- shadow
whether to mask the annotated regions, or everything but the annotated regions
-
with_modified_termini
(self)¶ Changes the termini to include termini char instead of gapmotif.
Useful to correct the standard gap char output by most alignment programs when aligned sequences have different ends.
-
write
(self, filename=None, format=None, **kwargs)¶ Write the alignment to a file, preserving order of sequences.
- Parameters
- filename
name of the sequence file
- format
format of the sequence file
Notes
If format is None, will attempt to infer format from the filename suffix.