SequenceCollection¶

class cogent3.core.alignment.SequenceCollection(data, names=None, alphabet=None, moltype=None, name=None, info=None, conversion_f=None, is_array=False, force_same_data=False, remove_duplicate_names=False, label_to_name=None, suppress_named_seqs=False)¶

Container for unaligned sequences

Attributes

num_seqs: Returns the number of sequences in the alignment.
seqs

Methods

`add_seqs`(self, other[, before_name, after_name])	Returns new object of class self with sequences from other added.
`annotate_from_gff`(self, f)	Copies annotations from gff-format file to self.
`apply_pssm`(self[, pssm, path, background, …])	scores sequences using the specified pssm
`copy`(self)	Returns deep copy of self.
`copy_annotations`(self, unaligned)	Copies annotations from seqs in unaligned to self, matching by name.
`counts`(self[, motif_length, …])	returns dict of counts of motifs
`counts_per_seq`(self[, motif_length, …])	returns dict of counts of motifs per sequence
`deepcopy`(self[, sliced])	Returns deep copy of self.
`degap`(self, \\kwargs)	Returns copy in which sequences have no gaps.
`dotplot`(self[, name1, name2, window, …])	make a dotplot between specified sequences.
`entropy_per_seq`(self[, motif_length, …])	Returns the Shannon entropy per sequence.
`get_ambiguous_positions`(self)	Returns dict of seq:{position:char} for ambiguous chars.
`get_identical_sets`(self[, mask_degen])	returns sets of names for sequences that are identical
`get_lengths`(self[, include_ambiguity, allow_gap])	returns {name: seq length, …}
`get_motif_probs`(self[, alphabet, …])	Return a dictionary of motif probs, calculated as the averaged frequency across sequences.
`get_seq`(self, seqname)	Return a sequence object for the specified seqname.
`get_seq_indices`(self, f[, negate])	Returns list of keys of seqs where f(row) is True.
`get_similar`(self, target[, min_similarity, …])	Returns new Alignment containing sequences similar to target.
`get_translation`(self[, gc, incomplete_ok])	translate from nucleic acid to protein
`has_terminal_stops`(self[, gc, allow_partial])	Returns True if any sequence has a terminal stop codon.
`is_ragged`(self)	Returns True if alignment has sequences of different lengths.
`iter_selected`(self[, seq_order, pos_order])	Iterates over elements in the alignment.
`iter_seqs`(self[, seq_order])	Iterates over values (sequences) in the alignment, in order.
`omit_gap_runs`(self[, allowed_run])	Returns new alignment where all seqs have runs of gaps <=allowed_run.
`omit_gap_seqs`(self[, allowed_gap_frac])	Returns new alignment with seqs that have <= allowed_gap_frac.
`pad_seqs`(self[, pad_length])	Returns copy in which sequences are padded to same length.
`probs_per_seq`(self[, motif_length, …])	return MotifFreqsArray per sequence
`rc`(self)	Returns the reverse complement alignment
`rename_seqs`(self, renamer)	returns new instance with sequences renamed
`reverse_complement`(self)	Returns the reverse complement alignment.
`set_repr_policy`(self[, num_seqs, num_pos, …])	specify policy for repr(self)
`strand_symmetry`(self[, motif_length])	returns dict of strand symmetry test results per seq
`take_seqs`(self, seqs[, negate])	Returns new Alignment containing only specified seqs.
`take_seqs_if`(self, f[, negate])	Returns new Alignment containing seqs where f(row) is True.
`to_dict`(self)	Returns the alignment as dict of names -> strings.
`to_dna`(self)	returns copy of self as an alignment of DNA moltype seqs
`to_fasta`(self)	Return alignment in Fasta format
`to_json`(self)	returns json formatted string
`to_moltype`(self, moltype)	returns copy of self with moltype seqs
`to_nexus`(self, seq_type[, interleave_len])	Return alignment in NEXUS format and mapping to sequence ids
`to_phylip`(self)	Return alignment in PHYLIP format and mapping to sequence ids
`to_protein`(self)	returns copy of self as an alignment of PROTEIN moltype seqs
`to_rich_dict`(self)	returns detailed content including info and moltype attributes
`to_rna`(self)	returns copy of self as an alignment of RNA moltype seqs
`trim_stop_codons`(self[, gc, allow_partial])	Removes any terminal stop codons from the sequences
`with_modified_termini`(self)	Changes the termini to include termini char instead of gapmotif.
`write`(self[, filename, format])	Write the alignment to a file, preserving order of sequences.

add_seqs(self, other, before_name=None, after_name=None)¶

Returns new object of class self with sequences from other added.

Parameters

other: same class as self or coerceable to that class
before_namestr: which sequence is added
after_namestr: which sequence is added

Notes

If both before_name and after_name are specified, the seqs will be inserted using before_name.

By default the sequence is appended to the end of the alignment, this can be changed by using either before_name or after_name arguments.

annotate_from_gff(self, f)¶

Copies annotations from gff-format file to self.

Matches by name of sequence. This method accepts string path or pathlib.Path or file-like object (e.g. StringIO)

Skips sequences in the file that are not in self.

apply_pssm(self, pssm=None, path=None, background=None, pseudocount=0, names=None, ui=None)¶

scores sequences using the specified pssm

Parameters

pssmprofile.PSSM: if not provided, will be loaded from path
path: path to either a jaspar or cisbp matrix (path must end have a suffix matching the format).
pseudocount: adjustment for zero in matrix
names: returns only scores for these sequences and in the name order

Returns

numpy array of log2 based scores at every position

copy(self)¶: Returns deep copy of self.

copy_annotations(self, unaligned)¶

Copies annotations from seqs in unaligned to self, matching by name.

Alignment programs like ClustalW don’t preserve annotations, so this method is available to copy annotations off the unaligned sequences.

unaligned should be a dictionary of Sequence instances.

Ignores sequences that are not in self, so safe to use on larger dict of seqs that are not in the current collection/alignment.

counts(self, motif_length=1, include_ambiguity=False, allow_gap=False, exclude_unobserved=False)¶

returns dict of counts of motifs

Parameters

motif_length: number of elements per character.
include_ambiguity: if True, motifs containing ambiguous characters from the seq moltype are included. No expansion of those is attempted.
allow_gaps: if True, motifs containing a gap character are included.
exclude_unobserved: if True, unobserved motif combinations are excluded.

Notes

only non-overlapping motifs are counted

counts_per_seq(self, motif_length=1, include_ambiguity=False, allow_gap=False, exclude_unobserved=False)¶

returns dict of counts of motifs per sequence

Parameters

motif_length: number of characters per tuple.
include_ambiguity: if True, motifs containing ambiguous characters from the seq moltype are included. No expansion of those is attempted.
allow_gap: if True, motifs containing a gap character are included.

Notes

only non-overlapping motifs are counted

deepcopy(self, sliced=True)¶: Returns deep copy of self.

degap(self, **kwargs)¶: Returns copy in which sequences have no gaps.

dotplot(self, name1=None, name2=None, window=20, threshold=None, min_gap=0, width=500, title=None, rc=False, show_progress=False)¶

make a dotplot between specified sequences. Random sequences chosen if names not provided.

Parameters

name1, name2str or None: names of sequences. If one is not provided, a random choice is made
windowint: k-mer size for comparison between sequences
thresholdint: windows where the sequences are identical >= threshold are a match
min_gapint: permitted gap for joining adjacent line segments, default is no gap joining
widthint: figure width. Figure height is computed based on the ratio of len(seq1) / len(seq2)
title: title for the plot
rcbool or None: include dotplot of reverse compliment also. Only applies to Nucleic acids moltypes
Returns
——-
a Drawable or AnnotatedDrawable

entropy_per_seq(self, motif_length=1, include_ambiguity=False, allow_gap=False, exclude_unobserved=True, alert=False)¶

Returns the Shannon entropy per sequence.

Parameters

motif_length: int: number of characters per tuple.
include_ambiguity: bool: if True, motifs containing ambiguous characters from the seq moltype are included. No expansion of those is attempted.
allow_gap: bool: if True, motifs containing a gap character are included.
exclude_unobserved: bool: if True, unobserved motif combinations are excluded.

Notes

For motif_length > 1, it’s advisable to specify exclude_unobserved=True, this avoids unnecessary calculations.

get_ambiguous_positions(self)¶

Returns dict of seq:{position:char} for ambiguous chars.

Used in likelihood calculations.

get_identical_sets(self, mask_degen=False)¶

returns sets of names for sequences that are identical

Parameters

mask_degen: if True, degenerate characters are ignored

get_lengths(self, include_ambiguity=False, allow_gap=False)¶

returns {name: seq length, …}

Parameters

include_ambiguity: if True, motifs containing ambiguous characters from the seq moltype are included. No expansion of those is attempted.
allow_gaps: if True, motifs containing a gap character are included.

get_motif_probs(self, alphabet=None, include_ambiguity=False, exclude_unobserved=False, allow_gap=False, pseudocount=0)¶

Return a dictionary of motif probs, calculated as the averaged frequency across sequences.

Parameters

include_ambiguity: if True resolved ambiguous codes are included in estimation of frequencies, default is False.
exclude_unobserved: if True, motifs that are not present in the alignment are excluded from the returned dictionary, default is False.
allow_gap: allow gap motif

Notes

only non-overlapping motifs are counted

get_seq(self, seqname)¶: Return a sequence object for the specified seqname.

get_seq_indices(self, f, negate=False)¶

Returns list of keys of seqs where f(row) is True.

List will be in the same order as self.names, if present.

get_similar(self, target, min_similarity=0.0, max_similarity=1.0, metric=<cogent3.util.transform.for_seq object at 0x7fcd68dd4990>, transform=None)¶

Returns new Alignment containing sequences similar to target.

Parameters

target: sequence object to compare to. Can be in the alignment.
min_similarity: minimum similarity that will be kept. Default 0.0.
max_similarity: maximum similarity that will be kept. Default 1.0. (Note that both min_similarity and max_similarity are inclusive.) metric similarity function to use. Must be f(first_seq, second_seq).
The default metric is fraction similarity, ranging from 0.0 (0%
identical) to 1.0 (100% identical). The Sequence classes have lots
of methods that can be passed in as unbound methods to act as the
metric, e.g. frac_same_gaps.
transform: transformation function to use on the sequences before the metric is calculated. If None, uses the whole sequences in each case. A frequent transformation is a function that returns a specified range of a sequence, e.g. eliminating the ends. Note that the transform applies to both the real sequence and the target sequence.
WARNING: if the transformation changes the type of the sequence (e.g.
extracting a string from an RnaSequence object), distance metrics that
depend on instance data of the original class may fail.

get_translation(self, gc=None, incomplete_ok=False, **kwargs)¶

translate from nucleic acid to protein

Parameters

gc: genetic code, either the number or name (use cogent3.core.genetic_code.available_codes)
incomplete_okbool: codons that are mixes of nucleotide and gaps converted to ‘?’. raises a ValueError if False
kwargs: related to construction of the resulting object

Returns

A new instance of self translated into protein

has_terminal_stops(self, gc=None, allow_partial=False)¶

Returns True if any sequence has a terminal stop codon.

Parameters

gc: genetic code object
allow_partial: if True and the sequence length is not divisible by 3, ignores the 3’ terminal incomplete codon

is_array = {'array', 'array_seqs'}¶

is_ragged(self)¶: Returns True if alignment has sequences of different lengths.

iter_selected(self, seq_order=None, pos_order=None)¶

Iterates over elements in the alignment.

seq_order (names) can be used to select a subset of seqs. pos_order (positions) can be used to select a subset of positions.

Always iterates along a seq first, then down a position (transposes normal order of a[i][j]; possibly, this should change)..

WARNING: Alignment.iter_selected() is not the same as alignment.iteritems() (which is the built-in dict iteritems that iterates over key-value pairs).

iter_seqs(self, seq_order=None)¶

Iterates over values (sequences) in the alignment, in order.

seq_order: list of keys giving the order in which seqs will be returned. Defaults to self.Names. Note that only these sequences will be returned, and that KeyError will be raised if there are sequences in order that have been deleted from the Alignment. If self.Names is None, returns the sequences in the same order as self.named_seqs.values().

Use map(f, self.seqs()) to apply the constructor f to each seq. f must accept a single list as an argument.

Always returns references to the same objects that are values of the alignment.

moltype = MolType(('\x00', '\x01', '\x02', '\x03', '\x04', '\x05', '\x06', '\x07', '\x08', '\t', '\n', '\x0b', '\x0c', '\r', '\x0e', '\x0f', '\x10', '\x11', '\x12', '\x13', '\x14', '\x15', '\x16', '\x17', '\x18', '\x19', '\x1a', '\x1b', '\x1c', '\x1d', '\x1e', '\x1f', ' ', '!', '"', '#', '$', '%', '&', "'", '(', ')', '*', '+', ',', '-', '.', '/', '0', '1', '2', '3', '4', '5', '6', '7', '8', '9', ':', ';', '<', '=', '>', '?', '@', 'A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L', 'M', 'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'W', 'X', 'Y', 'Z', '[', '\\', ']', '^', '_', '`', 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z', '{', '|', '}', '~', '\x7f', '\x80', '\x81', '\x82', '\x83', '\x84', '\x85', '\x86', '\x87', '\x88', '\x89', '\x8a', '\x8b', '\x8c', '\x8d', '\x8e', '\x8f', '\x90', '\x91', '\x92', '\x93', '\x94', '\x95', '\x96', '\x97', '\x98', '\x99', '\x9a', '\x9b', '\x9c', '\x9d', '\x9e', '\x9f', '\xa0', '¡', '¢', '£', '¤', '¥', '¦', '§', '¨', '©', 'ª', '«', '¬', '\xad', '®', '¯', '°', '±', '²', '³', '´', 'µ', '¶', '·', '¸', '¹', 'º', '»', '¼', '½', '¾', '¿', 'À', 'Á', 'Â', 'Ã', 'Ä', 'Å', 'Æ', 'Ç', 'È', 'É', 'Ê', 'Ë', 'Ì', 'Í', 'Î', 'Ï', 'Ð', 'Ñ', 'Ò', 'Ó', 'Ô', 'Õ', 'Ö', '×', 'Ø', 'Ù', 'Ú', 'Û', 'Ü', 'Ý', 'Þ', 'ß', 'à', 'á', 'â', 'ã', 'ä', 'å', 'æ', 'ç', 'è', 'é', 'ê', 'ë', 'ì', 'í', 'î', 'ï', 'ð', 'ñ', 'ò', 'ó', 'ô', 'õ', 'ö', '÷', 'ø', 'ù', 'ú', 'û', 'ü', 'ý', 'þ', 'ÿ'))¶

property num_seqs¶: Returns the number of sequences in the alignment.

omit_gap_runs(self, allowed_run=1)¶

Returns new alignment where all seqs have runs of gaps <=allowed_run.

Note that seqs with exactly allowed_run gaps are not deleted. Default is for allowed_run to be 1 (i.e. no consecutive gaps allowed).

Because the test for whether the current gap run exceeds the maximum allowed gap run is only triggered when there is at least one gap, even negative values for allowed_run will still let sequences with no gaps through.

omit_gap_seqs(self, allowed_gap_frac=0)¶

Returns new alignment with seqs that have <= allowed_gap_frac.

allowed_gap_frac should be a fraction between 0 and 1 inclusive. Default is 0.

pad_seqs(self, pad_length=None, **kwargs)¶

Returns copy in which sequences are padded to same length.

Parameters

pad_length: Length all sequences are to be padded to. Will pad to max sequence length if pad_length is None or less than max length.

probs_per_seq(self, motif_length=1, include_ambiguity=False, allow_gap=False, exclude_unobserved=False, alert=False)¶: return MotifFreqsArray per sequence

rc(self)¶: Returns the reverse complement alignment

rename_seqs(self, renamer)¶

returns new instance with sequences renamed

Parameters

renamercallable: function that will take current sequences and return the new one

reverse_complement(self)¶: Returns the reverse complement alignment. A synonymn for rc.

property seqs¶

set_repr_policy(self, num_seqs=None, num_pos=None, ref_name=None)¶

specify policy for repr(self)

Parameters

num_seqsint or None: number of sequences to include in represented display.
num_posint or None: length of sequences to include in represented display.
ref_namestr or None: name of sequence to be placed first, or “longest” (default). If latter, indicates longest sequence will be chosen.

strand_symmetry(self, motif_length=1)¶: returns dict of strand symmetry test results per seq

take_seqs(self, seqs, negate=False, **kwargs)¶

Returns new Alignment containing only specified seqs.

Note that the seqs in the new alignment will be references to the same objects as the seqs in the old alignment.

take_seqs_if(self, f, negate=False, **kwargs)¶

Returns new Alignment containing seqs where f(row) is True.

Note that the seqs in the new Alignment are the same objects as the seqs in the old Alignment, not copies.

to_dict(self)¶

Returns the alignment as dict of names -> strings.

Note: returns strings, NOT Sequence objects.

to_dna(self)¶: returns copy of self as an alignment of DNA moltype seqs

to_fasta(self)¶

Return alignment in Fasta format

Parameters

make_seqlabel: callback function that takes the seq object and returns a label str

to_json(self)¶: returns json formatted string

to_moltype(self, moltype)¶: returns copy of self with moltype seqs

to_nexus(self, seq_type, interleave_len=50)¶

Return alignment in NEXUS format and mapping to sequence ids

NOTE Not that every sequence in the alignment MUST come from: a different species!! (You can concatenate multiple sequences from same species together before building tree)

seq_type: dna, rna, or protein

Raises exception if invalid alignment

to_phylip(self)¶

Return alignment in PHYLIP format and mapping to sequence ids

raises exception if invalid alignment

to_protein(self)¶: returns copy of self as an alignment of PROTEIN moltype seqs

to_rich_dict(self)¶: returns detailed content including info and moltype attributes

to_rna(self)¶: returns copy of self as an alignment of RNA moltype seqs

trim_stop_codons(self, gc=None, allow_partial=False, **kwargs)¶

Removes any terminal stop codons from the sequences

Parameters

gc: genetic code object
allow_partial: if True and the sequence length is not divisible by 3, ignores the 3’ terminal incomplete codon

with_modified_termini(self)¶

Changes the termini to include termini char instead of gapmotif.

Useful to correct the standard gap char output by most alignment programs when aligned sequences have different ends.

write(self, filename=None, format=None, **kwargs)¶

Write the alignment to a file, preserving order of sequences.

Parameters

filename: name of the sequence file
format: format of the sequence file

Notes

If format is None, will attempt to infer format from the filename suffix.

cogent3

Navigation

Related Topics

SequenceCollection¶