Alignment class¶
- class pyaln.Alignment(file_or_iter=None, fileformat=None)¶
Represents a multiple sequence alignment.
Alignment can contain sequences of any type (nucleotide, protein, or custom). Gaps must be encoded as -.
Each entry is uniquely identified by a name, with an optional description. When reading alignment files, sequence titles are split into the name (the first word) and descriptions (the remainder of the title).
An Alignment can be instanced with a filename (or file buffer), or from any iterable of [title, sequence]. A variety of file formats are supported, through Bio.AlignIO (see a full list). When a filename or buffer is provided but not the file format, Alignment tries to guess it from the extension.
You can take portions of an Alignment (i.e. take some sequences and/or some columns) by indexing it.
The format is Alignment[rows_selector, column_selector], where:
The rows_selector can be an integer (i.e., the vertical position of the sequence in the alignment), or a slice thereof (e.g. 2:5), or a list of sequence names.
The column_selector is a integer index (i.e. the horizontal position in the alignment), or a slice thereof, or a boolean Numpy array / Pandas Series. See examples below.
Iterating over an Alignment will yield tuples like (name, sequence). To get the description of a sequence, use Alignment.get_desc(name).
- Parameters
file_or_iter (str | TextIO | iterable) – Filename to sequence file to be loaded, or TextIO buffer already opened on it, or iterable of [title, seq] objects.
fileformat (str, optional) – When a filename or TextIO is provided, specifies the file format (e.g. fasta, clustal, stockholm ..)
Examples
>>> ali=Alignment(pyaln_folder+'/examples/fep15_protein.fa')
Default representation (note, it does not contain descriptions):
>>> ali # Alignment of 6 sequences and 138 positions MWLTLVALLALCATGRTAENLSESTTDQDKLVIARGKLVAPSVVGUSIKKMPELYNFLM...L Fep15_danio_rerio MWAFLLLTLAFSATGMTEE-DVTDTAIEERPVIAKGILKAPSVVGUAIKKMPALYMFLM...L Fep15_S_salar MWIFLLLTLAFSATGMTEE-NVTDTAIEERPVIAKGILKAPSVVGUAIKKMPELYTFLM...L Fep15_O_mykiss MWAFLVLTFAVAA-GASET-VDNHTAAEEKLLIARGKLLAPSVVGUGIKKMPELHHFLM...L Fep15_T_rubripes MWALLVLTFAVTV-GASEE-VKNQTAAEEKLVIARGTLLAPSVVGUGIKKMPELHHFLM...L Fep15_T_nigroviridis MWAFVLIAFSV---GASDS--SNSTAE----VIARGKLMAPSVVGUAIKKLPELNRFLM...L Fep15_O_latipes
Many file formats are supported:
>>> ali=Alignment(pyaln_folder+'/examples/fep15_protein.stockholm', fileformat='stockholm') >>> ali # Alignment of 6 sequences and 138 positions MWLTLVALLALCATGRTAENLSESTTDQDKLVIARGKLVAPSVVGUSIKKMPELYNFLM...L Fep15_danio_rerio MWAFLLLTLAFSATGMTEE-DVTDTAIEERPVIAKGILKAPSVVGUAIKKMPALYMFLM...L Fep15_S_salar MWIFLLLTLAFSATGMTEE-NVTDTAIEERPVIAKGILKAPSVVGUAIKKMPELYTFLM...L Fep15_O_mykiss MWAFLVLTFAVAA-GASET-VDNHTAAEEKLLIARGKLLAPSVVGUGIKKMPELHHFLM...L Fep15_T_rubripes MWALLVLTFAVTV-GASEE-VKNQTAAEEKLVIARGTLLAPSVVGUGIKKMPELHHFLM...L Fep15_T_nigroviridis MWAFVLIAFSV---GASDS--SNSTAE----VIARGKLMAPSVVGUAIKKLPELNRFLM...L Fep15_O_latipes
Initializing from iterable (in this case a list):
>>> ali=Alignment([ ('seq1', 'ATTCG-'), ('seq2', '--TTGG'), ('seq3', 'ATTCG-')]) >>> ali # Alignment of 3 sequences and 6 positions ATTCG- seq1 --TTGG seq2 ATTCG- seq3
To visualize sequence descriptions, use the fasta format:
>>> ali=Alignment([ ('seq1 this is a seq', 'ATTCG-'), ('seq2 another seq', '--TTGG'), ('seq3', 'ATTCG-')]) >>> print(ali.fasta()) >seq1 this is a seq ATTCG- >seq2 another seq --TTGG >seq3 ATTCG-
Indexing an alignment
Get alignment of first two sequences only:
>>> ali[:2,:] # Alignment of 2 sequences and 6 positions ATTCG- seq1 --TTGG seq2
Trim off the first and last alignment columns:
>>> ali[:,1:-1] # Alignment of 3 sequences and 4 positions TTCG seq1 -TTG seq2 TTCG seq3
Get subalignment of two sequences, by their name:
>>> ali[ ['seq1', 'seq3'], : ] # Alignment of 2 sequences and 6 positions ATTCG- seq1 ATTCG- seq3
Index columns by providing list of (start, end) elements:
>>> ali[:, [(0,2), (5, 6)]] # Alignment of 3 sequences and 3 positions AT- seq1 --G seq2 AT- seq3
Iterating over an alignment:
>>> [(name, len(seq)) for name, seq in ali] [('seq1', 6), ('seq2', 6), ('seq3', 6)]
- add_seq(title, sequence, desc=None, index=None)¶
Add a sequence to the alignment.
The sequence name (i.e., its unique id) is derived from title, taking its first word. The rest of title is taken as sequence description. By default, the sequence is added to the bottom of the alignment.
- Parameters
title (str) – Sequence title, from which name and description are derived
sequence (str) – Actual sequence, with gaps encoded as “-” characters
desc (str, optional) – The description can be directly provided here. If so, title is taken as name instead
index (int, optional) – The position at which the sequence is inserted. If not provided, it goes last
- Returns
None
- Return type
None
Examples
>>> ali=Alignment() >>> ali.add_seq('seq1 custom nt seq', 'ATTCG-') >>> ali.add_seq('seq2 another seq', '--TTGG') >>> print(ali.fasta()) >seq1 custom nt seq ATTCG- >seq2 another seq --TTGG
>>> ali.add_seq('seq3', 'ATT---', desc='some desc') >>> ali.add_seq('seq4', 'ATTGG-', index=0) >>> print(ali.fasta()) >seq4 ATTGG- >seq1 custom nt seq ATTCG- >seq2 another seq --TTGG >seq3 some desc ATT---
- ali_length()¶
Returns the number of columns in the alignment (i.e., its width)
- Returns
The number of columns in the alignment
- Return type
int
Examples
>>> ali=Alignment([ ('seq1', 'ATTCG-'), ('seq2', '--TTGG'), ('seq3', 'ATTCG-')]) >>> ali.ali_length() 6
Warning
For best performance, the Alignment class does not check that all sequences have the same length. This method simply returns the length of the first sequence. To check for homogenous sequence length, see
same_length()
See also
same_length
check that all sequences are truly aligned, i.e. have the same length
- column_weights(method='m')¶
Computes weights indicating the relative importance of the different alignment columns, based on their level of conservation.
- Parameters
method (str) – One of these arguments: - ‘m’ : maximum frequency of non-gap character in self - ‘i’ : information content, i.e. 2- sum(p*log2(p)) where p is frequency of non-gap characters - ‘q’ : quadratic sum, i.e. sum(p*p) where p is frequency of non-gap characters
- Returns
Numpy array of float numbers, of the same length as the alignment (n. of columns) indicating the different weights
- Return type
np.ndarray
See also
- concatenate(other)¶
Concatenate two alignments, i.e., add their sequences one next to the other
The two alignments must have the same names in the same order or an AlignmentError exception is raised. The sequence descriptions in returned alignment are taken from self
- Parameters
other (Alignment or str) – alignment that will be concatenated to the right of the self in the returned Alignment. If a string is provided, this same sequence is added to each sequence in self
- Returns
alignment with same names as inputs, and sequences resulting from their concatenation
- Return type
Examples
>>> ali= Alignment([ ('seq1 first', 'ATTCG-'), ('seq2 this is 2nd' , '--TTGG'), ('seq3', 'ATTCG-')]) >>> ali2=Alignment([ ('seq1 first', 'TTGC-TAG'), ('seq2 this is 2nd' , '-ATGGGGC'), ('seq3', 'AATCGGCC')]) >>> ali.concatenate(ali2) # Alignment of 3 sequences and 14 positions ATTCG-TTGC-TAG seq1 --TTGG-ATGGGGC seq2 ATTCG-AATCGGCC seq3
Note that descriptions in the second alignment are ignored:
>>> ali3= Alignment([ ('seq1 this desc is ignored', 'TTGC-TAG'), ('seq2' , '-ATGGGGC'), ('seq3 this also', 'AATCGGCC')]) >>> print( ali.concatenate(ali3).fasta() ) >seq1 first ATTCG-TTGC-TAG >seq2 this is 2nd --TTGG-ATGGGGC >seq3 ATTCG-AATCGGCC
- consensus(ignore_gaps=None)¶
Compute the consensus sequence, taking the most represented character for each column
- Parameters
ignore_gaps (float, optional) – By default, gaps are treated as any other character, so that they are returned for columns in which they are the most common character. If you provide ignore_gaps with a float from 0.0 to 1.0, gaps are not present on the output except for columns with a frequency equal or greater than the value provided. For example, a value of 1.0 implies gaps are included only if a column is entirely made of gaps
- Returns
The consensus sequence
- Return type
str
Examples
>>> ali= Alignment([ ('seq1', 'ATTCG-'), ('seq2' , '-TTCGT'), ('seq3', 'ACGCG-'), ('seq4', 'CTTGGT'), ('seq5', '-TGCT-'), ('seq6', '-TGGG-')]) >>> ali # Alignment of 6 sequences and 6 positions ATTCG- seq1 -TTCGT seq2 ACGCG- seq3 CTTGGT seq4 -TGCT- seq5 -TGGG- seq6
>>> ali.conservation_map() 0 1 2 3 4 5 - 0.500000 0.000000 0.0 0.000000 0.000000 0.666667 A 0.333333 0.000000 0.0 0.000000 0.000000 0.000000 C 0.166667 0.166667 0.0 0.666667 0.000000 0.000000 G 0.000000 0.000000 0.5 0.333333 0.833333 0.000000 T 0.000000 0.833333 0.5 0.000000 0.166667 0.333333
>>> ali.consensus() '-TGCG-'
>>> ali.consensus(0.6) 'ATGCG-'
- conservation_map(counts=None)¶
Computes the frequency of characters (nucleotides/amino acids) at each column of the alignment
Gaps are considered as any other character during computation. The returned object reports frequencies at each position, for all characters which are observed at least once in the alignment. This may not correspond to the full nucleotide or protein alphabet, if some characters are not present in the alignment.
- Returns
The returned dataframe has one row per observed character (i.e., nucleotide / amino acid) and one column per alignment position. Each value is a float ranging from 0 to 1 representing the frequency of that character in that alignment column.
- Return type
pd.DataFrame
Examples
>>> ali= Alignment([ ('seq1 first', 'ATTCG-'), ('seq2 this is 2nd' , '--TTGG'), ('seq3', '--TT--')]) >>> ali.conservation_map() 0 1 2 3 4 5 - 0.666667 0.666667 0.0 0.000000 0.333333 0.666667 A 0.333333 0.000000 0.0 0.000000 0.000000 0.000000 C 0.000000 0.000000 0.0 0.333333 0.000000 0.000000 G 0.000000 0.000000 0.0 0.000000 0.666667 0.333333 T 0.000000 0.333333 1.0 0.666667 0.000000 0.000000
Warning
This function is cached for best performance. Thus, do not directly modify the returned object. The hash key for caching is derived from sequences only: names are not considered.
- descriptions()¶
Returns the descriptions for all sequences in the alignment as list
- Returns
An ordered list of descriptions for each sequence in the alignment
- Return type
list of str
Examples
>>> ali=Alignment([ ('seq1 this is first', 'ATTCG-'), ('seq2 this is 2nd' , '--TTGG'), ('seq3', 'ATTCG-')]) >>> ali.descriptions() ['this is first', 'this is 2nd', '']
- fasta(nchar=60)¶
Returns the alignment in (aligned) fasta format
- Parameters
nchar (int, default=60) – The number of characters per line for sequences
- Returns
A multiline string with the alignment in fasta format, including sequence descriptions
- Return type
str
Examples
>>> ali=Alignment([ ('seq1 this is first', 'ATTCG-'), ('seq2 this is 2nd' , '--TTGG'), ('seq3', 'ATTCG-')]) >>> print(ali.fasta()) >seq1 this is first ATTCG- >seq2 this is 2nd --TTGG >seq3 ATTCG-
See also
write
generic function supporting many output formats
- classmethod from_numpy(nparray, names, descriptions=None)¶
Class method to instance an Alignment object from a numpy array.
- Parameters
nparray (np.ndarray) – analogous to object returned by Alignment.to_numpy(), it must have one row per sequence, and one column per alignment position. Its dtype must be is np.str_
names (list of str) – ordered list of sequence names (i.e. identifiers)
descriptions (list of str, optional) – ordered list of sequence description. If not provided, all descriptions are set to ‘’
- Returns
new alignment object
- Return type
See also
- get_desc(name)¶
Returns the description for a sequence entry
If no sequence with that name is present in the alignment, an AlignmentError exception is raised.
- Parameters
name (str) – The name (i.e. identifier) of the sequence
- Returns
The description stored for the sequence
- Return type
str
Examples
>>> ali=Alignment([ ('seq1 this is first', 'ATTCG-'), ('seq2 this is 2nd' , '--TTGG'), ('seq3', 'ATTCG-')]) >>> ali.get_desc('seq1') 'this is first' >>> ali.get_desc('seq3') ''
See also
- get_seq(name)¶
Returns the sequence with the requested name
If no sequence with that name is present in the alignment, an AlignmentError exception is raised.
- Parameters
name (str) – The name (i.e. identifier) of the sequence requested
- Returns
The sequence requested, including any gaps it may contain
- Return type
str
Examples
>>> ali=Alignment([ ('seq1 this is first', 'ATTCG-'), ('seq2 this is 2nd' , '--TTGG'), ('seq3', 'ATTCG-')]) >>> ali.get_seq('seq1') 'ATTCG-' >>> ali.get_seq('seq2') '--TTGG'
See also
- has_name(name)¶
Checks whether the alignment contains a sequence with the name provided
- Parameters
name (str) – The name (i.e. identifier) searched in the alignment.
- Returns
A bool indicating whether the name is present
- Return type
bool
Examples
>>> ali=Alignment([ ('seq1 this is first', 'ATTCG-'), ('seq2 this is 2nd' , '--TTGG'), ('seq3', 'ATTCG-')]) >>> ali.has_name('seq1') True >>> ali.has_name('seq2 this is 2nd') # note: the name would be 'seq2' only False
- n_seqs()¶
Returns the number of sequences in the alignment (i.e. the number of rows, or alignment height)
- Returns
Number of sequences in the alignment
- Return type
int
Examples
>>> ali=Alignment([ ('seq1', 'ATTCG-'), ('seq2', '--TTGG'), ('seq3', 'ATTCG-')]) >>> ali.n_seqs() 3
See also
ali_length
length of the alignment (i.e. number of columns)
- names()¶
Returns a list of all sequence names in the alignment
The names returned do not include the sequence descriptions.
- Returns
An ordered list of sequence names (identifiers) in the alignment
- Return type
list of str
Examples
>>> ali=Alignment([ ('seq1 this is first', 'ATTCG-'), ('seq2 this is 2nd' , '--TTGG'), ('seq3', 'ATTCG-')]) >>> ali.names() ['seq1', 'seq2', 'seq3']
See also
titles
get all sequence titles, including their name and description
- position_in_ali(name, pos_in_seq)¶
Maps an position in a certain sequence (without counting gaps) to its corresponding position in the alignment
If the requested position is invalid, raise an IndexError
- Parameters
name (str) – the name of the sequence
pos_in_seq (int) – 0-based sequence position, i.e. the index mapping to the requested sequence without gaps
- Returns
0-based alignment position , i.e. the column index where the requested sequence position is found
- Return type
int
Examples
>>> ali= Alignment([ ('seq1', 'ATTCG-'), ('seq2' , '--TTG-'), ('seq3', 'AT-CCG')]) >>> ali # Alignment of 3 sequences and 6 positions ATTCG- seq1 --TTG- seq2 AT-CCG seq3
>>> ali.position_in_ali('seq1', 4) 4 >>> ali.position_in_ali('seq2', 0) 2 >>> ali.position_in_ali('seq3', 2) 3
See also
position_in_seq
maps an alignment position to sequence position for a single sequence
position_map
maps all alignment positions for all sequences
- position_in_seq(name, pos_in_ali)¶
Maps an alignment column position to the corresponding position in a certain sequence
- Parameters
name (str) – the name of the sequence to map to
pos_in_ali (int) – 0-based alignment position, i.e. the column index
- Returns
0-based position in sequence, i.e. the index of the sequence without counting gaps at the requested column.
Note that for gap positions, the position of the closest non-gap to the left is reported. For gap positions, the position of the closest non-gap to the left is reported.
- Return type
int
Examples
>>> ali= Alignment([ ('seq1', 'ATTCG-'), ('seq2' , '--TTG-'), ('seq3', 'ATTCCG')]) >>> ali # Alignment of 3 sequences and 6 positions ATTCG- seq1 --TTG- seq2 ATTCCG seq3
>>> ali.position_in_seq('seq1', 4) 4 >>> ali.position_in_seq('seq2', 2) 0
Checking a left terminal gap returns -1:
>>> ali.position_in_seq('seq2', 0) -1
Checking a gap position returns the index of the closest non gap char on the left:
>>> ali.position_in_seq('seq1', 5) 4
See also
position_in_ali
maps a sequence position to alignment position for a single sequence
position_map
maps all alignment positions for all sequences
- position_map()¶
Compute a numerical matrix instrumental to map alignment positions to sequence positions (and reverse)
- Returns
Returns a Pandas DataFrame with one row per sequence (indexed by name), and one column per alignment position. Each value is the index of that particular sequence (without gaps) in that alignment column. All positions are 0-based. For gap positions, the position of the closest non-gap to the left is reported. For left terminal gaps, the value reported is -1
- Return type
pd.DataFrame
Examples
>>> ali= Alignment([ ('seq1', 'ATTCG-'), ('seq2' , '--TTG-'), ('seq3', 'ATTCCG')]) >>> ali # Alignment of 3 sequences and 6 positions ATTCG- seq1 --TTG- seq2 ATTCCG seq3
>>> ali.position_map() 0 1 2 3 4 5 seq1 0 1 2 3 4 4 seq2 -1 -1 0 1 2 2 seq3 0 1 2 3 4 5
Note
Computing this matrix makes sense only if you will use positions for many or all sequences. For the corresponding operations on single sequences, see functions
position_in_seq()
andposition_in_ali()
See also
position_in_seq
maps an alignment position to sequence position for a single sequence
position_in_ali
maps a sequence position to alignment position for a single sequence
- positions()¶
Returns an iterator over the column indices of the alignment
This is qquivalent to range(self.ali_length())
- Returns
An iterator of int
- Return type
range
Examples
>>> ali=Alignment([ ('seq1 this is first', 'ATTCG-'), ('seq2 this is 2nd' , '--TTGG'), ('seq3', 'ATTCG-')]) >>> for i in ali.positions(): print( ali.get_seq('seq1')[i] ) A T T C G -
- remove_by_index(*seqindices)¶
Remove one or more sequences in the alignment by their index, in-place.
The input indices refer to the position of the sequence in the alignment, i.e. their row number. Note that the modification is done in place. To obtain a new object instead, see examples below.
- Parameters
*seqindices (tuple) – index or indices of sequences to be removed from the alignment
- Returns
None
- Return type
None
Examples
>>> ali= Alignment([ ('seq1 first', 'ATTCG-'), ('seq2 this is 2nd' , '--TTGG'), ('seq3', 'ATTCG-')]) >>> ali.remove_by_index(0) >>> ali # Alignment of 2 sequences and 6 positions --TTGG seq2 ATTCG- seq3
To return a new alignment without certain sequences, do not use this function. Instead, use indexing by rows:
>>> ali= Alignment([ ('seq1 first', 'ATTCG-'), ('seq2 this is 2nd' , '--TTGG'), ('seq3', 'ATTCG-')]) >>> indices_to_omit=set( [0, 2] ) >>> ali[ [n for i,n in enumerate(ali.names()) if not i in indices_to_omit], :] # Alignment of 1 sequences and 6 positions --TTGG seq2
See also
- remove_by_name(*names)¶
Remove one or more sequences in the alignment by name in-place.
Note that the modification is done in place. To obtain a new object instead, see examples below.
- Parameters
*names (tuple) – name or names of sequences to be removed from the alignment
- Returns
None
- Return type
None
Examples
>>> ali= Alignment([ ('seq1 first', 'ATTCG-'), ('seq2 this is 2nd' , '--TTGG'), ('seq3', 'ATTCG-')]) >>> ali.remove_by_name('seq1') >>> ali # Alignment of 2 sequences and 6 positions --TTGG seq2 ATTCG- seq3
>>> ali.remove_by_name('seq2', 'seq3') >>> ali # Empty alignment
To return a new alignment without certain names, do not use this function. Instead, use indexing by rows:
>>> ali= Alignment([ ('seq1 first', 'ATTCG-'), ('seq2 this is 2nd' , '--TTGG'), ('seq3', 'ATTCG-')]) >>> names_to_omit=set( ['seq2'] ) >>> ali[ [n for n in ali.names() if not n in names_to_omit], :] # Alignment of 2 sequences and 6 positions ATTCG- seq1 ATTCG- seq3
See also
- remove_empty_seqs(inplace=True)¶
Remove all sequences which are entirely made of gaps or that are empty.
By default, removal is done in place.
- Parameters
inplace (bool, default:True) – whether the removal should be done in place. If not, a new Alignment is returned instead
- Returns
If inplace==True, None is returned; otherwise, a new Alignment without empty sequences is returned
- Return type
None or Alignment
Examples
>>> ali= Alignment([ ('seq1 first', 'ATTCG-'), ('seq2 this is 2nd' , '--TTGG'), ('seq3', '------')]) >>> ali.remove_empty_seqs() >>> ali # Alignment of 2 sequences and 6 positions ATTCG- seq1 --TTGG seq2
See also
- same_length()¶
Check whether sequences are aligned, i.e. they have the same length
- Returns
Stating if all sequences have the same lengths
- Return type
bool
Examples
>>> ali=Alignment([ ('seq1', 'ATTCG-'), ('seq2', '--TTGG'), ('seq3', 'ATTCG-')]) >>> ali.same_length() True
>>> ali.add_seq('seqX', 'TATTCGGT-') >>> ali.same_length() False
See also
ali_length
length of the alignment (i.e. number of columns)
- score_similarity(targets=None, gaps='y', metrics='i', weights='m', method=2)¶
Computes metrics of similarity between some target sequences and sequences in the (self) alignment
The rationale of the function is to quantify how much target sequences ‘fit’ in the alignment. By default, similarity metrics are computed for all sequences in self, meaning targets=self. In that case, they provide a measure of how much sequences resemble each other, i.e. the global agreement of alignment, and it can be instrumental to identify outliers.
The default use of this function is to compute the Average Sequence Identity (ASI) of targets, obtained by performing pairwise comparisons between each target and all sequences in self, and averaging. The definition of sequence identity varies as gaps may be counted in different ways, as specified by the gaps argument. For explanatory examples, see this page .
Besides ASI, a variant called AWSI (Average Weighted Sequence Identity) is available, wherein different alignment columns are given different weight when averaging. Various built-in methods to define weights are available, all based on the concept that conserved alignment positions are given more weight. Custom weights may also be provided.
- Parameters
targets (Alignment, optional) – sequences for which the similarity metrics are requested, provided as an Alignment instance. The sequences must be aligned to the self Alignment. If not provided, self is taken as targets, meaning that metrics are reported for each sequence in self, compared to the full alignment.
gaps (str, default:'y') –
defines how to take into account gaps when comparing sequences pairwise. Possible values: - ‘y’ : gaps are considered and considered mismatches. Positions that are gaps in both sequences are ignored. - ‘n’ : gaps are not considered. Positions that are gaps in either sequences compared are ignored. - ‘t’ : terminal gaps are trimmed. Terminal gap positions in either sequences are ignored, others are considered as in ‘y’. - ‘a’ : gaps are considered as any other character; even gap-to-gap matches are scored as identities.
Multiple arguments may be concatenated (e.g. ‘yn’) to compute all of the possibilities.
metrics (str, default:'i') –
defines which metrics are computed. Possible values: - ‘i’ : average sequence identity, aka ASI - ‘w’ : weighted sequence identity, aka AWSI
Multiple arguments may be concatenated (e.g. ‘iw’) to compute all of the possibilities.
weights (str or list or np.ndarray, default: 'm') –
if AWSI is computed, defines how weights are computed for each alignment column. Possible values: - ‘m’ : maximum frequency of non-gap character in self - ‘i’ : information content, i.e. 2- sum(p*log2(p)) where p is frequency of non-gap characters - ‘q’ : quadratic sum, i.e. sum(p*p) where p is frequency of non-gap characters
Multiple arguments may be concatenated (e.g. ‘mi’) to compute all of the possibilities.
Alternatively, custom weights may be provided as an iterable (e.g. list or Numpy ndarray) of numbers (the weights), with as many elements as the alignment columns.
- Returns
a DataFrame with one row per target (indexed by sequence names), and one column for each sequence metrics requested. If multiple parameters were specified for gaps, metrics and/or weights, the output columns are a MultiIndex which covers all combinations requested.
- Return type
pd.DataFrame
See also
sequtils.sequence_identity
function that compares sequences pairwise and returns their sequence identity. Accepts the same gaps argument as score_similarity. Though sequtils.sequence_identity is not run internally by score_similarity, it can be used to replicate its results.
Examples
>>> ali= Alignment([ ('seq1', 'ATTCG-'), ('seq2' , '--TTG-'), ('seq3', 'AT-CCG')]) >>> ali # Alignment of 3 sequences and 6 positions ATTCG- seq1 --TTG- seq2 AT-CCG seq3
>>> fep_ali=Alignment(pyaln_folder + '/examples/fep15_protein.fa', fileformat='fasta') >>> fep_ali # Alignment of 6 sequences and 138 positions MWLTLVALLALCATGRTAENLSESTTDQDKLVIARGKLVAPSVVGUSIKKMPELYNFLM...L Fep15_danio_rerio MWAFLLLTLAFSATGMTEE-DVTDTAIEERPVIAKGILKAPSVVGUAIKKMPALYMFLM...L Fep15_S_salar MWIFLLLTLAFSATGMTEE-NVTDTAIEERPVIAKGILKAPSVVGUAIKKMPELYTFLM...L Fep15_O_mykiss MWAFLVLTFAVAA-GASET-VDNHTAAEEKLLIARGKLLAPSVVGUGIKKMPELHHFLM...L Fep15_T_rubripes MWALLVLTFAVTV-GASEE-VKNQTAAEEKLVIARGTLLAPSVVGUGIKKMPELHHFLM...L Fep15_T_nigroviridis MWAFVLIAFSV---GASDS--SNSTAE----VIARGKLMAPSVVGUAIKKLPELNRFLM...L Fep15_O_latipes
>>> fep_ali.score_similarity() metrics ASI Fep15_danio_rerio 0.777778 Fep15_S_salar 0.826334 Fep15_O_mykiss 0.822684 Fep15_T_rubripes 0.829599 Fep15_T_nigroviridis 0.815000 Fep15_O_latipes 0.767438
If we choose to not consider gap positions, the score increases:
>>> fep_ali.score_similarity(gaps='n') metrics ASI Fep15_danio_rerio 0.793051 Fep15_S_salar 0.838283 Fep15_O_mykiss 0.834522 Fep15_T_rubripes 0.842566 Fep15_T_nigroviridis 0.835351 Fep15_O_latipes 0.805693
Requesting AWSI as well as ASI, and testing both considering and omitting gaps:
>>> fep_ali.score_similarity(gaps='yn', metrics='iw') gaps y n metrics ASI AWSI ASI AWSI Fep15_danio_rerio 0.777778 0.847123 0.793051 0.856044 Fep15_S_salar 0.826334 0.885040 0.838283 0.893412 Fep15_O_mykiss 0.822684 0.882183 0.834522 0.890497 Fep15_T_rubripes 0.829599 0.887255 0.842566 0.896094 Fep15_T_nigroviridis 0.815000 0.874389 0.835351 0.891288 Fep15_O_latipes 0.767438 0.834809 0.805693 0.860639
Computing AWSI with all possible weights available, considering only internal gaps:
>>> fep_ali.score_similarity(gaps='t', metrics='w', weights='iqm') metrics AWSI.i AWSI.q AWSI.m Fep15_danio_rerio 0.881664 0.913531 0.847123 Fep15_S_salar 0.912174 0.937773 0.885040 Fep15_O_mykiss 0.910436 0.936245 0.882183 Fep15_T_rubripes 0.915546 0.938932 0.887255 Fep15_T_nigroviridis 0.901011 0.929572 0.874389 Fep15_O_latipes 0.872316 0.903712 0.834809
- sequences()¶
Returns a list of the all sequences in the alignment
- Returns
An ordered list of sequences in the alignment (without names or descriptions)
- Return type
list of str
Examples
>>> ali=Alignment([ ('seq1 this is first', 'ATTCG-'), ('seq2 this is 2nd' , '--TTGG'), ('seq3', 'ATTCG-')]) >>> ali.sequences() ['ATTCG-', '--TTGG', 'ATTCG-']
- set_desc(name, desc)¶
Change the description of an entry in-place.
- Parameters
name (str) – The name (i.e. identifier) of the entry to be altered
desc (str) – The new description to be used
- Returns
None
- Return type
None
Examples
>>> ali=Alignment([ ('seq1 this is first', 'ATTCG-'), ('seq2 this is 2nd' , '--TTGG'), ('seq3 not sure', 'ATTCG-')]) >>> print(ali.fasta()) >seq1 this is first ATTCG- >seq2 this is 2nd --TTGG >seq3 not sure ATTCG-
>>> ali.set_desc('seq3', 'obviously third') >>> print(ali.fasta()) >seq1 this is first ATTCG- >seq2 this is 2nd --TTGG >seq3 obviously third ATTCG-
- set_seq(name, sequence)¶
Change the sequence of an entry in-place.
- Parameters
name (str) – The name (i.e. identifier) of the sequence to be altered
sequence (str) – The new sequence to be set
- Returns
None
- Return type
None
Examples
>>> ali=Alignment([ ('seq1', 'ATTCG-'), ('seq2', '--TTGG'), ('seq3', 'ATTCG-')]) >>> ali # Alignment of 3 sequences and 6 positions ATTCG- seq1 --TTGG seq2 ATTCG- seq3
>>> ali.set_seq('seq1', 'CHANGE') >>> ali # Alignment of 3 sequences and 6 positions CHANGE seq1 --TTGG seq2 ATTCG- seq3
- property shape¶
Return the size of the alignment in its two dimensions, i.e. the number of sequences and alignment columns
- Returns
(height, width), where height is the number of sequences in the alignment and width the number of columns
- Return type
tuple of int
Examples
>>> ali=Alignment([ ('seq1', 'ATTCG-'), ('seq2', '--TTGG'), ('seq3', 'ATTCG-')]) >>> ali.shape (3, 6)
Note
This method is presented as property for symmetry with Numpy array
.shape
. However, this Alignment property is read-only.
- titles()¶
Returns a list of all sequence titles in the alignment
Each title is the concatenation of sequence name and description, separated by a space. If the description is empty for an entry, only the name is returned
- Returns
An ordered list of sequence titles in the alignment
- Return type
list of str
Examples
>>> ali=Alignment([ ('seq1 this is first', 'ATTCG-'), ('seq2 this is 2nd' , '--TTGG'), ('seq3', 'ATTCG-')]) >>> ali.titles() ['seq1 this is first', 'seq2 this is 2nd', 'seq3']
See also
names
get all sequence names (their unique identifiers, without description)
- to_biopython()¶
Returns a copy of the alignment as a Bio.Align.MultipleSeqAlignment object
The SeqRecord instances in the returned MultipleSeqAlignment has their id and name attributes set to sequence names, and also possess the description attribute.
- Returns
Alignment in biopython format (Bio.Align.MultipleSeqAlignment)
- Return type
MultipleSeqAlignment
- to_numpy()¶
Returns a numpy 2-D array representation of the alignment, useful for vectorized sequence methods
- Returns
The returned array has one row per sequence and one column per alignment position. Each value is a single character. The dtype is np.str_ Note that rows are not indexed by sequence names, just by their order index
- Return type
np.ndarray
Examples
>>> ali= Alignment([ ('seq1 first', 'ATTCG-'), ('seq2 this is 2nd' , '--TTGG'), ('seq3', '--TT--')]) >>> print(ali.to_numpy()) [['A' 'T' 'T' 'C' 'G' '-'] ['-' '-' 'T' 'T' 'G' 'G'] ['-' '-' 'T' 'T' '-' '-']]
See also
Warning
This function is cached for best performance. Thus, do not directly modify the returned object. The hash key for caching is derived from sequences only: names are not considered.
- to_pandas(use_names=False)¶
Returns a pandas DataFrame representation of the alignment
- Parameters
use_names (bool, optional) – Normally, the returned DataFrame has a simply RangeIndex as index. Specify this to instead use sequence names as the index.
- Returns
The returned dataframe has one row per sequence and one column per alignment position. Each value is a single character. The dtype is object Rows are indexed by the sequence names if use_names==True, or by a RangeIndex by default
- Return type
pd.DataFrame
Examples
>>> ali= Alignment([ ('seq1 first', 'ATTCG-'), ('seq2 this is 2nd' , '--TTGG'), ('seq3', '--TT--')]) >>> ali.to_pandas() 0 1 2 3 4 5 0 A T T C G - 1 - - T T G G 2 - - T T - -
>>> ali.to_pandas(use_names=True) 0 1 2 3 4 5 seq1 A T T C G - seq2 - - T T G G seq3 - - T T - -
See also
- trim_gaps(pct=1.0, count=None, inplace=False)¶
Removes the alignment columns with more gaps than specified
By default, a new alignment without the columns identified as ‘too gappy’ is returned.
- Parameters
pct (float, default:1.0) – minimal gap frequency (from 0.0 to 1.0) for a column to be removed
count (int, optional) – defines the minimum absolute number of gaps for a column to be removed. It is an alternative way to select columns, which overrides the pct argument
inplace (bool, default:False) – whether the column removal should be done in place. If not, a new Alignment is returned instead
- Returns
By default, a new Alignment without empty sequences is returned; if inplace==True, None is returned
- Return type
None or Alignment
Examples
>>> ali= Alignment([ ('seq1 first', 'ATTCG-'), ('seq2 this is 2nd' , '--TTG-'), ('seq3', 'ATTCCG')]) >>> ali # Alignment of 3 sequences and 6 positions ATTCG- seq1 --TTG- seq2 ATTCCG seq3
>>> ali.trim_gaps(0.5) # Alignment of 3 sequences and 5 positions ATTCG seq1 --TTG seq2 ATTCC seq3
>>> ali.trim_gaps(count=1) # Alignment of 3 sequences and 3 positions TCG seq1 TTG seq2 TCC seq3
- write(fileformat='fasta', to_file=None)¶
Returns a string representation of the alignment in a format of choice
Internally uses Bio.Align to generate output. Supported fileformat arguments include clustal, stockholm, phylip and many others. The full list of supported fileformat arguments is provided here).
- Parameters
fileformat (str, default='fasta') – text format requested
to_file (str | TextIO, optional) – filename or buffer to write into. If not specified, the output is returned instead
- Returns
String representation of alignment in the requested format
- Return type
str
Examples
>>> ali=Alignment([ ('seq1 this is first', 'ATTCG-'), ('seq2 this is 2nd' , '--TTGG'), ('seq3', 'ATTCG-')]) >>> print(ali.write('phylip')) 3 6 seq1 ATTCG- seq2 --TTGG seq3 ATTCG-