Alignment class

class pyaln.Alignment(file_or_iter=None, fileformat=None)

Represents a multiple sequence alignment.

Alignment can contain sequences of any type (nucleotide, protein, or custom). Gaps must be encoded as -.

Each entry is uniquely identified by a name, with an optional description. When reading alignment files, sequence titles are split into the name (the first word) and descriptions (the remainder of the title).

An Alignment can be instanced with a filename (or file buffer), or from any iterable of [title, sequence]. A variety of file formats are supported, through Bio.AlignIO (see a full list). When a filename or buffer is provided but not the file format, Alignment tries to guess it from the extension.

You can take portions of an Alignment (i.e. take some sequences and/or some columns) by indexing it.

The format is Alignment[rows_selector, column_selector], where:

  • The rows_selector can be an integer (i.e., the vertical position of the sequence in the alignment), or a slice thereof (e.g. 2:5), or a list of sequence names.

  • The column_selector is a integer index (i.e. the horizontal position in the alignment), or a slice thereof, or a boolean Numpy array / Pandas Series. See examples below.

Iterating over an Alignment will yield tuples like (name, sequence). To get the description of a sequence, use Alignment.get_desc(name).

Parameters
  • file_or_iter (str | TextIO | iterable) – Filename to sequence file to be loaded, or TextIO buffer already opened on it, or iterable of [title, seq] objects.

  • fileformat (str, optional) – When a filename or TextIO is provided, specifies the file format (e.g. fasta, clustal, stockholm ..)

Examples

>>> ali=Alignment(pyaln_folder+'/examples/fep15_protein.fa')

Default representation (note, it does not contain descriptions):

>>> ali
# Alignment of 6 sequences and 138 positions
MWLTLVALLALCATGRTAENLSESTTDQDKLVIARGKLVAPSVVGUSIKKMPELYNFLM...L Fep15_danio_rerio
MWAFLLLTLAFSATGMTEE-DVTDTAIEERPVIAKGILKAPSVVGUAIKKMPALYMFLM...L Fep15_S_salar
MWIFLLLTLAFSATGMTEE-NVTDTAIEERPVIAKGILKAPSVVGUAIKKMPELYTFLM...L Fep15_O_mykiss
MWAFLVLTFAVAA-GASET-VDNHTAAEEKLLIARGKLLAPSVVGUGIKKMPELHHFLM...L Fep15_T_rubripes
MWALLVLTFAVTV-GASEE-VKNQTAAEEKLVIARGTLLAPSVVGUGIKKMPELHHFLM...L Fep15_T_nigroviridis
MWAFVLIAFSV---GASDS--SNSTAE----VIARGKLMAPSVVGUAIKKLPELNRFLM...L Fep15_O_latipes

Many file formats are supported:

>>> ali=Alignment(pyaln_folder+'/examples/fep15_protein.stockholm', fileformat='stockholm')
>>> ali
# Alignment of 6 sequences and 138 positions
MWLTLVALLALCATGRTAENLSESTTDQDKLVIARGKLVAPSVVGUSIKKMPELYNFLM...L Fep15_danio_rerio
MWAFLLLTLAFSATGMTEE-DVTDTAIEERPVIAKGILKAPSVVGUAIKKMPALYMFLM...L Fep15_S_salar
MWIFLLLTLAFSATGMTEE-NVTDTAIEERPVIAKGILKAPSVVGUAIKKMPELYTFLM...L Fep15_O_mykiss
MWAFLVLTFAVAA-GASET-VDNHTAAEEKLLIARGKLLAPSVVGUGIKKMPELHHFLM...L Fep15_T_rubripes
MWALLVLTFAVTV-GASEE-VKNQTAAEEKLVIARGTLLAPSVVGUGIKKMPELHHFLM...L Fep15_T_nigroviridis
MWAFVLIAFSV---GASDS--SNSTAE----VIARGKLMAPSVVGUAIKKLPELNRFLM...L Fep15_O_latipes

Initializing from iterable (in this case a list):

>>> ali=Alignment([ ('seq1', 'ATTCG-'), ('seq2', '--TTGG'), ('seq3', 'ATTCG-')])
>>> ali
# Alignment of 3 sequences and 6 positions
ATTCG- seq1
--TTGG seq2
ATTCG- seq3

To visualize sequence descriptions, use the fasta format:

>>> ali=Alignment([ ('seq1 this is a seq', 'ATTCG-'), ('seq2 another seq', '--TTGG'), ('seq3', 'ATTCG-')])
>>> print(ali.fasta())
>seq1 this is a seq
ATTCG-
>seq2 another seq
--TTGG
>seq3
ATTCG-

Indexing an alignment

Get alignment of first two sequences only:

>>> ali[:2,:]
# Alignment of 2 sequences and 6 positions
ATTCG- seq1
--TTGG seq2

Trim off the first and last alignment columns:

>>> ali[:,1:-1]
# Alignment of 3 sequences and 4 positions
TTCG seq1
-TTG seq2
TTCG seq3

Get subalignment of two sequences, by their name:

>>> ali[ ['seq1', 'seq3'], : ]
# Alignment of 2 sequences and 6 positions
ATTCG- seq1
ATTCG- seq3

Index columns by providing list of (start, end) elements:

>>> ali[:, [(0,2), (5, 6)]]
# Alignment of 3 sequences and 3 positions
AT- seq1
--G seq2
AT- seq3

Iterating over an alignment:

>>> [(name, len(seq))   for name, seq in ali]
[('seq1', 6), ('seq2', 6), ('seq3', 6)]
add_seq(title, sequence, desc=None, index=None)

Add a sequence to the alignment.

The sequence name (i.e., its unique id) is derived from title, taking its first word. The rest of title is taken as sequence description. By default, the sequence is added to the bottom of the alignment.

Parameters
  • title (str) – Sequence title, from which name and description are derived

  • sequence (str) – Actual sequence, with gaps encoded as “-” characters

  • desc (str, optional) – The description can be directly provided here. If so, title is taken as name instead

  • index (int, optional) – The position at which the sequence is inserted. If not provided, it goes last

Returns

None

Return type

None

Examples

>>> ali=Alignment()
>>> ali.add_seq('seq1 custom nt seq', 'ATTCG-')
>>> ali.add_seq('seq2 another seq',   '--TTGG')
>>> print(ali.fasta())
>seq1 custom nt seq
ATTCG-
>seq2 another seq
--TTGG
>>> ali.add_seq('seq3', 'ATT---', desc='some desc')
>>> ali.add_seq('seq4', 'ATTGG-', index=0)
>>> print(ali.fasta())
>seq4
ATTGG-
>seq1 custom nt seq
ATTCG-
>seq2 another seq
--TTGG
>seq3 some desc
ATT---
ali_length()

Returns the number of columns in the alignment (i.e., its width)

Returns

The number of columns in the alignment

Return type

int

Examples

>>> ali=Alignment([ ('seq1', 'ATTCG-'), ('seq2', '--TTGG'), ('seq3', 'ATTCG-')])
>>> ali.ali_length()
6

Warning

For best performance, the Alignment class does not check that all sequences have the same length. This method simply returns the length of the first sequence. To check for homogenous sequence length, see same_length()

See also

same_length

check that all sequences are truly aligned, i.e. have the same length

column_weights(method='m')

Computes weights indicating the relative importance of the different alignment columns, based on their level of conservation.

Parameters

method (str) – One of these arguments: - ‘m’ : maximum frequency of non-gap character in self - ‘i’ : information content, i.e. 2- sum(p*log2(p)) where p is frequency of non-gap characters - ‘q’ : quadratic sum, i.e. sum(p*p) where p is frequency of non-gap characters

Returns

Numpy array of float numbers, of the same length as the alignment (n. of columns) indicating the different weights

Return type

np.ndarray

See also

score_similarity

concatenate(other)

Concatenate two alignments, i.e., add their sequences one next to the other

The two alignments must have the same names in the same order or an AlignmentError exception is raised. The sequence descriptions in returned alignment are taken from self

Parameters

other (Alignment or str) – alignment that will be concatenated to the right of the self in the returned Alignment. If a string is provided, this same sequence is added to each sequence in self

Returns

alignment with same names as inputs, and sequences resulting from their concatenation

Return type

Alignment

Examples

>>> ali= Alignment([ ('seq1 first', 'ATTCG-'), ('seq2 this is 2nd'  , '--TTGG'), ('seq3', 'ATTCG-')])
>>> ali2=Alignment([ ('seq1 first', 'TTGC-TAG'), ('seq2 this is 2nd'  , '-ATGGGGC'), ('seq3', 'AATCGGCC')])
>>> ali.concatenate(ali2)
# Alignment of 3 sequences and 14 positions
ATTCG-TTGC-TAG seq1
--TTGG-ATGGGGC seq2
ATTCG-AATCGGCC seq3

Note that descriptions in the second alignment are ignored:

>>> ali3= Alignment([ ('seq1 this desc is ignored', 'TTGC-TAG'), ('seq2'  , '-ATGGGGC'), ('seq3 this also', 'AATCGGCC')])
>>> print( ali.concatenate(ali3).fasta() )
>seq1 first
ATTCG-TTGC-TAG
>seq2 this is 2nd
--TTGG-ATGGGGC
>seq3
ATTCG-AATCGGCC
consensus(ignore_gaps=None)

Compute the consensus sequence, taking the most represented character for each column

Parameters

ignore_gaps (float, optional) – By default, gaps are treated as any other character, so that they are returned for columns in which they are the most common character. If you provide ignore_gaps with a float from 0.0 to 1.0, gaps are not present on the output except for columns with a frequency equal or greater than the value provided. For example, a value of 1.0 implies gaps are included only if a column is entirely made of gaps

Returns

The consensus sequence

Return type

str

Examples

>>> ali= Alignment([ ('seq1', 'ATTCG-'), ('seq2'  , '-TTCGT'), ('seq3', 'ACGCG-'),  ('seq4', 'CTTGGT'), ('seq5', '-TGCT-'), ('seq6', '-TGGG-')])
>>> ali
# Alignment of 6 sequences and 6 positions
ATTCG- seq1
-TTCGT seq2
ACGCG- seq3
CTTGGT seq4
-TGCT- seq5
-TGGG- seq6
>>> ali.conservation_map()
          0         1    2         3         4         5
-  0.500000  0.000000  0.0  0.000000  0.000000  0.666667
A  0.333333  0.000000  0.0  0.000000  0.000000  0.000000
C  0.166667  0.166667  0.0  0.666667  0.000000  0.000000
G  0.000000  0.000000  0.5  0.333333  0.833333  0.000000
T  0.000000  0.833333  0.5  0.000000  0.166667  0.333333
>>> ali.consensus()
'-TGCG-'
>>> ali.consensus(0.6)
'ATGCG-'
conservation_map(counts=None)

Computes the frequency of characters (nucleotides/amino acids) at each column of the alignment

Gaps are considered as any other character during computation. The returned object reports frequencies at each position, for all characters which are observed at least once in the alignment. This may not correspond to the full nucleotide or protein alphabet, if some characters are not present in the alignment.

Returns

The returned dataframe has one row per observed character (i.e., nucleotide / amino acid) and one column per alignment position. Each value is a float ranging from 0 to 1 representing the frequency of that character in that alignment column.

Return type

pd.DataFrame

Examples

>>> ali= Alignment([ ('seq1 first', 'ATTCG-'), ('seq2 this is 2nd'  , '--TTGG'), ('seq3', '--TT--')])
>>> ali.conservation_map()
          0         1    2         3         4         5
-  0.666667  0.666667  0.0  0.000000  0.333333  0.666667
A  0.333333  0.000000  0.0  0.000000  0.000000  0.000000
C  0.000000  0.000000  0.0  0.333333  0.000000  0.000000
G  0.000000  0.000000  0.0  0.000000  0.666667  0.333333
T  0.000000  0.333333  1.0  0.666667  0.000000  0.000000

Warning

This function is cached for best performance. Thus, do not directly modify the returned object. The hash key for caching is derived from sequences only: names are not considered.

copy()

Returns a copy of the alignment

Returns

copy of the self alignment

Return type

Alignment

descriptions()

Returns the descriptions for all sequences in the alignment as list

Returns

An ordered list of descriptions for each sequence in the alignment

Return type

list of str

Examples

>>> ali=Alignment([ ('seq1 this is first', 'ATTCG-'), ('seq2 this is 2nd'  , '--TTGG'), ('seq3', 'ATTCG-')])
>>> ali.descriptions()
['this is first', 'this is 2nd', '']

See also

names

get all sequence names (their unique identifiers, without description)

titles

get all sequence titles (name and description separated by space)

fasta(nchar=60)

Returns the alignment in (aligned) fasta format

Parameters

nchar (int, default=60) – The number of characters per line for sequences

Returns

A multiline string with the alignment in fasta format, including sequence descriptions

Return type

str

Examples

>>> ali=Alignment([ ('seq1 this is first', 'ATTCG-'), ('seq2 this is 2nd'  , '--TTGG'), ('seq3', 'ATTCG-')])
>>> print(ali.fasta())
>seq1 this is first
ATTCG-
>seq2 this is 2nd
--TTGG
>seq3
ATTCG-

See also

write

generic function supporting many output formats

classmethod from_numpy(nparray, names, descriptions=None)

Class method to instance an Alignment object from a numpy array.

Parameters
  • nparray (np.ndarray) – analogous to object returned by Alignment.to_numpy(), it must have one row per sequence, and one column per alignment position. Its dtype must be is np.str_

  • names (list of str) – ordered list of sequence names (i.e. identifiers)

  • descriptions (list of str, optional) – ordered list of sequence description. If not provided, all descriptions are set to ‘’

Returns

new alignment object

Return type

Alignment

See also

to_numpy

get_desc(name)

Returns the description for a sequence entry

If no sequence with that name is present in the alignment, an AlignmentError exception is raised.

Parameters

name (str) – The name (i.e. identifier) of the sequence

Returns

The description stored for the sequence

Return type

str

Examples

>>> ali=Alignment([ ('seq1 this is first', 'ATTCG-'), ('seq2 this is 2nd'  , '--TTGG'), ('seq3', 'ATTCG-')])
>>> ali.get_desc('seq1')
'this is first'
>>> ali.get_desc('seq3')
''

See also

set_desc

get_seq(name)

Returns the sequence with the requested name

If no sequence with that name is present in the alignment, an AlignmentError exception is raised.

Parameters

name (str) – The name (i.e. identifier) of the sequence requested

Returns

The sequence requested, including any gaps it may contain

Return type

str

Examples

>>> ali=Alignment([ ('seq1 this is first', 'ATTCG-'), ('seq2 this is 2nd'  , '--TTGG'), ('seq3', 'ATTCG-')])
>>> ali.get_seq('seq1')
'ATTCG-'
>>> ali.get_seq('seq2')
'--TTGG'

See also

set_seq

has_name(name)

Checks whether the alignment contains a sequence with the name provided

Parameters

name (str) – The name (i.e. identifier) searched in the alignment.

Returns

A bool indicating whether the name is present

Return type

bool

Examples

>>> ali=Alignment([ ('seq1 this is first', 'ATTCG-'), ('seq2 this is 2nd'  , '--TTGG'), ('seq3', 'ATTCG-')])
>>> ali.has_name('seq1')
True
>>> ali.has_name('seq2 this is 2nd')  # note: the name would be 'seq2' only
False
n_seqs()

Returns the number of sequences in the alignment (i.e. the number of rows, or alignment height)

Returns

Number of sequences in the alignment

Return type

int

Examples

>>> ali=Alignment([ ('seq1', 'ATTCG-'), ('seq2', '--TTGG'), ('seq3', 'ATTCG-')])
>>> ali.n_seqs()
3

See also

ali_length

length of the alignment (i.e. number of columns)

names()

Returns a list of all sequence names in the alignment

The names returned do not include the sequence descriptions.

Returns

An ordered list of sequence names (identifiers) in the alignment

Return type

list of str

Examples

>>> ali=Alignment([ ('seq1 this is first', 'ATTCG-'), ('seq2 this is 2nd'  , '--TTGG'), ('seq3', 'ATTCG-')])
>>> ali.names()
['seq1', 'seq2', 'seq3']

See also

titles

get all sequence titles, including their name and description

position_in_ali(name, pos_in_seq)

Maps an position in a certain sequence (without counting gaps) to its corresponding position in the alignment

If the requested position is invalid, raise an IndexError

Parameters
  • name (str) – the name of the sequence

  • pos_in_seq (int) – 0-based sequence position, i.e. the index mapping to the requested sequence without gaps

Returns

0-based alignment position , i.e. the column index where the requested sequence position is found

Return type

int

Examples

>>> ali= Alignment([ ('seq1', 'ATTCG-'), ('seq2'  , '--TTG-'), ('seq3', 'AT-CCG')])
>>> ali
# Alignment of 3 sequences and 6 positions
ATTCG- seq1
--TTG- seq2
AT-CCG seq3
>>> ali.position_in_ali('seq1', 4)
4
>>> ali.position_in_ali('seq2', 0)
2
>>> ali.position_in_ali('seq3', 2)
3

See also

position_in_seq

maps an alignment position to sequence position for a single sequence

position_map

maps all alignment positions for all sequences

position_in_seq(name, pos_in_ali)

Maps an alignment column position to the corresponding position in a certain sequence

Parameters
  • name (str) – the name of the sequence to map to

  • pos_in_ali (int) – 0-based alignment position, i.e. the column index

Returns

0-based position in sequence, i.e. the index of the sequence without counting gaps at the requested column.

Note that for gap positions, the position of the closest non-gap to the left is reported. For gap positions, the position of the closest non-gap to the left is reported.

Return type

int

Examples

>>> ali= Alignment([ ('seq1', 'ATTCG-'), ('seq2'  , '--TTG-'), ('seq3', 'ATTCCG')])
>>> ali
# Alignment of 3 sequences and 6 positions
ATTCG- seq1
--TTG- seq2
ATTCCG seq3
>>> ali.position_in_seq('seq1', 4)
4
>>> ali.position_in_seq('seq2', 2)
0

Checking a left terminal gap returns -1:

>>> ali.position_in_seq('seq2', 0)
-1

Checking a gap position returns the index of the closest non gap char on the left:

>>> ali.position_in_seq('seq1', 5)
4

See also

position_in_ali

maps a sequence position to alignment position for a single sequence

position_map

maps all alignment positions for all sequences

position_map()

Compute a numerical matrix instrumental to map alignment positions to sequence positions (and reverse)

Returns

Returns a Pandas DataFrame with one row per sequence (indexed by name), and one column per alignment position. Each value is the index of that particular sequence (without gaps) in that alignment column. All positions are 0-based. For gap positions, the position of the closest non-gap to the left is reported. For left terminal gaps, the value reported is -1

Return type

pd.DataFrame

Examples

>>> ali= Alignment([ ('seq1', 'ATTCG-'), ('seq2'  , '--TTG-'), ('seq3', 'ATTCCG')])
>>> ali
# Alignment of 3 sequences and 6 positions
ATTCG- seq1
--TTG- seq2
ATTCCG seq3
>>> ali.position_map()
      0  1  2  3  4  5
seq1  0  1  2  3  4  4
seq2 -1 -1  0  1  2  2
seq3  0  1  2  3  4  5

Note

Computing this matrix makes sense only if you will use positions for many or all sequences. For the corresponding operations on single sequences, see functions position_in_seq() and position_in_ali()

See also

position_in_seq

maps an alignment position to sequence position for a single sequence

position_in_ali

maps a sequence position to alignment position for a single sequence

positions()

Returns an iterator over the column indices of the alignment

This is qquivalent to range(self.ali_length())

Returns

An iterator of int

Return type

range

Examples

>>> ali=Alignment([ ('seq1 this is first', 'ATTCG-'), ('seq2 this is 2nd'  , '--TTGG'), ('seq3', 'ATTCG-')])
>>> for i in ali.positions():      print( ali.get_seq('seq1')[i] )
A
T
T
C
G
-
remove_by_index(*seqindices)

Remove one or more sequences in the alignment by their index, in-place.

The input indices refer to the position of the sequence in the alignment, i.e. their row number. Note that the modification is done in place. To obtain a new object instead, see examples below.

Parameters

*seqindices (tuple) – index or indices of sequences to be removed from the alignment

Returns

None

Return type

None

Examples

>>> ali= Alignment([ ('seq1 first', 'ATTCG-'), ('seq2 this is 2nd'  , '--TTGG'), ('seq3', 'ATTCG-')])
>>> ali.remove_by_index(0)
>>> ali
# Alignment of 2 sequences and 6 positions
--TTGG seq2
ATTCG- seq3

To return a new alignment without certain sequences, do not use this function. Instead, use indexing by rows:

>>> ali= Alignment([ ('seq1 first', 'ATTCG-'), ('seq2 this is 2nd'  , '--TTGG'), ('seq3', 'ATTCG-')])
>>> indices_to_omit=set( [0, 2] )
>>> ali[ [n for i,n in enumerate(ali.names()) if not i in indices_to_omit], :]
# Alignment of 1 sequences and 6 positions
--TTGG seq2

See also

remove_by_name

remove_by_name(*names)

Remove one or more sequences in the alignment by name in-place.

Note that the modification is done in place. To obtain a new object instead, see examples below.

Parameters

*names (tuple) – name or names of sequences to be removed from the alignment

Returns

None

Return type

None

Examples

>>> ali= Alignment([ ('seq1 first', 'ATTCG-'), ('seq2 this is 2nd'  , '--TTGG'), ('seq3', 'ATTCG-')])
>>> ali.remove_by_name('seq1')
>>> ali
# Alignment of 2 sequences and 6 positions
--TTGG seq2
ATTCG- seq3
>>> ali.remove_by_name('seq2', 'seq3')
>>> ali
# Empty alignment

To return a new alignment without certain names, do not use this function. Instead, use indexing by rows:

>>> ali= Alignment([ ('seq1 first', 'ATTCG-'), ('seq2 this is 2nd'  , '--TTGG'), ('seq3', 'ATTCG-')])
>>> names_to_omit=set(  ['seq2']  )
>>> ali[ [n for n in ali.names() if not n in names_to_omit], :]
# Alignment of 2 sequences and 6 positions
ATTCG- seq1
ATTCG- seq3

See also

remove_by_index

remove_empty_seqs(inplace=True)

Remove all sequences which are entirely made of gaps or that are empty.

By default, removal is done in place.

Parameters

inplace (bool, default:True) – whether the removal should be done in place. If not, a new Alignment is returned instead

Returns

If inplace==True, None is returned; otherwise, a new Alignment without empty sequences is returned

Return type

None or Alignment

Examples

>>> ali= Alignment([ ('seq1 first', 'ATTCG-'), ('seq2 this is 2nd'  , '--TTGG'), ('seq3', '------')])
>>> ali.remove_empty_seqs()
>>> ali
# Alignment of 2 sequences and 6 positions
ATTCG- seq1
--TTGG seq2
same_length()

Check whether sequences are aligned, i.e. they have the same length

Returns

Stating if all sequences have the same lengths

Return type

bool

Examples

>>> ali=Alignment([ ('seq1', 'ATTCG-'), ('seq2', '--TTGG'), ('seq3', 'ATTCG-')])
>>> ali.same_length()
True
>>> ali.add_seq('seqX', 'TATTCGGT-')
>>> ali.same_length()
False

See also

ali_length

length of the alignment (i.e. number of columns)

score_similarity(targets=None, gaps='y', metrics='i', weights='m', method=2)

Computes metrics of similarity between some target sequences and sequences in the (self) alignment

The rationale of the function is to quantify how much target sequences ‘fit’ in the alignment. By default, similarity metrics are computed for all sequences in self, meaning targets=self. In that case, they provide a measure of how much sequences resemble each other, i.e. the global agreement of alignment, and it can be instrumental to identify outliers.

The default use of this function is to compute the Average Sequence Identity (ASI) of targets, obtained by performing pairwise comparisons between each target and all sequences in self, and averaging. The definition of sequence identity varies as gaps may be counted in different ways, as specified by the gaps argument. For explanatory examples, see this page .

Besides ASI, a variant called AWSI (Average Weighted Sequence Identity) is available, wherein different alignment columns are given different weight when averaging. Various built-in methods to define weights are available, all based on the concept that conserved alignment positions are given more weight. Custom weights may also be provided.

Parameters
  • targets (Alignment, optional) – sequences for which the similarity metrics are requested, provided as an Alignment instance. The sequences must be aligned to the self Alignment. If not provided, self is taken as targets, meaning that metrics are reported for each sequence in self, compared to the full alignment.

  • gaps (str, default:'y') –

    defines how to take into account gaps when comparing sequences pairwise. Possible values: - ‘y’ : gaps are considered and considered mismatches. Positions that are gaps in both sequences are ignored. - ‘n’ : gaps are not considered. Positions that are gaps in either sequences compared are ignored. - ‘t’ : terminal gaps are trimmed. Terminal gap positions in either sequences are ignored, others are considered as in ‘y’. - ‘a’ : gaps are considered as any other character; even gap-to-gap matches are scored as identities.

    Multiple arguments may be concatenated (e.g. ‘yn’) to compute all of the possibilities.

  • metrics (str, default:'i') –

    defines which metrics are computed. Possible values: - ‘i’ : average sequence identity, aka ASI - ‘w’ : weighted sequence identity, aka AWSI

    Multiple arguments may be concatenated (e.g. ‘iw’) to compute all of the possibilities.

  • weights (str or list or np.ndarray, default: 'm') –

    if AWSI is computed, defines how weights are computed for each alignment column. Possible values: - ‘m’ : maximum frequency of non-gap character in self - ‘i’ : information content, i.e. 2- sum(p*log2(p)) where p is frequency of non-gap characters - ‘q’ : quadratic sum, i.e. sum(p*p) where p is frequency of non-gap characters

    Multiple arguments may be concatenated (e.g. ‘mi’) to compute all of the possibilities.

    Alternatively, custom weights may be provided as an iterable (e.g. list or Numpy ndarray) of numbers (the weights), with as many elements as the alignment columns.

Returns

a DataFrame with one row per target (indexed by sequence names), and one column for each sequence metrics requested. If multiple parameters were specified for gaps, metrics and/or weights, the output columns are a MultiIndex which covers all combinations requested.

Return type

pd.DataFrame

See also

sequtils.sequence_identity

function that compares sequences pairwise and returns their sequence identity. Accepts the same gaps argument as score_similarity. Though sequtils.sequence_identity is not run internally by score_similarity, it can be used to replicate its results.

Examples

>>> ali= Alignment([ ('seq1', 'ATTCG-'), ('seq2'  , '--TTG-'), ('seq3', 'AT-CCG')])
>>> ali
# Alignment of 3 sequences and 6 positions
ATTCG- seq1
--TTG- seq2
AT-CCG seq3
>>> fep_ali=Alignment(pyaln_folder + '/examples/fep15_protein.fa', fileformat='fasta')
>>> fep_ali
# Alignment of 6 sequences and 138 positions
MWLTLVALLALCATGRTAENLSESTTDQDKLVIARGKLVAPSVVGUSIKKMPELYNFLM...L Fep15_danio_rerio
MWAFLLLTLAFSATGMTEE-DVTDTAIEERPVIAKGILKAPSVVGUAIKKMPALYMFLM...L Fep15_S_salar
MWIFLLLTLAFSATGMTEE-NVTDTAIEERPVIAKGILKAPSVVGUAIKKMPELYTFLM...L Fep15_O_mykiss
MWAFLVLTFAVAA-GASET-VDNHTAAEEKLLIARGKLLAPSVVGUGIKKMPELHHFLM...L Fep15_T_rubripes
MWALLVLTFAVTV-GASEE-VKNQTAAEEKLVIARGTLLAPSVVGUGIKKMPELHHFLM...L Fep15_T_nigroviridis
MWAFVLIAFSV---GASDS--SNSTAE----VIARGKLMAPSVVGUAIKKLPELNRFLM...L Fep15_O_latipes
>>> fep_ali.score_similarity()
metrics                    ASI
Fep15_danio_rerio     0.777778
Fep15_S_salar         0.826334
Fep15_O_mykiss        0.822684
Fep15_T_rubripes      0.829599
Fep15_T_nigroviridis  0.815000
Fep15_O_latipes       0.767438

If we choose to not consider gap positions, the score increases:

>>> fep_ali.score_similarity(gaps='n')
metrics                    ASI
Fep15_danio_rerio     0.793051
Fep15_S_salar         0.838283
Fep15_O_mykiss        0.834522
Fep15_T_rubripes      0.842566
Fep15_T_nigroviridis  0.835351
Fep15_O_latipes       0.805693

Requesting AWSI as well as ASI, and testing both considering and omitting gaps:

>>> fep_ali.score_similarity(gaps='yn', metrics='iw')
gaps                         y                   n
metrics                    ASI      AWSI       ASI      AWSI
Fep15_danio_rerio     0.777778  0.847123  0.793051  0.856044
Fep15_S_salar         0.826334  0.885040  0.838283  0.893412
Fep15_O_mykiss        0.822684  0.882183  0.834522  0.890497
Fep15_T_rubripes      0.829599  0.887255  0.842566  0.896094
Fep15_T_nigroviridis  0.815000  0.874389  0.835351  0.891288
Fep15_O_latipes       0.767438  0.834809  0.805693  0.860639

Computing AWSI with all possible weights available, considering only internal gaps:

>>> fep_ali.score_similarity(gaps='t', metrics='w', weights='iqm')
metrics                 AWSI.i    AWSI.q    AWSI.m
Fep15_danio_rerio     0.881664  0.913531  0.847123
Fep15_S_salar         0.912174  0.937773  0.885040
Fep15_O_mykiss        0.910436  0.936245  0.882183
Fep15_T_rubripes      0.915546  0.938932  0.887255
Fep15_T_nigroviridis  0.901011  0.929572  0.874389
Fep15_O_latipes       0.872316  0.903712  0.834809
sequences()

Returns a list of the all sequences in the alignment

Returns

An ordered list of sequences in the alignment (without names or descriptions)

Return type

list of str

Examples

>>> ali=Alignment([ ('seq1 this is first', 'ATTCG-'), ('seq2 this is 2nd'  , '--TTGG'), ('seq3', 'ATTCG-')])
>>> ali.sequences()
['ATTCG-', '--TTGG', 'ATTCG-']
set_desc(name, desc)

Change the description of an entry in-place.

Parameters
  • name (str) – The name (i.e. identifier) of the entry to be altered

  • desc (str) – The new description to be used

Returns

None

Return type

None

Examples

>>> ali=Alignment([ ('seq1 this is first', 'ATTCG-'), ('seq2 this is 2nd'  , '--TTGG'), ('seq3 not sure',   'ATTCG-')])
>>> print(ali.fasta())
>seq1 this is first
ATTCG-
>seq2 this is 2nd
--TTGG
>seq3 not sure
ATTCG-
>>> ali.set_desc('seq3', 'obviously third')
>>> print(ali.fasta())
>seq1 this is first
ATTCG-
>seq2 this is 2nd
--TTGG
>seq3 obviously third
ATTCG-
set_seq(name, sequence)

Change the sequence of an entry in-place.

Parameters
  • name (str) – The name (i.e. identifier) of the sequence to be altered

  • sequence (str) – The new sequence to be set

Returns

None

Return type

None

Examples

>>> ali=Alignment([ ('seq1', 'ATTCG-'), ('seq2', '--TTGG'), ('seq3', 'ATTCG-')])
>>> ali
# Alignment of 3 sequences and 6 positions
ATTCG- seq1
--TTGG seq2
ATTCG- seq3
>>> ali.set_seq('seq1', 'CHANGE')
>>> ali
# Alignment of 3 sequences and 6 positions
CHANGE seq1
--TTGG seq2
ATTCG- seq3

See also

add_seq, get_seq

property shape

Return the size of the alignment in its two dimensions, i.e. the number of sequences and alignment columns

Returns

(height, width), where height is the number of sequences in the alignment and width the number of columns

Return type

tuple of int

Examples

>>> ali=Alignment([ ('seq1', 'ATTCG-'), ('seq2', '--TTGG'), ('seq3', 'ATTCG-')])
>>> ali.shape
(3, 6)

Note

This method is presented as property for symmetry with Numpy array .shape. However, this Alignment property is read-only.

titles()

Returns a list of all sequence titles in the alignment

Each title is the concatenation of sequence name and description, separated by a space. If the description is empty for an entry, only the name is returned

Returns

An ordered list of sequence titles in the alignment

Return type

list of str

Examples

>>> ali=Alignment([ ('seq1 this is first', 'ATTCG-'), ('seq2 this is 2nd'  , '--TTGG'), ('seq3', 'ATTCG-')])
>>> ali.titles()
['seq1 this is first', 'seq2 this is 2nd', 'seq3']

See also

names

get all sequence names (their unique identifiers, without description)

to_biopython()

Returns a copy of the alignment as a Bio.Align.MultipleSeqAlignment object

The SeqRecord instances in the returned MultipleSeqAlignment has their id and name attributes set to sequence names, and also possess the description attribute.

Returns

Alignment in biopython format (Bio.Align.MultipleSeqAlignment)

Return type

MultipleSeqAlignment

See also

to_numpy, to_pandas

to_numpy()

Returns a numpy 2-D array representation of the alignment, useful for vectorized sequence methods

Returns

The returned array has one row per sequence and one column per alignment position. Each value is a single character. The dtype is np.str_ Note that rows are not indexed by sequence names, just by their order index

Return type

np.ndarray

Examples

>>> ali= Alignment([ ('seq1 first', 'ATTCG-'), ('seq2 this is 2nd'  , '--TTGG'), ('seq3', '--TT--')])
>>> print(ali.to_numpy())
[['A' 'T' 'T' 'C' 'G' '-']
 ['-' '-' 'T' 'T' 'G' 'G']
 ['-' '-' 'T' 'T' '-' '-']]

Warning

This function is cached for best performance. Thus, do not directly modify the returned object. The hash key for caching is derived from sequences only: names are not considered.

to_pandas(use_names=False)

Returns a pandas DataFrame representation of the alignment

Parameters

use_names (bool, optional) – Normally, the returned DataFrame has a simply RangeIndex as index. Specify this to instead use sequence names as the index.

Returns

The returned dataframe has one row per sequence and one column per alignment position. Each value is a single character. The dtype is object Rows are indexed by the sequence names if use_names==True, or by a RangeIndex by default

Return type

pd.DataFrame

Examples

>>> ali= Alignment([ ('seq1 first', 'ATTCG-'), ('seq2 this is 2nd'  , '--TTGG'), ('seq3', '--TT--')])
>>> ali.to_pandas()
   0  1  2  3  4  5
0  A  T  T  C  G  -
1  -  -  T  T  G  G
2  -  -  T  T  -  -
>>> ali.to_pandas(use_names=True)
      0  1  2  3  4  5
seq1  A  T  T  C  G  -
seq2  -  -  T  T  G  G
seq3  -  -  T  T  -  -
trim_gaps(pct=1.0, count=None, inplace=False)

Removes the alignment columns with more gaps than specified

By default, a new alignment without the columns identified as ‘too gappy’ is returned.

Parameters
  • pct (float, default:1.0) – minimal gap frequency (from 0.0 to 1.0) for a column to be removed

  • count (int, optional) – defines the minimum absolute number of gaps for a column to be removed. It is an alternative way to select columns, which overrides the pct argument

  • inplace (bool, default:False) – whether the column removal should be done in place. If not, a new Alignment is returned instead

Returns

By default, a new Alignment without empty sequences is returned; if inplace==True, None is returned

Return type

None or Alignment

Examples

>>> ali= Alignment([ ('seq1 first', 'ATTCG-'), ('seq2 this is 2nd'  , '--TTG-'), ('seq3', 'ATTCCG')])
>>> ali
# Alignment of 3 sequences and 6 positions
ATTCG- seq1
--TTG- seq2
ATTCCG seq3
>>> ali.trim_gaps(0.5)
# Alignment of 3 sequences and 5 positions
ATTCG seq1
--TTG seq2
ATTCC seq3
>>> ali.trim_gaps(count=1)
# Alignment of 3 sequences and 3 positions
TCG seq1
TTG seq2
TCC seq3
write(fileformat='fasta', to_file=None)

Returns a string representation of the alignment in a format of choice

Internally uses Bio.Align to generate output. Supported fileformat arguments include clustal, stockholm, phylip and many others. The full list of supported fileformat arguments is provided here).

Parameters
  • fileformat (str, default='fasta') – text format requested

  • to_file (str | TextIO, optional) – filename or buffer to write into. If not specified, the output is returned instead

Returns

String representation of alignment in the requested format

Return type

str

Examples

>>> ali=Alignment([ ('seq1 this is first', 'ATTCG-'), ('seq2 this is 2nd'  , '--TTGG'), ('seq3', 'ATTCG-')])
>>> print(ali.write('phylip'))
 3 6
seq1       ATTCG-
seq2       --TTGG
seq3       ATTCG-