Sequtils submodule

pyaln.sequtils.sequence_identity(a, b, gaps='y')

Compute the sequence identity between two sequences.

The definition of sequence_identity is ambyguous as it depends on how gaps are treated, here defined by the gaps argument. For details and examples, see this page

Parameters
  • a (str) – first sequence, with gaps encoded as “-“

  • b (str) – second sequence, with gaps encoded as “-“

  • gaps (str) – defines how to take into account gaps when comparing sequences pairwise. Possible values: - ‘y’ : gaps are considered and considered mismatches. Positions that are gaps in both sequences are ignored. - ‘n’ : gaps are not considered. Positions that are gaps in either sequences compared are ignored. - ‘t’ : terminal gaps are trimmed. Terminal gap positions in either sequences are ignored, others are considered as in ‘y’. - ‘a’ : gaps are considered as any other character; even gap-to-gap matches are scored as identities.

Returns

sequence identity between the two sequences

Return type

float

Examples

>>> sequence_identity('ATGCA',
...                   'ATGCC')
0.8
>>> sequence_identity('--ATC-GGG-',
                      'AAATCGGGGC',
                      gaps='y')
0.6

Note

To compute sequence identity efficiently among many sequences, use score_similarity() instead.

pyaln.sequtils.weighted_sequence_identity(a, b, weights, gaps='y')

Compute the sequence identity between two sequences, different positions differently

The definition of sequence_identity is ambyguous as it depends on how gaps are treated, here defined by the gaps argument. For details and examples, see this page

Parameters
  • a (str) – first sequence, with gaps encoded as “-“

  • b (str) – second sequence, with gaps encoded as “-“

  • weights (list of float) – list of weights. Any iterable with the same length as the two input sequences (including gaps) is accepted. The final score is divided by their sum (except for positions not considered, as defined by the gaps argument).

  • gaps (str) – defines how to take into account gaps when comparing sequences pairwise. Possible values: - ‘y’ : gaps are considered and considered mismatches. Positions that are gaps in both sequences are ignored. - ‘n’ : gaps are not considered. Positions that are gaps in either sequences compared are ignored. - ‘t’ : terminal gaps are trimmed. Terminal gap positions in either sequences are ignored, others are considered as in ‘y’. - ‘a’ : gaps are considered as any other character; even gap-to-gap matches are scored as identities.

Returns

sequence identity between the two sequences

Return type

float

Examples

>>> weighted_sequence_identity('ATGCA',
...                            'ATGCC', weights=[1, 1, 1, 1, 6])
0.4
>>> weighted_sequence_identity('ATGCA',
...                            'ATGCC', weights=[1, 1, 1, 1, 1])
0.8

Note

To compute sequence identity efficiently among many sequences, use score_similarity() instead.