Module sequences

count_coding_changes(cds, cds2, silent=True, cpg=False, nonsense=False)

Counts the number of synonymous and non-synonymous changes between two sequences.

This function is instrumental for any dN/dS analysis. Sequences should be aligned using gaps “-”. Gapped positions are not counted. Codons with any character different from ACTG (for example, N) are skipped and a message is printed to stderr.

silentbool, defaults to True: do not print warning messages for non-ATGC characters
cpgbool: if True, separate counts for CpG sites only are provided, altering the structure of the returned tuple (see below). Note that only cds is inspected to identity CpG sites, so with cpg=True the output may be different for (cds, cds2) and (cds2, cds) Note that the total counts are still provided. To obtain the number of nonCpG changes, subtract the number of CpG sites from the total number
nonsensebool: if True, non-sense mutations (i.e. stop codons or mutations to a stop codon) are not counted as non-synonymous, and are returned separately, altering the structure of the returned tuple (see below).

Returns:

counts –

With default options, counts contains: (nonsyn, syn)
If cpg is True (but nonsense is not), counts contains: (nonsyn, syn, cpg_nonsyn, cpg_syn)
If nonsense is True (but cpg is not), counts contains: (nonsyn, syn, nonsense)
If both cpg and nonsense are true, counts contains: (nonsyn, syn, nonsense, cpg_nonsyn, cpg_syn, cpg_nonsense)

Return type:

tuple of ints

count_coding_sites(cds, silent=False, cpg=False, nonsense=False)

Counts the number of synonymous and non-synonymous sites for a input nucletoide coding sequence.

This function is instrumental for any dN/dS analysis. As a single site can be partly non-syn and partly syn, the numbers returned are float (always multiple of one third). Gaps “-”, if present, are removed silently. Codons with any character different from ACTG (for example, N) are skipped and a message is printed to stderr.

Parameters:

cds (str) – coding sequence in DNA format (characters: ATGC)
silent (bool) – do not print warning messages for non-ATGC characters
cpg (bool) – if True, separate counts for CpG sites only are provided, altering the structure of the returned tuple (see below). Note that the total counts are still provided. To obtain the number of nonCpG changes, subtract the number of CpG sites from the total number
nonsense (bool) – if True, non-sense mutations (i.e. stop codons or mutations to a stop codon) are not counted as non-synonymous, and are returned separately, altering the structure of the returned tuple (see below).

Returns:

counts –

With default options, counts contains: (nonsyn, syn)
If cpg is True (but nonsense is not), counts contains: (nonsyn, syn, cpg_nonsyn, cpg_syn)
If nonsense is True (but cpg is not), counts contains: (nonsyn, syn, nonsense)
If both cpg and nonsense are true, counts contains: (nonsyn, syn, nonsense, cpg_nonsyn, cpg_syn, cpg_nonsense)

Return type:

tuple of floats

count_unique_changes(cds, other_cds_list, silent=True, cpg=False, nonsense=False)

Counts the number of unique synonymous and non-synonymous changes between one reference CDS sequence and a set of others.

This function is instrumental for any dN/dS analysis. Sequences should be aligned using gaps “-”. Gapped positions are not counted. Codons with any character different from ACTG (for example, N) are skipped and a message is printed to stderr. If the same mutation is observed between cds and multiple sequences in other_cds_list, it is counted only once.

silentbool, defaults to True: do not print warning messages for non-ATGC characters
cpgbool: if True, separate counts for CpG sites only are provided, altering the structure of the returned tuple (see below). Note that only cds is inspected to identity CpG sites, so with cpg=True the output may be different for (cds, cds2) and (cds2, cds) Note that the total counts are still provided. To obtain the number of nonCpG changes, subtract the number of CpG sites from the total number
nonsensebool: if True, non-sense mutations (i.e. stop codons or mutations to a stop codon) are not counted as non-synonymous, and are returned separately, altering the structure of the returned tuple (see below).

Returns:

counts –

With default options, counts contains: (nonsyn, syn)
If cpg is True (but nonsense is not), counts contains: (nonsyn, syn, cpg_nonsyn, cpg_syn)
If nonsense is True (but cpg is not), counts contains: (nonsyn, syn, nonsense)
If both cpg and nonsense are true, counts contains: (nonsyn, syn, nonsense, cpg_nonsyn, cpg_syn, cpg_nonsense)

Return type:

tuple of ints

reverse_complement(seq, is_RNA=False)

Reverse complement a DNA sequence

Parameters:

seq (str) – nucleotide sequence in DNA format (characters: ATGC)
is_RNA (bool) – use this to provide the input seq in RNA format instead (characters: AUGC)

Returns:

revcompseq – reverse complement nucleotide sequence in DNA format (or RNA if is_RNA was set to True)

Return type:

str

Note

Characters that are not upper or lowercase ATGC (or AUGC if is_RNA) are left unchanged

translate(seq, genetic_code='1', unknown='X', cache=False)

Translate a coding sequence into protein

Parameters:

seq (str) – nucleotide sequence in DNA format (characters: ATGC)
genetic_code (str | dict) – string-converted NCBI index for genetic code (see https://www.ncbi.nlm.nih.gov/Taxonomy/Utils/wprintgc.cgi) or dictionary with keys for each codon, values are amino acids (remember to include the translation for gaps '---':'-') or string-converted NCBI index with a ‘+U’ suffix to have UGA as selenocysteine (U character)
unknown (str | None) – codons that are not found in the genetic code table will be translated as this character if None, finding an unknown codon will raise an exception instead
cache (bool | int) – speeds up translation by caching the result of translation of multicodon strings. The 1st time, the function is slow since precomputing all results; then, it is ~3 times faster than non-caching translate. With cache=True, all 3-codon translations are cached (memory 10Mb, precompute ~ 230ms). Provide an int to define how many codons to cache; this is approx the speedup that will be obtained. Note: memory and precomputing grow exponentially with the N of codons cached; use easybioinfo.clear_kmer_memory() to free memory.

Returns:

pep – protein sequence resulting from translation, with gaps as ‘-’ and unknown characters as ‘X’

Return type:

str

Warning

This function expects uppercase DNA as input. ‘U’ or lowercase characters will result in ‘X’ characters as translation. To provide more flexible input, pre-process input with seq.upper().replace('U', 'T')