predictions

Machine learning models for predicting TR (Template Repeat) quality. Only ~9% of random TRs show >10% mutagenesis, making prediction essential. Includes functions to score individual sequences, predict mutagenesis percentages, and optimize TR sequences using beam search to balance protein function, TR folding, and amino acid accessibility under mutagenesis.

Scoring functions

Predict TR quality scores and mutagenesis percentages for individual sequences or batches.


source

score


def score(
    TR_seq:str, # A string of the TR DNA sequence
    features:int=1, # The classifier model, no need to specify it (one feature by default). If two: uses the two features model
):

Calculates the predicted score of a given TR sequence (1 = perfect TR and 0 = poorly-performing TR). If features=2, returns the score according to each feature (better to have both high).

TR_bad='TTAGCGAATGGCGAAATTCGTAAACGCCCTCTGATCGAAACCAACGGCGAAACGGGTGAGATCGTGTGGG'
print('TR bad score =',score(TR_bad))
TR_good='AAATGATCGCCAAATCTGAACAGGAAATTGGCAAAGCAACCGCTAAATACTTTTTCTACTCAAACATTAT'
print('TR good score =',score(TR_good))
TR bad score = 0.23
TR good score = 0.84

source

score_list


def score_list(
    TR_seq_list:list, # A list of strings of TRs DNA sequences
    TR_name_list:list, # A list of strings of TRs names
    features:int=1, # The number of features to use
):

Calculates the score for every TR in the list and returns them in a dataframe format. If features=2, returns the score according to each feature (better to have both high).

TR_bad=[
     'TTAGCGAATGGCGAAATTCGTAAACGCCCTCTGATCGAAACCAACGGCGAAACGGGTGAGATCGTGTGGG',
     'AAACGCCCTCTGATCGAAACCAACGGCGAAACGGGTGAGATCGTGTGGGACAAAGGTCGTGATTTCGCTA',
    'GGTTTCTCTAAGGAGTCCATTCTGCCGAAGCGCAACTCCGACAAGCTGATCGCGCGTAAGAAGGACTGGG',
     'CAAGCTGATCGCGCGTAAGAAGGACTGGGATCCGAAGAAGTACGGTGGCTTCGATTCTCCGACCGTGGCG',
     'ACCCGATTGACTTCCTCGAGGCGAAGGGGTACAAGGAGGTGAAGAAGGATCTGATTATCAAGCTGCCGAA',
     'AGTACTCCCTGTTCGAGCTGGAGAATGGTCGTAAGCGTATGCTGGCGTCTGCGGGTGAGCTGCAGAAGGG',
     'CAGCACAAGCACTACCTGGACGAGATTATTGAGCAGATTTCTGAGTTTTCTAAGCGCGTGATTCTGGCGG',
     'ACGCGAATCTGGATAAGGTCCTGTCTGCCTACAATAAGCACCGTGATAAGCCGATCCGTGAGCAGGCGGA',   
 ]

score_list(TR_bad,['TR_bad_'+str(k) for k in range (1,9)])
TR_Name TR_Seq TR_Score
0 TR_bad_1 TTAGCGAATGGCGAAATTCGTAAACGCCCTCTGATCGAAACCAACG... 0.23
1 TR_bad_2 AAACGCCCTCTGATCGAAACCAACGGCGAAACGGGTGAGATCGTGT... 0.05
2 TR_bad_3 GGTTTCTCTAAGGAGTCCATTCTGCCGAAGCGCAACTCCGACAAGC... 0.00
3 TR_bad_4 CAAGCTGATCGCGCGTAAGAAGGACTGGGATCCGAAGAAGTACGGT... 0.00
4 TR_bad_5 ACCCGATTGACTTCCTCGAGGCGAAGGGGTACAAGGAGGTGAAGAA... 0.01
5 TR_bad_6 AGTACTCCCTGTTCGAGCTGGAGAATGGTCGTAAGCGTATGCTGGC... 0.08
6 TR_bad_7 CAGCACAAGCACTACCTGGACGAGATTATTGAGCAGATTTCTGAGT... 0.06
7 TR_bad_8 ACGCGAATCTGGATAAGGTCCTGTCTGCCTACAATAAGCACCGTGA... 0.12
TR_good=[
     'AAATGATCGCCAAATCTGAACAGGAAATTGGCAAAGCAACCGCTAAATACTTTTTCTACTCAAACATTAT',
     'TCAAACATTATGAATTTCTTCAAAACCGAAATCACCTTAGCGAATGGCGAAATTCGTAAACGCCCTCTGA',
     'ATGCCTCAAGTAAACATCGTTAAAAAGACTGAGGTGCAGACTGGCGGTTTCTCTAAGGAGTCCATTCTGC',
     'GGATCCGAAGAAGTACGGTGGCTTCGATTCTCCGACCGTGGCGTACTCTGTTCTGGTGGTCGCCAAGGTC',
     'AGCGTATGCTGGCGTCTGCGGGTGAGCTGCAGAAGGGGAACGAGTTGGCCCTTCCGTCCAAGTACGTGAA',
     'GCAGAAGGGGAACGAGTTGGCCCTTCCGTCCAAGTACGTGAACTTCCTGTACCTGGCCTCGCACTACGAG',
     'CAGAAGCAGCTGTTCGTGGAGCAGCACAAGCACTACCTGGACGAGATTATTGAGCAGATTTCTGAGTTTT',
     'CTAAGCGCGTGATTCTGGCGGACGCGAATCTGGATAAGGTCCTGTCTGCCTACAATAAGCACCGTGATAA'
     ]

score_list(TR_good,['TR_good_'+str(k) for k in range (1,9)])
TR_Name TR_Seq TR_Score
0 TR_good_1 AAATGATCGCCAAATCTGAACAGGAAATTGGCAAAGCAACCGCTAA... 0.84
1 TR_good_2 TCAAACATTATGAATTTCTTCAAAACCGAAATCACCTTAGCGAATG... 0.82
2 TR_good_3 ATGCCTCAAGTAAACATCGTTAAAAAGACTGAGGTGCAGACTGGCG... 0.76
3 TR_good_4 GGATCCGAAGAAGTACGGTGGCTTCGATTCTCCGACCGTGGCGTAC... 0.74
4 TR_good_5 AGCGTATGCTGGCGTCTGCGGGTGAGCTGCAGAAGGGGAACGAGTT... 0.83
5 TR_good_6 GCAGAAGGGGAACGAGTTGGCCCTTCCGTCCAAGTACGTGAACTTC... 0.55
6 TR_good_7 CAGAAGCAGCTGTTCGTGGAGCAGCACAAGCACTACCTGGACGAGA... 0.81
7 TR_good_8 CTAAGCGCGTGATTCTGGCGGACGCGAATCTGGATAAGGTCCTGTC... 0.81
TR_bad=[
     'TTAGCGAATGGCGAAATTCGTAAACGCCCTCTGATCGAAACCAACGGCGAAACGGGTGAGATCGTGTGGG',
     'AAACGCCCTCTGATCGAAACCAACGGCGAAACGGGTGAGATCGTGTGGGACAAAGGTCGTGATTTCGCTA',
    'GGTTTCTCTAAGGAGTCCATTCTGCCGAAGCGCAACTCCGACAAGCTGATCGCGCGTAAGAAGGACTGGG',
     'CAAGCTGATCGCGCGTAAGAAGGACTGGGATCCGAAGAAGTACGGTGGCTTCGATTCTCCGACCGTGGCG',
     'ACCCGATTGACTTCCTCGAGGCGAAGGGGTACAAGGAGGTGAAGAAGGATCTGATTATCAAGCTGCCGAA',
     'AGTACTCCCTGTTCGAGCTGGAGAATGGTCGTAAGCGTATGCTGGCGTCTGCGGGTGAGCTGCAGAAGGG',
     'CAGCACAAGCACTACCTGGACGAGATTATTGAGCAGATTTCTGAGTTTTCTAAGCGCGTGATTCTGGCGG',
     'ACGCGAATCTGGATAAGGTCCTGTCTGCCTACAATAAGCACCGTGATAAGCCGATCCGTGAGCAGGCGGA',   
 ]

score_list(TR_bad,['TR_bad_'+str(k) for k in range (1,9)],2)
TR_Name TR_Seq TR_Score_Sp TR_Score_Avd
0 TR_bad_1 TTAGCGAATGGCGAAATTCGTAAACGCCCTCTGATCGAAACCAACG... 0.23 0.63
1 TR_bad_2 AAACGCCCTCTGATCGAAACCAACGGCGAAACGGGTGAGATCGTGT... 0.05 0.42
2 TR_bad_3 GGTTTCTCTAAGGAGTCCATTCTGCCGAAGCGCAACTCCGACAAGC... 0.00 0.30
3 TR_bad_4 CAAGCTGATCGCGCGTAAGAAGGACTGGGATCCGAAGAAGTACGGT... 0.00 0.51
4 TR_bad_5 ACCCGATTGACTTCCTCGAGGCGAAGGGGTACAAGGAGGTGAAGAA... 0.01 0.59
5 TR_bad_6 AGTACTCCCTGTTCGAGCTGGAGAATGGTCGTAAGCGTATGCTGGC... 0.08 0.54
6 TR_bad_7 CAGCACAAGCACTACCTGGACGAGATTATTGAGCAGATTTCTGAGT... 0.06 0.29
7 TR_bad_8 ACGCGAATCTGGATAAGGTCCTGTCTGCCTACAATAAGCACCGTGA... 0.12 0.04
TR_good=[
     'AAATGATCGCCAAATCTGAACAGGAAATTGGCAAAGCAACCGCTAAATACTTTTTCTACTCAAACATTAT',
     'TCAAACATTATGAATTTCTTCAAAACCGAAATCACCTTAGCGAATGGCGAAATTCGTAAACGCCCTCTGA',
     'ATGCCTCAAGTAAACATCGTTAAAAAGACTGAGGTGCAGACTGGCGGTTTCTCTAAGGAGTCCATTCTGC',
     'GGATCCGAAGAAGTACGGTGGCTTCGATTCTCCGACCGTGGCGTACTCTGTTCTGGTGGTCGCCAAGGTC',
     'AGCGTATGCTGGCGTCTGCGGGTGAGCTGCAGAAGGGGAACGAGTTGGCCCTTCCGTCCAAGTACGTGAA',
     'GCAGAAGGGGAACGAGTTGGCCCTTCCGTCCAAGTACGTGAACTTCCTGTACCTGGCCTCGCACTACGAG',
     'CAGAAGCAGCTGTTCGTGGAGCAGCACAAGCACTACCTGGACGAGATTATTGAGCAGATTTCTGAGTTTT',
     'CTAAGCGCGTGATTCTGGCGGACGCGAATCTGGATAAGGTCCTGTCTGCCTACAATAAGCACCGTGATAA'
     ]

score_list(TR_good,['TR_good_'+str(k) for k in range (1,9)],2)
TR_Name TR_Seq TR_Score_Sp TR_Score_Avd
0 TR_good_1 AAATGATCGCCAAATCTGAACAGGAAATTGGCAAAGCAACCGCTAA... 0.84 0.80
1 TR_good_2 TCAAACATTATGAATTTCTTCAAAACCGAAATCACCTTAGCGAATG... 0.82 0.78
2 TR_good_3 ATGCCTCAAGTAAACATCGTTAAAAAGACTGAGGTGCAGACTGGCG... 0.76 0.84
3 TR_good_4 GGATCCGAAGAAGTACGGTGGCTTCGATTCTCCGACCGTGGCGTAC... 0.74 0.75
4 TR_good_5 AGCGTATGCTGGCGTCTGCGGGTGAGCTGCAGAAGGGGAACGAGTT... 0.83 0.88
5 TR_good_6 GCAGAAGGGGAACGAGTTGGCCCTTCCGTCCAAGTACGTGAACTTC... 0.55 0.58
6 TR_good_7 CAGAAGCAGCTGTTCGTGGAGCAGCACAAGCACTACCTGGACGAGA... 0.81 0.34
7 TR_good_8 CTAAGCGCGTGATTCTGGCGGACGCGAATCTGGATAAGGTCCTGTC... 0.81 0.82

source

DGR_percentage


def DGR_percentage(
    TR_seq:str, # A string of the TR DNA sequence
):

Calculates the predicted DGR mutagenesis percentage of a given TR sequence (100 = perfect TR and 0 = poorly-performing TR).


source

DGR_percentage_list


def DGR_percentage_list(
    TR_seq_list:list, # A list of strings of TRs DNA sequences
    TR_name_list:list, # A list of strings of TRs names
):

Calculates the predicted DGR mutagenesis percentage for every TR in the list and returns them in a dataframe format

TR_bad=[
     'TTAGCGAATGGCGAAATTCGTAAACGCCCTCTGATCGAAACCAACGGCGAAACGGGTGAGATCGTGTGGG',
     'AAACGCCCTCTGATCGAAACCAACGGCGAAACGGGTGAGATCGTGTGGGACAAAGGTCGTGATTTCGCTA',
    'GGTTTCTCTAAGGAGTCCATTCTGCCGAAGCGCAACTCCGACAAGCTGATCGCGCGTAAGAAGGACTGGG',
     'CAAGCTGATCGCGCGTAAGAAGGACTGGGATCCGAAGAAGTACGGTGGCTTCGATTCTCCGACCGTGGCG',
     'ACCCGATTGACTTCCTCGAGGCGAAGGGGTACAAGGAGGTGAAGAAGGATCTGATTATCAAGCTGCCGAA',
     'AGTACTCCCTGTTCGAGCTGGAGAATGGTCGTAAGCGTATGCTGGCGTCTGCGGGTGAGCTGCAGAAGGG',
     'CAGCACAAGCACTACCTGGACGAGATTATTGAGCAGATTTCTGAGTTTTCTAAGCGCGTGATTCTGGCGG',
     'ACGCGAATCTGGATAAGGTCCTGTCTGCCTACAATAAGCACCGTGATAAGCCGATCCGTGAGCAGGCGGA',   
 ]

DGR_percentage_list(TR_bad,['TR_bad_'+str(k) for k in range (1,9)])
TR_Name TR_Seq TR_rates
0 TR_bad_1 TTAGCGAATGGCGAAATTCGTAAACGCCCTCTGATCGAAACCAACG... 0.427049
1 TR_bad_2 AAACGCCCTCTGATCGAAACCAACGGCGAAACGGGTGAGATCGTGT... 0.110769
2 TR_bad_3 GGTTTCTCTAAGGAGTCCATTCTGCCGAAGCGCAACTCCGACAAGC... 0.019986
3 TR_bad_4 CAAGCTGATCGCGCGTAAGAAGGACTGGGATCCGAAGAAGTACGGT... 0.025612
4 TR_bad_5 ACCCGATTGACTTCCTCGAGGCGAAGGGGTACAAGGAGGTGAAGAA... 0.045752
5 TR_bad_6 AGTACTCCCTGTTCGAGCTGGAGAATGGTCGTAAGCGTATGCTGGC... 0.172833
6 TR_bad_7 CAGCACAAGCACTACCTGGACGAGATTATTGAGCAGATTTCTGAGT... 0.092111
7 TR_bad_8 ACGCGAATCTGGATAAGGTCCTGTCTGCCTACAATAAGCACCGTGA... 0.039965
TR_good=[
     'AAATGATCGCCAAATCTGAACAGGAAATTGGCAAAGCAACCGCTAAATACTTTTTCTACTCAAACATTAT',
     'TCAAACATTATGAATTTCTTCAAAACCGAAATCACCTTAGCGAATGGCGAAATTCGTAAACGCCCTCTGA',
     'ATGCCTCAAGTAAACATCGTTAAAAAGACTGAGGTGCAGACTGGCGGTTTCTCTAAGGAGTCCATTCTGC',
     'GGATCCGAAGAAGTACGGTGGCTTCGATTCTCCGACCGTGGCGTACTCTGTTCTGGTGGTCGCCAAGGTC',
     'AGCGTATGCTGGCGTCTGCGGGTGAGCTGCAGAAGGGGAACGAGTTGGCCCTTCCGTCCAAGTACGTGAA',
     'GCAGAAGGGGAACGAGTTGGCCCTTCCGTCCAAGTACGTGAACTTCCTGTACCTGGCCTCGCACTACGAG',
     'CAGAAGCAGCTGTTCGTGGAGCAGCACAAGCACTACCTGGACGAGATTATTGAGCAGATTTCTGAGTTTT',
     'CTAAGCGCGTGATTCTGGCGGACGCGAATCTGGATAAGGTCCTGTCTGCCTACAATAAGCACCGTGATAA'
     ]

DGR_percentage_list(TR_good,['TR_good_'+str(k) for k in range (1,9)])
TR_Name TR_Seq TR_rates
0 TR_good_1 AAATGATCGCCAAATCTGAACAGGAAATTGGCAAAGCAACCGCTAA... 3.067841
1 TR_good_2 TCAAACATTATGAATTTCTTCAAAACCGAAATCACCTTAGCGAATG... 2.721600
2 TR_good_3 ATGCCTCAAGTAAACATCGTTAAAAAGACTGAGGTGCAGACTGGCG... 2.808427
3 TR_good_4 GGATCCGAAGAAGTACGGTGGCTTCGATTCTCCGACCGTGGCGTAC... 1.919474
4 TR_good_5 AGCGTATGCTGGCGTCTGCGGGTGAGCTGCAGAAGGGGAACGAGTT... 4.142115
5 TR_good_6 GCAGAAGGGGAACGAGTTGGCCCTTCCGTCCAAGTACGTGAACTTC... 0.811313
6 TR_good_7 CAGAAGCAGCTGTTCGTGGAGCAGCACAAGCACTACCTGGACGAGA... 0.940305
7 TR_good_8 CTAAGCGCGTGATTCTGGCGGACGCGAATCTGGATAAGGTCCTGTC... 3.040041

Sequence optimization utilities

Helper functions for codon manipulation and Pareto-optimal sequence selection.

TR sequence optimization

Beam-search optimization to improve TR sequences while maintaining protein function.


source

optimize_sequence


def optimize_sequence(
    original_seq, # Original DNA sequence to optimize.
    frame_offset:int=0, # Reading-frame offset (0, 1, or 2) used when grouping codons.
    dict_allowed_AAs:NoneType=None, # Dictionary of positions (keys) and AAs (values) where you want to reach all AAs in the list with the codon. If not mentioned, does as before.
Selects for codons which do not reach (by adenine mutation) stop codons. If not possible, allow them anyway.
    CHANGES:int=6, # Maximum number of codon substitutions allowed (on top of the AAs requirements from the previous argument).
    freq_min:float=0.2, # Lowest usage frequency acceptable.
    N:int=1, # Number of putative TR to output.
    forbidden_positions:list=[], # Nucleotide positions that must not be modified.
    threshold:float=0.7, # Minimum required value for both `Score_TRSp` and `Score_TRSpAvd` to
accept a sequence as optimal.
    codon_usage:dict={'F': {'TTT': 0.57, 'TTC': 0.43}, 'L': {'TTA': 0.15, 'TTG': 0.12, 'CTT': 0.12, 'CTC': 0.1, 'CTA': 0.05, 'CTG': 0.46}, 'S': {'TCT': 0.11, 'TCC': 0.11, 'TCA': 0.15, 'TCG': 0.16, 'AGT': 0.14, 'AGC': 0.33}, 'Y': {'TAT': 0.53, 'TAC': 0.47}, '*': {'TAA': 0.64, 'TAG': 0.0, 'TGA': 0.36}, 'C': {'TGT': 0.42, 'TGC': 0.58}, 'W': {'TGG': 1.0}, 'P': {'CCT': 0.17, 'CCC': 0.13, 'CCA': 0.14, 'CCG': 0.55}, 'H': {'CAT': 0.55, 'CAC': 0.45}, 'Q': {'CAA': 0.3, 'CAG': 0.7}, 'R': {'CGT': 0.36, 'CGC': 0.44, 'CGA': 0.07, 'CGG': 0.07, 'AGA': 0.07, 'AGG': 0.0}, 'I': {'ATT': 0.58, 'ATC': 0.35, 'ATA': 0.07}, 'M': {'ATG': 1.0}, 'T': {'ACT': 0.16, 'ACC': 0.47, 'ACA': 0.13, 'ACG': 0.24}, 'N': {'AAT': 0.47, 'AAC': 0.53}, 'K': {'AAA': 0.73, 'AAG': 0.27}, 'V': {'GTT': 0.25, 'GTC': 0.18, 'GTA': 0.17, 'GTG': 0.4}, 'A': {'GCT': 0.11, 'GCC': 0.31, 'GCA': 0.2, 'GCG': 0.38}, 'D': {'GAT': 0.65, 'GAC': 0.35}, 'E': {'GAA': 0.7, 'GAG': 0.3}, 'G': {'GGT': 0.29, 'GGC': 0.46, 'GGA': 0.13, 'GGG': 0.12}}, # Codon usage table of E. Coli mapping amino acids to codons and frequencies.
): # Dictionary containing:
- `Original_Sequence` : str  
  Input DNA sequence.
- `New_Variant` : str  
  Optimized DNA sequence.
- `Rank` : int or None  
  rank of the sequence (by score).
- `Score` : float or None  
  score of the selected variant (geometrical mean).
- `Score_TRSp` : float or None  
  TR+Sp score of the selected variant.
- `Score_TRSpAvd` : float or None  
  Avd+TR+Sp score of the selected variant.

Optimize a DNA sequence via synonymous codon substitutions.

This function performs a beam-search–based optimization of a nucleotide sequence by iteratively proposing single-codon synonymous changes and evaluating them with the two scoring functions. The search stops early if a variant meets the specified score thresholds, otherwise the best Pareto- optimal solution is returned.

seq = 'GACACCTGCTATGGATTAAAAAGGCGCTCCCGTTGGGTACCAGGTCGCGGCACCTAACTGCAGGCACATC'
dict_allowed={0:['D','Y'],1:['T']}
optimize_sequence(seq,N=5,CHANGES=6,dict_allowed_AAs=dict_allowed)

[{'Original_Sequence': 'GACACCTGCTATGGATTAAAAAGGCGCTCCCGTTGGGTACCAGGTCGCGGCACCTAACTGCAGGCACATC',
  'New_Variant': 'AAAACGTGCTATGGATTAAAACGCCGCTCCCGTTGGGTACCAGGTCGCGGCACCTAACTGCAAGCACATC',
  'Score': 0.8146164741766521,
  'Rank': 1,
  'Score_TRSp': 0.79,
  'Score_TRSpAvd': 0.84},
 {'Original_Sequence': 'GACACCTGCTATGGATTAAAAAGGCGCTCCCGTTGGGTACCAGGTCGCGGCACCTAACTGCAGGCACATC',
  'New_Variant': 'AATACGTGCTATGGATTAAAACGCCGCTCCCGTTGGGTACCAGGTCGCGGCACCTAACTGCAAGCACATC',
  'Score': 0.8146164741766521,
  'Rank': 2,
  'Score_TRSp': 0.79,
  'Score_TRSpAvd': 0.84},
 {'Original_Sequence': 'GACACCTGCTATGGATTAAAAAGGCGCTCCCGTTGGGTACCAGGTCGCGGCACCTAACTGCAGGCACATC',
  'New_Variant': 'AATAATTGTTATGGATTAAAAAGGCGCTCCCGTTGGGTACCAGGTCGCGGCACCTAACTGCAGGCCCATC',
  'Score': 0.714982517268779,
  'Rank': 3,
  'Score_TRSp': 0.72,
  'Score_TRSpAvd': 0.71},
 {'Original_Sequence': 'GACACCTGCTATGGATTAAAAAGGCGCTCCCGTTGGGTACCAGGTCGCGGCACCTAACTGCAGGCACATC',
  'New_Variant': 'AATAATTGCTACGGATTAAAAAGGCGCTCCCGTTGGGTACCAGGTCGCGGCACCTAACTGCAGGCCCATC',
  'Score': 0.6997142273814361,
  'Rank': 4,
  'Score_TRSp': 0.68,
  'Score_TRSpAvd': 0.72},
 {'Original_Sequence': 'GACACCTGCTATGGATTAAAAAGGCGCTCCCGTTGGGTACCAGGTCGCGGCACCTAACTGCAGGCACATC',
  'New_Variant': 'AATAATTGCTATGGATTAAAAAGGCGCTCCCGTTGGGTGCCAGGTCGCGGCACCTAACTGCAGGCCCATC',
  'Score': 0.6841052550594827,
  'Rank': 5,
  'Score_TRSp': 0.65,
  'Score_TRSpAvd': 0.72}]