lstm

Deep learning models to predict the population of VR (Variable Region) sequences generated by a given TR design. Uses LSTM neural networks trained on DGRec experimental data to model bRT’s position-dependent, context-dependent error-prone reverse transcription, including the snowball effect where prior mutations increase subsequent error rates.

Model components

Custom loss function, metrics, and model separation utilities.


source

generate_sequences_oneTR


def generate_sequences_oneTR(
    TR:str, # TR sequence
    n:int=1000, # Number of VR to generate
)->list: # List of `n` generated VR sequences.

Generate multiple VR sequences from a single TR sequence.


source

generate_sequences


def generate_sequences(
    X_seq:list, # list of TR sequences
)->list: # List of generated VR sequences, in the same order as `X_seq`.

Generate VR sequences from a list of TR sequences.

Each TR sequence produces exactly one VR sequence.

generate_sequences_oneTR('CGTAAACCGGACCTAGTTTAGTTCTTAGACCAAGGTACATATCCCCGTAACATAAGACGCGACTGGGCCC',n=10)
WARNING: All log messages before absl::InitializeLog() is called are written to STDERR
I0000 00:00:1770899481.568753   42894 gpu_device.cc:2020] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 6189 MB memory:  -> device: 0, name: NVIDIA RTX 2000 Ada Generation Laptop GPU, pci bus id: 0000:01:00.0, compute capability: 8.9
2026-02-12 13:31:25.232428: I external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:473] Loaded cuDNN version 91900
1/1 ━━━━━━━━━━━━━━━━━━━━ 2s 2s/step
Generating sequence: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 70/70 [00:06<00:00, 10.69it/s]
['CGTAAACCGGACCTAGTTTAGTTCTTAGACCAAGGTACATATCCCCGTCACATAAGACGCGACTGGGCCC',
 'CGTAAACCGGACCTAGTTTAGTTCTTAGACCAAGGTCCATATCCCCGTAACATAAGACGCGACTGGGCCC',
 'CGTAAACCGGACCTAGTTTAGTTCTTAGACCAAGGTACATATCCCCGTTACATAAGACGCGACTGGGCCC',
 'CGTAAACCGGACCTAGGTTAGTTCTTAGGCCAAGGTGCATTTCCCCGTTACATAAGGCGCGACTGGGCCC',
 'CGTGTACCGGACCTGGTTTAGTTCTTAGACCAAGGTACATATCCCCGTAACATAAGACGCGACTGGGCCC',
 'CGTAAACCGGACCTAGTTTAGTTCTTAGCCCGGGGTACATGTCCCCGTAACATAAGACGCGACTGGGCCC',
 'CGTAAACCGGACCTAGTTTAGTTCTTAGACCAAGGTACATATCCCCGTTACATGAGACGCGACTGGGCCC',
 'CGTAAACCGGACCTAGTTTAGTTCTTAGCCCCTGGTACATATCCCCGTAACATAAGACGCGACTGGGCCC',
 'CGTAAACCGGACCTGGTTTAGTTCTTGGACCAAGGTCCATATCCCCGTAACATAAGACGCGACTGGGCCC',
 'CGTAAACCGGCCCTGGTTTGGTTCTTTGGCCGGGGTACATTTCCCCGTTACATAAGACGCGACTGGGCCC']

VR sequence generation

Generate predicted VR sequences from TR inputs using the LSTM model.

Protein diversity evaluation

Evaluate and visualize protein-level diversity from generated sequences.


source

EvaluateTR_to_prot


def EvaluateTR_to_prot(
    TR:str, # The TR sequence
    NDGR:int=100, # Number of VR to generate for the protein logo
    offset:int=0, # The offset for protein translation
)->Counter: # Counts of translated protein sequences.

Evaluate protein diversity accessible from a TR sequence via DGR.

Generates VR sequences from a single TR, translates them into proteins, and displays a protein sequence logo based on amino-acid frequencies.

c = EvaluateTR_to_prot('CGTAAACCGGACCTAGTTAACTTCTTAGACCAAGGTACATATCCCCGTAACATAAGACGCGACTGGGCCC')
    c.most_common(5)
4/4 ━━━━━━━━━━━━━━━━━━━━ 0s 13ms/step
Generating sequence: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 70/70 [00:09<00:00,  7.67it/s]
/home/regnier/miniconda3/envs/dgrec_env/lib/python3.12/site-packages/logomaker/src/error_handling.py:58: UserWarning:  Warning: Character '*' is not in color_dict. Using black.
  warnings.warn(str(Error))
<Figure size 1200x400 with 0 Axes>

[('RKPDLVNFLDQGTYPRNIRRDWA', 13),
 ('RKPDLVNFLDQGTYPRYIRRDWA', 4),
 ('RKPDLVYFLDQGTYPRNIRRDWA', 2),
 ('RKPDLVNFLDHGTYPRNIRRDWA', 2),
 ('RKPGLVNFLDQGTYPRNIRRDWA', 2)]

source

optimize_sequence_display_proteins


def optimize_sequence_display_proteins(
    original_seq:str, # Original DNA sequence to optimize.
    frame_offset:int=0, # Reading-frame offset (0, 1, or 2) used when grouping codons.
    dict_allowed_AAs:dict=defaultdict(<class 'list'>, {}), # Dictionary of positions (keys) and AAs (values) where you want to reach all AAs in the list with the codon. If not mentioned, does as before.
Selects for codons which do not reach (by adenine mutation) stop codons. If not possible, allow them anyway.
    CHANGES:int=6, # Maximum number of codon substitutions allowed (on top of the AAs requirements from the previous argument).
    freq_min:float=0.2, # Lowest usage frequency acceptable.
    N:int=1, # Number of putative TR to output.
    forbidden_positions:list=[], # Nucleotide positions that must not be modified.
    threshold:float=0.7, # Minimum required value for both `Score_TRSp` and `Score_TRSpAvd` to
accept a sequence as optimal.
    codon_usage:dict={'F': {'TTT': 0.57, 'TTC': 0.43}, 'L': {'TTA': 0.15, 'TTG': 0.12, 'CTT': 0.12, 'CTC': 0.1, 'CTA': 0.05, 'CTG': 0.46}, 'S': {'TCT': 0.11, 'TCC': 0.11, 'TCA': 0.15, 'TCG': 0.16, 'AGT': 0.14, 'AGC': 0.33}, 'Y': {'TAT': 0.53, 'TAC': 0.47}, '*': {'TAA': 0.64, 'TAG': 0.0, 'TGA': 0.36}, 'C': {'TGT': 0.42, 'TGC': 0.58}, 'W': {'TGG': 1.0}, 'P': {'CCT': 0.17, 'CCC': 0.13, 'CCA': 0.14, 'CCG': 0.55}, 'H': {'CAT': 0.55, 'CAC': 0.45}, 'Q': {'CAA': 0.3, 'CAG': 0.7}, 'R': {'CGT': 0.36, 'CGC': 0.44, 'CGA': 0.07, 'CGG': 0.07, 'AGA': 0.07, 'AGG': 0.0}, 'I': {'ATT': 0.58, 'ATC': 0.35, 'ATA': 0.07}, 'M': {'ATG': 1.0}, 'T': {'ACT': 0.16, 'ACC': 0.47, 'ACA': 0.13, 'ACG': 0.24}, 'N': {'AAT': 0.47, 'AAC': 0.53}, 'K': {'AAA': 0.73, 'AAG': 0.27}, 'V': {'GTT': 0.25, 'GTC': 0.18, 'GTA': 0.17, 'GTG': 0.4}, 'A': {'GCT': 0.11, 'GCC': 0.31, 'GCA': 0.2, 'GCG': 0.38}, 'D': {'GAT': 0.65, 'GAC': 0.35}, 'E': {'GAA': 0.7, 'GAG': 0.3}, 'G': {'GGT': 0.29, 'GGC': 0.46, 'GGA': 0.13, 'GGG': 0.12}}, # Codon usage table of E. Coli mapping amino acids to codons and frequencies.
    NDGR:int=100, # Number of sequences to generate via the LSTM for sequence logo estimation. 
):

Optimize a DNA sequence via synonymous codon substitutions and shows the sequence logo for each of the optimal sequences.

This function performs a beam-search–based optimization of a nucleotide sequence by iteratively proposing single-codon synonymous changes and evaluating them with the two scoring functions. The search stops early if a variant meets the specified score thresholds, otherwise the best Pareto- optimal solution is returned.

seq = 'GACACCTGCTATGGATTAAAAAGGCGCTCCCGTTGGGTACCAGGTCGCGGCACCTAACTGCAGGCACATC'
    dict_allowed={10:['R','Y']}
    optimize_sequence_display_proteins(seq,N=5,CHANGES=6,dict_allowed_AAs=dict_allowed)

4/4 ━━━━━━━━━━━━━━━━━━━━ 0s 21ms/step
Generating sequence: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 70/70 [00:07<00:00,  9.51it/s]
/home/regnier/miniconda3/envs/dgrec_env/lib/python3.12/site-packages/logomaker/src/error_handling.py:58: UserWarning:  Warning: Character '*' is not in color_dict. Using black.
  warnings.warn(str(Error))
<Figure size 1200x400 with 0 Axes>

4/4 ━━━━━━━━━━━━━━━━━━━━ 0s 12ms/step
Generating sequence: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 70/70 [00:06<00:00, 11.66it/s]
/home/regnier/miniconda3/envs/dgrec_env/lib/python3.12/site-packages/logomaker/src/error_handling.py:58: UserWarning:  Warning: Character '*' is not in color_dict. Using black.
  warnings.warn(str(Error))
<Figure size 1200x400 with 0 Axes>

4/4 ━━━━━━━━━━━━━━━━━━━━ 0s 13ms/step
Generating sequence: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 70/70 [00:05<00:00, 13.39it/s]
/home/regnier/miniconda3/envs/dgrec_env/lib/python3.12/site-packages/logomaker/src/error_handling.py:58: UserWarning:  Warning: Character '*' is not in color_dict. Using black.
  warnings.warn(str(Error))
<Figure size 1200x400 with 0 Axes>

4/4 ━━━━━━━━━━━━━━━━━━━━ 0s 21ms/step
Generating sequence: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 70/70 [00:08<00:00,  8.32it/s]
/home/regnier/miniconda3/envs/dgrec_env/lib/python3.12/site-packages/logomaker/src/error_handling.py:58: UserWarning:  Warning: Character '*' is not in color_dict. Using black.
  warnings.warn(str(Error))
<Figure size 1200x400 with 0 Axes>

4/4 ━━━━━━━━━━━━━━━━━━━━ 0s 20ms/step
Generating sequence: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 70/70 [00:07<00:00,  9.50it/s]
/home/regnier/miniconda3/envs/dgrec_env/lib/python3.12/site-packages/logomaker/src/error_handling.py:58: UserWarning:  Warning: Character '*' is not in color_dict. Using black.
  warnings.warn(str(Error))
<Figure size 1200x400 with 0 Axes>

[{'Original_Sequence': 'GACACCTGCTATGGATTAAAAAGGCGCTCCCGTTGGGTACCAGGTCGCGGCACCTAACTGCAGGCACATC',
  'New_Variant': np.str_('GACACCTGCTATGGTTTAAAACGCCGCTCCAACTGGGTACCAGGTCGCGGCACCTAACTGCAGGCGCATC'),
  'Score': np.float64(0.7794228634059948),
  'Rank': 1,
  'Score_TRSp': 0.81,
  'Score_TRSpAvd': 0.75,
  'Proteins': Counter({'DTCYGLKRRSNWVPGRGP*LQAH': 5,
           'DTCYGLKRRSNWVPGRGT*LQAH': 5,
           'DTCYGLKRRSYWVPGRGT*LQAH': 4,
           'DTCYGLKRRSNWVPGRGTLLQAH': 4,
           'DTCYGLKRRSNWVPGRGTSLQAH': 4,
           'DTCYGLKRRSHWVPGRGT*LQAH': 2,
           'DTCYGLKRRSYWVPGRGP*LQAH': 2,
           'DTCYGLKRRSSWVPGRGT*LQAH': 2,
           'DTCYGLVRRSNWVPGRGTLLQAH': 2,
           'VTCYGFLRRSLWVPGRGAFLRAH': 1,
           'DTCYGLKRRSSWVPGRGP*LQAH': 1,
           'DPCYGFLRRSFWVPGRGSWLQAH': 1,
           'DTCYGLVRRSAWVPGRGALLRAH': 1,
           'DTCYGLVRRSDWVPGRGA*LQAH': 1,
           'DTCYGLKRRSSWVPGRGPLLQAH': 1,
           'DTCYGLRRRSYWVPGRGPLLQAH': 1,
           'DTCYGLKRRSNWVPGRGTFLQAH': 1,
           'DTCYGLKRRANWVPGRGPLLQAH': 1,
           'GPCCGLSRRSNWVAGRGP*LQAH': 1,
           'DTCYGLLRRSYWVPGRGT*LQAH': 1,
           'DTCYGFKRRSYWVPGRGT*LQAH': 1,
           'DTCYGLKRRSNWVPGRGTYLQAH': 1,
           'DTCFGLKRRSSWVPGRGP*LQAH': 1,
           'DPCYGLKRRSNWVPGRGT*LQAH': 1,
           'DTCYGLKRRSNWVPGRGSYLQAH': 1,
           'DTCYGLVRRSNWVPGRGASLQAH': 1,
           'DTCYGLGRRAYWVPGRGT*LQAH': 1,
           'DTCYGLTRRSNWVPGRGT*LQAH': 1,
           'DTCYGLKRRSAWVPGRGPELRAH': 1,
           'DTCCGLVRRSYWVPGRGT*LQAH': 1,
           'DTCYGLIRRSSWVPGRGPLLQAH': 1,
           'GTCYGLLRRSNWVPGRGT*LQAH': 1,
           'DTWCGLKRRSSWVPGRGT*LQAH': 1,
           'DTCYGLKRRSSWGPGRGT*LQAH': 1,
           'DTCYGLKRRSNWVPGRGA*LQAH': 1,
           'DTCYGLKRRAYWVPGRGT*LQAH': 1,
           'DTCYGLKRRSYWVPGRGTLLQAH': 1,
           'DTCYGFKRRSNWVPGRGT*LQAH': 1,
           'DTCYGLIRRSNWVPGRGPWLRAH': 1,
           'DTCYGLKRRSDWVPGRGT*LQAH': 1,
           'DTCYGLGRRSYWVPGRGP*LQAH': 1,
           'DTCYGLKRRPNWVPGRGP*LQAH': 1,
           'DTCYGFLRRSSWVPGRGT*LQAH': 1,
           'DTCFGLKRRANWVPGRGT*LQAH': 1,
           'DTCYGLRRRSAWVPGRGP*LQAH': 1,
           'DTCYGL*RRSAWVPGRGT*LQAH': 1,
           'DTCCGLKRRSYWVPGRGTLLQAH': 1,
           'DTCYGLLRRSNWVPGRGP*LQAH': 1,
           'DTCYGLKRRSNWVPGRGALLQAH': 1,
           'DTCYGLKRRSNWVPGRGT*LQAP': 1,
           'DTCFGLKRRSYWVPGRGT*LQAH': 1,
           'DTCYGLGRRSSWVPGRGP*LQAH': 1,
           'DTCYGLKRRSNWVPGRGT*LPAH': 1,
           'DTCYGLERRSSWVPGRGALLQAH': 1,
           'DTCYGLKRRSNWGPGRGPYLQAH': 1,
           'DTCYGLGRRSYWVPGRGT*LQAH': 1,
           'DTCYGLNRRSNWVPGRGT*LQAH': 1,
           'DTCYGFCRRSNWVPGRGT*LQAH': 1,
           'DTCYGLKRRSLWVPGRGTLLQAH': 1,
           'DTCYGLSRRSSWVPGRGT*LQAH': 1,
           'DTCYGLRRRSNWVPGRGTLLQAH': 1,
           'DTCYGLKRRSNWVPGRGTCLRAH': 1,
           'DTCYGLKRRSNWVPGRGPLLQAH': 1,
           'DTCYGLLRRSDWVPGRGT*LQAH': 1,
           'DTCYGLIRRSSWVPGRGT*LQAH': 1,
           'DTCCGFKRRSYWVPGRGTLLQAH': 1,
           'DTCYGLVRRSSWVPGRGPLLQAH': 1,
           'DTCYGFLRRSNWVPGRGTLLQAH': 1,
           'GTCYGLVRRSTWVPGRGT*LQAH': 1,
           'DTCYGFLRRSYWVPGRGT*LQAH': 1,
           'DTCYGLLRRSSWVPGRGT*LQAH': 1,
           'DTCYGLKRRSYWVPGRGTSLQAH': 1,
           'DTCYGLKRRSNWGPGRGTLLRAH': 1,
           'DTCYGLERRSSWVPGRGTLLQAH': 1,
           'DTCYGLKRRSGWVPGRGTLLQAH': 1,
           'DTCYGLSRRAYWVPGRGT*LQAH': 1,
           'DTCYGLVRRSDWVPGRGT*LQAH': 1,
           'DPCYGLERRSNWVPGRGPYLQAH': 1,
           'DTCFGLERRSNWVPGRGS*LQAH': 1})},
 {'Original_Sequence': 'GACACCTGCTATGGATTAAAAAGGCGCTCCCGTTGGGTACCAGGTCGCGGCACCTAACTGCAGGCACATC',
  'New_Variant': np.str_('GACACGTGCTATGGATTAAAACGCCGCTCCAACTGGGTACCAGGTCGCGGCACCTAACTGCAAGCACATC'),
  'Score': np.float64(0.7730459236035074),
  'Rank': 2,
  'Score_TRSp': 0.72,
  'Score_TRSpAvd': 0.83,
  'Proteins': Counter({'DTCYGLKRRSNWVPGRGT*LQAH': 7,
           'DTCYGLKRRSSWVPGRGT*LQAH': 6,
           'DTCYGLKRRSNWVPGRGTLLQAH': 4,
           'DTCYGLKRRSYWVPGRGT*LQAH': 3,
           'DTCYGLKRRSYWVPGRGP*LQAH': 2,
           'DTCYGLKRRSNWVPGRGT*LPAH': 2,
           'DTCYGLKRRSTWVPGRGT*LQAH': 2,
           'DTCYGLLRRSNWVPGRGP*LQAH': 2,
           'DTCYGLKRRSNWVPGRGTSLQAH': 2,
           'DTCYGLKRRSNWVPGRGP*LQAH': 2,
           'DTCYGLLRRSSWVPGRGTLLQAH': 2,
           'DTCFGFKRRSSWVPGRGALLQAH': 1,
           'DTCYGLVRRSYWVPGRGPVLQAH': 1,
           'DTCYGLLRRSFWVPGRGT*LKAH': 1,
           'DTCYGLKRRSNWVPGRGA*LQAH': 1,
           'DTCYGLRRRSIWVPGRGT*LQEH': 1,
           'DTCYGFLRRANWVPGRGT*LQAH': 1,
           'DTCYGLIRRSSWVPGRGT*LQAH': 1,
           'DTCYGFLRRSNWVPGRGT*LQAH': 1,
           'DTCCGLARRSYWVPGRGPFLQAH': 1,
           'DTCYGLARRSSWVPGRGT*LQAH': 1,
           'DTCCGLRRRSSWVPGRGT*LQAH': 1,
           'DTCYGLKRRSAWVPGRGT*LQAH': 1,
           'DTCYGLVRRSNWVPGRGT*LQAH': 1,
           'DTCCGLRRRSYWVPGRGNCLQAH': 1,
           'DTCYGLKRRSSWVPGRGTLLQAH': 1,
           'DTCYGLRRRSCWGPGRGT*LQAH': 1,
           'DTCYGLKRRSRWVPGRGT*LQAH': 1,
           'DTCYGLKRRSPWVPGRGT*LQAH': 1,
           'DTCYGLKRRSNWVPGRGT*LHAH': 1,
           'DTCCGLKRRSPWVPGRGT*LQAH': 1,
           'DTCYGFKRRSNWVPGRGT*LQAH': 1,
           'DTCYGLKRRSNWVPGRGT*LQAP': 1,
           'DTCFGLIRRSNWVPGRGT*LQAH': 1,
           'DTCYGLLRRSSWVPGRGT*LQAH': 1,
           'DACCGLKRRSNWVPGRGTFLQAH': 1,
           'DTCYGFCRRSDWVPGRGAFLQAH': 1,
           'DTCCGFGRRSLWVPGRGP*LQAH': 1,
           'DTCYGLERRSYWVPGRGT*LQAH': 1,
           'DTCYGLKRRSYWVPGRGTYLQAH': 1,
           'ETCSGLKRRSYWVPGRGT*LQAP': 1,
           'DTCYGLKRRSDWVPGRGT*LQAH': 1,
           'DTCYGLKRRAFWVPGRGTLLQAH': 1,
           'DPCYGLVRRSNWVPGRGT*LQAH': 1,
           'DACYGLLRRSNWVPGRGPLLQAH': 1,
           'DTCCGLTRRSSWVPGRGALLPAH': 1,
           'DTCGGLLRRSNWAPGRGPLLQAH': 1,
           'DTCYGLLRRSYWVPGRGT*LQAH': 1,
           'DTCYGLKRRSGWVPGRGTLLQAH': 1,
           'DTCYGLKRRSYWVPGRGS*LQAH': 1,
           'DTCYGLIRRSYWVPGRGTELQAH': 1,
           'DTCYGLKRRSNWVPGRGT*LPAN': 1,
           'DTCYGLVRRSYWVPGRGT*LQAH': 1,
           'DTCFGLGRRSYWVPGRVPFLQAH': 1,
           'DTCDGLERRSNWVPGRGP*LQAH': 1,
           'DACYGLLRRSNWVPGRGT*LQAH': 1,
           'DTCYGLERRSSWVPGRGP*LQAH': 1,
           'GTCCGVMRRSYWGPGRGT*LQAH': 1,
           'DTCYGLKRRSNWVPGRGALLQAH': 1,
           'GTCFGLGRRSSWVPGRGT*LQAH': 1,
           'DTCSGLLRRSYWVPGRGT*LQAH': 1,
           'DTCCGLGRRSFWVPGRGT*LQAH': 1,
           'DTCYGLSRRSYWVPGRGT*LQAH': 1,
           'DTCYGLKRRSNWVPGRGPLLQAH': 1,
           'DTCYGLKRRSNWVPARGP*LQAH': 1,
           'DTCYGLVRRSLWVPGRGT*LQAH': 1,
           'DTCSGLKRRSYWVPGRGPLLQAH': 1,
           'DTCYGLKRRYNWGPGRGPELQAH': 1,
           'DTCYGLKRRSYWVPGRGPCLRAH': 1,
           'DTCFGLRRRSSWVPGRGT*LQAH': 1,
           'DTCYGLKRRSHWVPGRGT*LQAH': 1,
           'DTCSGLKRRSAWVPGRGPFLQAH': 1,
           'DTCFGLERRSYWVPGRGT*LQAH': 1,
           'DTCYGLVRRSYWVPGRGSFLQAH': 1,
           'ATCYVLIRRSNWGPGRGT*LQAH': 1,
           'DTCCGLKRRSYWVPGRGTCLQAH': 1,
           'DTCYGLLRRSNWVPGRGT*LQAH': 1})},
 {'Original_Sequence': 'GACACCTGCTATGGATTAAAAAGGCGCTCCCGTTGGGTACCAGGTCGCGGCACCTAACTGCAGGCACATC',
  'New_Variant': np.str_('GACACCTGCTATGGTTTAAAAAGGCGCTCCAACTGGGTGCCAGGTCGCGGCACCTAACTGCAGGCCCATC'),
  'Score': np.float64(0.7360706487831178),
  'Rank': 3,
  'Score_TRSp': 0.86,
  'Score_TRSpAvd': 0.63,
  'Proteins': Counter({'DTCYGLKRRSNWVPGRGP*LQAH': 9,
           'DTCYGLKRRSSWVPGRGT*LQAH': 4,
           'DTCYGLKRRSNWVPGRGTLLQAH': 4,
           'DTCYGLKRRSNWVPGRGTSLQAH': 3,
           'DTCYGLKRRSHWVPGRGT*LQAH': 2,
           'DTCYGFKRRSNWVPGRGT*LQAH': 2,
           'DTCYGLKRRSYWVPGRGA*LQAH': 2,
           'DTCYGLKRRSYWVPGRGT*LQAH': 2,
           'DTCYGLKRRSSWVPGRGP*LQAH': 2,
           'DTCYGLKRRSNWVPGRGT*LPAH': 1,
           'DTCCGLKRRSTWVPGRGT*LQAH': 1,
           'DTCYVLERRSNWVPGRGT*LQAH': 1,
           'DTCYGLGRRSYWVPGRGT*LQAH': 1,
           'DTCCGLKGRSLWVPGRGT*LQAH': 1,
           'DTCYGLKRRSDWVPGRGP*LQDH': 1,
           'DTCYGLRRRSPWVPGRGT*LQAH': 1,
           'DPCCGLKGRSFWVPGRGT*LQAH': 1,
           'ATCYGLKRRSNWVPGRGTLLQAH': 1,
           'DTCYGLKGRSNWVPGRGP*LQAH': 1,
           'DPCYGFEGRSSWVPGRGT*LQAH': 1,
           'DTCFGLKRRSDWVPGRGTYLQAH': 1,
           'DPCYALKRRSGWVQGRGA*LQAH': 1,
           'DTCYGLRRRSYWVPGRGTLLQAH': 1,
           'DTCYGLGRRSSWVPGRGP*LQAH': 1,
           'DTCYGLKRRSLWVPGRGTLLQAH': 1,
           'DTCYGLRRRSNWVPGRGT*LQAH': 1,
           'DTCYGLLGRSYWVPGRGT*LQAH': 1,
           'DTCYGLKRRSNWVPGRGTYLRAH': 1,
           'DTCYGLKRRASWVPGRGT*LPAH': 1,
           'DTCYGLLRRSNWVPGRGT*LQAH': 1,
           'DTCYGLKRRSGWVPGRGPLLQAH': 1,
           'DTCYGLARRSTWVPGRGP*LQAH': 1,
           'DTRYGLKGRSNWVPGRGT*LQAH': 1,
           'DPCYGLGGRSFWVPGRGT*LQAH': 1,
           'DTCYGLGRRSAWVPGRGTSLQAH': 1,
           'DTCYGLARRSYWVPGRGTLLQAH': 1,
           'DTCYGLTRRSNWVPGRGP*LQAH': 1,
           'DTCYGLKRRSSWVPGRGPFLQAH': 1,
           'DTCYVLERRSSWVPGRGT*LQAH': 1,
           'DTCYGLKRRSGWVPGRGP*LQAH': 1,
           'DTCYGLKGRSTWVPGRGT*LQAH': 1,
           'DTCYGLKRRSAWVPGRGTLLPAH': 1,
           'DTCYGFKGRSNWVPGRGT*LQAH': 1,
           'DTCYGLERRSNWVPGRGTLLQAH': 1,
           'DTCYGLGRRSNWVPGRGTLLQAH': 1,
           'DTCSGLKRRSNWVPGRGT*LQAH': 1,
           'DTCYGLSGSSGWVPGRGSLLPAH': 1,
           'DTCYGLKRRSIWVPGRGT*LQAH': 1,
           'DTCYGLKRRSYWVPGRGPLLRAH': 1,
           'DTCCGLFRRSNWVPGRGT*LQAH': 1,
           'DPCCGLRGRSNWVPGRGT*LQAH': 1,
           'DTCYGFKRRSNWVPGRGT*LPAH': 1,
           'DTCCGLKRRSSWVPGRGTCLQAH': 1,
           'DTCYGLKRRSIWVPGRGALLQAH': 1,
           'DTCFGLGGRASWVPGRGT*LQAH': 1,
           'DTCYGFGRRSYWVPGRGT*LQAH': 1,
           'DTCYGLGRRSNWVPGRGP*LQAH': 1,
           'DTCSGLKRRSYWVPGRGTYLQAH': 1,
           'DTCYGLVRRSSWVPGRGT*LPAH': 1,
           'DTCYGLKGRSNWVPGRGT*LQAH': 1,
           'DTCYGLRGRSNWVPGRGPYLQAH': 1,
           'DTCYGLKRRSYWVPGRGTYLQAH': 1,
           'DTCYGLRRRSYWVPGRGT*LQAH': 1,
           'DTCYGLERRSTWVPGRGT*LQAH': 1,
           'DTCYGLKRRSYWVPGRGPLLQAH': 1,
           'DTCYGLKWRSYWVPGRGT*LQAH': 1,
           'DPCYGLRRRSTWVPGRGT*LQAH': 1,
           'DTCYGLKRRSYWVPGRGP*LQAH': 1,
           'DTCYGLKRRSPWVPGRGT*LQAH': 1,
           'DTCYGLKRRSLWVPGRGT*LQAH': 1,
           'DPCFGLRRRSNWVPGRGT*LQAH': 1,
           'DTCYGLKRRSTWVPGRGT*LQAH': 1,
           'DTCYGLKRRSNWVPGRGALLQAH': 1,
           'DTCYGLKRRSAWVPGRGT*LQAH': 1,
           'DTCYGLEGRSNWVPGRGT*LQAH': 1,
           'DPCYGLKRRSNWVPGRGTLLQAH': 1,
           'DTCCGLSRRSYWVPGRGT*LQAH': 1,
           'DTCYGLKRRSNWVPGRGT*LQAH': 1,
           'DTCYGLKRRSYWVPGRGPSLQAH': 1})},
 {'Original_Sequence': 'GACACCTGCTATGGATTAAAAAGGCGCTCCCGTTGGGTACCAGGTCGCGGCACCTAACTGCAGGCACATC',
  'New_Variant': np.str_('GACACGTGCTATGGATTAAAACGTCGCTCCAACTGGGTACCAGGTCGCGGCACCTAACTGCAAGCACATC'),
  'Score': np.float64(0.723118247591637),
  'Rank': 4,
  'Score_TRSp': 0.63,
  'Score_TRSpAvd': 0.83,
  'Proteins': Counter({'DTCYGLKRRSNWVPGRGT*LQAH': 13,
           'DTCYGLKRRSSWVPGRGT*LQAH': 5,
           'DTCYGLKRRSAWVPGRGT*LQAH': 3,
           'DTCYGLKRRSYWVPGRGT*LQAH': 3,
           'DTCYGLKRRSNWVPGRGTSLQAH': 2,
           'DTCYGLKRRSTWVPGRGT*LQAH': 2,
           'DTCYGLKRRSNWVPGRGPLLQAH': 2,
           'DTCYGLKRRSLWVPGRGT*LQAH': 2,
           'DTCYGLKRRSNWVPGRGTYLQAH': 2,
           'DTCYGLKRRSNWVPGRGTLLQAH': 2,
           'DTCYGLKRRSNWVPGRGT*LHAH': 2,
           'DTCYGLERRSSWVPGRGT*LQAH': 2,
           'DTCYGLKRRSNWGPGRGT*LQAH': 2,
           'DTCYGLTRRSYWVPGRGT*LQAH': 2,
           'DTCYGLKRRSNWVPGRGP*LQAH': 2,
           'DTCYGLKRRSIWVPGRGT*LQAH': 1,
           'DTCCGLKRRSSWVPGRGTYLQAH': 1,
           'DTCCGLKRRSFWVPGRGT*LQAH': 1,
           'DTCCG*KRRAYWVPGRGT*LQAH': 1,
           'DTCYWLIRRSNWVPGRGT*LQAH': 1,
           'DTCYGLIRRSTWVPGRGT*LQAH': 1,
           'DTCYGLKRRSDWVPGRGP*LQAH': 1,
           'DTCYGLYRRSNWVPGRGT*LQAH': 1,
           'DTCDGLERRSNWVPGRGT*LQAH': 1,
           'DTCYGLARRSNWVPGRGT*LQAH': 1,
           'DTCYGLQRRSNWVPGRGP*LQAH': 1,
           'DTCYGLARRSHWVPGRGPLLQAH': 1,
           'DPCYGLIRRSNWVPGRGT*LQAH': 1,
           'DTCFGLVRRSSWVPGRGT*LQAH': 1,
           'DTCYGLLRRSNWVPGRGP*LQAH': 1,
           'DTCYGLKRRSGWVPGRVP*LQAH': 1,
           'DTCIGLRRRSYWGPGRGT*LQAH': 1,
           'DTCFGLIRRSYWVPGRGT*LQAH': 1,
           'DTCYGLKRRSFWVPGRGTLLQAH': 1,
           'DPCYGLKRRSNWVPGRGT*LQAH': 1,
           'DACYGLGRRSYWVPGRGTLLQAH': 1,
           'DTCCGLKRRSSWVPGRGT*LQAH': 1,
           'DTCYGLDRRSYWVPGRGP*LQAH': 1,
           'DTCSGFLRRSSWVPGRGP*LQAH': 1,
           'DTCYGLKRRSDWVPGRGPLLQAH': 1,
           'DACFGLRRRSNWVPGRGP*LQAH': 1,
           'GTCYGLGRRSNWVPGRGT*LQAH': 1,
           'AACYGFKRRSSWVPGRGT*LQAH': 1,
           'DSCYGLLRRSTWVPGRGT*LQAH': 1,
           'GTCDGLLRRSNWVPGRGT*LQAH': 1,
           'DTCYGLKRRSSWVPGRGP*LQAH': 1,
           'DTCDGLKLRSSWVPGRGA*LQAH': 1,
           'DTCYGLARRSNWVPGRGTLLQAH': 1,
           'DTCSGLKRRSNWVPGRGT*LQAH': 1,
           'DTCVGLKRRSYWVPGRGT*LQAH': 1,
           'DTCYGLKRRAYWVPGRGTSLPAH': 1,
           'DTCYGLKRRSNWVPGRGAFLQAH': 1,
           'DTCYGLVRRSTWVPGRGT*LQAH': 1,
           'DTCYGLNRRSNWVPGRGPYLQAH': 1,
           'DTCYGLIRRSNWVPGRGT*LQAH': 1,
           'GTCYGLKRRSNWVPGRGT*LQAH': 1,
           'DTCYGLKRRASWVPGRGTLLQAH': 1,
           'DTCYGLLRRSSWVPGRGT*LQAH': 1,
           'DTCYGLKSRSNWGPGRGT*LQAH': 1,
           'DTCYGLVRRSSWVPGRGTYLQAH': 1,
           'DTCYGLSRRSAWVPGRGT*LQAH': 1,
           'DTCYGLKRRSNWVAGRGT*LQAH': 1,
           'DTCYGLKRRSNWVPGRGALLQAH': 1,
           'DTCYGLIRRS*WVPGRGPLLQAH': 1,
           'DTCYGLKRRSNWVPGRGPFLQAH': 1,
           'DTCCGLRRRSNWVPGRGT*LQAH': 1,
           'DTCYGLKRRSYWGPGRGP*LQAH': 1,
           'DTCCGLLRRSYWVPGRGTLLQAH': 1,
           'DTCCGFVRRSDWVPGRGT*LQAH': 1})},
 {'Original_Sequence': 'GACACCTGCTATGGATTAAAAAGGCGCTCCCGTTGGGTACCAGGTCGCGGCACCTAACTGCAGGCACATC',
  'New_Variant': np.str_('GACACGTGCTACGGATTAAAAAGGCGCTCCAACTGGGTACCAGGTCGCGGCACCTAACTGCAAGCACATC'),
  'Score': np.float64(0.7173562573784381),
  'Rank': 5,
  'Score_TRSp': 0.62,
  'Score_TRSpAvd': 0.83,
  'Proteins': Counter({'DTCYGLKRRSNWVPGRGT*LQAH': 10,
           'DTCYGLKRRSYWVPGRGT*LQAH': 9,
           'DTCYGLKRRSAWVPGRGT*LQAH': 4,
           'DTCYGLKGRSNWVPGRGT*LQAH': 4,
           'DTCYGLKRRSNWVPGRGTLLQAH': 3,
           'DTCYGLKRRSSWVPGRGP*LQAH': 3,
           'DTCYGLKRRSNWVPGRGPLLQAH': 3,
           'DTCYGLRRRSGWVPGRGT*LQAH': 3,
           'DTCCGLKRRSNWVPGRGT*LQAH': 3,
           'DTCYGLKRRSSWVPGRGT*LQAH': 2,
           'DTCYGLKRRSTWVPGRGT*LQAH': 2,
           'DTCYGLKRRSDWVPGRGP*LQAH': 2,
           'DTCYGLKRRSNWVPGRGTSLQAH': 2,
           'DTCYGLKRRSYWVPGRGP*LQAH': 2,
           'DTCFGLGRRSNWVPGRGTLLQAH': 1,
           'DTCYGL*RRSNWVPGRGSLLQAH': 1,
           'DTCYGLKGRSFWVPGRGT*LQAH': 1,
           'ATCYGLKRRSNWVPGRGT*LQAP': 1,
           'DPCYGLKRRSYWVPGRGT*LQAH': 1,
           'DACYGLGRRSLWVPGRGT*LQAH': 1,
           'DTCCGLRGRSYWVPGRGTLLQAH': 1,
           'DTCYGLKRRSLWVPGRGT*LQAH': 1,
           'DTCCGLKRRSNWVPGRGP*LQAH': 1,
           'DTCYGLKRRSNWVPGRGA*LQAH': 1,
           'DTCYGLKRRSYWVPGGGSLLQAH': 1,
           'DTCYGLKRRSDWVPGRGA*LQAH': 1,
           'DTCYGLTRRSNWVPGRGT*LQAH': 1,
           'DTCYGLKRRSNWVPGRGT*LPAH': 1,
           'DTCYGLGRRSNWGPGRGT*LQAH': 1,
           'DTCCGLKGRSYWVPGRGT*LQAH': 1,
           'DTCYGLKRRSNWVPGRGT*LHAH': 1,
           'DTCYGLKRRSNWVPGRGTYLQAH': 1,
           'DTCYGLEGRSNWVPGRGT*LQAH': 1,
           'DTCYGLRRRSYWVPGRGPLLQAH': 1,
           'DTCYGLERRSYWVPGRGPCLQAH': 1,
           'DTCYGLRRRSSWVPGRGALLQAH': 1,
           'ATCFGLKRRSDWGPGRGS*LQAH': 1,
           'ATCYGLKRRSYWVPGRGP*LQAH': 1,
           'GTCYGLKRRSNSVPGRGT*LQAH': 1,
           'DTCFGLERRSSWVPGRGT*LQAH': 1,
           'DTCYGLGRRSDWVPGRGT*LQAH': 1,
           'GTCYGLKRRSNWVPGRGT*LQAH': 1,
           'DTCDGLKRRSGWVPGRGP*LQAH': 1,
           'ATCYGLKGRSSWGPGRGT*LQAH': 1,
           'DTCYGLKRRSNWVPGRGP*LQAH': 1,
           'DTCFGLFRRSGWVPGRGT*LQAH': 1,
           'DTCYGLGRRAYWVPGRGT*LQAR': 1,
           'DTCFGLKGRSYWVPGRGTVLQAH': 1,
           'DTCYGLKRRSRWVPGRGPLLQAH': 1,
           'DTCYGLGGRSSWVPGRGTVLQAH': 1,
           'DTCDGLRRRSYWVPGRGT*LQAH': 1,
           'DTCYGLKRRSDWVPGRGT*LQAH': 1,
           'DTCYGLRRRSNWVAGRGTLLQAH': 1,
           'DTCSGF*RRSCWVPGRGP*LQAH': 1,
           'DTCYGLKRRSPWVPGRGPYLQAH': 1,
           'DTCYGLKRRSFWVPGRGT*LQAH': 1,
           'DTCYGLKRRSNWVPGRGTFLQAH': 1,
           'DTCYGLKRRSYWGPGRGTYLQAH': 1,
           'DTCFGLRGRSNWVPGRGS*LQAH': 1,
           'DTCYGLKRRSNWVPGRGALLQAH': 1,
           'DTCYGLKRRSAWVPGRGPWVRAH': 1,
           'DTCYGLKRRSRWVPGRGA*LQAH': 1})}]

Likelihood computation

Compute log-likelihood of observed VR sequences given a TR.


source

compute_likelihood


def compute_likelihood(
    TR, # Reference/template sequence.
    VR, # Variant sequence to score.
): # Log-likelihood of VR given TR.

Compute the log-likelihood of generating a variant sequence (VR) given a template/reference sequence (TR).

TR='CGTAAACCGGACCTAGTTTAGTTCTTAGACCAAGGTACATATCCCCGTAACATAAGACGCGACTGGGCCC'
    VR='CGTAAACCGGACCTAGTTTAGTTCTTAGACCAAGGTACATATCCCCGTAACATAAGACGCGACTGGGCCC'
    compute_likelihood(TR, VR)
np.float32(-4.534629)

source

compute_likelihood_list


def compute_likelihood_list(
    TR_list, # Reference/template sequences.
    VR_list, # Variant sequences.
): # Log-likelihoods in the same order as input.

Compute log-likelihoods for (TR, VR) pairs.

TR_list=['CGTAAACCGGACCTAGTTTAGTTCTTAGACCAAGGTACATATCCCCGTAACATAAGACGCGACTGGGCCC']*2
    VR_list=['CGTAAACCGGACCTAGTTTAGTTCTTAGACCAAGGTACATATCCCCGTAACATAAGACGCGACTGGGCCC','CGTAAACCGGACCTAGTTTAGTTCTTAGACCAAGGTACATATCCCCGTATCATAAGACGCGACTGGGCCC']
    compute_likelihood_list(TR_list, VR_list)
[np.float32(-4.5359917), np.float32(-8.759228)]

source

compute_likelihood_matrix


def compute_likelihood_matrix(
    TR_list, # List of reference/template sequences.
    VR_list, # List of variant sequences.
    batch_size:int=64, # (Currently unused) Intended batch size for future optimization.
): # Log-likelihood matrix of shape (len(TR_list), len(VR_list)).

Compute a matrix of log-likelihoods where each entry (i, j) corresponds to the log-likelihood of generating VR_list[j] from TR_list[i].

Sequences with mismatched lengths are assigned -inf.

TR_list=['CGTAAACCGGACCTAGTTTAGTTCTTAGACCAAGGTACATATCCCCGTAACATAAGACGCGACTGGGCCC','CGTAAACCGGACCTAGTTTAGTTCTTAGACCAAGGTACATATCCCCGTATCATAAGACGCGACTGGGCCC']
    VR_list=['CGTAAACCGGACCTAGTTTAGTTCTTAGACCAAGGTACATATCCCCGTAACATAAGACGCGACTGGGCCC','CGTAAACCGGACCTAGTTTAGTTCTTAGACCAAGGTACATATCCCCGTATCATAAGACGCGACTGGGCCC']
    compute_likelihood_matrix(TR_list, VR_list)
[[np.float32(-4.5359917), np.float32(-8.759228)],
 [np.float32(-16.45727), np.float32(-5.387972)]]