genotypes

Functions for calling genotypes from single-end FASTQ sequencing data. This module handles the full pipeline from raw reads to filtered genotype lists, including UMI-based deduplication and consensus genotype calling. Use get_genotypes() as the main entry point.

UMI processing


source

get_UMI_genotype


def get_UMI_genotype(
    fastq_path:str, # path to the input fastq file
    ref_seq:str, # sequence of the reference amplicon
    umi_size:int=10, # number of nucleotides at the beginning of the read that will be used as the UMI
    ref_read_size:int=None, # number of nucleotides in the read expected to align to the ref_seq. If None the whole read will be used.
    quality_threshold:int=30, # threshold value used to filter out reads of poor average quality
    ignore_pos:list=[], # list of positions that are ignored in the genotype
    kwargs:VAR_KEYWORD
)->dict: # alignment parameters can be passed here (match, mismatch, gap_open, gap_extend)

Takes as input a fastq_file of single read amplicon sequencing, and a reference amplicon sequence. Returns a dictionary containing as keys UMIs and as values a Counter of all genotype strings read for that UMI.

fastq_file="sacB_example.fastq.gz"
fastq_path=os.path.join(data_path,fastq_file)

read_ref_file="sacB_ref.fasta"
ref=next(SeqIO.parse(os.path.join(data_path,read_ref_file),"fasta"))
ref_seq=str(ref.seq)

UMI_gencounter = get_UMI_genotype(fastq_path, ref_seq, ignore_pos=[0,1,2,138,139,140,141], gap_open=-5)

for umi in itertools.islice(UMI_gencounter,20):
    print(umi, list(UMI_gencounter[umi].items()))
n reads:    1000
n_reads pass filter:    847
n_reads aligned:    824
Number of UMIs: 814
Median number of reads per UMI: 1.0
GCATANCTCA [('A61G,-63T,A76T,A91T', 1)]
CGCATNTATA [('', 1)]
CCTTGNAGTA [('', 1)]
GGCGCNAGAA [('', 1)]
TCTCTTGTGA [('', 1)]
ATTACAGAAT [('', 1)]
CTTTTACTAT [('', 1)]
TCAAAGTTTT [('A79T,A91G', 1)]
TTAGCTCATA [('', 1)]
TCATAATGTA [('', 1)]
ATGTGCGGAT [('', 1)]
TGTGTTTATA [('', 1)]
CCATACATCC [('', 1)]
AGGGACGTTT [('A61G,A72G,A76G,A79T', 1)]
GTGTAATAGC [('', 1)]
ATGTCTTTTA [('', 1)]
TATCGGTAGT [('', 1)]
GTCGGGGGGG [('', 1)]
AAGTGGCACA [('', 1)]
AATAGAACCT [('T108A,G127T,G132T', 1)]

Genotype calling


source

correct_UMI_genotypes


def correct_UMI_genotypes(
    UMI_gencounter:dict, # the output of the get_UMI_genotype function
    reads_per_umi_thr:int=2, # only assign a genotype to a UMI if we have reads_per_umi_thr reads for that genotype or more
)->dict:

Keeps only the genotype with the most reads for each UMI. Returns a dictionary with UMIs as keys and the genotype string as value.

correct_UMI_genotypes(UMI_gencounter)
{'CTCCGGGGAG': '',
 'TGCTTGAGTG': 'A79T',
 'AGGGCGGGCT': '',
 'ATTTCTGTTT': '',
 'TGGGGGGGCT': '',
 'GATTGGTAGA': '',
 'GAACTCTAGT': '',
 'TAACTAATCG': 'A79G,A86G,A91G'}

source

genotype_UMI_counter


def genotype_UMI_counter(
    UMI_gen_dict
):

Takes as input the output of correct_UMI_genotypes() and returns a list of genotypes sorted by the number of UMIs detected corresponding to each genotype.

UMI_gen_dict=correct_UMI_genotypes(UMI_gencounter, reads_per_umi_thr=0)
gen_list = genotype_UMI_counter(UMI_gen_dict)
for g in gen_list[:20]:
    print(f"{g[1]}\t{g[0]}")
675 
3   C56A
3   A76G
3   A91G
3   A91T
2   C69T
2   T122A
2   A91C
2   A105G
2   C116A
2   T60A
2   T59A
2   A68G
2   T134A
1   A61G,-63T,A76T,A91T
1   A79T,A91G
1   A61G,A72G,A76G,A79T
1   T108A,G127T,G132T
1   A48T,A86G
1   A61T,A68T,A72G,A79C,A91G

Main pipeline


source

get_genotypes


def get_genotypes(
    fastq_path:str, # path to the input fastq file
    ref_seq:str, # sequence of the reference amplicon
    umi_size:int=10, # number of nucleotides at the beginning of the read that will be used as the UMI
    ref_read_size:int=None, # number of nucleotides in the read expected to align to the ref_seq. If None the whole read will be used.
    quality_threshold:int=30, # threshold value used to filter out reads of poor average quality
    ignore_pos:list=[], # list of positions that are ignored in the genotype
    reads_per_umi_thr:int=0, # minimum number of reads required to take a UMI into account. Using a number >2 enables to perform error correction for UMIs with multiple reads.
    save_umi_data:str=None, # path to the csv file where to save the details of the genotypes reads for each UMI. If None the data isn't saved.
    kwargs:VAR_KEYWORD
):

Processes a single-end FASTQ file to extract UMI-corrected genotypes. Returns a sorted list of (genotype_string, count) tuples.

fastq_file="sacB_example.fastq.gz"
fastq_path=os.path.join(data_path,fastq_file)
read_ref_file="sacB_ref.fasta"
ref=next(SeqIO.parse(os.path.join(data_path,read_ref_file),"fasta"))
ref_seq=str(ref.seq)
gen_list = get_genotypes(fastq_path, ref_seq, 
                         ignore_pos=[0,1,2,138,139,140,141],
                         gap_open=-4, 
                         save_umi_data="test.csv")
for g in gen_list[:20]:
    print(f"{g[1]}\t{g[0]}")
n reads:    1000
n_reads pass filter:    847
n_reads aligned:    824
Number of UMIs: 814
Median number of reads per UMI: 1.0
Number of genotypes: 123
675 
3   C56A
3   A76G
3   A91G
3   A91T
2   C69T
2   T122A
2   A91C
2   A105G
2   C116A
2   T60A
2   T59A
2   A68G
2   T134A
1   A61G,-63T,A76T,A91T
1   A79T,A91G
1   A61G,A72G,A76G,A79T
1   T108A,G127T,G132T
1   A48T,A86G
1   A61T,A68T,A72G,A79C,A91G