Functions for calling genotypes from single-end FASTQ sequencing data. This module handles the full pipeline from raw reads to filtered genotype lists, including UMI-based deduplication and consensus genotype calling. Use get_genotypes() as the main entry point.
def get_UMI_genotype( fastq_path:str, # path to the input fastq file ref_seq:str, # sequence of the reference amplicon umi_size:int=10, # number of nucleotides at the beginning of the read that will be used as the UMI ref_read_size:int=None, # number of nucleotides in the read expected to align to the ref_seq. If None the whole read will be used. quality_threshold:int=30, # threshold value used to filter out reads of poor average quality ignore_pos:list=[], # list of positions that are ignored in the genotype kwargs:VAR_KEYWORD)->dict: # alignment parameters can be passed here (match, mismatch, gap_open, gap_extend)
Takes as input a fastq_file of single read amplicon sequencing, and a reference amplicon sequence. Returns a dictionary containing as keys UMIs and as values a Counter of all genotype strings read for that UMI.
fastq_file="sacB_example.fastq.gz"fastq_path=os.path.join(data_path,fastq_file)read_ref_file="sacB_ref.fasta"ref=next(SeqIO.parse(os.path.join(data_path,read_ref_file),"fasta"))ref_seq=str(ref.seq)UMI_gencounter = get_UMI_genotype(fastq_path, ref_seq, ignore_pos=[0,1,2,138,139,140,141], gap_open=-5)for umi in itertools.islice(UMI_gencounter,20):print(umi, list(UMI_gencounter[umi].items()))
def correct_UMI_genotypes( UMI_gencounter:dict, # the output of the get_UMI_genotype function reads_per_umi_thr:int=2, # only assign a genotype to a UMI if we have reads_per_umi_thr reads for that genotype or more)->dict:
Keeps only the genotype with the most reads for each UMI. Returns a dictionary with UMIs as keys and the genotype string as value.
Takes as input the output of correct_UMI_genotypes() and returns a list of genotypes sorted by the number of UMIs detected corresponding to each genotype.
UMI_gen_dict=correct_UMI_genotypes(UMI_gencounter, reads_per_umi_thr=0)gen_list = genotype_UMI_counter(UMI_gen_dict)for g in gen_list[:20]:print(f"{g[1]}\t{g[0]}")
def get_genotypes( fastq_path:str, # path to the input fastq file ref_seq:str, # sequence of the reference amplicon umi_size:int=10, # number of nucleotides at the beginning of the read that will be used as the UMI ref_read_size:int=None, # number of nucleotides in the read expected to align to the ref_seq. If None the whole read will be used. quality_threshold:int=30, # threshold value used to filter out reads of poor average quality ignore_pos:list=[], # list of positions that are ignored in the genotype reads_per_umi_thr:int=0, # minimum number of reads required to take a UMI into account. Using a number >2 enables to perform error correction for UMIs with multiple reads. save_umi_data:str=None, # path to the csv file where to save the details of the genotypes reads for each UMI. If None the data isn't saved. kwargs:VAR_KEYWORD):
Processes a single-end FASTQ file to extract UMI-corrected genotypes. Returns a sorted list of (genotype_string, count) tuples.
fastq_file="sacB_example.fastq.gz"fastq_path=os.path.join(data_path,fastq_file)read_ref_file="sacB_ref.fasta"ref=next(SeqIO.parse(os.path.join(data_path,read_ref_file),"fasta"))ref_seq=str(ref.seq)gen_list = get_genotypes(fastq_path, ref_seq, ignore_pos=[0,1,2,138,139,140,141], gap_open=-4, save_umi_data="test.csv")for g in gen_list[:20]:print(f"{g[1]}\t{g[0]}")
n reads: 1000
n_reads pass filter: 847
n_reads aligned: 824
Number of UMIs: 814
Median number of reads per UMI: 1.0
Number of genotypes: 123
675
3 C56A
3 A76G
3 A91G
3 A91T
2 C69T
2 T122A
2 A91C
2 A105G
2 C116A
2 T60A
2 T59A
2 A68G
2 T134A
1 A61G,-63T,A76T,A91T
1 A79T,A91G
1 A61G,A72G,A76G,A79T
1 T108A,G127T,G132T
1 A48T,A86G
1 A61T,A68T,A72G,A79C,A91G