Functions for calling genotypes from paired-end FASTQ sequencing data. This module extends the single-end pipeline to handle forward and reverse reads with configurable span regions. Use get_genotypes_paired() as the main entry point.
def get_UMI_genotype_paired( fastq_path_fwd:str, # path to the input fastq file reading the ref_seq in the forward orientation fastq_path_rev:str, # path to the input fastq file reading the ref_seq in the reverse orientation ref_seq:str, # sequence of the reference amplicon fwd_span:tuple, # span of the ref_seq that is read in the forward orientation format: (start, end) rev_span:tuple, # span of the ref_seq that is read in the reverse orientation format: (start, end) fwd_ref_read_size:int=None, # number of nucleotides in the fwd read expected to align to the ref_seq. If None the whole read will be used. rev_ref_read_size:int=None, # number of nucleotides in the rev read expected to align to the ref_seq. If None the whole read will be used. require_perfect_pair_agreement:bool=True, # if True only pairs of reads that perfectly agree on the sequence within the common span will be used. If False the fwd sequence will be used. Will be set to False by default if there is no overlap. umi_size_fwd:int=10, # number of nucleotides at the beginning of the fwd read that will be used as the UMI umi_size_rev:int=0, # number of nucleotides at the beginning of the rev read that will be used as the UMI (if both are provided the umi will be the concatenation of both) quality_threshold:int=30, # threshold value used to filter out reads of poor average quality. Both reads have to pass the threshold. ignore_pos:list=[], # list of positions that are ignored in the genotype N:NoneType=None, # number of reads to consider (useful to get a quick view of the data without going through the whole fastq files). If None the whole data will be used. kwargs:VAR_KEYWORD)->dict: # alignment parameters can be passed here (match, mismatch, gap_open, gap_extend)
Processes paired-end FASTQ files to extract UMI-grouped genotypes. Aligns forward and reverse reads to the reference, optionally requiring perfect agreement in overlapping regions.
def get_genotypes_paired( fastq_path_fwd:str, # path to the input fastq file reading the ref_seq in the forward orientation fastq_path_rev:str, # path to the input fastq file reading the ref_seq in the reverse orientation ref_seq:str, # sequence of the reference amplicon fwd_span:tuple, # span of the ref_seq that is read in the forward orientation format: (start, end) rev_span:tuple, # span of the ref_seq that is read in the reverse orientation format: (start, end) fwd_ref_read_size:int=None, # number of nucleotides in the fwd read expected to align to the ref_seq. If None the whole read will be used. rev_ref_read_size:int=None, # number of nucleotides in the rev read expected to align to the ref_seq. If None the whole read will be used. require_perfect_pair_agreement:bool=True, # if True only pairs of reads that perfectly agree on the sequence within the common span will be used. If False the fwd sequence will be used. Will be set to False by default if there is no overlap. umi_size_fwd:int=10, # number of nucleotides at the beginning of the fwd read that will be used as the UMI umi_size_rev:int=0, # number of nucleotides at the beginning of the rev read that will be used as the UMI (if both are provided the umi will be the concatenation of both) quality_threshold:int=30, # threshold value used to filter out reads of poor average quality ignore_pos:list=[], # list of positions that are ignored in the genotype reads_per_umi_thr:int=0, # minimum number of reads required to take a UMI into account. Using a number >2 enables to perform error correction for UMIs with multiple reads. save_umi_data:str=None, # path to the csv file where to save the details of the genotypes reads for each UMI. If None the data isn't saved. N:NoneType=None, # number of reads to consider (useful to get a quick view of the data without going through the whole fastq files). If None the whole data will be used. kwargs:VAR_KEYWORD):
Processes paired-end FASTQ files to extract UMI-corrected genotypes. Returns a sorted list of (genotype_string, count) tuples.
gen_list = get_genotypes_paired(fastq_path_fwd, fastq_path_rev, ref_seq, fwd_span=(0,150), rev_span=(0,0), umi_size_fwd=0, umi_size_rev=10, ignore_pos=[0,1,150,151], N=100, save_umi_data="test.csv")for g in gen_list[:20]:print(f"{g[1]}\t{g[0]}")
n reads: 100
n_reads pass filter: 96
n_reads aligned: 94
n_pairs agree: 0
Number of UMIs: 85
Median number of reads per UMI: 1.0
Number of genotypes: 12
49
23 T50A,T51G
2 T33G,T34A,T51G
2 T33C,T34C,T36C,T50A
2 T37C,T42C,T45A,T50C,T51C
1 T130A
1 A52C
1 G8A
1 T51C
1 A52C,G131C
1 G69A
1 T37C,T42C,T45A,T50C,T51C,T146A