genotypes

Functions to go a fastq file to a list of genotypes

get_UMI_genotype

 get_UMI_genotype (fastq_path:str, ref_seq:str, umi_size:int=10,
                   ref_read_size:int=None, quality_threshold:int=30,
                   ignore_pos:list=[], **kwargs)

Takes as input a fastq_file of single read amplicon sequencing, and a reference amplicon sequence. Returns a dictionnary containing as keys UMIs and as values a Counter of all genotype strings read for that UMI.

	Type	Default	Details
fastq_path	str		path to the input fastq file
ref_seq	str		sequence of the reference amplicon
umi_size	int	10	number of nucleotides at the begining of the read that will be used as the UMI
ref_read_size	int	None	number of nucleotides in the read expected to align to the ref_seq. If None the whole read will be used.
quality_threshold	int	30	threshold value used to filter out reads of poor average quality
ignore_pos	list	[]	list of positions that are ignored in the genotype
kwargs	VAR_KEYWORD
Returns	dict		alignment parameters can be passed here (match, mismatch, gap_open, gap_extend)

fastq_file="sacB_example.fastq.gz"
fastq_path=os.path.join(data_path,fastq_file)

read_ref_file="sacB_ref.fasta"
ref=next(SeqIO.parse(os.path.join(data_path,read_ref_file),"fasta"))
ref_seq=str(ref.seq)

UMI_gencounter = get_UMI_genotype(fastq_path, ref_seq, ignore_pos=[0,1,2,138,139,140,141], gap_open=-5)

for umi in itertools.islice(UMI_gencounter,20):
    print(umi, list(UMI_gencounter[umi].items()))

n reads:    1000
n_reads pass filter:    847
n_reads aligned:    824
Number of UMIs: 814
Median number of reads per UMI: 1.0
GCATANCTCA [('A61G,-63T,A76T,A91T', 1)]
CGCATNTATA [('', 1)]
CCTTGNAGTA [('', 1)]
GGCGCNAGAA [('', 1)]
TCTCTTGTGA [('', 1)]
ATTACAGAAT [('', 1)]
CTTTTACTAT [('', 1)]
TCAAAGTTTT [('A79T,A91G', 1)]
TTAGCTCATA [('', 1)]
TCATAATGTA [('', 1)]
ATGTGCGGAT [('', 1)]
TGTGTTTATA [('', 1)]
CCATACATCC [('', 1)]
AGGGACGTTT [('A61G,A72G,A76G,A79T', 1)]
GTGTAATAGC [('', 1)]
ATGTCTTTTA [('', 1)]
TATCGGTAGT [('', 1)]
GTCGGGGGGG [('', 1)]
AAGTGGCACA [('', 1)]
AATAGAACCT [('T108A,G127T,G132T', 1)]

source

correct_UMI_genotypes

 correct_UMI_genotypes (UMI_gencounter:dict, reads_per_umi_thr=2)

Keeps only the genotype with the most reads for each UMI. Returns a dictionary with UMIs as keys and a tuple as value: (genotype string, number of reads)

	Type	Default	Details
UMI_gencounter	dict		the output of the get_UMI_genotype function
reads_per_umi_thr	int	2	only assign a genotype to a UMI if we have reads_per_umi_thr reads for that genotype or more
Returns	dict

correct_UMI_genotypes(UMI_gencounter)

{'CTCCGGGGAG': '',
 'TGCTTGAGTG': 'A79T',
 'AGGGCGGGCT': '',
 'ATTTCTGTTT': '',
 'TGGGGGGGCT': '',
 'GATTGGTAGA': '',
 'GAACTCTAGT': '',
 'TAACTAATCG': 'A79G,A86G,A91G'}

source

genotype_UMI_counter

 genotype_UMI_counter (UMI_gen_dict)

Takes as input the output of correct_UMI_genotypes() and returns a list of genotypes sorted by the number of UMIs detected corresponding that each genotype.

UMI_gen_dict=correct_UMI_genotypes(UMI_gencounter, reads_per_umi_thr=0)
gen_list = genotype_UMI_counter(UMI_gen_dict)
for g in gen_list[:20]:
    print(f"{g[1]}\t{g[0]}")

675 
3   C56A
3   A76G
3   A91G
3   A91T
2   C69T
2   T122A
2   A91C
2   A105G
2   C116A
2   T60A
2   T59A
2   A68G
2   T134A
1   A61G,-63T,A76T,A91T
1   A79T,A91G
1   A61G,A72G,A76G,A79T
1   T108A,G127T,G132T
1   A48T,A86G
1   A61T,A68T,A72G,A79C,A91G

source

get_genotypes

 get_genotypes (fastq_path:str, ref_seq:str, umi_size:int=10,
                ref_read_size:int=None, quality_threshold:int=30,
                ignore_pos:list=[], reads_per_umi_thr:int=0,
                save_umi_data:str=None, **kwargs)

Putting things together in a single wrapper function that takes the fastq as input and returns the list of genotypes.

	Type	Default	Details
fastq_path	str		path to the input fastq file
ref_seq	str		sequence of the reference amplicon
umi_size	int	10	number of nucleotides at the begining of the read that will be used as the UMI
ref_read_size	int	None	number of nucleotides in the read expected to align to the ref_seq. If None the whole read will be used.
quality_threshold	int	30	threshold value used to filter out reads of poor average quality
ignore_pos	list	[]	list of positions that are ignored in the genotype
reads_per_umi_thr	int	0	minimum number of reads required to take a UMI into account. Using a number >2 enables to perform error correction for UMIs with multiple reads.
save_umi_data	str	None	path to the csv file where to save the details of the genotypes reads for each UMI. If None the data isn’t saved.
kwargs	VAR_KEYWORD

fastq_file="sacB_example.fastq.gz"
fastq_path=os.path.join(data_path,fastq_file)
read_ref_file="sacB_ref.fasta"
ref=next(SeqIO.parse(os.path.join(data_path,read_ref_file),"fasta"))
ref_seq=str(ref.seq)
gen_list = get_genotypes(fastq_path, ref_seq, 
                         ignore_pos=[0,1,2,138,139,140,141],
                         gap_open=-4, 
                         save_umi_data="test.csv")
for g in gen_list[:20]:
    print(f"{g[1]}\t{g[0]}")

n reads:    1000
n_reads pass filter:    847
n_reads aligned:    824
Number of UMIs: 814
Median number of reads per UMI: 1.0
Number of genotypes: 123
675 
3   C56A
3   A76G
3   A91G
3   A91T
2   C69T
2   T122A
2   A91C
2   A105G
2   C116A
2   T60A
2   T59A
2   A68G
2   T134A
1   A61G,-63T,A76T,A91T
1   A79T,A91G
1   A61G,A72G,A76G,A79T
1   T108A,G127T,G132T
1   A48T,A86G
1   A61T,A68T,A72G,A79C,A91G