Takes as input a fastq_file of single read amplicon sequencing, and a reference amplicon sequence. Returns a dictionnary containing as keys UMIs and as values a Counter of all genotype strings read for that UMI.
Type
Default
Details
fastq_path
str
path to the input fastq file
ref_seq
str
sequence of the reference amplicon
umi_size
int
10
number of nucleotides at the begining of the read that will be used as the UMI
ref_read_size
int
None
number of nucleotides in the read expected to align to the ref_seq. If None the whole read will be used.
quality_threshold
int
30
threshold value used to filter out reads of poor average quality
ignore_pos
list
[]
list of positions that are ignored in the genotype
kwargs
Returns
dict
alignment parameters can be passed here (match, mismatch, gap_open, gap_extend)
fastq_file="sacB_example.fastq.gz"fastq_path=os.path.join(data_path,fastq_file)read_ref_file="sacB_ref.fasta"ref=next(SeqIO.parse(os.path.join(data_path,read_ref_file),"fasta"))ref_seq=str(ref.seq)UMI_gencounter = get_UMI_genotype(fastq_path, ref_seq, ignore_pos=[0,1,2,138,139,140,141], gap_open=-5)for umi in itertools.islice(UMI_gencounter,20):print(umi, list(UMI_gencounter[umi].items()))
Keeps only the genotype with the most reads for each UMI. Returns a dictionary with UMIs as keys and a tuple as value: (genotype string, number of reads)
Type
Default
Details
UMI_gencounter
dict
the output of the get_UMI_genotype function
reads_per_umi_thr
int
2
only assign a genotype to a UMI if we have reads_per_umi_thr reads for that genotype or more
Takes as input the output of correct_UMI_genotypes() and returns a list of genotypes sorted by the number of UMIs detected corresponding that each genotype.
UMI_gen_dict=correct_UMI_genotypes(UMI_gencounter, reads_per_umi_thr=0)gen_list = genotype_UMI_counter(UMI_gen_dict)for g in gen_list[:20]:print(f"{g[1]}\t{g[0]}")
Putting things together in a single wrapper function that takes the fastq as input and returns the list of genotypes.
Type
Default
Details
fastq_path
str
path to the input fastq file
ref_seq
str
sequence of the reference amplicon
umi_size
int
10
number of nucleotides at the begining of the read that will be used as the UMI
ref_read_size
int
None
number of nucleotides in the read expected to align to the ref_seq. If None the whole read will be used.
quality_threshold
int
30
threshold value used to filter out reads of poor average quality
ignore_pos
list
[]
list of positions that are ignored in the genotype
reads_per_umi_thr
int
0
minimum number of reads required to take a UMI into account. Using a number >2 enables to perform error correction for UMIs with multiple reads.
save_umi_data
str
None
path to the csv file where to save the details of the genotypes reads for each UMI. If None the data isn’t saved.
kwargs
fastq_file="sacB_example.fastq.gz"fastq_path=os.path.join(data_path,fastq_file)read_ref_file="sacB_ref.fasta"ref=next(SeqIO.parse(os.path.join(data_path,read_ref_file),"fasta"))ref_seq=str(ref.seq)gen_list = get_genotypes(fastq_path, ref_seq, ignore_pos=[0,1,2,138,139,140,141], gap_open=-4, save_umi_data="test.csv")for g in gen_list[:20]:print(f"{g[1]}\t{g[0]}")
n reads: 1000
n_reads pass filter: 847
n_reads aligned: 824
Number of UMIs: 814
Median number of reads per UMI: 1.0
Number of genotypes: 123
675
3 C56A
3 A76G
3 A91G
3 A91T
2 C69T
2 T122A
2 A91C
2 A105G
2 C116A
2 T60A
2 T59A
2 A68G
2 T134A
1 A61G,-63T,A76T,A91T
1 A79T,A91G
1 A61G,A72G,A76G,A79T
1 T108A,G127T,G132T
1 A48T,A86G
1 A61T,A68T,A72G,A79C,A91G