Introduction

GenoTan is a free tool to identify length variation of microsatellites from short sequence reads. Inferring lengths of inherited microsatellite alleles with single base pair resolution from short sequence reads is challenging due to several sources of noise including PCR amplification errors, individual cell mutation, misalignment or mis-mapping caused by the repetitive nature of the microsatellites. We have developed a method using a discretized Gaussian mixture model combined with a rules-based approach to identify inherited variation of microsatellite loci from short sequence reads, which effectively distinguishes length variants from INDEL errors at homopolymers.

License: GPL V3 (http://www.gnu.org/licenses/gpl-3.0.html)

How to use GenoTan

Table of Contents

Introduction. 1

Methods used in GenoTan. 2

1. Discretized Gaussian mixture distribution. 2

2. Decision of genotyping. 5

Getting started. 6

Installation. 6

Searching microsatellites. 7

1. Using TRF. 7

2. Searching repeats of short motifs. 7

Realignment of sequence reads mapped to the microsatellite loci 7

4. Test with simulated data. 9

Methods used in GenoTan

1. Discretized Gaussian mixture distribution

GenoTan uses a discretized Gaussian distribution to address homopolymer errors from short sequence reads mapped to microsatellite loci. Since probabilities of insertion and deletion errors in a sequence are proportional, they can be calculated from the cumulative distribution function of the Gaussian distribution.

Let l_L be the length of a candidate allele L at a target locus and let x be the observed length of the microsatellite sequence with INDEL errors in a read mapped to the locus with an assumption in which the length x is derived from the original length l_L. Let F_L(t) and f_L(t) denote the distribution and the density functions of a Gaussian random variable with mean l_L and variance respectively. Then the probability mass function p_L(x) of x is

(1)

where x=0,1,2,... , and is a scale factor (Since x is a natural number, we need to rescale to make the distribution a proper probability mass function).

B.suis GGCGCATTTTGCAACTGATTCTATGCC---GGGGGGGGGGAAGCGCATACGTTGGAGCGT

read1 GGCGCATTTTGCAACTGATTCTATGCCGGGGGGGGGGGGGAAGCGCATACGT--------

read2 GGCGCATTTTGCAACTGATTCTATGCCGGGGGGGGGGGGGAAGCGCATACGTTGG-----

read3 GGCGCATTTTGCAACTGATTCTATGCCGGGGGGGGGGGGGAAGCGCATACGTTGGAG---

read4 ------TTTTGCAACTGATTCTATGCCGGGGGGGGGGGGGAAGCGCATACGTTGGAGCGT

read5 GGCGCATTTTGCAACTGATTCTATGCC-GGGGGGGGGGGGAAG-----------------

read6 GGCGCATTTTGCAACTGATTCTATGCC-GGGGGGGGGGGGAAGCGC--------------

read7 GGCGCATTTTGCAACTGATTCTATGCC-GGGGGGGGGGGGAAGGGCA-------------

read8 GGCGCATTTTGCAACTGATTCTATGCC-GGGGGGGGGGGGAGGCGCATACG---------

read9 GGCGCATTTTGCAACTGATTCTATGCC-GGGGGGGGGGGGAAGCGCATACGTT-------

read10 GGCGCATTTTGCAACTGATTCTATGCC-GGGGGGGGGGGGAAGCGCATACGTTG------

read11 ---GCATTTTGCAACTGATTCTATGCC-GGGGGGGGGGGGAAGCGCATACGTTGGAGCGT

read12 ------TTTTGCAACTGATTCTATGCC-GGGGGGGGGGGGAAGCGCATACGTTGGAGCGT

read13 --------------CTGATTCTATGCC-GGGGGGGGGGGGAAGCGCATACGTTGGAGCGT

read14 GGCGCATTTTGCAACTGATTCTATGCC--GGGGGGGGGGGAAGCGCATACGTTGGAGCGT

read15 ---GCATTTTGCAACTGATTCTATGCC--GGGGGGGGGGGAAGCGCATACGTTGGAGCGT

read16 GGCGCATTTTGCAACTGATTCTATGCC---GGGGGGGGGGAAGCGCATACGT--------

read17 -----ATTTTGCAACTGATTCTATGCC---GGGGGGGGGGAAGCGCATACGTTGGAGCGT

Recurrence of INDEL errors in sequence reads. The reference is a genomic sequence of Brucella suis 1330 which is a prokaryote with a haploid genome, and the locus is a unique region in the genome. The sequence reads were generated from the Illumina 101 cycle paired-end sequencing technology. Insertions or deletion errors by sequencing errors or other sources, such as PCR amplification error or individual cell mutation, results in various lengths of G homopolymers in sequence reads.

Gaussian distribution to express the bidirectional INDEL errors in sequence reads. The graph shows a Gaussian distribution for the locus at the previous figure. The chances of insertion errors and deletion errors are proportional.

For the heterozygous loci with allele lengths, l_L₁ and l_L₂, we can use the mixture distribution of (1) as follows

(2)

, where is the unknown mixture proportion parameter for reads derived from one of the two alleles, regardless of the repeat sequence length x. It is also assumed that the associated parameters and are both unknown. These parameters can be estimated by a nonlinear least squares (NLS) regression function.

Heterozygous fitting. Graphs represent the process of heterozygous fitting for 4 and 6 mapped reads containing microsatellite sequences with lengths 13 and 16 bases respectively at a locus.

If the sequence reads mapped to a same microsatellite locus contain INDEL errors, the number of observed lengths of the microsatellite at the locus would be equal to or more than 2. Since the inherited alleles are unknown, all observed lengths can be allele candidates. Then we can apply the g(x) function for each combination of two allele candidates (two of the same candidate for a homozygous genotype), calculate the squared error of each combination, and select the allele pair, L₁^*and L₂^*, that generates the minimum squared error as follows

(3)

, where o_x is an observed proportion of reads containing a length x microsatellite sequence, a is the minimum observed length minus a fixed amount k, and b is the maximum observed length plus k, where k is set to be five as default value. This is necessary because the g(x) function generates output values for all possible sequence lengths, the comparison between observed proportions and expected proportions need to be extended beyond the minimum and maximum observed lengths. So, the boundaries of the calculation are extended by an additional value k.

As an example, suppose that we have 2, 8 and 4 mapped reads containing microsatellite sequences with lengths 14, 15 and 16 bases respectively at a locus. The list of possible genotype candidates G(l_L1, l_L2) for the locus are G(14, 14), G(14, 15), G(14, 16), G(15, 15), G(15, 16), and G(16, 16). In the example, the observed minimum and maximum lengths are 14 and 16 respectively, and the observed and expected values from the equation 3 are compared for x ranging from 9 to 21. While the observed ratio of read counts between the highest read frequency allele (l_L₁=15) and the second highest read frequency allele (l_L₂=16) is 0.5 (=4/8), the read ratio of those two alleles estimated by the NLS function was 0.163 (= (1- )/= 0.14/0.86). The difference between the two estimated ratios may result a different decision of genotype calls, depending on the cutoff ratio to determine if the second highest read frequency allele candidate is noise.

2. Decision of genotyping

The most probable genotype for a given sequence reads mapped to a locus is decided by equation 3, but it is bias toward calling heterozygous genotypes, because the Gaussian mixture model is better fit to the training data when more distributions are mixed. However since reads supporting one or both predicted alleles may be from noise including individual cell mutation, PCR amplification error, sequencing error and mismapping, an evaluation method is necessary.

We use a rule-based approach to choose alleles and to decide the homozygosity of each locus because the frequencies of INDEL error reads derived from mis-mapping, PCR amplification error and individual cell mutation are more difficult to measure than that from the sequencing error. For this approach, we assign a confidence score to each allele instead of calculating the probability of a genotype (a two allele set) for a locus. The probability of each allele can be generated by the equation 1 as p_L1(l_L₁) or p_L2(l_L₂) if we assume the read frequencies from two different alleles at the heterozygotic locus are not correlated. However DNA fragments from two paired chromosomes have the same chances of being sequenced and the read frequencies of two alleles would be tend to be similar. If the proportion of reads for an allele candidate L_low with lower read frequency is too small compared to that for another allele candidate L_high with higher read frequency, we may conclude that the reads for the allele candidate L_low are from noise and the locus is homozygous. Considering this condition, we multiplied ratio of to and the output of p_Llow(l_Llow), where is the output of MIN{, 1-} and is the output of MAX{, 1-}. The confidence scores of two allele candidates were calculated by

(4)

In the final results, an allele candidate from the predicted genotype is removed when its confidence score is lower than given cutoff values (we set 0.35 for L_high and 0.25 for L_low). When a confidence score of L_low is lower than the cutoff value, GenoTan generates a partial genotyping call for the locus in which only one allele is called while the other allele is reported as unknown. It only reports the genotype of the locus as homozygous when the number of reads supporting the selected allele is more than 4 and its confidence score is equal to or higher than 0.9. The confidence score of the second allele, L_high₂, at a homozygous locus is calculated by

(5)

where [0.5ⁿ] represents the probability of that another unobserved allele exists when n reads support the selected allele.

Getting started

GenoTan is available at the Download page and is distributed with three additional PERL programs including makeListFromTRF.pl, searchTandemRepeats.pl and realignReadsOnRepeats.pl.

Installation

The current release requires GSL (http://www.gnu.org/software/gsl, developed with GSL 1.14) and runs in the LINUX environment.

$ tar xzf genotan.[VERSION].tar.gz

$ cd genotan.[VERSION]

If the GSL library was installed in a personal directory, edit the 'Makefile' file in the directory.

Ex) If the GSL library was installed in the '$(HOME)/opt' directory,

INCLUDES= -I. -I$(HOME)/opt/include

LIBPATH= -L$(HOME)/opt/lib

Change '$(HOME)/opt/include' to a full path of a directory containing GSL headers and '$(HOME)/opt/lib' to a full path of a directory containing GSL libs.

Then, type 'make' to compile it.

$ make

Searching microsatellites

We provide two methods to search microsatellite sequences. The first method is to use TRF (Tandem Repeat Finder, http://tandem.bu.edu/trf/trf.html) which searches imperfect repeats of short motifs from reference sequences. The second method is to search perfect repeats of short motifs from reference sequences directly.

1. Using TRF

To convert the output of TRF to the input file of GenoTan, we provide makeListFromTRF.pl.

$ trf reference.fa 2 7 5 80 10 14 6 -h

$ [GenoTan_path]/makeListFromTRF.pl -r reference.fa -t reference.fa.2.7.5.80.10.14.6.dat -o mic.lst

2. Searching repeats of short motifs

We also provide searchTandemRepeats.pl to search repeats of short motifs from reference sequences directly.

$ [GenoTan_path]/searchTandemRepeats.pl reference.fa -o mic.lst

Realignment of sequence reads mapped to the microsatellite loci

Misalignment of sequence reads mapped to the repeat sequences is frequently observed. Since GenoTan does not have a process to realign sequence reads, its results significantly rely on the alignment quality. To reduce the misalignment problem we provide realignReadsOnRepeats.pl which is a realignment program for the sequence reads.

$ [GenoTan_path]/realignReadsOnRepeats.pl -r reference.fa -m mic.lst -s input.sam -o realigned.sam

The sequence reads can be also realigned by GATK (Genome Analysis Toolkit, http://www.broadinstitute.org/gsa/wiki).

$ cat mic.lst | perl -ne '@arr=split/\t/; print "$arr[0]:$arr[1]-".($arr[1]+$arr[2]-1)."\n";' > mic.intervals

$ java -jar GenomeAnalysisTK.jar -T IndelRealigner -R reference.fa -targetIntervals mic.intervals -I input.bam
-o realigned.bam

Run GenoTan

$ [GenoTan_path]/genotan -m mic.lst realigned.sam -o genotype.txt

Usage : genotan -m <microsatellite list file> <sam or bam file> [-o <output file>] [-L off]
[-C <0.01-0.99>] [-c <0.01-0.99>] [-q <0-255 >]

1. Input files

GenoTan takes a microsatellite list file (which requires '-m') and a SAM or BAM file containing mapped reads. The microsatellite list file has the following format.

[chromosome], [starting position], [length], [motif], [sequence] in a tab-delimited format

Ex)

chr1 26454 12 GT GTGTGTGTGTGT

chr1 28589 15 T TTTTTTTTTTTTTTT

chr1 44836 32 AAAT AAATAAATAAATAAATAAATAAATAAATAAAT

chr1 722365 20 ATTT ATTTATTTATTTATTTATTT

2. Other options

-L off : Turn off the normalization of read counts using the allele lengths (default : on)

-C FLOAT [0.01-0.99] : If a confidence score of a higher read frequency allele at a locus is lower than the cutoff value, a genotype of the locus will not be reported. (default : 0.35)

-c FLOAT [0.01-0.99] : If a confidence score of a lower read frequency allele at a locus is lower than the cutoff value, the allele will not be reported. (default : 0.25)

-q INT [0-255] : if a mapping quality of a read is lower than the cutoff value, the read will not be used. (default : 5)

-O INT [0: no sequence, 1: sequence output]: It decides the output style. (default : 1)

3. Output file

Ex)

chr2	20243	20253	11	11(0:0)	-1(0:0)	0.554	0	AAAAAAAAAAA	?
chr2	23247	23259	13	13(0:0)	13(0:0)	0.958	0.943	AATAAATAAATAA	-
chr2	25281	25294	14	14(0:0)	14(0:0)	0.977	0.971	AAATAAATAAATAA	-
chr2	26216	26223	8	8(0:0)	8(0:0)	0.962	0.895	TTTTTTTT	-
chr2	29366	29375	10	10(0:0)	9(0:0)	0.731	0.420	AAAAAAAAAA	AAAAAAAAA
chr2	29462	29471	10	10(0:0)	-1(0:0)	0.866	0	AAAAAAAAAA	?
chr2	38985	39001	17	-1(0:0)	-1(0:0)	0	0	?	?

The output file of GenoTan is a tab-delimited file. Each line consists of

Col.	Field	Description
1	chr.	Reference sequence name
2	start	Starting position of the locus in the reference sequence
3	end	Ending position of the locus in the reference sequence
4	ref. len.	Length of the locus in the reference sequence
5	len. allele 1 (ext. del. in flanking left:right)	Length of allele 1 (higher read frequency allele candidate) with additional deletions in flanking sequences (left:right). See below.
6	len. allele 2 (ext. del. in flanking left:right)	Length of allele 2 (lower read frequency allele candidate) with additional deletions in flanking sequences
7	score of allele 1	Confidence score of allele 1. See 'Decision of genotyping'
8	score of allele 2	Confidence score of allele 2
9	seq. of allele 1	Sequence of allele 1. If the allele is not identified, '?' is displayed.
10	seq. of allele 2	Sequence of allele 2. If the sequence of the allele is the same as allele 1, '-' is displayed.

'ext. del. in flanking left:right' : Lengths of deletions adjoined to the microsatellite locus. For example :

reference : ATTCGAGGGATATATATATATCAAGTAC

read 1 : ATTCGAG----ATATATATATCAAG

read 2 : CGAG----ATATATATATCAAGTAC

In this example, the length of the microsatellite in the reference and the length of the allele candidate in the reads are 12 bases and 10 bases respectively. But there are two additional deleted bases in the left flanking sequence of the microsatellite. To separate the deletions in the repeat sequence from that in the flanking sequences, the allele is displayed as '10(2:0)' in the output file.

4. Test with simulated data

The simulated data is available online [Download]. It was created from human genome data with artificial microsatellite sequences of 1~8-mer motifs. Repeat sequences in sequence reads for the microsatellite loci were generated by the Binomial random function (p=0.5) and two Gaussian random functions. The inherited alleles (artificial) for each locus are in the 'test.allele.lst' file. There are three different BAM files and each file contains reads mapped to the microsatellite loci in different coverage.

$ [GenoTan_path]/genotan -m test.mic.lst test.cov.10.bam -L off -o genotype.cov.10.txt

Genotyping of microsatellite loci Genotyping of microsatellite loci GenoTan GenoTan