Genotyping of microsatellite loci
Genotyping of microsatellite loci
GenoTan
GenoTan

Introduction

GenoTan is a free tool to identify length variation of microsatellites from short sequence reads. Inferring lengths of inherited microsatellite alleles with single base pair resolution from short sequence reads is challenging due to several sources of noise including PCR amplification errors, individual cell mutation, misalignment or mis-mapping caused by the repetitive nature of the microsatellites. We have developed a method using a discretized Gaussian mixture model combined with a rules-based approach to identify inherited variation of microsatellite loci from short sequence reads, which effectively distinguishes length variants from INDEL errors at homopolymers.

License: GPL V3 (http://www.gnu.org/licenses/gpl-3.0.html)

 

 

How to use GenoTan

Table of Contents

Introduction. 1

Methods used in GenoTan. 2

1. Discretized Gaussian mixture distribution. 2

2. Decision of genotyping. 5

Getting started. 6

Installation. 6

Searching microsatellites. 7

1. Using TRF. 7

2. Searching repeats of short motifs. 7

Realignment of sequence reads mapped to the microsatellite loci 7

Run GenoTan. 8

1. Input files. 8

2. Other options. 8

3. Output file. 8

4. Test with simulated data. 9

 

Methods used in GenoTan

1. Discretized Gaussian mixture distribution

GenoTan uses a discretized Gaussian distribution to address homopolymer errors from short sequence reads mapped to microsatellite loci. Since probabilities of insertion and deletion errors in a sequence are proportional, they can be calculated from the cumulative distribution function of the Gaussian distribution.

Let lL be the length of a candidate allele L at a target locus and let x be the observed length of the microsatellite sequence with INDEL errors in a read mapped to the locus with an assumption in which the length x is derived from the original length lL. Let FL(t) and f L(t) denote the distribution and the density functions of a Gaussian random variable with mean lL and variance  respectively. Then the probability mass function pL(x) of x is

 

(1)

 

where x=0,1,2,... , and is a scale factor (Since x is a natural number, we need to rescale to make the distribution a proper probability mass function).

 

 

B.suis  GGCGCATTTTGCAACTGATTCTATGCC---GGGGGGGGGGAAGCGCATACGTTGGAGCGT

read1   GGCGCATTTTGCAACTGATTCTATGCCGGGGGGGGGGGGGAAGCGCATACGT--------

read2   GGCGCATTTTGCAACTGATTCTATGCCGGGGGGGGGGGGGAAGCGCATACGTTGG-----

read3   GGCGCATTTTGCAACTGATTCTATGCCGGGGGGGGGGGGGAAGCGCATACGTTGGAG---

read4   ------TTTTGCAACTGATTCTATGCCGGGGGGGGGGGGGAAGCGCATACGTTGGAGCGT

read5   GGCGCATTTTGCAACTGATTCTATGCC-GGGGGGGGGGGGAAG-----------------

read6   GGCGCATTTTGCAACTGATTCTATGCC-GGGGGGGGGGGGAAGCGC--------------

read7   GGCGCATTTTGCAACTGATTCTATGCC-GGGGGGGGGGGGAAGGGCA-------------

read8   GGCGCATTTTGCAACTGATTCTATGCC-GGGGGGGGGGGGAGGCGCATACG---------

read9   GGCGCATTTTGCAACTGATTCTATGCC-GGGGGGGGGGGGAAGCGCATACGTT-------

read10  GGCGCATTTTGCAACTGATTCTATGCC-GGGGGGGGGGGGAAGCGCATACGTTG------

read11  ---GCATTTTGCAACTGATTCTATGCC-GGGGGGGGGGGGAAGCGCATACGTTGGAGCGT

read12  ------TTTTGCAACTGATTCTATGCC-GGGGGGGGGGGGAAGCGCATACGTTGGAGCGT

read13  --------------CTGATTCTATGCC-GGGGGGGGGGGGAAGCGCATACGTTGGAGCGT

read14  GGCGCATTTTGCAACTGATTCTATGCC--GGGGGGGGGGGAAGCGCATACGTTGGAGCGT

read15  ---GCATTTTGCAACTGATTCTATGCC--GGGGGGGGGGGAAGCGCATACGTTGGAGCGT

read16  GGCGCATTTTGCAACTGATTCTATGCC---GGGGGGGGGGAAGCGCATACGT--------

read17  -----ATTTTGCAACTGATTCTATGCC---GGGGGGGGGGAAGCGCATACGTTGGAGCGT

 

Recurrence of INDEL errors in sequence reads. The reference is a genomic sequence of Brucella suis 1330 which is a prokaryote with a haploid genome, and the locus is a unique region in the genome. The sequence reads were generated from the Illumina 101 cycle paired-end sequencing technology. Insertions or deletion errors by sequencing errors or other sources, such as PCR amplification error or individual cell mutation, results in various lengths of G homopolymers in sequence reads.

 

Gaussian distribution to express the bidirectional INDEL errors in sequence reads. The graph shows a Gaussian distribution for the locus at the previous figure. The chances of insertion errors and deletion errors are proportional.

 

For the heterozygous loci with allele lengths, lL1 and lL2, we can use the mixture distribution of (1) as follows

 

,

(2)

 

, where is the unknown mixture proportion parameter for reads derived from one of the two alleles, regardless of the repeat sequence length x. It is also assumed that the associated parameters  and  are both unknown. These parameters can be estimated by a nonlinear least squares (NLS) regression function.

 

 

Heterozygous fitting. Graphs represent the process of heterozygous fitting for 4 and 6 mapped reads containing microsatellite sequences with lengths 13 and 16 bases respectively at a locus.

 

 

If the sequence reads mapped to a same microsatellite locus contain INDEL errors, the number of observed lengths of the microsatellite at the locus would be equal to or more than 2. Since the inherited alleles are unknown, all observed lengths can be allele candidates. Then we can apply the g(x) function for each combination of two allele candidates (two of the same candidate for a homozygous genotype), calculate the squared error of each combination, and select the allele pair, L1*and L2*, that generates the minimum squared error as follows

 

(3)

 

 

, where ox is an observed proportion of reads containing a length x microsatellite sequence, a is the minimum observed length minus a fixed amount k, and b is the maximum observed length plus k, where k is set to be five as default value. This is necessary because the g(x) function generates output values for all possible sequence lengths, the comparison between observed proportions and expected proportions need to be extended beyond the minimum and maximum observed lengths. So, the boundaries of the calculation are extended by an additional value k.

As an example, suppose that we have 2, 8 and 4 mapped reads containing microsatellite sequences with lengths 14, 15 and 16 bases respectively at a locus. The list of possible genotype candidates G(lL1, lL2) for the locus are G(14, 14), G(14, 15), G(14, 16), G(15, 15), G(15, 16), and G(16, 16). In the example, the observed minimum and maximum lengths are 14 and 16 respectively, and the observed and expected values from the equation 3 are compared for x ranging from 9 to 21. While the observed ratio of read counts between the highest read frequency allele (lL1=15) and the second highest read frequency allele (lL2=16) is 0.5 (=4/8), the read ratio of those two alleles estimated by the NLS function was 0.163 (= (1- )/= 0.14/0.86). The difference between the two estimated ratios may result a different decision of genotype calls, depending on the cutoff ratio to determine if the second highest read frequency allele candidate is noise.

 

 

2. Decision of genotyping

The most probable genotype for a given sequence reads mapped to a locus is decided by equation 3, but it is bias toward calling heterozygous genotypes, because the Gaussian mixture model is better fit to the training data when more distributions are mixed. However since reads supporting one or both predicted alleles may be from noise including individual cell mutation, PCR amplification error, sequencing error and mismapping, an evaluation method is necessary.

We use a rule-based approach to choose alleles and to decide the homozygosity of each locus because the frequencies of INDEL error reads derived from mis-mapping, PCR amplification error and individual cell mutation are more difficult to measure than that from the sequencing error. For this approach, we assign a confidence score to each allele instead of calculating the probability of a genotype (a two allele set) for a locus. The probability of each allele can be generated by the equation 1 as pL1(lL1) or pL2(lL2) if we assume the read frequencies from two different alleles at the heterozygotic locus are not correlated. However DNA fragments from two paired chromosomes have the same chances of being sequenced and the read frequencies of two alleles would be tend to be similar. If the proportion of reads for an allele candidate Llow with lower read frequency is too small compared to that for another allele candidate Lhigh with higher read frequency, we may conclude that the reads for the allele candidate Llow are from noise and the locus is homozygous. Considering this condition, we multiplied ratio of  to  and the output of pLlow(lLlow), where  is the output of MIN{, 1-} and  is the output of MAX{, 1-}. The confidence scores of two allele candidates were calculated by

 

(4)

 

In the final results, an allele candidate from the predicted genotype is removed when its confidence score is lower than given cutoff values (we set 0.35 for Lhigh and 0.25 for Llow). When a confidence score of Llow is lower than the cutoff value, GenoTan generates a partial genotyping call for the locus in which only one allele is called while the other allele is reported as unknown. It only reports the genotype of the locus as homozygous when the number of reads supporting the selected allele is more than 4 and its confidence score is equal to or higher than 0.9. The confidence score of the second allele, Lhigh2, at a homozygous locus is calculated by

 

(5)

 

where [0.5n] represents the probability of that another unobserved allele exists when n reads support the selected allele.

 

Getting started

GenoTan is available at the Download page and is distributed with three additional PERL programs including makeListFromTRF.pl, searchTandemRepeats.pl and realignReadsOnRepeats.pl.

 

Installation

The current release requires GSL (http://www.gnu.org/software/gsl, developed with GSL 1.14) and runs in the LINUX environment.

 

$ tar xzf genotan.[VERSION].tar.gz

$ cd  genotan.[VERSION]

 

If the GSL library was installed in a personal directory, edit the 'Makefile' file in the directory.

Ex) If the GSL library was installed in the '$(HOME)/opt' directory,

 

INCLUDES= -I. -I$(HOME)/opt/include

LIBPATH= -L$(HOME)/opt/lib

 

Change '$(HOME)/opt/include' to a full path of a directory containing GSL headers and '$(HOME)/opt/lib' to a full path of a directory containing GSL libs.

Then, type 'make' to compile it.

 

$ make

 

Searching microsatellites

We provide two methods to search microsatellite sequences. The first method is to use TRF (Tandem Repeat Finder, http://tandem.bu.edu/trf/trf.html) which searches imperfect repeats of short motifs from reference sequences. The second method is to search perfect repeats of short motifs from reference sequences directly.

 

1. Using TRF

To convert the output of TRF to the input file of GenoTan, we provide makeListFromTRF.pl.

 

$ trf  reference.fa 2 7 5 80 10 14 6 -h

$ [GenoTan_path]/makeListFromTRF.pl -r  reference.fa  -t  reference.fa.2.7.5.80.10.14.6.dat  -o  mic.lst

 

2. Searching repeats of short motifs

We also provide searchTandemRepeats.pl to search repeats of short motifs from reference sequences directly.

 

$ [GenoTan_path]/searchTandemRepeats.pl  reference.fa  -o  mic.lst

 

Realignment of sequence reads mapped to the microsatellite loci

Misalignment of sequence reads mapped to the repeat sequences is frequently observed. Since GenoTan does not have a process to realign sequence reads, its results significantly rely on the alignment quality. To reduce the misalignment problem we provide realignReadsOnRepeats.pl which is a realignment program for the sequence reads.

 

$ [GenoTan_path]/realignReadsOnRepeats.pl  -r  reference.fa  -m  mic.lst  -s  input.sam  -o  realigned.sam

 

The sequence reads can be also realigned by GATK (Genome Analysis Toolkit, http://www.broadinstitute.org/gsa/wiki).

 

$ cat mic.lst | perl -ne '@arr=split/\t/; print "$arr[0]:$arr[1]-".($arr[1]+$arr[2]-1)."\n";' > mic.intervals

$ java -jar GenomeAnalysisTK.jar  -T  IndelRealigner  -R  reference.fa  -targetIntervals  mic.intervals  -I  input.bam
-o realigned.bam

 

 

Run GenoTan

 

$ [GenoTan_path]/genotan  -m  mic.lst   realigned.sam  -o  genotype.txt

 

Usage : genotan -m <microsatellite list file> <sam or bam file> [-o <output file>] [-L off]
[-C <0.01-0.99>] [-c <0.01-0.99>] [-q <0-255 >]

 

1. Input files

GenoTan takes a microsatellite list file (which requires '-m') and a SAM or BAM file containing mapped reads. The microsatellite list file has the following format.

 

[chromosome], [starting position], [length], [motif], [sequence] in a tab-delimited format

 

Ex)

chr1                26454    12           GT          GTGTGTGTGTGT

chr1                28589    15           T             TTTTTTTTTTTTTTT

chr1                44836    32           AAAT    AAATAAATAAATAAATAAATAAATAAATAAAT

chr1                722365  20           ATTT     ATTTATTTATTTATTTATTT

 

2. Other options

-L off : Turn off the normalization of read counts using the allele lengths (default : on)

-C FLOAT [0.01-0.99] : If a confidence score of a higher read frequency allele at a locus is lower than the cutoff value, a genotype of the locus will not be reported. (default : 0.35)

-c FLOAT [0.01-0.99] : If a confidence score of a lower read frequency allele at a locus is lower than the cutoff value, the allele will not be reported. (default : 0.25)

-q INT [0-255] : if a mapping quality of a read is lower than the cutoff value, the read will not be used. (default : 5)

-O INT [0: no sequence, 1: sequence output]: It decides the output style. (default : 1)

 

3. Output file

Ex)

chr2

20243

20253

11

11(0:0)

-1(0:0)

0.554

0

AAAAAAAAAAA

?

chr2

23247

23259

13

13(0:0)

13(0:0)

0.958

0.943

AATAAATAAATAA

-

chr2

25281

25294

14

14(0:0)

14(0:0)

0.977

0.971

AAATAAATAAATAA

-

chr2

26216

26223

8

8(0:0)

8(0:0)

0.962

0.895

TTTTTTTT

-

chr2

29366

29375

10

10(0:0)

9(0:0)

0.731

0.420

AAAAAAAAAA

AAAAAAAAA

chr2

29462

29471

10

10(0:0)

-1(0:0)

0.866

0

AAAAAAAAAA

?

chr2

38985

39001

17

-1(0:0)

-1(0:0)

0

0

?

?

 

 

The output file of GenoTan is a tab-delimited file. Each line consists of

 

Col.

Field

Description

1

chr.

Reference sequence name

2

start

Starting position of the locus in the reference sequence

3

end

Ending position of the locus in the reference sequence

4

ref. len.

Length of the locus in the reference sequence

5

len. allele 1

(ext. del. in

flanking left:right)

Length of allele 1 (higher read frequency allele candidate) with additional deletions in flanking sequences (left:right). See below.

6

len. allele 2

(ext. del. in

flanking left:right)

Length of allele 2 (lower read frequency allele candidate) with additional deletions in flanking sequences

7

score of allele 1

Confidence score of allele 1. See 'Decision of genotyping'

8

score of allele 2

Confidence score of allele 2

9

seq. of allele 1

Sequence of allele 1. If the allele is not identified, '?' is displayed.

10

seq. of allele 2

Sequence of allele 2. If the sequence of the allele is the same as allele 1, '-' is displayed.

 

'ext. del. in flanking left:right' : Lengths of deletions adjoined to the microsatellite locus. For example :

 

reference :    ATTCGAGGGATATATATATATCAAGTAC

read 1 :       ATTCGAG----ATATATATATCAAG

read 2 :          CGAG----ATATATATATCAAGTAC

 

In this example, the length of the microsatellite in the reference and the length of the allele candidate in the reads are 12 bases and 10 bases respectively. But there are two additional deleted bases in the left flanking sequence of the microsatellite. To separate the deletions in the repeat sequence from that in the flanking sequences, the allele is displayed as '10(2:0)' in the output file.

 

4. Test with simulated data

The simulated data is available online [Download]. It was created from human genome data with artificial microsatellite sequences of 1~8-mer motifs. Repeat sequences in sequence reads for the microsatellite loci were generated by the Binomial random function (p=0.5) and two Gaussian random functions. The inherited alleles (artificial) for each locus are in the 'test.allele.lst' file. There are three different BAM files and each file contains reads mapped to the microsatellite loci in different coverage.

 

$ [GenoTan_path]/genotan  -m  test.mic.lst  test.cov.10.bam  -L off  -o  genotype.cov.10.txt