Gene prediction means identification of coding regions present in DNA nucleotide sequences. These coding region codes for a specific protein in body. Gene prediction involve detection of open reading frames ORF, identification of splice sites and separation of exons from introns. As gene sequences of many organisms are introduced so it is needed to predict coding regions from whole sequence. In this way gene prediction enables us to analyze and annotate large uncharacterized gene sequences.
Eukaryotic genome
Eukaryotic genome is larger than prokaryotic genomes having length ranging from 10 MbP to 670 Gbp. Eukaryotic genomes have very low gene density means they have less coding regions (exons)than non-coding regions (introns). There are some difficulties in predicting genes computationally in eukaryotes. As eukaryotic genomes are interrupted by presence of introns so it requires a more complex gene prediction methods as compared to prokaryotes. More over due to alternative splicing, exons after separating from introns gathered in different ways leading to various protein isoforms. Still there is good feature of eukaryotic gene sequence being that its splice junction is identified using GT-AG rule means 5’ end of sequences has conserved sequence of GTAAGT and 3’ end has a conserved sequence of(py)12 NCAG.
Programs for computational gene prediction
Programs that perform gene prediction use three types of approaches
Ab initio
Similarity based
Consensus based
Ab initio method
Ab initio method perform its task by first separating coding sequences (exons) from non coding sequences introns and then joining exons together in a correct way making whole sequence functional. In order to identify where coding sequence is present , it finds start and stop codons and splices sites . Some specific distribution of nucleotides like hexamer frequencies( non random occurrence of six nucleotides in coding region) gives idea about coding region also.
Models used for ab initio method
Ab initio achieve its goal by using different models
Neural network
This model is first trained to make predictions about patterns in sequences just like human nervous system has the ability to learn . Its working is somewhat related to nervous system so that’s why it is called neural network. It consists of various layers layers .First layer called input layer where our input sequence is present , then inner layers where it preforms its activity using machine learning approaches and in third layer it gives output , predicting coding region based on its training.
Program that is based on this model is GRAIL
2. Discriminative analysis
It performs its role by plotting a graph between 3’ end splice site on X axis and coding scores on y axis. It separates the coding regions from non coding ones by drawing a straight line in case of linear discriminant analysis (LDA) and a curved line in case of quadratic discriminant analysis.(QDA) . It finds the optimal coding regions in unknown sequences on the basis of knowledge of gene sequences on which it is trained
FGENESH and MZEF uses LDA and QDA respectively for prediction of coding region
3. Hidden Markov Model
HMM use transition probabilities and emission probabilities to predict coding region in a sequence. It is very easy to understand that what are these. Lets have a simple example, that if you have a piece of paper on which word “HPT” is written, as we know there is no meaning of this word in English and we can have an hypothesis that writer intended to write either real word is HOT Or HAT, but most likely probability is for HOT as in keyboard P is more close to O then A .This probability process is used by HMM to predict gene prediction as if we assume intended letter as hidden states or simply state .State means intron, exons, and splicing sites. And the mistyped letter as observation(target sequence).HMM give probability that which state is followed by other is termed as transition probability. And the probability that tell which state occurs by observing the target sequence is emission probability .In programs mostly comprehensive form of HMM ,Generalized HMM (GHMM) is used as it is independent of length of DNA sequence.
GENE SCAN and HMMgene are programs that uses HMM to make gene predictions
Similarity based
This is an empirical gene finding approach in which coding region is identified by comparing a query sequence with Expressed sequence tags (EST) or sequences of proteins and genes already stored in data bases of gene and protein. EST are short parts of complementary DNA sequence transcripted from messenger RNA. This method always requires a homolog in data base for finding coding region. In the absence of homolog sequence ,this method is not able for identifying coding regions in input sequence that are not similar to sequences stored in databases.
SGP 1 and Genomescan algorithms are based on this method
Consensus based
This method involves combining the results of multiple programs .as every program has its own specificity and sensitivity, so it compares the results of different programs and give results by choosing the consistent predictions and ignoring ones that are not predicted by most of them. In this way it improves prediction quality and remove false positive results.
DIGIT is a consensus-based algorithm that combines the result of FGENESH , GENESCAN and HMMgene.
Conclusion:
Various computational methods are discovered for prediction of gene in eukaryotes. These programs help researchers to work with complex genome structure and function. Ever program has its own strength and limitation. So, it is required to select the most accurate program for your work Combining different methods result in accurate gene prediction. Increasing knowledge about computational biology leads to more precise algorithms for gene annotations.
Reference
- Stormo, G. D. (2000). Gene-finding approaches for eukaryotes. Genome research, 10(4), 394-397.
- Mathé, C., Sagot, M. F., Schiex, T., & Rouzé, P. (2002). Current methods of gene prediction, their strengths and weaknesses. Nucleic acids research, 30(19), 4103-4117.
- Wang, Z., Chen, Y., & Li, Y. (2004). A brief review of computational gene prediction methods. Genomics, proteomics and bioinformatics, 2(4), 216-221.
- Brent, M. R. (2007). How does eukaryotic gene prediction work?. Nature biotechnology, 25(8), 883-885.