Decoding the Blueprint of Life for Healthier Future

Press ESC to close

Computational methods for Gene prediction in Eukaryotes.

Gene prediction means identification of coding regions present in DNA nucleotide sequences. These coding region codes for a specific protein in body. Gene prediction involve detection of open reading frames ORF, identification of splice sites and separation of exons from introns. As gene sequences of many organisms are introduced so it is needed to predict coding regions from whole sequence. In this way gene prediction enables us to analyze and annotate large uncharacterized gene sequences.

Eukaryotic genome 

Eukaryotic genome is larger than prokaryotic genomes having length ranging from 10 MbP to 670 Gbp. Eukaryotic genomes have very low gene density means they have less coding regions (exons)than non-coding regions (introns). There are some difficulties in predicting genes computationally in eukaryotes. As eukaryotic genomes are interrupted by presence of introns so it requires a more complex gene prediction methods as compared to prokaryotes. More over due to alternative splicing, exons after separating from introns gathered in different ways leading to various protein isoforms. Still there is good feature of eukaryotic gene sequence being that its splice junction is identified using GT-AG rule means 5’ end of sequences has conserved sequence of GTAAGT and 3’ end has a conserved sequence of(py)12 NCAG.

Programs for computational gene prediction

Programs that perform gene prediction use three types of approaches

  • Ab initio 

  • Similarity based

  • Consensus based

Ab initio method

Ab initio method perform its task by first separating coding sequences (exons) from non coding sequences introns and then joining exons together in a correct way making whole sequence functional. In order to identify where coding sequence is present , it finds start and stop codons and splices sites . Some specific distribution of nucleotides like hexamer frequencies( non random occurrence of six nucleotides in coding region) gives idea about coding region also.

Models used for ab initio method

Ab initio achieve its goal by using different models

  1. Neural network 

This model is first trained to make predictions about patterns in sequences just like human nervous system has the ability to learn . Its working is somewhat related to nervous system so that’s why it is called neural network. It consists of various layers layers .First layer called input layer where our input sequence is present , then inner layers where it preforms its activity using machine learning approaches and in third layer it gives output , predicting coding region based on its training.

  •   Program that is based on this model is GRAIL

2. Discriminative analysis

It performs its role by plotting a graph between 3’ end splice site on X axis and coding scores on y axis. It separates the coding regions from non coding ones by drawing a straight line in case of linear discriminant analysis (LDA) and a curved line in case of quadratic discriminant analysis.(QDA) . It finds the optimal coding regions in unknown sequences on the basis of knowledge of gene sequences on which it is trained

  • FGENESH and MZEF uses LDA and QDA respectively for prediction of coding region

3. Hidden Markov Model

HMM use transition probabilities and emission probabilities to predict coding region in a sequence. It is very easy to understand that what are these. Lets have a simple example, that if you have a piece of paper on which word “HPT” is written, as we know there is no meaning of this word  in English and we can have an hypothesis that  writer intended to write either real word is HOT Or HAT, but most likely probability is for HOT as in keyboard P is more close to O then A .This probability process is used by HMM to predict gene prediction as if we assume intended letter as hidden states or simply state .State  means  intron, exons, and splicing sites. And the mistyped letter as observation(target sequence).HMM give probability  that which state is followed by other is termed as transition probability. And the probability that tell which state occurs by observing the target sequence is emission probability .In programs mostly comprehensive  form of HMM ,Generalized HMM (GHMM) is used as it is independent of length of DNA sequence.

  • GENE SCAN and HMMgene are programs that uses HMM to make gene predictions

Similarity based

This is an empirical gene finding approach in which coding region is identified by comparing a query sequence with Expressed sequence tags (EST) or sequences of proteins and genes already stored in data bases of gene and protein. EST are short parts of complementary DNA sequence  transcripted from messenger RNA.  This method always requires a homolog in data base for finding coding region. In the absence of homolog sequence ,this method is not able for identifying coding regions in input sequence  that are not similar to sequences stored in databases.

  • SGP 1 and Genomescan algorithms are based on this method

Consensus based

This method involves combining the results of multiple programs .as every program has its own specificity and sensitivity, so it compares the results of different programs and give results by choosing the consistent predictions and ignoring ones that are not predicted by most of them. In this way it improves prediction quality and remove false positive results.

  • DIGIT is a consensus-based algorithm that combines the result of FGENESH , GENESCAN and HMMgene.

Conclusion:

Various computational methods are discovered for prediction of gene in eukaryotes. These programs help researchers to work with complex genome structure and function. Ever program has its own strength and limitation. So, it is required to select the most accurate program for your work Combining different methods result in accurate gene prediction. Increasing knowledge about computational biology leads to more precise algorithms for gene annotations.

 

Reference

  • Stormo, G. D. (2000). Gene-finding approaches for eukaryotes. Genome research10(4), 394-397.
  • Mathé, C., Sagot, M. F., Schiex, T., & Rouzé, P. (2002). Current methods of gene prediction, their strengths and weaknesses. Nucleic acids research30(19), 4103-4117.     
  • Wang, Z., Chen, Y., & Li, Y. (2004). A brief review of computational gene prediction methods. Genomics, proteomics and bioinformatics2(4), 216-221.
  • Brent, M. R. (2007). How does eukaryotic gene prediction work?. Nature biotechnology25(8), 883-885.

 

Where biomolecules meet bytes, I find my curiosity sparked. As a biochemistry student at the University of Punjab, Lahore, navigating the exciting field of bioinformatics, I will share my learning with you.

Leave a comment

Your email address will not be published. Required fields are marked *