Elsevier

Neurocomputing

Volume 71, Issues 4–6, January 2008, Pages 538-543
Neurocomputing

An artificial neural network method for combining gene prediction based on equitable weights

https://doi.org/10.1016/j.neucom.2007.07.019Get rights and content

Abstract

Gene prediction is still an important step to annotate genomes. In this paper, we proposed a novel method for recognizing gene in genomes. The method combines three famous gene-finding programs. After calculating the accuracy parameters, the equitable weight for each parameter is calculated using genetic algorithm. Then the integrative evaluation is performed. The integrative evaluation is employed to instruct the training of an artificial neural network. The simulation results show that the proposed method integrates advantages of three programs and the accuracy has an obvious improvement, which indicate that the proposed method has a powerful capability for gene prediction.

Introduction

Millions of genomes of many organisms have been sequenced over the last few years. However, biological annotation is not keeping pace with this avalanche of raw sequence data. There is a real need for accurate and fast tools to analyze these sequences and, especially, to analyze these genes and determine their functions. Tools for annotating the content of these genomes are more important then ever, then a great number of gene-finding programs have been developed for annotating newly sequenced DNA genomes. This is an essential and important step in the works of genome annotation. Many gene-finding programs are claimed to possess high prediction accuracy and many researchers think that they have got gene-finding tools good enough for the usage. However, the case is not true quite often. In general, there are two major problems about gene-finding tools, the first is that the high prediction accuracy is for specific domain, and the second is that there is a high accuracy at the nucleotide level, but at exon level, the accuracy is low. For example, the original genome annotation of the hyperthermophilic archaeon Pyroccus furiosus contained 2065 open reading frames (ORFs). The genome was subsequently automatically annotated in two public databases by the Institute for Genomic Research (TIGR) and the National Center for Biotechnology Information (NCBI). Remarkably, more than 500 of the originally annotated ORFs are different in size in the two databases, and many of them are very significantly. More than 170 of the predicted proteins differ at their N termini by more than 25 amino acids. Similar discrepancies were observed in the TIGR and NCBI databases with other archaeal and bacterial genomes examined [1].

Therefore, the more accurate gene prediction method is still important. In this paper, we propose a method for gene prediction which uses a neural network to combine three famous gene-finding programs. The network uses integrative evaluation for the training. The genetic algorithmic is used to calculate the integrative evaluation.

In recent 20 years, many gene-finding programs are developed for gene prediction. The major existing methods can be divided into three classes. The first is on further verifying the predictions of these programs by searching for similar homologues in the database, however, it has been already known that about 50% of newly discovered genes have no similar homologues in the protein sequence database [9], [10]. The second class is based on pattern recognition methods such as artificial neural networks [11], [12], discriminant analysis [13], [14], and hidden Markov models [15], [16], [17], [18]. The third class depends on some other methods, such as ZCURVE [19], it uses geometry method.

There already exist some methods which combine the predictions by several gene-finding programs. Murakami and Takagi [2] proposed five combination methods, namely AND, OR, HIGHEST, RULE, and BOUNDARY methods, to integrate the predictions by FEXH, GeneParser3, GENSCAN, and GRAIL2. The five combination methods are simple and ad-hoc, and cannot accommodate the correlations between programs and adjacent nucleotides. Rogic et al. [3] proposed three methods for combining predictions by GENSCAN and HMMgene. They focused on improving exon level accuracy by union or intersection of predicted exon regions considering probabilistic scores and reading frame consistency. The accuracy improvement on a newly assembled dataset is 7.9% over the single-best program. Nevertheless, these methods are also rule-based and are not able to model complex correlations among programs and adjacent nucleotides. Pavlovic et al. [4] provided a full Bayesian framework and adopted the hidden input/output Markov models for combining gene-predictions produced by a set of program experts. The prior observations on the predictions by the programs can be used to train the Bayesian network which models the correlations between programs and adjacent nucleotides.

Section snippets

Methodologies

The motivation of proposed method is to predict gene in genomes and get a high prediction accuracy. Using artificial neural network to combine three current gene prediction programs lies in the idea that none of these programs are 100% accurate for gene prediction and some of them use different information in their predictions, and hence integrating them in an effective way could possibly lead to a more powerful tool which can take advantages of different programs. The question is how to

Experimental results

We use HMR195 dataset, E. coli K12 genome, and Arabidopsis thaliana genome to train and test prediction results for these tools. The simulated results are shown in Table 1, Table 2, Table 3. From the table it can be seen that the average accuracies on Sn and Sp increases at about 2.3% at the nucleotide level. And the accuracies on both ESn and ESp increase at 6% compared with the maximum accuracies obtained using the other three programs at the exon level, respectively. Considering that the

Conclusions

Gene prediction is an important step in annotating genomes, the accurate prediction for genes in genomes is still a challenging topic. Therefore, the highly effective method is need. In this paper, we proposed the RBFN-Combining method for combining gene prediction from three gene-finding programs, Genscan, Glimmer and HMMgene, which can be used successfully to improve the prediction accuracy in the exon level. The improvement has been obtained at exon levels for test genomic dataset. For

Acknowledgements

The authors are grateful to the support of the National Natural Science Foundations of China (60433020, 60673023), the science-technology development project of Jilin Province of China (20050705-2), the European Commission under Grant no. TH/Asia Link/010 (111084), and “985” project of Jilin University.

You Zhou, Master, Mr. Zhou is now a docent in College of Computer Science and Technology, Jilin University. He graduated from the College of Computer Science and Technology of Jilin University for Bachelor degree in 2002. He graduated from the College of Computer Science and Technology of Harbin Institute of Technology for Master degree in 2004. His research interests focus on computational intelligence, artificial neural networks and bioinformatics. The research is involved gene prediction,

References (19)

  • E.C. Uberbacher et al.

    Discovering and understanding genes in human DNA sequence using GRAIL

    Methods Enzymol.

    (1996)
  • E. Snyder et al.

    Identification of protein coding regions in genomic DNA

    J. Mol. Biol.

    (1995)
  • C. Burge et al.

    Prediction of complete gene structures in human genomic DNA

    J. Mol. Biol.

    (1997)
  • F.L. Poole et al.

    Defining genes in the genome of the hyperthermophilic archaeon pyrococcus furiosus: implications for all microbial genomes

    J. Bacteriol.

    (2005)
  • K. Murakami et al.

    Gene recognition by combination of several gene-finding programs

    Bioinformatics

    (1998)
  • S. Rogic et al.

    Improving gene recognition accuracy by combining predictions from two gene-finding programs

    Bioinformatics

    (2002)
  • V. Pavlovic et al.

    A Bayesian framework for combining gene predictions

    Bioinformatics

    (2002)
  • S. Rogic et al.

    Evaluation of gene-finding programs on mammalian sequences

    Genome Res.

    (2001)
  • C. Dewey et al.

    Accurate identification of novel human genes through simultaneous gene prediction in human, mouse, and rat

    Genome Res.

    (2004)
There are more references available in the full text version of this article.

Cited by (8)

View all citing articles on Scopus

You Zhou, Master, Mr. Zhou is now a docent in College of Computer Science and Technology, Jilin University. He graduated from the College of Computer Science and Technology of Jilin University for Bachelor degree in 2002. He graduated from the College of Computer Science and Technology of Harbin Institute of Technology for Master degree in 2004. His research interests focus on computational intelligence, artificial neural networks and bioinformatics. The research is involved gene prediction, software evaluation, essential gene prediction, non-coding RNA.

Yanchun Liang, Ph.D., Professor. Dr. Liang is now a professor in College of Computer Science and Technology, Jilin University. He graduated from the Department of Mathematics of Jilin University in 1977. He was a visiting scholar in Manchester University of UK from 1990 to 1991, a visiting professor in National University of Singapore from 2000 to 2001, and a visiting professor in Institute of High Performance Computing of Singapore from 2002 to 2004. His research interests focus on computational intelligence and intelligence engineering, including related theories, models and algorithms of artificial neural networks, fuzzy systems and evolutionary computation, as well as applications of intelligent computational methods to combinational optimization, control of ultrasonic motors, MEMS modeling, prediction of economic time series, and bioinformatics.

Chengquan Hu is a professor in the College of Computer Science and Technology of Jilin University. His research interests include bioinformatics, embedded systems.

Liupu Wang was born in Changchun, Jilin, China. He received the B.S. degree in Computer Science from College of Computer Science and Technology, Jilin University in 2002. He received the M.T. degree in Computer Science from College of Computer Science and Technology, Jilin University in 2006. His interests are bioinformatics and artificial neural networks.

Xiaohu Shi is now a lecturer in College of Computer Science and Technology, Jilin University, China. He received the M.S. degree in Fluid Mechanics in July 2002 from the College of Mathematics, Jilin University. His current research interests include computing intelligent and bioinformatics.

View full text