A novel ensemble of classifiers for microarray data classification
Introduction
Microarray technology has provided the ability to measure the expression levels of thousands of genes simultaneously in a single experiment. Each spot on a microarray chip contains the clone of a gene from a tissue sample. Some mRNA samples are labelled with two different kinds of dyes, for example, Cy5 (red) and Cy3 (blue). After mRNA interact with the genes, i.e., hybridization, the color of each spot on the chip will change. The resulted image reflects the characteristics of the tissue at the molecular level [1].
In recent years, research has showed that accurate cancer diagnosis can be achieved by performing microarray data classification. Various intelligent methods have been applied in this area. But the microarray data consists of a few hundreds of samples and thousands or even ten thousands of genes. It is extremely difficult to work in such a high dimension space using traditional classification methods directly. So gene selection methods have been proposed and developed to reduce the dimensionality. These include principal components analysis (PCA) [8], Fisher-ratio, t-test, and correlation analysis. Along with the feature selection methods, intelligent methods have been applied for microarray classification, such as support vector machine (SVM) [7], K nearest neighbor (KNN) [6], artificial neural network (ANN) [1]. But high accurate classification is difficult to achieve. Most intelligent classifiers are apt to be over-fitted. Recent years, ensemble approaches have been proposed. It combines multiple classifiers together as a committee to make more appropriate decisions for classifying microarray data instances. It offers improved accuracy and reliability. Much research has shown that a sufficient and necessary condition the approach outperforms its individual members is that the base classifiers should be accurate and diverse. An accurate classifier is one that has an error rate of better than randomly guessing classes for new instances, and two classifiers are diverse if they make different errors on common data instances [9]. So there are two important aspects to be focused on ensemble approaches.
First aspect is how to generate diverse base classifiers. In traditional, re-sampling has been widely used to generating training datasets for base classifiers learning. This method is much too random and due to the small numbers of samples, the datasets may be greatly similar. In this paper, different methods such as correlation analysis, Fisher-ratio are firstly applied to generate feature subsets. Since variety of selection methods, some features selected in the subsets are different with each other and all of them are informative genes. Then re-sampling the feature subsets to form learning datasets. Owing to the datasets forming from different feature subsets, it may be much more various and more efficient.
The second aspect is how to combine the base classifiers. In this paper, an intelligent approach for constructing ensemble classifiers is proposed. The methods first training the base classifiers with particle swarm optimization (PSO) algorithm, and then select the appropriate classifiers to construct a high performance classification committee with estimation of distribution (EDA) algorithm. Experiment show that the proposed methods produce the best recognition rates.
The paper is organized as follows. The feature selection methods are introduced in Section 2. The particle swarm optimization used to train the neural networks which are employed as the base classifiers is depicts in Section 3. The optimal design method for constructing ensemble classifiers is described in Section 4. Section 5 gives the simulation results. Finally, we present some concluding remarks.
Section snippets
Gene selection methods
Although there are a large number of genes in microarray, only small parts of genes have great impact on classification. Lots of genes are always similar in cancer and normal cases. Even worse, some genes may act as “noise” and undermine the classification accuracy. Hence, to obtain good classification accuracy, we need to pick out the genes that benefit the classification most. In addition, reducing the number of genes can help to cut down the inputs for computation, so the classifiers are
Learning the datasets with neural networks
There are many kinds of methods for classification. In recent years, most researchers applied the SVM (Support Vector Machine) as a classifier to learn the microarray dataset and obtained very good results. But the SVM is very complex to compute and training the SVM costs a lot of time. If many SVMs are used as base classifiers for assembling, the training time may be very long and the ensemble classifiers are inefficient. Moreover, the SVMs are similar due to their learning algorithms, the
Optimal design method for constructing ensemble classifiers
Select many classifiers for constructing the committee are better than all [2]. So we should select appropriate classifiers to form the classification committee. In traditional, many approaches can accomplish this task, such as greedy hill climbing. It evaluates all the possible local changes to the current set, such as adding one classifier to the set or removing one. It chooses the best or simply the first change that improves the performance of subset. Once a change is made for a subset, it
Experiments
We performed extensive experiments on four benchmark cancer datasets, namely the Leukemia, Colon, Ovarian and Lungcancer.
Conclusions
In this paper, a novel ensemble of classifiers based on correlation analysis is proposed for cancer classification. The leukemia and colon databases are used for conducting all the experiments. Gene features are first extracted by the correlation analysis technique which greatly reduces dimensionality as well as maintains the informative features. Then the EDA is employed to construct the classifier committee for classification. Compare the results with some advanced artificial techniques, the
Acknowledgments
This research was supported by the NSFC under grant No.60573065, the Key Subject Research Foundation of Shandong Province and the Natural Science Foundation of Shandong Province (grant Y2007G33).
Yuehui Chen was born in 1964. He received his B.Sc. degree in the Department of Mathematics (major in control theory) from the Shandong University of China in 1985, and Ph.D. degree in Department of Electrical Engineering and Computer Science from the Kumamoto University, Japan in 2001. During 2001–2003, he had worked as a Senior Researcher of the Memory-Tech Corporation at Tokyo. Since 2003 he has been a member at the Faculty of School of Information Science and Engineering in University of
References (19)
- et al.
Ensembling neural networks: many could be better than all
Artif. Intell.
(2002) - et al.
DNA arrays for analysis of gene expression
Methods Enzymol.
(1999) - et al.
Monitoring gene expression using DNA microarrays
Curr. Opin. Microbiol.
(2000) - et al.
Use of proteomic patterns in serum to identify ovarian cancer
Lancet
(2002) - et al.
The classification of cancer based on DNA microarray data that uses diverse ensemble genetic programming
Artif. Intell. Med.
(2006) - et al.
Appliations of support vector machines to cancer classification with microarray data
Int. J. Neural Syst.
(2005) - et al.
Estimation of Distribution Algorithms: A New Tool for Evolutionary Computation
(2001) - et al.
Molecular classification of cancer: Class discovery and class prediction by gene expression monitoring
Science
(1999) - et al.
Broad patterns of gene expresson revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide array
Proc. Natl. Acad. Sci.
(1999)
Cited by (71)
Application of ensemble learning–based classifiers for genetic expression data classification
2022, Data Science for GenomicsApplication of active learning in DNA microarray data for cancerous gene identification
2021, Expert Systems with ApplicationsA survey of intrusion detection systems based on ensemble and hybrid classifiers
2017, Computers and SecurityFeature selection model based on clustering and ranking in pipeline for microarray data
2017, Informatics in Medicine UnlockedCitation Excerpt :It is more time consuming and obtains a feature subset more inclined towards the used classifier. Different search algorithms (best-first search, branch and bound, evolutionary algorithms [32–37]) have been used to direct the search method while the classifier (e.g., k-nearest neighbors, support vector machines, Naive Bayes, etc.) is used as surrogate to evaluate the integrity of the feature subset. Although wrapper gives better accuracy than filters, it does not scale well for high dimensional data.
A survey of evolutionary algorithms for supervised ensemble learning
2023, Knowledge Engineering ReviewSoft Computing Approaches for Ovarian Cancer: A Review
2024, GMSARN International Journal
Yuehui Chen was born in 1964. He received his B.Sc. degree in the Department of Mathematics (major in control theory) from the Shandong University of China in 1985, and Ph.D. degree in Department of Electrical Engineering and Computer Science from the Kumamoto University, Japan in 2001. During 2001–2003, he had worked as a Senior Researcher of the Memory-Tech Corporation at Tokyo. Since 2003 he has been a member at the Faculty of School of Information Science and Engineering in University of Jinan, where he is currently heads the Laboratory of Computational Intelligence. His research interests include evolutionary computation, neural networks, fuzzy systems, hybrid computational intelligence and their applications in time-series prediction, system identification and intelligent control. He is the author and co-author of more than 100 papers. Professor Yuehui Chen is a member of IEEE, the IEEE Systems, Man and Cybernetics Society and the Computational Intelligence Society. He is also a member of the editorial boards of several technical journals and a member of the program committee of several international conferences.
Yaou Zhao was born in 1982. He received the B.S. degree in computer science from Jinan University, Jinan, China, in 2005. Since 2005, he has been a master student in the School of Information Science and Engineering, Jinan University. His research interests include evolutionary neural network, bioinformatics and face reorganization.