A novel ensemble of classifiers for microarray data classification

doi:10.1016/j.asoc.2008.01.006

Applied Soft Computing

Volume 8, Issue 4, September 2008, Pages 1664-1669

https://doi.org/10.1016/j.asoc.2008.01.006 Get rights and content

Abstract

Micorarray data are often extremely asymmetric in dimensionality, such as thousands or even tens of thousands of genes and a few hundreds of samples. Such extreme asymmetry between the dimensionality of genes and samples presents several challenges to conventional clustering and classification methods. In this paper, a novel ensemble method is proposed. Firstly, in order to extract useful features and reduce dimensionality, different feature selection methods such as correlation analysis, Fisher-ratio is used to form different feature subsets. Then a pool of candidate base classifiers is generated to learn the subsets which are re-sampling from the different feature subsets with PSO (Particle Swarm Optimization) algorithm. At last, appropriate classifiers are selected to construct the classification committee using EDAs (Estimation of Distribution Algorithms). Experiments show that the proposed method produces the best recognition rates on four benchmark databases.

Introduction

Microarray technology has provided the ability to measure the expression levels of thousands of genes simultaneously in a single experiment. Each spot on a microarray chip contains the clone of a gene from a tissue sample. Some mRNA samples are labelled with two different kinds of dyes, for example, Cy5 (red) and Cy3 (blue). After mRNA interact with the genes, i.e., hybridization, the color of each spot on the chip will change. The resulted image reflects the characteristics of the tissue at the molecular level [1].

In recent years, research has showed that accurate cancer diagnosis can be achieved by performing microarray data classification. Various intelligent methods have been applied in this area. But the microarray data consists of a few hundreds of samples and thousands or even ten thousands of genes. It is extremely difficult to work in such a high dimension space using traditional classification methods directly. So gene selection methods have been proposed and developed to reduce the dimensionality. These include principal components analysis (PCA) [8], Fisher-ratio, t-test, and correlation analysis. Along with the feature selection methods, intelligent methods have been applied for microarray classification, such as support vector machine (SVM) [7], K nearest neighbor (KNN) [6], artificial neural network (ANN) [1]. But high accurate classification is difficult to achieve. Most intelligent classifiers are apt to be over-fitted. Recent years, ensemble approaches have been proposed. It combines multiple classifiers together as a committee to make more appropriate decisions for classifying microarray data instances. It offers improved accuracy and reliability. Much research has shown that a sufficient and necessary condition the approach outperforms its individual members is that the base classifiers should be accurate and diverse. An accurate classifier is one that has an error rate of better than randomly guessing classes for new instances, and two classifiers are diverse if they make different errors on common data instances [9]. So there are two important aspects to be focused on ensemble approaches.

First aspect is how to generate diverse base classifiers. In traditional, re-sampling has been widely used to generating training datasets for base classifiers learning. This method is much too random and due to the small numbers of samples, the datasets may be greatly similar. In this paper, different methods such as correlation analysis, Fisher-ratio are firstly applied to generate feature subsets. Since variety of selection methods, some features selected in the subsets are different with each other and all of them are informative genes. Then re-sampling the feature subsets to form learning datasets. Owing to the datasets forming from different feature subsets, it may be much more various and more efficient.

The second aspect is how to combine the base classifiers. In this paper, an intelligent approach for constructing ensemble classifiers is proposed. The methods first training the base classifiers with particle swarm optimization (PSO) algorithm, and then select the appropriate classifiers to construct a high performance classification committee with estimation of distribution (EDA) algorithm. Experiment show that the proposed methods produce the best recognition rates.

The paper is organized as follows. The feature selection methods are introduced in Section 2. The particle swarm optimization used to train the neural networks which are employed as the base classifiers is depicts in Section 3. The optimal design method for constructing ensemble classifiers is described in Section 4. Section 5 gives the simulation results. Finally, we present some concluding remarks.

Section snippets

Gene selection methods

Although there are a large number of genes in microarray, only small parts of genes have great impact on classification. Lots of genes are always similar in cancer and normal cases. Even worse, some genes may act as “noise” and undermine the classification accuracy. Hence, to obtain good classification accuracy, we need to pick out the genes that benefit the classification most. In addition, reducing the number of genes can help to cut down the inputs for computation, so the classifiers are

Learning the datasets with neural networks

There are many kinds of methods for classification. In recent years, most researchers applied the SVM (Support Vector Machine) as a classifier to learn the microarray dataset and obtained very good results. But the SVM is very complex to compute and training the SVM costs a lot of time. If many SVMs are used as base classifiers for assembling, the training time may be very long and the ensemble classifiers are inefficient. Moreover, the SVMs are similar due to their learning algorithms, the

Optimal design method for constructing ensemble classifiers

Select many classifiers for constructing the committee are better than all [2]. So we should select appropriate classifiers to form the classification committee. In traditional, many approaches can accomplish this task, such as greedy hill climbing. It evaluates all the possible local changes to the current set, such as adding one classifier to the set or removing one. It chooses the best or simply the first change that improves the performance of subset. Once a change is made for a subset, it

Experiments

We performed extensive experiments on four benchmark cancer datasets, namely the Leukemia, Colon, Ovarian and Lungcancer.

Conclusions

In this paper, a novel ensemble of classifiers based on correlation analysis is proposed for cancer classification. The leukemia and colon databases are used for conducting all the experiments. Gene features are first extracted by the correlation analysis technique which greatly reduces dimensionality as well as maintains the informative features. Then the EDA is employed to construct the classifier committee for classification. Compare the results with some advanced artificial techniques, the

Acknowledgments

This research was supported by the NSFC under grant No.60573065, the Key Subject Research Foundation of Shandong Province and the Natural Science Foundation of Shandong Province (grant Y2007G33).

References (19)

Z.-H. Zhou et al.
Ensembling neural networks: many could be better than all
Artif. Intell.
(2002)
M.B. Eisen et al.
DNA arrays for analysis of gene expression
Methods Enzymol.
(1999)
C.A. Harrington et al.
Monitoring gene expression using DNA microarrays
Curr. Opin. Microbiol.
(2000)
E. Petricoin et al.
Use of proteomic patterns in serum to identify ovarian cancer
Lancet
(2002)
J.-H. Hong et al.
The classification of cancer based on DNA microarray data that uses diverse ensemble genetic programming
Artif. Intell. Med.
(2006)
F. Chu et al.
Appliations of support vector machines to cancer classification with microarray data
Int. J. Neural Syst.
(2005)
P. Larranaga et al.
Estimation of Distribution Algorithms: A New Tool for Evolutionary Computation
(2001)
T.R. Golub et al.
Molecular classification of cancer: Class discovery and class prediction by gene expression monitoring
Science
(1999)
U. Alon et al.
Broad patterns of gene expresson revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide array
Proc. Natl. Acad. Sci.
(1999)

There are more references available in the full text version of this article.

Cited by (71)

Application of ensemble learning–based classifiers for genetic expression data classification
2022, Data Science for Genomics
Background: Cancer is a type of disease that occurs due to the abnormal behaviors of the genes within the human body. Microarray analysis is one of the widely accepted technologies in the cancer diagnosis process and has become one of the major fields of research in recent days. For the diagnosis of this disease, it is necessary to detect it at an early stage. Machine learning techniques are playing a major role in automated disease classification and detection. For early cancer detection, these techniques may be utilized by physicians. Objective: From the literature, it can be found that different machine learning and data mining techniques were considered for designing a computer-aided cancer classification system. The major drawback of these conventional classifiers is in achieving satisfactory classification accuracy, and it is because of the size of the data. To improve the accuracy, different classification models can be merged to solve a particular problem. The authors in this chapter have presented a study on the detection and classification of cancer from gene expression by using ensemble learning techniques. The state of the art related to automated cancer data analysis is also presented in this study. Leukemia is a type of bone marrow cancer that happens due to a genetic disorder. Methodology: A case study on ensemble learning–based leukemia classification is also presented. For this purpose, a gradient boosting ensemble classifier is considered by the authors. From the obtained result, it is observed that the proposed model provides a satisfactory result of 98.21%, compared with earlier works. Conclusion: The proposed approach can be a supportive tool for physicians in cancer data analysis, and the accuracy can be improved by adopting some novel techniques as well as with some modifications in existing techniques.
Application of active learning in DNA microarray data for cancerous gene identification
2021, Expert Systems with Applications
Microarray technology has an important role in evaluating gene expression data with unique patterns into existence. In gene-expression based experiments, the expression level of the gene is constantly monitored in order to classify a tissue sample. In microarray technology, the expressions of the genes are altered with respect to pathogenes. The altered expression values can be identified by analyzing the genes of the tissue/cell that are affected along with the tissues/cells that are unaffected are termed as biomarkers. In the current paper, we have developed an Active Learning (AL) model by using Support Vector Machine (SVM) in association with feature-selection (FS) algorithm; called Symmetrical Uncertainty (SU) for the prediction of cancer. The effectiveness of the proposed AL and SU combination is manifested and the biomarkers or cancerous genes identified by the proposed method on four gene-expression data sets are reported. In addition, the biological significance tests are performed for the cancer biomarkers obtained from the data sets.
A survey of intrusion detection systems based on ensemble and hybrid classifiers
2017, Computers and Security
Due to the frequency of malicious network activities and network policy violations, intrusion detection systems (IDSs) have emerged as a group of methods that combats the unauthorized use of a network's resources. Recent advances in information technology have produced a wide variety of machine learning methods, which can be integrated into an IDS. This study presents an overview of intrusion classification algorithms, based on popular methods in the field of machine learning. Specifically, various ensemble and hybrid techniques were examined, considering both homogeneous and heterogeneous types of ensemble methods. In addition, special attention was paid to those ensemble methods that are based on voting techniques, as those methods are the simplest to implement and generally produce favorable results. A survey of recent literature shows that hybrid methods, where feature selection or a feature reduction component is combined with a single-stage classifier, have become commonplace. Therefore, the scope of this study has been expanded to encompass hybrid classifiers.
Feature selection model based on clustering and ranking in pipeline for microarray data
2017, Informatics in Medicine Unlocked
Citation Excerpt :
It is more time consuming and obtains a feature subset more inclined towards the used classifier. Different search algorithms (best-first search, branch and bound, evolutionary algorithms [32–37]) have been used to direct the search method while the classifier (e.g., k-nearest neighbors, support vector machines, Naive Bayes, etc.) is used as surrogate to evaluate the integrity of the feature subset. Although wrapper gives better accuracy than filters, it does not scale well for high dimensional data.
Most of the available feature selection techniques in the literature are classifier bound. It means a group of features tied to the performance of a specific classifier as applied in wrapper and hybrid approach. Our objective in this study is to select a set of generic features not tied to any classifier based on the proposed framework. This framework uses attribute clustering and feature ranking techniques in pipeline in order to remove redundant features. On each uncovered cluster, signal-to-noise ratio, t-statistics and significance analysis of microarray are independently applied to select the top ranked features. Both filter and evolutionary wrapper approaches have been considered for feature selection and the data set with selected features are given to ensemble of predefined statistically different classifiers. The class labels of the test data are determined using majority voting technique. Moreover, with the aforesaid objectives, this paper focuses on obtaining a stable result out of various classification models. Further, a comparative analysis has been performed to study the classification accuracy and computational time of the current approach and evolutionary wrapper techniques. It gives a better insight into the features and further enhancing the classification accuracy with less computational time.
A survey of evolutionary algorithms for supervised ensemble learning
2023, Knowledge Engineering Review
Soft Computing Approaches for Ovarian Cancer: A Review
2024, GMSARN International Journal

View all citing articles on Scopus

Yuehui Chen was born in 1964. He received his B.Sc. degree in the Department of Mathematics (major in control theory) from the Shandong University of China in 1985, and Ph.D. degree in Department of Electrical Engineering and Computer Science from the Kumamoto University, Japan in 2001. During 2001–2003, he had worked as a Senior Researcher of the Memory-Tech Corporation at Tokyo. Since 2003 he has been a member at the Faculty of School of Information Science and Engineering in University of Jinan, where he is currently heads the Laboratory of Computational Intelligence. His research interests include evolutionary computation, neural networks, fuzzy systems, hybrid computational intelligence and their applications in time-series prediction, system identification and intelligent control. He is the author and co-author of more than 100 papers. Professor Yuehui Chen is a member of IEEE, the IEEE Systems, Man and Cybernetics Society and the Computational Intelligence Society. He is also a member of the editorial boards of several technical journals and a member of the program committee of several international conferences.

Yaou Zhao was born in 1982. He received the B.S. degree in computer science from Jinan University, Jinan, China, in 2005. Since 2005, he has been a master student in the School of Information Science and Engineering, Jinan University. His research interests include evolutionary neural network, bioinformatics and face reorganization.

View full text

A novel ensemble of classifiers for microarray data classification

Abstract

Introduction

Section snippets

Gene selection methods

Learning the datasets with neural networks

Optimal design method for constructing ensemble classifiers

Experiments

Conclusions

Acknowledgments

Artif. Intell.

Methods Enzymol.

Curr. Opin. Microbiol.

Lancet

Artif. Intell. Med.

Appliations of support vector machines to cancer classification with microarray data

Int. J. Neural Syst.

Estimation of Distribution Algorithms: A New Tool for Evolutionary Computation

Molecular classification of cancer: Class discovery and class prediction by gene expression monitoring

Science

Broad patterns of gene expresson revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide array

Proc. Natl. Acad. Sci.