Elsevier

Applied Soft Computing

Volume 8, Issue 4, September 2008, Pages 1664-1669
Applied Soft Computing

A novel ensemble of classifiers for microarray data classification

https://doi.org/10.1016/j.asoc.2008.01.006Get rights and content

Abstract

Micorarray data are often extremely asymmetric in dimensionality, such as thousands or even tens of thousands of genes and a few hundreds of samples. Such extreme asymmetry between the dimensionality of genes and samples presents several challenges to conventional clustering and classification methods. In this paper, a novel ensemble method is proposed. Firstly, in order to extract useful features and reduce dimensionality, different feature selection methods such as correlation analysis, Fisher-ratio is used to form different feature subsets. Then a pool of candidate base classifiers is generated to learn the subsets which are re-sampling from the different feature subsets with PSO (Particle Swarm Optimization) algorithm. At last, appropriate classifiers are selected to construct the classification committee using EDAs (Estimation of Distribution Algorithms). Experiments show that the proposed method produces the best recognition rates on four benchmark databases.

Introduction

Microarray technology has provided the ability to measure the expression levels of thousands of genes simultaneously in a single experiment. Each spot on a microarray chip contains the clone of a gene from a tissue sample. Some mRNA samples are labelled with two different kinds of dyes, for example, Cy5 (red) and Cy3 (blue). After mRNA interact with the genes, i.e., hybridization, the color of each spot on the chip will change. The resulted image reflects the characteristics of the tissue at the molecular level [1].

In recent years, research has showed that accurate cancer diagnosis can be achieved by performing microarray data classification. Various intelligent methods have been applied in this area. But the microarray data consists of a few hundreds of samples and thousands or even ten thousands of genes. It is extremely difficult to work in such a high dimension space using traditional classification methods directly. So gene selection methods have been proposed and developed to reduce the dimensionality. These include principal components analysis (PCA) [8], Fisher-ratio, t-test, and correlation analysis. Along with the feature selection methods, intelligent methods have been applied for microarray classification, such as support vector machine (SVM) [7], K nearest neighbor (KNN) [6], artificial neural network (ANN) [1]. But high accurate classification is difficult to achieve. Most intelligent classifiers are apt to be over-fitted. Recent years, ensemble approaches have been proposed. It combines multiple classifiers together as a committee to make more appropriate decisions for classifying microarray data instances. It offers improved accuracy and reliability. Much research has shown that a sufficient and necessary condition the approach outperforms its individual members is that the base classifiers should be accurate and diverse. An accurate classifier is one that has an error rate of better than randomly guessing classes for new instances, and two classifiers are diverse if they make different errors on common data instances [9]. So there are two important aspects to be focused on ensemble approaches.

First aspect is how to generate diverse base classifiers. In traditional, re-sampling has been widely used to generating training datasets for base classifiers learning. This method is much too random and due to the small numbers of samples, the datasets may be greatly similar. In this paper, different methods such as correlation analysis, Fisher-ratio are firstly applied to generate feature subsets. Since variety of selection methods, some features selected in the subsets are different with each other and all of them are informative genes. Then re-sampling the feature subsets to form learning datasets. Owing to the datasets forming from different feature subsets, it may be much more various and more efficient.

The second aspect is how to combine the base classifiers. In this paper, an intelligent approach for constructing ensemble classifiers is proposed. The methods first training the base classifiers with particle swarm optimization (PSO) algorithm, and then select the appropriate classifiers to construct a high performance classification committee with estimation of distribution (EDA) algorithm. Experiment show that the proposed methods produce the best recognition rates.

The paper is organized as follows. The feature selection methods are introduced in Section 2. The particle swarm optimization used to train the neural networks which are employed as the base classifiers is depicts in Section 3. The optimal design method for constructing ensemble classifiers is described in Section 4. Section 5 gives the simulation results. Finally, we present some concluding remarks.

Section snippets

Gene selection methods

Although there are a large number of genes in microarray, only small parts of genes have great impact on classification. Lots of genes are always similar in cancer and normal cases. Even worse, some genes may act as “noise” and undermine the classification accuracy. Hence, to obtain good classification accuracy, we need to pick out the genes that benefit the classification most. In addition, reducing the number of genes can help to cut down the inputs for computation, so the classifiers are

Learning the datasets with neural networks

There are many kinds of methods for classification. In recent years, most researchers applied the SVM (Support Vector Machine) as a classifier to learn the microarray dataset and obtained very good results. But the SVM is very complex to compute and training the SVM costs a lot of time. If many SVMs are used as base classifiers for assembling, the training time may be very long and the ensemble classifiers are inefficient. Moreover, the SVMs are similar due to their learning algorithms, the

Optimal design method for constructing ensemble classifiers

Select many classifiers for constructing the committee are better than all [2]. So we should select appropriate classifiers to form the classification committee. In traditional, many approaches can accomplish this task, such as greedy hill climbing. It evaluates all the possible local changes to the current set, such as adding one classifier to the set or removing one. It chooses the best or simply the first change that improves the performance of subset. Once a change is made for a subset, it

Experiments

We performed extensive experiments on four benchmark cancer datasets, namely the Leukemia, Colon, Ovarian and Lungcancer.

Conclusions

In this paper, a novel ensemble of classifiers based on correlation analysis is proposed for cancer classification. The leukemia and colon databases are used for conducting all the experiments. Gene features are first extracted by the correlation analysis technique which greatly reduces dimensionality as well as maintains the informative features. Then the EDA is employed to construct the classifier committee for classification. Compare the results with some advanced artificial techniques, the

Acknowledgments

This research was supported by the NSFC under grant No.60573065, the Key Subject Research Foundation of Shandong Province and the Natural Science Foundation of Shandong Province (grant Y2007G33).

Yuehui Chen was born in 1964. He received his B.Sc. degree in the Department of Mathematics (major in control theory) from the Shandong University of China in 1985, and Ph.D. degree in Department of Electrical Engineering and Computer Science from the Kumamoto University, Japan in 2001. During 2001–2003, he had worked as a Senior Researcher of the Memory-Tech Corporation at Tokyo. Since 2003 he has been a member at the Faculty of School of Information Science and Engineering in University of

References (19)

There are more references available in the full text version of this article.

Cited by (71)

  • Feature selection model based on clustering and ranking in pipeline for microarray data

    2017, Informatics in Medicine Unlocked
    Citation Excerpt :

    It is more time consuming and obtains a feature subset more inclined towards the used classifier. Different search algorithms (best-first search, branch and bound, evolutionary algorithms [32–37]) have been used to direct the search method while the classifier (e.g., k-nearest neighbors, support vector machines, Naive Bayes, etc.) is used as surrogate to evaluate the integrity of the feature subset. Although wrapper gives better accuracy than filters, it does not scale well for high dimensional data.

View all citing articles on Scopus

Yuehui Chen was born in 1964. He received his B.Sc. degree in the Department of Mathematics (major in control theory) from the Shandong University of China in 1985, and Ph.D. degree in Department of Electrical Engineering and Computer Science from the Kumamoto University, Japan in 2001. During 2001–2003, he had worked as a Senior Researcher of the Memory-Tech Corporation at Tokyo. Since 2003 he has been a member at the Faculty of School of Information Science and Engineering in University of Jinan, where he is currently heads the Laboratory of Computational Intelligence. His research interests include evolutionary computation, neural networks, fuzzy systems, hybrid computational intelligence and their applications in time-series prediction, system identification and intelligent control. He is the author and co-author of more than 100 papers. Professor Yuehui Chen is a member of IEEE, the IEEE Systems, Man and Cybernetics Society and the Computational Intelligence Society. He is also a member of the editorial boards of several technical journals and a member of the program committee of several international conferences.

Yaou Zhao was born in 1982. He received the B.S. degree in computer science from Jinan University, Jinan, China, in 2005. Since 2005, he has been a master student in the School of Information Science and Engineering, Jinan University. His research interests include evolutionary neural network, bioinformatics and face reorganization.

View full text