A comparative analysis of classification algorithms in data mining for accuracy, speed and robustness

Dogan, Neslihan; Tanrikulu, Zuhal

doi:10.1007/s10799-012-0135-8

A comparative analysis of classification algorithms in data mining for accuracy, speed and robustness

Published: 12 August 2012

Volume 14, pages 105–124, (2013)
Cite this article

Information Technology and Management Aims and scope Submit manuscript

Neslihan Dogan¹ &
Zuhal Tanrikulu²

1941 Accesses
30 Citations
Explore all metrics

Abstract

Classification algorithms are the most commonly used data mining models that are widely used to extract valuable knowledge from huge amounts of data. The criteria used to evaluate the classifiers are mostly accuracy, computational complexity, robustness, scalability, integration, comprehensibility, stability, and interestingness. This study compares the classification of algorithm accuracies, speed (CPU time consumed) and robustness for various datasets and their implementation techniques. The data miner selects the model mainly with respect to classification accuracy; therefore, the performance of each classifier plays a crucial role for selection. Complexity is mostly dominated by the time required for classification. In terms of complexity, the CPU time consumed by each classifier is implied here. The study first discusses the application of certain classification models on multiple datasets in three stages: first, implementing the algorithms on original datasets; second, implementing the algorithms on the same datasets where continuous variables are discretised; and third, implementing the algorithms on the same datasets where principal component analysis is applied. The accuracies and the speed of the results are then compared. The relationship of dataset characteristics and implementation attributes between accuracy and CPU time is also examined and debated. Moreover, a regression model is introduced to show the correlating effect of dataset and implementation conditions on the classifier accuracy and CPU time. Finally, the study addresses the robustness of the classifiers, measured by repetitive experiments on both noisy and cleaned datasets.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A Comparative Study in Data Mining: Clustering and Classification Capabilities

A Comparative Analysis of Data Mining Analysis Tools

Data Cleansing Meets Feature Selection: A Supervised Machine Learning Approach

References

Abbasi A, Chen H (2009) A comparison of fraud cues and classification methods for fake escrow website detection. Inf Technol Manage 10(2–3):83–101
Article Google Scholar
Aha DW, Kibler D, Albert MK (1991) Instance-based learning algorithms. Mach Learn 6(1):37–66
Google Scholar
Al-Sheshtawi KA, Abdul-Kader HM, Ismail NA (2010) Artificial immune clonal selection classification algorithms for classifying malware and benign processes using API call sequences. Int J Comput Sci Netw Secur 10(4):31–39
Google Scholar
Armstrong LJ, Diepeveen D, Maddern R (2007) The application of data mining techniques to characterize agricultural soil profiles. In: Proceedings of the 6th Australasian conference on data mining and analytics (AusDM’07), Gold Coast, Australia, pp 81–95
Badulescu LA (2007) The choice of the best attribute selection measure in decision tree induction. Ann Univ Craiova Math Comp Sci Ser 34(1):88–93
Google Scholar
Bazan JG (1998) A comparison of dynamic and non-dynamic rough set methods for extracting laws from decision tables. In: Polkowski L, Skowron A (eds) Rough sets in knowledge discovery: methodology and applications. Physica-Verlag, Heidelberg, pp 321–365
Bazan JG, Szczuka M (2001) RSES and RSESlib - A collection of tools for rough set computations. In: Ziarko W, Yao Y (eds) Proceedings of the 2nd international conference on rough sets and current trends in computing (RSCTC’2000). Lecture Notes in Artificial Intelligence 2005. Springer, Berlin, pp 106–113
Berson A, Smith S, Thearling K (2000) Building data mining applications for CRM. McGraw Hill, USA
Google Scholar
Bigss D, Ville B, Suen E (1991) A method of choosing multiway partitions for classification and decision trees. J Appl Stat 18(1):49–62
Article Google Scholar
Boser BE, Guyon IM, Vapnik VN (1992) A training algorithm for optimal margin classes. In: Proceedings of the 5th annual workshop on computational learning theory, Pittsburg, USA, pp 144–152
Brazdil PB, Soares C, Costa JP (2003) Ranking learning algorithms: using IBL and meta-learning on accuracy and time results. Mach Learn 50(3):251–277
Article Google Scholar
Breiman L, Friedman JH, Olshen R, Stone CJ (1984) Classification and regression tree. Wadsworth & Brooks/Cole Advanced Books & Software, Pacific Grove
Google Scholar
Brownlee J (2005) Clonal selection theory & CLONALG: the clonal selection classification algorithm (CSCA). Technical Report 2-02, Centre for Intelligent Systems and Complex Processes (CISCP), Swinburne University of Technology (SUT), Australia. Available via http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.70.2821&rep=rep1&type=pdf. Accessed on 15 August 2011
Castro LN, Zuben FJ (2002) Learning and optimization using the clonal selection principle. IEEE Trans Evol Computat 6(3):239–251
Article Google Scholar
Chang CC, Lin CJ (2011) LIBSVM: a library for support vector machines. ACM Trans Intell Syst Technol 2(3):27:1–27. Software available at http://www.csie.ntu.edu.tw/~cjlin/libsvm. Accessed on 29 August 2011
Chiarini TM, Berndt JD, Luther SL, Foulis PR, French DD (2009) Identifying fall-related injuries: text mining the electronic medical record. Inf Technol Manage 10(4):253–265
Article Google Scholar
Cios KJ, Pedrycz W, Swiniarski RW, Kurgan LA (2007) Data mining: a knowledge discovery approach. Springer, USA
Google Scholar
Cortes C, Vapnik V (1995) Support-vector networks. Mach Learn 20(3):273–297
Google Scholar
Dogan N, Tanrikulu Z (2010) A comparative framework for evaluating classification algorithms. In: Proceedings of WCE 2010: international data mining and knowledge engineering 1, London, UK, pp 309–314
Dunham MH (2002) Data mining: introductory and advanced topics. Prentice Hall, New Jersey
Google Scholar
EL-Manzalawy Y, Honavar V (2005) WLSVM: integrating LibSVM into weka environment. Software available at http://www.cs.iastate.edu/~yasser/wlsvm. Accessed on 29 August 2011
Friedman J, Hastie T, Tibshirani R (2000) Additive logistic regression: a statistical view of boosting. Ann Stat 28(2):337–407
Article Google Scholar
Ge E, Nayak R, Xu Y, Li Y (2006) Data mining for lifetime prediction of metallic components. In: Proceedings of the 5th Australasian data mining conference (AusDM2006), Sydney, Australia, pp 75–81
Hacker S, Ahn LV (2009) Matchin: eliciting user preferences with an online game. In: Proceedings of the 27th international conference on human factors in computing systems (CHI’09), pp 1207–1216
Han J, Kamber M (2006) Data mining concepts and techniques, 2nd edn. Morgan Kaufmann, USA
Google Scholar
Hand D, Mannila H, Smyth P (2001) Principles of data mining. The MIT Press, Cambridge
Google Scholar
He H, Jin H, Chen J, McAullay D, Li J, Fallon T (2006) Analysis of breast feeding data using data mining methods. In: Proceedings of the 5th Australasian data mining conference (AusDM2006), Sydney, Australia, pp 47–53
Hergert F, Finnoff W, Zimmermann HG (1995) Improving model selection by dynamic regularization methods. In: Petsche T, Hanson SJ, Shavlik J (eds) Computational learning theory and natural learning systems: selecting good models. MIT Press, Cambridge, pp 323–343
Google Scholar
Hill T, Lewicki P (2007) STATISTICS: methods and applications. StatSoft, Tulsa
Google Scholar
Howley T, Madden MG, OConnell ML, Ryder AG (2006) The effect of principal component analysis on machine learning accuracy with high dimensional spectral data. Knowl-Based Syst 19(5):363–370
Article Google Scholar
Jamain A, Hand DJ (2008) Mining supervised classification performance studies: a meta-analytic investigation. J Classif 25(1):87–112
Article Google Scholar
John GH, Langley P (1995) Estimating continuous distributions in Bayesian classifiers. In: Besnard P, Hanks S (eds) Proceedings of the 17th conference on uncertainty in artificial intelligence. Morgan Kaufmann, USA, pp 338–345
Kaelbling LP (1994) Associative methods in reinforcement learning: an emprical study. In: Hanson SJ, Petsche T, Kearns M, Rivest RL (eds) Computational learning theory and natural learning systems: intersection between theory and experiment. MIT Press, Cambridge, pp 133–153
Google Scholar
Kalousis A, Gama J, Hilario M (2004) On data and algorithms: understanding inductive performance. Mach Learn 54(3):275–312
Article Google Scholar
Keogh E, Kasetty S (2002) On the need for time series data mining benchmarks: a survey and empirical demonstration. Data Min Knowl Disc 7(4):349–371
Article Google Scholar
Keogh E, Stefano L, Ratanamahatana CA (2004) Towards parameter-free data mining. In: Kim W, Kohavi R, Gehrke J, DuMouchel W (eds) Proceedings of the 10th ACM SIGKDD international conference on knowledge discovery and data mining. Seattle, Washington, pp 206–215
Google Scholar
Kim SB, Han KS, Rim HC, Myaeng SH (2006) Some effective techniques for Naive Bayes text classification. IEEE Trans Knowl Data Eng 18(11):1457–1466
Article Google Scholar
Ko M, Osei-Bryson KM (2008) Reexamining the impact of information technology investment on productivity using regression tree and multivariate adaptive regression splines (MARS). Inf Technol Manage 9(4):285–299
Article Google Scholar
Kohonen T (1990a) Improved versions of learning vector quantization. In: Proceedings of the international joint conference on neural networks I, San Diego, pp 545–550
Kohonen T (1990) The self-organizing map. Proc IEEE 78(9):1464–1480
Article Google Scholar
Kohonen T, Hynninen J, Kangas J, Laaksonen J, Torkkola K (1995) LVQ-PK: the learning vector quantization package version 3.1. Technical Report. Helsinki University of Technology Laboratory of Computer and Information Science, Finland. Available via http://www.cis.hut.fi/research/lvq_pak/lvq_doc.txt. Accessed on 27 March 2012
Le Cessie S, van Houwelingen JC (1992) Ridge estimators in logistic regression. Appl Stat 41(1):191–201
Article Google Scholar
Lim T-S, Loh W-Y, Shih Y-S (2000) A comparison of prediction accuracy, complexity, and training time of thirty-three old and new classification algorithms. Mach Learn 40(3):203–228
Article Google Scholar
Loh WY, Shih YS (1997) Split selection methods for classification trees. Stat Sin 7(4):815–840
Google Scholar
Maimon O, Rokach L (2010) The data mining and knowledge discovery handbook. Springer, Berlin, 2nd edn
Maindonald J (2006) Data mining methodological weaknesses and suggested fixes. In: Proceedings of the 5th Australasian data mining conference (AusDM2006), Sydney, Australia, pp 9–16
Mitchell TM (1997) Machine learning. McGraw Hill, USA
Google Scholar
Pawlak Z (1982) Rough sets. Int J Parallel Prog 11(5):341–356
Google Scholar
Pawlak Z (1991) Rough sets: theoretical aspects of reasoning about data. Kluwer, Dordrecht
Google Scholar
Pitt E, Nayak R (2007) The use of various data mining and feature selection methods in the analysis of a population survey dataset. In: Ong KL, Li W, Gao J (eds) Proceedings of the 2nd international workshop on integrating artificial intelligence and data mining (AIDM 2007) CRPIT 87. Goald Coast, Australia, pp 87–97
Google Scholar
Putten P, Lingjun M, Kok JN (2008) Profiling novel classification algorithms: artificial immune system. In: Proceedings of the 7th IEEE international conference on cybernetic intelligent systems (CIS 2008), London, UK, pp 1–6
Quinlan JR (1993) C4.5: programs for machine learning. Morgan Kaufmann, USA
Quinlan JR (1994) Comparing connectionist and symbolic learning methods. In: Hanson SJ, Drastal GA, Rivest RL (eds) Computational learning theory and natural learning systems: constraints and prospect. MIT Press, Cambridge, pp 445–456
Google Scholar
Rumelhart DE, Hinton GE, Williams RJ (1986) Learning internal representations by error propagation. In: Rumelhart DE, McClelland JL, and the PDP research group (eds) Parallel distributed processing: explorations in the microstructure of cognition 1: foundations. MIT Press, Cambridge, pp 318–362
Shih YS (2004) QUEST user manual. Department of Mathematics, National Chung Cheng University, Taiwan. Available via http://www.stat.wisc.edu/~loh/treeprogs/quest/questman.pdf. Accessed on 15 August 2011
SPSS (2012) CHAID and exhaustive CHAID algorithms. Available via ftp://ftp.software.ibm.com/software/analytics/spss/support/Stats/Docs/Statistics/Algorithms/13.0/TREE-CHAID.pdf. Accessed on 9 April 2012
Su CT, Hsiao YH (2009) Multiclass MTS for simultaneous feature selection and classification. IEEE Trans Knowl Data Eng 21(2):192–204
Article Google Scholar
UCI Machine Learning Repository (2010) Available via http://archive.ics.uci.edu/ml. Accessed on 10 December 2010
Watkins AB (2001) AIRS: a resource limited artificial immune classifier. M.Sc. Thesis, Mississippi State University Department of Computer Science, USA. Available via http://www.cse.msstate.edu/~andrew/research/publications/watkins_thesis.pdf. Accessed on 15 August 2011
Watkins AB (2005) Exploiting immunological metaphors in the development of serial, parallel and distributed learning algorithms. PhD Thesis, University of Kent, Canterbury, UK Available via http://www.cse.msstate.edu/~andrew/research/publications/watkins_phd_dissertation.pdf. Accessed on 15 August 2011
Watkins AB, Timmis J, Boggess L (2004) Artificial immune recognition system (AIRS): an immune-inspired supervised learning algorithm. Genet Program Evolvable Mach 5(3):291–317
Article Google Scholar
WEKA (2011) Classification algorithms. Available via http://wekaclassalgos.sourceforge.net/. Accessed on 15 August 2011
Witten IH, Frank E (2005) Data mining: practical machine learning tools and techniques, 2nd edn. Morgan Kaufmann, USA
Google Scholar
Wu X, Kumar V (2009) The top ten algorithms in data mining. Chapman&Hall/CRC Press, Taylor &Francis Group, USA
Yang Y, Webb GI, Cerquides J, Korb KB, Boughton J, Ting KM (2007) To select or to weigh: a comparative study of linear combination schemes for superparent-one-dependence estimators. IEEE Trans Knowl Data Eng 19(12):1652–1664
Article Google Scholar
Zhu D, Premkumar G, Zhang X, Chu C-H (2001) Data mining for network intrusion detection: a comparison of alternative methods. Decis Sci 32(4):635–660
Article Google Scholar

Download references

Author information

Authors and Affiliations

PricewaterhouseCoopers LLP, London, UK
Neslihan Dogan
Department of Management Information Systems, Bogazici University, Istanbul, Turkey
Zuhal Tanrikulu

Authors

Neslihan Dogan
View author publications
You can also search for this author in PubMed Google Scholar
Zuhal Tanrikulu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Neslihan Dogan.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Dogan, N., Tanrikulu, Z. A comparative analysis of classification algorithms in data mining for accuracy, speed and robustness. Inf Technol Manag 14, 105–124 (2013). https://doi.org/10.1007/s10799-012-0135-8

Download citation

Published: 12 August 2012
Issue Date: June 2013
DOI: https://doi.org/10.1007/s10799-012-0135-8

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A comparative analysis of classification algorithms in data mining for accuracy, speed and robustness

Abstract

Access this article

Similar content being viewed by others

A Comparative Study in Data Mining: Clustering and Classification Capabilities

A Comparative Analysis of Data Mining Analysis Tools

Data Cleansing Meets Feature Selection: A Supervised Machine Learning Approach

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

A comparative analysis of classification algorithms in data mining for accuracy, speed and robustness

Abstract

Access this article

Similar content being viewed by others

A Comparative Study in Data Mining: Clustering and Classification Capabilities

A Comparative Analysis of Data Mining Analysis Tools

Data Cleansing Meets Feature Selection: A Supervised Machine Learning Approach

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation