Skip to main content
Log in

A comparative analysis of classification algorithms in data mining for accuracy, speed and robustness

  • Published:
Information Technology and Management Aims and scope Submit manuscript

Abstract

Classification algorithms are the most commonly used data mining models that are widely used to extract valuable knowledge from huge amounts of data. The criteria used to evaluate the classifiers are mostly accuracy, computational complexity, robustness, scalability, integration, comprehensibility, stability, and interestingness. This study compares the classification of algorithm accuracies, speed (CPU time consumed) and robustness for various datasets and their implementation techniques. The data miner selects the model mainly with respect to classification accuracy; therefore, the performance of each classifier plays a crucial role for selection. Complexity is mostly dominated by the time required for classification. In terms of complexity, the CPU time consumed by each classifier is implied here. The study first discusses the application of certain classification models on multiple datasets in three stages: first, implementing the algorithms on original datasets; second, implementing the algorithms on the same datasets where continuous variables are discretised; and third, implementing the algorithms on the same datasets where principal component analysis is applied. The accuracies and the speed of the results are then compared. The relationship of dataset characteristics and implementation attributes between accuracy and CPU time is also examined and debated. Moreover, a regression model is introduced to show the correlating effect of dataset and implementation conditions on the classifier accuracy and CPU time. Finally, the study addresses the robustness of the classifiers, measured by repetitive experiments on both noisy and cleaned datasets.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9

Similar content being viewed by others

References

  1. Abbasi A, Chen H (2009) A comparison of fraud cues and classification methods for fake escrow website detection. Inf Technol Manage 10(2–3):83–101

    Article  Google Scholar 

  2. Aha DW, Kibler D, Albert MK (1991) Instance-based learning algorithms. Mach Learn 6(1):37–66

    Google Scholar 

  3. Al-Sheshtawi KA, Abdul-Kader HM, Ismail NA (2010) Artificial immune clonal selection classification algorithms for classifying malware and benign processes using API call sequences. Int J Comput Sci Netw Secur 10(4):31–39

    Google Scholar 

  4. Armstrong LJ, Diepeveen D, Maddern R (2007) The application of data mining techniques to characterize agricultural soil profiles. In: Proceedings of the 6th Australasian conference on data mining and analytics (AusDM’07), Gold Coast, Australia, pp 81–95

  5. Badulescu LA (2007) The choice of the best attribute selection measure in decision tree induction. Ann Univ Craiova Math Comp Sci Ser 34(1):88–93

    Google Scholar 

  6. Bazan JG (1998) A comparison of dynamic and non-dynamic rough set methods for extracting laws from decision tables. In: Polkowski L, Skowron A (eds) Rough sets in knowledge discovery: methodology and applications. Physica-Verlag, Heidelberg, pp 321–365

  7. Bazan JG, Szczuka M (2001) RSES and RSESlib - A collection of tools for rough set computations. In: Ziarko W, Yao Y (eds) Proceedings of the 2nd international conference on rough sets and current trends in computing (RSCTC’2000). Lecture Notes in Artificial Intelligence 2005. Springer, Berlin, pp 106–113

  8. Berson A, Smith S, Thearling K (2000) Building data mining applications for CRM. McGraw Hill, USA

    Google Scholar 

  9. Bigss D, Ville B, Suen E (1991) A method of choosing multiway partitions for classification and decision trees. J Appl Stat 18(1):49–62

    Article  Google Scholar 

  10. Boser BE, Guyon IM, Vapnik VN (1992) A training algorithm for optimal margin classes. In: Proceedings of the 5th annual workshop on computational learning theory, Pittsburg, USA, pp 144–152

  11. Brazdil PB, Soares C, Costa JP (2003) Ranking learning algorithms: using IBL and meta-learning on accuracy and time results. Mach Learn 50(3):251–277

    Article  Google Scholar 

  12. Breiman L, Friedman JH, Olshen R, Stone CJ (1984) Classification and regression tree. Wadsworth & Brooks/Cole Advanced Books & Software, Pacific Grove

    Google Scholar 

  13. Brownlee J (2005) Clonal selection theory & CLONALG: the clonal selection classification algorithm (CSCA). Technical Report 2-02, Centre for Intelligent Systems and Complex Processes (CISCP), Swinburne University of Technology (SUT), Australia. Available via http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.70.2821&rep=rep1&type=pdf. Accessed on 15 August 2011

  14. Castro LN, Zuben FJ (2002) Learning and optimization using the clonal selection principle. IEEE Trans Evol Computat 6(3):239–251

    Article  Google Scholar 

  15. Chang CC, Lin CJ (2011) LIBSVM: a library for support vector machines. ACM Trans Intell Syst Technol 2(3):27:1–27. Software available at http://www.csie.ntu.edu.tw/~cjlin/libsvm. Accessed on 29 August 2011

  16. Chiarini TM, Berndt JD, Luther SL, Foulis PR, French DD (2009) Identifying fall-related injuries: text mining the electronic medical record. Inf Technol Manage 10(4):253–265

    Article  Google Scholar 

  17. Cios KJ, Pedrycz W, Swiniarski RW, Kurgan LA (2007) Data mining: a knowledge discovery approach. Springer, USA

    Google Scholar 

  18. Cortes C, Vapnik V (1995) Support-vector networks. Mach Learn 20(3):273–297

    Google Scholar 

  19. Dogan N, Tanrikulu Z (2010) A comparative framework for evaluating classification algorithms. In: Proceedings of WCE 2010: international data mining and knowledge engineering 1, London, UK, pp 309–314

  20. Dunham MH (2002) Data mining: introductory and advanced topics. Prentice Hall, New Jersey

    Google Scholar 

  21. EL-Manzalawy Y, Honavar V (2005) WLSVM: integrating LibSVM into weka environment. Software available at http://www.cs.iastate.edu/~yasser/wlsvm. Accessed on 29 August 2011

  22. Friedman J, Hastie T, Tibshirani R (2000) Additive logistic regression: a statistical view of boosting. Ann Stat 28(2):337–407

    Article  Google Scholar 

  23. Ge E, Nayak R, Xu Y, Li Y (2006) Data mining for lifetime prediction of metallic components. In: Proceedings of the 5th Australasian data mining conference (AusDM2006), Sydney, Australia, pp 75–81

  24. Hacker S, Ahn LV (2009) Matchin: eliciting user preferences with an online game. In: Proceedings of the 27th international conference on human factors in computing systems (CHI’09), pp 1207–1216

  25. Han J, Kamber M (2006) Data mining concepts and techniques, 2nd edn. Morgan Kaufmann, USA

    Google Scholar 

  26. Hand D, Mannila H, Smyth P (2001) Principles of data mining. The MIT Press, Cambridge

    Google Scholar 

  27. He H, Jin H, Chen J, McAullay D, Li J, Fallon T (2006) Analysis of breast feeding data using data mining methods. In: Proceedings of the 5th Australasian data mining conference (AusDM2006), Sydney, Australia, pp 47–53

  28. Hergert F, Finnoff W, Zimmermann HG (1995) Improving model selection by dynamic regularization methods. In: Petsche T, Hanson SJ, Shavlik J (eds) Computational learning theory and natural learning systems: selecting good models. MIT Press, Cambridge, pp 323–343

    Google Scholar 

  29. Hill T, Lewicki P (2007) STATISTICS: methods and applications. StatSoft, Tulsa

    Google Scholar 

  30. Howley T, Madden MG, OConnell ML, Ryder AG (2006) The effect of principal component analysis on machine learning accuracy with high dimensional spectral data. Knowl-Based Syst 19(5):363–370

    Article  Google Scholar 

  31. Jamain A, Hand DJ (2008) Mining supervised classification performance studies: a meta-analytic investigation. J Classif 25(1):87–112

    Article  Google Scholar 

  32. John GH, Langley P (1995) Estimating continuous distributions in Bayesian classifiers. In: Besnard P, Hanks S (eds) Proceedings of the 17th conference on uncertainty in artificial intelligence. Morgan Kaufmann, USA, pp 338–345

  33. Kaelbling LP (1994) Associative methods in reinforcement learning: an emprical study. In: Hanson SJ, Petsche T, Kearns M, Rivest RL (eds) Computational learning theory and natural learning systems: intersection between theory and experiment. MIT Press, Cambridge, pp 133–153

    Google Scholar 

  34. Kalousis A, Gama J, Hilario M (2004) On data and algorithms: understanding inductive performance. Mach Learn 54(3):275–312

    Article  Google Scholar 

  35. Keogh E, Kasetty S (2002) On the need for time series data mining benchmarks: a survey and empirical demonstration. Data Min Knowl Disc 7(4):349–371

    Article  Google Scholar 

  36. Keogh E, Stefano L, Ratanamahatana CA (2004) Towards parameter-free data mining. In: Kim W, Kohavi R, Gehrke J, DuMouchel W (eds) Proceedings of the 10th ACM SIGKDD international conference on knowledge discovery and data mining. Seattle, Washington, pp 206–215

    Google Scholar 

  37. Kim SB, Han KS, Rim HC, Myaeng SH (2006) Some effective techniques for Naive Bayes text classification. IEEE Trans Knowl Data Eng 18(11):1457–1466

    Article  Google Scholar 

  38. Ko M, Osei-Bryson KM (2008) Reexamining the impact of information technology investment on productivity using regression tree and multivariate adaptive regression splines (MARS). Inf Technol Manage 9(4):285–299

    Article  Google Scholar 

  39. Kohonen T (1990a) Improved versions of learning vector quantization. In: Proceedings of the international joint conference on neural networks I, San Diego, pp 545–550

  40. Kohonen T (1990) The self-organizing map. Proc IEEE 78(9):1464–1480

    Article  Google Scholar 

  41. Kohonen T, Hynninen J, Kangas J, Laaksonen J, Torkkola K (1995) LVQ-PK: the learning vector quantization package version 3.1. Technical Report. Helsinki University of Technology Laboratory of Computer and Information Science, Finland. Available via http://www.cis.hut.fi/research/lvq_pak/lvq_doc.txt. Accessed on 27 March 2012

  42. Le Cessie S, van Houwelingen JC (1992) Ridge estimators in logistic regression. Appl Stat 41(1):191–201

    Article  Google Scholar 

  43. Lim T-S, Loh W-Y, Shih Y-S (2000) A comparison of prediction accuracy, complexity, and training time of thirty-three old and new classification algorithms. Mach Learn 40(3):203–228

    Article  Google Scholar 

  44. Loh WY, Shih YS (1997) Split selection methods for classification trees. Stat Sin 7(4):815–840

    Google Scholar 

  45. Maimon O, Rokach L (2010) The data mining and knowledge discovery handbook. Springer, Berlin, 2nd edn

  46. Maindonald J (2006) Data mining methodological weaknesses and suggested fixes. In: Proceedings of the 5th Australasian data mining conference (AusDM2006), Sydney, Australia, pp 9–16

  47. Mitchell TM (1997) Machine learning. McGraw Hill, USA

    Google Scholar 

  48. Pawlak Z (1982) Rough sets. Int J Parallel Prog 11(5):341–356

    Google Scholar 

  49. Pawlak Z (1991) Rough sets: theoretical aspects of reasoning about data. Kluwer, Dordrecht

    Google Scholar 

  50. Pitt E, Nayak R (2007) The use of various data mining and feature selection methods in the analysis of a population survey dataset. In: Ong KL, Li W, Gao J (eds) Proceedings of the 2nd international workshop on integrating artificial intelligence and data mining (AIDM 2007) CRPIT 87. Goald Coast, Australia, pp 87–97

    Google Scholar 

  51. Putten P, Lingjun M, Kok JN (2008) Profiling novel classification algorithms: artificial immune system. In: Proceedings of the 7th IEEE international conference on cybernetic intelligent systems (CIS 2008), London, UK, pp 1–6

  52. Quinlan JR (1993) C4.5: programs for machine learning. Morgan Kaufmann, USA

  53. Quinlan JR (1994) Comparing connectionist and symbolic learning methods. In: Hanson SJ, Drastal GA, Rivest RL (eds) Computational learning theory and natural learning systems: constraints and prospect. MIT Press, Cambridge, pp 445–456

    Google Scholar 

  54. Rumelhart DE, Hinton GE, Williams RJ (1986) Learning internal representations by error propagation. In: Rumelhart DE, McClelland JL, and the PDP research group (eds) Parallel distributed processing: explorations in the microstructure of cognition 1: foundations. MIT Press, Cambridge, pp 318–362

  55. Shih YS (2004) QUEST user manual. Department of Mathematics, National Chung Cheng University, Taiwan. Available via http://www.stat.wisc.edu/~loh/treeprogs/quest/questman.pdf. Accessed on 15 August 2011

  56. SPSS (2012) CHAID and exhaustive CHAID algorithms. Available via ftp://ftp.software.ibm.com/software/analytics/spss/support/Stats/Docs/Statistics/Algorithms/13.0/TREE-CHAID.pdf. Accessed on 9 April 2012

  57. Su CT, Hsiao YH (2009) Multiclass MTS for simultaneous feature selection and classification. IEEE Trans Knowl Data Eng 21(2):192–204

    Article  Google Scholar 

  58. UCI Machine Learning Repository (2010) Available via http://archive.ics.uci.edu/ml. Accessed on 10 December 2010

  59. Watkins AB (2001) AIRS: a resource limited artificial immune classifier. M.Sc. Thesis, Mississippi State University Department of Computer Science, USA. Available via http://www.cse.msstate.edu/~andrew/research/publications/watkins_thesis.pdf. Accessed on 15 August 2011

  60. Watkins AB (2005) Exploiting immunological metaphors in the development of serial, parallel and distributed learning algorithms. PhD Thesis, University of Kent, Canterbury, UK Available via http://www.cse.msstate.edu/~andrew/research/publications/watkins_phd_dissertation.pdf. Accessed on 15 August 2011

  61. Watkins AB, Timmis J, Boggess L (2004) Artificial immune recognition system (AIRS): an immune-inspired supervised learning algorithm. Genet Program Evolvable Mach 5(3):291–317

    Article  Google Scholar 

  62. WEKA (2011) Classification algorithms. Available via http://wekaclassalgos.sourceforge.net/. Accessed on 15 August 2011

  63. Witten IH, Frank E (2005) Data mining: practical machine learning tools and techniques, 2nd edn. Morgan Kaufmann, USA

    Google Scholar 

  64. Wu X, Kumar V (2009) The top ten algorithms in data mining. Chapman&Hall/CRC Press, Taylor &Francis Group, USA

  65. Yang Y, Webb GI, Cerquides J, Korb KB, Boughton J, Ting KM (2007) To select or to weigh: a comparative study of linear combination schemes for superparent-one-dependence estimators. IEEE Trans Knowl Data Eng 19(12):1652–1664

    Article  Google Scholar 

  66. Zhu D, Premkumar G, Zhang X, Chu C-H (2001) Data mining for network intrusion detection: a comparison of alternative methods. Decis Sci 32(4):635–660

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Neslihan Dogan.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Dogan, N., Tanrikulu, Z. A comparative analysis of classification algorithms in data mining for accuracy, speed and robustness. Inf Technol Manag 14, 105–124 (2013). https://doi.org/10.1007/s10799-012-0135-8

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10799-012-0135-8

Keywords

Navigation