Abstract
We propose a generic decision tree framework that supports reusable components design. The proposed generic decision tree framework consists of several sub-problems which were recognized by analyzing well-known decision tree induction algorithms, namely ID3, C4.5, CART, CHAID, QUEST, GUIDE, CRUISE, and CTREE. We identified reusable components in these algorithms as well as in several of their partial improvements that can be used as solutions for sub-problems in the generic decision tree framework. The identified components can now be used outside the algorithm they originate from. Combining reusable components allows the replication of original algorithms, their modification but also the creation of new decision tree induction algorithms. Every original algorithm can outperform other algorithms under specific conditions but can also perform poorly when these conditions change. Reusable components allow exchanging of solutions from various algorithms and fast design of new algorithms. We offer a generic framework for component-based algorithms design that enhances understanding, testing and usability of decision tree algorithm parts.
Similar content being viewed by others
References
Adams M, Coplien J, Gamoke R et al (1991) Fault-tolerant telecommunication system patterns. In: Rising L (ed.) The pattern handbook: techniques, strategies, and applications. Cambridge University Press, New York, pp 189–202
Alexander C (1979) The timeless way of building. Oxford University Press, New York
Attneave F (1959) Application of information theory to psychology. Holt, Rinehart and Winston, New York
Breiman L, Friedman J, Stone CJ et al (1984) Classification and regression trees. CRC Press, Boca Raton
Breiman L (1996) Technical note some properties of splitting criteria. Mach Learn 24: 41–47
Coplien JO, Schmidt DC (1995) Pattern languages of program design. Addison-Wesley Professional, Reading
Coplien JO, Harrison NB (2005) Organizational patterns of agile software development. Prentice Hall PTR, Upper Saddle River
Delibasic B, Kirchner K, Ruhland J et al (2008) A pattern-based data mining approach. In: Preisach C, Burckhardt H, Schmidt-Thieme L (eds) Data analysis, machine learning and applications. Springer, Berlin 327, pp 327–334
Delibasic B, Kirchner K, Ruhland J, Jovanovic M, Vukicevic M (2009) Reusable components for partitioning clustering algorithms. Art Int Rev 32: 59–75
Delibasic B, Jovanovic M, Vukicevic M, Suknovic M, Obradovic Z (2010) Component-based decision trees for classification. Intell Data Anal 15(5): accepted for publication
Drossos N, Papagelis A, Kalles D (2000) Decision tree toolkit: a component-based library of decision tree algorithms. In: Zighed DZ, Komorowski J, Zytkow J (eds) Principles of data mining and knowledge discovery. Springer, Berlin 121, pp 121–150
Fayyad UM, Irani KB (1992) On the handling of continuous-valued attributes in decision tree generation. Mach Learn 8: 87–102
Ferri C, Flach P, Hernandez-Orallo J (2002) Learning decision trees using the area under the ROC curve. In: Sammut C, Hoffmann A (eds), Proceedings of the 19th international conference on machine learning, Morgan Kaufmann, pp 139–146
Freeman P (1983) Reusable software engineering: Concepts and research directions. In: Workshop on reusability in programming, ITT Programming, Stratford, Conn., pp 2–16
Gamma E, Helm R, Johnson R et al (1995) Design patterns: elements of reusable object-oriented software. Addison, Reading
Gnanadesikan R (1977) Methods for statistical data analysis of multivariate observations. Wiley, New York
Hartigan JA, Wong MA (1979) Algorithm 136. A k-means clustering algorithm. Appl Stat 28: 100
Hothorn T, Hornik K, Zeileis A (2006) Unbiased recursive partitioning: a conditional inference framework. J Comput Gr Stat 15(3): 651–674
Kass GV (1980) An exploratory technique for investigating large quantities of categorical data. Appl Stat 29: 119–127
Kim H, Loh WY (2001) Classification trees with unbiased multiway splits. J Am Stat Assoc 96: 598–604
Lea D (1994) acs.pdf. In: Design patterns for avionics control systems. http://gee.cs.oswego.edu/dl/acs/acs.pdf
Lim TS, Loh WY, Shih YS (2000) A comparison of prediction accuracy, complexity, and training time of thirty-three old and new algorithms. Mach Learn 40: 203–228
Loh WY, Vanichsetakul N (1988) Tree-structured classification via generalized discriminant analysis (with discussion). J Am Stat Assoc 83: 715–728
Loh WY, Shih YS (1997) Split selection methods for classification trees. Stat Sin 7: 815–840
Loh WY (2002) Regression trees with unbiased variable selection and interaction detection. Stat Sin 12: 361–386
Malerba D, Esposito F, Semeraro G (1996) A further comparison of simplification methods for decision tree induction. In: Fisher D, Lenz HJ (eds) Learning from data: AI and statistics V. Springer, Berlin 365, pp 365–374
Mantaras RL (1991) A distance-based attribute selection measure for decision tree induction. Mach Learn 6: 81–92
Mierswa I, Wurst M, Klinkenberg R, Scholz M, Euler T (2006) YALE: Rapid Prototyping for Complex Data Mining Tasks. In: Proceedings of the 12th ACM SIGKDD international conference on knowledge discovery and data mining, ACM, New York, pp 935–940
Mingers J (1989) An empirical comparison of pruning methods for decision tree induction. Mach Learn 4: 227–243
Murthy SK (1998) Automatic construction of decision trees from data: A multi-disciplinary survey. Data Min Knowl Discov. doi:10.1023/A:1009744630224
Quinlan JR (1986) Induction of decision trees. Mach Learn 1: 81–106
Quinlan JR (1987) Simplifying decision trees. Int J Man Mach Stud 27: 221–234
Quinlan JR (1993) C4.5 Programs for machine learning. Morgan Kaufmann
R Development Core Team: (2008) R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna
Rokach L, Maimon O (2008) Data mining with decision trees—theory and application. World Scientific Publishing, London
Safavian SR, Landgrebe D (1991) A survey of decision tree classifier methodology. IEEE Trans Syst Man Cybern 21: 660–674
Shannon CE (1948) shannon1948.pdf. A mathematical theory of communication. http://cm.bell-labs.com/cm/ms/what/shannonday/shannon1948.pdf
Shih L (1999) Families of splitting criteria for decision trees. Stat Comput 9: 309–315
Siddique NH, Amavasai BP, Ikuta A (eds) (2007) Special issue on hybrid techniques in AI. Artificial Intelligence Revue 27
Sommerville I (2004) Software engineering. Pearson, Boston
Sonnenburg S, Braun ML, Ong CS, Bengio S, Bottou L, Holmes G, Le Cun Y, Mueller KR, Pereira F, Rasmussen CE, Raetsch G, Schoelkopf B, Smola A (2007) The need for open source software in machine learning. J Mach Learn Res 8: 2443–2466
Tracz W (1990) Where does reuse start?. ACM SIGSOFT Softw Eng Notes 15: 42–46
Utgoff PE et al (1997) Decision tree induction based on efficient tree restructuring. Mach Learn 29: 5–44
Winn T, Calder P (2002) Is this a pattern?. IEEE Softw 19: 59–66
Witten IH, Frank E (2005) Data mining: practical machine learning tools and techniques. 2. Morgan Kaufmann, San Francisco
Wu S, Flach PA (2002) Feature selection with labeled and unlabeled data. In: Proceedings of ECML/PKDD workshop on integration and collaboration aspects of data mining, decision support and meta-learning, IDDM 2002, Helsinki
Zeileis A, Hothorn T, Hornik K (2008) Model-based recursive partitioning. J Comput Gr Stat 17(2): 492–514
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Suknovic, M., Delibasic, B., Jovanovic, M. et al. Reusable components in decision tree induction algorithms. Comput Stat 27, 127–148 (2012). https://doi.org/10.1007/s00180-011-0242-8
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00180-011-0242-8