Abstract
Including categorical variables with many levels in a logistic regression model easily leads to a sparse design matrix. This can result in a big, ill-conditioned optimization problem causing overfitting, extreme coefficient values and long run times. Inspired by recent developments in matrix factorization, we propose four new strategies of overcoming this problem. Each strategy uses a Factorization Machine that transforms the categorical variables with many levels into a few numeric variables that are subsequently used in the logistic regression model. The application of Factorization Machines also allows for including interactions between the categorical variables with many levels, often substantially increasing model accuracy. The four strategies have been tested on four data sets, demonstrating superiority of our approach over other methods of handling categorical variables with many levels. In particular, our approach has been successfully used for developing high quality risk models at the Netherlands Tax and Customs Administration.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Bassi, D., Hernandez, C.: Credit risk scoring: results of different network structures, preprocessing and self-organised clustering. In: Decision Technologies for Financial Engineering. Proceedings of the Fourth International Conference on Neural Networks in the Capital Markets, pp. 151–61 (1997)
Basta, S., Fassetti, F., Guarascio, M., Manco, G., Giannotti, F., Pedreschi, D., Spinsanti, L., Papi, G., Pisani, S.: High quality true-positive prediction for fiscal fraud detection. In: International Conference on Data Mining Workshops, ICDMW 2009, pp. 7–12. IEEE (2009)
Berkman, N.C.: Value grouping for binary decision trees. Technical report, University of Massachusetts (1995)
Breiman, L., Friedman, J., Stone, C.J., Olshen, R.A.: Classification and Regression Trees. CRC Press, Boca Raton (1984)
Burshtein, D., Della Pietra, V., Kanevsky, D., Nadas, A.: Minimum impurity partitions. Ann. Stat. 20, 1637–1646 (1992)
Chou, P.A., et al.: Optimal partitioning for classification and regression trees. IEEE Trans. Pattern Anal. Mach. Intell. 13(4), 340–354 (1991)
Friedman, J., Hastie, T., Tibshirani, R.: The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Springer Series in Statistics. Springer, New York (2009)
Gupta, G.: Introduction to Data Mining with Case Studies. PHI Learning Pvt. Ltd., Delhi (2014)
Hosmer Jr., D.W., Lemeshow, S., Sturdivant, R.X.: Applied Logistic Regression, 3rd edn. Wiley, Hoboken (2013)
Kass, G.V.: An exploratory technique for investigating large quantities of categorical data. Appl. Stat. 29, 119–127 (1980)
Koren, Y., Bell, R., Volinsky, C.: Matrix factorization techniques for recommender systems. Computer 42(8), 30–37 (2009). http://dx.doi.org/10.1109/MC.2009.263
Liaw, A., Wiener, M.: Classification and Regression by randomForest. R News 2(3), 18–22 (2002). http://CRAN.R-project.org/doc/Rnews/
Lichman, M.: UCI Machine Learning Repository (2013). http://archive.ics.uci.edu/ml
van der Maaten, L.J.P., Postma, E.O., van den Herik, H.J.: Dimensionality reduction: a comparative review. Tilburg University Technical report, TiCC-TR 2009-005 (2009)
Micci-Barreca, D.: A preprocessing scheme for high-cardinality categorical attributes in classification and prediction problems. ACM SIGKDD Explor. Newsl. 3(1), 27–32 (2001)
Rendle, S.: Factorization machines. In: 2010 IEEE International Conference on Data Mining, pp. 995–1000. IEEE (2010)
Rendle, S.: Factorization machines with libFM. ACM Trans. Intell. Syst. Technol. 3(3), 57:1–57:22 (2012)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2017 Springer International Publishing AG
About this paper
Cite this paper
Pijnenburg, M., Kowalczyk, W. (2017). Extending Logistic Regression Models with Factorization Machines. In: Kryszkiewicz, M., Appice, A., Ślęzak, D., Rybinski, H., Skowron, A., Raś, Z. (eds) Foundations of Intelligent Systems. ISMIS 2017. Lecture Notes in Computer Science(), vol 10352. Springer, Cham. https://doi.org/10.1007/978-3-319-60438-1_32
Download citation
DOI: https://doi.org/10.1007/978-3-319-60438-1_32
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-60437-4
Online ISBN: 978-3-319-60438-1
eBook Packages: Computer ScienceComputer Science (R0)