Научная статья на тему 'DATA WITH PARTIAL MULTICOLLINEARITY HELPS TO RESOLVE OVERFIT PROBLEM IN LINEAR MODELS'

DATA WITH PARTIAL MULTICOLLINEARITY HELPS TO RESOLVE OVERFIT PROBLEM IN LINEAR MODELS Текст научной статьи по специальности «Медицинские технологии»

CC BY
42
5
i Надоели баннеры? Вы всегда можете отключить рекламу.
Ключевые слова
overfit / multicollinearity / Ridge regression / MSE / determination coefficient

Аннотация научной статьи по медицинским технологиям, автор научной работы — Solovei O.

Linear regression models are built on raw data which is supposed to have linear relation between predictors and target and no multicollinearity between predictors [1]. However, multicollinearity can be complete or partial and the second type of multicollinearity may be successfully utilized in Ridge regression algorithms to solve overfit problem.

i Надоели баннеры? Вы всегда можете отключить рекламу.
iНе можете найти то, что вам нужно? Попробуйте сервис подбора литературы.
i Надоели баннеры? Вы всегда можете отключить рекламу.

Текст научной работы на тему «DATA WITH PARTIAL MULTICOLLINEARITY HELPS TO RESOLVE OVERFIT PROBLEM IN LINEAR MODELS»

3. Butunov, D.B. (2019) "A study of the implementation of standards-time of wagons at sorting station" Journal of Tashkent Institute of Railway Engineers: Vol. 15: Iss. 3, Article 23. Available at: (https://uzjournals.edu.uz/tashiit/vol15/iss3/23)

4. Butunov D.B. Improvement of technical experimental methods for organization of wagon flows and management evaluation at sorting stations. Dis. ... doc. Phil. (PhD). Tashkent: TashIIT. - 2019. - 187 p.

5. Romanova P.B. The formation of trains of various masses and lengths / P.B. Romanova, S.A. Tsy-ganov // Bulletin of the Volga region. - 2016. - No. 6. - Page. 71-76.

6. Butunov, D.B. (2020) "Substantiation of the input of the parameters of the unprofitable loss of time when norming the time of the duration of the wagons on the sorting station" Journal of Tashkent Institute of Railway Engineers: Vol. 16: Iss. 3, Article 16. (https://uzjournals.edu.uz/cgi/viewcontent.cgi?article= 1191 &context=tashiit)

7. Korol A.A. Determination of losses arising at the marshalling yard during the period of the "Window" in the areas adjacent to the station / A.A. Korol //

Science and Education in Transport: Proceedings of the IX International Scientific and Practical Conference. SamGUPS. - 2016. - p. 106-108.

8. Abdukodirov Sardor, Dilmurod Butunov, Mafratkhon Tukhakhodjaeva, Shukhrat Buriev, Utkir Khusenov. (2021). Administration of Technological Procedures at Intermediate Stations. Design Engineering, 14531-14540. Retrieved from http://thedesignengineering.com/index.php/DE/article/ view/6581

9. Instructional directions on the order of automated accounting of in-house statistical reporting form DO-24 CC "A report on operation of sorting stations" and DO-24 a CC "A report on operation of station yards". Moscow, OAO "RZhD" Publ., 2016, 45 p. (In Russian)

10. Butunov, Dilmurod Baxodirovich; Aripov, Nodir Kodirovich; and Bashirova, Alfiya Mirkhatimovna (2020) "Systematization of factors influencing during processing of wagons at the sorting station" Journal of Tashkent Institute of Railway Engineers: Vol. 16: Iss. 2, Article 10. (https ://uzj ournals.edu. uz/tashiit/vol16/iss2/10/)

DATA WITH PARTIAL MULTICOLLINEARITY HELPS TO RESOLVE OVERFIT PROBLEM IN

LINEAR MODELS

Solovei O.

Candidate of Technical Sciences (PhD) ORCID: 0000-0001-8 774- 7243 Kyiv University of Civil Building and Architecture Kyiv, Povitroflotsky Avenue, 31, 03680

Abstract

Linear regression models are built on raw data which is supposed to have linear relation between predictors and target and no multicollinearity between predictors [1]. However, multicollinearity can be complete or partial and the second type of multicollinearity may be successfully utilized in Ridge regression algorithms to solve overfit problem.

Keywords: overfit, multicollinearity, Ridge regression, MSE, determination coefficient

Overfit term in machine learning refers to a problem when a model fits its purpose only for a certain set of data and fails for laid out data set. The overfit problem is visible when mean square error (MSE) on data used to train the model (train data) is less then MSE on data to test the model (test data); and determination coefficient R2 on train data is bigger then R2 on test data. Overfit for linear regression may happen when dataset has either small number of informative variables or small number of samples. In the first case, data is shrunk to the smaller dimension with maximum information preserved [2]; in the second case, new features are constructed as polynomial combinations of the existing predictors with a certain degree [3]. After data is pre-processed the mentioned ways, a linear model is built with ridge (1) or lasso (2) regularization term added to cost function of the linear regression [4].

¿£miyi-*w||2 + a||w||2^min (1)

1Xi=i Wyi-Xwf + a\\w\\^min (2)

where ji - a model's prediction for a sample with index i in dataset; X- a matrix with features' values; w - a vector of linear equation's model coefficient; a- a constant which provides the balance for model's coefficients adjustments and model's fit to data.

Linear regression analysis to be performed demands a raw data doesn't have multicollinearity between predictors. However, when new features are added to raw data as existing features' combinations then partial multicollinearity is introduced in dataset and to follow the mentioned linear regression standards the new features couldn't be included in the model. Current research is created to demonstrate that partial multicollinearity among predictors when correlation coefficient is not equal -1 or 1 can be used with ridge regression to resolve overfit problem.

Let consider a construction of linear regression model for dataset with 50 samples; 3 independent features xl, x2, x3 and target y. In raw data there are no samples with empty values and a linear relation exists only between predictor x3 and target y so linear model

can be built based on one predictor x3. After split da- train sub-set are R2 = 0.81; MSE = 93; on test sub-

taset on train and test parts and found linear model co- set: R2 = 0.67; MSE = 132 (pic 1) i.e., the model got

efficients w0,wi as a solution of expression w= overfitted and can't be used with new data predictions.

(XTX)~1XTy the received model's quality metrics on

Pic 1. Linear regression metrics for train and test data sets

Ridge regression is applied for the same dataset are remained better on train sub-set compared to test with different values for the regularization parameter sub-set (pic 2) alpha didn't solve overfit problem as model's metrics

Pic 2. Ridge regression fails to resolve overfit problem

With added new polynomial features the colline- multicollinearity is also introduced between predictors

arity became visible between predictors: x22x3; x1A2x3 (pic 3) . x22x3; x1A2x3 and target y with correlation coefficients: 0.63; 0.61 correspondingly, however, partial

Pic 3. Correlations in constructed data set

Now we constructed a new matrix X' with values of features x22x3; x1A2x3 and applied Ridge. On figure 5 is visible that when alpha equal to 50 then quality metrics on test sub-set became better compared to train

sub-set and determination coefficient fl2 on test reaches value 0.5 which is a good enough for regression model to make predictions in the future.

Pic 4. Ridge regression resolves overfit problem with partial multicollinearity presents

Based on the received results can be concluded that partial multicollinearity between new features constructed as polynomial combinations of existing features can help to solve overfit problem with Ridge regressions algorithm.

References

1. Gujarati, Damodar (2009). "Multicollinearity: what happens if the regressors are correlated?". Basic Econometrics (4th ed.). McGraw-Hill. pp. 363.

2. "An implementation of a randomized algorithm for principal component analysis" A. Szlam et al. 2014

3. M. Blondel, M. Ishihata, A. Fujino, and N. Ueda, "Polynomial Networks and Factorization Machines: New Insights and Efficient Training Algorithms," Proc. of ICML 2016 (the 33rd International Conference on Machine Learning), pp. 850-858, New York, USA, June 2016.

4. "Regularization Path For Generalized linear Models by Coordinate Descent", Friedman, Hastie & Tibshirani, J Stat Softw, 2010.

i Надоели баннеры? Вы всегда можете отключить рекламу.