3. Butunov, D.B. (2019) "A study of the implementation of standards-time of wagons at sorting station" Journal of Tashkent Institute of Railway Engineers: Vol. 15: Iss. 3, Article 23. Available at: (https://uzjournals.edu.uz/tashiit/vol15/iss3/23)
4. Butunov D.B. Improvement of technical experimental methods for organization of wagon flows and management evaluation at sorting stations. Dis. ... doc. Phil. (PhD). Tashkent: TashIIT. - 2019. - 187 p.
5. Romanova P.B. The formation of trains of various masses and lengths / P.B. Romanova, S.A. Tsy-ganov // Bulletin of the Volga region. - 2016. - No. 6. - Page. 71-76.
6. Butunov, D.B. (2020) "Substantiation of the input of the parameters of the unprofitable loss of time when norming the time of the duration of the wagons on the sorting station" Journal of Tashkent Institute of Railway Engineers: Vol. 16: Iss. 3, Article 16. (https://uzjournals.edu.uz/cgi/viewcontent.cgi?article= 1191 &context=tashiit)
7. Korol A.A. Determination of losses arising at the marshalling yard during the period of the "Window" in the areas adjacent to the station / A.A. Korol //
Science and Education in Transport: Proceedings of the IX International Scientific and Practical Conference. SamGUPS. - 2016. - p. 106-108.
8. Abdukodirov Sardor, Dilmurod Butunov, Mafratkhon Tukhakhodjaeva, Shukhrat Buriev, Utkir Khusenov. (2021). Administration of Technological Procedures at Intermediate Stations. Design Engineering, 14531-14540. Retrieved from http://thedesignengineering.com/index.php/DE/article/ view/6581
9. Instructional directions on the order of automated accounting of in-house statistical reporting form DO-24 CC "A report on operation of sorting stations" and DO-24 a CC "A report on operation of station yards". Moscow, OAO "RZhD" Publ., 2016, 45 p. (In Russian)
10. Butunov, Dilmurod Baxodirovich; Aripov, Nodir Kodirovich; and Bashirova, Alfiya Mirkhatimovna (2020) "Systematization of factors influencing during processing of wagons at the sorting station" Journal of Tashkent Institute of Railway Engineers: Vol. 16: Iss. 2, Article 10. (https ://uzj ournals.edu. uz/tashiit/vol16/iss2/10/)
DATA WITH PARTIAL MULTICOLLINEARITY HELPS TO RESOLVE OVERFIT PROBLEM IN
LINEAR MODELS
Solovei O.
Candidate of Technical Sciences (PhD) ORCID: 0000-0001-8 774- 7243 Kyiv University of Civil Building and Architecture Kyiv, Povitroflotsky Avenue, 31, 03680
Abstract
Linear regression models are built on raw data which is supposed to have linear relation between predictors and target and no multicollinearity between predictors [1]. However, multicollinearity can be complete or partial and the second type of multicollinearity may be successfully utilized in Ridge regression algorithms to solve overfit problem.
Keywords: overfit, multicollinearity, Ridge regression, MSE, determination coefficient
Overfit term in machine learning refers to a problem when a model fits its purpose only for a certain set of data and fails for laid out data set. The overfit problem is visible when mean square error (MSE) on data used to train the model (train data) is less then MSE on data to test the model (test data); and determination coefficient R2 on train data is bigger then R2 on test data. Overfit for linear regression may happen when dataset has either small number of informative variables or small number of samples. In the first case, data is shrunk to the smaller dimension with maximum information preserved [2]; in the second case, new features are constructed as polynomial combinations of the existing predictors with a certain degree [3]. After data is pre-processed the mentioned ways, a linear model is built with ridge (1) or lasso (2) regularization term added to cost function of the linear regression [4].
¿£miyi-*w||2 + a||w||2^min (1)
1Xi=i Wyi-Xwf + a\\w\\^min (2)
where ji - a model's prediction for a sample with index i in dataset; X- a matrix with features' values; w - a vector of linear equation's model coefficient; a- a constant which provides the balance for model's coefficients adjustments and model's fit to data.
Linear regression analysis to be performed demands a raw data doesn't have multicollinearity between predictors. However, when new features are added to raw data as existing features' combinations then partial multicollinearity is introduced in dataset and to follow the mentioned linear regression standards the new features couldn't be included in the model. Current research is created to demonstrate that partial multicollinearity among predictors when correlation coefficient is not equal -1 or 1 can be used with ridge regression to resolve overfit problem.
Let consider a construction of linear regression model for dataset with 50 samples; 3 independent features xl, x2, x3 and target y. In raw data there are no samples with empty values and a linear relation exists only between predictor x3 and target y so linear model
can be built based on one predictor x3. After split da- train sub-set are R2 = 0.81; MSE = 93; on test sub-
taset on train and test parts and found linear model co- set: R2 = 0.67; MSE = 132 (pic 1) i.e., the model got
efficients w0,wi as a solution of expression w= overfitted and can't be used with new data predictions.
(XTX)~1XTy the received model's quality metrics on
Pic 1. Linear regression metrics for train and test data sets
Ridge regression is applied for the same dataset are remained better on train sub-set compared to test with different values for the regularization parameter sub-set (pic 2) alpha didn't solve overfit problem as model's metrics
Pic 2. Ridge regression fails to resolve overfit problem
With added new polynomial features the colline- multicollinearity is also introduced between predictors
arity became visible between predictors: x22x3; x1A2x3 (pic 3) . x22x3; x1A2x3 and target y with correlation coefficients: 0.63; 0.61 correspondingly, however, partial
Pic 3. Correlations in constructed data set
Now we constructed a new matrix X' with values of features x22x3; x1A2x3 and applied Ridge. On figure 5 is visible that when alpha equal to 50 then quality metrics on test sub-set became better compared to train
sub-set and determination coefficient fl2 on test reaches value 0.5 which is a good enough for regression model to make predictions in the future.
Pic 4. Ridge regression resolves overfit problem with partial multicollinearity presents
Based on the received results can be concluded that partial multicollinearity between new features constructed as polynomial combinations of existing features can help to solve overfit problem with Ridge regressions algorithm.
References
1. Gujarati, Damodar (2009). "Multicollinearity: what happens if the regressors are correlated?". Basic Econometrics (4th ed.). McGraw-Hill. pp. 363.
2. "An implementation of a randomized algorithm for principal component analysis" A. Szlam et al. 2014
3. M. Blondel, M. Ishihata, A. Fujino, and N. Ueda, "Polynomial Networks and Factorization Machines: New Insights and Efficient Training Algorithms," Proc. of ICML 2016 (the 33rd International Conference on Machine Learning), pp. 850-858, New York, USA, June 2016.
4. "Regularization Path For Generalized linear Models by Coordinate Descent", Friedman, Hastie & Tibshirani, J Stat Softw, 2010.