Skip to main content
Log in

An Exploration of Online Missing Value Imputation in Non-stationary Data Stream

  • Original Research
  • Published:
SN Computer Science Aims and scope Submit manuscript

Abstract

Missing value imputation (MVI) is an important data preprocessing technique. In previous decades, MVI technique has been widely studied as well as most MVI approaches have been proposed by means of either statistics or machine learning techniques. However, all previous methods only focus on the static data, but ignore the imputation for the dynamic online data. It is intuitionistic that the imputation errors may be significantly increased when there exists concept drifts in the data stream. In this paper, we investigate the impact of adopting the conventional MVI methods in non-stationary data stream. Meanwhile, two slide time window-based strategies are proposed to alleviate this impact, where one is the plain average strategy, and the other is the logarithmic weighted average strategy that gradually adds the weights of instances along the time axis. Combining with the proposed strategies, three popular MVI techniques, mean imputation (MI), KNN imputation (KNNI) and the Bayesian principal component analysis imputation (BPCAI) are adopted, to indicate the effect of the strategies are irrelevant to the specific MVI technique. The experimental results on three different types' concept drift synthetic data sets and two real-world drifting data sets have presented the effectiveness and feasibility of the proposed strategies. Moreover, the impact of time window size has also been investigated for guiding the parameter settings in future practical applications.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5

Similar content being viewed by others

References

  1. Farhangfar A, Kurgan L, Dy J. Impact of imputation of missing values on classification error for discrete data. Pattern Recogn. 2008;41(12):3692–705.

    Article  Google Scholar 

  2. Lin WC, Tsai CF. Missing value imputation: a review and analysis of the literature (2006–2017). Artif Intell Rev. 2019. https://doi.org/10.1007/s10462-019-09709-4.

    Article  Google Scholar 

  3. Brown ML, Kros JF. Data mining and the impact of missing data. Industr Manag Data Syst. 2003;103(8):611–21.

    Article  Google Scholar 

  4. Donders ART, Van Der Heijden GJ, Stijnen T, Moons KG. A gentle introduction to imputation of missing values. J Clin Epidemiol. 2006;59(10):1087–91.

    Article  Google Scholar 

  5. Little RJ, Rubin DB. Statistical analysis with missing data. 3rd ed. Wiley John & Sons; 2019.

  6. Dixon JK. Pattern recognition with partly missing data. IEEE Trans Syst Man Cybern. 1979;9(10):617–21.

    Article  Google Scholar 

  7. Tsai CF, Chang FY. Combining instance selection for better missing value imputation. J Syst Softw. 2016;122:63–71.

    Article  Google Scholar 

  8. Rahman MG, Islam MZ. Missing value imputation using decision trees and decision forests by splitting and merging records: two novel techniques. Knowl-Based Syst. 2013;53:51–65.

    Article  Google Scholar 

  9. Sefidian AM, Daneshpour N. Missing value imputation using a novel grey based fuzzy c-means, mutual information based feature selection, and regression model. Expert Syst Appl. 2019;115:68–94.

    Article  Google Scholar 

  10. Zhu X, Zhang S, Jin Z, Zhang Z, Xu Z. Missing value estimation for mixed-attribute data sets. IEEE Trans Knowl Data Eng. 2010;23(1):110–21.

    Article  Google Scholar 

  11. García-Laencina PJ, Abreu PH, Abreu MH, Afonoso N. Missing data imputation on the 5-year survival prediction of breast cancer patients with unknown discrete values. Comput Biol Med. 2015;59:125–33.

    Article  Google Scholar 

  12. Purwar A, Singh SK. Hybrid prediction model with missing value imputation for medical data. Expert Syst Appl. 2015;42(13):5621–31.

    Article  Google Scholar 

  13. Abawajy J, Kelarev A, Chowdhury M, Stranieri A, Jelinek HF. Predicting cardiac autonomic neuropathy category for diabetic data with missing values. Comput Biol Med. 2013;43(10):1328–33.

    Article  Google Scholar 

  14. Grittner U, Gmel G, Ripatti S, Bloomfield K, Wicki M. Missing value imputation in longitudinal measures of alcohol consumption. Int J Methods Psychiatr Res. 2011;20(1):50–61.

    Article  Google Scholar 

  15. Wang A, Chen Y, An N, Yang J, Li L, Jiang L. Microarray missing value imputation: a regularized local learning method. IEEE/ACM Trans Comput Biol Bioinf. 2018;16(3):980–93.

    Article  Google Scholar 

  16. Hossain A, Chattopadhyay M, Chattopadhyay S, Bose S, Das C. A bicluster-based sequential interpolation imputation method for estimation of missing values in microarray gene expression data. Curr Bioinform. 2017;12(2):118–30.

    Article  Google Scholar 

  17. Oba S, Sato MA, Takemasa I, Monden M, Matsubara KI, Ishii S. A Bayesian missing value estimation method for gene expression profile data. Bioinformatics. 2003;19(16):2088–96.

    Article  Google Scholar 

  18. Farswan A, Gupta A, Gupta R, Kaur G. Imputation of gene expression data in blood cancer and its significance in inferring biological pathways. Front Oncol. 2020;9:1442.

    Article  Google Scholar 

  19. Roth PL. Missing data: a conceptual review for applied psychologists. Pers Psychol. 1994;47(3):537–60.

    Article  Google Scholar 

  20. Di Nuovo AG. Missing data analysis with fuzzy c-means: a study of its application in a psychological scenario. Expert Syst Appl. 2011;38:6793–7.

    Article  Google Scholar 

  21. Deb R, Liew AWC. Missing value imputation for the analysis of incomplete traffic accident data. Inf Sci. 2016;339:274–89.

    Article  Google Scholar 

  22. Sun Y, Tang K, Minku LL, Wang S, Yao X. Online ensemble learning of data streams with gradually evolved classes. IEEE Trans Knowl Data Eng. 2016;28(6):1532–45.

    Article  Google Scholar 

  23. Krawczyk B, Minku LL, Gama J, Stefanowski J, Woźniak M. Ensemble learning for data stream analysis: a survey. Inf Fus. 2017;37:132–56.

    Article  Google Scholar 

  24. Kim HG, Park YH, Cho YH, Kim MH. Time-slide window join over data streams. J Intell Inf Syst. 2014;43(2):323–47.

    Article  Google Scholar 

  25. Brzezinski D, Stefanowski J. Reacting to different types of concept drift: the accuracy updated ensemble algorithm. IEEE Trans Neural Netw Learn Syst. 2013;25(1):81–94.

    Article  Google Scholar 

  26. Webb GI, Hyde R, Cao H, Nguyen HL, Petitjean F. Characterizing concept drift. Data Min Knowl Disc. 2016;30(4):964–94.

    Article  MathSciNet  Google Scholar 

  27. Yu H, Webb GI. Adaptive online extreme learning machine by regulating forgetting factor by concept drift map. Neurocomputing. 2019;343:141–53.

    Article  Google Scholar 

  28. Andiojaya A, Demirhan H. A bagging algorithm for the imputation of missing values in time series. Expert Syst Appl. 2019;129:10–26.

    Article  Google Scholar 

  29. Conti PL, Marella D, Scanu M. Evaluation of matching noise for imputation techniques based on nonparemetric local linear regression estimators. Comput Stat Data Anal. 2008;53(2):354–65.

    Article  Google Scholar 

Download references

Acknowledgements

This work was supported by Natural Science Foundation of Jiangsu Province of China under Grant No. BK20191457, and Postgraduate Research & Practice Innovation Program of Jiangsu Province under Grants No. KYCX20_3147, Open Project of Artificial Intelligence Key Laboratory of Sichuan Province under Grant No. 2019RYJ02, Nature Science Foundation of the Jiangsu Higher Education Institute of China under grant No.18KJB520050, National Natural Science Foundation of China under Grants No. 61305058 and No. 61572242, China Postdoctoral Science Foundation under Grants No. 2013M540404 and No. 2015T80481, and Jiangsu Province 333 Project.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Hualong Yu.

Ethics declarations

Conflict of Interest

The authors have declared that no conflict of interest exists.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Dong, W., Gao, S., Yang, X. et al. An Exploration of Online Missing Value Imputation in Non-stationary Data Stream. SN COMPUT. SCI. 2, 57 (2021). https://doi.org/10.1007/s42979-021-00459-1

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s42979-021-00459-1

Keywords

Navigation