Skip to main content

Incremental Statistical Measures

  • Chapter
  • First Online:
Learning in Non-Stationary Environments

Abstract

Statistical measures provide essential and valuable information about data and are needed for any kind of data analysis. Statistical measures can be used in a purely exploratory context to describe properties of the data, but also as estimators for model parameters or in the context of hypothesis testing. For example, the mean value is a measure for location, but also an estimator for the expected value of a probability distribution from which the data are sampled. Statistical moments of higher order than the mean provide information about the variance, the skewness, and the kurtosis of a probability distribution. The Pearson correlation coefficient is a measure for linear dependency between two variables. In robust statistics, quantiles play an important role, since they are less sensitive to outliers. The median is an alternative measure of location, the interquartile range an alternative measure of dispersion. The application of statistical measures to data streams requires online calculation. Since data come in step by step, incremental calculations are needed to avoid to start the computation process each time new data arrive and to save memory so that not the whole data set needs to be kept in the memory. Statistical measures like the mean, the variance, moments in general, and the Pearson correlation coefficient render themselves easily to incremental computations, whereas recursive or incremental algorithms for quantiles are not as simple or obvious. Nonstationarity is another important aspect of data streams that needs to be taken into account. This means that the parameters of the underlying sampling distribution might change over time. Change detection and online adaptation of statistical estimators is required for nonstationary data streams. Hypothesis tests like the χ2- or the t-test can be a basis for change detection, since they can also be calculated in an incremental fashion. Based on change detection strategies, one can derive information on the sampling strategy, for instance the optimal size of a time window for parameter estimations of nonstationary data streams.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 129.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 169.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 169.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    For precise definitions, see Sect. 2.4.

  2. 2.

    We use capital letters here to distinguish between random variables and real numbers that are denoted by small letters.

  3. 3.

    The interquartile range is the midrange containing 50% of the data and it is computed as the difference between the 75%- and the 25%-quantiles: IQR = x 0. 75 − x 0. 25.

  4. 4.

    Let \({x}_{{r}_{1}},{x}_{{r}_{2}},\ldots {x}_{{r}_{n}}\) be a sample in ascending order from the random variables X 1, , X n . Then the empirical distribution function of the sample is given by

    $${ S}_{X,n}\left (x\right ) = \left \{\begin{array}{lcl} 0 & \mbox{ if } & x \leq {x}_{{r}_{1}}, \\ \frac{k} {n} &\mbox{ if } & {x}_{{r}_{k}} < x \leq {x}_{{r}_{k+1}}, \\ 1 & \mbox{ if } & x > {x}_{{r}_{k}}. \end{array} \right.$$
    (2.59)
  5. 5.

    This applies also to the t-test and the χ2-test.

References

  1. Aho, A.V., Ullman, J.D., Hopcroft, J.E.: Data Structures and Algorithms. Addison Wesley, Boston (1987)

    Google Scholar 

  2. Basseville, M., Nikiforov, I.: Detection of Abrupt Changes: Theory and Application (Prentice Hall information and system sciences series). Prentice Hall, Upper Saddle River, New Jersey (1993)

    Google Scholar 

  3. Beringer, J., Hüllermeier, E.: Effcient instance-based learning on data streams. Intelligent Data Analysis 11, 627–650 (2007)

    Google Scholar 

  4. Crawley, M.: Statistics: An Introduction using R. Wiley, New York (2005)

    Book  MATH  Google Scholar 

  5. Dutta, S., Chattopadhyay, M.: A change detection algorithm for medical cell images. In: Proc. Intern. Conf. on Scientific Paradigm Shift in Information Technology and Management, pp. 524–527. IEEE, Kolkata (2011)

    Google Scholar 

  6. Fischer, R.: Moments and product moments of sampling distributions. In: Proceedings of the London Mathematical Society, Series 2, 30, pp. 199–238 (1929)

    Google Scholar 

  7. Fisz, M.: Probability Theory and Mathematical Statistics. Wiley, New York (1963)

    MATH  Google Scholar 

  8. Ganti, V., Gehrke, J., Ramakrishnan, R.: Mining data streams under block evolution. SIGKDD Explorations 3, 1–10 (2002)

    Article  Google Scholar 

  9. Gelper, S., Schettlinger, K., Croux, C., Gather, U.: Robust online scale estimation in time series: A model-free approach. Journal of Statistical Planning & Inference 139(2), 335–349 (2008)

    Article  MathSciNet  Google Scholar 

  10. Grieszbach, G., Schack, B.: Adaptive quantile estimation and its application in analysis of biological signals. Biometrical journal 35, 166–179 (1993)

    Article  Google Scholar 

  11. Gustafsson, F.: Adaptive Filtering and Change Detection. Wiley, New York (2000)

    Google Scholar 

  12. Holm, S.: A simple sequentially rejective multiple test procedure. Scandinavian Journal of Statistics 6, 65–70 (1979)

    MathSciNet  MATH  Google Scholar 

  13. Hulten, G., Spencer, L., Domingos, P.: Mining time changing data streams. In: Proceedings of the seventh ACM SIGKDD international conference on Knowledge discovery and data mining (2001)

    Google Scholar 

  14. Ikonomovska, E., Gama, J., Sebastião, R., Gjorgjevik, D.: Regression trees from data streams with drift detection. In: 11th int conf on discovery science, LNAI, vol 5808, pp. 121–135. Springer, Berlin (2009)

    Google Scholar 

  15. Kifer, D., Ben-David, S., Gehrke, J.: Detecting change in data streams. In: Proc. 30th VLDB Conf., pp. 199–238. Toronto, Canada (2004)

    Google Scholar 

  16. Lai, T.: Sequential changepoint detection in quality control and dynamic systems. Journal of the Royal Statistical Society, Series B 57, 613–658 (1995)

    MATH  Google Scholar 

  17. Möller, E., Grieszbach, G., Schack, B., Witte, H.: Statistical properties and control algorithms of recursive quantile estimators. Biometrical Journal 42, 729–746 (2000)

    Article  MATH  Google Scholar 

  18. Nevelson, M., Chasminsky, R.: Stochastic approximation and recurrent estimation. Verlag Nauka, Moskau (1972)

    Google Scholar 

  19. Qiu, G.: An improved recursive median filtering scheme for image processing. IEEE Transactions on Image Processing 5, 646–648 (1996)

    Article  Google Scholar 

  20. Ruusunen, M., Paavola, M., Pirttimaa, M., Leiviska, K.: Comparison of three change detection algorithms for an electronics manufacturing process. In: Proc. 2005 IEEE International Symposium on Computational Intelligence in Robotics and Automation, pp. 679–683 (2005)

    Google Scholar 

  21. Shaffer, J.P.: Multiple hypothesis testing. Ann. Rev. Psych 46, 561–584 (1995)

    Google Scholar 

  22. Sheskin, D.: Handbook of Parametric and Nonparametric Statistical Procedures. CRC-Press, Boca Raton, Florida (1997)

    MATH  Google Scholar 

  23. Song, X., Wu, M., Jermaine, C., Ranka, S.: Statistical change detection for multi-dimensional data. In: Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 667–676. ACM, New York (2007)

    Google Scholar 

  24. Spitzer, F.: Principles of Random Walk (2nd edition). Springer, Berlin (2001)

    MATH  Google Scholar 

  25. Tschumitschew, K., Klawonn, F.: Incremental quantile estimation. Evolving Systems 1, 253–264 (2010)

    Google Scholar 

  26. Tschumitschew, K., Klawonn, F.: The need for benchmarks with data from stochastic processes and meta-models in evolving systems. In: N.K.P. Angelov D. Filev (ed.) International Symposium on Evolving Intelligent Systems. SSAISB, Leicester, pp. 30–33 (2010)

    Google Scholar 

  27. Wang, K., Stolfo, S.: Anomalous payload-based network intrusion detection. In: E. Jonsson, A. Valdes, M. Almgren (eds.) Recent Advances in Intrusion Detection, pp. 203–222. Springer, Berlin (2004)

    Chapter  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Katharina Tschumitschew .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2012 Springer Science+Business Media New York

About this chapter

Cite this chapter

Tschumitschew, K., Klawonn, F. (2012). Incremental Statistical Measures. In: Sayed-Mouchaweh, M., Lughofer, E. (eds) Learning in Non-Stationary Environments. Springer, New York, NY. https://doi.org/10.1007/978-1-4419-8020-5_2

Download citation

  • DOI: https://doi.org/10.1007/978-1-4419-8020-5_2

  • Published:

  • Publisher Name: Springer, New York, NY

  • Print ISBN: 978-1-4419-8019-9

  • Online ISBN: 978-1-4419-8020-5

  • eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics