Abstract
Statistical measures provide essential and valuable information about data and are needed for any kind of data analysis. Statistical measures can be used in a purely exploratory context to describe properties of the data, but also as estimators for model parameters or in the context of hypothesis testing. For example, the mean value is a measure for location, but also an estimator for the expected value of a probability distribution from which the data are sampled. Statistical moments of higher order than the mean provide information about the variance, the skewness, and the kurtosis of a probability distribution. The Pearson correlation coefficient is a measure for linear dependency between two variables. In robust statistics, quantiles play an important role, since they are less sensitive to outliers. The median is an alternative measure of location, the interquartile range an alternative measure of dispersion. The application of statistical measures to data streams requires online calculation. Since data come in step by step, incremental calculations are needed to avoid to start the computation process each time new data arrive and to save memory so that not the whole data set needs to be kept in the memory. Statistical measures like the mean, the variance, moments in general, and the Pearson correlation coefficient render themselves easily to incremental computations, whereas recursive or incremental algorithms for quantiles are not as simple or obvious. Nonstationarity is another important aspect of data streams that needs to be taken into account. This means that the parameters of the underlying sampling distribution might change over time. Change detection and online adaptation of statistical estimators is required for nonstationary data streams. Hypothesis tests like the χ2- or the t-test can be a basis for change detection, since they can also be calculated in an incremental fashion. Based on change detection strategies, one can derive information on the sampling strategy, for instance the optimal size of a time window for parameter estimations of nonstationary data streams.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
For precise definitions, see Sect. 2.4.
- 2.
We use capital letters here to distinguish between random variables and real numbers that are denoted by small letters.
- 3.
The interquartile range is the midrange containing 50% of the data and it is computed as the difference between the 75%- and the 25%-quantiles: IQR = x 0. 75 − x 0. 25.
- 4.
Let \({x}_{{r}_{1}},{x}_{{r}_{2}},\ldots {x}_{{r}_{n}}\) be a sample in ascending order from the random variables X 1, …, X n . Then the empirical distribution function of the sample is given by
$${ S}_{X,n}\left (x\right ) = \left \{\begin{array}{lcl} 0 & \mbox{ if } & x \leq {x}_{{r}_{1}}, \\ \frac{k} {n} &\mbox{ if } & {x}_{{r}_{k}} < x \leq {x}_{{r}_{k+1}}, \\ 1 & \mbox{ if } & x > {x}_{{r}_{k}}. \end{array} \right.$$(2.59) - 5.
This applies also to the t-test and the χ2-test.
References
Aho, A.V., Ullman, J.D., Hopcroft, J.E.: Data Structures and Algorithms. Addison Wesley, Boston (1987)
Basseville, M., Nikiforov, I.: Detection of Abrupt Changes: Theory and Application (Prentice Hall information and system sciences series). Prentice Hall, Upper Saddle River, New Jersey (1993)
Beringer, J., Hüllermeier, E.: Effcient instance-based learning on data streams. Intelligent Data Analysis 11, 627–650 (2007)
Crawley, M.: Statistics: An Introduction using R. Wiley, New York (2005)
Dutta, S., Chattopadhyay, M.: A change detection algorithm for medical cell images. In: Proc. Intern. Conf. on Scientific Paradigm Shift in Information Technology and Management, pp. 524–527. IEEE, Kolkata (2011)
Fischer, R.: Moments and product moments of sampling distributions. In: Proceedings of the London Mathematical Society, Series 2, 30, pp. 199–238 (1929)
Fisz, M.: Probability Theory and Mathematical Statistics. Wiley, New York (1963)
Ganti, V., Gehrke, J., Ramakrishnan, R.: Mining data streams under block evolution. SIGKDD Explorations 3, 1–10 (2002)
Gelper, S., Schettlinger, K., Croux, C., Gather, U.: Robust online scale estimation in time series: A model-free approach. Journal of Statistical Planning & Inference 139(2), 335–349 (2008)
Grieszbach, G., Schack, B.: Adaptive quantile estimation and its application in analysis of biological signals. Biometrical journal 35, 166–179 (1993)
Gustafsson, F.: Adaptive Filtering and Change Detection. Wiley, New York (2000)
Holm, S.: A simple sequentially rejective multiple test procedure. Scandinavian Journal of Statistics 6, 65–70 (1979)
Hulten, G., Spencer, L., Domingos, P.: Mining time changing data streams. In: Proceedings of the seventh ACM SIGKDD international conference on Knowledge discovery and data mining (2001)
Ikonomovska, E., Gama, J., Sebastião, R., Gjorgjevik, D.: Regression trees from data streams with drift detection. In: 11th int conf on discovery science, LNAI, vol 5808, pp. 121–135. Springer, Berlin (2009)
Kifer, D., Ben-David, S., Gehrke, J.: Detecting change in data streams. In: Proc. 30th VLDB Conf., pp. 199–238. Toronto, Canada (2004)
Lai, T.: Sequential changepoint detection in quality control and dynamic systems. Journal of the Royal Statistical Society, Series B 57, 613–658 (1995)
Möller, E., Grieszbach, G., Schack, B., Witte, H.: Statistical properties and control algorithms of recursive quantile estimators. Biometrical Journal 42, 729–746 (2000)
Nevelson, M., Chasminsky, R.: Stochastic approximation and recurrent estimation. Verlag Nauka, Moskau (1972)
Qiu, G.: An improved recursive median filtering scheme for image processing. IEEE Transactions on Image Processing 5, 646–648 (1996)
Ruusunen, M., Paavola, M., Pirttimaa, M., Leiviska, K.: Comparison of three change detection algorithms for an electronics manufacturing process. In: Proc. 2005 IEEE International Symposium on Computational Intelligence in Robotics and Automation, pp. 679–683 (2005)
Shaffer, J.P.: Multiple hypothesis testing. Ann. Rev. Psych 46, 561–584 (1995)
Sheskin, D.: Handbook of Parametric and Nonparametric Statistical Procedures. CRC-Press, Boca Raton, Florida (1997)
Song, X., Wu, M., Jermaine, C., Ranka, S.: Statistical change detection for multi-dimensional data. In: Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 667–676. ACM, New York (2007)
Spitzer, F.: Principles of Random Walk (2nd edition). Springer, Berlin (2001)
Tschumitschew, K., Klawonn, F.: Incremental quantile estimation. Evolving Systems 1, 253–264 (2010)
Tschumitschew, K., Klawonn, F.: The need for benchmarks with data from stochastic processes and meta-models in evolving systems. In: N.K.P. Angelov D. Filev (ed.) International Symposium on Evolving Intelligent Systems. SSAISB, Leicester, pp. 30–33 (2010)
Wang, K., Stolfo, S.: Anomalous payload-based network intrusion detection. In: E. Jonsson, A. Valdes, M. Almgren (eds.) Recent Advances in Intrusion Detection, pp. 203–222. Springer, Berlin (2004)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2012 Springer Science+Business Media New York
About this chapter
Cite this chapter
Tschumitschew, K., Klawonn, F. (2012). Incremental Statistical Measures. In: Sayed-Mouchaweh, M., Lughofer, E. (eds) Learning in Non-Stationary Environments. Springer, New York, NY. https://doi.org/10.1007/978-1-4419-8020-5_2
Download citation
DOI: https://doi.org/10.1007/978-1-4419-8020-5_2
Published:
Publisher Name: Springer, New York, NY
Print ISBN: 978-1-4419-8019-9
Online ISBN: 978-1-4419-8020-5
eBook Packages: EngineeringEngineering (R0)