Generalised linear model-based algorithm for detection of outliers in environmental data and comparison with semi-parametric outlier detection methods

https://doi.org/10.1016/j.apr.2019.01.010Get rights and content

Highlights

  • Method for detection of outliers in PM10 concentrations measured simultaneously with accompanying variables is proposed.

  • The suggested method is based on GLM describing PM10 concentrations based on known accompanying variables.

  • In the second step of the presented procedure the differences of original measurements from GLM fit are analysed.

  • The method is compared with two semi-parametric outlier detection methods formerly suggested by the authors.

  • It was shown that the presented procedure outperforms both distribution-free outlier detection methods for time series.

Abstract

Outliers are often present in large datasets of air pollutant concentrations. Existing methods for detection of outliers in environmental data can be divided as follows into three groups depending on the character of the data: methods for time series, methods for time series measured simultaneously with accompanying variables and methods for spatial data. A number of methods suggested for the automatic detection of outliers in time series data are limited by assumptions of known distribution of the analysed variable. Since the environmental variables are often influenced by accompanying factors their distribution is difficult to estimate. Considering the known information about accompanying variables and using appropriate methods for detection of outliers in time series measured simultaneously with accompanying variables can be a significant improvement in outlier detection approaches. This paper presents a method for the automatic detection of outliers in PM10 aerosols measured simultaneously with accompanying variables. The method is based on generalised linear model and subsequent analysis of the residuals. The method makes use of the benefits from the additional information included in the accessibility of accompanying variables. The results of the suggested procedure are compared with the results obtained using two distribution-free outlier detection methods for time series formerly suggested by the authors. The simulations-based comparison of the performance of all three procedures showed that the procedure presented in this paper effectively detects outliers that deviate at least 5 standard deviations from the mean value of the neighbouring observations and outperforms both distribution-free outlier detection methods for time series.

Introduction

The negative effects of air pollution on human health, ecosystem and climate are widely discussed at a local, regional and global level. While primary pollutants are emitted into the atmosphere directly from various anthropogenic and natural sources, secondary pollutants are formed in the atmosphere by secondary chemical reactions from precursors.

Emissions of many pollutants (e.g. SO2, NOx) have decreased over recent years. However, a significant proportion of the population is still exposed to increased risk due to air pollution because they live in areas where air quality standards are exceeded (EEA, 2018). The most hazardous concentrations of pollutants occur mainly in large urban areas, where the increase may be also caused by long-distance transport.

An improvement in air quality and a reduction in the risk of air pollution is therefore one of the main objectives. The limits for ambient concentrations of air pollutants are set by the Ambient Air Quality Directive of the Council and the European Parliament (EU, 2008) and by the Air Quality Guidelines (WHO, 2005) of the WHO.

For air pollution evaluation and investigation, continuous monitoring of the chemical composition and concentrations of significant air pollutants is needed. However, large datasets of concentrations of air pollutants often include outliers, observations which appear inconsistent with the rest of the dataset (Barnett and Lewis, 1978). These outliers may result from unusual measurement conditions, experimental errors, or from the natural variability of the analysed variable.

Because the outliers in a dataset can significantly affect future analysis and modelling, their detection, usually performed within data validation, plays an important role in environmental data analysis (Filzmoser, 2005, Garces and Sbarbaro, 2011). Of course, outlier detection must be followed by outlier treatment, because a correct outlying measurement gives useful information about abnormal behaviour of the analysed variable. For this reason, the quality of the detected outliers must be further evaluated by a specialised operator who decides on the removal, revision or retention of the outliers in the data.

Despite the large data sets that are being measured, data control and outlier detection are often performed manually. This approach seems to be an inadequate solution from a statistical point of view, because by manual inspection only the observations clearly deviating from the other measurements are detected. Measurements whose deviations from other measurements are not so obvious remain preserved unnoticed in the data.

The problem of the detection of outlier values is widely discussed in the scientific literature and a number of parametric as well as non-parametric methods for its solution have been proposed. In general, methods for the detection of outliers in environmental data can be divided into three groups as follows, based on the character of data: methods for time series data measured without accompanying variables, hereinafter denoted as type 1 data; methods for data measured simultaneously with accompanying variables, hereinafter denoted as type 2 data; methods for spatio-temporal data obtained from a grid of monitoring stations, hereinafter denoted as type 3 data.

Methods and algorithms for the detection of outlier values in both one-dimensional and multi-dimensional type 1 data sets can be found in (Gupta et al., 2014, Barnett, 2004, Ben-Gal, 2005, Chandola et al., 2009, Iglewicz and Hoaglin, 1993). Procedures for the detection of outlier values in time series are given, for example in (Fox, 1972, Burman and Otto, 1988). In (Bobia et al., 2015, Shaadan et al., 2015; O'Leary et al., 2016) the methods for the identification of outliers in spatial environmental data of type 3 are given.

For the use of most methods for type 1 data, an assumption about the distribution or model of the analysed data is required. However, concentrations of air pollutants are influenced by many factors that are quite often unknown, and the distribution of such data cannot be easily estimated. Therefore in the event that the concentrations are measured simultaneously with accompanying variables the use of methods suitable for type 2 data may lead to an improvement in the results.

This paper presents a method for the automatic detection of outliers in particulate matter (PM10 aerosols) measured simultaneously with accompanying variables.

Particulate matter is one of the most significant European pollutants that continues to exceed EU limits (EEA, 2017). It originates from natural sources (e.g. forest fires, volcanoes, dust storms, sea spray) as well as from anthropogenic sources (automotive transportation, industrial and agricultural activities, coal combustion, burning of waste and biomass, road dust etc.) (Kim et al., 2015).

A large number of epidemiological studies (Abrutzky et al., 2012, Pope et al., 1995, Pope and Dockery, 2006, Restrepo et al., 2012) have reported the existence of a statistical association between health effects and ambient PM10 concentrations. As shown in (Hrdličková et al., 2008; Hübnerová and Michálek, 2014; Mikuška et al., 2017, Křůmal et al., 2017), PM10 concentrations are influenced by miscellaneous factors including meteorological variables, the heating season or the specific day of the week. Continuous monitoring of the concentrations and composition of PM10 particles is essential for the prediction and evaluation of periods with a high concentration of PM10.

The dependence of particulate matter on various covariates including meteorological variables is given in (Hübnerová and Michálek, 2014, Mikuška et al., 2017, Křůmal et al., 2017, Hrdličková et al., 2008). Therefore the influence of accompanying variables should be considered within the data analysis method.

In (Čampulová et al., 2017, 2018b; Holešovský et al., 2018) the authors proposed outlier detection methods for type 1 data based on local kernel smoothing. The principle of these methods is to smooth the original data and thus remove the variable frequency term. This approach partly compensates for the influence of unknown variables, because the smoothing residuals, then further analysed, are free of any data trend caused by accompanying variables. The approach based on local kernel smoothing is appropriate especially in a situation when the measurement of accompanying covariates is unavailable.

As already mentioned, the information about the observations of factors influencing the concentrations of analysed pollutants, if known, can be a significant improvement in the outlier detection algorithms. One of the possibilities on how to include information about the behaviour of accompanying factors in the analysis is to use a suitable model describing the relationship between the studied variable and influencing factors. Fitting such a model is then one of the core parts of the outlier detection procedure.

The approaches for modelling spatiotemporal variability of environmental type 3 data based on variogram models are given for example in (O'Leary and Lemke, 2014, Miller et al., 2010). However, such approaches cannot be applied to the data presented in our study, since we do not have enough monitoring stations to estimate variograms. For this reason the analysis presented in this paper is based on models suitable for type 2 data.

An iterative algorithm for outlier detection based on robust scaling of Partial Least Square regression was presented in (Bao and Dai, 2009). In (Garces and Sbarbaro, 2011) outliers in the environmental data were detected using nonlinear regression with Multilayer Perceptron neural networks for the inner model in Partial Least Square regression, Baffi et al., (1999). The kriging-based outlier detection of outliers in spatial data was performed in (Araki et al., 2017). In (Rahman et al., 2012) the outliers were diagnosed based on DFFITS and Cook's Distance from multiple linear regression.

If a simple linear regression model is used, outliers can be detected using the robust outlier test (Rice and Spiegelhalter, 2006). The robust outlier test (Rice and Spiegelhalter, 2006) was further extended in (Lourenço and Pires, 2014) for the use in multiple M-regression. In (She and Owen, 2011) non-convex penalized regression was used as a tool for outlier detection.

The relationship between concentrations of PM10 aerosols and accompanying factors can be described using multiple linear regression or generalised linear models. In (Hormann et al., 2005), the square root of PM10 concentrations was predicted based on covariates using multiple linear regression. In (Chaloulakou et al., 2003, Stadlober et al., 2008, Stadlober et al., 2012) the relationship between PM10 aerosols and accompanying variables was modelled using linear regression with logarithmic transformations and a square root transform. In (Hrdličková et al., 2008) and (Hübnerová and Michálek, 2014) the PM10 concentrations were predicted based on a generalised linear model (GLM) with a gamma distribution of dependent variables and logarithmic link function.

The aim of this paper is detect observations inconsistent with the rest of the measurements of PM10 concentrations and thus simplify time-consuming manual validation of the data. For this a procedure for outlier detection in type 2 data (PM10 concentrations measured simultaneously with accompanying variables) is presented. The core idea of the method, as briefly described in (Čampulová et al., 2018a), is to model the concentrations of PM10 aerosols using a generalised linear model and to subsequently analyse the residuals.

The results of the algorithm presented are compared with the results obtained using two methods for type 1 data suggested in (Čampulová et al., 2018b) and (Holešovský et al., 2018) for the case that the measurements of accompanying variables are unknown. This way, we verify that the use of methods for type 2 data (when the accompanying variables are included in the analysis) improves the results when compared to methods for type 1 data.

All three outlier detection procedures were applied with the aim of detecting outliers in concentrations of PM10 aerosols measured at a monitoring station in Brno, Czech Republic and their performance was compared based on the simulations.

Although the outlier detection problem has been addressed by many authors, as cited, the approach presented in this paper is specific and original since it is based on a GLM and takes into consideration information about accompanying variables. Moreover, the core idea of the procedure suggested is expected to be an improvement when compared to previously cited methods that are often based only on interquartile ranges (or histograms) or that are limited by assumptions about the distribution type of the analysed variable (without considering the measurement of accompanying variables). Thus, the difference from the previous research referred to is that the procedure presented in this paper is suggested for type 2 data and based on a model suitable for prediction of PM10 concentrations. The comparison of the previous research for type 1 data formerly performed by the authors is covered in a real data example as well as in a simulation study.

The paper is organized as follows: in the next section we describe the data and give an overview of the characteristics of the monitoring station. In section 3 the methodology is described. Our focus is on introducing the GLM as well as on the proposed outlier detection procedure itself. Further, we briefly describe Method I and Method II for outlier detection. Section 4 covers the results obtained by analysing both real and simulated data. The discussion is given in Section 5 and the conclusions are summarised in the final Section 6.

Section snippets

Data

The algorithm being presented is applied to detect outliers in concentrations of atmospheric aerosol (particulate matter) PM10 measured hourly at the Zvonarka monitoring station, which is situated in Brno, Czech Republic and operated by the Brno City Municipality (BCM). The monitoring period was from November 2006 until November 2015.

The concentrations were measured using a GRIMM 180 that operates on the principle of the optical method and radiation scattering on particles.

Brno, with 430,000

Methodology

As already mentioned in the introduction the number of stations from which we have data is not suitable for performing spatial statistics since we cannot estimate the variogram necessary for applying spatio-temporal models. For this reason we model PM10 concentrations using GLM. However observations from neighbouring stations can be included in the model as regressors, but this is not a purpose of the paper.

In the following paragraphs a description of the GLM based procedure suggested in this

Results

In this section we illustrate the GLM based outlier detection procedure as presented with the aim of detecting outliers in PM10 concentrations. Detected outliers are compared with the results obtained using Method I and Method II discussed briefly earlier and then a comparison of individual procedures is performed on the basis of simulations.

Discussion

Comparing Fig. 4 graph a) and Fig. 6 graph a) we can see that the GLM fit and kernel regression estimate differ most in the time instants where some of the evidently outlying observations occur – see 3.12, 7.12, 8.12, 19.12, and 27.12. The outliers occurring on 3.12, 8.12, 19.12, and 27.12 can be explained by accompanying regressors since they are well modelled by the curve representing the fit of GLM (6) (see Fig. 4 graph a)). The outlying observations occurring on 3.12, 8.12, and 27.12 were

Conclusion

This paper presents a two-step method for the automatic identification of outliers in type 2 environmental data - that is, data measured simultaneously with accompanying variables.

In the first step, a GLM predicting the observations of the analysed variable based on the known measurements of the accompanying variables is fitted. Subsequently, in the second step, outlier differences of measurements from the values fitted by the GLM are identified. The result is a set of potential outliers that

Acknowledgement

The paper was written with the support of the reaserch project DZRO PASVŘ II, Ministry of Defence, Czech Republic.

References (62)

  • V.M. Lourenço et al.

    M-regression, false discovery rates and outlier detection with application to genetic association studies

    Comput. Stat. Data Anal.

    (2014)
  • P. Mikuška et al.

    Seasonal variability of monosaccharide anhydrides, resin acids,methoxyphenols and saccharides in PM2.5 in Brno, the Czech Republic

    Atmos. Pollut. Res.

    (2017)
  • L. Miller et al.

    Intra-urban correlation and spatial variability of air toxics across an international airshed in Detroit, Michigan (USA) and Windsor, Ontario (Canada)

    Atmos. Environ.

    (2010)
  • B. O'Leary et al.

    Modeling spatiotemporal variability of intra-urban air pollutants in Detroit: a pragmatic approach

    Atmos. Environ.

    (2014)
  • B. O'Leary et al.

    Identification and influence of spatio-temporal outliers in urban air quality measurements

    Sci. Total Environ.

    (2016)
  • N. Shaadan et al.

    Anomaly detection and assessment of PM10 functional data at several locations in the klang valley, Malaysia

    Atmos. Pollut. Res.

    (2015)
  • E. Stadlober et al.

    Quality and performance of a PM10 daily forecasting model

    Atmos. Environ.

    (2008)
  • R. Abrutzky et al.

    Health effects of climate and air pollution in buenos aires: a first time series analysis

    J. Environ. Protect.

    (2012)
  • A. Agresti

    Categorical Data Analysis

    (2002)
  • H. Akaike

    A new look at the statistical model identification

    IEEE Trans. Automat. Contr.

    (1974)
  • V. Barnett

    Environmental Statistics: Methods Nd Applications

    (2004)
  • V. Barnett et al.

    Outliers in Statistical Data

    (1978)
  • J. Beirlant et al.

    Statistics of Extremes: Theory and Applications

    (2004)
  • Ben-Gal

    Outlier detection

  • M. Bobia et al.

    Spatial outlier detection in the PM10 monitoring network of Normandy (France)

    Atmo. Pollut. Res.

    (2015)
  • a.s Brněnské komunikace

    Ročenka Dopravy Brno 2009. Brno Municipality

    (2010)
  • a.s Brněnské komunikace

    Ročenka Dopravy Brno 2010. Brno Municipality

    (2011)
  • a.s Brněnské komunikace

    Ročenka Dopravy Brno 2011. Brno Municipality

    (2012)
  • a.s Brněnské komunikace

    Ročenka Dopravy Brno 2012. Brno Municipality

    (2013)
  • a.s Brněnské komunikace

    Ročenka Dopravy Brno 2013. Brno Municipality

    (2014)
  • a.s Brněnské komunikace

    Ročenka Dopravy Brno 2014. Brno Municipality

    (2015)
  • Cited by (5)

    Peer review under responsibility of Turkish National Committee for Air Pollution Research and Control.

    View full text