Generalised linear model-based algorithm for detection of outliers in environmental data and comparison with semi-parametric outlier detection methods
Introduction
The negative effects of air pollution on human health, ecosystem and climate are widely discussed at a local, regional and global level. While primary pollutants are emitted into the atmosphere directly from various anthropogenic and natural sources, secondary pollutants are formed in the atmosphere by secondary chemical reactions from precursors.
Emissions of many pollutants (e.g. SO2, NOx) have decreased over recent years. However, a significant proportion of the population is still exposed to increased risk due to air pollution because they live in areas where air quality standards are exceeded (EEA, 2018). The most hazardous concentrations of pollutants occur mainly in large urban areas, where the increase may be also caused by long-distance transport.
An improvement in air quality and a reduction in the risk of air pollution is therefore one of the main objectives. The limits for ambient concentrations of air pollutants are set by the Ambient Air Quality Directive of the Council and the European Parliament (EU, 2008) and by the Air Quality Guidelines (WHO, 2005) of the WHO.
For air pollution evaluation and investigation, continuous monitoring of the chemical composition and concentrations of significant air pollutants is needed. However, large datasets of concentrations of air pollutants often include outliers, observations which appear inconsistent with the rest of the dataset (Barnett and Lewis, 1978). These outliers may result from unusual measurement conditions, experimental errors, or from the natural variability of the analysed variable.
Because the outliers in a dataset can significantly affect future analysis and modelling, their detection, usually performed within data validation, plays an important role in environmental data analysis (Filzmoser, 2005, Garces and Sbarbaro, 2011). Of course, outlier detection must be followed by outlier treatment, because a correct outlying measurement gives useful information about abnormal behaviour of the analysed variable. For this reason, the quality of the detected outliers must be further evaluated by a specialised operator who decides on the removal, revision or retention of the outliers in the data.
Despite the large data sets that are being measured, data control and outlier detection are often performed manually. This approach seems to be an inadequate solution from a statistical point of view, because by manual inspection only the observations clearly deviating from the other measurements are detected. Measurements whose deviations from other measurements are not so obvious remain preserved unnoticed in the data.
The problem of the detection of outlier values is widely discussed in the scientific literature and a number of parametric as well as non-parametric methods for its solution have been proposed. In general, methods for the detection of outliers in environmental data can be divided into three groups as follows, based on the character of data: methods for time series data measured without accompanying variables, hereinafter denoted as type 1 data; methods for data measured simultaneously with accompanying variables, hereinafter denoted as type 2 data; methods for spatio-temporal data obtained from a grid of monitoring stations, hereinafter denoted as type 3 data.
Methods and algorithms for the detection of outlier values in both one-dimensional and multi-dimensional type 1 data sets can be found in (Gupta et al., 2014, Barnett, 2004, Ben-Gal, 2005, Chandola et al., 2009, Iglewicz and Hoaglin, 1993). Procedures for the detection of outlier values in time series are given, for example in (Fox, 1972, Burman and Otto, 1988). In (Bobia et al., 2015, Shaadan et al., 2015; O'Leary et al., 2016) the methods for the identification of outliers in spatial environmental data of type 3 are given.
For the use of most methods for type 1 data, an assumption about the distribution or model of the analysed data is required. However, concentrations of air pollutants are influenced by many factors that are quite often unknown, and the distribution of such data cannot be easily estimated. Therefore in the event that the concentrations are measured simultaneously with accompanying variables the use of methods suitable for type 2 data may lead to an improvement in the results.
This paper presents a method for the automatic detection of outliers in particulate matter (PM10 aerosols) measured simultaneously with accompanying variables.
Particulate matter is one of the most significant European pollutants that continues to exceed EU limits (EEA, 2017). It originates from natural sources (e.g. forest fires, volcanoes, dust storms, sea spray) as well as from anthropogenic sources (automotive transportation, industrial and agricultural activities, coal combustion, burning of waste and biomass, road dust etc.) (Kim et al., 2015).
A large number of epidemiological studies (Abrutzky et al., 2012, Pope et al., 1995, Pope and Dockery, 2006, Restrepo et al., 2012) have reported the existence of a statistical association between health effects and ambient PM10 concentrations. As shown in (Hrdličková et al., 2008; Hübnerová and Michálek, 2014; Mikuška et al., 2017, Křůmal et al., 2017), PM10 concentrations are influenced by miscellaneous factors including meteorological variables, the heating season or the specific day of the week. Continuous monitoring of the concentrations and composition of PM10 particles is essential for the prediction and evaluation of periods with a high concentration of PM10.
The dependence of particulate matter on various covariates including meteorological variables is given in (Hübnerová and Michálek, 2014, Mikuška et al., 2017, Křůmal et al., 2017, Hrdličková et al., 2008). Therefore the influence of accompanying variables should be considered within the data analysis method.
In (Čampulová et al., 2017, 2018b; Holešovský et al., 2018) the authors proposed outlier detection methods for type 1 data based on local kernel smoothing. The principle of these methods is to smooth the original data and thus remove the variable frequency term. This approach partly compensates for the influence of unknown variables, because the smoothing residuals, then further analysed, are free of any data trend caused by accompanying variables. The approach based on local kernel smoothing is appropriate especially in a situation when the measurement of accompanying covariates is unavailable.
As already mentioned, the information about the observations of factors influencing the concentrations of analysed pollutants, if known, can be a significant improvement in the outlier detection algorithms. One of the possibilities on how to include information about the behaviour of accompanying factors in the analysis is to use a suitable model describing the relationship between the studied variable and influencing factors. Fitting such a model is then one of the core parts of the outlier detection procedure.
The approaches for modelling spatiotemporal variability of environmental type 3 data based on variogram models are given for example in (O'Leary and Lemke, 2014, Miller et al., 2010). However, such approaches cannot be applied to the data presented in our study, since we do not have enough monitoring stations to estimate variograms. For this reason the analysis presented in this paper is based on models suitable for type 2 data.
An iterative algorithm for outlier detection based on robust scaling of Partial Least Square regression was presented in (Bao and Dai, 2009). In (Garces and Sbarbaro, 2011) outliers in the environmental data were detected using nonlinear regression with Multilayer Perceptron neural networks for the inner model in Partial Least Square regression, Baffi et al., (1999). The kriging-based outlier detection of outliers in spatial data was performed in (Araki et al., 2017). In (Rahman et al., 2012) the outliers were diagnosed based on DFFITS and Cook's Distance from multiple linear regression.
If a simple linear regression model is used, outliers can be detected using the robust outlier test (Rice and Spiegelhalter, 2006). The robust outlier test (Rice and Spiegelhalter, 2006) was further extended in (Lourenço and Pires, 2014) for the use in multiple M-regression. In (She and Owen, 2011) non-convex penalized regression was used as a tool for outlier detection.
The relationship between concentrations of PM10 aerosols and accompanying factors can be described using multiple linear regression or generalised linear models. In (Hormann et al., 2005), the square root of PM10 concentrations was predicted based on covariates using multiple linear regression. In (Chaloulakou et al., 2003, Stadlober et al., 2008, Stadlober et al., 2012) the relationship between PM10 aerosols and accompanying variables was modelled using linear regression with logarithmic transformations and a square root transform. In (Hrdličková et al., 2008) and (Hübnerová and Michálek, 2014) the PM10 concentrations were predicted based on a generalised linear model (GLM) with a gamma distribution of dependent variables and logarithmic link function.
The aim of this paper is detect observations inconsistent with the rest of the measurements of PM10 concentrations and thus simplify time-consuming manual validation of the data. For this a procedure for outlier detection in type 2 data (PM10 concentrations measured simultaneously with accompanying variables) is presented. The core idea of the method, as briefly described in (Čampulová et al., 2018a), is to model the concentrations of PM10 aerosols using a generalised linear model and to subsequently analyse the residuals.
The results of the algorithm presented are compared with the results obtained using two methods for type 1 data suggested in (Čampulová et al., 2018b) and (Holešovský et al., 2018) for the case that the measurements of accompanying variables are unknown. This way, we verify that the use of methods for type 2 data (when the accompanying variables are included in the analysis) improves the results when compared to methods for type 1 data.
All three outlier detection procedures were applied with the aim of detecting outliers in concentrations of PM10 aerosols measured at a monitoring station in Brno, Czech Republic and their performance was compared based on the simulations.
Although the outlier detection problem has been addressed by many authors, as cited, the approach presented in this paper is specific and original since it is based on a GLM and takes into consideration information about accompanying variables. Moreover, the core idea of the procedure suggested is expected to be an improvement when compared to previously cited methods that are often based only on interquartile ranges (or histograms) or that are limited by assumptions about the distribution type of the analysed variable (without considering the measurement of accompanying variables). Thus, the difference from the previous research referred to is that the procedure presented in this paper is suggested for type 2 data and based on a model suitable for prediction of PM10 concentrations. The comparison of the previous research for type 1 data formerly performed by the authors is covered in a real data example as well as in a simulation study.
The paper is organized as follows: in the next section we describe the data and give an overview of the characteristics of the monitoring station. In section 3 the methodology is described. Our focus is on introducing the GLM as well as on the proposed outlier detection procedure itself. Further, we briefly describe Method I and Method II for outlier detection. Section 4 covers the results obtained by analysing both real and simulated data. The discussion is given in Section 5 and the conclusions are summarised in the final Section 6.
Section snippets
Data
The algorithm being presented is applied to detect outliers in concentrations of atmospheric aerosol (particulate matter) PM10 measured hourly at the Zvonarka monitoring station, which is situated in Brno, Czech Republic and operated by the Brno City Municipality (BCM). The monitoring period was from November 2006 until November 2015.
The concentrations were measured using a GRIMM 180 that operates on the principle of the optical method and radiation scattering on particles.
Brno, with 430,000
Methodology
As already mentioned in the introduction the number of stations from which we have data is not suitable for performing spatial statistics since we cannot estimate the variogram necessary for applying spatio-temporal models. For this reason we model PM10 concentrations using GLM. However observations from neighbouring stations can be included in the model as regressors, but this is not a purpose of the paper.
In the following paragraphs a description of the GLM based procedure suggested in this
Results
In this section we illustrate the GLM based outlier detection procedure as presented with the aim of detecting outliers in PM10 concentrations. Detected outliers are compared with the results obtained using Method I and Method II discussed briefly earlier and then a comparison of individual procedures is performed on the basis of simulations.
Discussion
Comparing Fig. 4 graph a) and Fig. 6 graph a) we can see that the GLM fit and kernel regression estimate differ most in the time instants where some of the evidently outlying observations occur – see 3.12, 7.12, 8.12, 19.12, and 27.12. The outliers occurring on 3.12, 8.12, 19.12, and 27.12 can be explained by accompanying regressors since they are well modelled by the curve representing the fit of GLM (6) (see Fig. 4 graph a)). The outlying observations occurring on 3.12, 8.12, and 27.12 were
Conclusion
This paper presents a two-step method for the automatic identification of outliers in type 2 environmental data - that is, data measured simultaneously with accompanying variables.
In the first step, a GLM predicting the observations of the analysed variable based on the known measurements of the accompanying variables is fitted. Subsequently, in the second step, outlier differences of measurements from the values fitted by the GLM are identified. The result is a set of potential outliers that
Acknowledgement
The paper was written with the support of the reaserch project DZRO PASVŘ II, Ministry of Defence, Czech Republic.
References (62)
- et al.
Effect of spatial outliers on the regression modelling of air pollutant concentrations: a case study in Japan
Atmos. Environ.
(2017) - et al.
Non-linear projection to latent structures revisited (the neural network PLS algorithm)
Comput. Chem. Eng.
(1999) - et al.
Partial least squares with outlier detection in spectral analysis: a tool to predict gasoline properties
Fuel
(2009) - et al.
Control chart and six sigma based algorithms for identification of outliers in experimental data, with an application to particulate matter PM10
Atmos. Pollut. Res.
(2017) - et al.
Measurements of PM10 and PM2.5 particle concentrations in athens, Greece
Atmos. Environ.
(2003) - et al.
Outliers detection in environmental monitoring databases
Eng. Appl. Artif. Intell.
(2011) - et al.
Semiparametric outlier detection in nonstationary times series: case study for atmospheric pollution in Brno, Czech Republic
Atmos. Pollut. Res.
(2018) - et al.
Identification of factors affecting air pollution by dust aerosol PM10 in Brno City, Czech Republic
Atmos. Pollut. Res.
(2008) - et al.
A review on the human health impact of airborne particulate matter
Environ. Int.
(2015) - et al.
Characterization of organic compounds in winter PM1 aerosols in a small industrial town
Atmos. Pollut. Res.
(2017)
M-regression, false discovery rates and outlier detection with application to genetic association studies
Comput. Stat. Data Anal.
Seasonal variability of monosaccharide anhydrides, resin acids,methoxyphenols and saccharides in PM2.5 in Brno, the Czech Republic
Atmos. Pollut. Res.
Intra-urban correlation and spatial variability of air toxics across an international airshed in Detroit, Michigan (USA) and Windsor, Ontario (Canada)
Atmos. Environ.
Modeling spatiotemporal variability of intra-urban air pollutants in Detroit: a pragmatic approach
Atmos. Environ.
Identification and influence of spatio-temporal outliers in urban air quality measurements
Sci. Total Environ.
Anomaly detection and assessment of PM10 functional data at several locations in the klang valley, Malaysia
Atmos. Pollut. Res.
Quality and performance of a PM10 daily forecasting model
Atmos. Environ.
Health effects of climate and air pollution in buenos aires: a first time series analysis
J. Environ. Protect.
Categorical Data Analysis
A new look at the statistical model identification
IEEE Trans. Automat. Contr.
Environmental Statistics: Methods Nd Applications
Outliers in Statistical Data
Statistics of Extremes: Theory and Applications
Outlier detection
Spatial outlier detection in the PM10 monitoring network of Normandy (France)
Atmo. Pollut. Res.
Ročenka Dopravy Brno 2009. Brno Municipality
Ročenka Dopravy Brno 2010. Brno Municipality
Ročenka Dopravy Brno 2011. Brno Municipality
Ročenka Dopravy Brno 2012. Brno Municipality
Ročenka Dopravy Brno 2013. Brno Municipality
Ročenka Dopravy Brno 2014. Brno Municipality
Cited by (5)
Study on the outlier identification approaches for atmospheric pollutant monitoring data
2022, Huanjing Kexue Xuebao/Acta Scientiae CircumstantiaeK-Means Clustering with Optimal Centroid: An Optimization Insisted Model for Removing Outliers
2022, International Journal of Pattern Recognition and Artificial IntelligenceDetection of Outliers and Extreme Events of Ground Level Particulate Matter Using DBSCAN Algorithm with Local Parameters
2022, Water, Air, and Soil PollutionMonte Carlo optimization for sliding window size in Dixon quality control of environmental monitoring time series data
2020, Applied Sciences (Switzerland)
Peer review under responsibility of Turkish National Committee for Air Pollution Research and Control.