Data mining and exploration techniques

doi:10.1016/S0166-2481(04)30002-4

Developments in Soil Science

Volume 30, 2004, Pages 21-32

https://doi.org/10.1016/S0166-2481(04)30002-4 Get rights and content

Publisher Summary

The data mining and exploration methods introduce algorithms that automate predictor and equation selections. This chapter describes three methods: artificial neural networks, group method of data handling (GMDH), and the regression tree that have recently been used in the pedotransfer function (PTF) development. Each of these methods has its advantages and disadvantages. For example, the advantage of regression trees is the transparency of results, whereas the advantage of neural networks is the ability to mimic practically any relationship. The disadvantage of all these techniques as compared to statistical regression is the heuristic element involved so that the rigorous statistical judgment is hard to make. The three techniques practically produce identical PTF accuracy. The database exploration is a useful step that may generate PTFs that are either sufficient for the intended application or may suggest further applications of more rigorous or more flexible PTF-building techniques.

Section snippets

Artificial neural networks

Artificial neural networks are becoming a common tool for modeling complex “input–output” dependencies (Maren et al., 1990, McCord and Illingworth, 1990). The advantage of ANN is their ability to mimic the behavior of complex systems by varying the strength of the influence of network components to each other and by varying the structure of the interconnections among components. After establishing network structure and finding coefficients to express the strength of influence of the network

Group method of data handling

After the original set of predictors for PTFs has been selected, the subsequent PTF development may: (a) retain all selected predictors in the PTF; (b) eliminate part of them based on statistical tests; (c) eliminate some predictors and define the relative importance of the remaining predictors. The regression method (Chapter 1) can eliminate insignificant predictors, but regression equations presume a priori knowledge about the type of dependence that should exist in PTFs. The neural network

Regression trees

PTF dependencies can be very different in different parts of the data base, and using the same PTF equation for the whole data base may be misleading. It may be beneficial to subdivide the database into more homogeneous parts and then to build essentially different PTFs for the different parts.

Regression tree modeling is an exploratory technique based on uncovering structure in data (Clark and Pregibon, 1992). The resulting model partitions data first into two groups, then into four groups, and

Cross-Validation procedures

Both the regression trees and the GMDH are iteratively building models of progressively increasing complexity. The processes have to be stopped to prevent over-fitting, otherwise the predictive capability of the resulting models with respect to new data will be deplorable. Practically, this means that the dataset has to be split into development and testing subsets, and the CP multiplier in Equation (2) or the cost-complexity parameter K in Equation (5) have to be varied to provide the level of

Concluding remarks

The main reason in using data mining and database exploration techniques in PTF development is probably due to the complexity of relationships between soil properties. Other data base exploratory techniques have been successfully used for this purpose, e.g., numerical classification (Williams et al., 1983). Each of the methods in this chapter has its advantages and disadvantages. For example, the advantage of the regression trees is in the transparency of results, whereas the advantage of the

References (37)

N.J. McKenzie et al.
Spatial prediction of soil properties using environmental correlation
Geoderma
(1999)
B. Minasny et al.
Comparison of different approaches to the development of pedotransfer functions for water-retention curves
Geoderma
(1999)
F. Ungaro et al.
Using existing soil databases for estimating water-retention properties for soils of the Pianura Padano-Veneta region of North Italy
Geoderma
(2001)
J.H.M. Wösten et al.
Pedotransfer functions: bridging the gap between available basic soil data and missing soil hydraulic characteristics
J. Hydrol
(2001)
I. Alexander et al.
An Introduction to Neural Computing
(1990)
F.A. Baker
Classification and regression tree analysis for assessing hazard of pine mortality caused by Heterobasidion annosum
Plant Dis
(1993)
A.R. Barron
Predicted Square Error: a Criterion for Automatic Model Selection
J.F. Bell
Tree-based methods
L. Breiman et al.
Regression Trees
(1993)
L.A. Clark et al.
Tree-based models

H. Demuth et al.

Neural Network Toolbox Manual

(1992)

B. Efron et al.

An Introduction to the Bootstrap. Monographs on Statistics and Applied Probability

(1993)

S.J. Farlow

The GMDH algorithm

D. Gimènez et al.

Prediction of a pore distribution factor from soil textural and mechanical parameters

Soil Sci

(2001)

P.I. Good

Resampling Methods: A Practical Guide to Data Analysis

(1999)

S. Haykin

Neural Networks, a Comprehensive Foundation

(1994)

R. Hecht-Nielsen

Neurocomputing

(1990)

E. Koekkoek et al.

Development of a neural network model to predict soil water retention

Eur. J. Soil Sci

(1997)

Cited by (26)

Pedotransfer functions to estimate hydraulic properties of tropical Sri Lankan soils
2019, Soil and Tillage Research
Citation Excerpt :
This process is repeated 10 times, with each subsample used once as the testing data. The results are then averaged to produce a single estimation (Pachepsky and Schaap, 2004). Attribute selection is a procedure that searches all possible combinations of attributes in a dataset to find the combination that yields the best prediction.
Knowledge of the hydraulic properties of soil is a vital factor in evaluating and managing environmental and agricultural problems. The expense and difficulty of measurements have prompted the development of other approaches to estimate soil hydraulic properties. Pedotransfer functions (PTFs) are predictive functions used to estimate soil properties on the basis of easily measurable soil parameters. Although PTFs are in use for most temperate regions, few attempts have been made to develop them for locations in the tropics. This study aimed to establish suitable PTFs for tropical soils of Sri Lanka to estimate soil hydraulic properties (field capacity and permanent wilting point) by a multiple linear regression method from inputs consisting of different combinations of four easily measured parameters: sand content; sand, silt, and clay content; bulk density; and organic carbon concentration. This analysis used the open-source data mining software in the Waikato Environment for Knowledge Analysis. We found that all the PTFs developed using different input levels showed similar performances. Our functional evaluation showed that the output of the PTFs performed essentially as well as measured data for estimating available water content and generating irrigation schedules for the selected localities. Hence, even using sand percentage alone, volumetric water contents at –10, –33, and −1500 kPa can be successfully estimated using PTFs developed for Sri Lankan soil conditions.
Weighted recalibration of the Rosetta pedotransfer model with improved estimates of hydraulic parameter distributions and summary statistics (Rosetta3)
2017, Journal of Hydrology
Citation Excerpt :
Due to the use of artificial neural networks (ANNs) as well as the bootstrap method (Efron and Tibshirani, 1993), the exact mathematical structure of Rosetta was never explicitly published (Rosetta consists of hundreds of matrices, each with tens of coefficients). Instead, Rosetta1 was released in 2001 as a Graphical User Interface (GUI) based Windows 98/XP application (Pachepsky and Schaap, 2004; Schaap et al., 2001). The GUI has become unsupportable, and the only current practical use of Rosetta1 is through the Hydrus1D and 3D applications (Šimůnek et al., 2012, 2008) and limited informal releases of source code by the authors.
Pedotransfer functions (PTFs) have been widely used to predict soil hydraulic parameters in favor of expensive laboratory or field measurements. Rosetta (Schaap et al., 2001, denoted as Rosetta1) is one of many PTFs and is based on artificial neural network (ANN) analysis coupled with the bootstrap re-sampling method which allows the estimation of van Genuchten water retention parameters (van Genuchten, 1980, abbreviated here as VG), saturated hydraulic conductivity (K_s), and their uncertainties. In this study, we present an improved set of hierarchical pedotransfer functions (Rosetta3) that unify the water retention and K_s submodels into one. Parameter uncertainty of the fit of the VG curve to the original retention data is used in the ANN calibration procedure to reduce bias of parameters predicted by the new PTF. One thousand bootstrap replicas were used to calibrate the new models compared to 60 or 100 in Rosetta1, thus allowing the uni-variate and bi-variate probability distributions of predicted parameters to be quantified in greater detail. We determined the optimal weights for VG parameters and K_s, the optimal number of hidden nodes in ANN, and the number of bootstrap replicas required for statistically stable estimates. Results show that matric potential-dependent bias was reduced significantly while root mean square error (RMSE) for water content were reduced modestly; RMSE for K_s was increased by 0.9% (H3w) to 3.3% (H5w) in the new models on log scale of K_s compared with the Rosetta1 model. It was found that estimated distributions of parameters were mildly non-Gaussian and could instead be described rather well with heavy-tailed α-stable distributions. On the other hand, arithmetic means had only a small estimation bias for most textures when compared with the mean-like “shift” parameter of the α-stable distributions. Arithmetic means and (co-)variances are therefore still recommended as summary statistics of the estimated distributions. However, it may be necessary to parameterize the distributions in different ways if the new estimates are used in stochastic analyses of vadose zone flow and transport. Rosetta1 and Posetta3 were implemented in the python programming language, and the source code as well as additional documentation is available at: http://www.cals.arizona.edu/research/rosettav3.html.
Comparison of statistical regression and data-mining techniques in estimating soil water retention of tropical delta soils
2017, Biosystems Engineering
Citation Excerpt :
Indeed, Perkins and Nimmo (2009) and Botula et al. (2013) have stressed that the predictive capability of PTFs derived by pattern recognition techniques depends on the quality and the representability of the training data to soils for which one needs to predict SWRC. The well-defined ability of the ANN technique to mimic the inputs–outputs relationship of complex soil water systems (Pachepsky & Schaap, 2004) might probably explain the adequate performance of ANN-PTFs in both training and testing phases of point and pseudo-continuous estimation. Inversely, the MLR models were constructed based on rigorous structural assumptions of the relationship between SWRC and other soil variables.
Although a great number of studies have been devoted to develop and evaluate pedotransfer functions (PTFs), several questions still are to be addressed, particularly pertaining to tropical delta soils which received very little attention. One such question relates to the optimal structural dependency between basic soil properties and soil water retention characteristics (SWRC), which could be formulated by various regression methods. It is hypothesised that data mining techniques provide more accurate SWRC-PTFs than statistical linear regression. However, data-mining techniques are often proven as highly data-demanding techniques. The aim of this study was, therefore, to verify that hypothesis for a limited data set of tropical delta soils by comparing the predictive capabilities of point PTFs and pseudo-continuous (PC) PTFs developed by Multiple Linear Regression (MLR), Artificial Neural Networks (ANN), Support Vector Machine for Regression (SVR), and k-Nearest Neighbours (kNN) methods. The results show that point-PTFs derived from data-mining techniques (i.e. ANN, SVR, kNN) offer accurate and reliable estimation of soil water content at several matric potentials. In case of PC-PTFs, ANN and kNN models outperformed SVR and MLR PTFs in validation phase (RMSE of ANN and kNN PTFs were around 0.05 m³ m⁻³, while those of SVR PTFs and MLR PTFs rose up to 0.068 and 0.066 m³ m⁻³). Our findings confirm the superiority of data-mining approaches in modelling the complex system of soil and water, even when a limited data set is available. The non-parametric kNN method, though being constrained in estimating SWRC in pseudo-continuous manner, has great benefits due to its flexibility, simplicity, accuracy and capacity to append new observations.
Evaluation of pedotransfer functions for predicting water retention of soils in Lower Congo (D.R. Congo)
2012, Agricultural Water Management
Citation Excerpt :
Although most “temperate” PTFs showed a poor performance compared to “tropical” PTFs in estimating water content, it is remarkable to see that the ANN PTF by Schaap et al. (2001) showed a good performance for soils in Lower Congo. This is probably due to its flexibility and ability to mimic the complex behaviour of soils (Pachepsky and Schaap, 2004). The PTFs of Gupta and Larson (1979) and Rawls and Brakensiek (1982) did not perform better than any “tropical” PTF.
The soil water retention curve (SWRC) is important to solve many soil and water management problems related to agriculture, ecology, and environmental issues. However, it is well recognized that its direct measurement is laborious, time-consuming and expensive. An alternative is the estimation of the SWRC by pedotransfer functions (PTFs), which are well documented for temperate soils. Few works, however, have been devoted to PTFs for tropical soils. The main objective of this study was to evaluate the ability of a number of published “point” and “parametric” PTFs to predict water retention of soils in the Lower Congo (the South-Western region of the Democratic Republic of Congo). The “point” PTFs of Oliveira et al. (2002) and Dijkerman (1988) performed best at −33 kPa, while those of Arruda et al. (1987) and Pidgeon (1972) were best at −1500 kPa. Regarding the parametric PTFs which predicted the Van Genuchten, 1980 parameters, the “tropical” PTFs of Hodnett and Tomasella (2002) and the “temperate” PTFs of Schaap et al. (2001) gave the best results. Preliminary results of this evaluation study suggest that estimates of water content by several existing “temperate” as well as “tropical” PTFs may induce errors in the outputs of watershed models used in various agricultural studies under the humid tropics. Large discrepancies in the derived soil hydraulic data can substantially reduce the quality of the modelling results particularly in regions where soils may have been formed and evolved in similar climatological and pedological conditions as soils from the Lower Congo. We further found that a predictor such as dithionite-citrate-bicarbonateextractable iron (DCB-Fe) has great potential to reduce the uncertainty of PTFs for predicting water retention parameters of tropical soils.
Necessary meta-data for pedotransfer functions
2011, Geoderma
Citation Excerpt :
A positive value indicates over-prediction, and a negative value indicates under-prediction. The prediction method should be described, i.e., whether it is a linear model or a data-mining technique (Pachepsky and Schaap, 2004). The prediction formula needs to be explicitly written down.
Although pedotransfer functions have been published for more than 25 years, most published functions have very little information on the functions themselves and where they can be used. In this paper, we recommend 3 tables to accompany every published PTF so that users can decide whether they can potentially use a published PTF on their data. The first table contains the information and statistics of the training data. The second table provides information and statistics about the variable to be predicted on the calibration set. The third table contains the statistics of the validation data. Furthermore, the function should be expressed, and uncertainty measures of the function should also be included.
Discriminating sources of nitrate pollution in an unconfined sandy aquifer
2009, Journal of Hydrology
Correctly assessing the origin of groundwater pollution is an important prerequisite for efficient groundwater management. In this paper, statistical modelling tools are applied to discriminate the sources of nitrate pollution in the unconfined deep sandy aquifer of the Brusselian sands (Belgium). Multiple regression and regression tree were compared to identify the factors affecting the nitrate concentration in this vulnerable groundwater body. The explanatory factors were related to land and land use properties in the capture zone. The tree model and the low fitting power of the multiple regression model showed the highly complex interaction pattern between explanatory variables. In the region, one explicative variable taken alone could not be considered responsible for the groundwater pollution by nitrate. However, both methods indicated the negative influence of residential land on the nitrate concentrations and a slight protective effect of low slope values. Furthermore, we showed the importance of delineating capture zones on the basis of topography, the type of monitoring station and a simplified water mass balance, compared to circular capture zones centered on the monitoring stations.

View all citing articles on Scopus

View full text

Chapter 2Data mining and exploration techniques

Publisher Summary

Section snippets

Artificial neural networks

Group method of data handling

Regression trees

Cross-Validation procedures

Concluding remarks

Geoderma

Geoderma

Geoderma

J. Hydrol

An Introduction to Neural Computing

Classification and regression tree analysis for assessing hazard of pine mortality caused by Heterobasidion annosum

Plant Dis

Predicted Square Error: a Criterion for Automatic Model Selection

Tree-based methods

Regression Trees

Tree-based models

Neural Network Toolbox Manual

An Introduction to the Bootstrap. Monographs on Statistics and Applied Probability

The GMDH algorithm

Prediction of a pore distribution factor from soil textural and mechanical parameters

Soil Sci

Resampling Methods: A Practical Guide to Data Analysis

Neural Networks, a Comprehensive Foundation

Neurocomputing

Development of a neural network model to predict soil water retention

Eur. J. Soil Sci

Chapter 2
Data mining and exploration techniques