Chapter 2Data mining and exploration techniques
Section snippets
Artificial neural networks
Artificial neural networks are becoming a common tool for modeling complex “input–output” dependencies (Maren et al., 1990, McCord and Illingworth, 1990). The advantage of ANN is their ability to mimic the behavior of complex systems by varying the strength of the influence of network components to each other and by varying the structure of the interconnections among components. After establishing network structure and finding coefficients to express the strength of influence of the network
Group method of data handling
After the original set of predictors for PTFs has been selected, the subsequent PTF development may: (a) retain all selected predictors in the PTF; (b) eliminate part of them based on statistical tests; (c) eliminate some predictors and define the relative importance of the remaining predictors. The regression method (Chapter 1) can eliminate insignificant predictors, but regression equations presume a priori knowledge about the type of dependence that should exist in PTFs. The neural network
Regression trees
PTF dependencies can be very different in different parts of the data base, and using the same PTF equation for the whole data base may be misleading. It may be beneficial to subdivide the database into more homogeneous parts and then to build essentially different PTFs for the different parts.
Regression tree modeling is an exploratory technique based on uncovering structure in data (Clark and Pregibon, 1992). The resulting model partitions data first into two groups, then into four groups, and
Cross-Validation procedures
Both the regression trees and the GMDH are iteratively building models of progressively increasing complexity. The processes have to be stopped to prevent over-fitting, otherwise the predictive capability of the resulting models with respect to new data will be deplorable. Practically, this means that the dataset has to be split into development and testing subsets, and the CP multiplier in Equation (2) or the cost-complexity parameter K in Equation (5) have to be varied to provide the level of
Concluding remarks
The main reason in using data mining and database exploration techniques in PTF development is probably due to the complexity of relationships between soil properties. Other data base exploratory techniques have been successfully used for this purpose, e.g., numerical classification (Williams et al., 1983). Each of the methods in this chapter has its advantages and disadvantages. For example, the advantage of the regression trees is in the transparency of results, whereas the advantage of the
References (37)
- et al.
Spatial prediction of soil properties using environmental correlation
Geoderma
(1999) - et al.
Comparison of different approaches to the development of pedotransfer functions for water-retention curves
Geoderma
(1999) - et al.
Using existing soil databases for estimating water-retention properties for soils of the Pianura Padano-Veneta region of North Italy
Geoderma
(2001) - et al.
Pedotransfer functions: bridging the gap between available basic soil data and missing soil hydraulic characteristics
J. Hydrol
(2001) - et al.
An Introduction to Neural Computing
(1990) Classification and regression tree analysis for assessing hazard of pine mortality caused by Heterobasidion annosum
Plant Dis
(1993)Predicted Square Error: a Criterion for Automatic Model Selection
Tree-based methods
- et al.
Regression Trees
(1993) - et al.
Tree-based models
Neural Network Toolbox Manual
An Introduction to the Bootstrap. Monographs on Statistics and Applied Probability
The GMDH algorithm
Prediction of a pore distribution factor from soil textural and mechanical parameters
Soil Sci
Resampling Methods: A Practical Guide to Data Analysis
Neural Networks, a Comprehensive Foundation
Neurocomputing
Development of a neural network model to predict soil water retention
Eur. J. Soil Sci
Cited by (26)
Pedotransfer functions to estimate hydraulic properties of tropical Sri Lankan soils
2019, Soil and Tillage ResearchCitation Excerpt :This process is repeated 10 times, with each subsample used once as the testing data. The results are then averaged to produce a single estimation (Pachepsky and Schaap, 2004). Attribute selection is a procedure that searches all possible combinations of attributes in a dataset to find the combination that yields the best prediction.
Weighted recalibration of the Rosetta pedotransfer model with improved estimates of hydraulic parameter distributions and summary statistics (Rosetta3)
2017, Journal of HydrologyCitation Excerpt :Due to the use of artificial neural networks (ANNs) as well as the bootstrap method (Efron and Tibshirani, 1993), the exact mathematical structure of Rosetta was never explicitly published (Rosetta consists of hundreds of matrices, each with tens of coefficients). Instead, Rosetta1 was released in 2001 as a Graphical User Interface (GUI) based Windows 98/XP application (Pachepsky and Schaap, 2004; Schaap et al., 2001). The GUI has become unsupportable, and the only current practical use of Rosetta1 is through the Hydrus1D and 3D applications (Šimůnek et al., 2012, 2008) and limited informal releases of source code by the authors.
Comparison of statistical regression and data-mining techniques in estimating soil water retention of tropical delta soils
2017, Biosystems EngineeringCitation Excerpt :Indeed, Perkins and Nimmo (2009) and Botula et al. (2013) have stressed that the predictive capability of PTFs derived by pattern recognition techniques depends on the quality and the representability of the training data to soils for which one needs to predict SWRC. The well-defined ability of the ANN technique to mimic the inputs–outputs relationship of complex soil water systems (Pachepsky & Schaap, 2004) might probably explain the adequate performance of ANN-PTFs in both training and testing phases of point and pseudo-continuous estimation. Inversely, the MLR models were constructed based on rigorous structural assumptions of the relationship between SWRC and other soil variables.
Evaluation of pedotransfer functions for predicting water retention of soils in Lower Congo (D.R. Congo)
2012, Agricultural Water ManagementCitation Excerpt :Although most “temperate” PTFs showed a poor performance compared to “tropical” PTFs in estimating water content, it is remarkable to see that the ANN PTF by Schaap et al. (2001) showed a good performance for soils in Lower Congo. This is probably due to its flexibility and ability to mimic the complex behaviour of soils (Pachepsky and Schaap, 2004). The PTFs of Gupta and Larson (1979) and Rawls and Brakensiek (1982) did not perform better than any “tropical” PTF.
Necessary meta-data for pedotransfer functions
2011, GeodermaCitation Excerpt :A positive value indicates over-prediction, and a negative value indicates under-prediction. The prediction method should be described, i.e., whether it is a linear model or a data-mining technique (Pachepsky and Schaap, 2004). The prediction formula needs to be explicitly written down.
Discriminating sources of nitrate pollution in an unconfined sandy aquifer
2009, Journal of Hydrology