Chapter 2
Data mining and exploration techniques

https://doi.org/10.1016/S0166-2481(04)30002-4Get rights and content

Publisher Summary

The data mining and exploration methods introduce algorithms that automate predictor and equation selections. This chapter describes three methods: artificial neural networks, group method of data handling (GMDH), and the regression tree that have recently been used in the pedotransfer function (PTF) development. Each of these methods has its advantages and disadvantages. For example, the advantage of regression trees is the transparency of results, whereas the advantage of neural networks is the ability to mimic practically any relationship. The disadvantage of all these techniques as compared to statistical regression is the heuristic element involved so that the rigorous statistical judgment is hard to make. The three techniques practically produce identical PTF accuracy. The database exploration is a useful step that may generate PTFs that are either sufficient for the intended application or may suggest further applications of more rigorous or more flexible PTF-building techniques.

Section snippets

Artificial neural networks

Artificial neural networks are becoming a common tool for modeling complex “input–output” dependencies (Maren et al., 1990, McCord and Illingworth, 1990). The advantage of ANN is their ability to mimic the behavior of complex systems by varying the strength of the influence of network components to each other and by varying the structure of the interconnections among components. After establishing network structure and finding coefficients to express the strength of influence of the network

Group method of data handling

After the original set of predictors for PTFs has been selected, the subsequent PTF development may: (a) retain all selected predictors in the PTF; (b) eliminate part of them based on statistical tests; (c) eliminate some predictors and define the relative importance of the remaining predictors. The regression method (Chapter 1) can eliminate insignificant predictors, but regression equations presume a priori knowledge about the type of dependence that should exist in PTFs. The neural network

Regression trees

PTF dependencies can be very different in different parts of the data base, and using the same PTF equation for the whole data base may be misleading. It may be beneficial to subdivide the database into more homogeneous parts and then to build essentially different PTFs for the different parts.

Regression tree modeling is an exploratory technique based on uncovering structure in data (Clark and Pregibon, 1992). The resulting model partitions data first into two groups, then into four groups, and

Cross-Validation procedures

Both the regression trees and the GMDH are iteratively building models of progressively increasing complexity. The processes have to be stopped to prevent over-fitting, otherwise the predictive capability of the resulting models with respect to new data will be deplorable. Practically, this means that the dataset has to be split into development and testing subsets, and the CP multiplier in Equation (2) or the cost-complexity parameter K in Equation (5) have to be varied to provide the level of

Concluding remarks

The main reason in using data mining and database exploration techniques in PTF development is probably due to the complexity of relationships between soil properties. Other data base exploratory techniques have been successfully used for this purpose, e.g., numerical classification (Williams et al., 1983). Each of the methods in this chapter has its advantages and disadvantages. For example, the advantage of the regression trees is in the transparency of results, whereas the advantage of the

References (37)

  • H. Demuth et al.

    Neural Network Toolbox Manual

    (1992)
  • B. Efron et al.

    An Introduction to the Bootstrap. Monographs on Statistics and Applied Probability

    (1993)
  • S.J. Farlow

    The GMDH algorithm

  • D. Gimènez et al.

    Prediction of a pore distribution factor from soil textural and mechanical parameters

    Soil Sci

    (2001)
  • P.I. Good

    Resampling Methods: A Practical Guide to Data Analysis

    (1999)
  • S. Haykin

    Neural Networks, a Comprehensive Foundation

    (1994)
  • R. Hecht-Nielsen

    Neurocomputing

    (1990)
  • E. Koekkoek et al.

    Development of a neural network model to predict soil water retention

    Eur. J. Soil Sci

    (1997)
  • Cited by (26)

    • Pedotransfer functions to estimate hydraulic properties of tropical Sri Lankan soils

      2019, Soil and Tillage Research
      Citation Excerpt :

      This process is repeated 10 times, with each subsample used once as the testing data. The results are then averaged to produce a single estimation (Pachepsky and Schaap, 2004). Attribute selection is a procedure that searches all possible combinations of attributes in a dataset to find the combination that yields the best prediction.

    • Weighted recalibration of the Rosetta pedotransfer model with improved estimates of hydraulic parameter distributions and summary statistics (Rosetta3)

      2017, Journal of Hydrology
      Citation Excerpt :

      Due to the use of artificial neural networks (ANNs) as well as the bootstrap method (Efron and Tibshirani, 1993), the exact mathematical structure of Rosetta was never explicitly published (Rosetta consists of hundreds of matrices, each with tens of coefficients). Instead, Rosetta1 was released in 2001 as a Graphical User Interface (GUI) based Windows 98/XP application (Pachepsky and Schaap, 2004; Schaap et al., 2001). The GUI has become unsupportable, and the only current practical use of Rosetta1 is through the Hydrus1D and 3D applications (Šimůnek et al., 2012, 2008) and limited informal releases of source code by the authors.

    • Comparison of statistical regression and data-mining techniques in estimating soil water retention of tropical delta soils

      2017, Biosystems Engineering
      Citation Excerpt :

      Indeed, Perkins and Nimmo (2009) and Botula et al. (2013) have stressed that the predictive capability of PTFs derived by pattern recognition techniques depends on the quality and the representability of the training data to soils for which one needs to predict SWRC. The well-defined ability of the ANN technique to mimic the inputs–outputs relationship of complex soil water systems (Pachepsky & Schaap, 2004) might probably explain the adequate performance of ANN-PTFs in both training and testing phases of point and pseudo-continuous estimation. Inversely, the MLR models were constructed based on rigorous structural assumptions of the relationship between SWRC and other soil variables.

    • Evaluation of pedotransfer functions for predicting water retention of soils in Lower Congo (D.R. Congo)

      2012, Agricultural Water Management
      Citation Excerpt :

      Although most “temperate” PTFs showed a poor performance compared to “tropical” PTFs in estimating water content, it is remarkable to see that the ANN PTF by Schaap et al. (2001) showed a good performance for soils in Lower Congo. This is probably due to its flexibility and ability to mimic the complex behaviour of soils (Pachepsky and Schaap, 2004). The PTFs of Gupta and Larson (1979) and Rawls and Brakensiek (1982) did not perform better than any “tropical” PTF.

    • Necessary meta-data for pedotransfer functions

      2011, Geoderma
      Citation Excerpt :

      A positive value indicates over-prediction, and a negative value indicates under-prediction. The prediction method should be described, i.e., whether it is a linear model or a data-mining technique (Pachepsky and Schaap, 2004). The prediction formula needs to be explicitly written down.

    View all citing articles on Scopus
    View full text