A typical
multivariate calibration procedure needs several separate data sets. The calibration
or training data set is needed for setting up the model by estimating the parameters
of an equation or for training a neural network. Often a second data set is
needed to determine when to stop the training or to determine how many and which
model components and variables to include. This second data set is usually called
monitor data set. If several models are developed, a third data set called test
set is required to select the most appropriate model. Finally, a validation
data set is essential to estimate the quality of the final model. It has been
shown that different data are needed for all these data sets, as otherwise the
models and estimations are biased [9]-[12].
For example, if the same data set is used for the calibration and validation,
the estimation of the prediction ability is overly optimistic. Additionally,
each data set should be as large as possible. The larger the calibration data
set the better the model and the larger the validation data set the better the
estimation of the predictivity. If many data are available, representative large
independent samples can be used for training, monitoring, testing and validating
by simply partitioning the large pool of all samples. Typically in analytical
chemistry, only data sets limited in size are available as measurements are
expensive and work intensive. To solve the dilemma of partitioning a small pool
of data into independent data subsets, which should be as large and as representative
as possible, subsampling procedures, which are also known as resampling procedures,
have become the quasi standard in chemometrics. There are many subsampling techniques,
whereby the most important ones are described below.