Basis of comparison
This study focuses mainly on the prediction performance of the methods with an emphasis specifically on the interaction between the properties of the data controlled by the simulation parameters and the prediction methods. The prediction performance is measured based on the following:
- The average prediction error that a method can give using an arbitrary number of components and
- The average number of components used by the method to give the minimum prediction error
Let us define,
\[\begin{equation} \mathcal{PE}_{ijkl} = \frac{1}{\sigma_{y_{ij}|x}^2} \mathsf{E}{\left[\left(\boldsymbol{\beta}_{ij} - \boldsymbol{\hat{\beta}_{ijkl}}\right)^t \left(\boldsymbol{\Sigma}_{xx}\right)_i \left(\boldsymbol{\beta}_{ij} - \boldsymbol{\hat{\beta}_{ijkl}}\right)\right]} + 1 \tag{7} \end{equation}\] as a prediction error of response \(j = 1, \ldots 4\) for a given design \(i=1, 2, \ldots 32\) and method \(k=1(\text{PCR}), \ldots 5(\text{Senv})\) using \(l=0, \ldots 10\) number of components. Here, \(\left(\boldsymbol{\Sigma}_{xx}\right)_i\) is the true covariance matrix of the predictors, unique for a particular design \(i\) and \(\sigma_{y_j\mid x}^2\) for response \(j = 1, \ldots m\) is the true model error. Here prediction error is scaled by the true model error to remove the effects of influencing residual variances. Since both the expectation and the variance of \(\hat{\boldsymbol{\beta}}\) are unknown, the prediction error is estimated using data from 50 replications as follows,
\[\begin{equation} \widehat{\mathcal{PE}_{ijkl}} = \frac{1}{\sigma_{y_{ij}|x}^2} \sum_{r=0}^{50}{\left[\left(\boldsymbol{\beta}_{ij} - \boldsymbol{\hat{\beta}_{ijklr}}\right)^t \left(\boldsymbol{\Sigma}_{xx}\right)_i \left(\boldsymbol{\beta}_{ij} - \boldsymbol{\hat{\beta}_{ijklr}}\right)\right]} + 1 \tag{8} \end{equation}\] where \(\widehat{\mathcal{PE}_{ijkl}}\) is the estimated prediction error averaged over \(r=50\) replicates.
The following section focuses on the data for the estimation of these prediction errors that are used for the two models discussed above in a) and b) of this section.