Statistical Analysis

This section has modelled the error data and the component data as a function of the simulation parameters to better understand the connection between data properties and prediction methods using multivariate analysis of variation (MANOVA).

Let us consider a model with third order interaction of the simulation parameters (p, gamma, eta and relpos) and Methods as in (11) and (12) using datasets \(\mathbf{u}\) and \(\mathbf{v}\), respectively. Let us refer them as the error model and the component model.

Error Model:: \[\begin{equation}\mathbf{u}_{abcdef} = \boldsymbol{\mu}_u + (\texttt{p}_a + \texttt{gamma}_b + \texttt{eta}_c + \texttt{relpos}_d + \texttt{Methods}_e)^3 + \left(\boldsymbol{\varepsilon}_u\right)_{abcdef} \tag{11} \end{equation}\]
Component Model:: \[\begin{equation}\mathbf{v}_{abcdef} = \boldsymbol{\mu}_v + (\texttt{p}_a + \texttt{gamma}_b + \texttt{eta}_c + \texttt{relpos}_d + \texttt{Methods}_e)^3 + \left(\boldsymbol{\varepsilon}_v\right)_{abcdef} \tag{12} \end{equation}\]

where, \(\mathbf{u}_{abcdef}\) is a vector of prediction errors in the error model and \(\mathbf{v}_{abcdef}\) is a vector of the number of components used by a method to obtain minimum prediction error in the component model.

Although there are several test-statistics for MANOVA, all are essentially equivalent for large samples (Johnson and Wichern 2018). Here we will use Pillai’s trace statistic which is defined as,

\[\begin{equation} \text{Pillai statistic} = \text{tr}\left[ \left(\mathbf{E} + \mathbf{H}\right)^{-1}\mathbf{H} \right] = \sum_{i=1}^m{\frac{\nu_i}{1 + \nu_i}} \tag{13} \end{equation}\] Here the matrix \(\mathbf{H}\) holds between-sum-of-squares and sum-of-products for each of the predictors. The matrix \(\mathbf{E}\) has a within the sum of squares and sum of products for each of the predictors. \(\nu_i\) represents the eigenvalues corresponding to \(\mathbf{E}^{-1}\mathbf{H}\) (Rencher 2003).

For both the models (11) and (12), Pillai’s trace statistic is used for accessing the effect of each factor and returns an F-value for the strength of their significance. Figure 7 plots the Pillai’s trace statistics as bars with corresponding F-values as text labels for both models.

Pillai Statistic and F-value for the MANOVA model. The bar represents the Pillai Statistic and the text labels are F-value for the corresponding factor.

Figure 7: Pillai Statistic and F-value for the MANOVA model. The bar represents the Pillai Statistic and the text labels are F-value for the corresponding factor.

Error Model:: Figure 7 (left) shows the Pillai’s trace statistic for factors of the error model. The main effect of Method followed by relpos, eta and gamma have largest influence on the model. A highly significant two-factor interaction of Method with gamma followed by relpos and eta clearly shows that methods perform differently for different levels of these data properties. The significant third order interaction between Method, eta and gamma suggests that the performance of a method differs for a given level of multicollinearity and the correlation between the responses. Since only some methods consider modelling predictor and response together, the prediction is affected by the level of correlation between the responses (eta) for a given method.
Component Model:: Figure 7 (right) shows the Pillai’s trace statistic for factors of the component model. As in the error model, the main effects of the Method, relpos, gamma and eta have a significantly large effect on the number of components that a method has used to obtain minimum prediction error. The two-factor interactions of Method with simulation parameters are larger in this case. This shows that the Methods and these interactions have a larger effect on the use of the number of component than the prediction error itself. In addition, a similar significant high third-order interaction as found in the error model is also observed in this model.

The following section will continue to explore the effects of different levels of the factors in the case of these interactions.

Effect Analysis of Error Model

The large difference in the prediction error for the envelope models in Figure 8 (left) is intensified when the position of the relevant predictor is at 5, 6, 7, 8. The results also show that the envelope methods are more sensitive to the levels of eta than the rest of the methods. In the case of PCR and PLS, the difference in the effect of levels of eta is small.

In Figure 8 (right), we can see that the multicollinearity (controlled by gamma) has affected all the methods. However, envelope methods have better performance on low multicollinearity, as opposed to high multicollinearity, and PCR, PLS1 and PLS2 are robust for high multicollinearity. Despite handling high multicollinearity, these methods have higher prediction error in both cases of multicollinearity than the envelope methods.

Figure 8: Effect plot of some interactions of the multivariate linear model of prediction error

Effect Analysis of Component Model

Figure 9: Effect plot of some interactions of the multivariate linear model of the number of components to get minimum prediction error

Unlike for prediction errors, Figure 9 (left) shows that the number of components used by the methods to obtain minimum prediction error is less affected by the levels of eta. All methods appear to use on average more components when eta increases. Envelope methods are able to obtain minimum prediction error by using components ranging from 1 to 3 in both the cases of relpos. This value is much higher in the case of PCR as its prediction is based only on the principal components of the predictor matrix. The number of components used by this method ranges from 3 to 5 when relevant components are at positions 1, 2, 3, 4 and 5 to 8 when relevant components are at positions 5, 6, 7, 8.

When relevant components are at position 5, 6, 7, 8, the eigenvalues of relevant predictors becomes smaller and responses are relatively difficult to predict. This becomes more critical for high multicollinearity cases. Figure 9 (right) shows that the envelope methods are less influenced by the level of relpos and are particularly better in achieving minimum prediction error using a fewer number of components than other methods.