Discussions and Conclusion
Analysis using both simulated data and real data has shown that the envelope methods are more stable, less influenced by relpos
and gamma
and in general, performed better than PCR and PLS methods. These methods are also found to be less dependent on the number of components.
Since the facet in the Figures 5 and 6 have their own scales, despite having some large prediction errors seen at the right tail, envelope methods still have a smaller prediction error and have used a fewer number of components than the other methods.
The envelope methods may have this problem of being caught in a local optimum of the objective function. If these cases of sub-optimal convergence were identified and rerun to obtain better convergence, the envelope results may have become even better. Particularly in the case of the simultaneous envelope, since users can specify the number of dimension for the response envelope, the method can leverage the relevant space of response while PCR, PLS and Xenv are constrained to play only on predictor space.
Furthermore, we have fixed the coefficient of determination (\(R^2\)) as a constant throughout all the designs. Initial simulations (not shown) indicated that low \(R^2\) affects all methods in a similar manner and that the MANOVA is highly dominated by \(R^2\). Keeping the value of \(R^2\) fixed has allowed us to analyze other factors properly.
Two clear comments can be made about the effect of correlation of response on the prediction methods. The highly correlated response has shown the highest prediction error in general and the effect is most distinct in envelope methods. Since the envelope methods identify the relevant space as the span of relevant eigenvectors, the methods are able to obtain the minimum average prediction error by using a lesser number of components for all levels of eta
.
To our knowledge, the effect of correlation in the response on PCR and PLS methods has been explored only to a limited extent. In this regards, it is interesting to see that these methods have applied a large number of components and returned a larger prediction error than envelope methods in the case of highly correlated responses. To fully understand the effect of eta
, it is necessary to study the estimation performance of these methods with different numbers of components.
In addition, since using principal components or actual variables as predictors in envelope methods has shown similar results, we have used principal components that have explained 97.5% of the variation, as mentioned previously, in the cases of envelope methods for the designs where \(p>n\). Using 97.5% is slightly arbitrary here, but for the chosen simulation designs this proportion captured a fair amount of variations in predictor variables and also reduce the dimension significantly while enabling us to use envelope methods in all settings. The analyst should choose this number to balance the explained amount of variation to the number of components which is practical for model fitting using the envelope model. The methodology used to adapt envelopes to settings in which \(p>n\) is in fact the same as that used by PLS: reduce by principal components, run the method, and then back transform to the original scale. The minor relative impact of \(p\) shown in Figure 7 suggests that this adaptation method is useful.
The results from this study will help researchers to understand these methods for their performance in various linear model data and encourage them to use newly developed methods such as the envelopes. Since this study has focused entirely on prediction performance, further analysis of the estimative properties of these methods is required. A study of estimation error and the performance of methods on the non-optimal number of components can give a deeper understanding of these methods.
A shiny application (Chang et al. 2018) is available at http://therimalaya.shinyapps.io/Comparison where all the results related to this study can be visualized. In addition, a GitHub repository at https://github.com/therimalaya/03-prediction-comparison can be used to reproduce this study.