Prediction Methods
Partial least squares regression (PLS) and Principal component regression (PCR) have been used in many disciplines such as chemometrics, econometrics, bioinformatics and machine learning, where wide predictor matrices, i.e. \(p\) (number or predictors) > \(n\) (number of observation) are common. These methods are popular in multivariate analysis, especially for exploratory studies and predictions. In recent years, a concept of envelope introduced by Cook, Li, and Chiaromonte (2007) based on the reduction in the regression model was implemented for the development of different estimators. This study compares these prediction methods based on their prediction performance on data simulated with different controlled properties.
- Principal Components Regression (PCR):
- Principal components are the linear combinations of predictor variables such that the transformation makes the new variables uncorrelated. In addition, the variation of the original dataset captured by the new variables is sorted in descending order. In other words, each successive component captures maximum variation left by the preceding components in predictor variables (Jolliffe 2002). Principal components regression uses these principal components as a new predictor to explain the variation in the response.
- Partial Least Squares (PLS):
- Two variants of PLS: PLS1 and PLS2 are used for comparison. The first one considers individual response variables separately, i.e. each response is predicted with a single response model, while the latter considers all response variables together. In PLS regression, the components are determined so as to maximize a covariance between response and predictors (Jong 1993). Among other, there are three main PLS algorithms NIPALS, SIMPLS and Kernel Algorithm all of which removes the extracted information through deflation and makes the resulting new variables orthogonal. The algorithms differ in the deflation strategy and computation of various weight vectors (Alin 2009) and here we have used the kernel version of PLS. R-package
pls
(Mevik, Wehrens, and Liland 2018) is used for both PCR and PLS methods. - Envelopes:
- The envelope, introduced by Cook, Li, and Chiaromonte (2007), was first used to define response envelope (Cook, Li, and Chiaromonte 2010) as the smallest subspace in the response space and must be a reducing subspace of \(\Sigma_{y|x}\) such that the span of regression coefficients lies in that space. Since a multivariate linear regression model contains relevant (material) and irrelevant (immaterial) variation in both response and predictor, the relevant part provides information, while the irrelevant part increases the estimative variation. The concept of the envelope uses the relevant part for estimation while excluding the irrelevant part consequently increasing the efficiency of the model (Cook and Zhang 2016).
- The concept was later extended to the predictor space, where the predictor envelope was defined (Cook, Helland, and Su 2013). Further Cook and Zhang (2015) used envelopes for joint reduction of the responses and predictors and argued that this produced efficiency gains that were greater than those derived by using individual envelopes for either the responses or the predictors separately. All the variants of envelope estimations are based on maximum likelihood estimation. Here we have used predictor envelope (Xenv) and simultaneous envelope (Senv) for the comparison. R-package
Renvlp
(Lee and Su 2018) is used for both Xenv and Senv methods.
Modification in envelope estimation
Since envelope estimators (Xenv and Senv) are based on maximum likelihood estimation (MLE), it fails to estimate in the case of wide matrices, i.e. \(p > n\). To incorporate these methods in our comparison, we have used the principal components \((\mathbf{z})\) of the predictor variables \((\mathbf{x})\) as predictors, using the required number of components for capturing 97.5% of the variation in \(\mathbf{x}\) for the designs where \(p > n\). The new set of variables \(\mathbf{z}\) were used for envelope estimation. The regression coefficients \((\hat{\boldsymbol{\alpha}})\) corresponding to these new variables \(\mathbf{z}\) were transformed back to obtain coefficients for each predictor variable as, \[\hat{\boldsymbol{\beta}} = \mathbf{e}_k\hat{\boldsymbol{\alpha}_k}\] where \(\mathbf{e}_k\) is a matrix of eigenvectors with the first \(k\) number of components. Only simultaneous envelope allows to specify the dimension of response envelope and all the simulation is based on a single latent dimension of response, so it is fixed at two in the simulation study. In the case of Senv, when the envelope dimension for response is the same as the number of responses, it degenerates to the Xenv method and if the envelope dimension for the predictor is the same as the number of predictors, it degenerates to the standard multivariate linear regression (Lee and Su 2018).