Experimental Design

This study compares prediction methods based on their prediction ability. Data with specific properties are simulated, some of which are easier to predict than others. These data are simulated using the R-package simrel, which is discussed in Sæbø, Almøy, and Helland (2015) and Rimal, Almøy, and Sæbø (2018). Here we have used four different factors to vary the property of the data: a) Number of predictors (p), b) Multicollinearity in predictor variables (gamma), c) Correlation in response variables (eta) and d) position of predictor components relevant for the response (relpos). Using two levels of p, gamma and relpos and four levels of eta, 32 set of distinct properties are designed for the simulation.

Number of predictors:

To observe the performance of the methods on tall and wide predictor matrices, 20 and 250 predictor variables are simulated with the number of observations fixed at 100. Parameter p controls these properties in the simrel function.

Multicollinearity in predictor variables:

Highly collinear predictors can be explained completely by a few components. The parameter gamma (\(\gamma\)) in simrel controls decline in the eigenvalues of the predictor variables as (5).

\[\begin{equation} \lambda_i = e^{-\gamma(i - 1)}, \gamma > 0 \text{ and } i = 1, 2, \ldots, p \tag{5} \end{equation}\]

Here, \(\lambda_i, i = 1, 2, \ldots p\) are eigenvalues of the predictor variables. We have used 0.2 and 0.9 as different levels of gamma. The higher the value of gamma, the higher the multicollinearity will be, and vice versa. In our simulations, the higher and lower gamma values corresponded to maximum correlation between the predictors equal to 0.990 and 0.709, respectively, in the case of \(p = 20\) variables. In the case of \(p = 250\) the corresponding values for the maximum correlation were 0.998 to 0.923.

Correlation in response variables:

Correlation among response variables has been explored to a lesser extent. Here we have tried to explore that part with four levels of correlation in the response variables. We have used the eta (\(\eta\)) parameter of simrel for controlling the decline in eigenvalues corresponding to the response variables as (6).

\[\begin{equation} \kappa_j = e^{-\eta(j - 1)}, \eta > 0 \text{ and } j = 1, 2, \ldots, m \tag{6} \end{equation}\]

Here, \(\kappa_j, i = 1, 2, \ldots m\) are the eigenvalues of the response variables and m is the number of response variables. We have used 0, 0.4, 0.8 and 1.2 as different levels of eta. The larger the value of eta, the larger will be the correlation will be between response variables and vice versa. In our simulation, the different levels of eta from small to large correspond to maximum correlation of 0, 0.442, 0.729 and 0.878 between the response variables respectively.

Position of predictor components relevant to the response:

The principal components of the predictors are ordered. The first principal component captures most of the variation in the predictors. The second captures most of the remainder left by the first principal component and so on. In highly collinear predictors, the variation captured by the first few components is relatively high. However, if those components are not relevant for the response, prediction becomes difficult (Helland and Almøy 1994). Here, two levels of the positions of these relevant components are used as 1, 2, 3, 4 and 5, 6, 7, 8.

Moreover, a complete factorial design from the levels of the above parameters gave us 32 designs. Each design is associated with a dataset having unique properties. Figure~2, shows all the designs. For each design and prediction method, 50 datasets were simulated as replicates. In total, there were \(5 \times 32 \times 50\), i.e. 8000 simulated datasets.

Figure 2: Experimental Design of simulation parameters. Each point represents a unique data property.

Common parameters:: Each dataset was simulated with \(n = 100\) number of observation and \(m = 4\) response variables. Furthermore, the coefficient of determination corresponding to each response components in all the designs is set to 0.8. The informative and uninformative latent components are generated according to (3). Since \(\boldsymbol{\Sigma}_{ww}\) and \(\boldsymbol{\Sigma}_{zz}\) are diagonal matrices, the components are independent within \(\mathbold{w}\) and \(\mathbold{z}\), but dependence between the latent spaces of \(\mathbold{x}\) and \(\mathbold{y}\) are secured through the non-zero elements of \(\boldsymbol{\Sigma}_{wz}\) with positions defined by the relpos and ypos parameters. The latent components are subsequently rotated to obtain the population covariance structure of response and predictor variables. In addition, we have assumed that there is only one informative response component. Hence, the informative response component after the orthogonal rotation together with three uninformative response components generates four response variables. This spreads out the information in all simulated response variables. For further details on the simulation tool, see (Rimal, Almøy, and Sæbø 2018).

An example of simulation parameters for the first design is as follows:

simrel(
    n       = 100,                 ## Training samples
    p       = 20,                  ## Predictors
    m       = 4,                   ## Responses
    q       = 20,                  ## Relevant predictors
    relpos  = list(c(1, 2, 3, 4)), ## Relevant predictor components index
    eta     = 0,                   ## Decay factor of response eigenvalues
    gamma   = 0.2,                 ## Decay factor of predictor eigenvalues
    R2      = 0.8,                 ## Coefficient of determination
    ypos    = list(c(1, 2, 3, 4)),
    type    = "multivariate"
)

Figure 3: (left) Covariance structure of latent components (right) Covariance structure of predictor and response

The covariance structure of the data simulated with this design in the Figure 3 shows that the predictor components at positions 1, 2, 3 and 4 are relevant for the first response component. After the rotation with an orthogonal rotation matrix, all predictor variables are somewhat relevant for all response variables, satisfying other desired properties such as multicollinearity and coefficient of determination. For the same design, Figure 4 (top left) shows that the predictor components 1, 2, 3 and 4 are relevant for the first response component. All other predictor components are irrelevant and all other response components are uninformative. However, due to orthogonal rotation of the informative response component together with uninformative response components, all response variables in the population have similar covariance with the relevant predictor components (Figure 4 (top right)). The sample covariances between the predictor components and predictor variables with response variables are shown in Figure 4 (bottom left) and (bottom right) respectively.

Expected Scaled absolute covariance between predictor components and response components (top left). Expected Scaled absolute covariance between predictor components and response variables (top right). Sample scaled absolute covariance between predictor components and response variables (bottom left). Sample scaled absolute covariance between predictor variables and response variables (bottom right). The bar graph in the background represents eigenvalues corresponding to each component in the population (top plots) and in the sample (bottom plots). One can compare the top-right plot (true covariance of the population) with bottom-left (covariance in the simulated data) which shows a similar pattern for different components.

Figure 4: Expected Scaled absolute covariance between predictor components and response components (top left). Expected Scaled absolute covariance between predictor components and response variables (top right). Sample scaled absolute covariance between predictor components and response variables (bottom left). Sample scaled absolute covariance between predictor variables and response variables (bottom right). The bar graph in the background represents eigenvalues corresponding to each component in the population (top plots) and in the sample (bottom plots). One can compare the top-right plot (true covariance of the population) with bottom-left (covariance in the simulated data) which shows a similar pattern for different components.

A similar description can be made for all 32 designs, where each of the designs holds the properties of the data they simulate. These data are used by the prediction methods discussed in the previous section. Each prediction method is given independently simulated datasets in order to give them an equal opportunity to capture the dynamics in the data.