How ANOVA analyze the variance

Statistics
ANOVA
Author

TheRimalaya

Published

March 29, 2021

Modified

October 8, 2024

Analysis of Variance (ANOVA) is a powerful statistical tool used to analyze the variance among group means and determine whether these differences are statistically significant. It’s commonly used in various fields such as agriculture, biology, psychology, and more to test hypotheses about different groups. Below are some practical examples:

Data for ANOVA could be collected through designed experiments or by sampling from populations. This article helps explain how ANOVA analyzes variance and identifies situations when these differences are significant, using both simulated and real data.

Often we Analysis of Variance (ANOVA) to analyze the variances to find if different cases results in similar outcome and if the difference is significant. Following are some simple examples,

Consider the following model with \(i=3\) groups and \(j=n\) observations,

\[ y_{ij} = \mu + \tau_i + \varepsilon_{ij}, \; i = 1, 2, 3 \texttt{ and } j = 1, 2, \ldots n \]

where, \(\tau_i\) represetns the effect corresponding to group \(i\) and \(\varepsilon_{ij} \sim \mathrm{N}(0, \sigma^2)\), the usual assumption of linear model. In order to better understand how ANOVA finds the differences between groups and how the group mean and their standard deviation influence the results from ANOVA, we will explore the following four cases:

Simultion design
Design <- tidytable::bind_rows(
    Case1 = data.table(
        Group = paste("Group", 1:3, sep = ""),
        Mean = c(10, 10, 10),
        SD = c(5, 5, 5)
    ),
    Case2 = data.table(
        Group = paste("Group", 1:3, sep = ""),
        Mean = c(10, 10, 10),
        SD = c(1, 1, 1)
    ),
    Case3 = data.table(
        Group = paste("Group", 1:3, sep = ""),
        Mean = c(5, 10, 15),
        SD = c(5, 5, 5)
    ),
    Case4 = data.table(
        Group = paste("Group", 1:3, sep = ""),
        Mean = c(5, 10, 15),
        SD = c(1, 1, 1)
    ), .id = "Cases")
  • Case 1: Similar group means with high variation within the groups
  • Case 2: Similar group means with low variation within the groups
  • Case 3: Distant group means with high variation within the groups
  • Case 4: Distant group means with low variation within the groups
Cases
Group1
Group2
Group3
Mean SD Mean SD Mean SD
Case1 10 5 10 5 10 5
Case2 10 1 10 1 10 1
Case3 5 5 10 5 15 5
Case4 5 1 10 1 15 1

Fitting ANOVA model for each cases

Simulate and fit ANOVA
generate_data <- function(mean, sd, nobs = 50) {
    Response <- rnorm(nobs, mean = mean, sd)
    tidytable(ID = 1:nobs, Response = Response)
}

Model <- Design %>% 
    mutate(Data = map2(Mean, SD, generate_data)) %>% 
    unnest() %>%
    nest(.by = "Cases") %>% 
    mutate(fit = map(data, function(dta) {
        lm(Response ~ Group, data = dta)
    }))

Model
# A tidytable: 4 × 3
  Cases data                  fit   
  <chr> <list>                <list>
1 Case1 <tidytable [150 × 5]> <lm>  
2 Case2 <tidytable [150 × 5]> <lm>  
3 Case3 <tidytable [150 × 5]> <lm>  
4 Case4 <tidytable [150 × 5]> <lm>  

Distribution of data

Data distribution
Model[, map_df(fit, broom::augment), by = Cases] %>%
  ggplot(aes(Response, Group)) +
  geom_boxplot(
    aes(fill = Group, color = Group), 
    alpha = 0.25, width = 0.25
  ) +
  geom_point(aes(fill = Group),
    position = position_jitter(height = 0.1),
    shape = 21, size = 2, stroke = 0.25, alpha = 0.25
  ) +
  stat_summary(
    fun = mean, geom = "point", aes(color = Group),
    size = 2, shape = 21, fill = "whitesmoke", stroke = 0.75
  ) +
  facet_wrap(facets = vars(Cases), scales = "free_x") +
  ggridges::geom_density_ridges(
    aes(color = Group),
    fill = NA,
    panel_scaling = FALSE
  ) +
  scale_color_brewer(palette = "Set1") +
  scale_fill_brewer(palette = "Set1") +
  theme(
    legend.position = "none"
  )

Model comparison

ANOVA for the four cases
anova_result <- Model[, map_df(
  .x = fit,
  .f = ~ broom::tidy(anova(.x))
), by = Cases] %>% rename(
  DF = df,
  SSE = sumsq,
  MSE = meansq,
  Statistic = statistic,
  `p value` = p.value
)

anova_result %>% 
  mutate(
    Cases = case_when(
        Cases == "Case1" ~ "Case 1: Similar group means with high variation within the groups",
        Cases == "Case2" ~ "Case 2: Similar group means with low variation within the groups",
        Cases == "Case3" ~ "Case 3: Distant group means with high variation within the groups",
        TRUE ~ "Case 4: Distant group means with low variation within the groups"
    )
  ) %>% 
  gt::gt(groupname_col = "Cases", rowname_col = "term") %>%
  gt::fmt_number(columns = 4:6) %>%
  gt::fmt_number(columns = 7, decimals = 4) %>% 
  gt::sub_missing(missing_text = "") %>% 
  gt::tab_options(
    table.width = "100%",
    row_group.font.weight = "600"
  )
DF SSE MSE Statistic p value
Case 1: Similar group means with high variation within the groups
Group 2 32.09 16.05 0.64 0.5304
Residuals 147 3,704.26 25.20

Case 2: Similar group means with low variation within the groups
Group 2 1.98 0.99 1.00 0.3714
Residuals 147 146.03 0.99

Case 3: Distant group means with high variation within the groups
Group 2 1,951.86 975.93 50.15 0.0000
Residuals 147 2,860.66 19.46

Case 4: Distant group means with low variation within the groups
Group 2 2,516.51 1,258.26 1,263.82 0.0000
Residuals 147 146.35 1.00

Interpretetion

The results show a high p-value, indicating no significant difference between the groups due to high within-group variability.

Here, the p-value is still high, suggesting no significant difference, but the small variance within groups provides clearer insights compared to Case 1.

Despite the high variation within groups, the distant group means lead to a low p-value, indicating statistically significant differences among the groups.

With low within-group variation and distant means, the p-value remains extremely low, strongly indicating significant group differences.

In conclusion, ANOVA helps determine if there are significant differences between multiple group means by comparing variances within groups to variances between groups. The power of ANOVA lies in its ability to detect even subtle differences when variations are minimal within groups.