Factors in R

Mastering Categorical Data for Analysis and Visualization

Author

Raju Rimal

Published

November 30, 2024

Modified

March 19, 2025

Factors are an essential data type in R, designed to handle categorical data. They enable efficient data storage, sorting, visualization, and modeling. This article explores the fundamentals of factors, factor manipulation techniques, and the use of the forcats package for advanced functionality.

1. Understanding Factors

Factors are used to represent categorical data. They store unique categories as levels and map them to integer codes for efficiency.

Creating a Factor
Checking Factor Levels

# Creating a factor
categories <- c("Low", "Medium", "High", "Medium", "Low")
factor_categories <- factor(categories)
print(factor_categories)

[1] Low    Medium High   Medium Low   
Levels: High Low Medium

# Levels of the factor
levels(factor_categories)

[1] "High"   "Low"    "Medium"

# Convert a character vector to a factor
char_vector <- c("A", "B", "A", "C")
factor_vector <- factor(char_vector)
print(factor_vector)

[1] A B A C
Levels: A B C

# Convert numeric data to a factor
numeric_vector <- c(1, 2, 1, 3)
factor_numeric <- factor(numeric_vector)
print(factor_numeric)

[1] 1 2 1 3
Levels: 1 2 3

3. Reordering Factor Levels

Specifying a Custom Order
Releveling Factors

By default, factor levels are ordered alphabetically. To specify a custom order:

# Custom order for levels
ordered_factor <- factor(
  categories, 
  levels = c("Low", "Medium", "High")
)
print(ordered_factor)

[1] Low    Medium High   Medium Low   
Levels: Low Medium High

# Change the reference level
releveled_factor <- relevel(ordered_factor, ref = "High")
print(releveled_factor)

[1] Low    Medium High   Medium Low   
Levels: High Low Medium

4. Relabeling Factor Levels

Renaming Levels
Using forcats for Relabeling

# Rename factor levels
levels(ordered_factor) <- c("L", "M", "H")
print(ordered_factor)

[1] L M H M L
Levels: L M H

The forcats package provides powerful functions for working with factors.

library(forcats)

# Relabel levels using fct_recode
relabelled_factor <- fct_recode(
  ordered_factor, 
  Low = "L", 
  Medium = "M", 
  High = "H"
)
print(relabelled_factor)

[1] Low    Medium High   Medium Low   
Levels: Low Medium High

5. Combining and Collapsing Levels

Collapsing Levels
Lump Rare Levels

# Combine levels
collapsed_factor <- fct_collapse(
  ordered_factor,
  Low_Med = c("Low", "Medium"),
  High = "High"
)

Warning: Unknown levels in `f`: Low, Medium, High

print(collapsed_factor)

[1] L M H M L
Levels: L M H

# Lump levels with fewer than 2 occurrences into "Other"
lumped_factor <- fct_lump(factor_categories, n = 2)
print(lumped_factor)

[1] Low    Medium Other  Medium Low   
Levels: Low Medium Other

6. Ordering Factors

Alphabetical Ordering
Ordering by Frequency

# Default alphabetical order
alphabetical_order <- factor(categories)
print(alphabetical_order)

[1] Low    Medium High   Medium Low   
Levels: High Low Medium

# Order by frequency
freq_ordered_factor <- fct_infreq(factor_categories)
print(freq_ordered_factor)

[1] Low    Medium High   Medium Low   
Levels: Low Medium High

7. Visualizing Factors

Factors are crucial for creating clear and ordered visualizations.

library(ggplot2)

dta <- data.frame(categories = factor_categories)

plt1 <- ggplot(dta, aes(categories)) + 
  geom_bar() +
  labs(title = "Without ordering categories") +
  theme_grey(base_size = 18)

plt2 <- ggplot(dta, aes(fct_infreq(categories))) + 
  geom_bar() +
  labs(title = "Ordering by frequency of category") +
  theme_grey(base_size = 18)

plt1
plt2

8. Common Factor Operations

Dropping Unused Levels
Checking Factor Properties

# Dropping unused levels
categories <- factor(
  c("A", "B", "C"), 
  levels = c("A", "B", "C", "D", "E")
)
print(categories)

[1] A B C
Levels: A B C D E

clean_categories <- droplevels(categories)
print(clean_categories)

[1] A B C
Levels: A B C

# Check if an object is a factor
is.factor(ordered_factor)

[1] TRUE

9. Practical Example

Scenario: Categorizing and Sorting Survey Responses

# Survey data
responses <- c(
  "Agree", "Neutral", "Disagree", 
  "Agree", "Disagree", "Agree"
)

# Convert to factor with custom levels
factor_responses <- factor(
  responses, 
  levels = c("Disagree", "Neutral", "Agree")
)

# Plot the responses
response_df <- data.frame(responses = factor_responses)
ggplot(response_df, aes(x = responses)) +
  geom_bar() +
  labs(
    title = "Survey Responses", 
    x = "Response", 
    y = "Count"
  )

10. Exercises for Practice

Convert a numeric dataset to a factor and assign meaningful labels to its levels.
Reorder a factor based on the median value of another variable.
Visualize a dataset with multiple factors using ggplot2.

Conclusion

Factors are a cornerstone of R programming for handling categorical data. Whether you’re visualizing responses, sorting data, or building models, understanding how to work with factors is essential. By leveraging tools like the forcats package, you can efficiently manage and manipulate categorical data to unlock deeper insights.r