tidyverse in R

Streamlining Data with tibble and dplyr

Author

Raju Rimal

Published

November 30, 2024

Modified

March 19, 2025

The tidyverse is a collection of R packages designed to make data manipulation, visualization, and analysis as seamless and intuitive as possible. Among the suite of tools it offers, tibble and dplyr stand out as the pillars for working with tabular data. If you’re used to working with data.frame and want a more user-friendly, readable approach to data handling, then tidyverse will be your go-to toolkit.


Introduction to Tidyverse

The tidyverse package is a powerful ecosystem that includes essential packages for data wrangling, visualization, and more. At its core, the tidyverse provides:

  • ggplot2 for visualization.
  • dplyr for data manipulation.
  • tibble for modern data frames.
  • tidyr for reshaping data.

However, when working with tabular data, the two most relevant packages you’ll interact with are dplyr and tibble.


Introducing tibble

A tibble is a modern take on the traditional data.frame. It is part of the tidyverse and comes with a few key advantages:

  • Prints in a cleaner format: You don’t get overwhelmed with the full dataset when you print it.
  • No row names: This helps to avoid unnecessary clutter.
  • Supports non-standard column names: You can have spaces or special characters in column names without issues.

Creating a tibble is as easy as using the tibble() function:

library(tidyverse)

# Creating a tibble
students <- tibble(
  Name = c("Alice", "Bob", "Charlie"),
  Age = c(25, 30, 22),
  Grade = c("A", "B", "A")
)

# Inspect the tibble
students
# A tibble: 3 × 3
  Name      Age Grade
  <chr>   <dbl> <chr>
1 Alice      25 A    
2 Bob        30 B    
3 Charlie    22 A    

Notice that the output is more compact and readable than data.frame. This makes it easier to work with, especially when dealing with large datasets.


Working with dplyr for Data Manipulation

Once you’re comfortable with tibbles, the real power of tidyverse comes from dplyr. This package provides intuitive and efficient ways to manipulate tabular data with verbs that describe what you want to do to the data.

Selecting Columns

To select specific columns, use select():

# Select specific columns
students %>% select(Name, Grade)
# A tibble: 3 × 2
  Name    Grade
  <chr>   <chr>
1 Alice   A    
2 Bob     B    
3 Charlie A    

Filtering Rows

Use filter() to subset rows that meet a certain condition:

# Filter rows where Age > 25
students %>% filter(Age > 25)
# A tibble: 1 × 3
  Name    Age Grade
  <chr> <dbl> <chr>
1 Bob      30 B    

Mutating and Creating New Columns

Use mutate() to add or modify columns:

# Create a new column
students %>% mutate(Passed = Grade == "A")
# A tibble: 3 × 4
  Name      Age Grade Passed
  <chr>   <dbl> <chr> <lgl> 
1 Alice      25 A     TRUE  
2 Bob        30 B     FALSE 
3 Charlie    22 A     TRUE  

Summarizing Data

To summarize your data, summarize() is your go-to function:

# Average Age of students
students %>% summarize(Average_Age = mean(Age))
# A tibble: 1 × 1
  Average_Age
        <dbl>
1        25.7

You can also group data with group_by() and then apply a summary function:

# Group by Grade and summarize Age
students %>%
  group_by(Grade) %>%
  summarize(Average_Age = mean(Age))
# A tibble: 2 × 2
  Grade Average_Age
  <chr>       <dbl>
1 A            23.5
2 B            30  

Chaining Operations with Pipes (%>%)

One of the most powerful aspects of dplyr is the pipe operator %>%, which allows you to chain multiple operations together into a readable sequence. Instead of nesting functions inside each other, you can write them as a series of steps, which makes the code easier to read and debug.

# Chaining operations
students %>%
  filter(Age > 25) %>%
  mutate(Passed = Grade == "A") %>%
  select(Name, Passed)
# A tibble: 1 × 2
  Name  Passed
  <chr> <lgl> 
1 Bob   FALSE 

Here, you’re applying filters, creating new columns, and selecting columns, all in one clean, concise line.


Merging and Joining Data with dplyr

Just like data.table and data.frame, dplyr allows you to merge and join datasets easily.

Inner Join

# Two example data frames
class1 <- tibble(
  ID = 1:3, 
  Name = c("Alice", "Bob", "Charlie")
)
class2 <- tibble(ID = 2:4, Score = c(85, 88, 90))

# Inner join
class1 %>% inner_join(class2, by = "ID")
# A tibble: 2 × 3
     ID Name    Score
  <int> <chr>   <dbl>
1     2 Bob        85
2     3 Charlie    88

Left Join

# Left join to keep all rows from class1
class1 %>% left_join(class2, by = "ID")
# A tibble: 3 × 3
     ID Name    Score
  <int> <chr>   <dbl>
1     1 Alice      NA
2     2 Bob        85
3     3 Charlie    88

Reshaping Data with tidyr

While dplyr focuses on manipulation, tidyr helps reshape data. Here’s how you can transform wide data to long format and vice versa.

Wide to Long with pivot_longer()

library(tidyr)

# Example data
df <- tibble(
  Name = c("Alice", "Bob"),
  Math = c(95, 88),
  English = c(90, 85)
)

# Convert from wide to long
df_long <- df %>% 
  pivot_longer(
    cols = c(Math, English), 
    names_to = "Subject", 
    values_to = "Score"
  )
df_long
# A tibble: 4 × 3
  Name  Subject Score
  <chr> <chr>   <dbl>
1 Alice Math       95
2 Alice English    90
3 Bob   Math       88
4 Bob   English    85

Long to Wide with pivot_wider()

# Convert back from long to wide
df_wide <- df_long %>% pivot_wider(names_from = "Subject", values_from = "Score")
df_wide
# A tibble: 2 × 3
  Name   Math English
  <chr> <dbl>   <dbl>
1 Alice    95      90
2 Bob      88      85

Strengths and Limitations of tidyverse

Strengths

  • Intuitive syntax: The verbs (select, filter, mutate, etc.) make the code easy to read and understand.
  • Seamless integration: The tidyverse packages are designed to work together, providing a consistent workflow.
  • Data wrangling made easy: Tasks like filtering, grouping, and summarizing data are straightforward.

Limitations

  • Performance: For extremely large datasets, data.table might be faster, though tidyverse is more user-friendly.
  • Memory consumption: tidyverse packages do not modify data by reference (like data.table), so they can use more memory when working with large datasets.

Conclusion

The tidyverse provides a modern and highly readable approach to data manipulation. It’s the go-to tool for many R users because it streamlines everyday tasks like filtering, summarizing, and reshaping data. By embracing tibble for modern data frames and dplyr for intuitive data manipulation, you can quickly become proficient in handling tabular data.

While it may not always be the fastest for large datasets (that’s where data.table shines), tidyverse’s user-friendly syntax and powerful functionality make it an essential part of the R ecosystem.