Merging multiple datasets

Joining Tables using dplyr in R: Concepts with Examples

Author

Raju Rimal

Published

March 18, 2025

This post explores the fundamentals of joining tables in R using the dplyr package, with a focus on both core concepts and practical edge cases. You’ll learn how keys link datasets, how different join types handle unmatched rows, and strategies for tackling real-world challenges like missing values, multi-key joins, and pre-processing with intermediate variables. To bridge the gap for database users, every join example includes its SQL equivalent. Finally, a reproducible dummy dataset is provided so you can follow along and experiment.

Example dataset

# Employees: 
#> emp_id 3 has NA dept_id; 
#> emp_id 5's dept_id (4) is missing
employees <- tibble(
  emp_id = c(1, 2, 3, 4, 5, 6),
  first_name = c(
    "Alice Susanna", "Bob", "Mary Anne", 
    "Diana", "Eve", "Frank"
  ),
  last_name = c(
    "Smith", "Johnson", "Brown", 
    "Lee", "Davis", "DeVito Jr."
  ),
  dept_id = c(1, 2, NA, 3, 4, 2),
  salary = c(
    60000, 75000, 80000, 
    90000, 65000, 70000
  )
)

print(employees)
# A tibble: 6 × 5
  emp_id first_name    last_name  dept_id salary
   <dbl> <chr>         <chr>        <dbl>  <dbl>
1      1 Alice Susanna Smith            1  60000
2      2 Bob           Johnson          2  75000
3      3 Mary Anne     Brown           NA  80000
4      4 Diana         Lee              3  90000
5      5 Eve           Davis            4  65000
6      6 Frank         DeVito Jr.       2  70000
#> Departments: 
#> - dept_id 5 has no employees
#> - dept_id 2 appears once
departments <- tibble(
  dept_id = c(1, 2, 3, 5),
  dept_name = c(
    "HR", "Engineering", "Marketing", "Finance"
  ),
  manager_first = c(
    "alice", "robert", "diana", "carol"
  ),
  manager_last = c(
    "smith", "johnson", "lee", "taylor"
  ),
  budget = c(
    500000, 1000000, 750000, 600000
  )
)

print(departments)
# A tibble: 4 × 5
  dept_id dept_name   manager_first manager_last  budget
    <dbl> <chr>       <chr>         <chr>          <dbl>
1       1 HR          alice         smith         500000
2       2 Engineering robert        johnson      1000000
3       3 Marketing   diana         lee           750000
4       5 Finance     carol         taylor        600000
library(RSQLite)
library(DBI)
con <- DBI::dbConnect(SQLite(), "database.db")
if (!RSQLite::dbExistsTable(con, "departments")) {
  dbWriteTable(con, name = "departments", value = departments)
}
if (!RSQLite::dbExistsTable(con, "employees")) {
  dbWriteTable(con, name = "employees", value = employees)
}
PRAGMA table_info(employees);
5 records
cid name type notnull dflt_value pk
0 emp_id REAL 0 NA 0
1 first_name TEXT 0 NA 0
2 last_name TEXT 0 NA 0
3 dept_id REAL 0 NA 0
4 salary REAL 0 NA 0
PRAGMA table_info(departments);
5 records
cid name type notnull dflt_value pk
0 dept_id REAL 0 NA 0
1 dept_name TEXT 0 NA 0
2 manager_first TEXT 0 NA 0
3 manager_last TEXT 0 NA 0
4 budget REAL 0 NA 0

Mutating Joins

Mutating joins combine columns from two tables based on matching keys, preserving rows depending on the join type.

Inner Join | inner_join()

We can use inner_join() function from dplyr to inner join the data. Inner join retains only rows with matching keys in both tables. The output table contains all columns, excluding the unmatched rows from both tables.

employees %>% 
  inner_join(departments, by = "dept_id") %>% 
  select(emp_id, first_name, last_name, salary, dept_name)
# A tibble: 4 × 5
  emp_id first_name    last_name  salary dept_name  
   <dbl> <chr>         <chr>       <dbl> <chr>      
1      1 Alice Susanna Smith       60000 HR         
2      2 Bob           Johnson     75000 Engineering
3      4 Diana         Lee         90000 Marketing  
4      6 Frank         DeVito Jr.  70000 Engineering
SELECT
 emp.emp_id,
 emp.first_name,
 emp.last_name,
 emp.salary,
 dept.dept_name 
 FROM employees AS emp
  INNER JOIN departments as dept
    ON emp.dept_id = dept.dept_id;
4 records
emp_id first_name last_name salary dept_name
1 Alice Susanna Smith 60000 HR
2 Bob Johnson 75000 Engineering
4 Diana Lee 90000 Marketing
6 Frank DeVito Jr. 70000 Engineering

Here,

  • Rows with dept_id = NA (Charlie) and dept_id = 4 (Eve) were dropped (no match in departments).
  • dept_id = 5 (Finance) was dropped (no match in employees).
  • dept_id = 2 appears twice (Bob and Frank), causing row expansion.

Left Join | left_join()

Left Join Keeps all rows from the left table, adding matched columns from the right table. The unmatched rows from the left table get NA for right-table columns. The left table is preserved entirely and the right table is attached where possible.

employees |> 
  left_join(departments, by = "dept_id") |> 
  select(names(employees), dept_name)
# A tibble: 6 × 6
  emp_id first_name    last_name  dept_id salary dept_name  
   <dbl> <chr>         <chr>        <dbl>  <dbl> <chr>      
1      1 Alice Susanna Smith            1  60000 HR         
2      2 Bob           Johnson          2  75000 Engineering
3      3 Mary Anne     Brown           NA  80000 <NA>       
4      4 Diana         Lee              3  90000 Marketing  
5      5 Eve           Davis            4  65000 <NA>       
6      6 Frank         DeVito Jr.       2  70000 Engineering
SELECT 
  emp.*, dept.dept_name
  FROM employees AS emp
  LEFT JOIN departments  AS dept
    ON emp.dept_id = dept.dept_id;
6 records
emp_id first_name last_name dept_id salary dept_name
1 Alice Susanna Smith 1 60000 HR
2 Bob Johnson 2 75000 Engineering
3 Mary Anne Brown NA 80000 NA
4 Diana Lee 3 90000 Marketing
5 Eve Davis 4 65000 NA
6 Frank DeVito Jr. 2 70000 Engineering

here,

  • All employees are retained, even those with NA or unmatched dept_id.
  • dept_id = 4 (Eve) and NA (Charlie) have NA for department columns.

Right Join | right_join()

Similar to Left Join, Right join Keeps all rows from the right table, adding matched columns from the left table. The unmatched rows from the right table get NA for left-table columns. The right table is preserved entirely and the left table is attached where possible.

employees |> 
  right_join(departments, by = "dept_id")
# A tibble: 5 × 9
  emp_id first_name    last_name  dept_id salary dept_name   manager_first
   <dbl> <chr>         <chr>        <dbl>  <dbl> <chr>       <chr>        
1      1 Alice Susanna Smith            1  60000 HR          alice        
2      2 Bob           Johnson          2  75000 Engineering robert       
3      4 Diana         Lee              3  90000 Marketing   diana        
4      6 Frank         DeVito Jr.       2  70000 Engineering robert       
5     NA <NA>          <NA>             5     NA Finance     carol        
# ℹ 2 more variables: manager_last <chr>, budget <dbl>
SELECT *
  FROM employees AS emp
  RIGHT JOIN departments  AS dept
    ON emp.dept_id = dept.dept_id;
5 records
emp_id first_name last_name dept_id salary dept_id dept_name manager_first manager_last budget
1 Alice Susanna Smith 1 60000 1 HR alice smith 500000
2 Bob Johnson 2 75000 2 Engineering robert johnson 1000000
4 Diana Lee 3 90000 3 Marketing diana lee 750000
6 Frank DeVito Jr. 2 70000 2 Engineering robert johnson 1000000
NA NA NA NA NA 5 Finance carol taylor 600000

here,

  • All departments are retained, even those with unmatched dept_id. The corresponding emp_id was set to NA if not matched with items in the departments.
  • dept_id = 5 (Finance department) where carol is the manager is retained even if she des not have any matched records in employees table.

Full Join | full_join()

Full join keeps all rows from both tables, filling NA where no match exists. Output is the union of both tables, with missing values for unmatched rows. In other words, it combines all data preserving everything.

employees %>% 
    full_join(departments, by = "dept_id")
# A tibble: 7 × 9
  emp_id first_name    last_name  dept_id salary dept_name   manager_first
   <dbl> <chr>         <chr>        <dbl>  <dbl> <chr>       <chr>        
1      1 Alice Susanna Smith            1  60000 HR          alice        
2      2 Bob           Johnson          2  75000 Engineering robert       
3      3 Mary Anne     Brown           NA  80000 <NA>        <NA>         
4      4 Diana         Lee              3  90000 Marketing   diana        
5      5 Eve           Davis            4  65000 <NA>        <NA>         
6      6 Frank         DeVito Jr.       2  70000 Engineering robert       
7     NA <NA>          <NA>             5     NA Finance     carol        
# ℹ 2 more variables: manager_last <chr>, budget <dbl>
SELECT * 
FROM employees 
FULL OUTER JOIN departments 
    ON employees.dept_id = departments.dept_id;
7 records
emp_id first_name last_name dept_id salary dept_id dept_name manager_first manager_last budget
1 Alice Susanna Smith 1 60000 1 HR alice smith 500000
2 Bob Johnson 2 75000 2 Engineering robert johnson 1000000
3 Mary Anne Brown NA 80000 NA NA NA NA NA
4 Diana Lee 3 90000 3 Marketing diana lee 750000
5 Eve Davis 4 65000 NA NA NA NA NA
6 Frank DeVito Jr. 2 70000 2 Engineering robert johnson 1000000
NA NA NA NA NA 5 Finance carol taylor 600000

here,

  • Includes all employees and all departments, even unmatched ones.
  • dept_id = 5 (Finance) appears with NA for employee columns.

Filtering Joins

Filtering joins subset rows from one table based on another, without merging columns.

Semi-Join | semi_join()

Semi-Join Returns rows from the left table that have a match in the right table. It returns a subset of the left table but no columns from the right table are added. For example: filtering the employees who belongs to one of the department in the departments table based on their dept_id information.

employees %>% 
  semi_join(departments, by = "dept_id")
# A tibble: 4 × 5
  emp_id first_name    last_name  dept_id salary
   <dbl> <chr>         <chr>        <dbl>  <dbl>
1      1 Alice Susanna Smith            1  60000
2      2 Bob           Johnson          2  75000
3      4 Diana         Lee              3  90000
4      6 Frank         DeVito Jr.       2  70000
SELECT employees.* 
FROM employees 
WHERE EXISTS (
  SELECT 1 FROM departments 
  WHERE employees.dept_id = departments.dept_id
);
4 records
emp_id first_name last_name dept_id salary
1 Alice Susanna Smith 1 60000
2 Bob Johnson 2 75000
4 Diana Lee 3 90000
6 Frank DeVito Jr. 2 70000

here,

  • Keeps only employees with dept_id present in departments.
  • Drops emp_id=3 (NA) and emp_id=5 (unmatched dept_id=4).

Anti-Join | anti_join()

Anti-Join returns rows from the left table with no match in the right table. For example: filtering the employees who do not belong to any of the department in the departments table.

employees %>% 
  anti_join(departments, by = "dept_id")
# A tibble: 2 × 5
  emp_id first_name last_name dept_id salary
   <dbl> <chr>      <chr>       <dbl>  <dbl>
1      3 Mary Anne  Brown          NA  80000
2      5 Eve        Davis           4  65000
SELECT employees.* 
FROM employees 
WHERE NOT EXISTS (
  SELECT 1 FROM departments 
  WHERE employees.dept_id = departments.dept_id
);
2 records
emp_id first_name last_name dept_id salary
3 Mary Anne Brown NA 80000
5 Eve Davis 4 65000

here, only employees with dept_id not in departments (or NA) are kept.

Advanced Cases: Preprocessing and Multiple Keys

Suppose we need find the employees with their departments who are also manager in their department. Here we have two issues to deal with:

  1. We need multiple keys (fields/ variables) to join. For example: dept_id in both table and first_name/ last_name from employees table and manager_first/ manager_last from departments table.

    To solve this we can use multiple keys to join the tables. We can use semi-join or inner-join on employees table using department tables.

  2. The first_name and last_name from employees table has uppercase characters and middle name in first_name field while manager_first and manager_last from departments tables are all lowercase. We need to preprocess them before using them as a joining key.

    To solve this, we lowercase the first word from first_name and manager_first and the last word from last_name and manager_last from tables employees and departments respectively.

employees |> 
  mutate(
    fname = stringr::word(first_name, 1) |> tolower(),
    lname = stringr::word(last_name, -1) |> tolower()
  ) |> inner_join(
    departments |> mutate(
      fname = stringr::word(manager_first, 1) |> tolower(),
      lname = stringr::word(manager_last, -1) |> tolower()
    ) |> select(dept_name, fname, lname, dept_id),
    by = join_by('fname', 'lname', 'dept_id')
  ) |> select(-dept_id, -fname, -lname)
# A tibble: 2 × 5
  emp_id first_name    last_name salary dept_name
   <dbl> <chr>         <chr>      <dbl> <chr>    
1      1 Alice Susanna Smith      60000 HR       
2      4 Diana         Lee        90000 Marketing
SELECT 
  emp.emp_id, emp.first_name, emp.last_name, 
  emp.salary, dept.dept_name
FROM employees emp
JOIN departments dept ON emp.dept_id = dept.dept_id
WHERE 
    -- First name: match first word
    LOWER(SUBSTR(
      emp.first_name, 1, 
      INSTR(emp.first_name || ' ', ' ') - 1
    )) = 
    LOWER(SUBSTR(
      dept.manager_first, 1, 
      INSTR(dept.manager_first || ' ', ' ') - 1
    ))
    
    -- Last name: match last word
    AND 
    LOWER(SUBSTR(
      emp.last_name, 
      INSTR(' ' || emp.last_name, ' ') + 1
    )) = 
    LOWER(SUBSTR(
      dept.manager_last, 
      INSTR(' ' || dept.manager_last, ' ') + 1
    ));
2 records
emp_id first_name last_name salary dept_name
1 Alice Susanna Smith 60000 HR
4 Diana Lee 90000 Marketing

Since the employees table uses first_name/last_name while departments uses manager_first/manager_last, I standardized the keys by:

  1. Extracting the first word from first_name and the last word from last_name (to handle middle names/suffixes).
  2. Converting them to lowercase for case-insensitive matching.

Also, check out a post, I have used all these concepts in a real dataset.

Joins are essential for combining data in R. With dplyr, you can handle mismatched keys, missing values, and multi-column joins—just clean your keys first and verify results with anti_join(). Try these techniques on your own data, or explore more examples here.