Merging multiple datasets

Joining Tables using dplyr in R: Concepts with Examples

Author

Raju Rimal

Published

March 18, 2025

This post explores the fundamentals of joining tables in R using the dplyr package, with a focus on both core concepts and practical edge cases. You’ll learn how keys link datasets, how different join types handle unmatched rows, and strategies for tackling real-world challenges like missing values, multi-key joins, and pre-processing with intermediate variables. To bridge the gap for database users, every join example includes its SQL equivalent. Finally, a reproducible dummy dataset is provided so you can follow along and experiment.

# Employees: 
#> emp_id 3 has NA dept_id; 
#> emp_id 5's dept_id (4) is missing
employees <- tibble(
  emp_id = c(1, 2, 3, 4, 5, 6),
  first_name = c(
    "Alice Susanna", "Bob", "Mary Anne", 
    "Diana", "Eve", "Frank"
  ),
  last_name = c(
    "Smith", "Johnson", "Brown", 
    "Lee", "Davis", "DeVito Jr."
  ),
  dept_id = c(1, 2, NA, 3, 4, 2),
  salary = c(
    60000, 75000, 80000, 
    90000, 65000, 70000
  )
)

print(employees)

# A tibble: 6 × 5
  emp_id first_name    last_name  dept_id salary
   <dbl> <chr>         <chr>        <dbl>  <dbl>
1      1 Alice Susanna Smith            1  60000
2      2 Bob           Johnson          2  75000
3      3 Mary Anne     Brown           NA  80000
4      4 Diana         Lee              3  90000
5      5 Eve           Davis            4  65000
6      6 Frank         DeVito Jr.       2  70000

#> Departments: 
#> - dept_id 5 has no employees
#> - dept_id 2 appears once
departments <- tibble(
  dept_id = c(1, 2, 3, 5),
  dept_name = c(
    "HR", "Engineering", "Marketing", "Finance"
  ),
  manager_first = c(
    "alice", "robert", "diana", "carol"
  ),
  manager_last = c(
    "smith", "johnson", "lee", "taylor"
  ),
  budget = c(
    500000, 1000000, 750000, 600000
  )
)

print(departments)

# A tibble: 4 × 5
  dept_id dept_name   manager_first manager_last  budget
    <dbl> <chr>       <chr>         <chr>          <dbl>
1       1 HR          alice         smith         500000
2       2 Engineering robert        johnson      1000000
3       3 Marketing   diana         lee           750000
4       5 Finance     carol         taylor        600000

library(RSQLite)
library(DBI)
con <- DBI::dbConnect(SQLite(), "database.db")
if (!RSQLite::dbExistsTable(con, "departments")) {
  dbWriteTable(con, name = "departments", value = departments)
}
if (!RSQLite::dbExistsTable(con, "employees")) {
  dbWriteTable(con, name = "employees", value = employees)
}

PRAGMA table_info(employees);

5 records
cid	name	type	dflt_value
0	emp_id	REAL	NA
1	first_name	TEXT	NA
2	last_name	TEXT	NA
3	dept_id	REAL	NA
4	salary	REAL	NA

PRAGMA table_info(departments);

5 records
cid	name	type	dflt_value
0	dept_id	REAL	NA
1	dept_name	TEXT	NA
2	manager_first	TEXT	NA
3	manager_last	TEXT	NA
4	budget	REAL	NA

Mutating Joins

Mutating joins combine columns from two tables based on matching keys, preserving rows depending on the join type.

Inner Join | `inner_join()`

We can use inner_join() function from dplyr to inner join the data. Inner join retains only rows with matching keys in both tables. The output table contains all columns, excluding the unmatched rows from both tables.

employees %>% 
  inner_join(departments, by = "dept_id") %>% 
  select(emp_id, first_name, last_name, salary, dept_name)

# A tibble: 4 × 5
  emp_id first_name    last_name  salary dept_name  
   <dbl> <chr>         <chr>       <dbl> <chr>      
1      1 Alice Susanna Smith       60000 HR         
2      2 Bob           Johnson     75000 Engineering
3      4 Diana         Lee         90000 Marketing  
4      6 Frank         DeVito Jr.  70000 Engineering

SELECT
 emp.emp_id,
 emp.first_name,
 emp.last_name,
 emp.salary,
 dept.dept_name 
 FROM employees AS emp
  INNER JOIN departments as dept
    ON emp.dept_id = dept.dept_id;

4 records
emp_id	first_name	last_name	salary	dept_name
1	Alice Susanna	Smith	60000	HR
2	Bob	Johnson	75000	Engineering
4	Diana	Lee	90000	Marketing
6	Frank	DeVito Jr.	70000	Engineering

Here,

Rows with dept_id = NA (Charlie) and dept_id = 4 (Eve) were dropped (no match in departments).
dept_id = 5 (Finance) was dropped (no match in employees).
dept_id = 2 appears twice (Bob and Frank), causing row expansion.

Left Join | `left_join()`

Left Join Keeps all rows from the left table, adding matched columns from the right table. The unmatched rows from the left table get NA for right-table columns. The left table is preserved entirely and the right table is attached where possible.

employees |> 
  left_join(departments, by = "dept_id") |> 
  select(names(employees), dept_name)

# A tibble: 6 × 6
  emp_id first_name    last_name  dept_id salary dept_name  
   <dbl> <chr>         <chr>        <dbl>  <dbl> <chr>      
1      1 Alice Susanna Smith            1  60000 HR         
2      2 Bob           Johnson          2  75000 Engineering
3      3 Mary Anne     Brown           NA  80000 <NA>       
4      4 Diana         Lee              3  90000 Marketing  
5      5 Eve           Davis            4  65000 <NA>       
6      6 Frank         DeVito Jr.       2  70000 Engineering

SELECT 
  emp.*, dept.dept_name
  FROM employees AS emp
  LEFT JOIN departments  AS dept
    ON emp.dept_id = dept.dept_id;

6 records
emp_id	first_name	last_name	dept_id	salary	dept_name
1	Alice Susanna	Smith	1	60000	HR
2	Bob	Johnson	2	75000	Engineering
3	Mary Anne	Brown	NA	80000	NA
4	Diana	Lee	3	90000	Marketing
5	Eve	Davis	4	65000	NA
6	Frank	DeVito Jr.	2	70000	Engineering

here,

All employees are retained, even those with NA or unmatched dept_id.
dept_id = 4 (Eve) and NA (Charlie) have NA for department columns.

Right Join | `right_join()`

Similar to Left Join, Right join Keeps all rows from the right table, adding matched columns from the left table. The unmatched rows from the right table get NA for left-table columns. The right table is preserved entirely and the left table is attached where possible.

employees |> 
  right_join(departments, by = "dept_id")

# A tibble: 5 × 9
  emp_id first_name    last_name  dept_id salary dept_name   manager_first
   <dbl> <chr>         <chr>        <dbl>  <dbl> <chr>       <chr>        
1      1 Alice Susanna Smith            1  60000 HR          alice        
2      2 Bob           Johnson          2  75000 Engineering robert       
3      4 Diana         Lee              3  90000 Marketing   diana        
4      6 Frank         DeVito Jr.       2  70000 Engineering robert       
5     NA <NA>          <NA>             5     NA Finance     carol        
# ℹ 2 more variables: manager_last <chr>, budget <dbl>

SELECT *
  FROM employees AS emp
  RIGHT JOIN departments  AS dept
    ON emp.dept_id = dept.dept_id;

5 records
emp_id	first_name	last_name	dept_id	salary	dept_id	dept_name	manager_first	manager_last	budget
1	Alice Susanna	Smith	1	60000	1	HR	alice	smith	500000
2	Bob	Johnson	2	75000	2	Engineering	robert	johnson	1000000
4	Diana	Lee	3	90000	3	Marketing	diana	lee	750000
6	Frank	DeVito Jr.	2	70000	2	Engineering	robert	johnson	1000000
NA	NA	NA	NA	NA	5	Finance	carol	taylor	600000

here,

All departments are retained, even those with unmatched dept_id. The corresponding emp_id was set to NA if not matched with items in the departments.
dept_id = 5 (Finance department) where carol is the manager is retained even if she des not have any matched records in employees table.

Full Join | `full_join()`

Full join keeps all rows from both tables, filling NA where no match exists. Output is the union of both tables, with missing values for unmatched rows. In other words, it combines all data preserving everything.

employees %>% 
    full_join(departments, by = "dept_id")

# A tibble: 7 × 9
  emp_id first_name    last_name  dept_id salary dept_name   manager_first
   <dbl> <chr>         <chr>        <dbl>  <dbl> <chr>       <chr>        
1      1 Alice Susanna Smith            1  60000 HR          alice        
2      2 Bob           Johnson          2  75000 Engineering robert       
3      3 Mary Anne     Brown           NA  80000 <NA>        <NA>         
4      4 Diana         Lee              3  90000 Marketing   diana        
5      5 Eve           Davis            4  65000 <NA>        <NA>         
6      6 Frank         DeVito Jr.       2  70000 Engineering robert       
7     NA <NA>          <NA>             5     NA Finance     carol        
# ℹ 2 more variables: manager_last <chr>, budget <dbl>

SELECT * 
FROM employees 
FULL OUTER JOIN departments 
    ON employees.dept_id = departments.dept_id;

7 records
emp_id	first_name	last_name	dept_id	salary	dept_id	dept_name	manager_first	manager_last	budget
1	Alice Susanna	Smith	1	60000	1	HR	alice	smith	500000
2	Bob	Johnson	2	75000	2	Engineering	robert	johnson	1000000
3	Mary Anne	Brown	NA	80000	NA	NA	NA	NA	NA
4	Diana	Lee	3	90000	3	Marketing	diana	lee	750000
5	Eve	Davis	4	65000	NA	NA	NA	NA	NA
6	Frank	DeVito Jr.	2	70000	2	Engineering	robert	johnson	1000000
NA	NA	NA	NA	NA	5	Finance	carol	taylor	600000

here,

Includes all employees and all departments, even unmatched ones.
dept_id = 5 (Finance) appears with NA for employee columns.

Filtering Joins

Filtering joins subset rows from one table based on another, without merging columns.

Semi-Join | `semi_join()`

Semi-Join Returns rows from the left table that have a match in the right table. It returns a subset of the left table but no columns from the right table are added. For example: filtering the employees who belongs to one of the department in the departments table based on their dept_id information.

employees %>% 
  semi_join(departments, by = "dept_id")

# A tibble: 4 × 5
  emp_id first_name    last_name  dept_id salary
   <dbl> <chr>         <chr>        <dbl>  <dbl>
1      1 Alice Susanna Smith            1  60000
2      2 Bob           Johnson          2  75000
3      4 Diana         Lee              3  90000
4      6 Frank         DeVito Jr.       2  70000

SELECT employees.* 
FROM employees 
WHERE EXISTS (
  SELECT 1 FROM departments 
  WHERE employees.dept_id = departments.dept_id
);

4 records
emp_id	first_name	last_name	dept_id	salary
1	Alice Susanna	Smith	1	60000
2	Bob	Johnson	2	75000
4	Diana	Lee	3	90000
6	Frank	DeVito Jr.	2	70000

here,

Keeps only employees with dept_id present in departments.
Drops emp_id=3 (NA) and emp_id=5 (unmatched dept_id=4).

Anti-Join | `anti_join()`

Anti-Join returns rows from the left table with no match in the right table. For example: filtering the employees who do not belong to any of the department in the departments table.

employees %>% 
  anti_join(departments, by = "dept_id")

# A tibble: 2 × 5
  emp_id first_name last_name dept_id salary
   <dbl> <chr>      <chr>       <dbl>  <dbl>
1      3 Mary Anne  Brown          NA  80000
2      5 Eve        Davis           4  65000

SELECT employees.* 
FROM employees 
WHERE NOT EXISTS (
  SELECT 1 FROM departments 
  WHERE employees.dept_id = departments.dept_id
);

2 records
emp_id	first_name	last_name	dept_id	salary
3	Mary Anne	Brown	NA	80000
5	Eve	Davis	4	65000

here, only employees with dept_id not in departments (or NA) are kept.

Advanced Cases: Preprocessing and Multiple Keys

Suppose we need find the employees with their departments who are also manager in their department. Here we have two issues to deal with:

We need multiple keys (fields/ variables) to join. For example: dept_id in both table and first_name/ last_name from employees table and manager_first/ manager_last from departments table.

To solve this we can use multiple keys to join the tables. We can use semi-join or inner-join on employees table using department tables.
The first_name and last_name from employees table has uppercase characters and middle name in first_name field while manager_first and manager_last from departments tables are all lowercase. We need to preprocess them before using them as a joining key.

To solve this, we lowercase the first word from first_name and manager_first and the last word from last_name and manager_last from tables employees and departments respectively.

employees |> 
  mutate(
    fname = stringr::word(first_name, 1) |> tolower(),
    lname = stringr::word(last_name, -1) |> tolower()
  ) |> inner_join(
    departments |> mutate(
      fname = stringr::word(manager_first, 1) |> tolower(),
      lname = stringr::word(manager_last, -1) |> tolower()
    ) |> select(dept_name, fname, lname, dept_id),
    by = join_by('fname', 'lname', 'dept_id')
  ) |> select(-dept_id, -fname, -lname)

# A tibble: 2 × 5
  emp_id first_name    last_name salary dept_name
   <dbl> <chr>         <chr>      <dbl> <chr>    
1      1 Alice Susanna Smith      60000 HR       
2      4 Diana         Lee        90000 Marketing

SELECT 
  emp.emp_id, emp.first_name, emp.last_name, 
  emp.salary, dept.dept_name
FROM employees emp
JOIN departments dept ON emp.dept_id = dept.dept_id
WHERE 
    -- First name: match first word
    LOWER(SUBSTR(
      emp.first_name, 1, 
      INSTR(emp.first_name || ' ', ' ') - 1
    )) = 
    LOWER(SUBSTR(
      dept.manager_first, 1, 
      INSTR(dept.manager_first || ' ', ' ') - 1
    ))
    
    -- Last name: match last word
    AND 
    LOWER(SUBSTR(
      emp.last_name, 
      INSTR(' ' || emp.last_name, ' ') + 1
    )) = 
    LOWER(SUBSTR(
      dept.manager_last, 
      INSTR(' ' || dept.manager_last, ' ') + 1
    ));

2 records
emp_id	first_name	last_name	salary	dept_name
1	Alice Susanna	Smith	60000	HR
4	Diana	Lee	90000	Marketing

Since the employees table uses first_name/last_name while departments uses manager_first/manager_last, I standardized the keys by:

Extracting the first word from first_name and the last word from last_name (to handle middle names/suffixes).
Converting them to lowercase for case-insensitive matching.

Also, check out a post, I have used all these concepts in a real dataset.

Joins are essential for combining data in R. With dplyr, you can handle mismatched keys, missing values, and multi-column joins—just clean your keys first and verify results with anti_join(). Try these techniques on your own data, or explore more examples here.

Example dataset

Mutating Joins

Inner Join | inner_join()

Left Join | left_join()

Right Join | right_join()

Full Join | full_join()

Filtering Joins

Semi-Join | semi_join()

Anti-Join | anti_join()

Advanced Cases: Preprocessing and Multiple Keys

Inner Join | `inner_join()`

Left Join | `left_join()`

Right Join | `right_join()`

Full Join | `full_join()`

Semi-Join | `semi_join()`

Anti-Join | `anti_join()`