Joining Tables using dplyr in R: Concepts with Examples
Author
Raju Rimal
Published
March 18, 2025
This post explores the fundamentals of joining tables in R using the dplyr package, with a focus on both core concepts and practical edge cases. You’ll learn how keys link datasets, how different join types handle unmatched rows, and strategies for tackling real-world challenges like missing values, multi-key joins, and pre-processing with intermediate variables. To bridge the gap for database users, every join example includes its SQL equivalent. Finally, a reproducible dummy dataset is provided so you can follow along and experiment.
# A tibble: 6 × 5
emp_id first_name last_name dept_id salary
<dbl> <chr> <chr> <dbl> <dbl>
1 1 Alice Susanna Smith 1 60000
2 2 Bob Johnson 2 75000
3 3 Mary Anne Brown NA 80000
4 4 Diana Lee 3 90000
5 5 Eve Davis 4 65000
6 6 Frank DeVito Jr. 2 70000
# A tibble: 4 × 5
dept_id dept_name manager_first manager_last budget
<dbl> <chr> <chr> <chr> <dbl>
1 1 HR alice smith 500000
2 2 Engineering robert johnson 1000000
3 3 Marketing diana lee 750000
4 5 Finance carol taylor 600000
library(RSQLite)library(DBI)con <- DBI::dbConnect(SQLite(), "database.db")if (!RSQLite::dbExistsTable(con, "departments")) {dbWriteTable(con, name ="departments", value = departments)}if (!RSQLite::dbExistsTable(con, "employees")) {dbWriteTable(con, name ="employees", value = employees)}
PRAGMA table_info(employees);
5 records
cid
name
type
notnull
dflt_value
pk
0
emp_id
REAL
0
NA
0
1
first_name
TEXT
0
NA
0
2
last_name
TEXT
0
NA
0
3
dept_id
REAL
0
NA
0
4
salary
REAL
0
NA
0
PRAGMA table_info(departments);
5 records
cid
name
type
notnull
dflt_value
pk
0
dept_id
REAL
0
NA
0
1
dept_name
TEXT
0
NA
0
2
manager_first
TEXT
0
NA
0
3
manager_last
TEXT
0
NA
0
4
budget
REAL
0
NA
0
Mutating Joins
Mutating joins combine columns from two tables based on matching keys, preserving rows depending on the join type.
Inner Join | inner_join()
We can use inner_join() function from dplyr to inner join the data. Inner join retains only rows with matching keys in both tables. The output table contains all columns, excluding the unmatched rows from both tables.
Left Join Keeps all rows from the left table, adding matched columns from the right table. The unmatched rows from the left table get NA for right-table columns. The left table is preserved entirely and the right table is attached where possible.
employees |>left_join(departments, by ="dept_id") |>select(names(employees), dept_name)
# A tibble: 6 × 6
emp_id first_name last_name dept_id salary dept_name
<dbl> <chr> <chr> <dbl> <dbl> <chr>
1 1 Alice Susanna Smith 1 60000 HR
2 2 Bob Johnson 2 75000 Engineering
3 3 Mary Anne Brown NA 80000 <NA>
4 4 Diana Lee 3 90000 Marketing
5 5 Eve Davis 4 65000 <NA>
6 6 Frank DeVito Jr. 2 70000 Engineering
SELECT emp.*, dept.dept_nameFROM employees AS empLEFTJOIN departments AS deptON emp.dept_id = dept.dept_id;
6 records
emp_id
first_name
last_name
dept_id
salary
dept_name
1
Alice Susanna
Smith
1
60000
HR
2
Bob
Johnson
2
75000
Engineering
3
Mary Anne
Brown
NA
80000
NA
4
Diana
Lee
3
90000
Marketing
5
Eve
Davis
4
65000
NA
6
Frank
DeVito Jr.
2
70000
Engineering
here,
All employees are retained, even those with NA or unmatched dept_id.
dept_id = 4 (Eve) and NA (Charlie) have NA for department columns.
Right Join | right_join()
Similar to Left Join, Right join Keeps all rows from the right table, adding matched columns from the left table. The unmatched rows from the right table get NA for left-table columns. The right table is preserved entirely and the left table is attached where possible.
employees |>right_join(departments, by ="dept_id")
# A tibble: 5 × 9
emp_id first_name last_name dept_id salary dept_name manager_first
<dbl> <chr> <chr> <dbl> <dbl> <chr> <chr>
1 1 Alice Susanna Smith 1 60000 HR alice
2 2 Bob Johnson 2 75000 Engineering robert
3 4 Diana Lee 3 90000 Marketing diana
4 6 Frank DeVito Jr. 2 70000 Engineering robert
5 NA <NA> <NA> 5 NA Finance carol
# ℹ 2 more variables: manager_last <chr>, budget <dbl>
SELECT*FROM employees AS empRIGHTJOIN departments AS deptON emp.dept_id = dept.dept_id;
5 records
emp_id
first_name
last_name
dept_id
salary
dept_id
dept_name
manager_first
manager_last
budget
1
Alice Susanna
Smith
1
60000
1
HR
alice
smith
500000
2
Bob
Johnson
2
75000
2
Engineering
robert
johnson
1000000
4
Diana
Lee
3
90000
3
Marketing
diana
lee
750000
6
Frank
DeVito Jr.
2
70000
2
Engineering
robert
johnson
1000000
NA
NA
NA
NA
NA
5
Finance
carol
taylor
600000
here,
All departments are retained, even those with unmatched dept_id. The corresponding emp_id was set to NA if not matched with items in the departments.
dept_id = 5 (Finance department) where carol is the manager is retained even if she des not have any matched records in employees table.
Full Join | full_join()
Full join keeps all rows from both tables, filling NA where no match exists. Output is the union of both tables, with missing values for unmatched rows. In other words, it combines all data preserving everything.
employees %>%full_join(departments, by ="dept_id")
# A tibble: 7 × 9
emp_id first_name last_name dept_id salary dept_name manager_first
<dbl> <chr> <chr> <dbl> <dbl> <chr> <chr>
1 1 Alice Susanna Smith 1 60000 HR alice
2 2 Bob Johnson 2 75000 Engineering robert
3 3 Mary Anne Brown NA 80000 <NA> <NA>
4 4 Diana Lee 3 90000 Marketing diana
5 5 Eve Davis 4 65000 <NA> <NA>
6 6 Frank DeVito Jr. 2 70000 Engineering robert
7 NA <NA> <NA> 5 NA Finance carol
# ℹ 2 more variables: manager_last <chr>, budget <dbl>
SELECT*FROM employees FULLOUTERJOIN departments ON employees.dept_id = departments.dept_id;
7 records
emp_id
first_name
last_name
dept_id
salary
dept_id
dept_name
manager_first
manager_last
budget
1
Alice Susanna
Smith
1
60000
1
HR
alice
smith
500000
2
Bob
Johnson
2
75000
2
Engineering
robert
johnson
1000000
3
Mary Anne
Brown
NA
80000
NA
NA
NA
NA
NA
4
Diana
Lee
3
90000
3
Marketing
diana
lee
750000
5
Eve
Davis
4
65000
NA
NA
NA
NA
NA
6
Frank
DeVito Jr.
2
70000
2
Engineering
robert
johnson
1000000
NA
NA
NA
NA
NA
5
Finance
carol
taylor
600000
here,
Includes all employees and all departments, even unmatched ones.
dept_id = 5 (Finance) appears with NA for employee columns.
Filtering Joins
Filtering joins subset rows from one table based on another, without merging columns.
Semi-Join | semi_join()
Semi-Join Returns rows from the left table that have a match in the right table. It returns a subset of the left table but no columns from the right table are added. For example: filtering the employees who belongs to one of the department in the departments table based on their dept_id information.
employees %>%semi_join(departments, by ="dept_id")
# A tibble: 4 × 5
emp_id first_name last_name dept_id salary
<dbl> <chr> <chr> <dbl> <dbl>
1 1 Alice Susanna Smith 1 60000
2 2 Bob Johnson 2 75000
3 4 Diana Lee 3 90000
4 6 Frank DeVito Jr. 2 70000
SELECT employees.*FROM employees WHEREEXISTS (SELECT1FROM departments WHERE employees.dept_id = departments.dept_id);
4 records
emp_id
first_name
last_name
dept_id
salary
1
Alice Susanna
Smith
1
60000
2
Bob
Johnson
2
75000
4
Diana
Lee
3
90000
6
Frank
DeVito Jr.
2
70000
here,
Keeps only employees with dept_id present in departments.
Drops emp_id=3 (NA) and emp_id=5 (unmatched dept_id=4).
Anti-Join | anti_join()
Anti-Join returns rows from the left table with no match in the right table. For example: filtering the employees who do not belong to any of the department in the departments table.
employees %>%anti_join(departments, by ="dept_id")
# A tibble: 2 × 5
emp_id first_name last_name dept_id salary
<dbl> <chr> <chr> <dbl> <dbl>
1 3 Mary Anne Brown NA 80000
2 5 Eve Davis 4 65000
SELECT employees.*FROM employees WHERENOTEXISTS (SELECT1FROM departments WHERE employees.dept_id = departments.dept_id);
2 records
emp_id
first_name
last_name
dept_id
salary
3
Mary Anne
Brown
NA
80000
5
Eve
Davis
4
65000
here, only employees with dept_id not in departments (or NA) are kept.
Advanced Cases: Preprocessing and Multiple Keys
Suppose we need find the employees with their departments who are also manager in their department. Here we have two issues to deal with:
We need multiple keys (fields/ variables) to join. For example: dept_id in both table and first_name/ last_name from employees table and manager_first/ manager_last from departments table.
To solve this we can use multiple keys to join the tables. We can use semi-join or inner-join on employees table using department tables.
The first_name and last_name from employees table has uppercase characters and middle name in first_name field while manager_first and manager_last from departments tables are all lowercase. We need to preprocess them before using them as a joining key.
To solve this, we lowercase the first word from first_name and manager_first and the last word from last_name and manager_last from tables employees and departments respectively.
# A tibble: 2 × 5
emp_id first_name last_name salary dept_name
<dbl> <chr> <chr> <dbl> <chr>
1 1 Alice Susanna Smith 60000 HR
2 4 Diana Lee 90000 Marketing
SELECT emp.emp_id, emp.first_name, emp.last_name, emp.salary, dept.dept_nameFROM employees empJOIN departments dept ON emp.dept_id = dept.dept_idWHERE-- First name: match first wordLOWER(SUBSTR( emp.first_name, 1, INSTR(emp.first_name ||' ', ' ') -1 )) =LOWER(SUBSTR( dept.manager_first, 1, INSTR(dept.manager_first ||' ', ' ') -1 ))-- Last name: match last wordANDLOWER(SUBSTR( emp.last_name, INSTR(' '|| emp.last_name, ' ') +1 )) =LOWER(SUBSTR( dept.manager_last, INSTR(' '|| dept.manager_last, ' ') +1 ));
2 records
emp_id
first_name
last_name
salary
dept_name
1
Alice Susanna
Smith
60000
HR
4
Diana
Lee
90000
Marketing
Since the employees table uses first_name/last_name while departments uses manager_first/manager_last, I standardized the keys by:
Extracting the first word from first_name and the last word from last_name (to handle middle names/suffixes).
Converting them to lowercase for case-insensitive matching.
Also, check out a post, I have used all these concepts in a real dataset.
Joins are essential for combining data in R. With dplyr, you can handle mismatched keys, missing values, and multi-column joins—just clean your keys first and verify results with anti_join(). Try these techniques on your own data, or explore more examples here.