library(tidyverse)
library(tidytable)
library(data.table)
library(microbenchmark)
# Create a synthetic dataset
<- 10^6
n <- data.frame(
df id = sample(1:1000, n, replace = TRUE),
value = rnorm(n)
)<- as_tidytable(df)
tt <- as.data.table(df) dt
Comparison
Comparison between data.frame, data.table, and tidytable
Syntax Comparison Table
Operation | Syntax |
---|---|
Creating a table | df <- data.frame(x = 1:5, y = letters[1:5]) |
Selecting columns | df[c("x", "y")] |
Filtering rows | df[df$x > 3, ] |
Adding a column | df$new_col <- df$x + 1 |
Modifying a column | df$x <- df$x * 2 |
Summarizing | aggregate(x ~ y, df, mean) |
Grouped operation | tapply(df$x, df$y, mean) |
Joining tables | merge(df1, df2, by = "key") |
Binding rows | rbind(df1, df2) |
Reshaping (wide → long) | reshape(df, varying, v.names, timevar, ...) |
Reshaping (long → wide) | reshape(df, direction = "wide") |
Row-wise operation | apply(df, 1, sum) |
Sorting rows | df[order(df$x), ] |
Renaming columns | names(df)[1] <- "new_name" |
Checking structure | str(df) |
Using pipes | Not natively supported, use %>% from magrittr |
Operation | Syntax |
---|---|
Creating a table | dt <- data.table(x = 1:5, y = letters[1:5]) |
Selecting columns | dt[, .(x, y)] |
Filtering rows | dt[x > 3] |
Adding a column | dt[, new_col := x + 1] |
Modifying a column | dt[, x := x * 2] |
Summarizing | dt[, .(mean_x = mean(x)), by = y] |
Grouped operation | dt[, .(mean_x = mean(x)), by = y] |
Joining tables | dt1[dt2, on = "key"] |
Binding rows | rbind(dt1, dt2) |
Reshaping (wide → long) | melt(dt, id.vars, measure.vars) |
Reshaping (long → wide) | dcast(dt, formula, value.var) |
Row-wise operation | dt[, row_sum := rowSums(.SD), .SDcols = x:y] |
Sorting rows | dt[order(x)] |
Renaming columns | setnames(dt, "old_name", "new_name") |
Checking structure | str(dt) |
Using pipes | Supported with %>% or native pipes |
Operation | Syntax |
---|---|
Creating a table | tt <- tidytable(x = 1:5, y = letters[1:5]) |
Selecting columns | tt %>% select(x, y) |
Filtering rows | tt %>% filter(x > 3) |
Adding a column | tt %>% mutate(new_col = x + 1) |
Modifying a column | tt %>% mutate(x = x * 2) |
Summarizing | tt %>% summarize(mean_x = mean(x), .by = y) |
Grouped operation | tt %>% summarize(mean_x = mean(x), .by = y) |
Joining tables | left_join(tt, tt2, by = "key") |
Binding rows | bind_rows(tt1, tt2) |
Reshaping (wide → long) | pivot_longer(tt, cols, names_to, values_to) |
Reshaping (long → wide) | pivot_wider(tt, names_from, values_from) |
Row-wise operation | tt %>% mutate(row_sum = c_across(x:y, sum)) |
Sorting rows | tt %>% arrange(x) |
Renaming columns | tt %>% rename(new_name = old_name) |
Checking structure | tt %>% glimpse() |
Using pipes | Fully integrated, %>% and ` |
Performance Benchmarking
When working with tabular data in R, selecting the right framework—data.frame
, data.table
, tidyverse
, or tidytable
—can make a significant difference in performance and usability. This article benchmarks these frameworks for common operations like selecting, filtering, and summarizing.
Setup: The Dataset and Benchmarking Framework
We’ll use a synthetic dataset with 1 million rows for benchmarking:
We’ll compare the frameworks on three key operations: selecting columns, filtering rows, and grouped summarization. Benchmarks are conducted using the microbenchmark
package.
Benchmarking Operations
1. Selecting Columns
Extracting specific columns is a common operation. Here’s how it’s done across the frameworks:
# Benchmark
<- microbenchmark(
mb_select base = df[c("id", "value")],
datatable = dt[, .(id, value)],
dplyr = dplyr::select(df, id, value),
tidytable = tt %>% tidytable::select(id, value),
times = 25
)summary(mb_select) %>% select(-cld)
# A tidytable: 4 × 8
expr min lq mean median uq max neval
<fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 base 31.8 39.6 51.4 44.9 50.5 110. 25
2 datatable 8004. 12465. 38210. 13735. 31066. 117135. 25
3 dplyr 1527. 1588. 1794. 1609. 2104. 2668. 25
4 tidytable 1423. 1462. 1856. 1495. 1919. 6591. 25
2. Filtering Rows
Filtering rows based on a condition:
# Benchmark
<- microbenchmark(
mb_filter base = df[df$id == 500, ],
datatable = dt[id == 500],
dplyr = dplyr::filter(df, id == 500),
tidytable = tt %>% tidytable::filter(id == 500),
times = 25
)summary(mb_filter) %>% select(-cld)
# A tidytable: 4 × 8
expr min lq mean median uq max neval
<fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 base 5.37 5.79 7.90 7.61 9.60 12.5 25
2 datatable 4.03 4.32 5.84 4.69 6.69 15.9 25
3 dplyr 11.1 11.7 22.3 15.3 17.7 196. 25
4 tidytable 5.63 6.30 7.91 6.60 7.87 27.4 25
3. Grouped Summarization
Computing the mean of a column, grouped by another column:
# Benchmark
<- microbenchmark(
mb_aggregate base = aggregate(value ~ id, data = df, FUN = mean),
datatable = dt[, .(mean_value = mean(value)), by = id],
dplyr = df %>% dplyr::group_by(id) %>% dplyr::summarize(mean_value = mean(value)),
tidytable = tt %>% tidytable::summarize(mean_value = mean(value), .by = id),
times = 25
)summary(mb_aggregate) %>% select(-cld)
# A tidytable: 4 × 8
expr min lq mean median uq max neval
<fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 base 437. 482. 594. 625. 652. 890. 25
2 datatable 16.3 17.9 35.2 18.6 23.1 212. 25
3 dplyr 40.3 43.1 46.1 44.2 49.4 57.0 25
4 tidytable 24.2 25.1 35.3 29.0 30.1 214. 25
4. Joining Two Datasets
Joining tables by a key column:
# Create a second dataset
<- data.frame(id = 1:1000, key_value = rnorm(1000))
df2 <- as.data.table(df2)
dt2 <- as_tidytable(df2)
tt2
# Benchmark
<- microbenchmark(
mb_join base = merge(df, df2, by = "id"),
datatable = dt[dt2, on = "id"],
dplyr = dplyr::left_join(df, df2, by = "id"),
tidytable = tidytable::left_join(tt, tt2, by = "id"),
times = 25
)summary(mb_join) %>% select(-cld)
# A tidytable: 4 × 8
expr min lq mean median uq max neval
<fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 base 2004. 2185. 2440. 2348. 2734. 2943. 25
2 datatable 22.4 23.1 76.9 24.4 82.2 439. 25
3 dplyr 79.6 87.3 141. 100. 146. 458. 25
4 tidytable 36.0 36.8 91.4 39.4 93.2 386. 25
Performance Results
Below is a summary of the average execution time (in milliseconds) across 50 runs for each operation.
<- bind_rows(
avg_performance select = summary(mb_select),
filter = summary(mb_filter),
summarize = summary(mb_aggregate),
joining = summary(mb_join),
.id = "Operation"
%>% select(expr, Operation, mean)
)
%>%
avg_performance pivot_wider(names_from = "expr", values_from = "mean")
# A tidytable: 4 × 5
Operation base datatable dplyr tidytable
<chr> <dbl> <dbl> <dbl> <dbl>
1 filter 7.90 5.84 22.3 7.91
2 joining 2440. 76.9 141. 91.4
3 select 51.4 38210. 1794. 1856.
4 summarize 594. 35.2 46.1 35.3
Analysis
Key Observations:
- Speed:
data.table
andtidytable
are significantly faster thandata.frame
andtidyverse
for all operations.tidytable
closely matches the performance ofdata.table
, as it uses the same backend.
- Ease of Use:
data.frame
syntax is verbose and lacks modern conveniences.tidyverse
offers intuitive syntax, making it suitable for beginners.tidytable
combines the syntax oftidyverse
with the speed ofdata.table
.
- Scalability:
- For datasets with millions of rows,
data.table
andtidytable
remain efficient, whiledata.frame
andtidyverse
struggle.
- For datasets with millions of rows,
Conclusion
Framework | Best For |
---|---|
data.frame | Beginners handling small datasets; foundational learning. |
data.table | Advanced users managing large datasets with performance-critical tasks. |
tidyverse | Beginners/intermediate users valuing syntax simplicity and readability. |
tidytable | Users who want the best of both worlds: tidyverse syntax + data.table speed. |
If you work with large datasets and value performance, data.table
or tidytable
is your go-to. If readability and ease of use are more important, stick with tidyverse
.