Comparison

Comparison between data.frame, data.table, and tidytable

Author

Raju Rimal

Published

November 30, 2024

Modified

March 19, 2025

Syntax Comparison Table

Operation Syntax
Creating a table df <- data.frame(x = 1:5, y = letters[1:5])
Selecting columns df[c("x", "y")]
Filtering rows df[df$x > 3, ]
Adding a column df$new_col <- df$x + 1
Modifying a column df$x <- df$x * 2
Summarizing aggregate(x ~ y, df, mean)
Grouped operation tapply(df$x, df$y, mean)
Joining tables merge(df1, df2, by = "key")
Binding rows rbind(df1, df2)
Reshaping (wide → long) reshape(df, varying, v.names, timevar, ...)
Reshaping (long → wide) reshape(df, direction = "wide")
Row-wise operation apply(df, 1, sum)
Sorting rows df[order(df$x), ]
Renaming columns names(df)[1] <- "new_name"
Checking structure str(df)
Using pipes Not natively supported, use %>% from magrittr

Operation Syntax
Creating a table dt <- data.table(x = 1:5, y = letters[1:5])
Selecting columns dt[, .(x, y)]
Filtering rows dt[x > 3]
Adding a column dt[, new_col := x + 1]
Modifying a column dt[, x := x * 2]
Summarizing dt[, .(mean_x = mean(x)), by = y]
Grouped operation dt[, .(mean_x = mean(x)), by = y]
Joining tables dt1[dt2, on = "key"]
Binding rows rbind(dt1, dt2)
Reshaping (wide → long) melt(dt, id.vars, measure.vars)
Reshaping (long → wide) dcast(dt, formula, value.var)
Row-wise operation dt[, row_sum := rowSums(.SD), .SDcols = x:y]
Sorting rows dt[order(x)]
Renaming columns setnames(dt, "old_name", "new_name")
Checking structure str(dt)
Using pipes Supported with %>% or native pipes

Operation Syntax
Creating a table tt <- tidytable(x = 1:5, y = letters[1:5])
Selecting columns tt %>% select(x, y)
Filtering rows tt %>% filter(x > 3)
Adding a column tt %>% mutate(new_col = x + 1)
Modifying a column tt %>% mutate(x = x * 2)
Summarizing tt %>% summarize(mean_x = mean(x), .by = y)
Grouped operation tt %>% summarize(mean_x = mean(x), .by = y)
Joining tables left_join(tt, tt2, by = "key")
Binding rows bind_rows(tt1, tt2)
Reshaping (wide → long) pivot_longer(tt, cols, names_to, values_to)
Reshaping (long → wide) pivot_wider(tt, names_from, values_from)
Row-wise operation tt %>% mutate(row_sum = c_across(x:y, sum))
Sorting rows tt %>% arrange(x)
Renaming columns tt %>% rename(new_name = old_name)
Checking structure tt %>% glimpse()
Using pipes Fully integrated, %>% and `

Performance Benchmarking

When working with tabular data in R, selecting the right framework—data.frame, data.table, tidyverse, or tidytable—can make a significant difference in performance and usability. This article benchmarks these frameworks for common operations like selecting, filtering, and summarizing.


Setup: The Dataset and Benchmarking Framework

We’ll use a synthetic dataset with 1 million rows for benchmarking:

library(tidyverse)
library(tidytable)
library(data.table)
library(microbenchmark)

# Create a synthetic dataset
n <- 10^6
df <- data.frame(
  id = sample(1:1000, n, replace = TRUE),
  value = rnorm(n)
)
tt <- as_tidytable(df)
dt <- as.data.table(df)

We’ll compare the frameworks on three key operations: selecting columns, filtering rows, and grouped summarization. Benchmarks are conducted using the microbenchmark package.


Benchmarking Operations

1. Selecting Columns

Extracting specific columns is a common operation. Here’s how it’s done across the frameworks:

# Benchmark
mb_select <- microbenchmark(
  base = df[c("id", "value")],
  datatable = dt[, .(id, value)],
  dplyr = dplyr::select(df, id, value),
  tidytable = tt %>% tidytable::select(id, value),
  times = 25
)
summary(mb_select) %>% select(-cld)
# A tidytable: 4 × 8
  expr         min      lq    mean  median      uq     max neval
  <fct>      <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl> <dbl>
1 base        31.8    39.6    51.4    44.9    50.5    110.    25
2 datatable 8004.  12465.  38210.  13735.  31066.  117135.    25
3 dplyr     1527.   1588.   1794.   1609.   2104.    2668.    25
4 tidytable 1423.   1462.   1856.   1495.   1919.    6591.    25

2. Filtering Rows

Filtering rows based on a condition:

# Benchmark
mb_filter <- microbenchmark(
  base = df[df$id == 500, ],
  datatable = dt[id == 500],
  dplyr = dplyr::filter(df, id == 500),
  tidytable = tt %>% tidytable::filter(id == 500),
  times = 25
)
summary(mb_filter) %>% select(-cld)
# A tidytable: 4 × 8
  expr        min    lq  mean median    uq   max neval
  <fct>     <dbl> <dbl> <dbl>  <dbl> <dbl> <dbl> <dbl>
1 base       5.37  5.79  7.90   7.61  9.60  12.5    25
2 datatable  4.03  4.32  5.84   4.69  6.69  15.9    25
3 dplyr     11.1  11.7  22.3   15.3  17.7  196.     25
4 tidytable  5.63  6.30  7.91   6.60  7.87  27.4    25

3. Grouped Summarization

Computing the mean of a column, grouped by another column:

# Benchmark
mb_aggregate <- microbenchmark(
  base = aggregate(value ~ id, data = df, FUN = mean),
  datatable = dt[, .(mean_value = mean(value)), by = id],
  dplyr = df %>% dplyr::group_by(id) %>% dplyr::summarize(mean_value = mean(value)),
  tidytable = tt %>% tidytable::summarize(mean_value = mean(value), .by = id),
  times = 25
)
summary(mb_aggregate) %>% select(-cld)
# A tidytable: 4 × 8
  expr        min    lq  mean median    uq   max neval
  <fct>     <dbl> <dbl> <dbl>  <dbl> <dbl> <dbl> <dbl>
1 base      437.  482.  594.   625.  652.  890.     25
2 datatable  16.3  17.9  35.2   18.6  23.1 212.     25
3 dplyr      40.3  43.1  46.1   44.2  49.4  57.0    25
4 tidytable  24.2  25.1  35.3   29.0  30.1 214.     25

4. Joining Two Datasets

Joining tables by a key column:

# Create a second dataset
df2 <- data.frame(id = 1:1000, key_value = rnorm(1000))
dt2 <- as.data.table(df2)
tt2 <- as_tidytable(df2)

# Benchmark
mb_join <- microbenchmark(
  base = merge(df, df2, by = "id"),
  datatable = dt[dt2, on = "id"],
  dplyr = dplyr::left_join(df, df2, by = "id"),
  tidytable = tidytable::left_join(tt, tt2, by = "id"),
  times = 25
)
summary(mb_join) %>% select(-cld)
# A tidytable: 4 × 8
  expr         min     lq   mean median     uq   max neval
  <fct>      <dbl>  <dbl>  <dbl>  <dbl>  <dbl> <dbl> <dbl>
1 base      2004.  2185.  2440.  2348.  2734.  2943.    25
2 datatable   22.4   23.1   76.9   24.4   82.2  439.    25
3 dplyr       79.6   87.3  141.   100.   146.   458.    25
4 tidytable   36.0   36.8   91.4   39.4   93.2  386.    25

Performance Results

Below is a summary of the average execution time (in milliseconds) across 50 runs for each operation.

avg_performance <- bind_rows(
  select = summary(mb_select),
  filter = summary(mb_filter),
  summarize = summary(mb_aggregate),
  joining = summary(mb_join),
  .id = "Operation"
) %>% select(expr, Operation, mean)

avg_performance %>% 
  pivot_wider(names_from = "expr", values_from = "mean")
# A tidytable: 4 × 5
  Operation    base datatable  dplyr tidytable
  <chr>       <dbl>     <dbl>  <dbl>     <dbl>
1 filter       7.90      5.84   22.3      7.91
2 joining   2440.       76.9   141.      91.4 
3 select      51.4   38210.   1794.    1856.  
4 summarize  594.       35.2    46.1     35.3 

Analysis

Key Observations:

  1. Speed:
    • data.table and tidytable are significantly faster than data.frame and tidyverse for all operations.
    • tidytable closely matches the performance of data.table, as it uses the same backend.
  2. Ease of Use:
    • data.frame syntax is verbose and lacks modern conveniences.
    • tidyverse offers intuitive syntax, making it suitable for beginners.
    • tidytable combines the syntax of tidyverse with the speed of data.table.
  3. Scalability:
    • For datasets with millions of rows, data.table and tidytable remain efficient, while data.frame and tidyverse struggle.

Conclusion

Framework Best For
data.frame Beginners handling small datasets; foundational learning.
data.table Advanced users managing large datasets with performance-critical tasks.
tidyverse Beginners/intermediate users valuing syntax simplicity and readability.
tidytable Users who want the best of both worlds: tidyverse syntax + data.table speed.

If you work with large datasets and value performance, data.table or tidytable is your go-to. If readability and ease of use are more important, stick with tidyverse.