Data Table in R

A Powerful Upgrade for Tabular Data

Author

Raju Rimal

Published

November 30, 2024

Modified

March 19, 2025

The data.table package is a game-changer for handling tabular data in R. Designed to overcome the limitations of data.frame, it brings speed, simplicity, and flexibility to data manipulation. Whether you’re dealing with large datasets or looking for concise and readable code, data.table is an essential tool in any R programmer’s arsenal.


Introduction to data.table

At its core, data.table is an enhanced version of data.frame. It maintains compatibility with data.frame while adding:

  • Fast performance for large datasets.
  • Concise syntax for common operations.
  • Built-in chaining for efficient workflows.

Let’s explore why data.table has become a favorite among data analysts and statisticians.


Creating and Exploring a data.table

You can create a data.table just like you’d create a data.frame, but using the data.table() function:

library(data.table)

# Creating a data.table
dt <- data.table(
  Name = c("Alice", "Bob", "Charlie"),
  Age = c(25, 30, 22),
  Score = c(90, 85, 88)
)

# View the structure
str(dt)
Classes 'data.table' and 'data.frame':  3 obs. of  3 variables:
 $ Name : chr  "Alice" "Bob" "Charlie"
 $ Age  : num  25 30 22
 $ Score: num  90 85 88
 - attr(*, ".internal.selfref")=<externalptr> 
# Inspect the first few rows
head(dt)
      Name   Age Score
    <char> <num> <num>
1:   Alice    25    90
2:     Bob    30    85
3: Charlie    22    88

Notice that data.table inherits many properties of data.frame, so if you’re familiar with data.frame, you’re already halfway there!


The Basics of data.table Syntax

The hallmark of data.table is its [i, j, by] syntax, where:

  • i specifies rows (like filtering).
  • j specifies columns (like selecting or transforming).
  • by is used for grouping.

Selecting Rows and Columns

# Select rows where Age > 25
dt[Age > 25]
     Name   Age Score
   <char> <num> <num>
1:    Bob    30    85
# Select specific columns
dt[, .(Name, Score)]
      Name Score
    <char> <num>
1:   Alice    90
2:     Bob    85
3: Charlie    88

Adding or Updating Columns

# Add a new column
dt[, Grade := ifelse(Score > 85, "A", "B")]

# Update a column
dt[Name == "Alice", Score := 95]

Notice the use of := for adding/updating columns. It modifies data.table by reference, making it memory efficient.


Efficient Data Aggregation

Grouping and summarizing data is where data.table truly shines:

# Calculate mean score by Grade
dt[, .(Mean_Score = mean(Score)), by = Grade]
    Grade Mean_Score
   <char>      <num>
1:      A       91.5
2:      B       85.0

You can also group by multiple columns:

# Group by Grade and another variable
dt[, .(Count = .N), by = .(Grade, Age > 25)]
    Grade    Age Count
   <char> <lgcl> <int>
1:      A  FALSE     2
2:      B   TRUE     1

The .N symbol represents the count of rows in each group—just one of data.table’s many built-in conveniences.


Joining Tables

Merging datasets in data.table is fast and intuitive, using the merge() function or direct syntax:

# Two example tables
dt1 <- data.table(
  ID = 1:3, 
  Name = c("Alice", "Bob", "Charlie")
)
dt2 <- data.table(
  ID = 2:4, 
  Score = c(85, 88, 90)
)

# Inner join
merged <- merge(dt1, dt2, by = "ID")

# Left join
merged_left <- merge(dt1, dt2, by = "ID", all.x = TRUE)

Alternatively, you can join using the on argument directly:

# Fast join
dt1[dt2, on = "ID"]
      ID    Name Score
   <int>  <char> <num>
1:     2     Bob    85
2:     3 Charlie    88
3:     4    <NA>    90

Reshaping Data with data.table

The data.table package makes reshaping data straightforward with melt() and dcast() functions:

# Wide to long
long <- melt(
  data = dt, 
  id.vars = "Name", 
  measure.vars = c("Age", "Score")
)

# Long to wide
wide <- dcast(
  data = long, 
  formula = Name ~ variable, 
  value.var = "value"
)

Built-In Optimizations

data.table is designed for speed. It optimizes operations like:

  1. Fast row filtering: Using binary search when the data is sorted.
  2. Efficient memory use: Modifies objects by reference.
  3. Automatic indexing: Speeds up repeated queries.

For example:

# Set a key for fast filtering
setkey(dt, Name)

# Perform fast lookups
dt["Alice"]
Key: <Name>
     Name   Age Score  Grade
   <char> <num> <num> <char>
1:  Alice    25    95      A

Common Functions in data.table

Here are some additional handy functions:

Sorting Data

# Sort by Score (descending)
dt[order(-Score)]
      Name   Age Score  Grade
    <char> <num> <num> <char>
1:   Alice    25    95      A
2: Charlie    22    88      A
3:     Bob    30    85      B

Chaining Operations

With data.table, you can chain multiple operations for concise, readable code:

dt[Score > 85, .(Name, Grade)][order(Name)]
Key: <Name>
      Name  Grade
    <char> <char>
1:   Alice      A
2: Charlie      A

Using .SD for Advanced Group Operations

The .SD object contains all columns except the ones in by:

# Calculate stats for each Grade
dt[
  , lapply(.SD, mean), 
  by = Grade, 
  .SDcols = c("Age", "Score")
]
    Grade   Age Score
   <char> <num> <num>
1:      A  23.5  91.5
2:      B  30.0  85.0

Strengths and Limitations of data.table

Strengths

  • Blazing-fast performance with large datasets.
  • Elegant syntax for common operations.
  • Memory efficiency with by-reference updates.

Limitations

  • Steeper learning curve compared to data.frame.
  • Less flexible with non-standard evaluation (e.g., programming with column names).

Conclusion

data.table is a powerful, high-performance alternative to data.frame. Once you understand its [i, j, by] syntax, it’s hard to go back. Whether you’re handling small datasets or millions of rows, data.table equips you with the tools to manipulate data efficiently.