Strings

String Manipulation and Regular Expressions with Practical Examples

Author

Raju Rimal

Published

November 30, 2024

Modified

March 19, 2025

Strings are a crucial part of data manipulation and analysis in R. From cleaning messy datasets to extracting specific information, the ability to efficiently work with text can save time and improve the quality of your results. This blog dives into string manipulation in R, focusing on the power of regular expressions and string functions.


1. The Basics of Strings in R

In R, strings are character data types represented by text enclosed in quotes.

# A simple string
my_string <- "Hello, R!"
print(my_string)
[1] "Hello, R!"

Strings are often stored in vectors, making them compatible with R’s vectorized operations:

# A character vector
fruits <- c("Apple", "Banana", "Cherry")
print(fruits)
[1] "Apple"  "Banana" "Cherry"

2. Core String Functions in Base R

A. Basic Operations

  1. nchar(): Count Characters
nchar("Banana")
[1] 6
  1. toupper() and tolower(): Change Case
toupper("hello")
[1] "HELLO"
tolower("HELLO")
[1] "hello"
  1. substr(): Extract or Replace Substrings
my_string <- "Banana"

# Extract characters 1 to 3
substr(my_string, 1, 3)
[1] "Ban"
# Replace characters 1 to 3
substr(my_string, 1, 3) <- "Pan"
print(my_string)
[1] "Panana"

3. Advanced String Manipulation with stringr

The stringr package, part of the tidyverse, simplifies string operations and introduces a consistent syntax.

A. Detecting Patterns with str_detect()

library(stringr)

# Check if a string contains "ana"
str_detect("Banana", "ana")
[1] TRUE

B. Extracting Patterns with str_extract()

# Extract the first occurrence of a digit
str_extract("Order123", "\\d")
[1] "1"

C. Replacing Patterns with str_replace()

# Replace "ana" with "XYZ"
str_replace("Banana", "ana", "XYZ")
[1] "BXYZna"

D. Splitting Strings with str_split()

# Split a string by commas
str_split("Apple, Banana, Cherry", ", ")
[[1]]
[1] "Apple"  "Banana" "Cherry"

4. The Magic of Regular Expressions

Regular expressions (regex) are powerful tools for pattern matching.

A. Regex Basics

Symbol Meaning Example
. Any character "a.c" matches "abc"
* Zero or more occurrences "a*" matches "aaa"
+ One or more occurrences "a+" matches "aa"
\\d Any digit "\\d" matches "1"
^ Start of a string "^A" matches "Apple"
$ End of a string "e$" matches "Apple"

B. Practical Examples with Regex

  1. Find All Words That Start with “B”
fruits <- c("Apple", "Banana", "Cherry", "Blueberry")
str_subset(fruits, "^B")
[1] "Banana"    "Blueberry"
  1. Replace Non-Alphanumeric Characters
text <- "Hello, World! @2024"
str_replace_all(text, "[^a-zA-Z0-9]", "")
[1] "HelloWorld2024"
  1. Extract All Numbers from Text
text <- "Order123 arrived at 4pm."
str_extract_all(text, "\\d+")
[[1]]
[1] "123" "4"  

5. String Manipulation in Data Frames

A. Adding a Prefix or Suffix

library(dplyr)

# Add a prefix "Fruit: " to each fruit name
fruits <- data.frame(name = c("Apple", "Banana", "Cherry"))
fruits <- fruits %>% mutate(name = str_c("Fruit: ", name))
print(fruits)
           name
1  Fruit: Apple
2 Fruit: Banana
3 Fruit: Cherry

B. Cleaning Text Data

# Remove leading and trailing spaces
dirty_text <- c("  Hello  ", "  World ")
cleaned_text <- str_trim(dirty_text)
print(cleaned_text)
[1] "Hello" "World"

6. Performance Considerations

For large datasets, use stringi, a faster alternative to stringr for complex text processing.

library(stringi)

# Count occurrences of a pattern
stri_count_regex(c("apple", "banana", "cherry"), "a")
[1] 1 3 0

7. Practical Applications

A. Validating Email Addresses

emails <- c("test@example.com", "invalid_email", "user@domain.com")
valid_emails <- str_subset(emails, "^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\\.[a-zA-Z]{2,}$")
print(valid_emails)
[1] "test@example.com" "user@domain.com" 

B. Parsing Logs

logs <- c("ERROR: Disk full", "INFO: Process started", "WARNING: Low memory")
error_logs <- str_subset(logs, "^ERROR")
print(error_logs)
[1] "ERROR: Disk full"

8. Exercise: Practice Your String Skills

  1. Extract all words that end with “ing” from a sentence.
  2. Replace all vowels in a text with “*“.
  3. Split a sentence into individual words and count the occurrences of each word.

Conclusion

Strings are more than just text—they’re data waiting to be transformed. By mastering R’s string manipulation functions and the power of regular expressions, you can efficiently clean, extract, and analyze text data.