Diving Deeper

Advanced Geometries and Statistical Layers

As you become more familiar with basic ggplot2 plots, it’s time to dive deeper and explore advanced plotting techniques. This topic focuses on visualizing distributions, adding statistical summaries to plots, and working with specialized geometries like heatmaps and 2D density plots. These advanced features of ggplot2 will enable you to create more insightful and sophisticated visualizations that can reveal deeper patterns in your data.


1. Visualizing Distributions

Visualizing the distribution of your data is crucial for understanding its underlying structure. ggplot2 offers several geometries for distribution plots, including histograms, density plots, and boxplots. These plots help you see the spread, skewness, and possible outliers in your data.

Histograms

Histograms are great for visualizing the distribution of a single variable, especially when you have continuous data. In a histogram, the data is divided into bins, and the frequency of data points in each bin is plotted.

Example: Histogram

r
Copy code
ggplot(mtcars, aes(x = mpg)) +
  geom_histogram(binwidth = 2, fill = "blue", color = "black") +
  labs(title = "Histogram of Miles Per Gallon")

In this example:

  • geom_histogram(binwidth = 2): Creates a histogram with a bin width of 2.
  • fill = "blue", color = "black": Customizes the colors of the bars and borders.

Density Plots

Density plots provide a smoothed version of the histogram, which is useful for understanding the distribution shape without being influenced by bin width. They are especially helpful when comparing the distributions of multiple variables.

Example: Density Plot

r
Copy code
ggplot(mtcars, aes(x = mpg)) +
  geom_density(fill = "blue", alpha = 0.5) +
  labs(title = "Density Plot of Miles Per Gallon")
  • geom_density(fill = "blue", alpha = 0.5): Adds a density plot with a semi-transparent blue fill.

Boxplots

Boxplots give a visual summary of the distribution, including the median, quartiles, and possible outliers. They are especially useful for comparing distributions across multiple groups.

Example: Boxplot

r
Copy code
ggplot(mtcars, aes(x = factor(cyl), y = mpg)) +
  geom_boxplot() +
  labs(title = "Boxplot of MPG by Cylinder Count")

In this example:

  • factor(cyl): Treats the cyl variable (number of cylinders) as a categorical variable for comparison.

2. Adding Statistical Summaries

In addition to raw data visualization, ggplot2 allows you to add statistical summaries such as smooth lines and confidence intervals. These layers help you understand trends and relationships in your data.

Smooth Lines

Smooth lines are often used to display trends or relationships in the data. The geom_smooth() function fits a statistical model to the data and adds a line representing that model. The default method is loess (local polynomial regression fitting), but you can also use linear models (lm) or generalized additive models (gam).

Example: Adding a Smooth Line

r
Copy code
ggplot(mtcars, aes(x = wt, y = mpg)) +
  geom_point() +
  geom_smooth(method = "lm", se = TRUE, color = "red") +
  labs(title = "Scatter Plot with Linear Regression Line")
  • geom_smooth(method = "lm"): Adds a linear regression line.
  • se = TRUE: Includes the shaded confidence interval around the regression line.

Confidence Intervals

Confidence intervals provide a range of likely values for a parameter and help assess the uncertainty of your estimates. These can be shown alongside statistical models like smooth lines.

In the above example, the confidence interval is represented by the shaded region around the regression line.


3. Working with Specialized Geometries

Beyond basic plots, ggplot2 also supports specialized geoms like heatmaps and 2D density plots, which are useful for visualizing complex relationships in multivariate data.

Heatmaps

Heatmaps display data values using a color scale, with colors representing different levels of intensity. They are particularly useful for visualizing matrices or data frames where you want to see the relationship between two continuous variables.

Example: Heatmap

r
Copy code
# Create a sample data frame
set.seed(123)
data <- data.frame(
  x = rep(1:10, each = 10),
  y = rep(1:10, times = 10),
  value = rnorm(100)
)

ggplot(data, aes(x = x, y = y, fill = value)) +
  geom_tile() +
  scale_fill_gradient2(low = "blue", high = "red", mid = "yellow", midpoint = 0) +
  labs(title = "Heatmap of Random Values")
  • geom_tile(): Creates a tile for each data point in the matrix.
  • scale_fill_gradient2(): Customizes the color scale, with blue representing low values, yellow as the midpoint, and red for high values.

2D Density Plots

2D density plots are used to visualize the distribution of two continuous variables simultaneously, creating a contour plot that shows areas of high density.

Example: 2D Density Plot

r
Copy code
ggplot(mtcars, aes(x = wt, y = mpg)) +
  geom_density_2d() +
  labs(title = "2D Density Plot of Car Weight and MPG")
  • geom_density_2d(): Adds a 2D density plot, showing contours of density.

Summary

In this topic, you’ve learned advanced techniques for visualizing your data using ggplot2, including:

  • Visualizing distributions with histograms, density plots, and boxplots.
  • Adding statistical summaries with smooth lines, regression models, and confidence intervals.
  • Working with specialized geoms like heatmaps and 2D density plots to visualize complex relationships.