Plotting different Confidence Intervals around Fitted Line using R and ggplot2
In this blog post, I’ll show that how we can plot different confidence intervals i.e., 90%
, 95%
or 99%
around the fitted line in R using ggplot2
library.
Load libraries
library(tidyverse)
library(stargazer)
library(patchwork)
Load data
For this blog post, I’ll use the built in cars data set. The data set contains only 2
variables i.e., speed and distance.
data("cars")
glimpse(cars)
## Rows: 50
## Columns: 2
## $ speed <dbl> 4, 4, 7, 7, 8, 9, 10, 10, 10, 11, 11, 12, 12, 12, 12, 13, 13, 13…
## $ dist <dbl> 2, 10, 4, 22, 16, 10, 18, 26, 34, 17, 28, 14, 20, 24, 28, 26, 34…
Simple Linear Regression
At first, I’ll fit a simple linear regression model for distance against speed.
model <- lm(dist ~ speed, data = cars)
stargazer::stargazer(model, type = "text")
##
## ===============================================
## Dependent variable:
## ---------------------------
## dist
## -----------------------------------------------
## speed 3.932***
## (0.416)
##
## Constant -17.579**
## (6.758)
##
## -----------------------------------------------
## Observations 50
## R2 0.651
## Adjusted R2 0.644
## Residual Std. Error 15.380 (df = 48)
## F Statistic 89.567*** (df = 1; 48)
## ===============================================
## Note: *p<0.1; **p<0.05; ***p<0.01
- We can see a significant positive association between distance and speed.
- As the speed increases, the distance also increases.
Now, we want to visualize this fitted model. An easy way is that we can use the geom_smooth
function of ggplot2
package to show a trend line of association between speed and distance. geom_smooth
also provides a built in function to plot the confidence interval around the fitted line.
Note: 95% confidence interval is the default
level. However, we can change the levels manually also.
Fitted line with 95% confidence level
p1 <- cars %>%
ggplot(aes(x = speed, y = dist)) +
geom_point(colour = "black") +
geom_smooth(method = "lm", se = TRUE) +
labs(title = "95% Confidence Interval") +
theme_bw() +
theme(plot.title = element_text(face = "bold",hjust = 0.5))
p1
Fitted line with 90% confidence level
p2 <- cars %>%
ggplot(aes(x = speed, y = dist)) +
geom_point(colour = "black") +
geom_smooth(method = "lm", se = TRUE, level = 0.90) +
labs(title = "90% Confidence Interval") +
theme_bw() +
theme(plot.title = element_text(face = "bold",hjust = 0.5))
p2
Fitted line with 90% confidence level
p3 <- cars %>%
ggplot(aes(x = speed, y = dist)) +
geom_point(colour = "black") +
geom_smooth(method = "lm", se = TRUE, level = 0.99) +
labs(title = "99% Confidence Interval") +
theme_bw() +
theme(plot.title = element_text(face = "bold",hjust = 0.5))
p3
Stacking
p1 + p2 + p3 + plot_layout(ncol = 2)
This is an easiest way to plot confidence intervals in R
and ggplot2
even without fitting a regression model separately. However, sometime we want to compare different confidence interval bounds in a single scatter plot. In that case we need some creativity and extra lines of code. I know there can be different ways to achieve this but here is how I had accomplished this task.
- First, I’ll extract the confidence intervals of fitted line using the
predict
function.
Extract confidence intervals data
# CI
ci90 <- predict(model, cars, interval = "confidence", level = 0.90)
ci95 <- predict(model, cars, interval = "confidence", level = 0.95)
ci99 <- predict(model, cars, interval = "confidence", level = 0.99)
- After that, I’ll bind these data sets using the
bind_rows
function. Before binding, I’ll create an ID variable for eachdata.frame
to keep track of the data.
# bind data
ci_pred <- as.data.frame(ci90) %>%
mutate(ID = "90%") %>%
bind_rows(as.data.frame(ci95) %>%
mutate(ID = "95%"),
as.data.frame(ci99) %>%
mutate(ID = "99%")
)
head(ci_pred)
## fit lwr upr ID
## 1...1 -1.849460 -10.591699 6.892779 90%
## 2...2 -1.849460 -10.591699 6.892779 90%
## 3...3 9.947766 3.050136 16.845396 90%
## 4...4 9.947766 3.050136 16.845396 90%
## 5...5 13.880175 7.563250 20.197100 90%
## 6...6 17.812584 12.050540 23.574627 90%
- After that, I’ll join this data back with
cars
data usingbind_cols
function followed by some data wrangling.
# Joining with cars data
final_data <- cars %>%
bind_cols(
ci_pred %>%
remove_rownames() %>%
pivot_wider(names_from = ID, values_from = c(lwr, upr)) %>%
unnest()
) %>%
pivot_longer(names_to = "key", values_to = "value", cols = 4:9) %>%
separate(key, into = c("Type", "Level"), sep = "_") %>%
pivot_wider(names_from = Type, values_from = value) %>%
unnest()
head(final_data)
## # A tibble: 6 × 6
## speed dist fit Level lwr upr
## <dbl> <dbl> <dbl> <chr> <dbl> <dbl>
## 1 4 2 -1.85 90% -10.6 6.89
## 2 4 2 -1.85 95% -12.3 8.63
## 3 4 2 -1.85 99% -15.8 12.1
## 4 4 10 -1.85 90% -10.6 6.89
## 5 4 10 -1.85 95% -12.3 8.63
## 6 4 10 -1.85 99% -15.8 12.1
- Now we have the cleaner data that contains fitted data along with confidence intervals at different levels.
- We can use this data to plot the confidence bounds around the fitted line in a single scatter chart.
Final Plot
final_data %>%
ggplot(aes(x = speed, y = dist, color = Level)) +
geom_point(colour = "midnightblue", alpha = 0.5) +
geom_smooth(method = "lm", se = F, colour = "brown") +
geom_line(aes(y = lwr), lty = 2) +
geom_line(aes(y = upr), lty = 2) +
labs(x = "Speed", y = "Distance", color = NULL) +
theme_test() +
theme(legend.position = c(0.08, 0.85))
That’s it!
Feel free to reach me out if you got any questions.