34  Simple Linear Regression

Author

Jarad Niemi

R Code Button

library("tidyverse"); theme_set(theme_bw())
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4.9000     ✔ readr     2.1.5     
✔ forcats   1.0.0          ✔ stringr   1.5.1     
✔ ggplot2   3.5.1          ✔ tibble    3.2.1     
✔ lubridate 1.9.3          ✔ tidyr     1.3.1     
✔ purrr     1.0.2          
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library("Sleuth3")
library("ggResidpanel")

Generally, the simple linear regression model has a continuous response variable and a continuous explanatory variable.

34.1 Model

The structure of a simple linear regression model is \[ Y_i \stackrel{ind}{\sim} N(\beta_0 + \beta_1X_{i}, \sigma^2) \] for \(i=1,\ldots,n\) or, equivalently, \[ Y_i = \beta_0 + \beta_1X_{i} + \epsilon_i, \qquad \epsilon_i \stackrel{ind}{\sim} N(0, \sigma^2). \]

where, for observation \(i\),

  • \(Y_i\) is the value of the response variable,
  • \(X_{i}\) is the value of the explanatory variable, and
  • \(\epsilon_i\) is the error.

The model parameters are

  • the intercept \(\beta_0\),
  • the slope \(\beta_1\), and
  • the error variance \(\sigma^2\).

34.2 Assumptions

The assumptions in every regression model are

  • errors are independent,
  • errors are normally distributed, and
  • errors have constant variance.

For simple linear regression,

  • the expected response, \(E[Y_i] = \beta_0 + \beta_1X_i\), is a linear function of the explanatory variable.

34.3 Visualize

Let’s start with an example where the simple linear regression model is straight forward. When constructing a plot, the explanatory variable should be put on the x-axis while the response variable should be put on the y-axis.

# Simple linear regression model
g <- ggplot(cars, 
       aes(x = speed,
           y = dist)) +
  geom_point() 

g

To fit a linear regression model use lm()

# Regression model fit
m <- lm(dist ~ speed, data = cars)

It is typically helpful to visualize the data with the line added.

# Manually add regression line
g + 
  geom_abline(intercept = coef(m)[1],
              slope     = coef(m)[2],
              color     = "blue")

Or we can use geom_smooth() which does not require that we have fit the model.

# Add line using geom_smooth()
g + geom_smooth(method = "lm")
`geom_smooth()` using formula = 'y ~ x'

# Options
g + geom_smooth(
  method = "lm",
  formula = y ~ x, # remove 'message'
  se = FALSE       # remove uncertainty
)

34.4 Inference

We now turn to understand the output of the regression analysis.

# Regression model fit
m

Call:
lm(formula = dist ~ speed, data = cars)

Coefficients:
(Intercept)        speed  
    -17.579        3.932  
# Summary
s <- summary(m)
s

Call:
lm(formula = dist ~ speed, data = cars)

Residuals:
    Min      1Q  Median      3Q     Max 
-29.069  -9.525  -2.272   9.215  43.201 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) -17.5791     6.7584  -2.601   0.0123 *  
speed         3.9324     0.4155   9.464 1.49e-12 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 15.38 on 48 degrees of freedom
Multiple R-squared:  0.6511,    Adjusted R-squared:  0.6438 
F-statistic: 89.57 on 1 and 48 DF,  p-value: 1.49e-12

Recall that coefficients can be calculated from the formulas \[\widehat{\beta}_1 = \frac{SXY}{SXX} = \frac{\sum_{i=1}^n (x_i - \overline{x})(y_i-\overline{y})}{\sum_{i=1}^n (x_i-\overline{x})^2}\] and \[\widehat{\beta}_0 = \overline{y} - \widehat{\beta}_1 \overline{x}\] where \(\overline{x}\) and \(\overline{y}\) are the sample means for the explanatory and response variables, respectively. /

# Manual calculation
x    <- cars$speed
y    <- cars$dist

xbar <- mean(x)
ybar <- mean(y)

sxx  <- sum((x-xbar)^2)
sxy  <- sum((x-xbar)*(y-ybar))
syy  <- sum((y-ybar)^2) # used later

beta1 <- sxy/sxx
beta0 <- ybar - beta1 * xbar

beta0
[1] -17.57909
beta1
[1] 3.932409
# Intercept (beta0) and slope (beta1)
m$coefficients
(Intercept)       speed 
 -17.579095    3.932409 
coef(m)
(Intercept)       speed 
 -17.579095    3.932409 

Residuals are the estimated errors and can be calculated from the formula \[r_i = \widehat{e}_i = y_i - \left(\widehat{\beta}_0 + \widehat{\beta}_1 x_i\right).\] These residuals can be used to calculate the estimated error variance using the formula \[\widehat{\sigma}^2 = \frac{1}{n-2} RSS = \frac{1}{n-2} \sum_{i=1}^n r_i^2\] where \(n-2\) is the degrees of freedom.

# Calculate residuals
r <- y - (beta0 + beta1*x)
head(r)
[1]  3.849460 11.849460 -5.947766 12.052234  2.119825 -7.812584
head(m$residuals)
        1         2         3         4         5         6 
 3.849460 11.849460 -5.947766 12.052234  2.119825 -7.812584 
# Calculate estimated error variance
n   <- length(y)
rss <- sum(r^2)
df  <- n - 2

df
[1] 48
m$df.residual
[1] 48
sigma2 <- rss / df
sigma2
[1] 236.5317
s$sigma^2
[1] 236.5317
# Estimated error sd
sqrt(sigma2)
[1] 15.37959
s$sigma
[1] 15.37959

The coefficient of determination, or \(R^2\), is the proportion of variability explained by the model. It can be calculated using the formula

\[R^2 = \frac{SYY - RSS}{SYY} = 1 - \frac{RSS}{SYY} = 1 - \frac{\sum_{i=1}^n r_i^2}{\sum_{i=1}^n (y_i-\overline{y})^2}.\]

# Coefficient of determination
1 - rss/syy
[1] 0.6510794
s$r.squared
[1] 0.6510794

Confidence intervals and p-values for the intercept and slope require the associated standard errors. The standard error for the slope is \[SE(\widehat{\beta}_1) = \sqrt{\frac{\widehat{\sigma}^2}{SXX}}\] and the standard error for the intercept is \[SE(\widehat{\beta}_0) = s_{\widehat{\beta}_1}\sqrt{\frac{1}{n}\sum_{i=1}^n x_i^2}.\]

# Standard errors
se_beta1 <- sqrt(s$sigma^2 / sxx)
se_beta0 <- se_beta1 * sqrt(mean(x^2))

se_beta0
[1] 6.75844
se_beta1
[1] 0.4155128
s$coefficients[,1:2]
              Estimate Std. Error
(Intercept) -17.579095  6.7584402
speed         3.932409  0.4155128

P-values (with null hypothesis of \(\beta = 0\) vs two-sided alternative) can be calculated using \[ \mbox{pvalue} = 2\times P\left(T_{n-2} > |t_{n-2}|\right) = 2\times P\left(T_{n-2} > \left|\frac{\widehat{\beta}}{SE(\widehat\beta)}\right|\right) \]

# t statistics
t_beta0 <- beta0 / se_beta0
t_beta1 <- beta1 / se_beta1

# pvalues
2 * (1 - pt(abs(t_beta0), df = df))
[1] 0.01231882
2 * (1 - pt(abs(t_beta1), df = df))
[1] 1.489919e-12
s$coefficients
              Estimate Std. Error   t value     Pr(>|t|)
(Intercept) -17.579095  6.7584402 -2.601058 1.231882e-02
speed         3.932409  0.4155128  9.463990 1.489836e-12

Confidence intervals can be constructed using \[ \widehat{\beta} \pm t_{n-2,\alpha/2} SE(\widehat{\beta}). \]

# Confidence intervals
alpha  <- 0.05
t_crit <- qt(1-alpha/2, df = df)

beta0 + c(-1,1) * t_crit * se_beta0
[1] -31.16785  -3.99034
beta1 + c(-1,1) * t_crit * se_beta1
[1] 3.096964 4.767853
# Confidence intervals for intercept and slope
confint(m)
                 2.5 %    97.5 %
(Intercept) -31.167850 -3.990340
speed         3.096964  4.767853

34.5 Diagnostics

Diagnostic plots are used to help evaluate model assumptions. As a general rule, we are looking for patterns in these diagnostic plots. Sometimes, those patterns are explainable by the structure of the data. Other times, these patterns are due to violations of the model assumptions.

34.5.1 Residuals vs fitted values

# Residuals vs fitted values
ggplot( # no data argument
  mapping = aes(x = m$fitted.values, 
                y = m$residuals)) +
  geom_point()

# Alternative residuals v fitted
ggplot( # no data argument
  mapping = aes(x = m$fitted.values, 
                y = sqrt(abs(m$residuals)))) +
  geom_point() 

34.5.2 Residuals vs index

Since data are often recorded in temporal order, the row number for an observation can be used as a proxy for time. Thus, we can plot residuals vs index (row number) to (possibly) assess violates of independence due to time. Of course, if our

# Residuals versus index
ggplot(
  mapping = aes(x = 1:length(m$residuals),
                y = m$residuals)) +
  geom_point()

34.5.3 Normal qq-plots

While we could use histograms or density plots to try to evaluate normality, a much better plot for this purpose is a normal quantile-quantile (qq) plot. In this plot, data should generally fall along the line. Severe deviations from this line are indications of non-normality.

# Normal qq plot
ggplot( # no data argument
  mapping = aes(sample = m$residuals)) +
  geom_qq() +
  geom_qq_line()

34.5.4 Residuals v explanatory

A good plot to use to evaluate the linearity assumption is residuals versus explanatory.

# Residual versus explanatory
ggplot( # no data argument
  mapping = aes(x = cars$speed, 
                y = m$residuals)) +
  geom_point()

34.5.5 Default

A collection of plots can be produced quickly by plotting the fitted regression object.

# Default regression diagnostic plots
opar <- par(mfrow=c(2,2))
plot(m, ask = FALSE)

par(opar)

# Additional "hidden" plots
opar <- par(mfrow=c(2,3))
plot(m, 1:6, ask = FALSE)

par(opar)

34.5.6 ggResidpanel

The ggResidpanel is a package the provides residual diagnostic plots for regression models using ggplot2.

# Default ggResidpanel plots
resid_panel(m)

# All ggResidpanel plots
resid_panel(m, plots = "all")

# ggResidpanel R plots
resid_panel(m, plots = "R")

# ggResidpanel R plots
resid_panel(m, plots = "SAS")

# My preferred plots
my_plots <- c("resid",
              "index",
              "qq",
              "cookd")

# w/ options
resid_panel(m, 
            plots    = my_plots,
            smoother = TRUE, 
            qqbands  = TRUE) 
`geom_smooth()` using formula = 'y ~ x'
`geom_smooth()` using formula = 'y ~ x'

The ggResidpanel package also has a convenient function to plot the residuals versus explanatory variables.

# Residuals v explanatory
resid_xpanel(m)

34.6 Examples

Here we introduce three examples. The first example illustrates the origin of the term “regression”. The second example illustrates a situation where the diagnostic plots suggest the use of logarithms. The third example illustrates the use of a binary categorical variable within the context of a simple linear regression model.

34.6.1 Regression to the mean

This data set contains the heights of parents and their children. It provides the origin of the term “regression” where children are, on average, closer to the mean height (for their sex) than their parents. This phenomenon is known as “regression to the mean”.

# Wrangle data
d <- ex0726 |>
         filter(Gender == "female")

# The data are clearly not independent
# since there are multiple observations
# from the same family
d |>
  group_by(Family) |>
  summarize(
    n    = n()) |>
  filter(n > 1)
# A tibble: 116 × 2
   Family     n
    <int> <int>
 1      1     3
 2      2     2
 3      4     3
 4      5     3
 5      7     2
 6      8     3
 7     11     6
 8     16     4
 9     17     2
10     18     3
# ℹ 106 more rows
# Plot data with regression line
ggplot(d,
       aes(x = Mother,
           y = Height)) +
  geom_point() +
  geom_smooth(method = "lm")
`geom_smooth()` using formula = 'y ~ x'

# Fit model
m <- lm(Height ~ Mother, 
        data = d)

# Assess diagnostic plots
resid_panel(m, 
            plots    = my_plots,
            qqbands  = TRUE,
            smoother = TRUE)
`geom_smooth()` using formula = 'y ~ x'
`geom_smooth()` using formula = 'y ~ x'

34.6.2 Logarithmic transformation

# Exploratory
summary(case0801)
      Area            Species      
 Min.   :    1.0   Min.   :  7.00  
 1st Qu.:   18.5   1st Qu.: 13.50  
 Median : 3435.0   Median : 45.00  
 Mean   :11615.1   Mean   : 48.57  
 3rd Qu.:16807.5   3rd Qu.: 76.50  
 Max.   :44218.0   Max.   :108.00  
# Plot data
g <- ggplot(case0801,
       aes(x = Area,
           y = Species)) +
  geom_point() +
  geom_smooth(method = "lm")
g
`geom_smooth()` using formula = 'y ~ x'

# Fit model
m <- lm(Species ~ Area, 
        data = case0801)

# Diagnostic plots
resid_panel(m, plots = my_plots)

resid_xpanel(m)

Try taking logarithsm

# Consider logarithms of both variables
g + scale_x_log10()
`geom_smooth()` using formula = 'y ~ x'

g + scale_x_log10() + scale_y_log10()
`geom_smooth()` using formula = 'y ~ x'

# Fit model with logarithms
m <- lm(log(Species) ~ log(Area), 
        data = case0801)

# Diagnostic plots
resid_panel(m, plots = my_plots, qqbands = TRUE)

resid_xpanel(m)

34.6.3 Categorical explanatory variable

We can also fit a simple linear regression model when we have a binary categorical explanatory variable. Since the mathematics requires that \(X_i\) be numeric, the standard approach to a binary categorical variable is to encode the levels as 0 and 1. To do this, we create a dummy (or indicator) variable.

Generally, \[X_i = \left\{ \begin{array}{ll} 0 & \mbox{obs $i$ is reference level} \\ 1 & \mbox{obs $i$ is NOT the reference level} \end{array} \right.\]

Thus, we have to choose one of the two levels as the reference level. The reference level gets encoded as 0 and the other level as 1.

# Categorical explanatory variable
summary(case0102)
     Salary         Sex    
 Min.   :3900   Female:61  
 1st Qu.:4980   Male  :32  
 Median :5400              
 Mean   :5420              
 3rd Qu.:6000              
 Max.   :8100              
lm(Salary ~ Sex, 
   data = case0102)

Call:
lm(formula = Salary ~ Sex, data = case0102)

Coefficients:
(Intercept)      SexMale  
       5139          818  

R indicates a categorical variable level by including the variable name and the level, e.g. SexMale. We can verify that this level has been encoded as a 1 and the reference level Female as a 0.

# Create dummy variable
d <- case0102 |>
  mutate(X = as.numeric(Sex == "Male"))

table(d$Sex, d$X)
        
          0  1
  Female 61  0
  Male    0 32
lm(Salary ~ X, 
   data = d)

Call:
lm(formula = Salary ~ X, data = d)

Coefficients:
(Intercept)            X  
       5139          818  

If you want to change the reference level, you can change order of the factor.

lm(Salary ~ Sex, 
   data = case0102 |>
     mutate(Sex = relevel(Sex, ref = "Male")))

Call:
lm(formula = Salary ~ Sex, data = mutate(case0102, Sex = relevel(Sex, 
    ref = "Male")))

Coefficients:
(Intercept)    SexFemale  
       5957         -818  

When the explanatory variable is categorical, the “slope” is the difference between the means of the two groups.

# Interpretation
m <- lm(Salary ~ Sex,
        data = case0102)

# 95% CI for mean difference in Salary (Male - Female)
confint(m)[2,] |> # Does this data prove there is a causal relationship
  round()         # between Salary and Sex? 
 2.5 % 97.5 % 
   560   1076 

Always remember to investigate diagnostic plots.

# Diagnostic plots
resid_panel(m, 
            plots   = my_plots,
            qqbands = TRUE)

An alternative is to model the logarithm of Salary. If you do, \(e^{\beta_1}\) is the multiplicative change in median salary from the reference level to the other level.

# Interpretation
m <- lm(log(Salary) ~ Sex,
        data = case0102)

# Diagnostics
resid_panel(m, 
            plots   = my_plots,
            qqbands = TRUE)

# 95% CI for multiplicative change in median Salary (Male / Female)
confint(m)[2,] |> 
  exp() |>
  round(2)     
 2.5 % 97.5 % 
  1.10   1.21 

34.7 Summary

Simple linear regression models are used when the response variable is continuous and the explanatory variable is continuous or a binary categorical variable. When the explanatory variable is a binary categorical variable, then the results are the same as the two-sample t test when the variance in the two groups is assumed to be equal.