32  Normal analyses

Author

Jarad Niemi

R Code Button

Normal analyses are (typically) relevant for independent, continuous observations. These observations may need to be transformed to satisfy assumptions of normality. Most commonly by taking a logarithmic transformation of our data.

library("tidyverse"); theme_set(theme_bw())
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4.9000     ✔ readr     2.1.5     
✔ forcats   1.0.0          ✔ stringr   1.5.1     
✔ ggplot2   3.5.1          ✔ tibble    3.2.1     
✔ lubridate 1.9.3          ✔ tidyr     1.3.1     
✔ purrr     1.0.2          
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library("Sleuth3")

32.1 One group

The single group normal analysis begins with the assumptions \[ Y_i \stackrel{ind}{\sim} N(\mu,\sigma^2) \] for \(i=1,\ldots,n\).

An alternative way to write this model is \[ Y_i = \mu + \epsilon_i, \qquad \epsilon_i \stackrel{ind}{\sim} N(0,\sigma^2) \]

which has the following assumptions

  • errors are independent,
  • errors are normally distributed,
  • errors have a constant variance,

and

  • common mean.

32.1.1 Summary statistics

# One-sample data
creativity <- case0101 |>
  filter(Treatment == "Intrinsic")

A first step in any analysis is to look at summary statistics.

# One-sample summary
creativity |>
  summarize(
    n    = n(),
    mean = mean(Score),
    sd   = sd(Score)
  )
   n     mean       sd
1 24 19.88333 4.439513

Plotting the data will also help provide insight.

# One-group histogram
ggplot(creativity, aes(x = Score)) + 
  geom_histogram()
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

# One-group boxplot
ggplot(creativity, aes(x = Score)) + 
  geom_boxplot()

32.1.2 Test and confidence interval

In a one-sample t-test, we are typically interested in the hypothesis \[ H_0: \mu = \mu_0 \] for some value \(\mu_0\).

# One-sample t-test
t.test(Score ~ 1, data = creativity)

    One Sample t-test

data:  Score
t = 21.941, df = 23, p-value < 2.2e-16
alternative hypothesis: true mean is not equal to 0
95 percent confidence interval:
 18.00869 21.75798
sample estimates:
mean of x 
 19.88333 

To extract objects from this object, checkout the names() of the object.

# Save output as an R object
t <- t.test(Score ~ 1, data = creativity)

# Check out the names of the object
names(t)
 [1] "statistic"   "parameter"   "p.value"     "conf.int"    "estimate"   
 [6] "null.value"  "stderr"      "alternative" "method"      "data.name"  

Use $ to access the names.

# Group mean
t$estimate
mean of x 
 19.88333 
# Degrees of freedom
t$parameter # n - 1
df 
23 
# Standard error
t$stderr
[1] 0.9062118
# t-statistic
t$statistic
       t 
21.94116 
with(t, estimate / stderr)
mean of x 
 21.94116 
# p-value and null value
t$p.value
[1] 6.360338e-17
t$null.value
mean 
   0 
# Estimated standard deviation
with(t, stderr * sqrt(parameter + 1))
      df 
4.439513 
# Confidence interval for the mean
t$conf.int
[1] 18.00869 21.75798
attr(,"conf.level")
[1] 0.95

If you are interested in an alternative value for \(\mu_0\), a different alternative hypothesis, or a different confidence level, there are arguments that can be adjusted.

# t.test arguments
t <- t.test(creativity$Score, 
            mu          = 20, 
            alternative = "greater", 
            conf.level  = 0.9)

t # print

    One Sample t-test

data:  creativity$Score
t = -0.12874, df = 23, p-value = 0.5507
alternative hypothesis: true mean is greater than 20
90 percent confidence interval:
 18.68762      Inf
sample estimates:
mean of x 
 19.88333 

32.2 Two groups

When you have two groups and they are independent, a common assumption is

\[ Y_{gi} \stackrel{ind}{\sim} N(\mu_g, \sigma_g^2) \] for \(i = 1, \ldots, n_g\) and \(g=1,2\). Alternatively

\[ Y_{gi} = \mu_g + \epsilon_{gi}, \qquad \epsilon_{gi} \stackrel{ind}{\sim} N(0,\sigma_g^2) \] which assumes

  • errors are independent,
  • errors are normally distributed,
  • errors have constant variance within a group,

and

  • each group has its own constant mean.

32.2.1 Summary statistics

# Two-groups summary
case0101 |>
  group_by(Treatment) |>
  summarize(
    n    = n(),
    mean = mean(Score),
    sd   = sd(Score),
    
    .groups = "drop"
  )
# A tibble: 2 × 4
  Treatment     n  mean    sd
  <fct>     <int> <dbl> <dbl>
1 Extrinsic    23  15.7  5.25
2 Intrinsic    24  19.9  4.44
ggplot(case0101, 
       aes(x = Score)) +
  geom_histogram(bins = 30) +
  facet_wrap( ~ Treatment, 
              ncol = 1)

ggplot(case0101, 
       aes(x = Treatment,
           y = Score)) +
  geom_boxplot() 

As you look at these summary statistics, you can start to evaluate model assumptions. From these plots and summary statistics, it appears the two data sets have approximately equal variance while the means are likely different.

32.2.2 Test and confidence intervals

Typically we are interested in the hypothesis \[ H_0: \mu_1 = \mu_2 \qquad \mbox{or, equivalently,} \qquad H_0: \mu_1-\mu_2 = 0 \] and a confidence interval for the difference \(\mu_1-\mu_2\).

32.2.2.1 Unequal variances

To run the test, we again use the t.test() function, By default, this function assumes the variances are unequal.

t.test(Score ~ Treatment, 
       data = case0101)

    Welch Two Sample t-test

data:  Score by Treatment
t = -2.9153, df = 43.108, p-value = 0.005618
alternative hypothesis: true difference in means between group Extrinsic and group Intrinsic is not equal to 0
95 percent confidence interval:
 -7.010803 -1.277603
sample estimates:
mean in group Extrinsic mean in group Intrinsic 
               15.73913                19.88333 

Given our visualization of the data earlier, it is unsurprising that our p-value is small indicating evidence against the model that assumes the means of the two groups are equal.

32.2.2.2 Equal variances

Often, we will make the assumption that the two groups have equal variances.

# Assume variances are equal
t <- t.test(Score ~ Treatment, 
            data      = case0101, 
            var.equal = TRUE)

t

    Two Sample t-test

data:  Score by Treatment
t = -2.9259, df = 45, p-value = 0.005366
alternative hypothesis: true difference in means between group Extrinsic and group Intrinsic is not equal to 0
95 percent confidence interval:
 -6.996973 -1.291432
sample estimates:
mean in group Extrinsic mean in group Intrinsic 
               15.73913                19.88333 

When we assume the two variances are equal, our results are very similar.

Like we did with the one-group data, we can extract objects from the object create with the t.test() function.

# Group means
t$estimate
mean in group Extrinsic mean in group Intrinsic 
               15.73913                19.88333 
# Degrees of freedom
t$parameter # n - 2
df 
45 
# Standard error
t$stderr
[1] 1.416397
# t-statistic
t$statistic
        t 
-2.925876 
with(t, -diff(estimate) / stderr) 
mean in group Intrinsic 
              -2.925876 
# p-value and null value
t$p.value
[1] 0.005366476
t$null.value
difference in means between group Extrinsic and group Intrinsic 
                                                              0 
# Estimated standard deviation
with(t, stderr * sqrt(parameter + 1))
      df 
9.606474 
# Confidence interval for the mean
t$conf.int # Group 1 (Extrinsic) - Group 2 (Intrinsic)
[1] -6.996973 -1.291432
attr(,"conf.level")
[1] 0.95

32.2.3 Paired data

A paired analysis can be used when the data have a natural grouping structure into a pair of observations. Within each pair of observations, one observation has one level of the explanatory variable while the other observation has the other level of the explanatory variable.

In the following example, the data are naturally paired because each set of observations is on a set of identical twins. One twin has schizophrenia while the other does not. These data compares the volume of the left hippocampus to understand its relationship to schizophrenia.

32.2.3.1 Summary statistics

We could compare summary statistics separately for all individuals.

summary(case0202)
   Unaffected       Affected   
 Min.   :1.250   Min.   :1.02  
 1st Qu.:1.600   1st Qu.:1.31  
 Median :1.770   Median :1.59  
 Mean   :1.759   Mean   :1.56  
 3rd Qu.:1.935   3rd Qu.:1.78  
 Max.   :2.080   Max.   :2.02  

The power in these data arises from the use of pairing. In order to make use of this pairing we should compute the difference (or ratio) for each pair.

case0202 |>
  mutate(diff = Affected - Unaffected) |>
  summarize(
    n    = n(),
    mean = mean(diff),
    sd   = sd(diff)
  )
   n       mean        sd
1 15 -0.1986667 0.2382935

For plotting, it is useful to wrangling the data into a longer format.

# Prepare data for plotting
schizophrenia <- case0202 |>
  mutate(pair = LETTERS[1:n()]) |>    # Create a variable for the pair
  pivot_longer(-pair, 
               names_to = "diagnosis", 
               values_to = "volume")

In this plot, each pair is a line with the affected twin on the left and the unaffected twin on the right.

ggplot(schizophrenia, 
       aes(x     = diagnosis, 
           y     = volume, 
           group = pair)) + 
  geom_line()

It is important to connect these points between each twin because we are looking for a pattern amongst the pairs.

32.2.3.2 Paired t-test

When the data are paired, we can perform a paired t-test.

# Paired t-test
t.test(case0202$Unaffected, 
       case0202$Affected, 
       paired = TRUE)

    Paired t-test

data:  case0202$Unaffected and case0202$Affected
t = 3.2289, df = 14, p-value = 0.006062
alternative hypothesis: true mean difference is not equal to 0
95 percent confidence interval:
 0.0667041 0.3306292
sample estimates:
mean difference 
      0.1986667 

This analysis is equivalent to taking the difference in each pair and performing a one-sample t-test.

# One-sample t-test on the difference
t.test(case0202$Unaffected - case0202$Affected)

    One Sample t-test

data:  case0202$Unaffected - case0202$Affected
t = 3.2289, df = 14, p-value = 0.006062
alternative hypothesis: true mean is not equal to 0
95 percent confidence interval:
 0.0667041 0.3306292
sample estimates:
mean of x 
0.1986667 

32.3 More than two groups

If we have a single categorical variable with more than two groups, then \(G > 2\).

32.3.1 Summary statistics

# More than two-groups data
case0501 |>
  group_by(Diet) |>
  summarize(
    n = n(),
    mean = mean(Lifetime),
    sd = sd(Lifetime),
    
    .groups = "drop"
  )
# A tibble: 6 × 4
  Diet      n  mean    sd
  <fct> <int> <dbl> <dbl>
1 N/N85    57  32.7  5.13
2 N/R40    60  45.1  6.70
3 N/R50    71  42.3  7.77
4 NP       49  27.4  6.13
5 R/R50    56  42.9  6.68
6 lopro    56  39.7  6.99
# Jitterplot
ggplot(case0501, 
       aes(x = Diet, 
           y = Lifetime)) +
  geom_jitter(width = 0.1)

# Boxplots
ggplot(case0501, 
       aes(x = Diet, 
           y = Lifetime)) +
  geom_boxplot()

We can see from these boxplots that the data appear a bit left-skewed, but that the variability in each group is pretty similar.

32.3.2 ANOVA

An ANOVA F-test considers the following null hypothesis \[ H_0: \mu_g = \mu \quad \forall\, g \]

# ANOVA
m <- lm(Lifetime ~ Diet, 
        data = case0501)
anova(m)
Analysis of Variance Table

Response: Lifetime
           Df Sum Sq Mean Sq F value    Pr(>F)    
Diet        5  12734  2546.8  57.104 < 2.2e-16 ***
Residuals 343  15297    44.6                      
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

32.4 Summary

We introduced a number of analyses for normal (continuous) data depending on how many groups we have and the structure of those groups. In the future, we will introduce linear regression models that can be used for continuous data. We will see that these models allow us to perform all of the analyses above and analyze data with a more complicated structure.