6  Scripts

Author

Jarad Niemi

R scripts are the primary file type for running analyses in R. By convention, R scripts are plain text files that use the .R extension.

In this chapter, we will run through a few R scripts as examples to demonstrate a data science pipeline. On aspect that will not be represented very well is the feedback loops that occur during data analysis.

6.1 Code

Here is a basic script that utilizes R to calculate the area and circumference of a circle.

Copy-and-paste this following code into a new script and then run the code.

radius <- 3

area          <- pi * radius^2 
circumference <- 2 * pi * radius

area
circumference

For readability in scripts, it is often helpful to align code. Notice the area and circumference lines are aligned at the arrow. This helps the reader focus on the important aspects of the code: the new objects being constructed and the formulas for their calculation.

6.2 Comments

The previous script was sufficiently short and used informative variable names and thus did not require the user to include any comments. Most code should include sufficient comments so that a later user (who might be yourself) can understand what the code is doing. The script below includes comments to help the user understand what is being done.

Copy-and-paste this following code into a new script and then run the code.

# Cone dimensions
r <- 3 # radius
h <- 8 # height

pi * r^2 * h / 3 # volume

surface_area <- pi * r * (r + sqrt(h^2 + r^2))
base_area    <- pi * r^2

# lateral area
pi * r * sqrt(h^2 + r^2) # direct calculation
surface_area - base_area # by calculating difference

A comment is indicated by the # character and anything after that character is not executed. Thus we can put the comment on its own line or we can put the comment at the end of a line.

The first comment indicates to the user that we are working with a cone and not a cylinder (which would also have a radius and height).

The next comments indicate that we will use r and h as shorthand for radius and height. This allows the calculation lines to not be so wide and use formulas we have memorized (at some point in the past).

6.3 Sections

It is helpful to break your code up into sections. You can do this simply by creating a simple comments, e.g.

# Section

But this can often get overlooked quite easily. A more dramatic section uses a # followed by a repeating symbol. Here are a couple of examples

################################################################################
# Section
################################################################################
#-------------------------------------------------------------------------------
# Section
#-------------------------------------------------------------------------------
#===============================================================================
# Section
#===============================================================================

6.4 Metadata

Metadata can be included at the beginning of a script as a set of comments. I encourage you to always include the author (yourself), date(s) the file was written, and the purpose of the script. So metadata might look something like the following

# Author:  Jarad Niemi
# Date:    2024-01-24
# Purpose: Demonstrate the use of metadata

6.5 Packages

R packages used within an R script should come at the beginning of the file but after the metadata. For example,

# Author:  Jarad Niemi
# Date:    2024-01-24
# Purpose: Demonstrate the loading of a package immediately after the metadata
#-------------------------------------------------------------------------------
library("tidyverse")

6.6 Whitespace

R ignores whitespace, but you can use this to your advantage to create easy to read code. For example, compare the following two scripts that do exactly the same thing.

library("tidyverse")
ChickWeight|>filter(Time==0)|>group_by(Diet)|>summarize(mean=mean(weight),sd=sd(weight),n=n())
library("tidyverse")
ChickWeight |>
  filter(Time == 0) |>
  group_by(Diet) |>
  summarize(
    mean = mean(weight),
    sd   = sd(weight),
    n    = n())

Which of these two scripts is easier to read? Can you guess what the scripts do (even though we haven’t talked about any of this functionality)?

6.7 Style

There are a number of aspects of style that relate to writing a good R script. I suggest you take a look at the tidyverse style guide for advice.

In RStudio, you can utilize

  • Code > Reindent Lines
  • Code > Reformat Code

to help reformat code. From this post, Reindent Lines, as the name suggests, will only Reindent Lines while Reformat Code is more aggressive.

An alternative is to use the styler package to help with a variety of tasks related to making your code easily readable.

6.8 Example

Here we have a script that performs much of the data science pipeline. Copy-and-paste this following code into a new script and then run the code. Alternatively, you can click on the code button here and save the file.

R Code Button

While running the code, feel free to run additional commands in the Console to understand what objects are being created.

# Author: Jarad Niemi
# Date:   2024-01-24
# Purpose: Demonstrate data analysis pipeline using the warpbreaks data set as
#          an example. It is also intended to show good coding practices. 
#-------------------------------------------------------------------------------
library("tidyverse")

# Import
#   `warpbreaks` already exists as a data set in R. If you want more details
#   about the data set, run 
# ?warpbreaks
#-------------------------------------------------------------------------------

# Transform data for better plotting
#-------------------------------------------------------------------------------
d <- warpbreaks |>
  mutate(
    tension = fct_recode(tension,
      Low    = "L",
      Medium = "M",
      High   = "H"
    ),
    
    # These are completely made up
    wool = fct_recode(wool,
      Lambswool     = "A",
      `Merino Wool` = "B"    # not a valid name, so we need backticks ` `
    )
    
  ) |>
  rename(
    Tension = tension,
    Wool    = wool,
    Breaks  = breaks
  )


# Exploratory statistics 
#-------------------------------------------------------------------------------
# Calculate confidence intervals for use in the graphic below
summaries <- d |> 
  group_by(Wool, Tension) |>
  summarize(
    n    = n(),
    mean = mean(Breaks),
    sd   = sd(Breaks),
    
    .groups = "drop"
  ) |>
  mutate(
    lcl = mean - qt(.975, df = n-1) * sd / sqrt(n),
    ucl = mean + qt(.975, df = n-1) * sd / sqrt(n),
  )

# For consistency across plots, use these values
jitterwidth <- 0.1
dodgewidth  <- 0.1


# Graphical statistics
#-------------------------------------------------------------------------------
ggplot(summaries, 
       
       # These values get used by geom_point, geom_pointrange, and geom_line below
       aes(x = Tension, 
           y = mean,
           color = Wool)) +
  
  # Add raw data points
  geom_point(
    data = d,
       aes(y = Breaks,
           shape = Wool),
    position = 
               # To avoid overlapping points
               position_jitterdodge(
                 jitter.width = jitterwidth, # adds random noise
                 dodge.width  = dodgewidth)) + # puts wool type slight left and right of tension values
  
  # Include means and confidence intervals
  geom_pointrange(
    aes(
      shape = Wool,
      ymax  = ucl,
      ymin  = lcl),
    position = position_dodge(width = dodgewidth)
  ) +
  
  # Add lines to connect the means
  geom_line(
    aes(
      linetype = Wool,
      group    = Wool),
    position = position_dodge(width = dodgewidth)
  ) +
  
  labs(
    y = "Number of Breaks per Loom",
    title = "Raw data with means: Lambswool v Merino Wool"
  ) +
  
  # Remove gray background with white grid and replace with the opposite
  theme_bw()


# Modeling
#-------------------------------------------------------------------------------
# Multiple regression model with number of breaks as the response and 
#   explanatory variables: wool, tension, wool-tension interaction.
m <- lm(Breaks ~ Wool + Tension + Wool:Tension,
        data = d)

# Test if interaction is significant
anova(m)

# Leave the interaction out, even though it was significant in this case
m <- lm(Breaks ~ Wool + Tension, # No interaction
        data = d)

# Use predictive means for plotting purposes
nd <- d |>
  select(Wool, Tension) |>
  unique()                 # only keep unique combinations of wool and tension

p <- predict(m, newdata = nd, interval = "confidence")

# Predicted values and confidence intervals are in the same order as the nd
predictions <- bind_cols(nd, p)

# Visualize predictions from this multiple regression model with no interaction
#   between wool and tension
ggplot(predictions, 
       aes(x     = Tension,
           y     = fit,
           color = Wool)) + 
  
  # Add raw data
  geom_point(
    data = d,
    aes(y = Breaks,
        shape = Wool),
    position = 
               # To avoid overlapping points
               position_jitterdodge(
                 jitter.width = jitterwidth, # adds random noise
                 dodge.width  = dodgewidth)) +
  
  # Add predicted points
  geom_pointrange(
    aes(
      y     = fit,
      shape = Wool,
      ymax  = upr,
      ymin  = lwr),
    position = position_dodge(width = dodgewidth)
  ) +

  # Add lines to connect the means
  geom_line(
    aes(
      linetype = Wool,
      group    = Wool),
    position = position_dodge(width = dodgewidth)
  ) +
  
  labs(
    y     = "Number of Breaks per Loom",
    title = "No-interaction Model: Lambswool v Merino Wool"
  ) +
  
  # Remove gray background with white grid and replace with the opposite
  theme_bw()