In this chapter, we will discuss the basics of wrangling an individual data set.
library("tidyverse")
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.4.9000 ✔ readr 2.1.5
✔ forcats 1.0.0 ✔ stringr 1.5.1
✔ ggplot2 3.5.1 ✔ tibble 3.2.1
✔ lubridate 1.9.3 ✔ tidyr 1.3.1
✔ purrr 1.0.2
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
20.1 Pipe
Pipe operators allow for a data pipeline to be constructed within R code that is relatively easy to understand due to steps being conducted sequentially from the top to the bottom.
For example, can you guess what the following code does
# Example data pipelineToothGrowth |>group_by(supp, dose) |>summarize(n =n(),mean =mean(len),sd =sd(len),.groups ="drop" ) |>mutate(mean =round(mean, 2),sd =round(sd, 2) ) |>arrange(mean)
This code is made much easier to read (than equivalent un-piped R code) due to 1) functions being written so that the first argument is always a data.frame and 2) the pipe operator being used to send the results of the previous operation to the next operation.
The pipe operator is a relatively new feature of R. The base R version |> was introduced in May 2021 (R version 4.1.0), the magittr verions %>% was introduced in Dec 2013, and the very first version was introduced in a stackoverflow post in Jan 2012.
The idea behind the pipe operator is pretty simple, it simply passes the contents on its left hand side as the first argument to the function on its right hand side.
# Calculate meanc(1, 2, 3, 4) |>mean()
[1] 2.5
As the pipe only replaces the first argument, we can also use additional arguments of the following function.
We will the pipe operator used extensively in data pipelines.
20.2 Rename
We can use the rename function to rename an existing variable in the data set. If we want to use names that are not valid R object names, we need to enclose the name using backticks.
# We didn't actually change the object so the names in the ToothGrowth data set has not changed.ToothGrowth |>names()
[1] "len" "supp" "dose"
# Save the renamed data setd <- ToothGrowth |>rename(`Dose (mg/day)`= dose,Supplement = supp,`Length (mm)`= len ) # Saved data set has the new namesd |>names()
[1] "Length (mm)" "Supplement" "Dose (mg/day)"
20.3 Mutate
The mutate() function allows you to create new variables in a data.frame. Typically this involves changing units, performing calculations with other variables, and refactoring factors.
20.3.1 Numeric
# Mutate numericToothGrowth |>mutate(`Length (cm)`= len /10# `len` is in the data.frame ) |>summary() # `len` is in mm while `Length (cm)` is in cm
len supp dose Length (cm)
Min. : 4.20 OJ:30 Min. :0.500 Min. :0.420
1st Qu.:13.07 VC:30 1st Qu.:0.500 1st Qu.:1.308
Median :19.25 Median :1.000 Median :1.925
Mean :18.81 Mean :1.167 Mean :1.881
3rd Qu.:25.27 3rd Qu.:2.000 3rd Qu.:2.527
Max. :33.90 Max. :2.000 Max. :3.390
# Overwrite existing variableToothGrowth |>mutate(len = len /10# `len` is in the data.frame ) |>summary() # `len` is in cm
len supp dose
Min. :0.420 OJ:30 Min. :0.500
1st Qu.:1.308 VC:30 1st Qu.:0.500
Median :1.925 Median :1.000
Mean :1.881 Mean :1.167
3rd Qu.:2.527 3rd Qu.:2.000
Max. :3.390 Max. :2.000
# Use object outside data.framemm_per_cm <-10ToothGrowth |>mutate(len = len / mm_per_cm ) |>summary()
len supp dose
Min. :0.420 OJ:30 Min. :0.500
1st Qu.:1.308 VC:30 1st Qu.:0.500
Median :1.925 Median :1.000
Mean :1.881 Mean :1.167
3rd Qu.:2.527 3rd Qu.:2.000
Max. :3.390 Max. :2.000
# Use updated variableToothGrowth |>mutate(# Scale len between 0 and 1len = len -min(len),len = len /max(len) # reused `len` here ) |>summary() # len now ranges between 0 and 1
len supp dose
Min. :0.0000 OJ:30 Min. :0.500
1st Qu.:0.2988 VC:30 1st Qu.:0.500
Median :0.5067 Median :1.000
Mean :0.4920 Mean :1.167
3rd Qu.:0.7096 3rd Qu.:2.000
Max. :1.0000 Max. :2.000
20.3.2 Categorical
We commonly need to convert between factor and character representations of categorical variables.
# Convert between character and factord <- ToothGrowth |>select(supp) |># only keep `supp` columnmutate(supp_ch =as.character(supp), # convert to charactersupp_fa =as.factor(supp_ch) # convert to factor ) d |>summary() # summary() only informative for factor
# table() is always informatived$supp_ch |>table()
OJ VC
30 30
d$supp_fa |>table()
OJ VC
30 30
At this point, the main distinction between character variables and factor variables is that you can change the order of factor variables while character variables will always be in alphabetical order.
# Change order of factor variable# Cannot change order of character variablesd |>mutate(supp_fa =factor(supp_fa, levels =c("VC", # put VC first"OJ")) # then OJ ) |>summary()
# Recode character or factor levelsd |>mutate(supp_ch =fct_recode(supp_ch, `Ascorbic Acid`="VC",`Orange Juice`="OJ" ),supp_fa =fct_recode(supp_fa,`Ascorbic Acid`="VC",`Orange Juice`="OJ" ) ) |>summary()
supp supp_ch supp_fa
OJ:30 Orange Juice :30 Orange Juice :30
VC:30 Ascorbic Acid:30 Ascorbic Acid:30
20.3.3 Both
Here we will show you how to utilize the mutate() function to perform a large number of calculations.
# Diamondsd <- diamonds |># Rather than precalculating depth # we will calculate depth in a scriptselect(-depth) |>mutate(# Calculate depthdepth =2* z / (x+y), # see ?diamonds for formuladepth =100* depth, # make depth a percent# Calculate $/weightprice_per_carat = price / carat,# Reorder cutcut =factor(cut, levels =c("Ideal","Premium","Very Good","Good","Fair" )) )# View calculated variablesd |>select(price_per_carat, depth, cut) |>summary()
price_per_carat depth cut
Min. : 1051 Min. : 0.00 Ideal :21551
1st Qu.: 2478 1st Qu.: 61.04 Premium :13791
Median : 3495 Median : 61.84 Very Good:12082
Mean : 4008 Mean : 61.74 Good : 4906
3rd Qu.: 4950 3rd Qu.: 62.53 Fair : 1610
Max. :17829 Max. :619.28
NA's :7
# View observations with NA depthd |>filter(is.na(depth)) |># see filter() belowselect(depth, x, y, z) # NaN stands for `Not a number`
# A tibble: 7 × 4
depth x y z
<dbl> <dbl> <dbl> <dbl>
1 NaN 0 0 0
2 NaN 0 0 0
3 NaN 0 0 0
4 NaN 0 0 0
5 NaN 0 0 0
6 NaN 0 0 0
7 NaN 0 0 0
20.4 Arrange
You can arrange a data set by a collection of variables. For factor variables, the order is according to the factor level which is alphabetically by default.
Data pipelines work the best when functions return a data.frame as the other functions in this chapter do. If you want to investigate a single variable, you can use the pull() function. This is equivalent to the $ access of a column, but can be included in a dplyr pipeline.
# Pull a variableToothGrowth |>pull(len) |>summary()
Min. 1st Qu. Median Mean 3rd Qu. Max.
4.20 13.07 19.25 18.81 25.27 33.90
diamonds |>pull(cut) |>summary()
Fair Good Very Good Premium Ideal
1610 4906 12082 13791 21551
20.6 Filter
The filter() function allows you to keep observations (rows) by some criteria.
# Filter by numeric variableToothGrowth |>filter(len <6)
len supp dose
Min. :18.50 OJ:10 Min. :2
1st Qu.:23.52 VC:10 1st Qu.:2
Median :25.95 Median :2
Mean :26.10 Mean :2
3rd Qu.:27.82 3rd Qu.:2
Max. :33.90 Max. :2
ToothGrowth |>filter(dose !=2) |>summary()
len supp dose
Min. : 4.200 OJ:20 Min. :0.50
1st Qu.: 9.925 VC:20 1st Qu.:0.50
Median :15.200 Median :0.75
Mean :15.170 Mean :0.75
3rd Qu.:19.775 3rd Qu.:1.00
Max. :27.300 Max. :1.00
You can also filter by character (or factor) variables.
len supp dose
Min. : 8.20 OJ:30 Min. :0.500
1st Qu.:15.53 VC: 0 1st Qu.:0.500
Median :22.70 Median :1.000
Mean :20.66 Mean :1.167
3rd Qu.:25.73 3rd Qu.:2.000
Max. :30.90 Max. :2.000
ToothGrowth |>filter(supp !="VC") |>summary()
len supp dose
Min. : 8.20 OJ:30 Min. :0.500
1st Qu.:15.53 VC: 0 1st Qu.:0.500
Median :22.70 Median :1.000
Mean :20.66 Mean :1.167
3rd Qu.:25.73 3rd Qu.:2.000
Max. :30.90 Max. :2.000
Fair Good Very Good Premium Ideal
0 0 0 13791 21551
You can also filter using multiple variables.
# Filter on multiple variablesToothGrowth |>filter(supp =="OJ", dose ==0.5) |>summary()
len supp dose
Min. : 8.20 OJ:10 Min. :0.5
1st Qu.: 9.70 VC: 0 1st Qu.:0.5
Median :12.25 Median :0.5
Mean :13.23 Mean :0.5
3rd Qu.:16.18 3rd Qu.:0.5
Max. :21.50 Max. :0.5
diamonds |>filter( cut %in%c("Premium", "Ideal"), carat <= .75, color =="D",!(clarity %in%c("VS1", "VS2")) # not VS1 or VS2 ) |>select(cut, carat, color, clarity) |>summary()
cut carat color clarity
Fair : 0 Min. :0.2300 D:1798 SI1 :920
Good : 0 1st Qu.:0.3300 E: 0 SI2 :382
Very Good: 0 Median :0.4100 F: 0 VVS2 :298
Premium : 634 Mean :0.4516 G: 0 VVS1 :166
Ideal :1164 3rd Qu.:0.5400 H: 0 IF : 26
Max. :0.7500 I: 0 I1 : 6
J: 0 (Other): 0
20.7 Slice
The slice functions allow you to subset the data in a variety of ways.
# Top of data.frameToothGrowth |>slice_head()
len supp dose
1 4.2 VC 0.5
# Bottom of data.frameToothGrowth |>slice_tail()
len supp dose
1 23 OJ 2
# Random rowsToothGrowth |>slice_sample(n =5) # number of rows
ToothGrowth |>slice_sample(prop =2/60) # proportion of rows
len supp dose
1 23.6 VC 2.0
2 10.0 VC 0.5
# Filter ToothGrowth |>slice_min( len, # variable to order data byprop =2/60 )
len supp dose
1 4.2 VC 0.5
2 5.2 VC 0.5
20.8 Summarize
Previously we have seen the summary() function which can be used to provide default summaries for numeric and factor variables. The summarize() function can be used to calculate user-determined values. One commonly used function is n() which counts the number of observations in the data.frame.
ToothGrowth |>summarize(n =n(), # number of observationsmean_len =mean(len),sdlen =sd(len) )
n mean_len sdlen
1 60 18.81333 7.649315
20.8.1 Group
Especially when summarizing the data.frame, we can use the group_by() function to allow the summarization to happen within each combination of the group by variables.
The summarize() function requires that each argument returns a single value. Most of the time this is what you want, but sometimes you want more flexibility. If you try to use summarize() you will receive a warning.
Warning: Returning more (or less) than 1 row per `summarise()` group was deprecated in
dplyr 1.1.0.
ℹ Please use `reframe()` instead.
ℹ When switching from `summarise()` to `reframe()`, remember that `reframe()`
always returns an ungrouped data frame and adjust accordingly.
`summarise()` has grouped output by 'supp', 'dose'. You can override using the
`.groups` argument.