In this chapter, we will discuss a variety of ways to joining multiple data sets into one data set.
library("tidyverse")
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.4.9000 ✔ readr 2.1.5
✔ forcats 1.0.0 ✔ stringr 1.5.1
✔ ggplot2 3.5.1 ✔ tibble 3.2.1
✔ lubridate 1.9.3 ✔ tidyr 1.3.1
✔ purrr 1.0.2
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
theme_set(theme_bw())
22.1 ID variables
In important concept in joining multiple data frames is the idea of an id variables (in database systems this is often referred to as a key). These variables unambiguously identify an observation or a record that is common across different data sets.
For example, all students (and faculty and staff) at Iowa State University have university id (a numeric value) as well as a NetID (alphanumeric). Both of these ID variables uniquely identify an individual.
When combining data frames, the process is made much easier when the data frames have a common ID variable. These ID variables will be included in the by argument of the functions. While it is best practice to include a single variable that uniquely identifies an observation, sometimes you can identify a single observation using a collection of variables. In this case, you can pass a character vector to the by argument.
22.2 Join
Joining operations occur when you want to combine exactly two data frames. Left, right, inner, and full joins combine two data frames, but differ in which row and variables are kept between the two data frames. Semi-join and anti-join can be used to filter on data frame against another.
For simplicity in understanding joining operations, we will construct two data frames that will be joined.
# Construct data framesd1 <-tribble( # Use tribble to allow construction of the data frame by rows~id, ~char1, ~num1,"id1", "A", 1,"id2", "B", 2)d2 <-tribble( # Use tribble to allow construction of the data frame by rows~id, ~char2, ~num2,"id1", "C", 11,"id3", "E", 33)
22.2.1 Left join
A left join will keep all observations in the first data frame and add the variables from the second data frame.
# Left joind1 |>left_join(d2, by ="id")
# A tibble: 2 × 5
id char1 num1 char2 num2
<chr> <chr> <dbl> <chr> <dbl>
1 id1 A 1 C 11
2 id2 B 2 <NA> NA
d2 |>left_join(d1, by ="id")
# A tibble: 2 × 5
id char2 num2 char1 num1
<chr> <chr> <dbl> <chr> <dbl>
1 id1 C 11 A 1
2 id3 E 33 <NA> NA
22.2.2 Right join
A right join will keep all observations in the second data frame and add the variables from the first data frame. Thus, in this way, is the opposite of a left join.
# Right joind1 |>right_join(d2, by ="id")
# A tibble: 2 × 5
id char1 num1 char2 num2
<chr> <chr> <dbl> <chr> <dbl>
1 id1 A 1 C 11
2 id3 <NA> NA E 33
d2 |>right_join(d1, by ="id")
# A tibble: 2 × 5
id char2 num2 char1 num1
<chr> <chr> <dbl> <chr> <dbl>
1 id1 C 11 A 1
2 id2 <NA> NA B 2
22.2.3 Inner join
An inner join will keep observations that exist in both data frames.
# Inner joind1 |>inner_join(d2, by ="id")
# A tibble: 1 × 5
id char1 num1 char2 num2
<chr> <chr> <dbl> <chr> <dbl>
1 id1 A 1 C 11
d2 |>inner_join(d1, by ="id")
# A tibble: 1 × 5
id char2 num2 char1 num1
<chr> <chr> <dbl> <chr> <dbl>
1 id1 C 11 A 1
22.2.4 Full join
A full join will keep all observations. If an observation doesn’t exist in one of the two data frames, then its corresponding variables from that data frame will be missing and thus be treated as an “NA” (not available).
# Full joind1 |>full_join(d2, by ="id")
# A tibble: 3 × 5
id char1 num1 char2 num2
<chr> <chr> <dbl> <chr> <dbl>
1 id1 A 1 C 11
2 id2 B 2 <NA> NA
3 id3 <NA> NA E 33
d2 |>full_join(d1, by ="id")
# A tibble: 3 × 5
id char2 num2 char1 num1
<chr> <chr> <dbl> <chr> <dbl>
1 id1 C 11 A 1
2 id3 E 33 <NA> NA
3 id2 <NA> NA B 2
22.2.5 Semi join
A semi-join will return the observations in the first data frame that exist in the second data frame, but will not include the variables from the second data frame.
# Semi joind1 |>semi_join(d2, by ="id")
# A tibble: 1 × 3
id char1 num1
<chr> <chr> <dbl>
1 id1 A 1
d2 |>semi_join(d1, by ="id")
# A tibble: 1 × 3
id char2 num2
<chr> <chr> <dbl>
1 id1 C 11
22.2.6 Anti join
An anti-join will include all observations that exist in the first data frame but not in the second data frame.
# Anti joind1 |>anti_join(d2, by ="id")
# A tibble: 1 × 3
id char1 num1
<chr> <chr> <dbl>
1 id2 B 2
d2 |>anti_join(d1, by ="id")
# A tibble: 1 × 3
id char2 num2
<chr> <chr> <dbl>
1 id3 E 33
22.3 Bind
While joining requires exactly two data frames, binding can be used to combine as many data frames as you want.
22.3.1 Bind rows
When binding rows, columns will be matched by name. If a column doesn’t exist in one of the data frames, then the associated values in the bound data frame will have missing values.
# A tibble: 4 × 4
id var1 var2 var3
<chr> <chr> <dbl> <dbl>
1 id1 A 1 NA
2 id2 B 2 NA
3 id3 C NA 10
4 id4 D NA 20
bind_rows(d2, d1)
# A tibble: 4 × 4
id var1 var3 var2
<chr> <chr> <dbl> <dbl>
1 id3 C 10 NA
2 id4 D 20 NA
3 id1 A NA 1
4 id2 B NA 2
22.3.2 Bind columns
Binding by columns
# Data frames for binding columnsd1 <-tribble(~id, ~var1, "id1", "A","id2", "B")d2 <-tribble(~id, ~var2,"id1", 1,"id2", 2)# If rows are in the correct order,# we get what we are expectingbind_cols(d1, d2 |>select(-id))
# A tibble: 2 × 3
id var1 var2
<chr> <chr> <dbl>
1 id1 A 1
2 id2 B 2
# If rows are in an incorrect order,# we are likely not getting what we expectedbind_cols(d1, d2 |>select(-id) |>arrange(desc(var2)))
# A tibble: 2 × 3
id var1 var2
<chr> <chr> <dbl>
1 id1 A 2
2 id2 B 1
# If we try to keep the ID variable,# new names will be createdbind_cols(d1, d2)
New names:
• `id` -> `id...1`
• `id` -> `id...3`
# A tibble: 2 × 4
id...1 var1 id...3 var2
<chr> <chr> <chr> <dbl>
1 id1 A id1 1
2 id2 B id2 2
Rather than using bind_cols(), it is probably better practice to join the data frames.
# Prefer joining by ID variableleft_join(d1, d2, by ="id")
# A tibble: 2 × 3
id var1 var2
<chr> <chr> <dbl>
1 id1 A 1
2 id2 B 2
22.4 Summary
Combining data sets from different sources is an extremely common practice when tidying data for visualization or analysis. Joining operations based on an ID variable are extremely useful while binding operations, especially bind_cols(), should be used with care.