22 Join

Author

Jarad Niemi

In this chapter, we will discuss a variety of ways to joining multiple data sets into one data set.

library("tidyverse")

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4.9000     ✔ readr     2.1.5     
✔ forcats   1.0.0          ✔ stringr   1.5.1     
✔ ggplot2   3.5.1          ✔ tibble    3.2.1     
✔ lubridate 1.9.3          ✔ tidyr     1.3.1     
✔ purrr     1.0.2          
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

theme_set(theme_bw())

22.1 ID variables

In important concept in joining multiple data frames is the idea of an id variables (in database systems this is often referred to as a key). These variables unambiguously identify an observation or a record that is common across different data sets.

For example, all students (and faculty and staff) at Iowa State University have university id (a numeric value) as well as a NetID (alphanumeric). Both of these ID variables uniquely identify an individual.

When combining data frames, the process is made much easier when the data frames have a common ID variable. These ID variables will be included in the by argument of the functions. While it is best practice to include a single variable that uniquely identifies an observation, sometimes you can identify a single observation using a collection of variables. In this case, you can pass a character vector to the by argument.

22.2 Join

Joining operations occur when you want to combine exactly two data frames. Left, right, inner, and full joins combine two data frames, but differ in which row and variables are kept between the two data frames. Semi-join and anti-join can be used to filter on data frame against another.

For simplicity in understanding joining operations, we will construct two data frames that will be joined.

# Construct data frames
d1 <- tribble( # Use tribble to allow construction of the data frame by rows
  ~id, ~char1, ~num1,
  "id1", "A", 1,
  "id2", "B", 2
)

d2 <- tribble( # Use tribble to allow construction of the data frame by rows
  ~id, ~char2, ~num2,
  "id1", "C", 11,
  "id3", "E", 33
)

22.2.1 Left join

A left join will keep all observations in the first data frame and add the variables from the second data frame.

# Left join
d1 |> left_join(d2, by = "id")

# A tibble: 2 × 5
  id    char1  num1 char2  num2
  <chr> <chr> <dbl> <chr> <dbl>
1 id1   A         1 C        11
2 id2   B         2 <NA>     NA

d2 |> left_join(d1, by = "id")

# A tibble: 2 × 5
  id    char2  num2 char1  num1
  <chr> <chr> <dbl> <chr> <dbl>
1 id1   C        11 A         1
2 id3   E        33 <NA>     NA

22.2.2 Right join

A right join will keep all observations in the second data frame and add the variables from the first data frame. Thus, in this way, is the opposite of a left join.

# Right join
d1 |> right_join(d2, by = "id")

# A tibble: 2 × 5
  id    char1  num1 char2  num2
  <chr> <chr> <dbl> <chr> <dbl>
1 id1   A         1 C        11
2 id3   <NA>     NA E        33

d2 |> right_join(d1, by = "id")

# A tibble: 2 × 5
  id    char2  num2 char1  num1
  <chr> <chr> <dbl> <chr> <dbl>
1 id1   C        11 A         1
2 id2   <NA>     NA B         2

22.2.3 Inner join

An inner join will keep observations that exist in both data frames.

# Inner join
d1 |> inner_join(d2, by = "id")

# A tibble: 1 × 5
  id    char1  num1 char2  num2
  <chr> <chr> <dbl> <chr> <dbl>
1 id1   A         1 C        11

d2 |> inner_join(d1, by = "id")

# A tibble: 1 × 5
  id    char2  num2 char1  num1
  <chr> <chr> <dbl> <chr> <dbl>
1 id1   C        11 A         1

22.2.4 Full join

A full join will keep all observations. If an observation doesn’t exist in one of the two data frames, then its corresponding variables from that data frame will be missing and thus be treated as an “NA” (not available).

# Full join
d1 |> full_join(d2, by = "id")

# A tibble: 3 × 5
  id    char1  num1 char2  num2
  <chr> <chr> <dbl> <chr> <dbl>
1 id1   A         1 C        11
2 id2   B         2 <NA>     NA
3 id3   <NA>     NA E        33

d2 |> full_join(d1, by = "id")

# A tibble: 3 × 5
  id    char2  num2 char1  num1
  <chr> <chr> <dbl> <chr> <dbl>
1 id1   C        11 A         1
2 id3   E        33 <NA>     NA
3 id2   <NA>     NA B         2

22.2.5 Semi join

A semi-join will return the observations in the first data frame that exist in the second data frame, but will not include the variables from the second data frame.

# Semi join
d1 |> semi_join(d2, by = "id")

# A tibble: 1 × 3
  id    char1  num1
  <chr> <chr> <dbl>
1 id1   A         1

d2 |> semi_join(d1, by = "id")

# A tibble: 1 × 3
  id    char2  num2
  <chr> <chr> <dbl>
1 id1   C        11

22.2.6 Anti join

An anti-join will include all observations that exist in the first data frame but not in the second data frame.

# Anti join
d1 |> anti_join(d2, by = "id")

# A tibble: 1 × 3
  id    char1  num1
  <chr> <chr> <dbl>
1 id2   B         2

d2 |> anti_join(d1, by = "id")

# A tibble: 1 × 3
  id    char2  num2
  <chr> <chr> <dbl>
1 id3   E        33

22.3 Bind

While joining requires exactly two data frames, binding can be used to combine as many data frames as you want.

22.3.1 Bind rows

When binding rows, columns will be matched by name. If a column doesn’t exist in one of the data frames, then the associated values in the bound data frame will have missing values.

# Data frames for binding rows
d1 <- tribble(
  ~id, ~var1, ~var2,
  "id1", "A", 1,
  "id2", "B", 2
)

d2 <- tribble(
  ~id, ~var1, ~var3,
  "id3", "C", 10,
  "id4", "D", 20
)

# Bind rows
bind_rows(d1, d2)

# A tibble: 4 × 4
  id    var1   var2  var3
  <chr> <chr> <dbl> <dbl>
1 id1   A         1    NA
2 id2   B         2    NA
3 id3   C        NA    10
4 id4   D        NA    20

bind_rows(d2, d1)

# A tibble: 4 × 4
  id    var1   var3  var2
  <chr> <chr> <dbl> <dbl>
1 id3   C        10    NA
2 id4   D        20    NA
3 id1   A        NA     1
4 id2   B        NA     2

22.3.2 Bind columns

Binding by columns

# Data frames for binding columns
d1 <- tribble(
  ~id, ~var1, 
  "id1", "A",
  "id2", "B"
)

d2 <- tribble(
  ~id, ~var2,
  "id1", 1,
  "id2", 2
)

# If rows are in the correct order,
# we get what we are expecting
bind_cols(d1, 
          d2 |> select(-id))

# A tibble: 2 × 3
  id    var1   var2
  <chr> <chr> <dbl>
1 id1   A         1
2 id2   B         2

# If rows are in an incorrect order,
# we are likely not getting what we expected
bind_cols(d1, 
          d2 |> 
            select(-id) |>
            arrange(desc(var2)))

# A tibble: 2 × 3
  id    var1   var2
  <chr> <chr> <dbl>
1 id1   A         2
2 id2   B         1

# If we try to keep the ID variable,
# new names will be created
bind_cols(d1, d2)

New names:
• `id` -> `id...1`
• `id` -> `id...3`

# A tibble: 2 × 4
  id...1 var1  id...3  var2
  <chr>  <chr> <chr>  <dbl>
1 id1    A     id1        1
2 id2    B     id2        2

Rather than using bind_cols(), it is probably better practice to join the data frames.

# Prefer joining by ID variable
left_join(d1, d2, by = "id")

# A tibble: 2 × 3
  id    var1   var2
  <chr> <chr> <dbl>
1 id1   A         1
2 id2   B         2

22.4 Summary

Combining data sets from different sources is an extremely common practice when tidying data for visualization or analysis. Joining operations based on an ID variable are extremely useful while binding operations, especially bind_cols(), should be used with care.