18 Import

Author

Jarad Niemi

We will start with loading up the necessary packages using the library() command.

library("tidyverse")

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4.9000     ✔ readr     2.1.5     
✔ forcats   1.0.0          ✔ stringr   1.5.1     
✔ ggplot2   3.5.1          ✔ tibble    3.2.1     
✔ lubridate 1.9.3          ✔ tidyr     1.3.1     
✔ purrr     1.0.2          
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

Before, we can start reading data in from our file system, we need to 1) understand the concept of a working directory and 2) have a file that we can read.

18.1 File system

The figure below shows an example directory structure based on a Mac (or Linux) system.

18.2 Working directory

# Get the current working directory
getwd()

[1] "/Users/niemi/git/teaching/STAT5860/chapters"

18.2.1 Set working directory

You can set the working directory in RStudio by going to

Session > Set Working Directory > Choose Working Directory

Alternatively, we can use the setwd() function.

# Set the current working directory
?setwd

Use RStudio projects!! This will set your working directory to the same folder every time you open that project.

18.2.2 Absolute paths

An absolute path will work no matter what your current working directory is

d <- readr::read_csv("~/Downloads/example.csv")
d <- readr::read_csv("/Users/niemi/Downloads/example.csv")

While absolute paths are convenient I highly recommend you don’t use them because their use create scripts that are not reproducible. The scripts won’t work correctly for you on any other system. The scripts won’t work for somebody else on their system.

18.2.3 Relative paths

A relative path is evaluated relative to the current working directory. To move down a directory, use “..”. To move up a directory, use the name of the directory.

d <- readr::read_csv("data/example.csv")
d <- readr::read_csv("../data/example.csv")
d <- readr::read_csv("../../example.csv")

In order to use relative paths, you need to know what directory is expected to be the current working directory. I often put this information in the metadata of the file.

#
# Execute this script from its directory
#
#                OR
#
# Execute this script from the project directory
#

The former is convenient because you always know where the script should be executed (as long as nobody moves the file). The latter is convenient because all scripts can be executed from the same folder and thus, when you are working within a project, you do not need to constantly change your working directory. A downside is that it is not always obvious what the project directory is.

In the large projects I have been part of, the latter is typical.

18.3 Comma-separated value (CSV)

In order to read a file, we need an appropriate file. Comma-separated value (csv) files are a de-facto standard for transporting data. In this file type, each column is separated by a comma and each row is separated by a newline.

18.3.1 Write

Before we try to read a csv file, we need to make sure there is a csv file available to be read. We will accomplish this by writing a csv file to the current working directory.

# Write csv file
readr::write_csv(ToothGrowth, 
                 file = "ToothGrowth.csv")

We will discuss more about writing a csv file when discussing exporting data.

18.3.2 Read

To read in a csv file you can use the read_csv function in the readr package.

# Read a csv file
d <- read_csv("ToothGrowth.csv")

Rows: 60 Columns: 3
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (1): supp
dbl (2): len, dose

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Let’s take a look at the data.

# Explore data
dim(d)

[1] 60  3

head(d)

# A tibble: 6 × 3
    len supp   dose
  <dbl> <chr> <dbl>
1   4.2 VC      0.5
2  11.5 VC      0.5
3   7.3 VC      0.5
4   5.8 VC      0.5
5   6.4 VC      0.5
6  10   VC      0.5

summary(d)

      len            supp                dose      
 Min.   : 4.20   Length:60          Min.   :0.500  
 1st Qu.:13.07   Class :character   1st Qu.:0.500  
 Median :19.25   Mode  :character   Median :1.000  
 Mean   :18.81                      Mean   :1.167  
 3rd Qu.:25.27                      3rd Qu.:2.000  
 Max.   :33.90                      Max.   :2.000

str(d)

spc_tbl_ [60 × 3] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
 $ len : num [1:60] 4.2 11.5 7.3 5.8 6.4 10 11.2 11.2 5.2 7 ...
 $ supp: chr [1:60] "VC" "VC" "VC" "VC" ...
 $ dose: num [1:60] 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 ...
 - attr(*, "spec")=
  .. cols(
  ..   len = col_double(),
  ..   supp = col_character(),
  ..   dose = col_double()
  .. )
 - attr(*, "problems")=<externalptr>

table(d$supp)


OJ VC 
30 30

18.3.3 Options

d <- read_csv("ToothGrowth.csv",
                     col_types = cols(
                       len = col_double(),
                       supp = col_factor(),
                       dose = col_double()
                     ))

This will make sure the columns are read in with the format that you are expecting.

18.3.4 Format

There is no official standard for CSV file formats, but there are attempts to define a standard or at least outline what is standard.

Library of Congress

The Internet Society

Generally, a csv file will contain commas (or semi-colons) that separate the columns in the data and the same number of columns in every row. The file should also have a header row, immediately preceding the first data row, that provides the names for each of the columns. Optionally, the file could contain additional metadata above the header row.

Some (Canvas….ahem!) add additional row(s) between the header and the data. This should not be done as it makes it difficult for standard software including R to read it since data is assumed to immediately come after the header.

If you want to include additional rows, you can include rows above the header row. The skip argument of read_csv() can be used to skip rows before the header row.

A common use of spreadsheets is to calculate subtotals either by adding additional rows or columns. Although these can be removed, it is annoying and therefore should be avoided. Simply calculate subtotals in a different Excel sheet or copy the data over to a new Excel sheet.

18.4 Excel

install.packages("readxl")
?readxl::read_excel

18.5 Databases

To read other types of files, including databases, start with these suggestions in R4DS.

18.6 Binary R file formats

Generally, I don’t suggest story data in binary formats, but these formats can be useful to store intermediate data. For example, if there is some important results from a statistical analysis that takes a long time to perform (I’m looking at you Bayesians) you might want to store the results in a binary format.

18.6.1 RData

There are two functions that will save RData files: save() and save.image(). The latter will save everything in the environment while the former will only save what it is specifically told to save. Both of these are read using the load() function.

a <- 1
save(a, file = "a.RData")

18.6.2 RDS

An RDS file contains a single R object. These files can be written using saveRDS() and read using readRDS().

saveRDS(a, file = "a.RDS")
rm(a)

When you read this file, you need to save it into an R object.

b <- readRDS("a.RDS")

18.7 Cleanup

unlink("ToothGrowth.csv")
unlink("a.RData")
unlink("a.RDS")