We will start with loading up the necessary packages using the library() command.
library("tidyverse")
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.4.9000 ✔ readr 2.1.5
✔ forcats 1.0.0 ✔ stringr 1.5.1
✔ ggplot2 3.5.1 ✔ tibble 3.2.1
✔ lubridate 1.9.3 ✔ tidyr 1.3.1
✔ purrr 1.0.2
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
Before, we can start reading data in from our file system, we need to 1) understand the concept of a working directory and 2) have a file that we can read.
18.1 File system
The figure below shows an example directory structure based on a Mac (or Linux) system.
18.2 Working directory
# Get the current working directorygetwd()
[1] "/Users/niemi/git/teaching/STAT5860/chapters"
18.2.1 Set working directory
You can set the working directory in RStudio by going to
Session > Set Working Directory > Choose Working Directory
Alternatively, we can use the setwd() function.
# Set the current working directory?setwd
Use RStudio projects!! This will set your working directory to the same folder every time you open that project.
18.2.2 Absolute paths
An absolute path will work no matter what your current working directory is
d <- readr::read_csv("~/Downloads/example.csv")d <- readr::read_csv("/Users/niemi/Downloads/example.csv")
While absolute paths are convenient I highly recommend you don’t use them because their use create scripts that are not reproducible. The scripts won’t work correctly for you on any other system. The scripts won’t work for somebody else on their system.
18.2.3 Relative paths
A relative path is evaluated relative to the current working directory. To move down a directory, use “..”. To move up a directory, use the name of the directory.
d <- readr::read_csv("data/example.csv")d <- readr::read_csv("../data/example.csv")d <- readr::read_csv("../../example.csv")
In order to use relative paths, you need to know what directory is expected to be the current working directory. I often put this information in the metadata of the file.
## Execute this script from its directory## OR## Execute this script from the project directory#
The former is convenient because you always know where the script should be executed (as long as nobody moves the file). The latter is convenient because all scripts can be executed from the same folder and thus, when you are working within a project, you do not need to constantly change your working directory. A downside is that it is not always obvious what the project directory is.
In the large projects I have been part of, the latter is typical.
18.3 Comma-separated value (CSV)
In order to read a file, we need an appropriate file. Comma-separated value (csv) files are a de-facto standard for transporting data. In this file type, each column is separated by a comma and each row is separated by a newline.
18.3.1 Write
Before we try to read a csv file, we need to make sure there is a csv file available to be read. We will accomplish this by writing a csv file to the current working directory.
We will discuss more about writing a csv file when discussing exporting data.
18.3.2 Read
To read in a csv file you can use the read_csv function in the readr package.
# Read a csv filed <-read_csv("ToothGrowth.csv")
Rows: 60 Columns: 3
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (1): supp
dbl (2): len, dose
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
len supp dose
Min. : 4.20 Length:60 Min. :0.500
1st Qu.:13.07 Class :character 1st Qu.:0.500
Median :19.25 Mode :character Median :1.000
Mean :18.81 Mean :1.167
3rd Qu.:25.27 3rd Qu.:2.000
Max. :33.90 Max. :2.000
Generally, a csv file will contain commas (or semi-colons) that separate the columns in the data and the same number of columns in every row. The file should also have a header row, immediately preceding the first data row, that provides the names for each of the columns. Optionally, the file could contain additional metadata above the header row.
Some (Canvas….ahem!) add additional row(s) between the header and the data. This should not be done as it makes it difficult for standard software including R to read it since data is assumed to immediately come after the header.
If you want to include additional rows, you can include rows above the header row. The skip argument of read_csv() can be used to skip rows before the header row.
A common use of spreadsheets is to calculate subtotals either by adding additional rows or columns. Although these can be removed, it is annoying and therefore should be avoided. Simply calculate subtotals in a different Excel sheet or copy the data over to a new Excel sheet.
Generally, I don’t suggest story data in binary formats, but these formats can be useful to store intermediate data. For example, if there is some important results from a statistical analysis that takes a long time to perform (I’m looking at you Bayesians) you might want to store the results in a binary format.
18.6.1 RData
There are two functions that will save RData files: save() and save.image(). The latter will save everything in the environment while the former will only save what it is specifically told to save. Both of these are read using the load() function.
a <-1save(a, file ="a.RData")
18.6.2 RDS
An RDS file contains a single R object. These files can be written using saveRDS() and read using readRDS().
saveRDS(a, file ="a.RDS")rm(a)
When you read this file, you need to save it into an R object.