# Scalars
TRUE
[1] TRUE
2.5
[1] 2.5
"a scalar"
[1] "a scalar"
This page provides an introduction to commonly used R object types and their dimensions. The page begins with a discussion of object dimensions including scalar, vector, matrices and arrays. It then discusses some of the basic object types including logical, numeric, and character. The page finishes with a discussion of advanced objects including factors, dates, data frames, and lists.
So far we have dealt with all R objects as scalars, i.e. a single value of that data type. For example, the following are all scalars.
# Scalars
TRUE
[1] TRUE
2.5
[1] 2.5
"a scalar"
[1] "a scalar"
Note that “a scalar” is considered a scalar character.
If we want to combine multiple scalars, we will typically put them into a vector. We construct vectors using the c()
function.
# Vectors
c(TRUE, FALSE)
[1] TRUE FALSE
c(2.5, 6, 7.2)
[1] 2.5 6.0 7.2
c("a", "character", "vector")
[1] "a" "character" "vector"
Vector length can be determined using the length()
function.
# Vector length
<- c(TRUE, FALSE)
vl length(vl)
[1] 2
<- c(1, 2, 3, 4)
vn length(vn)
[1] 4
<- c("a", "character", "vector")
vc length(vc)
[1] 3
The data type can be determined using the class()
function.
# Vector type
class(vl)
[1] "logical"
class(vn)
[1] "numeric"
class(vc)
[1] "character"
Sometimes it is useful to give names to elements of the vector. Names can be assigned using the names()
function.
# Surface area of Great Lakes
<- c(82100, 57800, 59600, 25670, 19010) # km^2
area
# Add names
names(area) <- c("Superior", "Ontario", "Huron", "Michigan", "Erie")
area
Superior Ontario Huron Michigan Erie
82100 57800 59600 25670 19010
Sometimes these names are created from a function.
# Named vector
table(OrchardSprays$treatment)
A B C D E F G H
8 8 8 8 8 8 8 8
Accessing vector elements can be done using indices or a logical vector. Indexing in R starts at 1 (rather than 0 like in some languages). To utilize a logical vector, the vector
# Construct vector
<- c("one", "two", "three")
v
# Access using indices
1] v[
[1] "one"
-2] v[
[1] "one" "three"
2:3] v[
[1] "two" "three"
-c(1,2)] v[
[1] "three"
# Access using logical vector
c( TRUE, FALSE, FALSE)] v[
[1] "one"
c( TRUE, FALSE, TRUE)] v[
[1] "one" "three"
c(FALSE, TRUE, TRUE)] v[
[1] "two" "three"
c(FALSE, FALSE, FALSE)] v[
character(0)
When using indices, you can repeat elements.
# Index repeats
c(1,1,2,2,3)] v[
[1] "one" "one" "two" "two" "three"
Care is needed when using a logical vector because the logical vector will be (if possible) replicated to be the same length as the original vector. This replication
# TRUE is replicated 3 times
TRUE] v[
[1] "one" "two" "three"
# New vector
<- LETTERS[1:4]
v c(TRUE, FALSE)] # Replicated v[
[1] "A" "C"
If the vector has names, the names can be used to access member elements.
c("Ontario", "Erie")] area[
Ontario Erie
57800 19010
We can extend from vector, a one-dimensional object, to a matrix, a two-dimensional object. There are a variety of ways to construct a matrix including matrix()
, cbind()
, and rbind()
.
# Construct a matrix
matrix(1:6, nrow = 3, ncol = 2)
[,1] [,2]
[1,] 1 4
[2,] 2 5
[3,] 3 6
# Construct a matrix inputting elements by rows
matrix(1:6, nrow = 3, ncol = 2, byrow = TRUE)
[,1] [,2]
[1,] 1 2
[2,] 3 4
[3,] 5 6
# Construct a matrix be binding columns
cbind(1:3, 4:6)
[,1] [,2]
[1,] 1 4
[2,] 2 5
[3,] 3 6
# Construct a matrix by binding rows
rbind(1:3, 4:6)
[,1] [,2] [,3]
[1,] 1 2 3
[2,] 4 5 6
The dimensions of the matrix can be found using dim()
, nrows()
, and ncols()
.
# Dimensions
<- matrix(1:6, nrow = 2)
m
dim(m) # row by column
[1] 2 3
nrow(m)
[1] 2
ncol(m)
[1] 3
The type of matrix can be determined using typeof()
.
# Matrix type
typeof(matrix(c(TRUE, FALSE), nrow = 1))
[1] "logical"
typeof(matrix(c(1L, 2L), nrow = 1))
[1] "integer"
typeof(matrix(c("a", "b"), nrow = 1))
[1] "character"
A matrix can have row and column names.
# Construct a matrix
<- cbind(
m
area,c(12100, 4920, 3540, 484, 1640)
)
# Assign row and column names
rownames(m) # taken from area
[1] "Superior" "Ontario" "Huron" "Michigan" "Erie"
colnames(m)[2] <- "volume"
# Print matrix
m
area volume
Superior 82100 12100
Ontario 57800 4920
Huron 59600 3540
Michigan 25670 484
Erie 19010 1640
Matrices can be accessed in the same way vectors were, but now there are two dimensions. If a dimension is blank, then we will get all elements in that dimension.
# Matrix accessing
1,2] m[
[1] 12100
1:2,] m[
area volume
Superior 82100 12100
Ontario 57800 4920
-3,] m[
area volume
Superior 82100 12100
Ontario 57800 4920
Michigan 25670 484
Erie 19010 1640
"volume"] m[,
Superior Ontario Huron Michigan Erie
12100 4920 3540 484 1640
"Ontario",] m[
area volume
57800 4920
It is common to forget the comma when and you get something you probably don’t expect.
# Forgot the comma
4] m[
[1] 25670
7] m[
[1] 4920
Arrays can be constructed that extend beyond the two-dimensional objects to higher dimensions. These are not commonly used and thus we will only show a quick example of how to construct one.
# Construct an array
<- array(1:12,
a dim = 4:2)
# Print the array
a
, , 1
[,1] [,2] [,3]
[1,] 1 5 9
[2,] 2 6 10
[3,] 3 7 11
[4,] 4 8 12
, , 2
[,1] [,2] [,3]
[1,] 1 5 9
[2,] 2 6 10
[3,] 3 7 11
[4,] 4 8 12
Logical (or boolean) variables have only two values: TRUE
and FALSE
.
# Logical
TRUE
[1] TRUE
FALSE
[1] FALSE
is.logical(TRUE)
[1] TRUE
is.logical(FALSE)
[1] TRUE
other variable types are not logicals
is.logical(1)
[1] FALSE
is.logical("a")
[1] FALSE
Returning to our discussion of logicals. By default, T
and F
are assigned the values TRUE
and FALSE
.
# Use TRUE and FALSE
T
[1] TRUE
F
[1] FALSE
is.logical(T)
[1] TRUE
is.logical(F)
[1] TRUE
Caution: the objects T
and F
can be redefined.
# Confusing redefinition
<- "a"
T
is.logical(T)
[1] FALSE
Numeric is the default data type for any number.
# Numeric
4
[1] 4
is.numeric(4)
[1] TRUE
2.5
[1] 2.5
is.numeric(2.5)
[1] TRUE
is.numeric(pi)
[1] TRUE
<- sqrt(2)
x is.numeric(x)
[1] TRUE
Numeric vectors can be constructed in a variety of ways.
# Sequential integer vector construction
2:5
[1] 2 3 4 5
5:1
[1] 5 4 3 2 1
-1:3
[1] -1 0 1 2 3
1:5.99
[1] 1 2 3 4 5
# General vector construction
seq(from = 2, to = 10, by = 2)
[1] 2 4 6 8 10
seq(from = 0, to = 1, by = 0.2)
[1] 0.0 0.2 0.4 0.6 0.8 1.0
In R, a numeric can be thought of as equivalent to a double in other languages. You can also explicitly create and use integers. To create an integer postpend the integer with L
.
# Integer
3L
[1] 3
is.integer(1L)
[1] TRUE
I mention integers so that if you see code where a number is postpended with an L
you will know what is going on. Otherwise, this distinction is not important because 1) an integer
is treated as a numeric
by R and 2) when doing any calculation, the integer
is converted to a numeric for calculations.
# Integer conversion
<- 3L
x is.integer(x)
[1] TRUE
is.numeric(x)
[1] TRUE
is.integer(sqrt(x^2))
[1] FALSE
is.numeric(sqrt(x^2))
[1] TRUE
A character is the default data type for any non-logical and non-numeric value. A character variable must be enclosed in quotes. If not, R will look for an object with the same name.
# Character
is.character("a")
[1] TRUE
is.character("2b")
[1] TRUE
is.character("2023-07-06")
[1] TRUE
is.character("Is this really a character?")
[1] TRUE
As you can see from the example above, the character refers to single characters but also strings of multiple characters.
To construct character vectors, the paste()
and paste0()
functions are often useful. By default, the paste()
includes a space as a separator while the paste0()
function has no separation.
# Example paste() usage
paste( "group", 1:2)
[1] "group 1" "group 2"
paste0("group", 1:2)
[1] "group1" "group2"
paste( "group", 1:2, sep = "-")
[1] "group-1" "group-2"
To extract individual characters, sue the substr()
function.
# Extract characters
<- "This is my very long character."
s substr(s, start = 1, stop = 4)
[1] "This"
substr(s, start = nchar(s)-9, stop = nchar(s))
[1] "character."
As we will see characters form the basis for many advanced data types including dates and factors.
Here we introduce a variety of advanced data types including factors, dates, data frames, and lists.
Factors are a special type of character vector that allows the user more control over how the elements of the vector are used in visualizations and modeling.
Character vectors are inherently ordered alphabetically with lowercase letters coming before their uppercase equivalents.
# Character vector sorting examples
<- c("A", "B", "a", "b")
v sort(v)
[1] "a" "A" "b" "B"
<- c("1", "2", "10")
v sort(v)
[1] "1" "10" "2"
# Alphabetical ordering may not be desired
<- c(
treatment paste0("A_dose", c(10,20,100)),
paste0("D_dose", c(10,20,100)),
"control")
sort(treatment)
[1] "A_dose10" "A_dose100" "A_dose20" "control" "D_dose10" "D_dose100"
[7] "D_dose20"
If we are creating a visualization or performing an analysis, the order we probably wanted was “control” followed by treatment A doses 10, 20, and 100 followed by treatment D doses 10, 20, and 100.
To create a factor vector use the factor()
function.
# Construct factor
<- factor(c("A","A","B","B","B"))
f
# Check if `f` is a factor
is.factor(f)
[1] TRUE
# Check class of `f`
class(f)
[1] "factor"
# Look at `f`
f
[1] A A B B B
Levels: A B
Compare the output above to a character vector:
# Convert factor to character
as.character(f)
[1] "A" "A" "B" "B" "B"
Notice that there are no quotes and there is a second line that indicates the levels of the character vector.
The internal representation of factors in R is a numeric vector with a lookup table for the levels of the factor.
# Numeric vector
as.numeric(f)
[1] 1 1 2 2 2
# Lookup table
levels(f)
[1] "A" "B"
You can use the levels function to change the values for the factor.
# Change levels
levels(f) <- c("C","D")
f
[1] C C D D D
Levels: C D
Rather than changing the levels this way, I suggest you use forcats::fct_recode()
or something similar.
By default, factor levels will be ordered alphabetically. For example,
# Factor default ordering
<- factor(rep(c("Dose10","Dose20","Dose100"), each = 2))
f f
[1] Dose10 Dose10 Dose20 Dose20 Dose100 Dose100
Levels: Dose10 Dose100 Dose20
levels(f)
[1] "Dose10" "Dose100" "Dose20"
This ordering will affect all aspects of statistical analysis including descriptive statistics and visualizations. For example,
# Descriptive statistics
summary(f)
Dose10 Dose100 Dose20
2 2 2
Thus, you may want to order the factors in a different order. To order the factor when constructed use the levels
argument.
# Construct a factor in order
<- factor(f, levels = c("Dose10", "Dose20", "Dose100"))
f_in_order
# Check factor level order
f_in_order
[1] Dose10 Dose10 Dose20 Dose20 Dose100 Dose100
Levels: Dose10 Dose20 Dose100
levels(f_in_order)
[1] "Dose10" "Dose20" "Dose100"
summary(f_in_order)
Dose10 Dose20 Dose100
2 2 2
In the factor
function, there is an argument called ordered
. This serves a different purpose and is beyond the scope of this course. Thus, just ignore the ordered
argument.
Sometimes, you simply need to just reorder one level to the first position. This is particularly true when performing a regression analysis and you are trying to set the reference level.
To move a single level to the first position, use the relevel
function. For example,
# Construct a factor
<- factor(c("Control","A10","A20","A30"))
f
# Check factor order
levels(f)
[1] "A10" "A20" "A30" "Control"
# Put `Control` as first factor level
<- relevel(f, ref = "Control")
f levels(f)
[1] "Control" "A10" "A20" "A30"
In regression modeling, the first factor level will be treated as the reference level, i.e. the level associated with the intercept.
Dates can be extremely difficult to work with due to inconsistency in how dates are formatted to dealing with time zones.
For any organization that utilizes dates, those dates should be recorded using the ISO 8601 standard. Specifically, the date should be represented in
YYYY-MM-DD
where YYYY represents the 4 digit year, MM represents the 2 digit month, and DD represents the two digit day. For example, 2023-07-02 is July 2, 2023. Note that the preceding zeros are required.
Dates start off as a character.
# Character
<- "2023-07-02"
d1 d1
[1] "2023-07-02"
class(d1)
[1] "character"
This date can be converted to a Date
object by using the as.Date()
function.
# Date
<- as.Date(d1)
d2 d2
[1] "2023-07-02"
class(d2)
[1] "Date"
The date will be printed out (by default) using the YYYY-MM-DD standard.
# ISO Format
as.Date("2023-07-02")
[1] "2023-07-02"
Most other formats will not give the date you are expecting.
# Alternative formats
as.Date("07-02-2023")
[1] "0007-02-20"
as.Date("02/07/2023")
[1] "0002-07-20"
as.Date("07-02-23")
[1] "0007-02-23"
as.Date("02/07/23")
[1] "0002-07-23"
If you need to read dates that are not in the standard YYYY-MM-DD format, you can use the format
argument to specify the correct format.
# Read dates using format
as.Date("07-02-2023", format = "%m-%d-%Y")
[1] "2023-07-02"
as.Date("07/02/2023", format = "%m/%d/%Y")
[1] "2023-07-02"
as.Date("07-02-23", format = "%m-%d-%y")
[1] "2023-07-02"
as.Date("07/02/23", format = "%m/%d/%y")
[1] "2023-07-02"
Read the helpfiles for strftime
and striptime
for more options for setting the format of date (and time) objects.
When printing out the date, the default is to print out using the YYYY-MM-DD standard. If you want to print out in a different format, you will need to specify the desired format using the format
function.
# Format date output
# default d2
[1] "2023-07-02"
format(d2, "%m/%d/%y")
[1] "07/02/23"
format(d2, "%a %b %d, %Y")
[1] "Sun Jul 02, 2023"
Vectors, matrices, and arrays can only have one type of data.
# Mixing types
c(TRUE, FALSE, 2) # logicals converted to numeric
[1] 1 0 2
c("a", 2, pi) # numeric converted to character
[1] "a" "2" "3.14159265358979"
c(TRUE, "a", FALSE) # logicals converted to character
[1] "TRUE" "a" "FALSE"
c(TRUE, "a", 2) # everything converted to character
[1] "TRUE" "a" "2"
c(c(TRUE, 1), "a") # logical converted to numeric first
[1] "1" "1" "a"
Data frames are special matrices that allow each column to be a different data type.
Construct a data frame using the data.frame
function.
# Construct data frame
<- data.frame(
d var1 = c(1:2),
var2 = c(TRUE, FALSE),
var3 = c("a","b")
)
is.data.frame(d)
[1] TRUE
The techniques to access matrix elements can also be used with data frames.
# Access using indices
1, 2] d[
[1] TRUE
1, ] d[
var1 var2 var3
1 1 TRUE a
-2] d[ ,
var1 var3
1 1 a
2 2 b
# Access using names
rownames(d)
[1] "1" "2"
colnames(d)
[1] "var1" "var2" "var3"
$var1 d
[1] 1 2
c("var2","var3")] d[,
var2 var3
1 TRUE a
2 FALSE b
# Type conversion
is.vector(d$var1) # converted to vector
[1] TRUE
is.data.frame(d[,c("var2","var3")]) # still a data frame
[1] TRUE
is.vector( d[,"var1"])
[1] TRUE
is.data.frame(d[,"var1"])
[1] FALSE
There are a couple of other functions that making construct data frames easier. The expand.grid()
function constructs a data frame with every combination of the variables provided.
# Every combination
<- expand.grid(
eg var1 = c(1:2),
var2 = c(TRUE, FALSE),
var3 = c("a","b")
)
eg
var1 var2 var3
1 1 TRUE a
2 2 TRUE a
3 1 FALSE a
4 2 FALSE a
5 1 TRUE b
6 2 TRUE b
7 1 FALSE b
8 2 FALSE b
Another function is tribble()
which allows us to construct data frames in a more user-readable way. The first row in this function are the column names.
# Construct data frame by rows
<- tibble::tribble(
d2 ~var1, ~var2, ~var3,
1, TRUE, "a",
2, FALSE, "b"
)
is.data.frame(d2)
[1] TRUE
Generally tidyverse functions are constructed to 1) have a data.frame
as the first argument and 2) return a data.frame
. This allows for a tidyverse data pipeline.
Vectors in R can only contain one data type. A list
is an alternative type of vector that can contain any other type of object in each element of the list.
To construct a list, you can use the list()
function.
# Construct a simple list
<- list(1, "a", TRUE)
l
# Check if this is a list
is.list(l)
[1] TRUE
List elements can be named.
# By default there are no names
names(l)
NULL
# Assign some names
names(l) <- c("one", "two", "three")
# View the assigned names
names(l)
[1] "one" "two" "three"
If the list elements are named, they can be accessed using those names and a $
sign.
$one # first element of the list l
[1] 1
$two # second element of the list l
[1] "a"
Lists can also be constructed by using the names during construction.
# Construct a list using names
<- list(
l numbers. = c(1, 2, 3.5),
characters = c("a", "character", "vector", "in", "a", "list"),
logicals = c(TRUE, FALSE, TRUE)
)
1]] # first list element l[[
[1] 1.0 2.0 3.5
$characters # list element named `characters` l
[1] "a" "character" "vector" "in" "a" "list"
As we have already seen, list elements can be any type of object. Thus you can have lists within lists.
# Construct a list within a list
<- list(
l this = 1,
that = list(a = "list element within a list")
)
is.list(l)
[1] TRUE
Lists can be accessed in a variety of ways similar to vectors, but you need to use double square brackets.
# Access using indices
1]] l[[
[1] 1
2]] l[[
$a
[1] "list element within a list"
# Access using names
'that']] l[[
$a
[1] "list element within a list"
'that']][['a']] l[[
[1] "list element within a list"
# Accessing using $
$that l
$a
[1] "list element within a list"
$that$a l
[1] "list element within a list"
Many functions that produce statistical analyses have output that are lists.
# Regression
<- lm(breaks ~ wool + tension,
m data = warpbreaks)
is.list(m)
[1] TRUE
<- summary(m)
s is.list(s)
[1] TRUE