In this chapter, we will look at how to explore a data set in R. There are many base R functions that are useful for finding information out about a data set. When the data set is small, you can simply look at the whole data.frame.

  Time demand
1    1    8.3
2    2   10.3
3    3   19.0
4    4   16.0
5    5   15.6
6    7   19.8

Most datasets these days are too big. Thus, we need some ways to understand what is contained in the data before we can visualize or model it.

19.1 Metadata

The metadata of a data set is really the information about the data. We might want to know how many observations there, how many variables there, what type of variables the data set has etc.

Let’s first look at the dimensions of the data set.

# Dimensions
nrow(diamonds) # number of rows
[1] 53940
ncol(diamonds) # number of columns
[1] 10
dim(diamonds)  # both
[1] 53940    10

Now let’s take a look at the variables in the data set.

# Names
names(diamonds)    # variable names
 [1] "carat"   "cut"     "color"   "clarity" "depth"   "table"   "price"  
 [8] "x"       "y"       "z"      
colnames(diamonds) # column names (same as variable names)
 [1] "carat"   "cut"     "color"   "clarity" "depth"   "table"   "price"  
 [8] "x"       "y"       "z"      
# rownames(diamonds) # row names (typical unimportant)

The str() function provides a nice overview that includes the dimensions, variable names, variable types, and first few values.

# Overview
tibble [53,940 × 10] (S3: tbl_df/tbl/data.frame)
 $ carat  : num [1:53940] 0.23 0.21 0.23 0.29 0.31 0.24 0.24 0.26 0.22 0.23 ...
 $ cut    : Ord.factor w/ 5 levels "Fair"<"Good"<..: 5 4 2 4 2 3 3 3 1 3 ...
 $ color  : Ord.factor w/ 7 levels "D"<"E"<"F"<"G"<..: 2 2 2 6 7 7 6 5 2 5 ...
 $ clarity: Ord.factor w/ 8 levels "I1"<"SI2"<"SI1"<..: 2 3 5 4 2 6 7 3 4 5 ...
 $ depth  : num [1:53940] 61.5 59.8 56.9 62.4 63.3 62.8 62.3 61.9 65.1 59.4 ...
 $ table  : num [1:53940] 55 61 65 58 58 57 57 55 61 61 ...
 $ price  : int [1:53940] 326 326 327 334 335 336 336 337 337 338 ...
 $ x      : num [1:53940] 3.95 3.89 4.05 4.2 4.34 3.94 3.95 4.07 3.87 4 ...
 $ y      : num [1:53940] 3.98 3.84 4.07 4.23 4.35 3.96 3.98 4.11 3.78 4.05 ...
 $ z      : num [1:53940] 2.43 2.31 2.31 2.63 2.75 2.48 2.47 2.53 2.49 2.39 ...
19.2 View

It is often handy to take a quick at (part of) the data.frame.

# Quick view
# A tibble: 6 × 10
  carat cut       color clarity depth table price     x     y     z
  <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
1  0.23 Ideal     E     SI2      61.5    55   326  3.95  3.98  2.43
2  0.21 Premium   E     SI1      59.8    61   326  3.89  3.84  2.31
3  0.23 Good      E     VS1      56.9    65   327  4.05  4.07  2.31
4  0.29 Premium   I     VS2      62.4    58   334  4.2   4.23  2.63
5  0.31 Good      J     SI2      63.3    58   335  4.34  4.35  2.75
6  0.24 Very Good J     VVS2     62.8    57   336  3.94  3.96  2.48
# Full view

To access an individual variable, we can use the $.

# View variable
[1] 0.23 0.21 0.23 0.29 0.31 0.24
[1] Ideal     Premium   Good      Premium   Good      Very Good
Levels: Fair < Good < Very Good < Premium < Ideal

19.3 Numeric

For numeric variables, we have a wide variety of functions available including all your descriptive statistics.

# Central tendency
[1] 0.7979397
[1] 0.7
quantile(diamonds$carat, c(.25, .75))
 25%  75% 
0.40 1.04 
# Spread
[1] 0.4740112
[1] 0.2246867
[1] 0.20 5.01
# 6-number summary
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
 0.2000  0.4000  0.7000  0.7979  1.0400  5.0100 

19.4 Categorical

For categorical data, we typically just have the number of observations for each level.

nlevels(diamonds$cut) # number of factor levels
[1] 5
levels(diamonds$cut)  # factor level values
[1] "Fair"      "Good"      "Very Good" "Premium"   "Ideal"    
table(diamonds$cut)   # number of observations for each factor level

     Fair      Good Very Good   Premium     Ideal 
     1610      4906     12082     13791     21551 
summary(diamonds$cut) # same as table()
     Fair      Good Very Good   Premium     Ideal 
     1610      4906     12082     13791     21551 

19.5 Summary

A quick way to summarize all of the variables in the data set is to use summary() on the entire data set.

     carat               cut        color        clarity          depth      
 Min.   :0.2000   Fair     : 1610   D: 6775   SI1    :13065   Min.   :43.00  
 1st Qu.:0.4000   Good     : 4906   E: 9797   VS2    :12258   1st Qu.:61.00  
 Median :0.7000   Very Good:12082   F: 9542   SI2    : 9194   Median :61.80  
 Mean   :0.7979   Premium  :13791   G:11292   VS1    : 8171   Mean   :61.75  
 3rd Qu.:1.0400   Ideal    :21551   H: 8304   VVS2   : 5066   3rd Qu.:62.50  
 Max.   :5.0100                     I: 5422   VVS1   : 3655   Max.   :79.00  
                                    J: 2808   (Other): 2531                  
     table           price             x                y         
 Min.   :43.00   Min.   :  326   Min.   : 0.000   Min.   : 0.000  
 1st Qu.:56.00   1st Qu.:  950   1st Qu.: 4.710   1st Qu.: 4.720  
 Median :57.00   Median : 2401   Median : 5.700   Median : 5.710  
 Mean   :57.46   Mean   : 3933   Mean   : 5.731   Mean   : 5.735  
 3rd Qu.:59.00   3rd Qu.: 5324   3rd Qu.: 6.540   3rd Qu.: 6.540  
 Max.   :95.00   Max.   :18823   Max.   :10.740   Max.   :58.900  
 Min.   : 0.000  
 1st Qu.: 2.910  
 Median : 3.530  
 Mean   : 3.539  
 3rd Qu.: 4.040  
 Max.   :31.800  

This summary is not as effect if a variable is a character rather than a factor. By default, non-numeric variables will be read in as a character.

# Summary of character
   Length     Class      Mode 
    53940 character character 
table(  as.character(diamonds$cut)) # still informative

     Fair      Good     Ideal   Premium Very Good 
     1610      4906     21551     13791     12082