Day 1: Tidying data

In-class worksheet, solutions

June 21st, 2021

In this worksheet, we will use the library tidyverse:

library(tidyverse)

Tidying data with pivot_longer()and pivot_wider()

1.1 Making tables longer

Consider the following two data sets, male_haireyecolor and female_haireyecolor. The data sets record the occurrence of hair and eye color phenotype combinations in a class of statistics students. Use head() to preview these data sets; are they tidy?

# download male data set
male_haireyecolor <- read_csv("https://rachaelcox.github.io/classes/datasets/male_haireyecolor.csv")
head(male_haireyecolor)
## # A tibble: 4 x 5
##   Hair  Brown  Blue Hazel Green
##   <chr> <dbl> <dbl> <dbl> <dbl>
## 1 Black    32    11    10     3
## 2 Brown    53    50    25    15
## 3 Red      10    10     7     7
## 4 Blond     3    30     5     8
# download female data set
female_haireyecolor <- read_csv("https://rachaelcox.github.io/classes/datasets/female_haireyecolor.csv")
head(female_haireyecolor)
## # A tibble: 4 x 5
##   Hair  Brown  Blue Hazel Green
##   <chr> <dbl> <dbl> <dbl> <dbl>
## 1 Black    36     9     5     2
## 2 Brown    66    34    29    14
## 3 Red      16     7     7     7
## 4 Blond     4    64     5     8

The data set is not tidy, because the columns Brown, Blue, Hazel, and Green are observations for a single variable, eye color. The following versions of the tables are tidy:

# pivot all columns except the column `Hair`
male_colors_tidy <- male_haireyecolor %>%
  pivot_longer(-Hair, names_to = "EyeColor", values_to = "Count")
head(male_colors_tidy)
## # A tibble: 6 x 3
##   Hair  EyeColor Count
##   <chr> <chr>    <dbl>
## 1 Black Brown       32
## 2 Black Blue        11
## 3 Black Hazel       10
## 4 Black Green        3
## 5 Brown Brown       53
## 6 Brown Blue        50
# pivot all columns except the column `Hair`
female_colors_tidy <- female_haireyecolor %>%
  pivot_longer(-Hair, names_to = "EyeColor", values_to = "Count")
head(female_colors_tidy)
## # A tibble: 6 x 3
##   Hair  EyeColor Count
##   <chr> <chr>    <dbl>
## 1 Black Brown       36
## 2 Black Blue         9
## 3 Black Hazel        5
## 4 Black Green        2
## 5 Brown Brown       66
## 6 Brown Blue        34

There is another syntax that specifies a range of columns that also works:

# pivot all columns in the range of `Brown` to `Green`
male_colors_tidy <- male_haireyecolor %>%
  pivot_longer(Brown:Green, names_to = "EyeColor", values_to = "Count")
head(male_colors_tidy)
## # A tibble: 6 x 3
##   Hair  EyeColor Count
##   <chr> <chr>    <dbl>
## 1 Black Brown       32
## 2 Black Blue        11
## 3 Black Hazel       10
## 4 Black Green        3
## 5 Brown Brown       53
## 6 Brown Blue        50
# pivot all columns except the column `Hair`
female_colors_tidy <- female_haireyecolor %>%
  pivot_longer(Brown:Green, names_to = "EyeColor", values_to = "Count")
head(female_colors_tidy)
## # A tibble: 6 x 3
##   Hair  EyeColor Count
##   <chr> <chr>    <dbl>
## 1 Black Brown       36
## 2 Black Blue         9
## 3 Black Hazel        5
## 4 Black Green        2
## 5 Brown Brown       66
## 6 Brown Blue        34

1.2 Making tables wider

Consider the following data set persons, which contains information about the sex, weight, and height of 200 individuals:

persons <- read_csv("https://rachaelcox.github.io/classes/datasets/persons.csv")
head(persons)
## # A tibble: 6 x 3
##   subject indicator value
##     <dbl> <chr>     <chr>
## 1       1 sex       M    
## 2       1 weight    77   
## 3       1 height    182  
## 4       2 sex       F    
## 5       2 weight    58   
## 6       2 height    161

Is this data set tidy? And can you rearrange it so that you have one column for subject, one for sex, one for weight, and one for height?

The data set is not tidy, because neither indicator nor value are variables. The variables are subject, sex, weight, height. The following version of the table is tidy:

persons %>%
  pivot_wider(names_from = "indicator", values_from = "value") %>%
  head()
## # A tibble: 6 x 4
##   subject sex   weight height
##     <dbl> <chr> <chr>  <chr> 
## 1       1 M     77     182   
## 2       2 F     58     161   
## 3       3 F     53     161   
## 4       4 M     68     177   
## 5       5 F     59     157   
## 6       6 M     76     170