June 21st, 2021
In this worksheet, we will use the library tidyverse:
library(tidyverse)
pivot_longer()
and pivot_wider()
Consider the following two data sets, male_haireyecolor
and female_haireyecolor
. The data sets record the occurrence of hair and eye color phenotype combinations in a class of statistics students. Use head()
to preview these data sets; are they tidy?
# download male data set
male_haireyecolor <- read_csv("https://rachaelcox.github.io/classes/datasets/male_haireyecolor.csv")
head(male_haireyecolor)
## # A tibble: 4 x 5
## Hair Brown Blue Hazel Green
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 Black 32 11 10 3
## 2 Brown 53 50 25 15
## 3 Red 10 10 7 7
## 4 Blond 3 30 5 8
# download female data set
female_haireyecolor <- read_csv("https://rachaelcox.github.io/classes/datasets/female_haireyecolor.csv")
head(female_haireyecolor)
## # A tibble: 4 x 5
## Hair Brown Blue Hazel Green
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 Black 36 9 5 2
## 2 Brown 66 34 29 14
## 3 Red 16 7 7 7
## 4 Blond 4 64 5 8
The data set is not tidy, because the columns Brown
, Blue
, Hazel
, and Green
are observations for a single variable, eye color. The following versions of the tables are tidy:
# pivot all columns except the column `Hair`
male_colors_tidy <- male_haireyecolor %>%
pivot_longer(-Hair, names_to = "EyeColor", values_to = "Count")
head(male_colors_tidy)
## # A tibble: 6 x 3
## Hair EyeColor Count
## <chr> <chr> <dbl>
## 1 Black Brown 32
## 2 Black Blue 11
## 3 Black Hazel 10
## 4 Black Green 3
## 5 Brown Brown 53
## 6 Brown Blue 50
# pivot all columns except the column `Hair`
female_colors_tidy <- female_haireyecolor %>%
pivot_longer(-Hair, names_to = "EyeColor", values_to = "Count")
head(female_colors_tidy)
## # A tibble: 6 x 3
## Hair EyeColor Count
## <chr> <chr> <dbl>
## 1 Black Brown 36
## 2 Black Blue 9
## 3 Black Hazel 5
## 4 Black Green 2
## 5 Brown Brown 66
## 6 Brown Blue 34
There is another syntax that specifies a range of columns that also works:
# pivot all columns in the range of `Brown` to `Green`
male_colors_tidy <- male_haireyecolor %>%
pivot_longer(Brown:Green, names_to = "EyeColor", values_to = "Count")
head(male_colors_tidy)
## # A tibble: 6 x 3
## Hair EyeColor Count
## <chr> <chr> <dbl>
## 1 Black Brown 32
## 2 Black Blue 11
## 3 Black Hazel 10
## 4 Black Green 3
## 5 Brown Brown 53
## 6 Brown Blue 50
# pivot all columns except the column `Hair`
female_colors_tidy <- female_haireyecolor %>%
pivot_longer(Brown:Green, names_to = "EyeColor", values_to = "Count")
head(female_colors_tidy)
## # A tibble: 6 x 3
## Hair EyeColor Count
## <chr> <chr> <dbl>
## 1 Black Brown 36
## 2 Black Blue 9
## 3 Black Hazel 5
## 4 Black Green 2
## 5 Brown Brown 66
## 6 Brown Blue 34
Consider the following data set persons
, which contains information about the sex, weight, and height of 200 individuals:
persons <- read_csv("https://rachaelcox.github.io/classes/datasets/persons.csv")
head(persons)
## # A tibble: 6 x 3
## subject indicator value
## <dbl> <chr> <chr>
## 1 1 sex M
## 2 1 weight 77
## 3 1 height 182
## 4 2 sex F
## 5 2 weight 58
## 6 2 height 161
Is this data set tidy? And can you rearrange it so that you have one column for subject, one for sex, one for weight, and one for height?
The data set is not tidy, because neither indicator
nor value
are variables. The variables are subject
, sex
, weight
, height
. The following version of the table is tidy:
persons %>%
pivot_wider(names_from = "indicator", values_from = "value") %>%
head()
## # A tibble: 6 x 4
## subject sex weight height
## <dbl> <chr> <chr> <chr>
## 1 1 M 77 182
## 2 2 F 58 161
## 3 3 F 53 161
## 4 4 M 68 177
## 5 5 F 59 157
## 6 6 M 76 170