November 3rd, 2021
There are many external packages that computational scientists rely on for data analysis (e.g., DESeq2
for RNA-seq processing), but perhaps one of the most important is the tidyverse
library. The tidyverse
library is a collection of R packages made for streamlining interactive data analysis. Most of the core functions are designed to interact with tabular data in a “tidy” format, where there is 1 column per variable and and 1 row per observation (see slides for examples). Though there are many packages within the tidyverse
, we’re going to focus on two of the most commonly used: dplyr
for data manipulation, and ggplot
for data visualization.
select()
, filter()
)Given a large and/or complicated data set, we may only be interested in examining a few variables or conditions. We will use the built-in msleep
data set to demonstrate how to subset data. Run the code chunk below to take a glance at the data.
head(msleep)
## # A tibble: 6 x 11
## name genus vore order conservation sleep_total sleep_rem sleep_cycle awake
## <chr> <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 Chee… Acin… carni Carn… lc 12.1 NA NA 11.9
## 2 Owl … Aotus omni Prim… <NA> 17 1.8 NA 7
## 3 Moun… Aplo… herbi Rode… nt 14.4 2.4 NA 9.6
## 4 Grea… Blar… omni Sori… lc 14.9 2.3 0.133 9.1
## 5 Cow Bos herbi Arti… domesticated 4 0.7 0.667 20
## 6 Thre… Brad… herbi Pilo… <NA> 14.4 2.2 0.767 9.6
## # … with 2 more variables: brainwt <dbl>, bodywt <dbl>
The msleep
data set contains sleep cycle data for a number of mammals, in addition to information related to each mammals’ diet, taxonomic classification, conservation status, body weight, and brain weight. Let’s say we are only interested in in the name
and sleep_total
variables for each mammal.
We can extract these columns with the select()
function, and then save the result to a new data frame called msleep_subset
:
# extract the "name" and "sleep_total" columns
msleep_subset <- msleep %>%
select(name, sleep_total)
Let’s say we want to know which mammals in these data sleep for less than 8 hours per day. To obtain this information, we need to use filter()
to extract the rows where sleep_total
is less than 8. Notice that, in this example, I do not assign the result to a new data frame, and thus the result of the function is not “saved” (i.e., only displayed on the console or inline output).
# filter subsetted msleep data for mammals that
# sleep less than 8 hours
msleep_subset %>%
filter(sleep_total < 8)
## # A tibble: 21 x 2
## name sleep_total
## <chr> <dbl>
## 1 Cow 4
## 2 Vesper mouse 7
## 3 Roe deer 3
## 4 Goat 5.3
## 5 Tree hyrax 5.3
## 6 Asian elephant 3.9
## 7 Horse 2.9
## 8 Donkey 3.1
## 9 Giraffe 1.9
## 10 Pilot whale 2.7
## # … with 11 more rows
One of the best features of the tidyverse
is that it encourages piping, which is a process that executes multiple functions at once on a data set. This feature can save us many, many lines of codes while still maintaining excellent readability. Referring to the above example, we can chain both steps together using %>%
, also known as the tidyverse
pipe:
msleep_filt <- msleep %>%
select(name, sleep_total) %>%
filter(sleep_total < 8)
msleep_filt
## # A tibble: 21 x 2
## name sleep_total
## <chr> <dbl>
## 1 Cow 4
## 2 Vesper mouse 7
## 3 Roe deer 3
## 4 Goat 5.3
## 5 Tree hyrax 5.3
## 6 Asian elephant 3.9
## 7 Horse 2.9
## 8 Donkey 3.1
## 9 Giraffe 1.9
## 10 Pilot whale 2.7
## # … with 11 more rows
Problem Set #1:
msleep
data set and use select()
to extract the columns name
, vore
, and awake
. Assign it to a new data frame called msleep_new
.msleep_new
such that awake
is less than 8.# your R code here
Challenge Problem: If that was easy… Which carnivores have total sleep greater than or equal to 4 hours and recorded body weight greater than or equal to 100 kg?
# your R code here
group_by()
and summarize()
)Most of the time, our goal is to learn something about the relationships between the variables in our data set. For instance, the msleep
data set samples a diversity of mammals and their sleeping habits. For instance, does average time slept vary by diet? With group_by
and summarize
, we can do a quick investigation.
# calculate the mean and median sleep total
# for each type of diet
msleep %>%
group_by(vore) %>%
summarize(median_sleep = median(sleep_total),
mean_sleep = mean(sleep_total))
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 5 x 3
## vore median_sleep mean_sleep
## <chr> <dbl> <dbl>
## 1 carni 10.4 10.4
## 2 herbi 10.3 9.51
## 3 insecti 18.1 14.9
## 4 omni 9.9 10.9
## 5 <NA> 10.6 10.2
Note the way I’ve structured the above code. At the beginning, the data set we’re manipulating is on the first line, and each subsequent function the data passes through gets a new line/identation. Though not functionally necessary, styling your R/tidyverse code this way greatly improves readability. Readability of your code should ALWAYS be high priority.
We can also probe potential biases in our data set–for example, the sample size for each of the diets in question.
# add a calculation for the # of mammals sampled per diet
msleep %>%
group_by(vore) %>%
summarize(median_sleep = median(sleep_total),
mean_sleep = mean(sleep_total),
count = n())
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 5 x 4
## vore median_sleep mean_sleep count
## <chr> <dbl> <dbl> <int>
## 1 carni 10.4 10.4 19
## 2 herbi 10.3 9.51 32
## 3 insecti 18.1 14.9 5
## 4 omni 9.9 10.9 20
## 5 <NA> 10.6 10.2 7
The number of insectivores sampled is fairly small compared to the other types of diets.
Problem Set #2:
msleep
data set, again group_by()
the vore
column and summarize()
the median()
of bodywt
.# your R code here
Challenge Problem: If that was easy… How many species within the order “Rodentia” have carnivore, herbivore, and omnivore diets respectively? Hint: This will require using the filter()
function first, before group_by()
and summarize()
.
# your R code here
Arguably the Tidyverse is the most used and well-known for its data visualization package, ggplot2
. Its syntax is similar to but distinct from other tidyverse
packages. In our previous examples using dplyr
, we chained functions together using the pipe (%>%
). When making a ggplot, we layer plot functions using the plus sign (+
). Continuing to use the msleep
data set, let’s visualize the relationship between mammal bodywt
and sleep_total
.
# plot total sleep vs body weight for each mammal
ggplot(msleep, aes(x = bodywt, y = sleep_total)) +
geom_point()
It looks like there might be correlation between body mass and time spent asleep. However, most of the data points are bunched up at small values of bodywt
. We can add an additional layer, scale_x_log10()
, to better visualize these samples.
# convert x-axis to log scale
ggplot(msleep, aes(x = bodywt, y = sleep_total)) +
geom_point() +
scale_x_log10()
Another feature of ggplot()
is the ability to color, shape, or size points by other variables in the data set. Earlier, we were interested in the correlation between diet and sleep total. Now, we’ll visualize the diet variable on the same plot with color.
# color the points by diet
ggplot(msleep, aes(x = bodywt, y = sleep_total, color = vore)) +
geom_point() +
scale_x_log10()
Next, we can apply a number of layers to the plot to make it publication ready. The layer labs()
will allow us to change the legends, we can specify point size inside geom_point()
, and we can manually set our desired color palette. Additionally, we can pre-filter NAs from our data set before passing the data to ggplot
. Note that the dplyr::filter
function pipes into ggplot()
, and therefore we do not have to specify the data set as our first argument to the ggplot()
function.
# make it pretty
msleep %>%
filter(!is.na(vore)) %>%
ggplot(aes(x = bodywt, y = sleep_total, color = vore)) +
geom_point(size = 3) +
scale_x_log10() +
labs(x = "Body weight (kg)", y = "Total time asleep (hr)", color = "Diet") +
scale_color_manual(values = c("#E69F00", "#56B4E9", "#009E73","#F0E442", "#0072B2",
"#D55E00", "#CC79A7", "#999999"))
Problem Set #3:
ggplot
and geom_point
. Specify the x-axis as brainwt
and the y-axis as sleep_rem
.scale_x_log10()
.color = vore
.labs()
.# your R code here
Challenge Problem: If that was easy… For all the samples within the “Rodentia” order, plot the distribution of body weight by diet using geom_boxplot()
. Hint: You’ll need to filter()
the data before passing it to ggplot()
.
# your R code here