Introduction to R

In-class worksheet #2

November 3rd, 2021

There are many external packages that computational scientists rely on for data analysis (e.g., DESeq2 for RNA-seq processing), but perhaps one of the most important is the tidyverse library. The tidyverse library is a collection of R packages made for streamlining interactive data analysis. Most of the core functions are designed to interact with tabular data in a “tidy” format, where there is 1 column per variable and and 1 row per observation (see slides for examples). Though there are many packages within the tidyverse, we’re going to focus on two of the most commonly used: dplyr for data manipulation, and ggplot for data visualization.


1. Data extraction (select(), filter())

Given a large and/or complicated data set, we may only be interested in examining a few variables or conditions. We will use the built-in msleep data set to demonstrate how to subset data. Run the code chunk below to take a glance at the data.

head(msleep)
## # A tibble: 6 x 11
##   name  genus vore  order conservation sleep_total sleep_rem sleep_cycle awake
##   <chr> <chr> <chr> <chr> <chr>              <dbl>     <dbl>       <dbl> <dbl>
## 1 Chee… Acin… carni Carn… lc                  12.1      NA        NA      11.9
## 2 Owl … Aotus omni  Prim… <NA>                17         1.8      NA       7  
## 3 Moun… Aplo… herbi Rode… nt                  14.4       2.4      NA       9.6
## 4 Grea… Blar… omni  Sori… lc                  14.9       2.3       0.133   9.1
## 5 Cow   Bos   herbi Arti… domesticated         4         0.7       0.667  20  
## 6 Thre… Brad… herbi Pilo… <NA>                14.4       2.2       0.767   9.6
## # … with 2 more variables: brainwt <dbl>, bodywt <dbl>

The msleep data set contains sleep cycle data for a number of mammals, in addition to information related to each mammals’ diet, taxonomic classification, conservation status, body weight, and brain weight. Let’s say we are only interested in in the name and sleep_total variables for each mammal.

We can extract these columns with the select() function, and then save the result to a new data frame called msleep_subset:

# extract the "name" and "sleep_total" columns
msleep_subset <- msleep %>%
  select(name, sleep_total)

Let’s say we want to know which mammals in these data sleep for less than 8 hours per day. To obtain this information, we need to use filter() to extract the rows where sleep_total is less than 8. Notice that, in this example, I do not assign the result to a new data frame, and thus the result of the function is not “saved” (i.e., only displayed on the console or inline output).

# filter subsetted msleep data for mammals that
# sleep less than 8 hours
msleep_subset %>%
  filter(sleep_total < 8)
## # A tibble: 21 x 2
##    name           sleep_total
##    <chr>                <dbl>
##  1 Cow                    4  
##  2 Vesper mouse           7  
##  3 Roe deer               3  
##  4 Goat                   5.3
##  5 Tree hyrax             5.3
##  6 Asian elephant         3.9
##  7 Horse                  2.9
##  8 Donkey                 3.1
##  9 Giraffe                1.9
## 10 Pilot whale            2.7
## # … with 11 more rows

One of the best features of the tidyverse is that it encourages piping, which is a process that executes multiple functions at once on a data set. This feature can save us many, many lines of codes while still maintaining excellent readability. Referring to the above example, we can chain both steps together using %>%, also known as the tidyverse pipe:

msleep_filt <- msleep %>%
  select(name, sleep_total) %>%
  filter(sleep_total < 8)
msleep_filt
## # A tibble: 21 x 2
##    name           sleep_total
##    <chr>                <dbl>
##  1 Cow                    4  
##  2 Vesper mouse           7  
##  3 Roe deer               3  
##  4 Goat                   5.3
##  5 Tree hyrax             5.3
##  6 Asian elephant         3.9
##  7 Horse                  2.9
##  8 Donkey                 3.1
##  9 Giraffe                1.9
## 10 Pilot whale            2.7
## # … with 11 more rows

Problem Set #1:

  1. Take the msleep data set and use select() to extract the columns name, vore, and awake. Assign it to a new data frame called msleep_new.
  2. Filter the result from msleep_new such that awake is less than 8.
# your R code here

Challenge Problem: If that was easy… Which carnivores have total sleep greater than or equal to 4 hours and recorded body weight greater than or equal to 100 kg?

# your R code here

2. Data aggregation (group_by() and summarize())

Most of the time, our goal is to learn something about the relationships between the variables in our data set. For instance, the msleep data set samples a diversity of mammals and their sleeping habits. For instance, does average time slept vary by diet? With group_by and summarize, we can do a quick investigation.

# calculate the mean and median sleep total
# for each type of diet
msleep %>%
  group_by(vore) %>%
  summarize(median_sleep = median(sleep_total),
            mean_sleep = mean(sleep_total))
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 5 x 3
##   vore    median_sleep mean_sleep
##   <chr>          <dbl>      <dbl>
## 1 carni           10.4      10.4 
## 2 herbi           10.3       9.51
## 3 insecti         18.1      14.9 
## 4 omni             9.9      10.9 
## 5 <NA>            10.6      10.2

Note the way I’ve structured the above code. At the beginning, the data set we’re manipulating is on the first line, and each subsequent function the data passes through gets a new line/identation. Though not functionally necessary, styling your R/tidyverse code this way greatly improves readability. Readability of your code should ALWAYS be high priority.

We can also probe potential biases in our data set–for example, the sample size for each of the diets in question.

# add a calculation for the # of mammals sampled per diet
msleep %>%
  group_by(vore) %>%
  summarize(median_sleep = median(sleep_total),
            mean_sleep = mean(sleep_total),
            count = n())
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 5 x 4
##   vore    median_sleep mean_sleep count
##   <chr>          <dbl>      <dbl> <int>
## 1 carni           10.4      10.4     19
## 2 herbi           10.3       9.51    32
## 3 insecti         18.1      14.9      5
## 4 omni             9.9      10.9     20
## 5 <NA>            10.6      10.2      7

The number of insectivores sampled is fairly small compared to the other types of diets.

Problem Set #2:

  1. How does the median body weight of mammals vary by diet? Using the msleep data set, again group_by() the vore column and summarize() the median() of bodywt.
# your R code here

Challenge Problem: If that was easy… How many species within the order “Rodentia” have carnivore, herbivore, and omnivore diets respectively? Hint: This will require using the filter() function first, before group_by() and summarize().

# your R code here

3. Data visualization

Arguably the Tidyverse is the most used and well-known for its data visualization package, ggplot2. Its syntax is similar to but distinct from other tidyverse packages. In our previous examples using dplyr, we chained functions together using the pipe (%>%). When making a ggplot, we layer plot functions using the plus sign (+). Continuing to use the msleep data set, let’s visualize the relationship between mammal bodywt and sleep_total.

# plot total sleep vs body weight for each mammal
ggplot(msleep, aes(x = bodywt, y = sleep_total)) +
  geom_point()

It looks like there might be correlation between body mass and time spent asleep. However, most of the data points are bunched up at small values of bodywt. We can add an additional layer, scale_x_log10(), to better visualize these samples.

# convert x-axis to log scale
ggplot(msleep, aes(x = bodywt, y = sleep_total)) +
  geom_point() +
  scale_x_log10()

Another feature of ggplot() is the ability to color, shape, or size points by other variables in the data set. Earlier, we were interested in the correlation between diet and sleep total. Now, we’ll visualize the diet variable on the same plot with color.

# color the points by diet
ggplot(msleep, aes(x = bodywt, y = sleep_total, color = vore)) +
  geom_point() +
  scale_x_log10()

Next, we can apply a number of layers to the plot to make it publication ready. The layer labs() will allow us to change the legends, we can specify point size inside geom_point(), and we can manually set our desired color palette. Additionally, we can pre-filter NAs from our data set before passing the data to ggplot. Note that the dplyr::filter function pipes into ggplot(), and therefore we do not have to specify the data set as our first argument to the ggplot() function.

# make it pretty
msleep %>%
  filter(!is.na(vore)) %>% 
  ggplot(aes(x = bodywt, y = sleep_total, color = vore)) +
    geom_point(size = 3) +
    scale_x_log10() +
    labs(x = "Body weight (kg)", y = "Total time asleep (hr)", color = "Diet") +
    scale_color_manual(values = c("#E69F00", "#56B4E9", "#009E73","#F0E442", "#0072B2", 
                                  "#D55E00", "#CC79A7", "#999999"))

Problem Set #3:

  1. Plot the relationship between REM sleep and brain weight using ggplot and geom_point. Specify the x-axis as brainwt and the y-axis as sleep_rem.
  2. Convert the x-axis to a log scale using scale_x_log10().
  3. Color the points by specifying color = vore.
  4. Rename the axes labels using labs().
# your R code here

Challenge Problem: If that was easy… For all the samples within the “Rodentia” order, plot the distribution of body weight by diet using geom_boxplot(). Hint: You’ll need to filter() the data before passing it to ggplot().

# your R code here