Day 3: Data manipulation & analysis

In-class worksheet

June 23rd, 2021

1. Row/column addition (mutate(), bind_rows(), left_join())

The raw output from many types of high-throughput molecular biology experiments comes a unique sample ID (e.g., protein ID, gene ID) and that samples raw output response. (e.g., peptide abundance via mass spectrometry experiments, mRNA abundance via RNA-seq). Often we would like to accomplish the following:

  1. Add new columns that perform calculations on the raw data or contain “meta data” associated with that experiment (mutate()).
  2. Join multiple experiments into one data frame (bind_rows()).
  3. Import annotations to sample IDs with the hope of extracting broad trends about the behavior of biological systems in different contexts (left_join()).

To demonstrate these concepts, we’ll use real affinity purification mass spectrometry (APMS) data sets: frog_dnai2 and frog_heatr2. Download them below:

# download the dnai2 pull down data set
frog_dnai2 <- read_csv("https://rachaelcox.github.io/classes/datasets/frog_apms_dnai2.csv")
head(frog_dnai2)
## # A tibble: 6 x 6
##   ID          ctrl_PSMs exp_PSMs PSM_fc PSM_log2fc PSM_zscore
##   <chr>           <dbl>    <dbl>  <dbl>      <dbl>      <dbl>
## 1 ENOG411DGHA         0      202 200.        7.64       14.2 
## 2 ENOG411CW1E        80      265   3.23      1.69        9.90
## 3 ENOG411DGH7         0       46  46.2       5.53        6.76
## 4 ENOG411CT94         1       35  17.7       4.15        5.64
## 5 ENOG411CQ77         9       52   5.21      2.38        5.47
## 6 ENOG411DGHH        72      133   1.81      0.853       4.16
# # download the heatr pull down data set
frog_heatr2 <- read_csv("https://rachaelcox.github.io/classes/datasets/frog_apms_heatr2.csv")
head(frog_heatr2)
## # A tibble: 6 x 6
##   ID          ctrl_PSMs exp_PSMs PSM_fc PSM_log2fc PSM_zscore
##   <chr>           <dbl>    <dbl>  <dbl>      <dbl>      <dbl>
## 1 ENOG411CUGY         0      274 241.         7.91      16.0 
## 2 ENOG411CWA4        31       79   2.19       1.13       3.98
## 3 ENOG411CTH9        17       54   2.68       1.42       3.94
## 4 ENOG411CSRQ        17       49   2.43       1.28       3.49
## 5 ENOG411CTE0         0       10   9.64       3.27       3.05
## 6 ENOG411CYEZ         5       22   3.36       1.75       3.01

1.1 Adding a new column

For each APMS data frame, use mutate() to add a new column named bait that contains a string naming the bait used in that experiment. For instance, the frog_dnai2 data frame should have a new column simply reading “dnai2” for each row in the new column.

The purpose of this is to prepare the data frames to be bound together; without this new column, we would not know observation belongs to which experiment once the data frames are combined.

# create a new column called 'bait' with dnai2 label
dnai2_labeled <- frog_dnai2 %>%
  mutate(bait = 'dnai2')
head(dnai2_labeled)
## # A tibble: 6 x 7
##   ID          ctrl_PSMs exp_PSMs PSM_fc PSM_log2fc PSM_zscore bait 
##   <chr>           <dbl>    <dbl>  <dbl>      <dbl>      <dbl> <chr>
## 1 ENOG411DGHA         0      202 200.        7.64       14.2  dnai2
## 2 ENOG411CW1E        80      265   3.23      1.69        9.90 dnai2
## 3 ENOG411DGH7         0       46  46.2       5.53        6.76 dnai2
## 4 ENOG411CT94         1       35  17.7       4.15        5.64 dnai2
## 5 ENOG411CQ77         9       52   5.21      2.38        5.47 dnai2
## 6 ENOG411DGHH        72      133   1.81      0.853       4.16 dnai2
# create a new column called `bait` with heatr2 label
heatr2_labeled <- frog_heatr2 %>%
  mutate(bait = 'heatr2')
head(heatr2_labeled)
## # A tibble: 6 x 7
##   ID          ctrl_PSMs exp_PSMs PSM_fc PSM_log2fc PSM_zscore bait  
##   <chr>           <dbl>    <dbl>  <dbl>      <dbl>      <dbl> <chr> 
## 1 ENOG411CUGY         0      274 241.         7.91      16.0  heatr2
## 2 ENOG411CWA4        31       79   2.19       1.13       3.98 heatr2
## 3 ENOG411CTH9        17       54   2.68       1.42       3.94 heatr2
## 4 ENOG411CSRQ        17       49   2.43       1.28       3.49 heatr2
## 5 ENOG411CTE0         0       10   9.64       3.27       3.05 heatr2
## 6 ENOG411CYEZ         5       22   3.36       1.75       3.01 heatr2

1.2 Binding two experiments together

Now, using bind_rows() combine the two new data frames together.

combined_df <- bind_rows(dnai2_labeled, heatr2_labeled)
head(combined_df)
## # A tibble: 6 x 7
##   ID          ctrl_PSMs exp_PSMs PSM_fc PSM_log2fc PSM_zscore bait 
##   <chr>           <dbl>    <dbl>  <dbl>      <dbl>      <dbl> <chr>
## 1 ENOG411DGHA         0      202 200.        7.64       14.2  dnai2
## 2 ENOG411CW1E        80      265   3.23      1.69        9.90 dnai2
## 3 ENOG411DGH7         0       46  46.2       5.53        6.76 dnai2
## 4 ENOG411CT94         1       35  17.7       4.15        5.64 dnai2
## 5 ENOG411CQ77         9       52   5.21      2.38        5.47 dnai2
## 6 ENOG411DGHH        72      133   1.81      0.853       4.16 dnai2

Another valid way to do this is with rbind(); the difference is that rbind() requires data frames with absolutely identifical column names to ensure a clean concatenation, whereas bind_rows() isn’t so picky.

Bonus challenge: Plot the distribution of z-scores (PSM_zscore) for your combined data frame. Color by experiment. Your choice of geom; see Section 3 of yesterday’s worksheet for plotting distributions. Plot horizontal lines at the critical z-score value of 1.6, corresponding to a roughly 5% false positive rate (or p-value = 0.05) cutoff for a one-sided test (geom_hline or geom_vline will accomplish this).

ggplot(combined_df, aes(x = bait, y = PSM_zscore, fill = bait, color = bait)) +
  geom_violin(alpha = 0.5) +
  geom_jitter(size = 0.5, alpha = 0.5) +   # overlay data points on distribution
  geom_hline(yintercept = 1.6, alpha = 0.75) +    # add lines at z-value cutoffs
  scale_color_viridis_d(option = "B", end = 0.8) +   # color data points
  scale_fill_viridis_d(option = "E")  # color distribution

1.3 Joining an experiment table with an annotation table

Now we have a data frame that contains protein interaction data for both dnai2 and heatr2, but the actual identifiers in the ID are not informative. To investigate the interacting biology, we need to join an annotation table to the APMS data frame. Download the annotations with the code chunk below.

# download the frog gene annotations
frog_annotations <- read_csv("https://rachaelcox.github.io/classes/datasets/frog_enog_annotations.csv")
head(frog_annotations)
## # A tibble: 6 x 6
##   ID    OG_category XENLA_GeneNames HUMAN_GeneNames OG_annotation
##   <chr> <chr>       <chr>           <chr>           <chr>        
## 1 ENOG… S           rhpn1.L, rhpn2… RHPN2P1, RHPN2… Rhophilin, R…
## 2 ENOG… S           arl14epl.L, ar… ARL14EPL        ADP-ribosyla…
## 3 ENOG… Z           myh1-2-1.S, my… MYH3, MYH7B     myosin, heav…
## 4 ENOG… S           loc101733025.L… SHISAL1         kiaa1644     
## 5 ENOG… T           epha3.L, epha4… EPHA5, EPHA3, … Eph receptor 
## 6 ENOG… O           txndc11.L       TXNDC11         thioredoxin …
## # … with 1 more variable: OG_category_annotation <chr>

Note that the annotation data frame has an ID column that matches the ID column in the APMS data frame. Also notice that the annotation data frame has 14292 rows, and our APMS data frame has 3356 rows. The left_join() function keeps all the rows from the data frame used as the “left” input, while joining in places that match the common column (ID in this case) to the right data frame. Use left_join() to combine the APMS data frame with the annotation data frame, such that the final data frame has 3356 rows and 12 columns.

annotated_df <- combined_df %>%
  left_join(frog_annotations, by=c("ID"))

Now each observation is connected to a X. laevis gene name, in addition to the gene’s human cousin. Not all genes have annotations, even though X. laevis is considered a model organism. This is common in bioinformatics.

Bonus challenge: Plot a bar chart with OG_category_annotation on the x-axis. Color by bait. Flip the chart so long descriptions are more readable using the coord_flip() layer. Try facet_wrap(), does this improve clarity? Remove the y-axis label.

ggplot(annotated_df, aes(x = OG_category_annotation, fill = bait)) +
  geom_bar() +
  xlab("") +
  coord_flip() +
  facet_wrap(~bait) +
  scale_fill_viridis_d(option = "E", begin = 0.2, end = 0.8)

2. Subsetting data (filter(), select())

In biological data analysis, we’re often most interested in a specific subset of data–usually scores above some threshold (p-values, z-scores, fold changes in gene expression) or the behavior of a specific gene/protein family relative to time or different conditions..

Perform the following operations to pare down the combined APMS data frame. Use the pipe (%>%) to chain data operations together and save the results to a new data frame in a single step:

  1. Filter the combined and annotated APMS data frame (generated in the previous step) to only contain observations with PSM_zscore > 1.6 (p ~ 0.05) to pull out proteins significantly interacting with dnai2 and heatr2 (use the filter() function).
  2. Then, in the same line of code, pipe (%>%) into select() to extract only the columns named ID, bait, PSM_log2fc, PSM_zscore, OG_category_annotation.
  3. Finally, again in the same line of code, pipe (%>%) into arrange() to re-sort the data frame to descend by PSM_zscore (i.e., highest z-scores should be at the top).
filtered_df <- annotated_df %>%
  filter(PSM_zscore > 1.6) %>%
  select(ID, bait, PSM_log2fc, PSM_zscore, OG_category_annotation) %>%
  arrange(desc(PSM_zscore))
head(filtered_df)
## # A tibble: 6 x 5
##   ID          bait   PSM_log2fc PSM_zscore OG_category_annotation             
##   <chr>       <chr>       <dbl>      <dbl> <chr>                              
## 1 ENOG411CUGY heatr2       7.91      16.0  Function Unknown                   
## 2 ENOG411DGHA dnai2        7.64      14.2  Cytoskeleton                       
## 3 ENOG411CW1E dnai2        1.69       9.90 Lipid metabolism                   
## 4 ENOG411DGH7 dnai2        5.53       6.76 Cytoskeleton                       
## 5 ENOG411CT94 dnai2        4.15       5.64 Cytoskeleton                       
## 6 ENOG411CQ77 dnai2        2.38       5.47 Nucleotide metabolism and transport

Plot another bar plot, the same way we made the plot in Section 1.3; compare the distribution of functional categories for the most significant proteins versus the unfiltered data set depicted in Section 1.3. Bonus challenge: Reorder the bar plots so that the functional categories with the highest counts are grouped together using fct_infreq().

ggplot(filtered_df, aes(x = fct_rev(fct_infreq(OG_category_annotation)), 
                        fill = bait)) +
  geom_bar() +
  xlab("") +
  coord_flip() +
  facet_wrap(~bait) +
  scale_fill_viridis_d(option = "E", begin = 0.2, end = 0.8)

3. Aggregating data (group_by(), summarize(), arrange())

To find broad patterns in big data sets, we can use the group_by() and summarize() functions to get summary information for groups of variables. For instance, the bait column has two groups: dnai2 and heatr2. The OG_category_annotations column has 22 groups if you count the “NA” values resulting from unannotated genes. Use select() and unique() to look at the unique values (groups) in the columns bait and OG_category_annotation.

filtered_df %>% 
  select(bait) %>% 
  unique()
## # A tibble: 2 x 1
##   bait  
##   <chr> 
## 1 heatr2
## 2 dnai2
filtered_df %>%
  select(OG_category_annotation) %>%
  unique()
## # A tibble: 22 x 1
##    OG_category_annotation                
##    <chr>                                 
##  1 Function Unknown                      
##  2 Cytoskeleton                          
##  3 Lipid metabolism                      
##  4 Nucleotide metabolism and transport   
##  5 Energy production and conversion      
##  6 Tranlsation                           
##  7 Intracellular trafficing and secretion
##  8 Amino Acid metabolis and transport    
##  9 Cell wall/membrane/envelop biogenesis 
## 10 <NA>                                  
## # … with 12 more rows

When you group_by() a variable, you are collapsing all the information for the unique values associated with that variable. Then, we use summarize() to extract information associated with the collapsed information.

We want to learn the functional categories for the proteins that are most signficantly measured to be interacting with both dnai2 and heatr2. To do this, code the following steps (as a chain of functions connected by %>%):

  1. Use group_by() to group both the bait and OG_category_annotation columns
  2. Use summarize() to create a new column called joint_zscore that calculates a weighted z-score for all the proteins in each category. The equation for this is:
    • sum(PSM_zscore)/sqrt(length(PSM_zscore)) # sum of all z-scores divided by the number of observations per group
  3. Use arrange() to sort the data frame by bait and then by joint_zscore to get top hits.
summary_df <- filtered_df %>%
  group_by(bait, OG_category_annotation) %>%
  summarize(joint_zscore = sum(PSM_zscore)/sqrt(length(PSM_zscore))) %>%
  arrange(desc(joint_zscore))
## `summarise()` regrouping output by 'bait' (override with `.groups` argument)

Plot a bar chart for the functional categories, but this time put z-score on the y-axis instead of counts. Since you’re specifying the y = argument, you’ll have to use geom_col() instead of geom_bar().

ggplot(summary_df, aes(x = OG_category_annotation,
                       y = joint_zscore, fill = bait)) +
  geom_col() +
  xlab("") +
  coord_flip() +
  facet_wrap(~bait) +
  scale_fill_viridis_d(option = "E", begin = 0.2, end = 0.8)

4. If that was easy…

Bonus challenge: Take the sepal lengths from the iris data set and put them into a wide table so that is one data column per species. You might be tempted to do this with the following code, which however doesn’t work. Can you figure out why?

# If you remove the # signs in the lines below you will get an error; this code doesn't work
# iris %>% 
#   select(Sepal.Length, Species) %>%
#   pivot_wider(names_from = "Species", values_from = "Sepal.Length")

The problem is that pivot_wider() does not like to put data into the same row if it isn’t sure that they actually belong together. In the iris table, there is no indication which “setosa” values, for example, should go with which “versicolor” values. Therefore, pivot_wider() prints a warning about values that are not uniquely identified.

We can avoid this issue by adding a row column that is repeated among the three groups. This column simply counts rows within each group, from 1 to 50 in this particular data set. This trick forces the data from the same rows for different species into one row. (That means, rows 1 of setosa, versicolor, and virginica get combined, then rows 2, and so on.)

iris %>%
  select(Sepal.Length, Species) %>% 
  group_by(Species) %>%
  mutate(row = 1:n()) %>%
  pivot_wider(names_from = "Species", 
              values_from = "Sepal.Length")
## # A tibble: 50 x 4
##      row setosa versicolor virginica
##    <int>  <dbl>      <dbl>     <dbl>
##  1     1    5.1        7         6.3
##  2     2    4.9        6.4       5.8
##  3     3    4.7        6.9       7.1
##  4     4    4.6        5.5       6.3
##  5     5    5          6.5       6.5
##  6     6    5.4        5.7       7.6
##  7     7    4.6        6.3       4.9
##  8     8    5          4.9       7.3
##  9     9    4.4        6.6       6.7
## 10    10    4.9        5.2       7.2
## # … with 40 more rows

(At the end, if you want, you can delete this column again:)

iris %>%
  select(Sepal.Length, Species) %>% 
  group_by(Species) %>%
  mutate(row = 1:n()) %>%
  pivot_wider(names_from = "Species", 
              values_from = "Sepal.Length") %>%
  select(-row)
## # A tibble: 50 x 3
##    setosa versicolor virginica
##     <dbl>      <dbl>     <dbl>
##  1    5.1        7         6.3
##  2    4.9        6.4       5.8
##  3    4.7        6.9       7.1
##  4    4.6        5.5       6.3
##  5    5          6.5       6.5
##  6    5.4        5.7       7.6
##  7    4.6        6.3       4.9
##  8    5          4.9       7.3
##  9    4.4        6.6       6.7
## 10    4.9        5.2       7.2
## # … with 40 more rows

Bonus challenge: For the data set diamonds from the ggplot2 package, create a table displaying the mean price for each combination of cut and clarity. Then use pivot_wider() to rearrange this table into a wide format, such that there is a column of mean prices for each cut level (Fair, Good, Very Good, etc.).

price_by_cut <-
  diamonds %>%
  group_by(cut, clarity) %>% 
  summarize(mean.price = mean(price))
## `summarise()` regrouping output by 'cut' (override with `.groups` argument)