Day 3: Data transformation

In-class worksheet, solutions

July 1, 2020

In this worksheet, we will use the library tidyverse:

library(tidyverse)
theme_set(theme_bw(base_size = 12)) # set the default plot theme for the ggplot2

1. Row/column addition (mutate(), bind_rows(), left_join())

The raw output from many types of high-throughput molecular biology experiments comes a unique sample ID (e.g., protein ID, gene ID) and that samples raw output response. (e.g., peptide abundance via mass spectrometry experiments, mRNA abundance via RNA-seq). Often we would like to accomplish the following:

  1. Add new columns that perform calculations on the raw data or contain “meta data” associated with that experiment (mutate()).
  2. Join multiple experiments into one dataframe (bind_rows()).
  3. Import annotations to sample IDs with the hope of extracting broad trends about the behavior of biological systems in different contexts (left_join()).

To demonstrate these concepts, we’ll use real affinity purification mass spectrometry (APMS) data sets: frog_dnai2 and frog_heatr2. Download them below:

# download the dnai2 pull down data set
frog_dnai2 <- read_csv("https://rachaelcox.github.io/classes/datasets/frog_apms_dnai2.csv")
head(frog_dnai2)
## # A tibble: 6 x 6
##   ID          ctrl_PSMs exp_PSMs PSM_fc PSM_log2fc PSM_zscore
##   <chr>           <dbl>    <dbl>  <dbl>      <dbl>      <dbl>
## 1 ENOG411DGHA         0      202 200.        7.64       14.2 
## 2 ENOG411CW1E        80      265   3.23      1.69        9.90
## 3 ENOG411DGH7         0       46  46.2       5.53        6.76
## 4 ENOG411CT94         1       35  17.7       4.15        5.64
## 5 ENOG411CQ77         9       52   5.21      2.38        5.47
## 6 ENOG411DGHH        72      133   1.81      0.853       4.16
# # download the heatr pull down data set
frog_heatr2 <- read_csv("https://rachaelcox.github.io/classes/datasets/frog_apms_heatr2.csv")
head(frog_heatr2)
## # A tibble: 6 x 6
##   ID          ctrl_PSMs exp_PSMs PSM_fc PSM_log2fc PSM_zscore
##   <chr>           <dbl>    <dbl>  <dbl>      <dbl>      <dbl>
## 1 ENOG411CUGY         0      274 241.         7.91      16.0 
## 2 ENOG411CWA4        31       79   2.19       1.13       3.98
## 3 ENOG411CTH9        17       54   2.68       1.42       3.94
## 4 ENOG411CSRQ        17       49   2.43       1.28       3.49
## 5 ENOG411CTE0         0       10   9.64       3.27       3.05
## 6 ENOG411CYEZ         5       22   3.36       1.75       3.01

1.1 Adding a new column

For each APMS data frame, use mutate() to add a new column named bait that contains a string naming the bait used in that experiment. For instance, the frog_dnai2 data frame should have a new column simply reading “dnai2” for each row in the new column.

The purpose of this is to prepare the data frames to be bound together; without this new column, we would not know observation belongs to which experiment once the data frames are combined.

frog_dnai2_mutated <- frog_dnai2 %>%
  mutate(bait = 'dnai2')
head(frog_dnai2_mutated)
## # A tibble: 6 x 7
##   ID          ctrl_PSMs exp_PSMs PSM_fc PSM_log2fc PSM_zscore bait 
##   <chr>           <dbl>    <dbl>  <dbl>      <dbl>      <dbl> <chr>
## 1 ENOG411DGHA         0      202 200.        7.64       14.2  dnai2
## 2 ENOG411CW1E        80      265   3.23      1.69        9.90 dnai2
## 3 ENOG411DGH7         0       46  46.2       5.53        6.76 dnai2
## 4 ENOG411CT94         1       35  17.7       4.15        5.64 dnai2
## 5 ENOG411CQ77         9       52   5.21      2.38        5.47 dnai2
## 6 ENOG411DGHH        72      133   1.81      0.853       4.16 dnai2
frog_heatr2_mutated <- frog_heatr2 %>%
  mutate(bait = 'heatr2')
head(frog_heatr2_mutated)
## # A tibble: 6 x 7
##   ID          ctrl_PSMs exp_PSMs PSM_fc PSM_log2fc PSM_zscore bait  
##   <chr>           <dbl>    <dbl>  <dbl>      <dbl>      <dbl> <chr> 
## 1 ENOG411CUGY         0      274 241.         7.91      16.0  heatr2
## 2 ENOG411CWA4        31       79   2.19       1.13       3.98 heatr2
## 3 ENOG411CTH9        17       54   2.68       1.42       3.94 heatr2
## 4 ENOG411CSRQ        17       49   2.43       1.28       3.49 heatr2
## 5 ENOG411CTE0         0       10   9.64       3.27       3.05 heatr2
## 6 ENOG411CYEZ         5       22   3.36       1.75       3.01 heatr2

1.2 Binding two experiments together

Now, using bind_rows() combine the two new dataframes together.

frog_combined <- bind_rows(frog_dnai2_mutated, frog_heatr2_mutated)
head(frog_combined)
## # A tibble: 6 x 7
##   ID          ctrl_PSMs exp_PSMs PSM_fc PSM_log2fc PSM_zscore bait 
##   <chr>           <dbl>    <dbl>  <dbl>      <dbl>      <dbl> <chr>
## 1 ENOG411DGHA         0      202 200.        7.64       14.2  dnai2
## 2 ENOG411CW1E        80      265   3.23      1.69        9.90 dnai2
## 3 ENOG411DGH7         0       46  46.2       5.53        6.76 dnai2
## 4 ENOG411CT94         1       35  17.7       4.15        5.64 dnai2
## 5 ENOG411CQ77         9       52   5.21      2.38        5.47 dnai2
## 6 ENOG411DGHH        72      133   1.81      0.853       4.16 dnai2

Another valid way to do this is with rbind(); the difference is that rbind() requires data frames with absolutely identifical column names to ensure a clean concatenation, whereas bind_rows() isn’t so picky.

Bonus challenge: Plot the distribution of z-scores (PSM_zscore) for your combined dataframe. Color by experiment. Your choice of geom; see Section 3 of yesterday’s worksheet for plotting distributions. Plot horizontal lines at a z-score values of 1.6 and -1.6, corresponding to roughly a 5% false discovery rate (geom_hline or geom_vline will accomplish this).

ggplot(frog_combined, aes(x = bait, y = PSM_zscore, fill = bait, color = bait)) +
  geom_violin(alpha = 0.5) +
  geom_jitter(size = 0.5, alpha = 0.5) +   # overlay data points on distribution
  geom_hline(yintercept = 1.6, alpha = 0.75) +    # add lines at z-value cutoffs
  geom_hline(yintercept = -1.6, alpha = 0.75) +
  scale_color_viridis_d(option = "B", end = 0.8) +   # color data points
  scale_fill_viridis_d(option = "E")  # color distribution

1.3 Joining an experiment table with an annotation table

Now we have a dataframe that contains protein interaction data for both dnai2 and heatr2, but the actual identifiers in the ID are not informative. To investigate the interacting biology, we need to join an annotation table to the APMS dataframe. Download the annotations with the code chunk below.

# download the frog gene annotations
frog_annotations <- read_csv("https://rachaelcox.github.io/classes/datasets/frog_enog_annotations.csv")
head(frog_annotations)
## # A tibble: 6 x 6
##   ID    OG_category XENLA_GeneNames HUMAN_GeneNames OG_annotation
##   <chr> <chr>       <chr>           <chr>           <chr>        
## 1 ENOG… S           rhpn1.L, rhpn2… RHPN2P1, RHPN2… Rhophilin, R…
## 2 ENOG… S           arl14epl.L, ar… ARL14EPL        ADP-ribosyla…
## 3 ENOG… Z           myh1-2-1.S, my… MYH3, MYH7B     myosin, heav…
## 4 ENOG… S           loc101733025.L… SHISAL1         kiaa1644     
## 5 ENOG… T           epha3.L, epha4… EPHA5, EPHA3, … Eph receptor 
## 6 ENOG… O           txndc11.L       TXNDC11         thioredoxin …
## # … with 1 more variable: OG_category_annotation <chr>

Note that the annotation dataframe has an ID column that matches the ID column in the APMS dataframe. Also notice that the annotation dataframe has 14292 rows, and our APMS dataframe has 3356 rows. The left_join() function keeps all the rows from the dataframe used as the “left” input, while joining in places that match the common column (ID in this case) to the right dataframe. Use left_join() to combine the APMS dataframe with the annotation dataframe, such that the final data frame has 3356 rows and 12 columns.

frog_combined_annotated <- frog_combined %>%
  left_join(frog_annotations, by = c("ID"))
head(frog_combined_annotated)
## # A tibble: 6 x 12
##   ID    ctrl_PSMs exp_PSMs PSM_fc PSM_log2fc PSM_zscore bait  OG_category
##   <chr>     <dbl>    <dbl>  <dbl>      <dbl>      <dbl> <chr> <chr>      
## 1 ENOG…         0      202 200.        7.64       14.2  dnai2 Z          
## 2 ENOG…        80      265   3.23      1.69        9.90 dnai2 I          
## 3 ENOG…         0       46  46.2       5.53        6.76 dnai2 Z          
## 4 ENOG…         1       35  17.7       4.15        5.64 dnai2 Z          
## 5 ENOG…         9       52   5.21      2.38        5.47 dnai2 F          
## 6 ENOG…        72      133   1.81      0.853       4.16 dnai2 C          
## # … with 4 more variables: XENLA_GeneNames <chr>, HUMAN_GeneNames <chr>,
## #   OG_annotation <chr>, OG_category_annotation <chr>

Now each observation is connected to a X. laevis gene name, in addition to the gene’s human cousin. Not all genes have annotations, even though X. laevis is considered a model organism. This is common in bioinformatics.

Bonus challenge: Plot a bar chart with OG_category_annotation on the x-axis. Color by bait. Flip the chart so long descriptions are more readable using the coord_flip() layer. Try facet_wrap(), does this improve clarity? Remove the y-axis label.

ggplot(frog_combined_annotated, aes(x = OG_category_annotation, fill = bait)) +
  geom_bar(position = "dodge") +
  theme(axis.title.y = element_blank()) +
  facet_wrap(~bait) +
  scale_fill_viridis_d(option = "E", begin = 0.2, end = 0.8) +
  coord_flip()

2. Subsetting data (filter(), select())

In biological data analysis, we’re often most interested in a specific subset of data–usually scores above some threshold (p-values, z-scores, fold changes in gene expression) or the behavior of a specific gene/protein family relative to time or different conditions.

Perform the following operations to pare down the combined APMS dataframe. Use the pipe (%>%) to save the results of operation to a new dataframe in a single step:

  1. Filter the combined and annotated APMS dataframe (generated in the previous step) to only contain observations with PSM_zscore > 1.6 to pull out statistically significant interactors with the proteins dnai2 and heatr2 (use the filter() function).
  2. Then, in the same line of code, pipe (%>%) into select() to extract only the columns named ID, bait, PSM_log2fc, and PSM_zscore.
  3. Finally, again in the same line of code, pipe (%>%) into arrange() to re-sort the dataframe to descend by PSM_zscore (i.e., highest z-scores should be at the top).
frog_significant <- frog_combined_annotated %>%
  filter(PSM_zscore > 1.6) %>%
  select(-ctrl_PSMs, -exp_PSMs, -PSM_fc) %>%
  arrange(desc(PSM_zscore))
frog_significant
## # A tibble: 270 x 9
##    ID    PSM_log2fc PSM_zscore bait  OG_category XENLA_GeneNames
##    <chr>      <dbl>      <dbl> <chr> <chr>       <chr>          
##  1 ENOG…      7.91       16.0  heat… S           dnaaf5.L, dnaa…
##  2 ENOG…      7.64       14.2  dnai2 Z           dnai2.S        
##  3 ENOG…      1.69        9.90 dnai2 I           loc100494633.L…
##  4 ENOG…      5.53        6.76 dnai2 Z           dnai1.L        
##  5 ENOG…      4.15        5.64 dnai2 Z           loc101730961.L…
##  6 ENOG…      2.38        5.47 dnai2 F           cps1.L, cad.L  
##  7 ENOG…      0.853       4.16 dnai2 C           pc.S           
##  8 ENOG…      0.582       4.02 dnai2 Z           tubb.S, tubb2b…
##  9 ENOG…      1.13        3.98 heat… J           gcn1.S, gcn1.L 
## 10 ENOG…      1.42        3.94 heat… U           ipo7.L         
## # … with 260 more rows, and 3 more variables: HUMAN_GeneNames <chr>,
## #   OG_annotation <chr>, OG_category_annotation <chr>

Plot another bar plot, the same way we made the plot in Section 2.3; compare the distribution of functional categories for the most significant proteins versus the unfiltered data set depicted in Section 2.3. Challenge: Reorder the bar plots so that the functional categories with the highest counts are grouped together using fct_infreq().

ggplot(frog_significant, aes(x = fct_rev(fct_infreq(OG_category_annotation)), 
                             fill = bait)) +
  geom_bar(position = "dodge") +
  theme(axis.title.y = element_blank()) +
  facet_wrap(~bait) +
  scale_fill_viridis_d(option = "E", begin = 0.2, end = 0.8) +
  coord_flip()

3. Aggregating data (group_by(), summarize(), arrange())

To find broad patterns in big datasets, we can use the group_by() and summarize() functions to get summary information for groups of variables. For instance, the bait column has two groups: dnai2 and heatr2. The OG_category_annotations column has 22 groups if you count the “NA” values resulting from unannotated genes. Run the code below to see the groups.

frog_significant$bait %>% unique()
## [1] "heatr2" "dnai2"
frog_significant$OG_category_annotation %>% unique()
##  [1] "Function Unknown"                      
##  [2] "Cytoskeleton"                          
##  [3] "Lipid metabolism"                      
##  [4] "Nucleotide metabolism and transport"   
##  [5] "Energy production and conversion"      
##  [6] "Tranlsation"                           
##  [7] "Intracellular trafficing and secretion"
##  [8] "Amino Acid metabolis and transport"    
##  [9] "Cell wall/membrane/envelop biogenesis" 
## [10] NA                                      
## [11] "Post-translational modification"       
## [12] "Replication and repair"                
## [13] "Carbohydrate metabolism and transport" 
## [14] "Cell cycle control and mitosis"        
## [15] "Chromatin Structure and dynamics"      
## [16] "Signal Transduction"                   
## [17] "RNA processing and modification"       
## [18] "Transcription"                         
## [19] "Secondary Structure"                   
## [20] "Cell motility"                         
## [21] "Inorganic ion transport and metabolism"
## [22] "Coenzyme metabolis"

When you group_by() a variable, you are collapsing all the information for the unique values associated with that variable. Then, we use summarize() to extract information associated with the collapsed information.

We want to learn the functional categories for the proteins that are most signficantly measured to be interacting with both dnai2 and heatr2. To do this, code the following steps (as a chain of functions connected by %>%):

  1. Use group_by() to group both the bait and OG_category_annotation columns
  2. Use summarize() to create a new column called mean_zscore that calculates the mean z-score for all the proteins in each category.
  3. Use arrange() to sort the dataframe by bait and then by mean_zscore to get top hits.
frog_final <- frog_significant %>%
  group_by(bait, OG_category_annotation) %>%
  summarize(mean_zscore = mean(PSM_zscore)) %>%
  arrange(bait, desc(mean_zscore))
head(frog_final)
## # A tibble: 6 x 3
## # Groups:   bait [1]
##   bait  OG_category_annotation                mean_zscore
##   <chr> <chr>                                       <dbl>
## 1 dnai2 Nucleotide metabolism and transport          5.47
## 2 dnai2 Cytoskeleton                                 3.64
## 3 dnai2 Lipid metabolism                             3.59
## 4 dnai2 Cell wall/membrane/envelop biogenesis        3.45
## 5 dnai2 Energy production and conversion             2.66
## 6 dnai2 Amino Acid metabolis and transport           2.55

Plot a bar chart for the functional categories, but this time put z-score on the y-axis instead of counts. Since you’re specifying the y = argument, you’ll have to use geom_col() instead of geom_bar().

ggplot(frog_final, aes(x = fct_rev(fct_infreq(OG_category_annotation)), 
                       y = mean_zscore, fill = bait)) +
  geom_col(position = "dodge") +
  theme(axis.title.y = element_blank()) +
  facet_wrap(~bait) +
  scale_fill_viridis_d(option = "E", begin = 0.2, end = 0.8) +
  coord_flip()

4. If this was easy…

Bonus challenge: Take the sepal lengths from the iris dataset and put them into a wide table so that is one data column per species. You might be tempted to do this with the following code, which however doesn’t work. Can you explain why?

# If you remove the # signs in the lines below you will get an error; this code doesn't work
# iris %>% 
#   select(Sepal.Length, Species) %>%
#   pivot_wider(names_from = "Species", values_from = "Sepal.Length")

The problem is that pivot_wider() does not like to put data into the same row if it isn’t sure that they actually belong together. In the iris table, there is no indication which “setosa” values, for example, should go with which “versicolor” values. Therefore, pivot_wider() prints a warning about values that are not uniquely identified.

We can avoid this issue by adding a row column that is repeated among the three groups. This column simply counts rows within each group, from 1 to 50 in this particular data set. This trick forces the data from the same rows for different species into one row. (That means, rows 1 of setosa, versicolor, and virginica get combined, then rows 2, and so on.)

iris %>%
  select(Sepal.Length, Species) %>% 
  group_by(Species) %>%
  mutate(row = 1:n()) %>%
  pivot_wider(names_from = "Species", values_from = "Sepal.Length")
## # A tibble: 50 x 4
##      row setosa versicolor virginica
##    <int>  <dbl>      <dbl>     <dbl>
##  1     1    5.1        7         6.3
##  2     2    4.9        6.4       5.8
##  3     3    4.7        6.9       7.1
##  4     4    4.6        5.5       6.3
##  5     5    5          6.5       6.5
##  6     6    5.4        5.7       7.6
##  7     7    4.6        6.3       4.9
##  8     8    5          4.9       7.3
##  9     9    4.4        6.6       6.7
## 10    10    4.9        5.2       7.2
## # … with 40 more rows

(At the end, if you want, you can delete this column again:)

iris %>%
  select(Sepal.Length, Species) %>% 
  group_by(Species) %>%
  mutate(row = 1:n()) %>%
  pivot_wider(names_from = "Species", values_from = "Sepal.Length") %>%
  select(-row)
## # A tibble: 50 x 3
##    setosa versicolor virginica
##     <dbl>      <dbl>     <dbl>
##  1    5.1        7         6.3
##  2    4.9        6.4       5.8
##  3    4.7        6.9       7.1
##  4    4.6        5.5       6.3
##  5    5          6.5       6.5
##  6    5.4        5.7       7.6
##  7    4.6        6.3       4.9
##  8    5          4.9       7.3
##  9    4.4        6.6       6.7
## 10    4.9        5.2       7.2
## # … with 40 more rows

Bonus challenge: For the data set diamonds from the ggplot2 package, create a table displaying the mean price for each combination of cut and clarity. Then use pivot_wider() to rearrange this table into a wide format, such that there is a column of mean prices for each cut level (Fair, Good, Very Good, etc.).

price_by_cut <-
  diamonds %>%
  group_by(cut, clarity) %>% 
  summarize(mean.price = mean(price))
price_by_cut
## # A tibble: 40 x 3
## # Groups:   cut [5]
##    cut   clarity mean.price
##    <ord> <ord>        <dbl>
##  1 Fair  I1           3704.
##  2 Fair  SI2          5174.
##  3 Fair  SI1          4208.
##  4 Fair  VS2          4175.
##  5 Fair  VS1          4165.
##  6 Fair  VVS2         3350.
##  7 Fair  VVS1         3871.
##  8 Fair  IF           1912.
##  9 Good  I1           3597.
## 10 Good  SI2          4580.
## # … with 30 more rows
price_by_cut %>% 
  pivot_wider(names_from = "cut", values_from = "mean.price")
## # A tibble: 8 x 6
##   clarity  Fair  Good `Very Good` Premium Ideal
##   <ord>   <dbl> <dbl>       <dbl>   <dbl> <dbl>
## 1 I1      3704. 3597.       4078.   3947. 4336.
## 2 SI2     5174. 4580.       4989.   5546. 4756.
## 3 SI1     4208. 3690.       3932.   4455. 3752.
## 4 VS2     4175. 4262.       4216.   4550. 3285.
## 5 VS1     4165. 3801.       3805.   4485. 3490.
## 6 VVS2    3350. 3079.       3038.   3795. 3250.
## 7 VVS1    3871. 2255.       2459.   2831. 2468.
## 8 IF      1912. 4098.       4396.   3856. 2273.