July 1, 2020
In this worksheet, we will use the library tidyverse:
library(tidyverse)
theme_set(theme_bw(base_size = 12)) # set the default plot theme for the ggplot2
mutate()
, bind_rows()
, left_join()
)The raw output from many types of high-throughput molecular biology experiments comes a unique sample ID (e.g., protein ID, gene ID) and that samples raw output response. (e.g., peptide abundance via mass spectrometry experiments, mRNA abundance via RNA-seq). Often we would like to accomplish the following:
mutate()
).bind_rows()
).left_join()
).To demonstrate these concepts, we’ll use real affinity purification mass spectrometry (APMS) data sets: frog_dnai2
and frog_heatr2
. Download them below:
# download the dnai2 pull down data set
frog_dnai2 <- read_csv("https://rachaelcox.github.io/classes/datasets/frog_apms_dnai2.csv")
head(frog_dnai2)
## # A tibble: 6 x 6
## ID ctrl_PSMs exp_PSMs PSM_fc PSM_log2fc PSM_zscore
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 ENOG411DGHA 0 202 200. 7.64 14.2
## 2 ENOG411CW1E 80 265 3.23 1.69 9.90
## 3 ENOG411DGH7 0 46 46.2 5.53 6.76
## 4 ENOG411CT94 1 35 17.7 4.15 5.64
## 5 ENOG411CQ77 9 52 5.21 2.38 5.47
## 6 ENOG411DGHH 72 133 1.81 0.853 4.16
# # download the heatr pull down data set
frog_heatr2 <- read_csv("https://rachaelcox.github.io/classes/datasets/frog_apms_heatr2.csv")
head(frog_heatr2)
## # A tibble: 6 x 6
## ID ctrl_PSMs exp_PSMs PSM_fc PSM_log2fc PSM_zscore
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 ENOG411CUGY 0 274 241. 7.91 16.0
## 2 ENOG411CWA4 31 79 2.19 1.13 3.98
## 3 ENOG411CTH9 17 54 2.68 1.42 3.94
## 4 ENOG411CSRQ 17 49 2.43 1.28 3.49
## 5 ENOG411CTE0 0 10 9.64 3.27 3.05
## 6 ENOG411CYEZ 5 22 3.36 1.75 3.01
For each APMS data frame, use mutate()
to add a new column named bait
that contains a string naming the bait used in that experiment. For instance, the frog_dnai2
data frame should have a new column simply reading “dnai2” for each row in the new column.
The purpose of this is to prepare the data frames to be bound together; without this new column, we would not know observation belongs to which experiment once the data frames are combined.
# your R code here
Now, using bind_rows()
combine the two new dataframes together.
# your R code here
Another valid way to do this is with rbind()
; the difference is that rbind()
requires data frames with absolutely identifical column names to ensure a clean concatenation, whereas bind_rows()
isn’t so picky.
Bonus challenge: Plot the distribution of z-scores (PSM_zscore
) for your combined dataframe. Color by experiment. Your choice of geom; see Section 3 of yesterday’s worksheet for plotting distributions. Plot horizontal lines at a z-score values of 1.6 and -1.6, corresponding to roughly a 5% false discovery rate (geom_hline
or geom_vline
will accomplish this).
# your R code here
Now we have a dataframe that contains protein interaction data for both dnai2 and heatr2, but the actual identifiers in the ID
are not informative. To investigate the interacting biology, we need to join an annotation table to the APMS dataframe. Download the annotations with the code chunk below.
# download the frog gene annotations
frog_annotations <- read_csv("https://rachaelcox.github.io/classes/datasets/frog_enog_annotations.csv")
head(frog_annotations)
## # A tibble: 6 x 6
## ID OG_category XENLA_GeneNames HUMAN_GeneNames OG_annotation
## <chr> <chr> <chr> <chr> <chr>
## 1 ENOG… S rhpn1.L, rhpn2… RHPN2P1, RHPN2… Rhophilin, R…
## 2 ENOG… S arl14epl.L, ar… ARL14EPL ADP-ribosyla…
## 3 ENOG… Z myh1-2-1.S, my… MYH3, MYH7B myosin, heav…
## 4 ENOG… S loc101733025.L… SHISAL1 kiaa1644
## 5 ENOG… T epha3.L, epha4… EPHA5, EPHA3, … Eph receptor
## 6 ENOG… O txndc11.L TXNDC11 thioredoxin …
## # … with 1 more variable: OG_category_annotation <chr>
Note that the annotation dataframe has an ID
column that matches the ID
column in the APMS dataframe. Also notice that the annotation dataframe has 14292 rows, and our APMS dataframe has 3356 rows. The left_join()
function keeps all the rows from the dataframe used as the “left” input, while joining in places that match the common column (ID
in this case) to the right dataframe. Use left_join()
to combine the APMS dataframe with the annotation dataframe, such that the final data frame has 3356 rows and 12 columns.
# your R code here
Now each observation is connected to a X. laevis gene name, in addition to the gene’s human cousin. Not all genes have annotations, even though X. laevis is considered a model organism. This is common in bioinformatics.
Bonus challenge: Plot a bar chart with OG_category_annotation
on the x-axis. Color by bait. Flip the chart so long descriptions are more readable using the coord_flip()
layer. Try facet_wrap()
, does this improve clarity? Remove the y-axis label.
# your R code here
filter()
, select()
)In biological data analysis, we’re often most interested in a specific subset of data–usually scores above some threshold (p-values, z-scores, fold changes in gene expression) or the behavior of a specific gene/protein family relative to time or different conditions.
Perform the following operations to pare down the combined APMS dataframe. Use the pipe (%>%) to save the results of operation to a new dataframe in a single step:
PSM_zscore
> 1.6 to pull out statistically significant interactors with the proteins dnai2 and heatr2 (use the filter()
function).%>%
) into select()
to extract only the columns named ID
, bait
, PSM_log2fc
, and PSM_zscore
.%>%
) into arrange()
to re-sort the dataframe to descend by PSM_zscore
(i.e., highest z-scores should be at the top).# your R code here
Plot another bar plot, the same way we made the plot in Section 2.3; compare the distribution of functional categories for the most significant proteins versus the unfiltered data set depicted in Section 2.3. Challenge: Reorder the bar plots so that the functional categories with the highest counts are grouped together using fct_infreq()
.
# your R code here
group_by()
, summarize()
, arrange()
)To find broad patterns in big datasets, we can use the group_by()
and summarize()
functions to get summary information for groups of variables. For instance, the bait
column has two groups: dnai2 and heatr2. The OG_category_annotations
column has 22 groups if you count the “NA” values resulting from unannotated genes. Run the code below to see the groups.
# your R code here
When you group_by()
a variable, you are collapsing all the information for the unique values associated with that variable. Then, we use summarize()
to extract information associated with the collapsed information.
We want to learn the functional categories for the proteins that are most signficantly measured to be interacting with both dnai2 and heatr2. To do this, code the following steps (as a chain of functions connected by %>%
):
group_by()
to group both the bait
and OG_category_annotation
columnssummarize()
to create a new column called mean_zscore
that calculates the mean z-score for all the proteins in each category.arrange()
to sort the dataframe by bait
and then by mean_zscore
to get top hits.# your R code here
Plot a bar chart for the functional categories, but this time put z-score on the y-axis instead of counts. Since you’re specifying the y =
argument, you’ll have to use geom_col()
instead of geom_bar()
.
# your R code here
Bonus challenge: Take the sepal lengths from the iris
dataset and put them into a wide table so that is one data column per species. You might be tempted to do this with the following code, which however doesn’t work. Can you explain why?
# If you remove the # signs in the lines below you will get an error; this code doesn't work
# iris %>%
# select(Sepal.Length, Species) %>%
# pivot_wider(names_from = "Species", values_from = "Sepal.Length")
The problem is that pivot_wider()
does not like to put data into the same row if it isn’t sure that they actually belong together. In the iris table, there is no indication which “setosa” values, for example, should go with which “versicolor” values. Therefore, pivot_wider()
prints a warning about values that are not uniquely identified.
We can avoid this issue by adding a row
column that is repeated among the three groups. This column simply counts rows within each group, from 1 to 50 in this particular data set. This trick forces the data from the same rows for different species into one row. (That means, rows 1 of setosa, versicolor, and virginica get combined, then rows 2, and so on.)
# your R code here
(At the end, if you want, you can delete this column again:)
# your R code here
Bonus challenge: For the data set diamonds
from the ggplot2 package, create a table displaying the mean price for each combination of cut and clarity. Then use pivot_wider()
to rearrange this table into a wide format, such that there is a column of mean prices for each cut level (Fair, Good, Very Good, etc.).
# your R code here