July 1, 2020
In this worksheet, we will use the library tidyverse:
library(tidyverse)
theme_set(theme_bw(base_size = 12)) # set the default plot theme for the ggplot2
mutate()
, bind_rows()
, left_join()
)The raw output from many types of high-throughput molecular biology experiments comes a unique sample ID (e.g., protein ID, gene ID) and that samples raw output response. (e.g., peptide abundance via mass spectrometry experiments, mRNA abundance via RNA-seq). Often we would like to accomplish the following:
mutate()
).bind_rows()
).left_join()
).To demonstrate these concepts, we’ll use real affinity purification mass spectrometry (APMS) data sets: frog_dnai2
and frog_heatr2
. Download them below:
# download the dnai2 pull down data set
frog_dnai2 <- read_csv("https://rachaelcox.github.io/classes/datasets/frog_apms_dnai2.csv")
head(frog_dnai2)
## # A tibble: 6 x 6
## ID ctrl_PSMs exp_PSMs PSM_fc PSM_log2fc PSM_zscore
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 ENOG411DGHA 0 202 200. 7.64 14.2
## 2 ENOG411CW1E 80 265 3.23 1.69 9.90
## 3 ENOG411DGH7 0 46 46.2 5.53 6.76
## 4 ENOG411CT94 1 35 17.7 4.15 5.64
## 5 ENOG411CQ77 9 52 5.21 2.38 5.47
## 6 ENOG411DGHH 72 133 1.81 0.853 4.16
# # download the heatr pull down data set
frog_heatr2 <- read_csv("https://rachaelcox.github.io/classes/datasets/frog_apms_heatr2.csv")
head(frog_heatr2)
## # A tibble: 6 x 6
## ID ctrl_PSMs exp_PSMs PSM_fc PSM_log2fc PSM_zscore
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 ENOG411CUGY 0 274 241. 7.91 16.0
## 2 ENOG411CWA4 31 79 2.19 1.13 3.98
## 3 ENOG411CTH9 17 54 2.68 1.42 3.94
## 4 ENOG411CSRQ 17 49 2.43 1.28 3.49
## 5 ENOG411CTE0 0 10 9.64 3.27 3.05
## 6 ENOG411CYEZ 5 22 3.36 1.75 3.01
For each APMS data frame, use mutate()
to add a new column named bait
that contains a string naming the bait used in that experiment. For instance, the frog_dnai2
data frame should have a new column simply reading “dnai2” for each row in the new column.
The purpose of this is to prepare the data frames to be bound together; without this new column, we would not know observation belongs to which experiment once the data frames are combined.
frog_dnai2_mutated <- frog_dnai2 %>%
mutate(bait = 'dnai2')
head(frog_dnai2_mutated)
## # A tibble: 6 x 7
## ID ctrl_PSMs exp_PSMs PSM_fc PSM_log2fc PSM_zscore bait
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <chr>
## 1 ENOG411DGHA 0 202 200. 7.64 14.2 dnai2
## 2 ENOG411CW1E 80 265 3.23 1.69 9.90 dnai2
## 3 ENOG411DGH7 0 46 46.2 5.53 6.76 dnai2
## 4 ENOG411CT94 1 35 17.7 4.15 5.64 dnai2
## 5 ENOG411CQ77 9 52 5.21 2.38 5.47 dnai2
## 6 ENOG411DGHH 72 133 1.81 0.853 4.16 dnai2
frog_heatr2_mutated <- frog_heatr2 %>%
mutate(bait = 'heatr2')
head(frog_heatr2_mutated)
## # A tibble: 6 x 7
## ID ctrl_PSMs exp_PSMs PSM_fc PSM_log2fc PSM_zscore bait
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <chr>
## 1 ENOG411CUGY 0 274 241. 7.91 16.0 heatr2
## 2 ENOG411CWA4 31 79 2.19 1.13 3.98 heatr2
## 3 ENOG411CTH9 17 54 2.68 1.42 3.94 heatr2
## 4 ENOG411CSRQ 17 49 2.43 1.28 3.49 heatr2
## 5 ENOG411CTE0 0 10 9.64 3.27 3.05 heatr2
## 6 ENOG411CYEZ 5 22 3.36 1.75 3.01 heatr2
Now, using bind_rows()
combine the two new dataframes together.
frog_combined <- bind_rows(frog_dnai2_mutated, frog_heatr2_mutated)
head(frog_combined)
## # A tibble: 6 x 7
## ID ctrl_PSMs exp_PSMs PSM_fc PSM_log2fc PSM_zscore bait
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <chr>
## 1 ENOG411DGHA 0 202 200. 7.64 14.2 dnai2
## 2 ENOG411CW1E 80 265 3.23 1.69 9.90 dnai2
## 3 ENOG411DGH7 0 46 46.2 5.53 6.76 dnai2
## 4 ENOG411CT94 1 35 17.7 4.15 5.64 dnai2
## 5 ENOG411CQ77 9 52 5.21 2.38 5.47 dnai2
## 6 ENOG411DGHH 72 133 1.81 0.853 4.16 dnai2
Another valid way to do this is with rbind()
; the difference is that rbind()
requires data frames with absolutely identifical column names to ensure a clean concatenation, whereas bind_rows()
isn’t so picky.
Bonus challenge: Plot the distribution of z-scores (PSM_zscore
) for your combined dataframe. Color by experiment. Your choice of geom; see Section 3 of yesterday’s worksheet for plotting distributions. Plot horizontal lines at a z-score values of 1.6 and -1.6, corresponding to roughly a 5% false discovery rate (geom_hline
or geom_vline
will accomplish this).
ggplot(frog_combined, aes(x = bait, y = PSM_zscore, fill = bait, color = bait)) +
geom_violin(alpha = 0.5) +
geom_jitter(size = 0.5, alpha = 0.5) + # overlay data points on distribution
geom_hline(yintercept = 1.6, alpha = 0.75) + # add lines at z-value cutoffs
geom_hline(yintercept = -1.6, alpha = 0.75) +
scale_color_viridis_d(option = "B", end = 0.8) + # color data points
scale_fill_viridis_d(option = "E") # color distribution
Now we have a dataframe that contains protein interaction data for both dnai2 and heatr2, but the actual identifiers in the ID
are not informative. To investigate the interacting biology, we need to join an annotation table to the APMS dataframe. Download the annotations with the code chunk below.
# download the frog gene annotations
frog_annotations <- read_csv("https://rachaelcox.github.io/classes/datasets/frog_enog_annotations.csv")
head(frog_annotations)
## # A tibble: 6 x 6
## ID OG_category XENLA_GeneNames HUMAN_GeneNames OG_annotation
## <chr> <chr> <chr> <chr> <chr>
## 1 ENOG… S rhpn1.L, rhpn2… RHPN2P1, RHPN2… Rhophilin, R…
## 2 ENOG… S arl14epl.L, ar… ARL14EPL ADP-ribosyla…
## 3 ENOG… Z myh1-2-1.S, my… MYH3, MYH7B myosin, heav…
## 4 ENOG… S loc101733025.L… SHISAL1 kiaa1644
## 5 ENOG… T epha3.L, epha4… EPHA5, EPHA3, … Eph receptor
## 6 ENOG… O txndc11.L TXNDC11 thioredoxin …
## # … with 1 more variable: OG_category_annotation <chr>
Note that the annotation dataframe has an ID
column that matches the ID
column in the APMS dataframe. Also notice that the annotation dataframe has 14292 rows, and our APMS dataframe has 3356 rows. The left_join()
function keeps all the rows from the dataframe used as the “left” input, while joining in places that match the common column (ID
in this case) to the right dataframe. Use left_join()
to combine the APMS dataframe with the annotation dataframe, such that the final data frame has 3356 rows and 12 columns.
frog_combined_annotated <- frog_combined %>%
left_join(frog_annotations, by = c("ID"))
head(frog_combined_annotated)
## # A tibble: 6 x 12
## ID ctrl_PSMs exp_PSMs PSM_fc PSM_log2fc PSM_zscore bait OG_category
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <chr> <chr>
## 1 ENOG… 0 202 200. 7.64 14.2 dnai2 Z
## 2 ENOG… 80 265 3.23 1.69 9.90 dnai2 I
## 3 ENOG… 0 46 46.2 5.53 6.76 dnai2 Z
## 4 ENOG… 1 35 17.7 4.15 5.64 dnai2 Z
## 5 ENOG… 9 52 5.21 2.38 5.47 dnai2 F
## 6 ENOG… 72 133 1.81 0.853 4.16 dnai2 C
## # … with 4 more variables: XENLA_GeneNames <chr>, HUMAN_GeneNames <chr>,
## # OG_annotation <chr>, OG_category_annotation <chr>
Now each observation is connected to a X. laevis gene name, in addition to the gene’s human cousin. Not all genes have annotations, even though X. laevis is considered a model organism. This is common in bioinformatics.
Bonus challenge: Plot a bar chart with OG_category_annotation
on the x-axis. Color by bait. Flip the chart so long descriptions are more readable using the coord_flip()
layer. Try facet_wrap()
, does this improve clarity? Remove the y-axis label.
ggplot(frog_combined_annotated, aes(x = OG_category_annotation, fill = bait)) +
geom_bar(position = "dodge") +
theme(axis.title.y = element_blank()) +
facet_wrap(~bait) +
scale_fill_viridis_d(option = "E", begin = 0.2, end = 0.8) +
coord_flip()
filter()
, select()
)In biological data analysis, we’re often most interested in a specific subset of data–usually scores above some threshold (p-values, z-scores, fold changes in gene expression) or the behavior of a specific gene/protein family relative to time or different conditions.
Perform the following operations to pare down the combined APMS dataframe. Use the pipe (%>%) to save the results of operation to a new dataframe in a single step:
PSM_zscore
> 1.6 to pull out statistically significant interactors with the proteins dnai2 and heatr2 (use the filter()
function).%>%
) into select()
to extract only the columns named ID
, bait
, PSM_log2fc
, and PSM_zscore
.%>%
) into arrange()
to re-sort the dataframe to descend by PSM_zscore
(i.e., highest z-scores should be at the top).frog_significant <- frog_combined_annotated %>%
filter(PSM_zscore > 1.6) %>%
select(-ctrl_PSMs, -exp_PSMs, -PSM_fc) %>%
arrange(desc(PSM_zscore))
frog_significant
## # A tibble: 270 x 9
## ID PSM_log2fc PSM_zscore bait OG_category XENLA_GeneNames
## <chr> <dbl> <dbl> <chr> <chr> <chr>
## 1 ENOG… 7.91 16.0 heat… S dnaaf5.L, dnaa…
## 2 ENOG… 7.64 14.2 dnai2 Z dnai2.S
## 3 ENOG… 1.69 9.90 dnai2 I loc100494633.L…
## 4 ENOG… 5.53 6.76 dnai2 Z dnai1.L
## 5 ENOG… 4.15 5.64 dnai2 Z loc101730961.L…
## 6 ENOG… 2.38 5.47 dnai2 F cps1.L, cad.L
## 7 ENOG… 0.853 4.16 dnai2 C pc.S
## 8 ENOG… 0.582 4.02 dnai2 Z tubb.S, tubb2b…
## 9 ENOG… 1.13 3.98 heat… J gcn1.S, gcn1.L
## 10 ENOG… 1.42 3.94 heat… U ipo7.L
## # … with 260 more rows, and 3 more variables: HUMAN_GeneNames <chr>,
## # OG_annotation <chr>, OG_category_annotation <chr>
Plot another bar plot, the same way we made the plot in Section 2.3; compare the distribution of functional categories for the most significant proteins versus the unfiltered data set depicted in Section 2.3. Challenge: Reorder the bar plots so that the functional categories with the highest counts are grouped together using fct_infreq()
.
ggplot(frog_significant, aes(x = fct_rev(fct_infreq(OG_category_annotation)),
fill = bait)) +
geom_bar(position = "dodge") +
theme(axis.title.y = element_blank()) +
facet_wrap(~bait) +
scale_fill_viridis_d(option = "E", begin = 0.2, end = 0.8) +
coord_flip()
group_by()
, summarize()
, arrange()
)To find broad patterns in big datasets, we can use the group_by()
and summarize()
functions to get summary information for groups of variables. For instance, the bait
column has two groups: dnai2 and heatr2. The OG_category_annotations
column has 22 groups if you count the “NA” values resulting from unannotated genes. Run the code below to see the groups.
frog_significant$bait %>% unique()
## [1] "heatr2" "dnai2"
frog_significant$OG_category_annotation %>% unique()
## [1] "Function Unknown"
## [2] "Cytoskeleton"
## [3] "Lipid metabolism"
## [4] "Nucleotide metabolism and transport"
## [5] "Energy production and conversion"
## [6] "Tranlsation"
## [7] "Intracellular trafficing and secretion"
## [8] "Amino Acid metabolis and transport"
## [9] "Cell wall/membrane/envelop biogenesis"
## [10] NA
## [11] "Post-translational modification"
## [12] "Replication and repair"
## [13] "Carbohydrate metabolism and transport"
## [14] "Cell cycle control and mitosis"
## [15] "Chromatin Structure and dynamics"
## [16] "Signal Transduction"
## [17] "RNA processing and modification"
## [18] "Transcription"
## [19] "Secondary Structure"
## [20] "Cell motility"
## [21] "Inorganic ion transport and metabolism"
## [22] "Coenzyme metabolis"
When you group_by()
a variable, you are collapsing all the information for the unique values associated with that variable. Then, we use summarize()
to extract information associated with the collapsed information.
We want to learn the functional categories for the proteins that are most signficantly measured to be interacting with both dnai2 and heatr2. To do this, code the following steps (as a chain of functions connected by %>%
):
group_by()
to group both the bait
and OG_category_annotation
columnssummarize()
to create a new column called mean_zscore
that calculates the mean z-score for all the proteins in each category.arrange()
to sort the dataframe by bait
and then by mean_zscore
to get top hits.frog_final <- frog_significant %>%
group_by(bait, OG_category_annotation) %>%
summarize(mean_zscore = mean(PSM_zscore)) %>%
arrange(bait, desc(mean_zscore))
head(frog_final)
## # A tibble: 6 x 3
## # Groups: bait [1]
## bait OG_category_annotation mean_zscore
## <chr> <chr> <dbl>
## 1 dnai2 Nucleotide metabolism and transport 5.47
## 2 dnai2 Cytoskeleton 3.64
## 3 dnai2 Lipid metabolism 3.59
## 4 dnai2 Cell wall/membrane/envelop biogenesis 3.45
## 5 dnai2 Energy production and conversion 2.66
## 6 dnai2 Amino Acid metabolis and transport 2.55
Plot a bar chart for the functional categories, but this time put z-score on the y-axis instead of counts. Since you’re specifying the y =
argument, you’ll have to use geom_col()
instead of geom_bar()
.
ggplot(frog_final, aes(x = fct_rev(fct_infreq(OG_category_annotation)),
y = mean_zscore, fill = bait)) +
geom_col(position = "dodge") +
theme(axis.title.y = element_blank()) +
facet_wrap(~bait) +
scale_fill_viridis_d(option = "E", begin = 0.2, end = 0.8) +
coord_flip()
Bonus challenge: Take the sepal lengths from the iris
dataset and put them into a wide table so that is one data column per species. You might be tempted to do this with the following code, which however doesn’t work. Can you explain why?
# If you remove the # signs in the lines below you will get an error; this code doesn't work
# iris %>%
# select(Sepal.Length, Species) %>%
# pivot_wider(names_from = "Species", values_from = "Sepal.Length")
The problem is that pivot_wider()
does not like to put data into the same row if it isn’t sure that they actually belong together. In the iris table, there is no indication which “setosa” values, for example, should go with which “versicolor” values. Therefore, pivot_wider()
prints a warning about values that are not uniquely identified.
We can avoid this issue by adding a row
column that is repeated among the three groups. This column simply counts rows within each group, from 1 to 50 in this particular data set. This trick forces the data from the same rows for different species into one row. (That means, rows 1 of setosa, versicolor, and virginica get combined, then rows 2, and so on.)
iris %>%
select(Sepal.Length, Species) %>%
group_by(Species) %>%
mutate(row = 1:n()) %>%
pivot_wider(names_from = "Species", values_from = "Sepal.Length")
## # A tibble: 50 x 4
## row setosa versicolor virginica
## <int> <dbl> <dbl> <dbl>
## 1 1 5.1 7 6.3
## 2 2 4.9 6.4 5.8
## 3 3 4.7 6.9 7.1
## 4 4 4.6 5.5 6.3
## 5 5 5 6.5 6.5
## 6 6 5.4 5.7 7.6
## 7 7 4.6 6.3 4.9
## 8 8 5 4.9 7.3
## 9 9 4.4 6.6 6.7
## 10 10 4.9 5.2 7.2
## # … with 40 more rows
(At the end, if you want, you can delete this column again:)
iris %>%
select(Sepal.Length, Species) %>%
group_by(Species) %>%
mutate(row = 1:n()) %>%
pivot_wider(names_from = "Species", values_from = "Sepal.Length") %>%
select(-row)
## # A tibble: 50 x 3
## setosa versicolor virginica
## <dbl> <dbl> <dbl>
## 1 5.1 7 6.3
## 2 4.9 6.4 5.8
## 3 4.7 6.9 7.1
## 4 4.6 5.5 6.3
## 5 5 6.5 6.5
## 6 5.4 5.7 7.6
## 7 4.6 6.3 4.9
## 8 5 4.9 7.3
## 9 4.4 6.6 6.7
## 10 4.9 5.2 7.2
## # … with 40 more rows
Bonus challenge: For the data set diamonds
from the ggplot2 package, create a table displaying the mean price for each combination of cut and clarity. Then use pivot_wider()
to rearrange this table into a wide format, such that there is a column of mean prices for each cut level (Fair, Good, Very Good, etc.).
price_by_cut <-
diamonds %>%
group_by(cut, clarity) %>%
summarize(mean.price = mean(price))
price_by_cut
## # A tibble: 40 x 3
## # Groups: cut [5]
## cut clarity mean.price
## <ord> <ord> <dbl>
## 1 Fair I1 3704.
## 2 Fair SI2 5174.
## 3 Fair SI1 4208.
## 4 Fair VS2 4175.
## 5 Fair VS1 4165.
## 6 Fair VVS2 3350.
## 7 Fair VVS1 3871.
## 8 Fair IF 1912.
## 9 Good I1 3597.
## 10 Good SI2 4580.
## # … with 30 more rows
price_by_cut %>%
pivot_wider(names_from = "cut", values_from = "mean.price")
## # A tibble: 8 x 6
## clarity Fair Good `Very Good` Premium Ideal
## <ord> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 I1 3704. 3597. 4078. 3947. 4336.
## 2 SI2 5174. 4580. 4989. 5546. 4756.
## 3 SI1 4208. 3690. 3932. 4455. 3752.
## 4 VS2 4175. 4262. 4216. 4550. 3285.
## 5 VS1 4165. 3801. 3805. 4485. 3490.
## 6 VVS2 3350. 3079. 3038. 3795. 3250.
## 7 VVS1 3871. 2255. 2459. 2831. 2468.
## 8 IF 1912. 4098. 4396. 3856. 2273.