June 24th, 2021
Ideally a hypothesis test should proceed in 4 steps:
The most commonly used test in biology might be the t-test, which is useful for determining if there is a significant difference between your control and test cases (e.g., treated and untreated samples).
The PlantGrowth
data set contains yields (as measured by the dried weight of plants in the weight
column) obtained under a control and two different treatment conditions. Plot the distribution of plant weight
measurements for each condition using geom_violin
and formulate a hypothesis.
# your R code here
The plot shows that many “trt1” data points overlap with the control, while “trt2” might be significantly different. Formally evaluate this conclusion using the t.test()
function by first comparing the “ctrl” to “trt1” in the group
column, and then comparing “ctrl” to “trt2”.
# your R code here
When we set var.equal = TRUE
, we’re making a simplifying assumption that the two groups we’re comparing have equal or near equal variance. You can set var.equal = FALSE
to remove this assumption, and R will perform Welch’s t-test instead. How does the p-value change?
The tidyverse
also contains a function called t_test()
for setting up this test with tidy input/output.
# your R code here
The p-value is the end result uniting all hypothesis tests. Built-in statistical tests in R such as t.test()
(or t_test()
in the tidyverse) generally calculate this value for you. However, depending on the experiment and the size of the data, we might find it easier to manually compute our own test statistics and resulting p-values.
We will use the APMS data analyzed yesterday to demonstrate this. These data have a manually computed z-score for each observed protein. This paper contains the exact equation used to calculate the PSM_zscore
column: https://elifesciences.org/articles/58662
# download the APMS data for dnai2
dnai2_apms <- read_csv("https://rachaelcox.github.io/classes/datasets/frog_heatr2_dnai2_annotated.csv") %>%
filter(bait == "dnai2") %>%
select(XENLA_GeneNames, matches("*PSM*"))
##
## ── Column specification ────────────────────────────────────────────────────────
## cols(
## ID = col_character(),
## ctrl_PSMs = col_double(),
## exp_PSMs = col_double(),
## PSM_fc = col_double(),
## PSM_log2fc = col_double(),
## PSM_zscore = col_double(),
## bait = col_character(),
## OG_category = col_character(),
## XENLA_GeneNames = col_character(),
## HUMAN_GeneNames = col_character(),
## OG_annotation = col_character(),
## OG_category_annotation = col_character()
## )
head(dnai2_apms)
## # A tibble: 6 x 6
## XENLA_GeneNames ctrl_PSMs exp_PSMs PSM_fc PSM_log2fc PSM_zscore
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 dnai2.S 0 202 200. 7.64 14.2
## 2 loc100494633.L, loc100494633.… 80 265 3.23 1.69 9.90
## 3 dnai1.L 0 46 46.2 5.53 6.76
## 4 loc101730961.L, dnah11.L, dna… 1 35 17.7 4.15 5.64
## 5 cps1.L, cad.L 9 52 5.21 2.38 5.47
## 6 pc.S 72 133 1.81 0.853 4.16
To calculate p-values and create a new column containing those p-values, we will use a combination of the mutate()
function from the tidyverse and the pnorm()
function built into R. If we want to specify that this is a one-sided test, we need to include the argument lower.tail = FALSE
. For visualization purposes, lets also create a new column that labels each point as either “significant” or “not significant” based on a 0.05 significance level.
# your R code here
If we threshold the data at a 95% confidence interval, our results indicate that there are 144 “significant” interactors that we are 95% confident in. Let’s visualize the raw PSM counts on a log-log plot in the control versus the experiment. If we color by our new significance
column, we can see where each “significant” point lies on the distribution.
# your R code here
Looking at this plot, a lot of proteins we’re calling “significant” lie directly along the edge of the “background proteins.” Since we are performing 1,685 hypothesis tests (i.e. “is this protein interacting with dnai2”), at least one of these is likely to be a false positive. To correct for this, we use the p.adjust()
function built into R. There are a number of ways to correct for the multiple tests problem, spanning from extremely conservative (e.g. Bonferroni) to less conservative (e.g. Benjamini-Hochberg). We will specify the argument method = "BH"
to indicate we want to use the Benjamini-Hochberg method. Lets also create a new column called signif_corrected
that labels each point as either “significant” or “not significant” based on a 0.05 false discovery threshold.
# your R code here
# your R code here
Now, do the same for the heatr2_apms
data.
# download the APMS data for heatr2
heatr2_apms <- read_csv("https://rachaelcox.github.io/classes/datasets/frog_heatr2_dnai2_annotated.csv") %>%
filter(bait == "heatr2") %>%
select(XENLA_GeneNames, matches("*PSM*"))
##
## ── Column specification ────────────────────────────────────────────────────────
## cols(
## ID = col_character(),
## ctrl_PSMs = col_double(),
## exp_PSMs = col_double(),
## PSM_fc = col_double(),
## PSM_log2fc = col_double(),
## PSM_zscore = col_double(),
## bait = col_character(),
## OG_category = col_character(),
## XENLA_GeneNames = col_character(),
## HUMAN_GeneNames = col_character(),
## OG_annotation = col_character(),
## OG_category_annotation = col_character()
## )
head(heatr2_apms)
## # A tibble: 6 x 6
## XENLA_GeneNames ctrl_PSMs exp_PSMs PSM_fc PSM_log2fc PSM_zscore
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 dnaaf5.L, dnaaf5.S 0 274 241. 7.91 16.0
## 2 gcn1.S, gcn1.L 31 79 2.19 1.13 3.98
## 3 ipo7.L 17 54 2.68 1.42 3.94
## 4 ipo9.L, ipo9.S 17 49 2.43 1.28 3.49
## 5 myo5b.S, myo5c.L, myo5b.L, my… 0 10 9.64 3.27 3.05
## 6 hsph1.L, hsph1.S, hspa4.L, hs… 5 22 3.36 1.75 3.01
Step 1: Calculate p-values from the test statistic (in this case, the z-score).
# your R code here
Step 2: Visualize how many results are called “significant” at a 5% false positive rate.
# your R code here
Step 3: Calculate the false discovery rate using the Benjamini-Hochberg method.
# your R code here
Step 4: Visualize how many results are called “significant” at a 5% false discovery rate.
# your R code here
The iris dataset has four measurements per observational unit (iris plant):
head(iris)
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1 5.1 3.5 1.4 0.2 setosa
## 2 4.9 3.0 1.4 0.2 setosa
## 3 4.7 3.2 1.3 0.2 setosa
## 4 4.6 3.1 1.5 0.2 setosa
## 5 5.0 3.6 1.4 0.2 setosa
## 6 5.4 3.9 1.7 0.4 setosa
If we want to find out which characteristics are most distinguishing between iris plants, we have to make many individual plots and hope we can see distinguishing patterns:
p1 <- ggplot(iris, aes(x = Sepal.Length, y = Sepal.Width, color = Species)) +
geom_point() +
scale_color_colorblind()
p2 <- ggplot(iris, aes(x = Petal.Length, y = Petal.Width, color = Species)) +
geom_point() +
scale_color_colorblind()
p3 <- ggplot(iris, aes(x = Sepal.Length, y = Petal.Length, color = Species)) +
geom_point() +
scale_color_colorblind()
p4 <- ggplot(iris, aes(x = Sepal.Width, y = Petal.Width, color = Species)) +
geom_point() +
scale_color_colorblind()
p1 + p2 + p3 + p4 + plot_layout(ncol = 2) # arrange in a grid
In this particular case, it seems that petal length and petal width are most distinct for the three species. Principal Components Analysis (PCA) allows us to systematically discover such patterns, and it works also when there are many more variables than just four.
The basic steps in PCA are to (i) prepare a data frame that holds only the numerical columns of interest, (ii) scale the data to 0 mean and unit variance, and (iii) do the PCA with the function prcomp():
# your R code here
The main output for PCA are the standard deviations and rotation matrix. However, we’re most interested in the principal component (PC#) data, available as pca$x
:
# your R code here
But since we removed the categorical Species
column in order to scale the matrix, we have lost that information. First, we need to add it back onto the data frame using the data.frame()
function we discussed in day 1:
# your R code here
And now we can plot as usual:
# your R code here
In the above plot, the species are much better separated. Now, if we look at the rotation matrix, we can learn which of the 4 flower measurements for each species most contributes to the variation between species (i.e., how much each variable contributes to each principle component).
# your R code here
It’s most useful to plot this information using arrows, which let us visualize the magnitude of contribution to each PC:
# your R code here
We can now see clearly that Petal.Length, Petal.Width, and Sepal.Length all contribute to PC1, and Sepal.Width dominates PC2.
Finally, we want to look at the percent variance explained. The prcomp() function gives us standard deviations (stored in pca$sdev). To convert them into percent variance explained, we square them and then divide by the sum over all squared standard deviations:
# your R code here
The first component explains 73% of the variance, the second 23%, the third 4% and the last 0.5%. We can visualize these results nicely in a bar plot:
# your R code here
Now do it yourself using the wine_features
data set. The data set wine_features contains 6497 rows describing various chemical attributes of red and white wine (indicated by the type
column) and each wine’s relative quality (indicated by the quality
column) on a scale of 3 to 9 as graded by experts performing a blind taste test. Chemical attributes recorded for each wine include fixed_acidity
, volatile_acidity
, citric_acid
, residual_sugar
, chlorides
, free_sulfur_dioxide
, total_sulfur_dioxide
, density
, pH
, sulphates
, alcohol
.
wine_features <-
read_csv("https://rachaelcox.github.io/classes/datasets/wine_features.csv") %>%
mutate(type = as.factor(type)) %>%
select(-quality_grade, -alcohol_grade, -acidity_grade)
##
## ── Column specification ────────────────────────────────────────────────────────
## cols(
## type = col_character(),
## quality = col_double(),
## quality_grade = col_character(),
## alcohol = col_double(),
## alcohol_grade = col_character(),
## pH = col_double(),
## acidity_grade = col_character(),
## fixed_acidity = col_double(),
## volatile_acidity = col_double(),
## citric_acid = col_double(),
## residual_sugar = col_double(),
## chlorides = col_double(),
## free_sulfur_dioxide = col_double(),
## total_sulfur_dioxide = col_double(),
## density = col_double(),
## sulphates = col_double()
## )
Step 1: Remove categorical variable type
from the wine_features
data frame, scale the data around 0, then use prcomp()
to perform PCA. Reattach type
to the transformed PC values made available pca$x
. Hint: Refer to example above for code.
# your R code here
Step 2: Make a scatter plot of PC2 vs PC1, coloring by type
.
# your R code here
Step 3: Convert the rotation matrix to a data frame, and then plot the rotation data using arrows with geom_segment()
and geom_text()
. Hint: Refer to example above for code.
# your R code here
Step 4: Evaluate how much of the variance is explained by PC1 and PC2.
# your R code here