June 24th, 2021
Ideally a hypothesis test should proceed in 4 steps:
The most commonly used test in biology might be the t-test, which is useful for determining if there is a significant difference between your control and test cases (e.g., treated and untreated samples).
The PlantGrowth
data set contains yields (as measured by the dried weight of plants in the weight
column) obtained under a control and two different treatment conditions. Plot the distribution of plant weight
measurements for each condition using geom_violin
and formulate a hypothesis.
ggplot(PlantGrowth, aes(y = weight, x = group, color = group)) +
geom_violin() +
geom_jitter() +
theme(legend.position = "none")
The plot shows that many “trt1” data points overlap with the control, while “trt2” might be significantly different. Formally evaluate this conclusion using the t.test()
function by first comparing the “ctrl” to “trt1” in the group
column, and then comparing “ctrl” to “trt2”.
# compare control and trt1 means
tt_pg_trt1 = with(PlantGrowth,
t.test(weight[group =="ctrl"],
weight[group =="trt1"],
var.equal = TRUE))
tt_pg_trt1
##
## Two Sample t-test
##
## data: weight[group == "ctrl"] and weight[group == "trt1"]
## t = 1.1913, df = 18, p-value = 0.249
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -0.2833003 1.0253003
## sample estimates:
## mean of x mean of y
## 5.032 4.661
# compare control and trt2 means
tt_pg_trt2 = with(PlantGrowth,
t.test(weight[group =="ctrl"],
weight[group =="trt2"],
var.equal = TRUE))
tt_pg_trt2
##
## Two Sample t-test
##
## data: weight[group == "ctrl"] and weight[group == "trt2"]
## t = -2.134, df = 18, p-value = 0.04685
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -0.980338117 -0.007661883
## sample estimates:
## mean of x mean of y
## 5.032 5.526
When we set var.equal = TRUE
, we’re making a simplifying assumption that the two groups we’re comparing have equal or near equal variance. You can set var.equal = FALSE
to remove this assumption, and R will perform Welch’s t-test instead. How does the p-value change?
The tidyverse
also contains a function called t_test()
for setting up this test with tidy input/output.
# tidy t test
tidy_trt1 <- PlantGrowth %>%
filter(group == "trt1" | group == "ctrl") %>%
t_test(weight ~ group,
order = c("trt1", "ctrl"))
tidy_trt2 <- PlantGrowth %>%
filter(group == "trt2" | group == "ctrl") %>%
t_test(weight ~ group,
order = c("trt2", "ctrl"))
The p-value is the end result uniting all hypothesis tests. Built-in statistical tests in R such as t.test()
(or t_test()
in the tidyverse) generally calculate this value for you. However, depending on the experiment and the size of the data, we might find it easier to manually compute our own test statistics and resulting p-values.
We will use the APMS data analyzed yesterday to demonstrate this. These data have a manually computed z-score for each observed protein. This paper contains the exact equation used to calculate the PSM_zscore
column: https://elifesciences.org/articles/58662
# download the APMS data for dnai2
dnai2_apms <- read_csv("https://rachaelcox.github.io/classes/datasets/frog_heatr2_dnai2_annotated.csv") %>%
filter(bait == "dnai2") %>%
select(XENLA_GeneNames, matches("*PSM*"))
##
## ── Column specification ────────────────────────────────────────────────────────
## cols(
## ID = col_character(),
## ctrl_PSMs = col_double(),
## exp_PSMs = col_double(),
## PSM_fc = col_double(),
## PSM_log2fc = col_double(),
## PSM_zscore = col_double(),
## bait = col_character(),
## OG_category = col_character(),
## XENLA_GeneNames = col_character(),
## HUMAN_GeneNames = col_character(),
## OG_annotation = col_character(),
## OG_category_annotation = col_character()
## )
head(dnai2_apms)
## # A tibble: 6 x 6
## XENLA_GeneNames ctrl_PSMs exp_PSMs PSM_fc PSM_log2fc PSM_zscore
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 dnai2.S 0 202 200. 7.64 14.2
## 2 loc100494633.L, loc100494633.… 80 265 3.23 1.69 9.90
## 3 dnai1.L 0 46 46.2 5.53 6.76
## 4 loc101730961.L, dnah11.L, dna… 1 35 17.7 4.15 5.64
## 5 cps1.L, cad.L 9 52 5.21 2.38 5.47
## 6 pc.S 72 133 1.81 0.853 4.16
To calculate p-values, we use the pnorm()
function built into R. If we want to specify that this is a one-sided test, we need to include the argument lower.tail = FALSE
. For visualization purposes, lets also create a new column that labels each point as either “significant” or “not significant” based on a 0.05 significance level.
# calculate p-values
dnai2_pvals <- dnai2_apms %>%
mutate(pval = pnorm(PSM_zscore, lower.tail = FALSE)) %>%
arrange(pval) %>%
mutate(significance = case_when(pval <= 0.05 ~ "Significant",
pval > 0.05 ~ "Not significant"))
dnai2_pvals
## # A tibble: 1,685 x 8
## XENLA_GeneNames ctrl_PSMs exp_PSMs PSM_fc PSM_log2fc PSM_zscore pval
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 dnai2.S 0 202 200. 7.64 14.2 4.61e-46
## 2 loc100494633.L… 80 265 3.23 1.69 9.90 2.15e-23
## 3 dnai1.L 0 46 46.2 5.53 6.76 7.05e-12
## 4 loc101730961.L… 1 35 17.7 4.15 5.64 8.44e- 9
## 5 cps1.L, cad.L 9 52 5.21 2.38 5.47 2.31e- 8
## 6 pc.S 72 133 1.81 0.853 4.16 1.57e- 5
## 7 tubb.S, tubb2b… 158 241 1.50 0.582 4.02 2.93e- 5
## 8 MGC97820.L 65 117 1.76 0.815 3.76 8.50e- 5
## 9 oplah.L 1 16 8.36 3.06 3.62 1.48e- 4
## 10 gfpt1.S, gfpt2… 18 46 2.43 1.28 3.45 2.84e- 4
## # … with 1,675 more rows, and 1 more variable: significance <chr>
If we threshold the data at a 95% confidence interval, our results indicate that there are 144 “significant” interactors that we are 95% confident in. Let’s visualize the raw PSM counts on a log-log plot in the control versus the experiment. If we color by our new significance
column, we can see where each “significant” point lies on the distribution.
# visualize what is significant at 5% false positive rate (p < 0.05)
ggplot(dnai2_pvals, aes(x = ctrl_PSMs + 1, y = exp_PSMs + 1, color = significance)) +
geom_point() +
scale_x_log10() +
scale_y_log10() +
scale_color_brewer(type = 'qual', palette = 'Set1')
Looking at this plot, a lot of proteins we’re calling “significant” lie directly along the edge of the “background proteins.” Since we are performing 1,685 hypothesis tests (i.e. “is this protein interacting with dnai2”), at least one of these is likely to be a false positive. To correct for this, we use the p.adjust()
function built into R. There are a number of ways to correct for the multiple tests problem, spanning from extremely conservative (e.g. Bonferroni) to less conservative (e.g. Benjamini-Hochberg). We will specify the argument method = "BH"
to indicate we want to use the Benjamini-Hochberg method. Lets also create a new column called signif_corrected
that labels each point as either “significant” or “not significant” based on a 0.05 false discovery threshold.
# calculate the false discovery rate
dnai2_fdr <- dnai2_pvals %>%
mutate(fdr = p.adjust(pval, method = "BH", n = length(pval))) %>%
mutate(signif_corrected = case_when(fdr <= 0.05 ~ "Significant",
fdr > 0.05 ~ "Not significant")) %>%
arrange(fdr)
# visualize what is significant at 5% false discovery rate (fdr < 0.05)
ggplot(dnai2_fdr, aes(x = ctrl_PSMs + 1, y = exp_PSMs + 1, color = signif_corrected)) +
geom_point() +
scale_x_log10() +
scale_y_log10() +
scale_color_brewer(type = 'qual', palette = 'Set1')
Now, do the same for the heatr2_apms
data.
# download the APMS data for heatr2
heatr2_apms <- read_csv("https://rachaelcox.github.io/classes/datasets/frog_heatr2_dnai2_annotated.csv") %>%
filter(bait == "heatr2") %>%
select(XENLA_GeneNames, matches("*PSM*"))
##
## ── Column specification ────────────────────────────────────────────────────────
## cols(
## ID = col_character(),
## ctrl_PSMs = col_double(),
## exp_PSMs = col_double(),
## PSM_fc = col_double(),
## PSM_log2fc = col_double(),
## PSM_zscore = col_double(),
## bait = col_character(),
## OG_category = col_character(),
## XENLA_GeneNames = col_character(),
## HUMAN_GeneNames = col_character(),
## OG_annotation = col_character(),
## OG_category_annotation = col_character()
## )
head(heatr2_apms)
## # A tibble: 6 x 6
## XENLA_GeneNames ctrl_PSMs exp_PSMs PSM_fc PSM_log2fc PSM_zscore
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 dnaaf5.L, dnaaf5.S 0 274 241. 7.91 16.0
## 2 gcn1.S, gcn1.L 31 79 2.19 1.13 3.98
## 3 ipo7.L 17 54 2.68 1.42 3.94
## 4 ipo9.L, ipo9.S 17 49 2.43 1.28 3.49
## 5 myo5b.S, myo5c.L, myo5b.L, my… 0 10 9.64 3.27 3.05
## 6 hsph1.L, hsph1.S, hspa4.L, hs… 5 22 3.36 1.75 3.01
Step 1: Calculate p-values from the test statistic (in this case, the z-score).
# calculate p-values
heatr2_pvals <- heatr2_apms %>%
mutate(pval = pnorm(PSM_zscore, lower.tail = FALSE)) %>%
arrange(pval) %>%
mutate(significance = case_when(pval <= 0.05 ~ "Significant",
pval > 0.05 ~ "Not significant"))
heatr2_pvals
## # A tibble: 1,671 x 8
## XENLA_GeneNames ctrl_PSMs exp_PSMs PSM_fc PSM_log2fc PSM_zscore pval
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 dnaaf5.L, dnaa… 0 274 241. 7.91 16.0 2.91e-58
## 2 gcn1.S, gcn1.L 31 79 2.19 1.13 3.98 3.51e- 5
## 3 ipo7.L 17 54 2.68 1.42 3.94 4.13e- 5
## 4 ipo9.L, ipo9.S 17 49 2.43 1.28 3.49 2.43e- 4
## 5 myo5b.S, myo5c… 0 10 9.64 3.27 3.05 1.13e- 3
## 6 hsph1.L, hsph1… 5 22 3.36 1.75 3.01 1.29e- 3
## 7 Xelaev18029563… 2 15 4.67 2.22 2.97 1.48e- 3
## 8 pgm1.L, pgm5.S… 6 23 3.00 1.59 2.88 1.99e- 3
## 9 ppp6r3.L, ppp6… 2 14 4.38 2.13 2.82 2.39e- 3
## 10 xpo7.S, xpo7.L 1 11 5.26 2.39 2.74 3.04e- 3
## # … with 1,661 more rows, and 1 more variable: significance <chr>
Step 2: Visualize how many results are called “significant” at a 5% false positive rate.
# visualize what is significant at 5% false positive rate (p < 0.05)
ggplot(heatr2_pvals, aes(x = ctrl_PSMs + 1, y = exp_PSMs + 1, color = significance)) +
geom_point() +
scale_x_log10() +
scale_y_log10() +
scale_color_brewer(type = 'qual', palette = 'Set1')
Step 3: Calculate the false discovery rate using the Benjamini-Hochberg method.
# calculate the false discovery rate
heatr2_fdr <- heatr2_pvals %>%
mutate(fdr = p.adjust(pval, method = "BH", n = length(pval))) %>%
mutate(signif_corrected = case_when(fdr <= 0.05 ~ "Significant",
fdr > 0.05 ~ "Not significant")) %>%
arrange(fdr)
Step 4: Visualize how many results are called “significant” at a 5% false discovery rate.
# visualize what is significant at 5% false discovery rate (fdr < 0.05)
ggplot(heatr2_fdr, aes(x = ctrl_PSMs + 1, y = exp_PSMs + 1, color = signif_corrected)) +
geom_point() +
scale_x_log10() +
scale_y_log10() +
scale_color_brewer(type = 'qual', palette = 'Set1')
The iris dataset has four measurements per observational unit (iris plant):
head(iris)
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1 5.1 3.5 1.4 0.2 setosa
## 2 4.9 3.0 1.4 0.2 setosa
## 3 4.7 3.2 1.3 0.2 setosa
## 4 4.6 3.1 1.5 0.2 setosa
## 5 5.0 3.6 1.4 0.2 setosa
## 6 5.4 3.9 1.7 0.4 setosa
If we want to find out which characteristics are most distinguishing between iris plants, we have to make many individual plots and hope we can see distinguishing patterns:
p1 <- ggplot(iris, aes(x = Sepal.Length, y = Sepal.Width, color = Species)) +
geom_point() +
scale_color_colorblind()
p2 <- ggplot(iris, aes(x = Petal.Length, y = Petal.Width, color = Species)) +
geom_point() +
scale_color_colorblind()
p3 <- ggplot(iris, aes(x = Sepal.Length, y = Petal.Length, color = Species)) +
geom_point() +
scale_color_colorblind()
p4 <- ggplot(iris, aes(x = Sepal.Width, y = Petal.Width, color = Species)) +
geom_point() +
scale_color_colorblind()
p1 + p2 + p3 + p4 + plot_layout(ncol = 2) # arrange in a grid
In this particular case, it seems that petal length and petal width are most distinct for the three species. Principal Components Analysis (PCA) allows us to systematically discover such patterns, and it works also when there are many more variables than just four.
The basic steps in PCA are to (i) prepare a data frame that holds only the numerical columns of interest, (ii) scale the data to 0 mean and unit variance, and (iii) do the PCA with the function prcomp():
pca <- iris %>%
select(-Species) %>% # remove Species column
scale() %>% # scale to 0 mean and unit variance
prcomp() # perform `pca`
# now display the results from the PCA analysis
pca
## Standard deviations (1, .., p=4):
## [1] 1.7083611 0.9560494 0.3830886 0.1439265
##
## Rotation (n x k) = (4 x 4):
## PC1 PC2 PC3 PC4
## Sepal.Length 0.5210659 -0.37741762 0.7195664 0.2612863
## Sepal.Width -0.2693474 -0.92329566 -0.2443818 -0.1235096
## Petal.Length 0.5804131 -0.02449161 -0.1421264 -0.8014492
## Petal.Width 0.5648565 -0.06694199 -0.6342727 0.5235971
The main output for PCA are the standard deviations and rotation matrix. However, we’re most interested in the principal component (PC#) data, available as pca$x
:
head(pca$x)
## PC1 PC2 PC3 PC4
## [1,] -2.257141 -0.4784238 0.12727962 0.024087508
## [2,] -2.074013 0.6718827 0.23382552 0.102662845
## [3,] -2.356335 0.3407664 -0.04405390 0.028282305
## [4,] -2.291707 0.5953999 -0.09098530 -0.065735340
## [5,] -2.381863 -0.6446757 -0.01568565 -0.035802870
## [6,] -2.068701 -1.4842053 -0.02687825 0.006586116
But since we removed the categorical Species
column in order to scale the matrix, we have lost that information. First, we need to add it back onto the data frame using the data.frame()
function we discussed in day 1:
pca_data <- data.frame(pca$x, Species = iris$Species)
head(pca_data)
## PC1 PC2 PC3 PC4 Species
## 1 -2.257141 -0.4784238 0.12727962 0.024087508 setosa
## 2 -2.074013 0.6718827 0.23382552 0.102662845 setosa
## 3 -2.356335 0.3407664 -0.04405390 0.028282305 setosa
## 4 -2.291707 0.5953999 -0.09098530 -0.065735340 setosa
## 5 -2.381863 -0.6446757 -0.01568565 -0.035802870 setosa
## 6 -2.068701 -1.4842053 -0.02687825 0.006586116 setosa
And now we can plot as usual:
ggplot(pca_data, aes(x = PC1, y = PC2,
color = Species)) +
geom_point() +
scale_color_colorblind()
In the above plot, the species are much better separated. Now, if we look at the rotation matrix, we can learn which of the 4 flower measurements for each species most contributes to the variation between species (i.e., how much each variable contributes to each principle component).
head(pca$rotation)
## PC1 PC2 PC3 PC4
## Sepal.Length 0.5210659 -0.37741762 0.7195664 0.2612863
## Sepal.Width -0.2693474 -0.92329566 -0.2443818 -0.1235096
## Petal.Length 0.5804131 -0.02449161 -0.1421264 -0.8014492
## Petal.Width 0.5648565 -0.06694199 -0.6342727 0.5235971
It’s most useful to plot this information using arrows, which let us visualize the magnitude of contribution to each PC:
# capture the rotation matrix in a data frame
rotation_data <- data.frame(pca$rotation,
variable = row.names(pca$rotation))
# define a nice arrow style
arrow_style <- arrow(length = unit(0.05, "inches"),
type = "closed")
# now plot, using geom_segment() for arrows and geom_text() for labels
ggplot(rotation_data) +
geom_segment(aes(xend = PC1, yend = PC2),
x = 0, y = 0, arrow = arrow_style) +
geom_text(aes(x = PC1, y = PC2,
label = variable),
hjust = 0,
size = 3,
color = "#b80058") +
xlim(-1., 1.25) +
ylim(-1., 1.) +
coord_fixed() # fix aspect ratio to 1:1
We can now see clearly that Petal.Length, Petal.Width, and Sepal.Length all contribute to PC1, and Sepal.Width dominates PC2.
Finally, we want to look at the percent variance explained. The prcomp() function gives us standard deviations (stored in pca$sdev). To convert them into percent variance explained, we square them and then divide by the sum over all squared standard deviations:
percent <- 100*pca$sdev^2 / sum(pca$sdev^2)
percent
## [1] 72.9624454 22.8507618 3.6689219 0.5178709
The first component explains 73% of the variance, the second 23%, the third 4% and the last 0.5%. We can visualize these results nicely in a bar plot:
perc_data <- data.frame(percent = percent, PC = 1:length(percent))
ggplot(perc_data, aes(x = PC, y = percent)) +
geom_col() +
geom_text(aes(label = round(percent, 2)), size = 4, vjust = -0.5) +
ylim(0, 80)
Now do it yourself using the wine_features
data set. The data set wine_features contains 6497 rows describing various chemical attributes of red and white wine (indicated by the type
column) and each wine’s relative quality (indicated by the quality
column) on a scale of 3 to 9 as graded by experts performing a blind taste test. Chemical attributes recorded for each wine include fixed_acidity
, volatile_acidity
, citric_acid
, residual_sugar
, chlorides
, free_sulfur_dioxide
, total_sulfur_dioxide
, density
, pH
, sulphates
, alcohol
.
wine_features <-
read_csv("https://rachaelcox.github.io/classes/datasets/wine_features.csv") %>%
mutate(type = as.factor(type)) %>%
select(-quality_grade, -alcohol_grade, -acidity_grade)
##
## ── Column specification ────────────────────────────────────────────────────────
## cols(
## type = col_character(),
## quality = col_double(),
## quality_grade = col_character(),
## alcohol = col_double(),
## alcohol_grade = col_character(),
## pH = col_double(),
## acidity_grade = col_character(),
## fixed_acidity = col_double(),
## volatile_acidity = col_double(),
## citric_acid = col_double(),
## residual_sugar = col_double(),
## chlorides = col_double(),
## free_sulfur_dioxide = col_double(),
## total_sulfur_dioxide = col_double(),
## density = col_double(),
## sulphates = col_double()
## )
Step 1: Remove categorical variable type
from the wine_features
data frame, scale the data around 0, then use prcomp()
to perform PCA. Reattach type
to the transformed PC values made available pca$x
. Hint: Refer to example above for code.
# perform PCA on `wine_features` dataset
pca <- wine_features %>%
select(-type) %>%
scale() %>%
prcomp()
# get transformation data from PCA object for further analysis
pca_data <- data.frame(pca$x, type = wine_features$type)
Step 2: Make a scatter plot of PC2 vs PC1, coloring by type
.
# color PC2 vs. PC1 scatterplot by type of wine
ggplot(pca_data, aes(x = PC1, y = PC2, color = type)) +
geom_point(alpha = 0.75, size = 2) +
scale_color_manual(values = c("#5b0b0b", "#bcb37b"))
Step 3: Convert the rotation matrix to a data frame, and then plot the rotation data using arrows with geom_segment()
and geom_text()
. Hint: Refer to example above for code.
# capture the rotation matrix in a data frame
rotation_data <- data.frame(pca$rotation,
variable = row.names(pca$rotation))
# define a nice arrow style
arrow_style <- arrow(length = unit(0.05, "inches"),
type = "closed")
# now plot, using geom_segment() for arrows and geom_text() for labels
ggplot(rotation_data) +
geom_segment(aes(xend = PC1, yend = PC2),
x = 0, y = 0, arrow = arrow_style) +
geom_text(aes(x = PC1, y = PC2,
label = variable),
hjust = 0,
size = 3,
color = "#5b0b0b") +
xlim(-1., 1.25) +
ylim(-1., 1.) +
coord_fixed() # fix aspect ratio to 1:1
Step 4: Evaluate how much of the variance is explained by PC1 and PC2.
percent <- 100*pca$sdev^2 / sum(pca$sdev^2)
percent
## [1] 25.3462261 22.0821166 13.6792235 8.9052105 7.0041705 5.5033265
## [7] 4.6985537 4.2998570 3.8197690 2.4917742 1.8965627 0.2732097
perc_data <- data.frame(percent = percent, PC = 1:length(percent))
ggplot(perc_data, aes(x = PC, y = percent)) +
geom_col() +
geom_text(aes(label = round(percent, 2)), size = 3, vjust = -0.5) +
ylim(0, 30)