Day 4: Statistics & advanced data analysis

In-class worksheet, solutions

June 24th, 2021

1. Hypothesis testing

Ideally a hypothesis test should proceed in 4 steps:

  1. Decide on an effect you are interested in, design an appropriate experiment, and pick a test statistic.
  2. Frame your null hypothesis – what does your null distribution look like? Is it a one-sided or two-sided hypothesis? If necessary, calculate your null distribution.
  3. Collect the data and compute the test statistic.
  4. Compute p-values for your test statistic, decide if it falls above or below your threshold.

1.1 t-test

The most commonly used test in biology might be the t-test, which is useful for determining if there is a significant difference between your control and test cases (e.g., treated and untreated samples).

The PlantGrowth data set contains yields (as measured by the dried weight of plants in the weight column) obtained under a control and two different treatment conditions. Plot the distribution of plant weight measurements for each condition using geom_violin and formulate a hypothesis.

ggplot(PlantGrowth, aes(y = weight, x = group, color = group)) +
  geom_violin() +
  geom_jitter() + 
  theme(legend.position = "none")

The plot shows that many “trt1” data points overlap with the control, while “trt2” might be significantly different. Formally evaluate this conclusion using the t.test() function by first comparing the “ctrl” to “trt1” in the group column, and then comparing “ctrl” to “trt2”.

# compare control and trt1 means
tt_pg_trt1 = with(PlantGrowth,
                  t.test(weight[group =="ctrl"],
                         weight[group =="trt1"],
                         var.equal = TRUE))
tt_pg_trt1
## 
##  Two Sample t-test
## 
## data:  weight[group == "ctrl"] and weight[group == "trt1"]
## t = 1.1913, df = 18, p-value = 0.249
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.2833003  1.0253003
## sample estimates:
## mean of x mean of y 
##     5.032     4.661
# compare control and trt2 means
tt_pg_trt2 = with(PlantGrowth,
                  t.test(weight[group =="ctrl"],
                         weight[group =="trt2"],
                         var.equal = TRUE))
tt_pg_trt2
## 
##  Two Sample t-test
## 
## data:  weight[group == "ctrl"] and weight[group == "trt2"]
## t = -2.134, df = 18, p-value = 0.04685
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.980338117 -0.007661883
## sample estimates:
## mean of x mean of y 
##     5.032     5.526

When we set var.equal = TRUE, we’re making a simplifying assumption that the two groups we’re comparing have equal or near equal variance. You can set var.equal = FALSE to remove this assumption, and R will perform Welch’s t-test instead. How does the p-value change?

The tidyverse also contains a function called t_test() for setting up this test with tidy input/output.

# tidy t test
tidy_trt1 <- PlantGrowth %>%
  filter(group == "trt1" | group == "ctrl") %>%
  t_test(weight ~ group,
         order = c("trt1", "ctrl"))

tidy_trt2 <- PlantGrowth %>%
  filter(group == "trt2" | group == "ctrl") %>%
  t_test(weight ~ group,
         order = c("trt2", "ctrl"))

1.2 p-values and false discovery rate

The p-value is the end result uniting all hypothesis tests. Built-in statistical tests in R such as t.test() (or t_test() in the tidyverse) generally calculate this value for you. However, depending on the experiment and the size of the data, we might find it easier to manually compute our own test statistics and resulting p-values.

We will use the APMS data analyzed yesterday to demonstrate this. These data have a manually computed z-score for each observed protein. This paper contains the exact equation used to calculate the PSM_zscore column: https://elifesciences.org/articles/58662

# download the APMS data for dnai2
dnai2_apms <- read_csv("https://rachaelcox.github.io/classes/datasets/frog_heatr2_dnai2_annotated.csv") %>%
  filter(bait == "dnai2") %>%
  select(XENLA_GeneNames, matches("*PSM*"))
## 
## ── Column specification ────────────────────────────────────────────────────────
## cols(
##   ID = col_character(),
##   ctrl_PSMs = col_double(),
##   exp_PSMs = col_double(),
##   PSM_fc = col_double(),
##   PSM_log2fc = col_double(),
##   PSM_zscore = col_double(),
##   bait = col_character(),
##   OG_category = col_character(),
##   XENLA_GeneNames = col_character(),
##   HUMAN_GeneNames = col_character(),
##   OG_annotation = col_character(),
##   OG_category_annotation = col_character()
## )
head(dnai2_apms)
## # A tibble: 6 x 6
##   XENLA_GeneNames                ctrl_PSMs exp_PSMs PSM_fc PSM_log2fc PSM_zscore
##   <chr>                              <dbl>    <dbl>  <dbl>      <dbl>      <dbl>
## 1 dnai2.S                                0      202 200.        7.64       14.2 
## 2 loc100494633.L, loc100494633.…        80      265   3.23      1.69        9.90
## 3 dnai1.L                                0       46  46.2       5.53        6.76
## 4 loc101730961.L, dnah11.L, dna…         1       35  17.7       4.15        5.64
## 5 cps1.L, cad.L                          9       52   5.21      2.38        5.47
## 6 pc.S                                  72      133   1.81      0.853       4.16

To calculate p-values, we use the pnorm() function built into R. If we want to specify that this is a one-sided test, we need to include the argument lower.tail = FALSE. For visualization purposes, lets also create a new column that labels each point as either “significant” or “not significant” based on a 0.05 significance level.

# calculate p-values
dnai2_pvals <- dnai2_apms %>%
  mutate(pval = pnorm(PSM_zscore, lower.tail = FALSE)) %>%
  arrange(pval) %>%
  mutate(significance = case_when(pval <= 0.05 ~ "Significant",
                                  pval > 0.05 ~ "Not significant"))
dnai2_pvals
## # A tibble: 1,685 x 8
##    XENLA_GeneNames ctrl_PSMs exp_PSMs PSM_fc PSM_log2fc PSM_zscore     pval
##    <chr>               <dbl>    <dbl>  <dbl>      <dbl>      <dbl>    <dbl>
##  1 dnai2.S                 0      202 200.        7.64       14.2  4.61e-46
##  2 loc100494633.L…        80      265   3.23      1.69        9.90 2.15e-23
##  3 dnai1.L                 0       46  46.2       5.53        6.76 7.05e-12
##  4 loc101730961.L…         1       35  17.7       4.15        5.64 8.44e- 9
##  5 cps1.L, cad.L           9       52   5.21      2.38        5.47 2.31e- 8
##  6 pc.S                   72      133   1.81      0.853       4.16 1.57e- 5
##  7 tubb.S, tubb2b…       158      241   1.50      0.582       4.02 2.93e- 5
##  8 MGC97820.L             65      117   1.76      0.815       3.76 8.50e- 5
##  9 oplah.L                 1       16   8.36      3.06        3.62 1.48e- 4
## 10 gfpt1.S, gfpt2…        18       46   2.43      1.28        3.45 2.84e- 4
## # … with 1,675 more rows, and 1 more variable: significance <chr>

If we threshold the data at a 95% confidence interval, our results indicate that there are 144 “significant” interactors that we are 95% confident in. Let’s visualize the raw PSM counts on a log-log plot in the control versus the experiment. If we color by our new significance column, we can see where each “significant” point lies on the distribution.

# visualize what is significant at 5% false positive rate (p < 0.05)
ggplot(dnai2_pvals, aes(x = ctrl_PSMs + 1, y = exp_PSMs + 1, color = significance)) +
  geom_point() +
  scale_x_log10() +
  scale_y_log10() +
  scale_color_brewer(type = 'qual', palette = 'Set1')

Looking at this plot, a lot of proteins we’re calling “significant” lie directly along the edge of the “background proteins.” Since we are performing 1,685 hypothesis tests (i.e. “is this protein interacting with dnai2”), at least one of these is likely to be a false positive. To correct for this, we use the p.adjust() function built into R. There are a number of ways to correct for the multiple tests problem, spanning from extremely conservative (e.g. Bonferroni) to less conservative (e.g. Benjamini-Hochberg). We will specify the argument method = "BH" to indicate we want to use the Benjamini-Hochberg method. Lets also create a new column called signif_corrected that labels each point as either “significant” or “not significant” based on a 0.05 false discovery threshold.

# calculate the false discovery rate
dnai2_fdr <- dnai2_pvals %>%
  mutate(fdr = p.adjust(pval, method = "BH", n = length(pval))) %>%
  mutate(signif_corrected = case_when(fdr <= 0.05 ~ "Significant",
                                      fdr > 0.05 ~ "Not significant")) %>%
  arrange(fdr)
# visualize what is significant at 5% false discovery rate (fdr < 0.05)
ggplot(dnai2_fdr, aes(x = ctrl_PSMs + 1, y = exp_PSMs + 1, color = signif_corrected)) +
  geom_point() +
  scale_x_log10() +
  scale_y_log10() +
  scale_color_brewer(type = 'qual', palette = 'Set1')

Now, do the same for the heatr2_apms data.

# download the APMS data for heatr2
heatr2_apms <- read_csv("https://rachaelcox.github.io/classes/datasets/frog_heatr2_dnai2_annotated.csv") %>%
  filter(bait == "heatr2") %>%
  select(XENLA_GeneNames, matches("*PSM*"))
## 
## ── Column specification ────────────────────────────────────────────────────────
## cols(
##   ID = col_character(),
##   ctrl_PSMs = col_double(),
##   exp_PSMs = col_double(),
##   PSM_fc = col_double(),
##   PSM_log2fc = col_double(),
##   PSM_zscore = col_double(),
##   bait = col_character(),
##   OG_category = col_character(),
##   XENLA_GeneNames = col_character(),
##   HUMAN_GeneNames = col_character(),
##   OG_annotation = col_character(),
##   OG_category_annotation = col_character()
## )
head(heatr2_apms)
## # A tibble: 6 x 6
##   XENLA_GeneNames                ctrl_PSMs exp_PSMs PSM_fc PSM_log2fc PSM_zscore
##   <chr>                              <dbl>    <dbl>  <dbl>      <dbl>      <dbl>
## 1 dnaaf5.L, dnaaf5.S                     0      274 241.         7.91      16.0 
## 2 gcn1.S, gcn1.L                        31       79   2.19       1.13       3.98
## 3 ipo7.L                                17       54   2.68       1.42       3.94
## 4 ipo9.L, ipo9.S                        17       49   2.43       1.28       3.49
## 5 myo5b.S, myo5c.L, myo5b.L, my…         0       10   9.64       3.27       3.05
## 6 hsph1.L, hsph1.S, hspa4.L, hs…         5       22   3.36       1.75       3.01

Step 1: Calculate p-values from the test statistic (in this case, the z-score).

# calculate p-values
heatr2_pvals <- heatr2_apms %>%
  mutate(pval = pnorm(PSM_zscore, lower.tail = FALSE)) %>%
  arrange(pval) %>%
  mutate(significance = case_when(pval <= 0.05 ~ "Significant",
                                  pval > 0.05 ~ "Not significant"))
heatr2_pvals
## # A tibble: 1,671 x 8
##    XENLA_GeneNames ctrl_PSMs exp_PSMs PSM_fc PSM_log2fc PSM_zscore     pval
##    <chr>               <dbl>    <dbl>  <dbl>      <dbl>      <dbl>    <dbl>
##  1 dnaaf5.L, dnaa…         0      274 241.         7.91      16.0  2.91e-58
##  2 gcn1.S, gcn1.L         31       79   2.19       1.13       3.98 3.51e- 5
##  3 ipo7.L                 17       54   2.68       1.42       3.94 4.13e- 5
##  4 ipo9.L, ipo9.S         17       49   2.43       1.28       3.49 2.43e- 4
##  5 myo5b.S, myo5c…         0       10   9.64       3.27       3.05 1.13e- 3
##  6 hsph1.L, hsph1…         5       22   3.36       1.75       3.01 1.29e- 3
##  7 Xelaev18029563…         2       15   4.67       2.22       2.97 1.48e- 3
##  8 pgm1.L, pgm5.S…         6       23   3.00       1.59       2.88 1.99e- 3
##  9 ppp6r3.L, ppp6…         2       14   4.38       2.13       2.82 2.39e- 3
## 10 xpo7.S, xpo7.L          1       11   5.26       2.39       2.74 3.04e- 3
## # … with 1,661 more rows, and 1 more variable: significance <chr>

Step 2: Visualize how many results are called “significant” at a 5% false positive rate.

# visualize what is significant at 5% false positive rate (p < 0.05)
ggplot(heatr2_pvals, aes(x = ctrl_PSMs + 1, y = exp_PSMs + 1, color = significance)) +
  geom_point() +
  scale_x_log10() +
  scale_y_log10() +
  scale_color_brewer(type = 'qual', palette = 'Set1')

Step 3: Calculate the false discovery rate using the Benjamini-Hochberg method.

# calculate the false discovery rate
heatr2_fdr <- heatr2_pvals %>%
  mutate(fdr = p.adjust(pval, method = "BH", n = length(pval))) %>%
  mutate(signif_corrected = case_when(fdr <= 0.05 ~ "Significant",
                                      fdr > 0.05 ~ "Not significant")) %>%
  arrange(fdr)

Step 4: Visualize how many results are called “significant” at a 5% false discovery rate.

# visualize what is significant at 5% false discovery rate (fdr < 0.05)
ggplot(heatr2_fdr, aes(x = ctrl_PSMs + 1, y = exp_PSMs + 1, color = signif_corrected)) +
  geom_point() +
  scale_x_log10() +
  scale_y_log10() +
  scale_color_brewer(type = 'qual', palette = 'Set1')

2. Principal Component Analysis (PCA)

The iris dataset has four measurements per observational unit (iris plant):

head(iris)
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa
## 3          4.7         3.2          1.3         0.2  setosa
## 4          4.6         3.1          1.5         0.2  setosa
## 5          5.0         3.6          1.4         0.2  setosa
## 6          5.4         3.9          1.7         0.4  setosa

If we want to find out which characteristics are most distinguishing between iris plants, we have to make many individual plots and hope we can see distinguishing patterns:

p1 <- ggplot(iris, aes(x = Sepal.Length, y = Sepal.Width, color = Species)) + 
  geom_point() +
  scale_color_colorblind()

p2 <- ggplot(iris, aes(x = Petal.Length, y = Petal.Width, color = Species)) +
  geom_point() +
  scale_color_colorblind()

p3 <- ggplot(iris, aes(x = Sepal.Length, y = Petal.Length, color = Species)) +
  geom_point() +
  scale_color_colorblind()

p4 <- ggplot(iris, aes(x = Sepal.Width, y = Petal.Width, color = Species)) +
  geom_point() +
  scale_color_colorblind()

p1 + p2 + p3 + p4 + plot_layout(ncol = 2) # arrange in a grid

In this particular case, it seems that petal length and petal width are most distinct for the three species. Principal Components Analysis (PCA) allows us to systematically discover such patterns, and it works also when there are many more variables than just four.

The basic steps in PCA are to (i) prepare a data frame that holds only the numerical columns of interest, (ii) scale the data to 0 mean and unit variance, and (iii) do the PCA with the function prcomp():

pca <- iris %>% 
  select(-Species) %>%  # remove Species column
  scale() %>%           # scale to 0 mean and unit variance
  prcomp()              # perform `pca`

# now display the results from the PCA analysis
pca
## Standard deviations (1, .., p=4):
## [1] 1.7083611 0.9560494 0.3830886 0.1439265
## 
## Rotation (n x k) = (4 x 4):
##                     PC1         PC2        PC3        PC4
## Sepal.Length  0.5210659 -0.37741762  0.7195664  0.2612863
## Sepal.Width  -0.2693474 -0.92329566 -0.2443818 -0.1235096
## Petal.Length  0.5804131 -0.02449161 -0.1421264 -0.8014492
## Petal.Width   0.5648565 -0.06694199 -0.6342727  0.5235971

The main output for PCA are the standard deviations and rotation matrix. However, we’re most interested in the principal component (PC#) data, available as pca$x:

head(pca$x)
##            PC1        PC2         PC3          PC4
## [1,] -2.257141 -0.4784238  0.12727962  0.024087508
## [2,] -2.074013  0.6718827  0.23382552  0.102662845
## [3,] -2.356335  0.3407664 -0.04405390  0.028282305
## [4,] -2.291707  0.5953999 -0.09098530 -0.065735340
## [5,] -2.381863 -0.6446757 -0.01568565 -0.035802870
## [6,] -2.068701 -1.4842053 -0.02687825  0.006586116

But since we removed the categorical Species column in order to scale the matrix, we have lost that information. First, we need to add it back onto the data frame using the data.frame() function we discussed in day 1:

pca_data <- data.frame(pca$x, Species = iris$Species)
head(pca_data)
##         PC1        PC2         PC3          PC4 Species
## 1 -2.257141 -0.4784238  0.12727962  0.024087508  setosa
## 2 -2.074013  0.6718827  0.23382552  0.102662845  setosa
## 3 -2.356335  0.3407664 -0.04405390  0.028282305  setosa
## 4 -2.291707  0.5953999 -0.09098530 -0.065735340  setosa
## 5 -2.381863 -0.6446757 -0.01568565 -0.035802870  setosa
## 6 -2.068701 -1.4842053 -0.02687825  0.006586116  setosa

And now we can plot as usual:

ggplot(pca_data, aes(x = PC1, y = PC2, 
                     color = Species)) + 
  geom_point() +
  scale_color_colorblind()

In the above plot, the species are much better separated. Now, if we look at the rotation matrix, we can learn which of the 4 flower measurements for each species most contributes to the variation between species (i.e., how much each variable contributes to each principle component).

head(pca$rotation)
##                     PC1         PC2        PC3        PC4
## Sepal.Length  0.5210659 -0.37741762  0.7195664  0.2612863
## Sepal.Width  -0.2693474 -0.92329566 -0.2443818 -0.1235096
## Petal.Length  0.5804131 -0.02449161 -0.1421264 -0.8014492
## Petal.Width   0.5648565 -0.06694199 -0.6342727  0.5235971

It’s most useful to plot this information using arrows, which let us visualize the magnitude of contribution to each PC:

# capture the rotation matrix in a data frame
rotation_data <- data.frame(pca$rotation, 
                            variable = row.names(pca$rotation))

# define a nice arrow style
arrow_style <- arrow(length = unit(0.05, "inches"),
                     type = "closed")

# now plot, using geom_segment() for arrows and geom_text() for labels
ggplot(rotation_data) + 
  geom_segment(aes(xend = PC1, yend = PC2), 
               x = 0, y = 0, arrow = arrow_style) + 
  geom_text(aes(x = PC1, y = PC2, 
                label = variable), 
            hjust = 0, 
            size = 3, 
            color = "#b80058") + 
  xlim(-1., 1.25) + 
  ylim(-1., 1.) +
  coord_fixed() # fix aspect ratio to 1:1

We can now see clearly that Petal.Length, Petal.Width, and Sepal.Length all contribute to PC1, and Sepal.Width dominates PC2.

Finally, we want to look at the percent variance explained. The prcomp() function gives us standard deviations (stored in pca$sdev). To convert them into percent variance explained, we square them and then divide by the sum over all squared standard deviations:

percent <- 100*pca$sdev^2 / sum(pca$sdev^2)
percent
## [1] 72.9624454 22.8507618  3.6689219  0.5178709

The first component explains 73% of the variance, the second 23%, the third 4% and the last 0.5%. We can visualize these results nicely in a bar plot:

perc_data <- data.frame(percent = percent, PC = 1:length(percent))

ggplot(perc_data, aes(x = PC, y = percent)) + 
  geom_col() + 
  geom_text(aes(label = round(percent, 2)), size = 4, vjust = -0.5) + 
  ylim(0, 80)

Try it yourself

Now do it yourself using the wine_features data set. The data set wine_features contains 6497 rows describing various chemical attributes of red and white wine (indicated by the type column) and each wine’s relative quality (indicated by the quality column) on a scale of 3 to 9 as graded by experts performing a blind taste test. Chemical attributes recorded for each wine include fixed_acidity, volatile_acidity, citric_acid, residual_sugar, chlorides, free_sulfur_dioxide, total_sulfur_dioxide, density, pH, sulphates, alcohol.

wine_features <- 
  read_csv("https://rachaelcox.github.io/classes/datasets/wine_features.csv") %>%
  mutate(type = as.factor(type)) %>%
  select(-quality_grade, -alcohol_grade, -acidity_grade)
## 
## ── Column specification ────────────────────────────────────────────────────────
## cols(
##   type = col_character(),
##   quality = col_double(),
##   quality_grade = col_character(),
##   alcohol = col_double(),
##   alcohol_grade = col_character(),
##   pH = col_double(),
##   acidity_grade = col_character(),
##   fixed_acidity = col_double(),
##   volatile_acidity = col_double(),
##   citric_acid = col_double(),
##   residual_sugar = col_double(),
##   chlorides = col_double(),
##   free_sulfur_dioxide = col_double(),
##   total_sulfur_dioxide = col_double(),
##   density = col_double(),
##   sulphates = col_double()
## )

Step 1: Remove categorical variable type from the wine_features data frame, scale the data around 0, then use prcomp() to perform PCA. Reattach type to the transformed PC values made available pca$x. Hint: Refer to example above for code.

# perform PCA on `wine_features` dataset
pca <- wine_features %>%
  select(-type) %>%
  scale() %>%
  prcomp()

# get transformation data from PCA object for further analysis
pca_data <- data.frame(pca$x, type = wine_features$type)

Step 2: Make a scatter plot of PC2 vs PC1, coloring by type.

# color PC2 vs. PC1 scatterplot by type of wine
ggplot(pca_data, aes(x = PC1, y = PC2, color = type)) + 
  geom_point(alpha = 0.75, size = 2) + 
  scale_color_manual(values = c("#5b0b0b", "#bcb37b"))

Step 3: Convert the rotation matrix to a data frame, and then plot the rotation data using arrows with geom_segment() and geom_text(). Hint: Refer to example above for code.

# capture the rotation matrix in a data frame
rotation_data <- data.frame(pca$rotation, 
                            variable = row.names(pca$rotation))

# define a nice arrow style
arrow_style <- arrow(length = unit(0.05, "inches"),
                     type = "closed")

# now plot, using geom_segment() for arrows and geom_text() for labels
ggplot(rotation_data) + 
  geom_segment(aes(xend = PC1, yend = PC2), 
               x = 0, y = 0, arrow = arrow_style) + 
  geom_text(aes(x = PC1, y = PC2, 
                label = variable), 
            hjust = 0, 
            size = 3, 
            color = "#5b0b0b") + 
  xlim(-1., 1.25) + 
  ylim(-1., 1.) +
  coord_fixed() # fix aspect ratio to 1:1

Step 4: Evaluate how much of the variance is explained by PC1 and PC2.

percent <- 100*pca$sdev^2 / sum(pca$sdev^2)
percent
##  [1] 25.3462261 22.0821166 13.6792235  8.9052105  7.0041705  5.5033265
##  [7]  4.6985537  4.2998570  3.8197690  2.4917742  1.8965627  0.2732097
perc_data <- data.frame(percent = percent, PC = 1:length(percent))

ggplot(perc_data, aes(x = PC, y = percent)) + 
  geom_col() + 
  geom_text(aes(label = round(percent, 2)), size = 3, vjust = -0.5) + 
  ylim(0, 30)