day4.utf8

Day 4: Statistics & advanced data analysis

In-class worksheet, solutions

June 24th, 2021

1. Hypothesis testing

Ideally a hypothesis test should proceed in 4 steps:

Decide on an effect you are interested in, design an appropriate experiment, and pick a test statistic.
Frame your null hypothesis – what does your null distribution look like? Is it a one-sided or two-sided hypothesis? If necessary, calculate your null distribution.
Collect the data and compute the test statistic.
Compute p-values for your test statistic, decide if it falls above or below your threshold.

1.1 t-test

The most commonly used test in biology might be the t-test, which is useful for determining if there is a significant difference between your control and test cases (e.g., treated and untreated samples).

The PlantGrowth data set contains yields (as measured by the dried weight of plants in the weight column) obtained under a control and two different treatment conditions. Plot the distribution of plant weight measurements for each condition using geom_violin and formulate a hypothesis.

# your R code here

The plot shows that many “trt1” data points overlap with the control, while “trt2” might be significantly different. Formally evaluate this conclusion using the t.test() function by first comparing the “ctrl” to “trt1” in the group column, and then comparing “ctrl” to “trt2”.

# your R code here

When we set var.equal = TRUE, we’re making a simplifying assumption that the two groups we’re comparing have equal or near equal variance. You can set var.equal = FALSE to remove this assumption, and R will perform Welch’s t-test instead. How does the p-value change?

The tidyverse also contains a function called t_test() for setting up this test with tidy input/output.

# your R code here

1.2 p-values and false discovery rate

The p-value is the end result uniting all hypothesis tests. Built-in statistical tests in R such as t.test() (or t_test() in the tidyverse) generally calculate this value for you. However, depending on the experiment and the size of the data, we might find it easier to manually compute our own test statistics and resulting p-values.

We will use the APMS data analyzed yesterday to demonstrate this. These data have a manually computed z-score for each observed protein. This paper contains the exact equation used to calculate the PSM_zscore column: https://elifesciences.org/articles/58662

# download the APMS data for dnai2
dnai2_apms <- read_csv("https://rachaelcox.github.io/classes/datasets/frog_heatr2_dnai2_annotated.csv") %>%
  filter(bait == "dnai2") %>%
  select(XENLA_GeneNames, matches("*PSM*"))

## 
## ── Column specification ────────────────────────────────────────────────────────
## cols(
##   ID = col_character(),
##   ctrl_PSMs = col_double(),
##   exp_PSMs = col_double(),
##   PSM_fc = col_double(),
##   PSM_log2fc = col_double(),
##   PSM_zscore = col_double(),
##   bait = col_character(),
##   OG_category = col_character(),
##   XENLA_GeneNames = col_character(),
##   HUMAN_GeneNames = col_character(),
##   OG_annotation = col_character(),
##   OG_category_annotation = col_character()
## )

head(dnai2_apms)

## # A tibble: 6 x 6
##   XENLA_GeneNames                ctrl_PSMs exp_PSMs PSM_fc PSM_log2fc PSM_zscore
##   <chr>                              <dbl>    <dbl>  <dbl>      <dbl>      <dbl>
## 1 dnai2.S                                0      202 200.        7.64       14.2 
## 2 loc100494633.L, loc100494633.…        80      265   3.23      1.69        9.90
## 3 dnai1.L                                0       46  46.2       5.53        6.76
## 4 loc101730961.L, dnah11.L, dna…         1       35  17.7       4.15        5.64
## 5 cps1.L, cad.L                          9       52   5.21      2.38        5.47
## 6 pc.S                                  72      133   1.81      0.853       4.16

To calculate p-values and create a new column containing those p-values, we will use a combination of the mutate() function from the tidyverse and the pnorm() function built into R. If we want to specify that this is a one-sided test, we need to include the argument lower.tail = FALSE. For visualization purposes, lets also create a new column that labels each point as either “significant” or “not significant” based on a 0.05 significance level.

# your R code here

If we threshold the data at a 95% confidence interval, our results indicate that there are 144 “significant” interactors that we are 95% confident in. Let’s visualize the raw PSM counts on a log-log plot in the control versus the experiment. If we color by our new significance column, we can see where each “significant” point lies on the distribution.

# your R code here

Looking at this plot, a lot of proteins we’re calling “significant” lie directly along the edge of the “background proteins.” Since we are performing 1,685 hypothesis tests (i.e. “is this protein interacting with dnai2”), at least one of these is likely to be a false positive. To correct for this, we use the p.adjust() function built into R. There are a number of ways to correct for the multiple tests problem, spanning from extremely conservative (e.g. Bonferroni) to less conservative (e.g. Benjamini-Hochberg). We will specify the argument method = "BH" to indicate we want to use the Benjamini-Hochberg method. Lets also create a new column called signif_corrected that labels each point as either “significant” or “not significant” based on a 0.05 false discovery threshold.

# your R code here

# your R code here

Now, do the same for the heatr2_apms data.

# download the APMS data for heatr2
heatr2_apms <- read_csv("https://rachaelcox.github.io/classes/datasets/frog_heatr2_dnai2_annotated.csv") %>%
  filter(bait == "heatr2") %>%
  select(XENLA_GeneNames, matches("*PSM*"))

## 
## ── Column specification ────────────────────────────────────────────────────────
## cols(
##   ID = col_character(),
##   ctrl_PSMs = col_double(),
##   exp_PSMs = col_double(),
##   PSM_fc = col_double(),
##   PSM_log2fc = col_double(),
##   PSM_zscore = col_double(),
##   bait = col_character(),
##   OG_category = col_character(),
##   XENLA_GeneNames = col_character(),
##   HUMAN_GeneNames = col_character(),
##   OG_annotation = col_character(),
##   OG_category_annotation = col_character()
## )

head(heatr2_apms)

## # A tibble: 6 x 6
##   XENLA_GeneNames                ctrl_PSMs exp_PSMs PSM_fc PSM_log2fc PSM_zscore
##   <chr>                              <dbl>    <dbl>  <dbl>      <dbl>      <dbl>
## 1 dnaaf5.L, dnaaf5.S                     0      274 241.         7.91      16.0 
## 2 gcn1.S, gcn1.L                        31       79   2.19       1.13       3.98
## 3 ipo7.L                                17       54   2.68       1.42       3.94
## 4 ipo9.L, ipo9.S                        17       49   2.43       1.28       3.49
## 5 myo5b.S, myo5c.L, myo5b.L, my…         0       10   9.64       3.27       3.05
## 6 hsph1.L, hsph1.S, hspa4.L, hs…         5       22   3.36       1.75       3.01

Step 1: Calculate p-values from the test statistic (in this case, the z-score).

# your R code here

Step 2: Visualize how many results are called “significant” at a 5% false positive rate.

# your R code here

Step 3: Calculate the false discovery rate using the Benjamini-Hochberg method.

# your R code here

Step 4: Visualize how many results are called “significant” at a 5% false discovery rate.

# your R code here

2. Principal Component Analysis (PCA)

The iris dataset has four measurements per observational unit (iris plant):

head(iris)

##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa
## 3          4.7         3.2          1.3         0.2  setosa
## 4          4.6         3.1          1.5         0.2  setosa
## 5          5.0         3.6          1.4         0.2  setosa
## 6          5.4         3.9          1.7         0.4  setosa

If we want to find out which characteristics are most distinguishing between iris plants, we have to make many individual plots and hope we can see distinguishing patterns:

p1 <- ggplot(iris, aes(x = Sepal.Length, y = Sepal.Width, color = Species)) + 
  geom_point() +
  scale_color_colorblind()

p2 <- ggplot(iris, aes(x = Petal.Length, y = Petal.Width, color = Species)) +
  geom_point() +
  scale_color_colorblind()

p3 <- ggplot(iris, aes(x = Sepal.Length, y = Petal.Length, color = Species)) +
  geom_point() +
  scale_color_colorblind()

p4 <- ggplot(iris, aes(x = Sepal.Width, y = Petal.Width, color = Species)) +
  geom_point() +
  scale_color_colorblind()

p1 + p2 + p3 + p4 + plot_layout(ncol = 2) # arrange in a grid

In this particular case, it seems that petal length and petal width are most distinct for the three species. Principal Components Analysis (PCA) allows us to systematically discover such patterns, and it works also when there are many more variables than just four.

The basic steps in PCA are to (i) prepare a data frame that holds only the numerical columns of interest, (ii) scale the data to 0 mean and unit variance, and (iii) do the PCA with the function prcomp():

# your R code here

The main output for PCA are the standard deviations and rotation matrix. However, we’re most interested in the principal component (PC#) data, available as pca$x:

# your R code here

But since we removed the categorical Species column in order to scale the matrix, we have lost that information. First, we need to add it back onto the data frame using the data.frame() function we discussed in day 1:

# your R code here

And now we can plot as usual:

# your R code here

In the above plot, the species are much better separated. Now, if we look at the rotation matrix, we can learn which of the 4 flower measurements for each species most contributes to the variation between species (i.e., how much each variable contributes to each principle component).

# your R code here

It’s most useful to plot this information using arrows, which let us visualize the magnitude of contribution to each PC:

# your R code here

We can now see clearly that Petal.Length, Petal.Width, and Sepal.Length all contribute to PC1, and Sepal.Width dominates PC2.

Finally, we want to look at the percent variance explained. The prcomp() function gives us standard deviations (stored in pca$sdev). To convert them into percent variance explained, we square them and then divide by the sum over all squared standard deviations:

# your R code here

The first component explains 73% of the variance, the second 23%, the third 4% and the last 0.5%. We can visualize these results nicely in a bar plot:

# your R code here

Try it yourself

Now do it yourself using the wine_features data set. The data set wine_features contains 6497 rows describing various chemical attributes of red and white wine (indicated by the type column) and each wine’s relative quality (indicated by the quality column) on a scale of 3 to 9 as graded by experts performing a blind taste test. Chemical attributes recorded for each wine include fixed_acidity, volatile_acidity, citric_acid, residual_sugar, chlorides, free_sulfur_dioxide, total_sulfur_dioxide, density, pH, sulphates, alcohol.

wine_features <- 
  read_csv("https://rachaelcox.github.io/classes/datasets/wine_features.csv") %>%
  mutate(type = as.factor(type)) %>%
  select(-quality_grade, -alcohol_grade, -acidity_grade)

## 
## ── Column specification ────────────────────────────────────────────────────────
## cols(
##   type = col_character(),
##   quality = col_double(),
##   quality_grade = col_character(),
##   alcohol = col_double(),
##   alcohol_grade = col_character(),
##   pH = col_double(),
##   acidity_grade = col_character(),
##   fixed_acidity = col_double(),
##   volatile_acidity = col_double(),
##   citric_acid = col_double(),
##   residual_sugar = col_double(),
##   chlorides = col_double(),
##   free_sulfur_dioxide = col_double(),
##   total_sulfur_dioxide = col_double(),
##   density = col_double(),
##   sulphates = col_double()
## )

Step 1: Remove categorical variable type from the wine_features data frame, scale the data around 0, then use prcomp() to perform PCA. Reattach type to the transformed PC values made available pca$x. Hint: Refer to example above for code.

# your R code here

Step 2: Make a scatter plot of PC2 vs PC1, coloring by type.

# your R code here

Step 3: Convert the rotation matrix to a data frame, and then plot the rotation data using arrows with geom_segment() and geom_text(). Hint: Refer to example above for code.

# your R code here

Step 4: Evaluate how much of the variance is explained by PC1 and PC2.

# your R code here