day4_solutions.knit

Day 4: Machine learning & advanced data visualization

In-class worksheet

June 3nd, 2022

There are two primary types of machine learning: unsupervised learning (unlabeled data) and supervised learning (labeled data). Within those two broad categories, there are myriad algorithms and strategies you can choose from depending on your data and your goals. For the sake of practicality, we will learn how to execute a few of the most common strategies in bioinformatics and data science in general: dimensionality reduction and classification.

Dimensionality reduction

1.1 Principal component analysis

The basic steps in PCA are to (1) prepare a data frame that holds only the numerical columns of interest, (2) scale the data to 0 mean and unit variance, and (3) do the PCA with the function prcomp(). I will demonstrate this now with the iris dataset.

pca <- iris %>% 
  select(-Species) %>%   # remove Species column
  scale() %>%            # scale to 0 mean and unit variance
  prcomp()               # do PCA

# display the results from the PCA analysis
pca

## Standard deviations (1, .., p=4):
## [1] 1.7083611 0.9560494 0.3830886 0.1439265
## 
## Rotation (n x k) = (4 x 4):
##                     PC1         PC2        PC3        PC4
## Sepal.Length  0.5210659 -0.37741762  0.7195664  0.2612863
## Sepal.Width  -0.2693474 -0.92329566 -0.2443818 -0.1235096
## Petal.Length  0.5804131 -0.02449161 -0.1421264 -0.8014492
## Petal.Width   0.5648565 -0.06694199 -0.6342727  0.5235971

As we can see, these data don’t tell us to which species which observation belongs. We have to add the species information back in, and then plot as usual:

# add species information back into PCA data
pca_data <- data.frame(pca$x, Species = iris$Species)
head(pca_data)

##         PC1        PC2         PC3          PC4 Species
## 1 -2.257141 -0.4784238  0.12727962  0.024087508  setosa
## 2 -2.074013  0.6718827  0.23382552  0.102662845  setosa
## 3 -2.356335  0.3407664 -0.04405390  0.028282305  setosa
## 4 -2.291707  0.5953999 -0.09098530 -0.065735340  setosa
## 5 -2.381863 -0.6446757 -0.01568565 -0.035802870  setosa
## 6 -2.068701 -1.4842053 -0.02687825  0.006586116  setosa

# PCA plot
ggplot(pca_data, aes(x = PC1, y = PC2, color = Species)) + 
  geom_point() +
  scale_color_colorblind()

Now, we are interested in learning how much each variable contributes to each principal component. This information comes from the rotation matrix, which is accessed with pca$rotation and can be visualized with arrows using geom_segment() and geom_text_repel() from the ggrepel package.

# capture the rotation matrix in a data frame
rotation_data <- data.frame(
  pca$rotation, 
  variable = row.names(pca$rotation))

# define an arrow style
arrow_style <- arrow(
  length = unit(0.05, "inches"),
  type = "closed")

# now plot, using geom_segment() for arrows and ggrepel() for labels
ggplot(rotation_data) + 
  geom_segment(aes(xend = PC1, yend = PC2), 
               x = 0, y = 0, arrow = arrow_style) + 
  geom_text_repel(aes(x = PC1, y = PC2, label = variable),
                  color = "#3C5488",
                  nudge_x = 0.1,
                  segment.size = 0) + 
  xlim(-1., 1.25) + 
  ylim(-1., 1.) +
  coord_fixed() # fix aspect ratio to 1:1

Finally, we want to look at the percent variance explained. The prcomp() function gives us standard deviations (stored in pca$sdev). To convert them into percent variance explained, we square them and then divide by the sum over all squared standard deviations.

# calculate variance
percent <- 100*pca$sdev^2 / sum(pca$sdev^2)
percent

## [1] 72.9624454 22.8507618  3.6689219  0.5178709

The first component explains 73% of the variance, the second 23%, the third 4% and the last 0.5%. We can visualize these results nicely in a bar plot:

# put calculation in a data frame
perc_data <- data.frame(percent = percent, PC = 1:length(percent))

# plot explained variance per PC
ggplot(perc_data, aes(x = PC, y = percent)) + 
  geom_col() + 
  geom_text(aes(label = round(percent, 2)), 
            size = 4, vjust = -0.5) + 
  ylim(0, 80)

Now, perform these same steps using the wine data set. This data set contains 6497 rows describing various chemical attributes of red and white wine (indicated by the type column) and each wine’s relative quality (indicated by the quality/quality_grade column) on a scale of 3 to 9 as graded by experts performing a blind taste test. Chemical attributes recorded for each wine include fixed_acidity, volatile_acidity, citric_acid, residual_sugar, chlorides, free_sulfur_dioxide, total_sulfur_dioxide, density, pH, sulphates, alcohol.

# download the wine data set
wine <- read_csv("https://rachaelcox.github.io/classes/datasets/wine_features.csv") %>%
  select(-matches("*grade"))

# custom colors for plotting
pal_wine <- c("#790000", "#9e934d")

Follow these steps to execute a PCA on the wine data set:

Prepare the data frame so that it contains only numerical columns.
Standardize the data by scaling to 0 mean and unit variance.
Perform PCA with function prcomp().

pca <- wine %>% 
  select(-type) %>%      # remove type column
  scale() %>%            # scale to 0 mean and unit variance
  prcomp()               # do PCA

# display the results from the PCA analysis
pca

## Standard deviations (1, .., p=12):
##  [1] 1.7440032 1.6278372 1.2812130 1.0337433 0.9167881 0.8126495 0.7508838
##  [8] 0.7183195 0.6770320 0.5468207 0.4770613 0.1810667
## 
## Rotation (n x k) = (12 x 12):
##                              PC1        PC2         PC3         PC4
## quality               0.08747571  0.2966009 -0.29583773  0.47243936
## alcohol              -0.05892408  0.4927275 -0.21293142  0.08003179
## pH                   -0.20806957  0.1529219  0.40678741  0.47147768
## fixed_acidity        -0.25692873 -0.2618431 -0.46748619 -0.14396377
## volatile_acidity     -0.39493118 -0.1051983  0.27968932 -0.08005785
## citric_acid           0.14646061 -0.1440935 -0.58807557  0.05551036
## residual_sugar        0.31890519 -0.3425850  0.07550170  0.11245623
## chlorides            -0.31344994 -0.2697701 -0.04676921  0.16529004
## free_sulfur_dioxide   0.42269137 -0.1111788  0.09899801  0.30330631
## total_sulfur_dioxide  0.47441968 -0.1439475  0.10128143  0.13223199
## density              -0.09243753 -0.5549205  0.05156338  0.15057853
## sulphates            -0.29985192 -0.1196342 -0.16869128  0.58801992
##                               PC5         PC6         PC7          PC8
## quality              -0.459129140  0.27788835 -0.27317740  0.093046206
## alcohol              -0.116023190  0.16947538  0.33890566  0.226040866
## pH                    0.001457502 -0.56089714  0.07932113  0.362281828
## fixed_acidity        -0.165362611 -0.03003708  0.39343530  0.001155415
## volatile_acidity     -0.147774077  0.38266373  0.44511080  0.310077574
## citric_acid           0.234621394 -0.36224839  0.04769762  0.444962196
## residual_sugar       -0.507921181  0.06331719 -0.09576310  0.081944829
## chlorides             0.393896604  0.42544212 -0.47329609  0.375531558
## free_sulfur_dioxide   0.248451958  0.28318017  0.36271398  0.120097626
## total_sulfur_dioxide  0.223966811  0.10676882  0.23481304  0.011279269
## density              -0.330357303 -0.15455292  0.01328572  0.042943625
## sulphates             0.193245549  0.02014082  0.17023612 -0.592220645
##                              PC9        PC10         PC11          PC12
## quality               0.35664731  0.30782956  0.018082206  0.0082880344
## alcohol              -0.41706026 -0.41605905  0.191289020  0.3320120790
## pH                    0.13666158 -0.11240888  0.139083153 -0.2067626762
## fixed_acidity         0.42416913 -0.27243245  0.276932032 -0.3350925313
## volatile_acidity     -0.12323319  0.49394624 -0.140799120 -0.0824207438
## citric_acid          -0.24623265  0.33035538 -0.229276179  0.0013466535
## residual_sugar       -0.48802361 -0.20717448 -0.005144195 -0.4512148285
## chlorides            -0.04404975 -0.23887258  0.193398397 -0.0432778637
## free_sulfur_dioxide   0.30139683 -0.30344826 -0.486158400 -0.0009050256
## total_sulfur_dioxide  0.00181386  0.29478016  0.720162498  0.0640632122
## density               0.07108094 -0.07681483  0.003324491  0.7156665472
## sulphates            -0.29740108  0.08546912 -0.047219109 -0.0781995574

Next, visualize the results of your PCA.

Plot a scatter plot of PC2 vs PC1 using geom_point(). Note: First, you will need to re-combine your PCA results with the type column.

# add species information back into PCA data
pca_data <- data.frame(pca$x, type = wine$type)
head(pca_data)

##         PC1        PC2       PC3        PC4        PC5        PC6       PC7
## 1 -3.348180 -0.5688824  2.727176 -0.2237595 -0.6213606 -0.2315842 0.1248412
## 2 -3.228347 -1.1972425  1.998750 -0.3771252 -0.1103048  1.9457356 0.9383406
## 3 -3.237219 -0.9525067  1.746443 -0.4727542 -0.2253865  1.0824773 0.4519407
## 4 -1.672432 -1.6004598 -2.856332 -0.4383301 -0.2130769 -0.9706900 0.2224124
## 5 -3.348180 -0.5688824  2.727176 -0.2237595 -0.6213606 -0.2315842 0.1248412
## 6 -3.151994 -0.5562308  2.680877 -0.1631769 -0.5342549 -0.2947880 0.0980166
##             PC8       PC9        PC10        PC11        PC12 type
## 1  0.0005152445 0.6462178  0.06913538 -0.10424102  0.02764104  red
## 2 -0.4254996250 0.1147934  0.44912501 -0.25823103 -0.01446823  red
## 3 -0.4113187173 0.1206255  0.27286696 -0.08160722  0.05395677  red
## 4  0.2796681050 1.2948003 -0.21255728  0.12051766 -0.10056640  red
## 5  0.0005152445 0.6462178  0.06913538 -0.10424102  0.02764104  red
## 6 -0.0725326436 0.7218271 -0.04260129 -0.05377719  0.06508351  red

# PCA plot
ggplot(pca_data, aes(x = PC1, y = PC2, color = type)) + 
  geom_point(alpha = 0.5, size = 2) +
  scale_color_manual(values = pal_wine)

Plot the relative contribution of each variable to each principal component using geom_segment(). Note: First, you will need to capture the rotation matrix in a data frame.

# capture the rotation matrix in a data frame
rotation_data <- data.frame(
  pca$rotation, 
  variable = row.names(pca$rotation))

# now plot, using geom_segment() for arrows and ggrepel() for labels
ggplot(rotation_data) + 
  geom_segment(aes(xend = PC1, yend = PC2), 
               x = 0, y = 0, arrow = arrow_style) + 
  geom_text_repel(aes(x = PC1, y = PC2, label = variable),
                  color = pal_wine[1],
                  segment.size = 0,
                  size = 3,
                  nudge_x = 0.06,
                  nudge_y = 0.001) + 
  xlim(-0.75, 1.25) + 
  ylim(-0.75, 1) +
  coord_fixed() # fix aspect ratio to 1:1

Plot the percent of variance explained by each principal component using geom_col().

# calculate variance
percent <- 100*pca$sdev^2 / sum(pca$sdev^2)
percent

##  [1] 25.3462261 22.0821166 13.6792235  8.9052105  7.0041705  5.5033265
##  [7]  4.6985537  4.2998570  3.8197690  2.4917742  1.8965627  0.2732097

# put calculation in a data frame
perc_data <- data.frame(percent = percent, PC = 1:length(percent))

# plot explained variance per PC
ggplot(perc_data, aes(x = PC, y = percent)) + 
  geom_col() + 
  geom_text(aes(label = round(percent, 2)), 
            size = 4, vjust = -0.5) + 
  ylim(0, 80)

Bonus challenge: Create a panel of your PCA plots using the patchwork package. On your panel, label your plots with “A”, “B”, and “C” tags. In order to pass your plots to patchwork, you will need to assign your ggplot objects to variables.

# PC2 vs PC1 plot
plot1 <- ggplot(pca_data, aes(x = PC1, y = PC2, color = type)) + 
  geom_point(alpha = 0.5, size = 2) +
  scale_color_manual(values = pal_wine)

# rotation matrix
plot2 <- ggplot(rotation_data) + 
  geom_segment(aes(xend = PC1, yend = PC2), 
               x = 0, y = 0, arrow = arrow_style) + 
  geom_text_repel(aes(x = PC1, y = PC2, label = variable),
                  color = pal_wine[1],
                  segment.size = 0,
                  size = 3,
                  nudge_x = 0.06,
                  nudge_y = 0.001) + 
  xlim(-0.75, 1.25) + 
  ylim(-0.75, 1) +
  coord_fixed() # fix aspect ratio to 1:1

# variance captures
plot3 <- ggplot(perc_data, aes(x = PC, y = percent)) + 
  geom_col() + 
  geom_text(aes(label = round(percent, 2)), 
            size = 4, vjust = -0.5) + 
  ylim(0, 80)

# plot as a panel
panel <- plot1 / (plot2 + plot3)
panel

## Warning: ggrepel: 8 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps

# add plot tags
panel_tagged <- panel + plot_annotation(tag_levels = 'A')
panel_tagged

## Warning: ggrepel: 8 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps

2. Supervised classification

In the previous section, we used PCA to investigate how different categories of data are described by variables in the data, and the relative contribution of each of those variables. Now, we are going to put that variance to the test and see if we are able to use these chemical features of wine to predict whether a given wine is red or white.

2.0 Test & training data

Before implementing any supervised classification pipeline, the data must be split into test and training sets. The model is built on the training set, and then the performance of the model is assessed on the test set. Typical splits fall around 60-80% training data, 20-40% test data. Let’s do that now with the same wine data set we performed PCA on.

set.seed(13) # set the seed to make the "random" partition reproductible
train_fraction <- 0.6 # fraction of data for training purposes
train_size <- floor(train_fraction * nrow(wine)) # number of observations in training set
train_indices <- sample(1:nrow(wine), size = train_size)

wine_train <- wine[train_indices, ] # get training data
wine_test <- wine[-train_indices, ] # get test data

2.1 Logistic regression

If you have sufficiently descriptive data, a logistic regression or “generalized linear model” (GLM) should be your go to model for straightforward binary classification problems. The built-in GLM function in R works extremely well for this task, and is one of the ML tools we will demonstrate today. Let’s see if a GLM can predict if a wine is red or white from it’s chemical properties.

# model to use: type ~ volatile_acidity + total_sulfur_dioxide + alcohol + density
glm_out <- glm(factor(type) ~ volatile_acidity + total_sulfur_dioxide + alcohol + density,
                     data = wine_train, 
                     family = binomial)

summary(glm_out)

## 
## Call:
## glm(formula = factor(type) ~ volatile_acidity + total_sulfur_dioxide + 
##     alcohol + density, family = binomial, data = wine_train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -5.5862   0.0013   0.0318   0.0929   3.5211  
## 
## Coefficients:
##                        Estimate Std. Error z value Pr(>|z|)    
## (Intercept)           9.806e+02  7.245e+01  13.534  < 2e-16 ***
## volatile_acidity     -1.234e+01  8.206e-01 -15.036  < 2e-16 ***
## total_sulfur_dioxide  6.618e-02  3.545e-03  18.670  < 2e-16 ***
## alcohol              -7.997e-01  1.355e-01  -5.902  3.6e-09 ***
## density              -9.773e+02  7.179e+01 -13.613  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 4365.19  on 3897  degrees of freedom
## Residual deviance:  587.71  on 3893  degrees of freedom
## AIC: 597.71
## 
## Number of Fisher Scoring iterations: 8

# results data frame for training data
wine_train_glm <- data.frame(
  predictor = predict(glm_out, wine_train),
  known_truth = wine_train$type,
  model = "training"
)

# results data frame for test data
wine_test_glm <- data.frame(
  predictor = predict(glm_out, wine_test),
  known_truth = wine_test$type,
  model = "test"
)

# combining data frames with results
wine_combined <- rbind(wine_train_glm, wine_test_glm)

# create a density plot that shows how the linear predictor separates the two wine types in the test data
ggplot(wine_test_glm, aes(x = predictor, fill = factor(known_truth))) +
  geom_density(alpha = 0.5) +
  scale_fill_manual(values = pal_wine)

# visualize the performance of the model
p <- ggplot(wine_combined, aes(d = known_truth, m = predictor, color = model)) +
  geom_roc(n.cuts = 0) +
  coord_fixed() +
  scale_color_manual(values = pal_wine)
p

## Warning in verify_d(data$d): D not labeled 0/1, assuming red = 0 and white = 1!

# we can also calculate the area of the curve (AUC) to have a numeric representation of the performance of our model
calc_auc(p)

## Warning in verify_d(data$d): D not labeled 0/1, assuming red = 0 and white = 1!

##   PANEL group    model       AUC
## 1     1     1 training 0.9921338
## 2     1     2     test 0.9930421

# but we've lost the name of our groups!
# here is how we add them back
model_names <- unique(wine_combined$model)
model_info <- data.frame(
  model_names,
  group = order(model_names))

left_join(model_info, calc_auc(p)) %>%
  select(-group, -PANEL) %>%
  arrange(desc(AUC))

## Warning in verify_d(data$d): D not labeled 0/1, assuming red = 0 and white = 1!

## Joining, by = "group"

##   model_names    model       AUC
## 1        test     test 0.9930421
## 2    training training 0.9921338

2.2 Random forest classifier

While more of a “black box” than a logistic regression, random forest classifiers are powerful algorithms for biological data. This is due to a number of advantages in their inherent behavior, such as:

Not making any statistical assumptions about the independence or collinearity of data
Robust to outliers and overfitting
Random features tend to produce good output

With that being said, RFCs take significantly longer to run and the results are not as intuitive to assess in comparison to GLMs. Thus, it’s a good idea to think of RFCs as a “bear trap” and a GLM as a “mouse trap.” What are you trying to capture with your data? It’s possible to catch a mouse with a bear trap, but doing so is overkill!

Additionally, with any machine learning model, we need some kind of assurance that the model has extracted most of the patterns from the data correctly. We need to make sure that its not picking up too much noise, i.e., the model has low bias and variance. This process is known as validation. We didn’t do this with our GLM (though we could have), but we will demosntrate it here. To figure out how well our classifier will generalize to an independent/unseen dataset, we’ll use the cross validation parameters specified by the fitControl variable below, which is then passed to the train() function.

# load caret library
library(caret)

# designate validation method
fitControl <- trainControl(method = "cv", summaryFunction = twoClassSummary, 
                           savePredictions = T, classProbs = T)


# fit a model using the random forest algorithm
rf_out <- train(type ~ ., data = wine_train, method = "rf",
                trControl = fitControl)

## Warning in train.default(x, y, weights = w, ...): The metric "Accuracy" was not
## in the result set. ROC will be used instead.

rf_out

## Random Forest 
## 
## 3898 samples
##   12 predictor
##    2 classes: 'red', 'white' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 3508, 3509, 3508, 3508, 3508, 3508, ... 
## Resampling results across tuning parameters:
## 
##   mtry  ROC        Sens       Spec     
##    2    0.9988021  0.9824098  0.9986360
##    7    0.9980928  0.9782431  0.9959091
##   12    0.9972662  0.9730670  0.9938636
## 
## ROC was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 2.

# evaluate feature important
rf_feat_importance <- varImp(rf_out)
ggplot(rf_feat_importance)

Similarly to our GLM, we want to assess the performance of the model. We can extract the ROC info from the caret model with the following code:

# select a parameter setting
selectedIndices <- rf_out$pred$mtry == 2
head(rf_out$pred[selectedIndices, ])

##    pred   obs   red white rowIndex mtry Resample
## 1 white white 0.008 0.992        4    2   Fold01
## 2 white white 0.002 0.998       20    2   Fold01
## 3   red   red 0.992 0.008       24    2   Fold01
## 4 white white 0.004 0.996       28    2   Fold01
## 5 white white 0.002 0.998       55    2   Fold01
## 6   red   red 0.958 0.042       56    2   Fold01

# plot the ROC curve
rf_plot <- ggplot(rf_out$pred[selectedIndices, ], 
            aes(m=white, d=factor(obs, levels = c("red", "white")))) + 
  geom_roc(n.cuts=0)
rf_plot

## Warning in verify_d(data$d): D not labeled 0/1, assuming red = 0 and white = 1!