June 3nd, 2022
There are two primary types of machine learning: unsupervised learning (unlabeled data) and supervised learning (labeled data). Within those two broad categories, there are myriad algorithms and strategies you can choose from depending on your data and your goals. For the sake of practicality, we will learn how to execute a few of the most common strategies in bioinformatics and data science in general: dimensionality reduction and classification.
The basic steps in PCA are to (1) prepare a data frame that holds only the numerical columns of interest, (2) scale the data to 0 mean and unit variance, and (3) do the PCA with the function prcomp()
. I will demonstrate this now with the iris
dataset.
pca <- iris %>%
select(-Species) %>% # remove Species column
scale() %>% # scale to 0 mean and unit variance
prcomp() # do PCA
# display the results from the PCA analysis
pca
## Standard deviations (1, .., p=4):
## [1] 1.7083611 0.9560494 0.3830886 0.1439265
##
## Rotation (n x k) = (4 x 4):
## PC1 PC2 PC3 PC4
## Sepal.Length 0.5210659 -0.37741762 0.7195664 0.2612863
## Sepal.Width -0.2693474 -0.92329566 -0.2443818 -0.1235096
## Petal.Length 0.5804131 -0.02449161 -0.1421264 -0.8014492
## Petal.Width 0.5648565 -0.06694199 -0.6342727 0.5235971
As we can see, these data don’t tell us to which species which observation belongs. We have to add the species information back in, and then plot as usual:
# add species information back into PCA data
pca_data <- data.frame(pca$x, Species = iris$Species)
head(pca_data)
## PC1 PC2 PC3 PC4 Species
## 1 -2.257141 -0.4784238 0.12727962 0.024087508 setosa
## 2 -2.074013 0.6718827 0.23382552 0.102662845 setosa
## 3 -2.356335 0.3407664 -0.04405390 0.028282305 setosa
## 4 -2.291707 0.5953999 -0.09098530 -0.065735340 setosa
## 5 -2.381863 -0.6446757 -0.01568565 -0.035802870 setosa
## 6 -2.068701 -1.4842053 -0.02687825 0.006586116 setosa
# PCA plot
ggplot(pca_data, aes(x = PC1, y = PC2, color = Species)) +
geom_point() +
scale_color_colorblind()
Now, we are interested in learning how much each variable contributes to each principal component. This information comes from the rotation matrix, which is accessed with pca$rotation
and can be visualized with arrows using geom_segment()
and geom_text_repel()
from the ggrepel
package.
# capture the rotation matrix in a data frame
rotation_data <- data.frame(
pca$rotation,
variable = row.names(pca$rotation))
# define an arrow style
arrow_style <- arrow(
length = unit(0.05, "inches"),
type = "closed")
# now plot, using geom_segment() for arrows and ggrepel() for labels
ggplot(rotation_data) +
geom_segment(aes(xend = PC1, yend = PC2),
x = 0, y = 0, arrow = arrow_style) +
geom_text_repel(aes(x = PC1, y = PC2, label = variable),
color = "#3C5488",
nudge_x = 0.1,
segment.size = 0) +
xlim(-1., 1.25) +
ylim(-1., 1.) +
coord_fixed() # fix aspect ratio to 1:1
Finally, we want to look at the percent variance explained. The prcomp()
function gives us standard deviations (stored in pca$sdev
). To convert them into percent variance explained, we square them and then divide by the sum over all squared standard deviations.
# calculate variance
percent <- 100*pca$sdev^2 / sum(pca$sdev^2)
percent
## [1] 72.9624454 22.8507618 3.6689219 0.5178709
The first component explains 73% of the variance, the second 23%, the third 4% and the last 0.5%. We can visualize these results nicely in a bar plot:
# put calculation in a data frame
perc_data <- data.frame(percent = percent, PC = 1:length(percent))
# plot explained variance per PC
ggplot(perc_data, aes(x = PC, y = percent)) +
geom_col() +
geom_text(aes(label = round(percent, 2)),
size = 4, vjust = -0.5) +
ylim(0, 80)
Now, perform these same steps using the wine
data set. This data set contains 6497 rows describing various chemical attributes of red and white wine (indicated by the type column) and each wine’s relative quality (indicated by the quality/quality_grade column) on a scale of 3 to 9 as graded by experts performing a blind taste test. Chemical attributes recorded for each wine include fixed_acidity
, volatile_acidity
, citric_acid
, residual_sugar
, chlorides
, free_sulfur_dioxide
, total_sulfur_dioxide
, density
, pH
, sulphates
, alcohol
.
# download the wine data set
wine <- read_csv("https://rachaelcox.github.io/classes/datasets/wine_features.csv") %>%
select(-matches("*grade"))
# custom colors for plotting
pal_wine <- c("#790000", "#9e934d")
Follow these steps to execute a PCA on the wine
data set:
prcomp()
.pca <- wine %>%
select(-type) %>% # remove type column
scale() %>% # scale to 0 mean and unit variance
prcomp() # do PCA
# display the results from the PCA analysis
pca
## Standard deviations (1, .., p=12):
## [1] 1.7440032 1.6278372 1.2812130 1.0337433 0.9167881 0.8126495 0.7508838
## [8] 0.7183195 0.6770320 0.5468207 0.4770613 0.1810667
##
## Rotation (n x k) = (12 x 12):
## PC1 PC2 PC3 PC4
## quality 0.08747571 0.2966009 -0.29583773 0.47243936
## alcohol -0.05892408 0.4927275 -0.21293142 0.08003179
## pH -0.20806957 0.1529219 0.40678741 0.47147768
## fixed_acidity -0.25692873 -0.2618431 -0.46748619 -0.14396377
## volatile_acidity -0.39493118 -0.1051983 0.27968932 -0.08005785
## citric_acid 0.14646061 -0.1440935 -0.58807557 0.05551036
## residual_sugar 0.31890519 -0.3425850 0.07550170 0.11245623
## chlorides -0.31344994 -0.2697701 -0.04676921 0.16529004
## free_sulfur_dioxide 0.42269137 -0.1111788 0.09899801 0.30330631
## total_sulfur_dioxide 0.47441968 -0.1439475 0.10128143 0.13223199
## density -0.09243753 -0.5549205 0.05156338 0.15057853
## sulphates -0.29985192 -0.1196342 -0.16869128 0.58801992
## PC5 PC6 PC7 PC8
## quality -0.459129140 0.27788835 -0.27317740 0.093046206
## alcohol -0.116023190 0.16947538 0.33890566 0.226040866
## pH 0.001457502 -0.56089714 0.07932113 0.362281828
## fixed_acidity -0.165362611 -0.03003708 0.39343530 0.001155415
## volatile_acidity -0.147774077 0.38266373 0.44511080 0.310077574
## citric_acid 0.234621394 -0.36224839 0.04769762 0.444962196
## residual_sugar -0.507921181 0.06331719 -0.09576310 0.081944829
## chlorides 0.393896604 0.42544212 -0.47329609 0.375531558
## free_sulfur_dioxide 0.248451958 0.28318017 0.36271398 0.120097626
## total_sulfur_dioxide 0.223966811 0.10676882 0.23481304 0.011279269
## density -0.330357303 -0.15455292 0.01328572 0.042943625
## sulphates 0.193245549 0.02014082 0.17023612 -0.592220645
## PC9 PC10 PC11 PC12
## quality 0.35664731 0.30782956 0.018082206 0.0082880344
## alcohol -0.41706026 -0.41605905 0.191289020 0.3320120790
## pH 0.13666158 -0.11240888 0.139083153 -0.2067626762
## fixed_acidity 0.42416913 -0.27243245 0.276932032 -0.3350925313
## volatile_acidity -0.12323319 0.49394624 -0.140799120 -0.0824207438
## citric_acid -0.24623265 0.33035538 -0.229276179 0.0013466535
## residual_sugar -0.48802361 -0.20717448 -0.005144195 -0.4512148285
## chlorides -0.04404975 -0.23887258 0.193398397 -0.0432778637
## free_sulfur_dioxide 0.30139683 -0.30344826 -0.486158400 -0.0009050256
## total_sulfur_dioxide 0.00181386 0.29478016 0.720162498 0.0640632122
## density 0.07108094 -0.07681483 0.003324491 0.7156665472
## sulphates -0.29740108 0.08546912 -0.047219109 -0.0781995574
Next, visualize the results of your PCA.
geom_point()
. Note: First, you will need to re-combine your PCA results with the type
column.# add species information back into PCA data
pca_data <- data.frame(pca$x, type = wine$type)
head(pca_data)
## PC1 PC2 PC3 PC4 PC5 PC6 PC7
## 1 -3.348180 -0.5688824 2.727176 -0.2237595 -0.6213606 -0.2315842 0.1248412
## 2 -3.228347 -1.1972425 1.998750 -0.3771252 -0.1103048 1.9457356 0.9383406
## 3 -3.237219 -0.9525067 1.746443 -0.4727542 -0.2253865 1.0824773 0.4519407
## 4 -1.672432 -1.6004598 -2.856332 -0.4383301 -0.2130769 -0.9706900 0.2224124
## 5 -3.348180 -0.5688824 2.727176 -0.2237595 -0.6213606 -0.2315842 0.1248412
## 6 -3.151994 -0.5562308 2.680877 -0.1631769 -0.5342549 -0.2947880 0.0980166
## PC8 PC9 PC10 PC11 PC12 type
## 1 0.0005152445 0.6462178 0.06913538 -0.10424102 0.02764104 red
## 2 -0.4254996250 0.1147934 0.44912501 -0.25823103 -0.01446823 red
## 3 -0.4113187173 0.1206255 0.27286696 -0.08160722 0.05395677 red
## 4 0.2796681050 1.2948003 -0.21255728 0.12051766 -0.10056640 red
## 5 0.0005152445 0.6462178 0.06913538 -0.10424102 0.02764104 red
## 6 -0.0725326436 0.7218271 -0.04260129 -0.05377719 0.06508351 red
# PCA plot
ggplot(pca_data, aes(x = PC1, y = PC2, color = type)) +
geom_point(alpha = 0.5, size = 2) +
scale_color_manual(values = pal_wine)
geom_segment()
. Note: First, you will need to capture the rotation matrix in a data frame.# capture the rotation matrix in a data frame
rotation_data <- data.frame(
pca$rotation,
variable = row.names(pca$rotation))
# now plot, using geom_segment() for arrows and ggrepel() for labels
ggplot(rotation_data) +
geom_segment(aes(xend = PC1, yend = PC2),
x = 0, y = 0, arrow = arrow_style) +
geom_text_repel(aes(x = PC1, y = PC2, label = variable),
color = pal_wine[1],
segment.size = 0,
size = 3,
nudge_x = 0.06,
nudge_y = 0.001) +
xlim(-0.75, 1.25) +
ylim(-0.75, 1) +
coord_fixed() # fix aspect ratio to 1:1
geom_col()
.# calculate variance
percent <- 100*pca$sdev^2 / sum(pca$sdev^2)
percent
## [1] 25.3462261 22.0821166 13.6792235 8.9052105 7.0041705 5.5033265
## [7] 4.6985537 4.2998570 3.8197690 2.4917742 1.8965627 0.2732097
# put calculation in a data frame
perc_data <- data.frame(percent = percent, PC = 1:length(percent))
# plot explained variance per PC
ggplot(perc_data, aes(x = PC, y = percent)) +
geom_col() +
geom_text(aes(label = round(percent, 2)),
size = 4, vjust = -0.5) +
ylim(0, 80)
Bonus challenge: Create a panel of your PCA plots using the patchwork
package. On your panel, label your plots with “A”, “B”, and “C” tags. In order to pass your plots to patchwork
, you will need to assign your ggplot
objects to variables.
# PC2 vs PC1 plot
plot1 <- ggplot(pca_data, aes(x = PC1, y = PC2, color = type)) +
geom_point(alpha = 0.5, size = 2) +
scale_color_manual(values = pal_wine)
# rotation matrix
plot2 <- ggplot(rotation_data) +
geom_segment(aes(xend = PC1, yend = PC2),
x = 0, y = 0, arrow = arrow_style) +
geom_text_repel(aes(x = PC1, y = PC2, label = variable),
color = pal_wine[1],
segment.size = 0,
size = 3,
nudge_x = 0.06,
nudge_y = 0.001) +
xlim(-0.75, 1.25) +
ylim(-0.75, 1) +
coord_fixed() # fix aspect ratio to 1:1
# variance captures
plot3 <- ggplot(perc_data, aes(x = PC, y = percent)) +
geom_col() +
geom_text(aes(label = round(percent, 2)),
size = 4, vjust = -0.5) +
ylim(0, 80)
# plot as a panel
panel <- plot1 / (plot2 + plot3)
panel
## Warning: ggrepel: 8 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps
# add plot tags
panel_tagged <- panel + plot_annotation(tag_levels = 'A')
panel_tagged
## Warning: ggrepel: 8 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps
In the previous section, we used PCA to investigate how different categories of data are described by variables in the data, and the relative contribution of each of those variables. Now, we are going to put that variance to the test and see if we are able to use these chemical features of wine
to predict whether a given wine is red or white.
Before implementing any supervised classification pipeline, the data must be split into test and training sets. The model is built on the training set, and then the performance of the model is assessed on the test set. Typical splits fall around 60-80% training data, 20-40% test data. Let’s do that now with the same wine
data set we performed PCA on.
set.seed(13) # set the seed to make the "random" partition reproductible
train_fraction <- 0.6 # fraction of data for training purposes
train_size <- floor(train_fraction * nrow(wine)) # number of observations in training set
train_indices <- sample(1:nrow(wine), size = train_size)
wine_train <- wine[train_indices, ] # get training data
wine_test <- wine[-train_indices, ] # get test data
If you have sufficiently descriptive data, a logistic regression or “generalized linear model” (GLM) should be your go to model for straightforward binary classification problems. The built-in GLM function in R works extremely well for this task, and is one of the ML tools we will demonstrate today. Let’s see if a GLM can predict if a wine is red or white from it’s chemical properties.
# model to use: type ~ volatile_acidity + total_sulfur_dioxide + alcohol + density
glm_out <- glm(factor(type) ~ volatile_acidity + total_sulfur_dioxide + alcohol + density,
data = wine_train,
family = binomial)
summary(glm_out)
##
## Call:
## glm(formula = factor(type) ~ volatile_acidity + total_sulfur_dioxide +
## alcohol + density, family = binomial, data = wine_train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -5.5862 0.0013 0.0318 0.0929 3.5211
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 9.806e+02 7.245e+01 13.534 < 2e-16 ***
## volatile_acidity -1.234e+01 8.206e-01 -15.036 < 2e-16 ***
## total_sulfur_dioxide 6.618e-02 3.545e-03 18.670 < 2e-16 ***
## alcohol -7.997e-01 1.355e-01 -5.902 3.6e-09 ***
## density -9.773e+02 7.179e+01 -13.613 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 4365.19 on 3897 degrees of freedom
## Residual deviance: 587.71 on 3893 degrees of freedom
## AIC: 597.71
##
## Number of Fisher Scoring iterations: 8
# results data frame for training data
wine_train_glm <- data.frame(
predictor = predict(glm_out, wine_train),
known_truth = wine_train$type,
model = "training"
)
# results data frame for test data
wine_test_glm <- data.frame(
predictor = predict(glm_out, wine_test),
known_truth = wine_test$type,
model = "test"
)
# combining data frames with results
wine_combined <- rbind(wine_train_glm, wine_test_glm)
# create a density plot that shows how the linear predictor separates the two wine types in the test data
ggplot(wine_test_glm, aes(x = predictor, fill = factor(known_truth))) +
geom_density(alpha = 0.5) +
scale_fill_manual(values = pal_wine)
# visualize the performance of the model
p <- ggplot(wine_combined, aes(d = known_truth, m = predictor, color = model)) +
geom_roc(n.cuts = 0) +
coord_fixed() +
scale_color_manual(values = pal_wine)
p
## Warning in verify_d(data$d): D not labeled 0/1, assuming red = 0 and white = 1!
# we can also calculate the area of the curve (AUC) to have a numeric representation of the performance of our model
calc_auc(p)
## Warning in verify_d(data$d): D not labeled 0/1, assuming red = 0 and white = 1!
## PANEL group model AUC
## 1 1 1 training 0.9921338
## 2 1 2 test 0.9930421
# but we've lost the name of our groups!
# here is how we add them back
model_names <- unique(wine_combined$model)
model_info <- data.frame(
model_names,
group = order(model_names))
left_join(model_info, calc_auc(p)) %>%
select(-group, -PANEL) %>%
arrange(desc(AUC))
## Warning in verify_d(data$d): D not labeled 0/1, assuming red = 0 and white = 1!
## Joining, by = "group"
## model_names model AUC
## 1 test test 0.9930421
## 2 training training 0.9921338
While more of a “black box” than a logistic regression, random forest classifiers are powerful algorithms for biological data. This is due to a number of advantages in their inherent behavior, such as:
With that being said, RFCs take significantly longer to run and the results are not as intuitive to assess in comparison to GLMs. Thus, it’s a good idea to think of RFCs as a “bear trap” and a GLM as a “mouse trap.” What are you trying to capture with your data? It’s possible to catch a mouse with a bear trap, but doing so is overkill!
Additionally, with any machine learning model, we need some kind of assurance that the model has extracted most of the patterns from the data correctly. We need to make sure that its not picking up too much noise, i.e., the model has low bias and variance. This process is known as validation. We didn’t do this with our GLM (though we could have), but we will demosntrate it here. To figure out how well our classifier will generalize to an independent/unseen dataset, we’ll use the cross validation parameters specified by the fitControl
variable below, which is then passed to the train()
function.
# load caret library
library(caret)
# designate validation method
fitControl <- trainControl(method = "cv", summaryFunction = twoClassSummary,
savePredictions = T, classProbs = T)
# fit a model using the random forest algorithm
rf_out <- train(type ~ ., data = wine_train, method = "rf",
trControl = fitControl)
## Warning in train.default(x, y, weights = w, ...): The metric "Accuracy" was not
## in the result set. ROC will be used instead.
rf_out
## Random Forest
##
## 3898 samples
## 12 predictor
## 2 classes: 'red', 'white'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 3508, 3509, 3508, 3508, 3508, 3508, ...
## Resampling results across tuning parameters:
##
## mtry ROC Sens Spec
## 2 0.9988021 0.9824098 0.9986360
## 7 0.9980928 0.9782431 0.9959091
## 12 0.9972662 0.9730670 0.9938636
##
## ROC was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 2.
# evaluate feature important
rf_feat_importance <- varImp(rf_out)
ggplot(rf_feat_importance)
Similarly to our GLM, we want to assess the performance of the model. We can extract the ROC info from the caret
model with the following code:
# select a parameter setting
selectedIndices <- rf_out$pred$mtry == 2
head(rf_out$pred[selectedIndices, ])
## pred obs red white rowIndex mtry Resample
## 1 white white 0.008 0.992 4 2 Fold01
## 2 white white 0.002 0.998 20 2 Fold01
## 3 red red 0.992 0.008 24 2 Fold01
## 4 white white 0.004 0.996 28 2 Fold01
## 5 white white 0.002 0.998 55 2 Fold01
## 6 red red 0.958 0.042 56 2 Fold01
# plot the ROC curve
rf_plot <- ggplot(rf_out$pred[selectedIndices, ],
aes(m=white, d=factor(obs, levels = c("red", "white")))) +
geom_roc(n.cuts=0)
rf_plot
## Warning in verify_d(data$d): D not labeled 0/1, assuming red = 0 and white = 1!