What is machine learning?

Machine learning (also referred to as predictive analytics or predictive modeling) is branch of artificial intelligence that uses statistics to extract patterns from big datasets. In a nutshell, machine learning algorithms are designed to identify and apply the most important/descriptive features of your data.

Types of learning algorithms:

Supervised -> contains the target/response variable being models, where the goal is to predict the value or class of unseen data (i.e., linear regression, decisions trees, nearest neighbor methods, support vector machines, Bayesian classifiers)
Unsupervised -> dataset contains no target/response variable (unlabeled data), more about finding patterns than prediction (i.e., k-means clustering or hierarchical clustering)
Semi-supervised –> uses a very small amount of labeled data and a large amount of unlabeled data, with the goal being to infer the correct labels for the unlabeled set

Modern applications of machine learning:

Image recognition (iPhone face recognition)
Voice recognition (Alexa or Siri)
Recommendation engines (your Netflix account)
High-throughput microscopy
Drug discovery
Protein interactions

Today, we’re going to use a sort of “all-inclusive” R package for machine learning to build and test our legendary Pokemon model. http://topepo.github.io/caret/index.html

Not only does caret allow you to run a wide range of ML methods, but it also provides built-in functionality for auxiliary techniques such as:
1. Data preparation
2. Data splitting
3. Variable selection
4. Model evaluation

Setting up the environment

First we need to set up how we want our output to look (knitr, a feature of R Markdown, will take care of this for us) and install/load the R packages we will need for our experiment today.

PRO-TIP: you can execute a single line of code with the shortcut ‘Ctrl+Enter’ or a whole chunk of code with ‘Ctrl+Shift+Enter’

#install.packages("tidyverse")
#install.packages("skimr")
#install.packages("caret")
#install.packages("RANN")
#install.packages("plotly")
#install.packages("yardstick")


library(tidyverse) # collection of packages for easy data import, tidying, manipulation and visualization
library(skimr) # package for getting dataset stats at a glance
library(caret) # package for "Classification And REgression Training"
library(RANN) # required for some caret functionalities
library(plotly) # package for interacting with plots
library(yardstick) # package for evaluating statistical models

setwd("/stor/work/Marcotte/project/rmcox/github_repos/pokemon_machine_learning_demo/")

Importing and exploring the data

Next we will need to load the Pokemon data into our environment. This is a publicly available dataset scraped a couple years ago from http://serebii.net/ and contains information on the 802 Pokemon comprising Generations 1-7.

The information contained in this dataset include Base Stats, Performance against Other Types, Height, Weight, Classification, Egg Steps, Experience Points, Abilities, etc.

PRO-TIP: you can make a pipe (%>%) with the shortcut ‘Ctrl+Shift+M’

# The read_csv() function loads in the data and automatically detects it as comma-delimited.
poke_data <- read_csv("pokemon_data_all.csv", col_names=TRUE) %>%
  select(name, everything())

## Parsed with column specification:
## cols(
##   .default = col_double(),
##   abilities = col_character(),
##   classfication = col_character(),
##   japanese_name = col_character(),
##   name = col_character(),
##   type1 = col_character(),
##   type2 = col_character()
## )

## See spec(...) for full column specifications.

# The glimpse() function is nice for looking at all the variable names and types.
# You can see there are 801 pokemon ("observations") and 41 features ("variables").
glimpse(poke_data)

## Observations: 800
## Variables: 41
## $ name              <chr> "Bulbasaur", "Ivysaur", "Venusaur", "Charmande…
## $ abilities         <chr> "['Overgrow', 'Chlorophyll']", "['Overgrow', '…
## $ against_bug       <dbl> 1.00, 1.00, 1.00, 0.50, 0.50, 0.25, 1.00, 1.00…
## $ against_dark      <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ against_dragon    <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ against_electric  <dbl> 0.5, 0.5, 0.5, 1.0, 1.0, 2.0, 2.0, 2.0, 2.0, 1…
## $ against_fairy     <dbl> 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 1.0, 1.0, 1.0, 1…
## $ against_fight     <dbl> 0.50, 0.50, 0.50, 1.00, 1.00, 0.50, 1.00, 1.00…
## $ against_fire      <dbl> 2.0, 2.0, 2.0, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 2…
## $ against_flying    <dbl> 2.0, 2.0, 2.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 2…
## $ against_ghost     <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0…
## $ against_grass     <dbl> 0.25, 0.25, 0.25, 0.50, 0.50, 0.25, 2.00, 2.00…
## $ against_ground    <dbl> 1.0, 1.0, 1.0, 2.0, 2.0, 0.0, 1.0, 1.0, 1.0, 0…
## $ against_ice       <dbl> 2.0, 2.0, 2.0, 0.5, 0.5, 1.0, 0.5, 0.5, 0.5, 1…
## $ against_normal    <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ against_poison    <dbl> 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1…
## $ against_psychic   <dbl> 2, 2, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 1…
## $ against_rock      <dbl> 1, 1, 1, 2, 2, 4, 1, 1, 1, 2, 2, 4, 2, 2, 2, 2…
## $ against_steel     <dbl> 1.0, 1.0, 1.0, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 1…
## $ against_water     <dbl> 0.5, 0.5, 0.5, 2.0, 2.0, 2.0, 0.5, 0.5, 0.5, 1…
## $ attack            <dbl> 49, 62, 100, 52, 64, 104, 48, 63, 103, 30, 20,…
## $ base_egg_steps    <dbl> 5120, 5120, 5120, 5120, 5120, 5120, 5120, 5120…
## $ base_happiness    <dbl> 70, 70, 70, 70, 70, 70, 70, 70, 70, 70, 70, 70…
## $ base_total        <dbl> 318, 405, 625, 309, 405, 634, 314, 405, 630, 1…
## $ capture_rate      <dbl> 45, 45, 45, 45, 45, 45, 45, 45, 45, 255, 120, …
## $ classfication     <chr> "Seed Pokémon", "Seed Pokémon", "Seed Pokémon"…
## $ defense           <dbl> 49, 63, 123, 43, 58, 78, 65, 80, 120, 35, 55, …
## $ experience_growth <dbl> 1059860, 1059860, 1059860, 1059860, 1059860, 1…
## $ height_m          <dbl> 0.7, 1.0, 2.0, 0.6, 1.1, 1.7, 0.5, 1.0, 1.6, 0…
## $ hp                <dbl> 45, 60, 80, 39, 58, 78, 44, 59, 79, 45, 50, 60…
## $ japanese_name     <chr> "Fushigidaneフシギダネ", "Fushigisouフシギソウ", "Fushig…
## $ percentage_male   <dbl> 88.1, 88.1, 88.1, 88.1, 88.1, 88.1, 88.1, 88.1…
## $ pokedex_number    <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14,…
## $ sp_attack         <dbl> 65, 80, 122, 60, 80, 159, 50, 65, 135, 20, 25,…
## $ sp_defense        <dbl> 65, 80, 120, 50, 65, 115, 64, 80, 115, 20, 25,…
## $ speed             <dbl> 45, 60, 80, 65, 80, 100, 43, 58, 78, 45, 30, 7…
## $ type1             <chr> "grass", "grass", "grass", "fire", "fire", "fi…
## $ type2             <chr> "poison", "poison", "poison", NA, NA, "flying"…
## $ weight_kg         <dbl> 6.9, 13.0, 100.0, 8.5, 19.0, 90.5, 9.0, 22.5, …
## $ generation        <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ is_legendary      <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…

# The head() function lets us look at the first few rows of the dataframe and its format.
head(poke_data)

## # A tibble: 6 x 41
##   name  abilities against_bug against_dark against_dragon against_electric
##   <chr> <chr>           <dbl>        <dbl>          <dbl>            <dbl>
## 1 Bulb… ['Overgr…        1               1              1              0.5
## 2 Ivys… ['Overgr…        1               1              1              0.5
## 3 Venu… ['Overgr…        1               1              1              0.5
## 4 Char… ['Blaze'…        0.5             1              1              1  
## 5 Char… ['Blaze'…        0.5             1              1              1  
## 6 Char… ['Blaze'…        0.25            1              1              2  
## # … with 35 more variables: against_fairy <dbl>, against_fight <dbl>,
## #   against_fire <dbl>, against_flying <dbl>, against_ghost <dbl>,
## #   against_grass <dbl>, against_ground <dbl>, against_ice <dbl>,
## #   against_normal <dbl>, against_poison <dbl>, against_psychic <dbl>,
## #   against_rock <dbl>, against_steel <dbl>, against_water <dbl>,
## #   attack <dbl>, base_egg_steps <dbl>, base_happiness <dbl>,
## #   base_total <dbl>, capture_rate <dbl>, classfication <chr>,
## #   defense <dbl>, experience_growth <dbl>, height_m <dbl>, hp <dbl>,
## #   japanese_name <chr>, percentage_male <dbl>, pokedex_number <dbl>,
## #   sp_attack <dbl>, sp_defense <dbl>, speed <dbl>, type1 <chr>,
## #   type2 <chr>, weight_kg <dbl>, generation <dbl>, is_legendary <dbl>

# Luckily our dataset is already in the right format, with individual observations as rows and features as columns.
# If this were not the case, we would have to use gather() or spread() to get it in the right format.

# Some of these variables are not going to be helpful for training our model. E.g., "name" is always unique, "japanese name" is always unique, "abilities" are nearly always unique and "classification" are nearly always unique. Let's take these out for simplicity's sake.
poke_data <- poke_data %>%
  select(-japanese_name, -abilities, -classfication)

# Also right now, legendary status is described with 1's and 0's so let's change that to TRUE and FALSE.
poke_data <- poke_data %>%
  mutate(is_legendary = as.logical(is_legendary))

# The next step is to look for missing data, a common problem in big datasets. The skimr package provides a nice solution for this, along with showing key descriptive stats for each column.
skim(poke_data)

Data summary
Name	poke_data
Number of rows	800
Number of columns	38
_______________________
Column type frequency:
character	3
logical	1
numeric	34
________________________
Group variables	None

Variable type: character

skim_variable	n_missing	complete_rate	min	max	n_unique
name	0	1.00	3	12	800
type1	0	1.00	3	8	18
type2	384	0.52	3	8	18

Variable type: logical

skim_variable	n_missing	complete_rate	mean	count
is_legendary	0	1	0.09	FAL: 730, TRU: 70

Variable type: numeric

skim_variable	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
against_bug	0	1.00	1.00	0.60	2.50e-01	0.50	1.000e+00	1.00	4.0	▇▁▂▁▁
against_dark	0	1.00	1.06	0.44	2.50e-01	1.00	1.000e+00	1.00	4.0	▇▁▁▁▁
against_dragon	0	1.00	0.97	0.35	0.00e+00	1.00	1.000e+00	1.00	2.0	▁▁▇▁▁
against_electric	0	1.00	1.07	0.65	0.00e+00	0.50	1.000e+00	1.00	4.0	▅▇▃▁▁
against_fairy	0	1.00	1.07	0.52	2.50e-01	1.00	1.000e+00	1.00	4.0	▇▁▁▁▁
against_fight	0	1.00	1.07	0.72	0.00e+00	0.50	1.000e+00	1.00	4.0	▇▇▅▁▁
against_fire	0	1.00	1.14	0.69	2.50e-01	0.50	1.000e+00	2.00	4.0	▇▁▂▁▁
against_flying	0	1.00	1.19	0.60	2.50e-01	1.00	1.000e+00	1.00	4.0	▇▁▂▁▁
against_ghost	0	1.00	0.98	0.56	0.00e+00	1.00	1.000e+00	1.00	4.0	▂▇▂▁▁
against_grass	0	1.00	1.03	0.79	2.50e-01	0.50	1.000e+00	1.00	4.0	▇▁▂▁▁
against_ground	0	1.00	1.10	0.74	0.00e+00	1.00	1.000e+00	1.00	4.0	▃▇▃▁▁
against_ice	0	1.00	1.21	0.74	2.50e-01	0.50	1.000e+00	2.00	4.0	▇▁▃▁▁
against_normal	0	1.00	0.89	0.27	0.00e+00	1.00	1.000e+00	1.00	1.0	▁▁▁▁▇
against_poison	0	1.00	0.98	0.55	0.00e+00	0.50	1.000e+00	1.00	4.0	▃▇▂▁▁
against_psychic	0	1.00	1.01	0.50	0.00e+00	1.00	1.000e+00	1.00	4.0	▂▇▂▁▁
against_rock	0	1.00	1.25	0.70	2.50e-01	1.00	1.000e+00	2.00	4.0	▇▁▃▁▁
against_steel	0	1.00	0.98	0.50	2.50e-01	0.50	1.000e+00	1.00	4.0	▇▁▁▁▁
against_water	0	1.00	1.06	0.61	2.50e-01	0.50	1.000e+00	1.00	4.0	▇▁▂▁▁
attack	0	1.00	77.83	32.17	5.00e+00	55.00	7.500e+01	100.00	185.0	▂▇▆▂▁
base_egg_steps	0	1.00	7192.00	6562.26	1.28e+03	5120.00	5.120e+03	5440.00	30720.0	▇▁▁▁▁
base_happiness	0	1.00	65.36	19.61	0.00e+00	70.00	7.000e+01	70.00	140.0	▁▁▇▁▁
base_total	0	1.00	428.29	119.25	1.80e+02	320.00	4.350e+02	505.00	780.0	▃▆▇▂▁
capture_rate	0	1.00	98.76	76.26	3.00e+00	45.00	6.000e+01	170.00	255.0	▇▃▂▂▂
defense	0	1.00	73.03	30.78	5.00e+00	50.00	7.000e+01	90.00	230.0	▅▇▂▁▁
experience_growth	0	1.00	1054989.82	160356.00	6.00e+05	1000000.00	1.000e+06	1059860.00	1640000.0	▂▇▅▅▁
height_m	20	0.98	1.17	1.08	1.00e-01	0.60	1.000e+00	1.50	14.5	▇▁▁▁▁
hp	0	1.00	68.97	26.59	1.00e+00	50.00	6.500e+01	80.00	255.0	▃▇▁▁▁
percentage_male	97	0.88	55.16	20.26	0.00e+00	50.00	5.000e+01	50.00	100.0	▁▁▇▁▂
pokedex_number	0	1.00	400.53	231.14	1.00e+00	200.75	4.005e+02	600.25	801.0	▇▇▇▇▇
sp_attack	0	1.00	71.27	32.36	1.00e+01	45.00	6.500e+01	91.00	194.0	▅▇▅▁▁
sp_defense	0	1.00	70.92	27.96	2.00e+01	50.00	6.600e+01	90.00	230.0	▇▇▂▁▁
speed	0	1.00	66.27	28.86	5.00e+00	45.00	6.500e+01	85.00	180.0	▃▇▆▁▁
weight_kg	20	0.98	61.41	109.42	1.00e-01	9.00	2.715e+01	64.85	999.9	▇▁▁▁▁
generation	0	1.00	3.69	1.93	1.00e+00	2.00	4.000e+00	5.00	7.0	▇▅▃▅▅

# In our columns with character values, almost half the data set is missing a value for "type2."
# If you're familiar with Pokemon, you know that's perfectly valid for a Pokemon to only have one type.
# Since the characteristic of only having one type might be important to a Pokemon's legendary status, we can feel justified in replacing all the "NA" values in the "type2" column to "none."
poke_data$type2[is.na(poke_data$type2)] <- "none"

# We also have missing values in our numeric columns. Let's check out how many legendary Pokemon are affected by this missing data.
poke_data %>% filter(is.na(weight_kg)) %>% tally(is_legendary > 0)

## # A tibble: 1 x 1
##       n
##   <int>
## 1     1

poke_data %>% filter(is.na(height_m)) %>% tally(is_legendary > 0)

## # A tibble: 1 x 1
##       n
##   <int>
## 1     1

poke_data %>% filter(is.na(percentage_male)) %>% tally(is_legendary > 0)

## # A tibble: 1 x 1
##       n
##   <int>
## 1    63

# Let's remove the problematic "percentage_male" column. See text below for explanation.
poke_data <- poke_data %>%
  select(-percentage_male)

# Plot the relationship of non-legendary vs legendary Pokemon among the variables.
ggplot(poke_data, aes(x=type1)) +
  geom_bar(aes(fill=is_legendary, group=is_legendary)) +
  coord_flip()

ggplot(poke_data, aes(x=weight_kg, y=height_m, color=is_legendary)) +
  geom_point(size=3, alpha=0.5)

## Warning: Removed 20 rows containing missing values (geom_point).

# Plot the distrbution of the target (or response) variable and see if there is any class imbalance.
ggplot(poke_data, aes(x=is_legendary)) +
  geom_bar() +
  xlab("Legendary status") +
  ylab("# Pokemon")

Missing values in the “percentage_male” column is particularly problematic, since there are a lot of legendary Pokemon in this subset. The reason for this is that legendary Pokemon are commonly genderless.

We can’t replace N/A with “none” like we did with the types above, because then we would be mixing numeric and character values. We can’t replace N/A with 0, because then that would imply 100% female, which is false. We can’t impute it, because it will predict a value between 0 and 100, which is also incorrect.

Splitting the dataset for training and testing

We need to split the dataset into training data (75%) and test data (25%). When building the predictive model, the algorithm should see the training data and ONLY the training data to learn the relationship between Pokemon stats and their legendary status. Learned information about these relationships become our machine learning model.

# Now let's divide the dataset into training and test sets. The caret package has a nice function, createDataPartition(), for this purpose.

set.seed(13)

partition_index <- createDataPartition(poke_data$is_legendary, p=0.75, list=FALSE)

trainPoke <- poke_data[partition_index,]

testPoke <- poke_data[-partition_index,]

Dealing with missing data

There is no one-size-fits-all method for dealing with missing data. I found a nice overview for the different ways you can deal with missing values in predictive analytics in a data science blog (https://towardsdatascience.com/how-to-handle-missing-data-8646b18db0d4):

Ways to deal with missing data

The missing values in the “weight_kg” and “height_m” columns are less complex to deal with. These missing values represent a smaller portion of the dataset, and result from a bug in the web scraping script used to generate the csv.

Let’s take this as an opportunity to demonstrate what imputation, or value prediction, might look like. We can impute the missing values by considering the rest of the available variables as predictors using the k-Nearest Neighbors algorithm. Luckily caret has a built-in function for this, preProcess().

Before we do that, we need to partition our data into training and test sets. We want our test set to be completely independent of our training set, and that includes participation in imputation.

# The "name" column is not important to model training, so let's take it out to avoid having to encode the labels.
trainPoke <- trainPoke %>%
  select(-name) %>%
  mutate(is_legendary = as.factor(is_legendary))

# Now let's make a model for predicting the missing values in the weight and height columns.
poke_missingdata_model <- preProcess(trainPoke, method='knnImpute')

poke_missingdata_model

## Created from 586 samples and 36 variables
## 
## Pre-processing:
##   - centered (33)
##   - ignored (3)
##   - 5 nearest neighbor imputation (33)
##   - scaled (33)

# The output shows that the model has centered (substracted by mean) 33 variables, ignored 3 variables, used k=5 (considered 5 nearest neighbors) to predict missing values and finally scaled (divide by standard deviation) 33 variables.

# Now let's use this model to predict the missing values.
trainPoke_pp <- predict(poke_missingdata_model, newdata = trainPoke)

# Check to see if any N/A values remain. If FALSE, all values have been successfully imputed.
anyNA(trainPoke_pp)

## [1] FALSE

# We can also revisit the skim() function to confirm no missing values are present.
skim(trainPoke_pp)

Data summary
Name	trainPoke_pp
Number of rows	601
Number of columns	36
_______________________
Column type frequency:
character	2
factor	1
numeric	33
________________________
Group variables	None

Variable type: character

skim_variable	n_missing	complete_rate	min	max	empty	n_unique	whitespace
type1	0	1	3	8	0	18	0
type2	0	1	3	8	0	19	0

Variable type: factor

skim_variable	n_missing	complete_rate	ordered	n_unique	top_counts
is_legendary	0	1	FALSE	2	FAL: 548, TRU: 53

Variable type: numeric

skim_variable	complete_rate	sd	p0	p25	p50	p75	p100	hist
against_bug	1	1.00	-1.24	-0.83	-0.01	-0.01	4.91	▇▁▂▁▁
against_dark	1	1.00	-1.83	-0.14	-0.14	-0.14	6.60	▇▁▁▁▁
against_dragon	1	1.00	-2.80	0.08	0.08	0.08	2.96	▁▁▇▁▁
against_electric	1	1.00	-1.63	-0.87	-0.10	-0.10	4.48	▅▇▃▁▁
against_fairy	1	1.00	-1.56	-0.13	-0.13	-0.13	5.59	▇▁▁▁▁
against_fight	1	1.00	-1.48	-0.78	-0.09	-0.09	4.07	▇▇▅▁▁
against_fire	1	1.00	-1.27	-0.91	-0.20	1.21	4.05	▆▇▅▁▁
against_flying	1	1.00	-1.53	-0.33	-0.33	1.26	4.43	▇▁▂▁▁
against_ghost	1	1.00	-1.75	0.02	0.02	0.02	5.33	▂▇▂▁▁
against_grass	1	1.00	-0.98	-0.67	-0.05	-0.05	3.70	▇▁▂▁▁
against_ground	1	1.00	-1.47	-0.80	-0.13	-0.13	3.90	▅▇▃▁▁
against_ice	1	1.00	-1.33	-0.98	-0.28	1.11	3.91	▇▁▃▁▁
against_normal	1	1.00	-3.21	0.43	0.43	0.43	0.43	▁▁▁▁▇
against_poison	1	1.00	-1.80	-0.87	0.05	0.05	5.62	▃▇▂▁▁
against_psychic	1	1.00	-2.07	0.00	0.00	0.00	6.18	▂▇▂▁▁
against_rock	1	1.00	-1.42	-0.36	-0.36	1.06	3.90	▇▁▃▁▁
against_steel	1	1.00	-1.50	-0.98	0.05	0.05	6.26	▅▇▂▁▁
against_water	1	1.00	-1.32	-0.91	-0.09	-0.09	4.85	▅▇▂▁▁
attack	1	1.00	-2.24	-0.73	-0.11	0.69	3.31	▂▇▆▂▁
base_egg_steps	1	1.00	-0.89	-0.32	-0.32	-0.32	3.49	▇▁▁▁▁
base_happiness	1	1.00	-3.23	0.25	0.25	0.25	3.72	▁▁▇▁▁
base_total	1	1.00	-2.03	-0.88	0.02	0.64	2.89	▃▆▇▂▁
capture_rate	1	1.00	-1.30	-0.75	-0.35	1.03	2.02	▇▅▂▃▂
defense	1	1.00	-2.17	-0.73	-0.15	0.56	5.04	▅▇▂▁▁
experience_growth	1	1.00	-2.77	-0.35	-0.35	0.01	3.53	▂▇▅▅▁
height_m	1	0.99	-0.96	-0.51	-0.16	0.28	11.82	▇▁▁▁▁
hp	1	1.00	-2.69	-0.75	-0.15	0.44	7.18	▃▇▁▁▁
pokedex_number	1	1.00	-1.69	-0.84	0.00	0.87	1.76	▇▇▇▇▇
sp_attack	1	1.00	-1.87	-0.80	-0.19	0.58	3.75	▅▇▅▁▁
sp_defense	1	1.00	-1.80	-0.75	-0.18	0.66	5.58	▇▇▂▁▁
speed	1	1.00	-2.08	-0.70	-0.15	0.67	3.94	▅▇▆▁▁
weight_kg	1	0.99	-0.58	-0.49	-0.31	0.04	9.04	▇▁▁▁▁
generation	1	1.00	-1.37	-0.85	0.19	0.71	1.76	▇▅▃▅▅

Categorical variables in machine learning

Categorical data are variables that contain character values rather than numeric values. For instance, Pokemon types are categories like “fairy” and “fire.” Many machine learning algorithms don’t operate on category data directly, requiring all input and output variable to be numeric.

However, the tree-based algorithms we’re going to use handle it just fine, so we’re not going to bother with it here.

Training the model

There are many, many different machine learning algorithms (for both classification and regression) available via the caret package. For today’s demo, we will be training a Random Forest model because it plays nice with categorical data and datasets with imbalanced classes.

With any machine learning model, we need some kind of assurance that the model has extracted most of the patterns from the data correctly. We need to make sure that its not picking up too much noise, i.e., the model has low bias and variance. This process is known as validation. To figure out how well our classifier will generalize to an independent/unseen dataset, we’ll use k-fold cross validation. This method divides the training data in k subsets. Then, one of the k substs is used as the test set and the other k-1 subsets are put together to form a training set. This is repeated k times.

Since every data point gets to be in a validation set exactly once, and gets to be in a training set k-1 times, this signficantly reduces the bias (as we use most of the data for fitting) and variance (as most of the data is also being used in validation).

# designate validation method
fitControl <- trainControl(method="repeatedcv", number=5,
                           repeats=5)


# fit a "Random Forest" model; ~206.871 seconds
pokeFit_ranf <- train(is_legendary~., data=trainPoke_pp, method = "rf",
                    trControl=fitControl)

pokeFit_ranf

## Random Forest 
## 
## 601 samples
##  35 predictor
##   2 classes: 'FALSE', 'TRUE' 
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold, repeated 5 times) 
## Summary of sample sizes: 481, 481, 480, 481, 481, 481, ... 
## Resampling results across tuning parameters:
## 
##   mtry  Accuracy   Kappa    
##    2    0.9820438  0.8729632
##   35    0.9953443  0.9708658
##   68    0.9953415  0.9710380
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 35.

varimp_ranf <- varImp(pokeFit_ranf)
ggplot(varimp_ranf)

testPoke_pp <- predict(poke_missingdata_model, testPoke) %>%
  mutate(is_legendary=as.factor(is_legendary))
  
anyNA(testPoke_pp)

## [1] FALSE

predictionPoke <- predict(pokeFit_ranf, testPoke_pp)
predictionPoke

##   [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [12] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [23] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE 
##  [34] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [45] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [56] FALSE FALSE FALSE TRUE  FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [67] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [78] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [89] FALSE TRUE  FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [100] FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE  TRUE  FALSE TRUE 
## [111] FALSE TRUE  FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [122] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [133] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [144] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [155] FALSE TRUE  TRUE  TRUE  FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [166] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE  FALSE
## [177] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [188] FALSE FALSE FALSE TRUE  TRUE  FALSE FALSE TRUE  TRUE  TRUE  TRUE 
## [199] TRUE 
## Levels: FALSE TRUE

testPoke_pp$predicted <- predict(pokeFit_ranf, testPoke_pp)

target_outcomes <- testPoke_pp %>% 
  select(name, is_legendary, predicted) %>%
  arrange(desc(is_legendary), desc(predicted))

confusionMatrix(reference=testPoke_pp$is_legendary, data=predictionPoke, mode='everything', positive='TRUE')

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction FALSE TRUE
##      FALSE   180    1
##      TRUE      2   16
##                                           
##                Accuracy : 0.9849          
##                  95% CI : (0.9566, 0.9969)
##     No Information Rate : 0.9146          
##     P-Value [Acc > NIR] : 2.387e-05       
##                                           
##                   Kappa : 0.906           
##                                           
##  Mcnemar's Test P-Value : 1               
##                                           
##             Sensitivity : 0.94118         
##             Specificity : 0.98901         
##          Pos Pred Value : 0.88889         
##          Neg Pred Value : 0.99448         
##               Precision : 0.88889         
##                  Recall : 0.94118         
##                      F1 : 0.91429         
##              Prevalence : 0.08543         
##          Detection Rate : 0.08040         
##    Detection Prevalence : 0.09045         
##       Balanced Accuracy : 0.96509         
##                                           
##        'Positive' Class : TRUE            
##

Evaluating your model

A confusion matrix is a table that is often used to describe the performance of a classifier on a set of test data for which the true values are known. This is the basic structure: Confusion matrix example

Confusion matrices are used to calculate precision and recall, like so:

Here is a list of definitions for the rates that are often computed to evaluate a binary classifier (like ours):

Accuracy -> how often the classifier is correct overall
True Positive Rate -> when the answer is yes, how often does it predict yes? AKA as sensitivity or recall (TP/(TP+TN))
False Positive Rate -> when the answer is no, how often does it predict yes?
True Negative Rate -> when the answer is no, how often does it predict no?
Precision -> when it predicts yes, how often is it correct? (TP/(TP+FP))
Prevalence -> how often does the yes condition actually occur in our sample?
Null Error Rate -> how often you would be wrong if you always predicted the majority class
Kappa -> how well the classifier performed compared to how well it would have performed simply by chance; in other words, a model will have a high Kappa score if there is a big difference between the accuracy and null error rate
F score -> weighted average of the true positive rate (recall) and precision

### remove feature
poke_data_noegg <- poke_data %>%
  select(-base_egg_steps)

### split data
set.seed(13)

partition_index_noegg <- createDataPartition(poke_data_noegg$is_legendary, p=0.75, list=FALSE)

trainPoke_noegg <- poke_data_noegg[partition_index,]

testPoke_noegg <- poke_data_noegg[-partition_index,]

### train model
trainPoke_noegg <- trainPoke_noegg %>%
  select(-name) %>%
  mutate(is_legendary = as.factor(is_legendary))

poke_missingdata_model_noegg <- preProcess(trainPoke_noegg, method='knnImpute')

trainPoke_pp_noegg <- predict(poke_missingdata_model_noegg, newdata = trainPoke_noegg)

pokeFit_ranf_noegg <- train(is_legendary~., data=trainPoke_pp_noegg, method = "rf",
                    trControl=fitControl)

varimp_ranf_noegg <- varImp(pokeFit_ranf_noegg)
ggplot(varimp_ranf_noegg)

### test model
testPoke_pp_noegg <- predict(poke_missingdata_model_noegg, testPoke_noegg) %>%
  mutate(is_legendary=as.factor(is_legendary))
  
anyNA(testPoke_pp_noegg)

## [1] FALSE

predictionPoke_noegg <- predict(pokeFit_ranf_noegg, testPoke_pp_noegg)

testPoke_pp_noegg$predicted <- predict(pokeFit_ranf_noegg, testPoke_pp_noegg)

target_outcomes_noegg <- testPoke_pp_noegg %>% 
  select(name, is_legendary, predicted) %>%
  arrange(desc(is_legendary), desc(predicted))

### evaluate model
confusionMatrix(reference=testPoke_pp_noegg$is_legendary, data=predictionPoke_noegg, mode='everything', positive='TRUE')

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction FALSE TRUE
##      FALSE   180    1
##      TRUE      2   16
##                                           
##                Accuracy : 0.9849          
##                  95% CI : (0.9566, 0.9969)
##     No Information Rate : 0.9146          
##     P-Value [Acc > NIR] : 2.387e-05       
##                                           
##                   Kappa : 0.906           
##                                           
##  Mcnemar's Test P-Value : 1               
##                                           
##             Sensitivity : 0.94118         
##             Specificity : 0.98901         
##          Pos Pred Value : 0.88889         
##          Neg Pred Value : 0.99448         
##               Precision : 0.88889         
##                  Recall : 0.94118         
##                      F1 : 0.91429         
##              Prevalence : 0.08543         
##          Detection Rate : 0.08040         
##    Detection Prevalence : 0.09045         
##       Balanced Accuracy : 0.96509         
##                                           
##        'Positive' Class : TRUE            
##

Testing other models

modelLookup() %>% filter(forReg==FALSE)

Other things to consider that we are not covering here…

Regression models
Encoding categorical variables
Feature engineering
Tuning your model

Thanks for coming to my demo!

Bye!

Scratch notes

# Convert categorical variables to as many binary variables as there are categories
dummyPoke_model <- dummyVars(is_legendary~., data=trainPoke)

# Create dummy variables using predict
trainPoke_encoded <- predict(dummyPoke_model, newdata=trainPoke)

## Warning in model.frame.default(Terms, newdata, na.action = na.action, xlev
## = object$lvls): variable 'is_legendary' is not a factor

Predicting legendary Pokemon with a machine learning model

Rachael M. Cox

12/4/2019