Machine learning (also referred to as predictive analytics or predictive modeling) is branch of artificial intelligence that uses statistics to extract patterns from big datasets. In a nutshell, machine learning algorithms are designed to identify and apply the most important/descriptive features of your data.
Today, we’re going to use a sort of “all-inclusive” R package for machine learning to build and test our legendary Pokemon model. http://topepo.github.io/caret/index.html
Not only does caret allow you to run a wide range of ML methods, but it also provides built-in functionality for auxiliary techniques such as:
1. Data preparation
2. Data splitting
3. Variable selection
4. Model evaluation
First we need to set up how we want our output to look (knitr, a feature of R Markdown, will take care of this for us) and install/load the R packages we will need for our experiment today.
PRO-TIP: you can execute a single line of code with the shortcut ‘Ctrl+Enter’ or a whole chunk of code with ‘Ctrl+Shift+Enter’
#install.packages("tidyverse")
#install.packages("skimr")
#install.packages("caret")
#install.packages("RANN")
#install.packages("plotly")
#install.packages("yardstick")
library(tidyverse) # collection of packages for easy data import, tidying, manipulation and visualization
library(skimr) # package for getting dataset stats at a glance
library(caret) # package for "Classification And REgression Training"
library(RANN) # required for some caret functionalities
library(plotly) # package for interacting with plots
library(yardstick) # package for evaluating statistical models
setwd("/stor/work/Marcotte/project/rmcox/github_repos/pokemon_machine_learning_demo/")
Next we will need to load the Pokemon data into our environment. This is a publicly available dataset scraped a couple years ago from http://serebii.net/ and contains information on the 802 Pokemon comprising Generations 1-7.
The information contained in this dataset include Base Stats, Performance against Other Types, Height, Weight, Classification, Egg Steps, Experience Points, Abilities, etc.
PRO-TIP: you can make a pipe (%>%) with the shortcut ‘Ctrl+Shift+M’
# The read_csv() function loads in the data and automatically detects it as comma-delimited.
poke_data <- read_csv("pokemon_data_all.csv", col_names=TRUE) %>%
select(name, everything())
## Parsed with column specification:
## cols(
## .default = col_double(),
## abilities = col_character(),
## classfication = col_character(),
## japanese_name = col_character(),
## name = col_character(),
## type1 = col_character(),
## type2 = col_character()
## )
## See spec(...) for full column specifications.
# The glimpse() function is nice for looking at all the variable names and types.
# You can see there are 801 pokemon ("observations") and 41 features ("variables").
glimpse(poke_data)
## Observations: 800
## Variables: 41
## $ name <chr> "Bulbasaur", "Ivysaur", "Venusaur", "Charmande…
## $ abilities <chr> "['Overgrow', 'Chlorophyll']", "['Overgrow', '…
## $ against_bug <dbl> 1.00, 1.00, 1.00, 0.50, 0.50, 0.25, 1.00, 1.00…
## $ against_dark <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ against_dragon <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ against_electric <dbl> 0.5, 0.5, 0.5, 1.0, 1.0, 2.0, 2.0, 2.0, 2.0, 1…
## $ against_fairy <dbl> 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 1.0, 1.0, 1.0, 1…
## $ against_fight <dbl> 0.50, 0.50, 0.50, 1.00, 1.00, 0.50, 1.00, 1.00…
## $ against_fire <dbl> 2.0, 2.0, 2.0, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 2…
## $ against_flying <dbl> 2.0, 2.0, 2.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 2…
## $ against_ghost <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0…
## $ against_grass <dbl> 0.25, 0.25, 0.25, 0.50, 0.50, 0.25, 2.00, 2.00…
## $ against_ground <dbl> 1.0, 1.0, 1.0, 2.0, 2.0, 0.0, 1.0, 1.0, 1.0, 0…
## $ against_ice <dbl> 2.0, 2.0, 2.0, 0.5, 0.5, 1.0, 0.5, 0.5, 0.5, 1…
## $ against_normal <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ against_poison <dbl> 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1…
## $ against_psychic <dbl> 2, 2, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 1…
## $ against_rock <dbl> 1, 1, 1, 2, 2, 4, 1, 1, 1, 2, 2, 4, 2, 2, 2, 2…
## $ against_steel <dbl> 1.0, 1.0, 1.0, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 1…
## $ against_water <dbl> 0.5, 0.5, 0.5, 2.0, 2.0, 2.0, 0.5, 0.5, 0.5, 1…
## $ attack <dbl> 49, 62, 100, 52, 64, 104, 48, 63, 103, 30, 20,…
## $ base_egg_steps <dbl> 5120, 5120, 5120, 5120, 5120, 5120, 5120, 5120…
## $ base_happiness <dbl> 70, 70, 70, 70, 70, 70, 70, 70, 70, 70, 70, 70…
## $ base_total <dbl> 318, 405, 625, 309, 405, 634, 314, 405, 630, 1…
## $ capture_rate <dbl> 45, 45, 45, 45, 45, 45, 45, 45, 45, 255, 120, …
## $ classfication <chr> "Seed Pokémon", "Seed Pokémon", "Seed Pokémon"…
## $ defense <dbl> 49, 63, 123, 43, 58, 78, 65, 80, 120, 35, 55, …
## $ experience_growth <dbl> 1059860, 1059860, 1059860, 1059860, 1059860, 1…
## $ height_m <dbl> 0.7, 1.0, 2.0, 0.6, 1.1, 1.7, 0.5, 1.0, 1.6, 0…
## $ hp <dbl> 45, 60, 80, 39, 58, 78, 44, 59, 79, 45, 50, 60…
## $ japanese_name <chr> "Fushigidaneフシギダネ", "Fushigisouフシギソウ", "Fushig…
## $ percentage_male <dbl> 88.1, 88.1, 88.1, 88.1, 88.1, 88.1, 88.1, 88.1…
## $ pokedex_number <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14,…
## $ sp_attack <dbl> 65, 80, 122, 60, 80, 159, 50, 65, 135, 20, 25,…
## $ sp_defense <dbl> 65, 80, 120, 50, 65, 115, 64, 80, 115, 20, 25,…
## $ speed <dbl> 45, 60, 80, 65, 80, 100, 43, 58, 78, 45, 30, 7…
## $ type1 <chr> "grass", "grass", "grass", "fire", "fire", "fi…
## $ type2 <chr> "poison", "poison", "poison", NA, NA, "flying"…
## $ weight_kg <dbl> 6.9, 13.0, 100.0, 8.5, 19.0, 90.5, 9.0, 22.5, …
## $ generation <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ is_legendary <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
# The head() function lets us look at the first few rows of the dataframe and its format.
head(poke_data)
## # A tibble: 6 x 41
## name abilities against_bug against_dark against_dragon against_electric
## <chr> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 Bulb… ['Overgr… 1 1 1 0.5
## 2 Ivys… ['Overgr… 1 1 1 0.5
## 3 Venu… ['Overgr… 1 1 1 0.5
## 4 Char… ['Blaze'… 0.5 1 1 1
## 5 Char… ['Blaze'… 0.5 1 1 1
## 6 Char… ['Blaze'… 0.25 1 1 2
## # … with 35 more variables: against_fairy <dbl>, against_fight <dbl>,
## # against_fire <dbl>, against_flying <dbl>, against_ghost <dbl>,
## # against_grass <dbl>, against_ground <dbl>, against_ice <dbl>,
## # against_normal <dbl>, against_poison <dbl>, against_psychic <dbl>,
## # against_rock <dbl>, against_steel <dbl>, against_water <dbl>,
## # attack <dbl>, base_egg_steps <dbl>, base_happiness <dbl>,
## # base_total <dbl>, capture_rate <dbl>, classfication <chr>,
## # defense <dbl>, experience_growth <dbl>, height_m <dbl>, hp <dbl>,
## # japanese_name <chr>, percentage_male <dbl>, pokedex_number <dbl>,
## # sp_attack <dbl>, sp_defense <dbl>, speed <dbl>, type1 <chr>,
## # type2 <chr>, weight_kg <dbl>, generation <dbl>, is_legendary <dbl>
# Luckily our dataset is already in the right format, with individual observations as rows and features as columns.
# If this were not the case, we would have to use gather() or spread() to get it in the right format.
# Some of these variables are not going to be helpful for training our model. E.g., "name" is always unique, "japanese name" is always unique, "abilities" are nearly always unique and "classification" are nearly always unique. Let's take these out for simplicity's sake.
poke_data <- poke_data %>%
select(-japanese_name, -abilities, -classfication)
# Also right now, legendary status is described with 1's and 0's so let's change that to TRUE and FALSE.
poke_data <- poke_data %>%
mutate(is_legendary = as.logical(is_legendary))
# The next step is to look for missing data, a common problem in big datasets. The skimr package provides a nice solution for this, along with showing key descriptive stats for each column.
skim(poke_data)
Name | poke_data |
Number of rows | 800 |
Number of columns | 38 |
_______________________ | |
Column type frequency: | |
character | 3 |
logical | 1 |
numeric | 34 |
________________________ | |
Group variables | None |
Variable type: character
skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
---|---|---|---|---|---|---|---|
name | 0 | 1.00 | 3 | 12 | 0 | 800 | 0 |
type1 | 0 | 1.00 | 3 | 8 | 0 | 18 | 0 |
type2 | 384 | 0.52 | 3 | 8 | 0 | 18 | 0 |
Variable type: logical
skim_variable | n_missing | complete_rate | mean | count |
---|---|---|---|---|
is_legendary | 0 | 1 | 0.09 | FAL: 730, TRU: 70 |
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|
against_bug | 0 | 1.00 | 1.00 | 0.60 | 2.50e-01 | 0.50 | 1.000e+00 | 1.00 | 4.0 | ▇▁▂▁▁ |
against_dark | 0 | 1.00 | 1.06 | 0.44 | 2.50e-01 | 1.00 | 1.000e+00 | 1.00 | 4.0 | ▇▁▁▁▁ |
against_dragon | 0 | 1.00 | 0.97 | 0.35 | 0.00e+00 | 1.00 | 1.000e+00 | 1.00 | 2.0 | ▁▁▇▁▁ |
against_electric | 0 | 1.00 | 1.07 | 0.65 | 0.00e+00 | 0.50 | 1.000e+00 | 1.00 | 4.0 | ▅▇▃▁▁ |
against_fairy | 0 | 1.00 | 1.07 | 0.52 | 2.50e-01 | 1.00 | 1.000e+00 | 1.00 | 4.0 | ▇▁▁▁▁ |
against_fight | 0 | 1.00 | 1.07 | 0.72 | 0.00e+00 | 0.50 | 1.000e+00 | 1.00 | 4.0 | ▇▇▅▁▁ |
against_fire | 0 | 1.00 | 1.14 | 0.69 | 2.50e-01 | 0.50 | 1.000e+00 | 2.00 | 4.0 | ▇▁▂▁▁ |
against_flying | 0 | 1.00 | 1.19 | 0.60 | 2.50e-01 | 1.00 | 1.000e+00 | 1.00 | 4.0 | ▇▁▂▁▁ |
against_ghost | 0 | 1.00 | 0.98 | 0.56 | 0.00e+00 | 1.00 | 1.000e+00 | 1.00 | 4.0 | ▂▇▂▁▁ |
against_grass | 0 | 1.00 | 1.03 | 0.79 | 2.50e-01 | 0.50 | 1.000e+00 | 1.00 | 4.0 | ▇▁▂▁▁ |
against_ground | 0 | 1.00 | 1.10 | 0.74 | 0.00e+00 | 1.00 | 1.000e+00 | 1.00 | 4.0 | ▃▇▃▁▁ |
against_ice | 0 | 1.00 | 1.21 | 0.74 | 2.50e-01 | 0.50 | 1.000e+00 | 2.00 | 4.0 | ▇▁▃▁▁ |
against_normal | 0 | 1.00 | 0.89 | 0.27 | 0.00e+00 | 1.00 | 1.000e+00 | 1.00 | 1.0 | ▁▁▁▁▇ |
against_poison | 0 | 1.00 | 0.98 | 0.55 | 0.00e+00 | 0.50 | 1.000e+00 | 1.00 | 4.0 | ▃▇▂▁▁ |
against_psychic | 0 | 1.00 | 1.01 | 0.50 | 0.00e+00 | 1.00 | 1.000e+00 | 1.00 | 4.0 | ▂▇▂▁▁ |
against_rock | 0 | 1.00 | 1.25 | 0.70 | 2.50e-01 | 1.00 | 1.000e+00 | 2.00 | 4.0 | ▇▁▃▁▁ |
against_steel | 0 | 1.00 | 0.98 | 0.50 | 2.50e-01 | 0.50 | 1.000e+00 | 1.00 | 4.0 | ▇▁▁▁▁ |
against_water | 0 | 1.00 | 1.06 | 0.61 | 2.50e-01 | 0.50 | 1.000e+00 | 1.00 | 4.0 | ▇▁▂▁▁ |
attack | 0 | 1.00 | 77.83 | 32.17 | 5.00e+00 | 55.00 | 7.500e+01 | 100.00 | 185.0 | ▂▇▆▂▁ |
base_egg_steps | 0 | 1.00 | 7192.00 | 6562.26 | 1.28e+03 | 5120.00 | 5.120e+03 | 5440.00 | 30720.0 | ▇▁▁▁▁ |
base_happiness | 0 | 1.00 | 65.36 | 19.61 | 0.00e+00 | 70.00 | 7.000e+01 | 70.00 | 140.0 | ▁▁▇▁▁ |
base_total | 0 | 1.00 | 428.29 | 119.25 | 1.80e+02 | 320.00 | 4.350e+02 | 505.00 | 780.0 | ▃▆▇▂▁ |
capture_rate | 0 | 1.00 | 98.76 | 76.26 | 3.00e+00 | 45.00 | 6.000e+01 | 170.00 | 255.0 | ▇▃▂▂▂ |
defense | 0 | 1.00 | 73.03 | 30.78 | 5.00e+00 | 50.00 | 7.000e+01 | 90.00 | 230.0 | ▅▇▂▁▁ |
experience_growth | 0 | 1.00 | 1054989.82 | 160356.00 | 6.00e+05 | 1000000.00 | 1.000e+06 | 1059860.00 | 1640000.0 | ▂▇▅▅▁ |
height_m | 20 | 0.98 | 1.17 | 1.08 | 1.00e-01 | 0.60 | 1.000e+00 | 1.50 | 14.5 | ▇▁▁▁▁ |
hp | 0 | 1.00 | 68.97 | 26.59 | 1.00e+00 | 50.00 | 6.500e+01 | 80.00 | 255.0 | ▃▇▁▁▁ |
percentage_male | 97 | 0.88 | 55.16 | 20.26 | 0.00e+00 | 50.00 | 5.000e+01 | 50.00 | 100.0 | ▁▁▇▁▂ |
pokedex_number | 0 | 1.00 | 400.53 | 231.14 | 1.00e+00 | 200.75 | 4.005e+02 | 600.25 | 801.0 | ▇▇▇▇▇ |
sp_attack | 0 | 1.00 | 71.27 | 32.36 | 1.00e+01 | 45.00 | 6.500e+01 | 91.00 | 194.0 | ▅▇▅▁▁ |
sp_defense | 0 | 1.00 | 70.92 | 27.96 | 2.00e+01 | 50.00 | 6.600e+01 | 90.00 | 230.0 | ▇▇▂▁▁ |
speed | 0 | 1.00 | 66.27 | 28.86 | 5.00e+00 | 45.00 | 6.500e+01 | 85.00 | 180.0 | ▃▇▆▁▁ |
weight_kg | 20 | 0.98 | 61.41 | 109.42 | 1.00e-01 | 9.00 | 2.715e+01 | 64.85 | 999.9 | ▇▁▁▁▁ |
generation | 0 | 1.00 | 3.69 | 1.93 | 1.00e+00 | 2.00 | 4.000e+00 | 5.00 | 7.0 | ▇▅▃▅▅ |
# In our columns with character values, almost half the data set is missing a value for "type2."
# If you're familiar with Pokemon, you know that's perfectly valid for a Pokemon to only have one type.
# Since the characteristic of only having one type might be important to a Pokemon's legendary status, we can feel justified in replacing all the "NA" values in the "type2" column to "none."
poke_data$type2[is.na(poke_data$type2)] <- "none"
# We also have missing values in our numeric columns. Let's check out how many legendary Pokemon are affected by this missing data.
poke_data %>% filter(is.na(weight_kg)) %>% tally(is_legendary > 0)
## # A tibble: 1 x 1
## n
## <int>
## 1 1
poke_data %>% filter(is.na(height_m)) %>% tally(is_legendary > 0)
## # A tibble: 1 x 1
## n
## <int>
## 1 1
poke_data %>% filter(is.na(percentage_male)) %>% tally(is_legendary > 0)
## # A tibble: 1 x 1
## n
## <int>
## 1 63
# Let's remove the problematic "percentage_male" column. See text below for explanation.
poke_data <- poke_data %>%
select(-percentage_male)
# Plot the relationship of non-legendary vs legendary Pokemon among the variables.
ggplot(poke_data, aes(x=type1)) +
geom_bar(aes(fill=is_legendary, group=is_legendary)) +
coord_flip()
ggplot(poke_data, aes(x=weight_kg, y=height_m, color=is_legendary)) +
geom_point(size=3, alpha=0.5)
## Warning: Removed 20 rows containing missing values (geom_point).
# Plot the distrbution of the target (or response) variable and see if there is any class imbalance.
ggplot(poke_data, aes(x=is_legendary)) +
geom_bar() +
xlab("Legendary status") +
ylab("# Pokemon")
Missing values in the “percentage_male” column is particularly problematic, since there are a lot of legendary Pokemon in this subset. The reason for this is that legendary Pokemon are commonly genderless.
We can’t replace N/A with “none” like we did with the types above, because then we would be mixing numeric and character values. We can’t replace N/A with 0, because then that would imply 100% female, which is false. We can’t impute it, because it will predict a value between 0 and 100, which is also incorrect.
We need to split the dataset into training data (75%) and test data (25%). When building the predictive model, the algorithm should see the training data and ONLY the training data to learn the relationship between Pokemon stats and their legendary status. Learned information about these relationships become our machine learning model.
# Now let's divide the dataset into training and test sets. The caret package has a nice function, createDataPartition(), for this purpose.
set.seed(13)
partition_index <- createDataPartition(poke_data$is_legendary, p=0.75, list=FALSE)
trainPoke <- poke_data[partition_index,]
testPoke <- poke_data[-partition_index,]
There is no one-size-fits-all method for dealing with missing data. I found a nice overview for the different ways you can deal with missing values in predictive analytics in a data science blog (https://towardsdatascience.com/how-to-handle-missing-data-8646b18db0d4):
Ways to deal with missing data
The missing values in the “weight_kg” and “height_m” columns are less complex to deal with. These missing values represent a smaller portion of the dataset, and result from a bug in the web scraping script used to generate the csv.
Let’s take this as an opportunity to demonstrate what imputation, or value prediction, might look like. We can impute the missing values by considering the rest of the available variables as predictors using the k-Nearest Neighbors algorithm. Luckily caret has a built-in function for this, preProcess().
Before we do that, we need to partition our data into training and test sets. We want our test set to be completely independent of our training set, and that includes participation in imputation.
# The "name" column is not important to model training, so let's take it out to avoid having to encode the labels.
trainPoke <- trainPoke %>%
select(-name) %>%
mutate(is_legendary = as.factor(is_legendary))
# Now let's make a model for predicting the missing values in the weight and height columns.
poke_missingdata_model <- preProcess(trainPoke, method='knnImpute')
poke_missingdata_model
## Created from 586 samples and 36 variables
##
## Pre-processing:
## - centered (33)
## - ignored (3)
## - 5 nearest neighbor imputation (33)
## - scaled (33)
# The output shows that the model has centered (substracted by mean) 33 variables, ignored 3 variables, used k=5 (considered 5 nearest neighbors) to predict missing values and finally scaled (divide by standard deviation) 33 variables.
# Now let's use this model to predict the missing values.
trainPoke_pp <- predict(poke_missingdata_model, newdata = trainPoke)
# Check to see if any N/A values remain. If FALSE, all values have been successfully imputed.
anyNA(trainPoke_pp)
## [1] FALSE
# We can also revisit the skim() function to confirm no missing values are present.
skim(trainPoke_pp)
Name | trainPoke_pp |
Number of rows | 601 |
Number of columns | 36 |
_______________________ | |
Column type frequency: | |
character | 2 |
factor | 1 |
numeric | 33 |
________________________ | |
Group variables | None |
Variable type: character
skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
---|---|---|---|---|---|---|---|
type1 | 0 | 1 | 3 | 8 | 0 | 18 | 0 |
type2 | 0 | 1 | 3 | 8 | 0 | 19 | 0 |
Variable type: factor
skim_variable | n_missing | complete_rate | ordered | n_unique | top_counts |
---|---|---|---|---|---|
is_legendary | 0 | 1 | FALSE | 2 | FAL: 548, TRU: 53 |
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|
against_bug | 0 | 1 | 0 | 1.00 | -1.24 | -0.83 | -0.01 | -0.01 | 4.91 | ▇▁▂▁▁ |
against_dark | 0 | 1 | 0 | 1.00 | -1.83 | -0.14 | -0.14 | -0.14 | 6.60 | ▇▁▁▁▁ |
against_dragon | 0 | 1 | 0 | 1.00 | -2.80 | 0.08 | 0.08 | 0.08 | 2.96 | ▁▁▇▁▁ |
against_electric | 0 | 1 | 0 | 1.00 | -1.63 | -0.87 | -0.10 | -0.10 | 4.48 | ▅▇▃▁▁ |
against_fairy | 0 | 1 | 0 | 1.00 | -1.56 | -0.13 | -0.13 | -0.13 | 5.59 | ▇▁▁▁▁ |
against_fight | 0 | 1 | 0 | 1.00 | -1.48 | -0.78 | -0.09 | -0.09 | 4.07 | ▇▇▅▁▁ |
against_fire | 0 | 1 | 0 | 1.00 | -1.27 | -0.91 | -0.20 | 1.21 | 4.05 | ▆▇▅▁▁ |
against_flying | 0 | 1 | 0 | 1.00 | -1.53 | -0.33 | -0.33 | 1.26 | 4.43 | ▇▁▂▁▁ |
against_ghost | 0 | 1 | 0 | 1.00 | -1.75 | 0.02 | 0.02 | 0.02 | 5.33 | ▂▇▂▁▁ |
against_grass | 0 | 1 | 0 | 1.00 | -0.98 | -0.67 | -0.05 | -0.05 | 3.70 | ▇▁▂▁▁ |
against_ground | 0 | 1 | 0 | 1.00 | -1.47 | -0.80 | -0.13 | -0.13 | 3.90 | ▅▇▃▁▁ |
against_ice | 0 | 1 | 0 | 1.00 | -1.33 | -0.98 | -0.28 | 1.11 | 3.91 | ▇▁▃▁▁ |
against_normal | 0 | 1 | 0 | 1.00 | -3.21 | 0.43 | 0.43 | 0.43 | 0.43 | ▁▁▁▁▇ |
against_poison | 0 | 1 | 0 | 1.00 | -1.80 | -0.87 | 0.05 | 0.05 | 5.62 | ▃▇▂▁▁ |
against_psychic | 0 | 1 | 0 | 1.00 | -2.07 | 0.00 | 0.00 | 0.00 | 6.18 | ▂▇▂▁▁ |
against_rock | 0 | 1 | 0 | 1.00 | -1.42 | -0.36 | -0.36 | 1.06 | 3.90 | ▇▁▃▁▁ |
against_steel | 0 | 1 | 0 | 1.00 | -1.50 | -0.98 | 0.05 | 0.05 | 6.26 | ▅▇▂▁▁ |
against_water | 0 | 1 | 0 | 1.00 | -1.32 | -0.91 | -0.09 | -0.09 | 4.85 | ▅▇▂▁▁ |
attack | 0 | 1 | 0 | 1.00 | -2.24 | -0.73 | -0.11 | 0.69 | 3.31 | ▂▇▆▂▁ |
base_egg_steps | 0 | 1 | 0 | 1.00 | -0.89 | -0.32 | -0.32 | -0.32 | 3.49 | ▇▁▁▁▁ |
base_happiness | 0 | 1 | 0 | 1.00 | -3.23 | 0.25 | 0.25 | 0.25 | 3.72 | ▁▁▇▁▁ |
base_total | 0 | 1 | 0 | 1.00 | -2.03 | -0.88 | 0.02 | 0.64 | 2.89 | ▃▆▇▂▁ |
capture_rate | 0 | 1 | 0 | 1.00 | -1.30 | -0.75 | -0.35 | 1.03 | 2.02 | ▇▅▂▃▂ |
defense | 0 | 1 | 0 | 1.00 | -2.17 | -0.73 | -0.15 | 0.56 | 5.04 | ▅▇▂▁▁ |
experience_growth | 0 | 1 | 0 | 1.00 | -2.77 | -0.35 | -0.35 | 0.01 | 3.53 | ▂▇▅▅▁ |
height_m | 0 | 1 | 0 | 0.99 | -0.96 | -0.51 | -0.16 | 0.28 | 11.82 | ▇▁▁▁▁ |
hp | 0 | 1 | 0 | 1.00 | -2.69 | -0.75 | -0.15 | 0.44 | 7.18 | ▃▇▁▁▁ |
pokedex_number | 0 | 1 | 0 | 1.00 | -1.69 | -0.84 | 0.00 | 0.87 | 1.76 | ▇▇▇▇▇ |
sp_attack | 0 | 1 | 0 | 1.00 | -1.87 | -0.80 | -0.19 | 0.58 | 3.75 | ▅▇▅▁▁ |
sp_defense | 0 | 1 | 0 | 1.00 | -1.80 | -0.75 | -0.18 | 0.66 | 5.58 | ▇▇▂▁▁ |
speed | 0 | 1 | 0 | 1.00 | -2.08 | -0.70 | -0.15 | 0.67 | 3.94 | ▅▇▆▁▁ |
weight_kg | 0 | 1 | 0 | 0.99 | -0.58 | -0.49 | -0.31 | 0.04 | 9.04 | ▇▁▁▁▁ |
generation | 0 | 1 | 0 | 1.00 | -1.37 | -0.85 | 0.19 | 0.71 | 1.76 | ▇▅▃▅▅ |
Categorical data are variables that contain character values rather than numeric values. For instance, Pokemon types are categories like “fairy” and “fire.” Many machine learning algorithms don’t operate on category data directly, requiring all input and output variable to be numeric.
However, the tree-based algorithms we’re going to use handle it just fine, so we’re not going to bother with it here.
There are many, many different machine learning algorithms (for both classification and regression) available via the caret package. For today’s demo, we will be training a Random Forest model because it plays nice with categorical data and datasets with imbalanced classes.
With any machine learning model, we need some kind of assurance that the model has extracted most of the patterns from the data correctly. We need to make sure that its not picking up too much noise, i.e., the model has low bias and variance. This process is known as validation. To figure out how well our classifier will generalize to an independent/unseen dataset, we’ll use k-fold cross validation. This method divides the training data in k subsets. Then, one of the k substs is used as the test set and the other k-1 subsets are put together to form a training set. This is repeated k times.
Since every data point gets to be in a validation set exactly once, and gets to be in a training set k-1 times, this signficantly reduces the bias (as we use most of the data for fitting) and variance (as most of the data is also being used in validation).
# designate validation method
fitControl <- trainControl(method="repeatedcv", number=5,
repeats=5)
# fit a "Random Forest" model; ~206.871 seconds
pokeFit_ranf <- train(is_legendary~., data=trainPoke_pp, method = "rf",
trControl=fitControl)
pokeFit_ranf
## Random Forest
##
## 601 samples
## 35 predictor
## 2 classes: 'FALSE', 'TRUE'
##
## No pre-processing
## Resampling: Cross-Validated (5 fold, repeated 5 times)
## Summary of sample sizes: 481, 481, 480, 481, 481, 481, ...
## Resampling results across tuning parameters:
##
## mtry Accuracy Kappa
## 2 0.9820438 0.8729632
## 35 0.9953443 0.9708658
## 68 0.9953415 0.9710380
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 35.
varimp_ranf <- varImp(pokeFit_ranf)
ggplot(varimp_ranf)
testPoke_pp <- predict(poke_missingdata_model, testPoke) %>%
mutate(is_legendary=as.factor(is_legendary))
anyNA(testPoke_pp)
## [1] FALSE
predictionPoke <- predict(pokeFit_ranf, testPoke_pp)
predictionPoke
## [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [12] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [23] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE
## [34] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [45] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [56] FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [67] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [78] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [89] FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [100] FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE TRUE FALSE TRUE
## [111] FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [122] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [133] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [144] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [155] FALSE TRUE TRUE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [166] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE
## [177] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [188] FALSE FALSE FALSE TRUE TRUE FALSE FALSE TRUE TRUE TRUE TRUE
## [199] TRUE
## Levels: FALSE TRUE
testPoke_pp$predicted <- predict(pokeFit_ranf, testPoke_pp)
target_outcomes <- testPoke_pp %>%
select(name, is_legendary, predicted) %>%
arrange(desc(is_legendary), desc(predicted))
confusionMatrix(reference=testPoke_pp$is_legendary, data=predictionPoke, mode='everything', positive='TRUE')
## Confusion Matrix and Statistics
##
## Reference
## Prediction FALSE TRUE
## FALSE 180 1
## TRUE 2 16
##
## Accuracy : 0.9849
## 95% CI : (0.9566, 0.9969)
## No Information Rate : 0.9146
## P-Value [Acc > NIR] : 2.387e-05
##
## Kappa : 0.906
##
## Mcnemar's Test P-Value : 1
##
## Sensitivity : 0.94118
## Specificity : 0.98901
## Pos Pred Value : 0.88889
## Neg Pred Value : 0.99448
## Precision : 0.88889
## Recall : 0.94118
## F1 : 0.91429
## Prevalence : 0.08543
## Detection Rate : 0.08040
## Detection Prevalence : 0.09045
## Balanced Accuracy : 0.96509
##
## 'Positive' Class : TRUE
##
A confusion matrix is a table that is often used to describe the performance of a classifier on a set of test data for which the true values are known. This is the basic structure:
Confusion matrices are used to calculate precision and recall, like so:
Here is a list of definitions for the rates that are often computed to evaluate a binary classifier (like ours):
### remove feature
poke_data_noegg <- poke_data %>%
select(-base_egg_steps)
### split data
set.seed(13)
partition_index_noegg <- createDataPartition(poke_data_noegg$is_legendary, p=0.75, list=FALSE)
trainPoke_noegg <- poke_data_noegg[partition_index,]
testPoke_noegg <- poke_data_noegg[-partition_index,]
### train model
trainPoke_noegg <- trainPoke_noegg %>%
select(-name) %>%
mutate(is_legendary = as.factor(is_legendary))
poke_missingdata_model_noegg <- preProcess(trainPoke_noegg, method='knnImpute')
trainPoke_pp_noegg <- predict(poke_missingdata_model_noegg, newdata = trainPoke_noegg)
pokeFit_ranf_noegg <- train(is_legendary~., data=trainPoke_pp_noegg, method = "rf",
trControl=fitControl)
varimp_ranf_noegg <- varImp(pokeFit_ranf_noegg)
ggplot(varimp_ranf_noegg)
### test model
testPoke_pp_noegg <- predict(poke_missingdata_model_noegg, testPoke_noegg) %>%
mutate(is_legendary=as.factor(is_legendary))
anyNA(testPoke_pp_noegg)
## [1] FALSE
predictionPoke_noegg <- predict(pokeFit_ranf_noegg, testPoke_pp_noegg)
testPoke_pp_noegg$predicted <- predict(pokeFit_ranf_noegg, testPoke_pp_noegg)
target_outcomes_noegg <- testPoke_pp_noegg %>%
select(name, is_legendary, predicted) %>%
arrange(desc(is_legendary), desc(predicted))
### evaluate model
confusionMatrix(reference=testPoke_pp_noegg$is_legendary, data=predictionPoke_noegg, mode='everything', positive='TRUE')
## Confusion Matrix and Statistics
##
## Reference
## Prediction FALSE TRUE
## FALSE 180 1
## TRUE 2 16
##
## Accuracy : 0.9849
## 95% CI : (0.9566, 0.9969)
## No Information Rate : 0.9146
## P-Value [Acc > NIR] : 2.387e-05
##
## Kappa : 0.906
##
## Mcnemar's Test P-Value : 1
##
## Sensitivity : 0.94118
## Specificity : 0.98901
## Pos Pred Value : 0.88889
## Neg Pred Value : 0.99448
## Precision : 0.88889
## Recall : 0.94118
## F1 : 0.91429
## Prevalence : 0.08543
## Detection Rate : 0.08040
## Detection Prevalence : 0.09045
## Balanced Accuracy : 0.96509
##
## 'Positive' Class : TRUE
##
modelLookup() %>% filter(forReg==FALSE)
Regression models
Encoding categorical variables
Feature engineering
Tuning your model
Bye!
# Convert categorical variables to as many binary variables as there are categories
dummyPoke_model <- dummyVars(is_legendary~., data=trainPoke)
# Create dummy variables using predict
trainPoke_encoded <- predict(dummyPoke_model, newdata=trainPoke)
## Warning in model.frame.default(Terms, newdata, na.action = na.action, xlev
## = object$lvls): variable 'is_legendary' is not a factor