ggplot2
This 1 hour webinar will focus on making three kinds of visualizations: time-series line graphs, bar charts, and distribution plots. For examples of other kinds of plots (in addition to data transformation vignettes), check the class website for the 4 day R course taught by myself and Cory DuPai: https://rachaelcox.github.io/classes/IntroR_summer_2020.html
September 30, 2020
We will use the brewmats
dataset, which is sourced from the US Alcohol and Tobacco Tax and Trade Bureau (TBB, https://www.ttb.gov/beer/statistics), scraped by the TidyTuesday group and cleaned by me. The brewmats
dataset contains the following variables:
data_type
= denotes that all amounts of material used are given in pounds (lbs)material_type
= materials belong in one of two categories, grain (e.g., wheat) or non-grain (e.g., sugar)type
= specific material details (e.g., malt, corn, rice, barley, wheat, hops, sugar, etcyear
= year the amount of materials used was recorded (2008-2015)month
= month the amount of materials used was recorded (1=January, 2=February, 3=March, etc)month_usage_by_type
= total amount of material usage by type per month (lbs)month_sum_all_types
= total amount of material used for all types (lbs)year_sum_by_type
= total amount of material used by type per year# download the `brewmats` dataset
brewmats <- read_csv("https://rachaelcox.github.io/classes/datasets/brewmats.csv")
## Parsed with column specification:
## cols(
## data_type = col_character(),
## material_type = col_character(),
## year = col_double(),
## month = col_double(),
## type = col_character(),
## month_usage_by_type = col_double(),
## month_sum_all_types = col_double(),
## year_sum_by_type = col_double()
## )
Code Along: Plot the total monthly usage of all brewing materials (month_sum_all_types
) on the y-axis for every month
in the dataset on the x-axis, as a line graph colored by year
using geom_line()
.
# plot the variables as specified
ggplot(brewmats, aes(x = month, y = month_sum_all_types, # aes maps the variables
group = year, color = year)) +
geom_line() # calls the line graph
# ggplot is interpreting the numerical variables as continuous
# we use as.factor() to tell R that these variables are discrete
ggplot(brewmats, aes(x = as.factor(month), y = month_sum_all_types,
group = as.factor(year), color = as.factor(year))) +
geom_line() +
scale_color_colorblind() # use a colorblind-friendly palette
# clean it up and make it pretty
ggplot(brewmats, aes(x = as.factor(month), y = month_sum_all_types,
group = as.factor(year), color = as.factor(year))) +
geom_line(size = 1.5) +
scale_color_colorblind(name = "Year") + # change legend title
xlab("Month") + # rename x-axis
ylab("Monthly Material Usage (lbs)") # rename y-axis
Practice: Plot the yearly usage (year_sum_by_type
) of each type of brewing materials on the y-axis, for every year
in the dataset on the x-axis, as a line graph colored by type
using geom_line()
. Remember to map group = type
so that ggplot knows how which lines you want to connect.
ggplot(brewmats, aes(x = as.factor(year), y = year_sum_by_type,
group = type, color = type)) +
geom_line(size = 1.5) +
scale_color_colorblind(name = "Material Type") + # change legend title
scale_y_log10() + # scale y-axis logarithmically so we can more easily visualize amounts on the lower end
xlab("Year") + # rename x-axis
ylab("Usage (lbs)") # rename y-axis
For this section, we will use the mushrooms
dataset (obtained from Kaggle and cleaned by me), which contains the following information:
class
= whether the mushroom is edible or poisonouscap_shape
= shape of the mushroom cap, e.g., bell, conical, convex, flat, etccap_color
= color of the mushroom cap, e.g., brown, buff, cinnamon, gray, etcodor
= smell of the mushroom (almond, anise, creosote, fishy foul, musty, pungent, none)gill_spacing
= spacing between mushroom gills, aka the underside of the mushroom cap (close, crowded or distant)gill_size
= size of mushroom gills (broad or narrow)gill_color
= color of mushroom gills (black, brown, etc)stalk_shape
= shape of the mushroom stalk (enlarging or tapering)stalk_root
= type of stalk root (bulbous, club, cup, rooted, etc)veil_type
= type of veil on the mushroom (partial or universal)veil_color
= color of the veil (brown, orange, white or yellow)ring_number
= number of rings on the stalk of the mushroom (0, 1 or 2)ring_type
= description of the ring(s), if any, found on the stalk, e.g., flaring, large, pendant, etcspore_print_color
= color of spores collected on a sheet of paper as a print (black, brown, yellow, etc)population
= description of nearby mushrooms of the same species, if any; can be abundant, clustered, numerous, scattered, several, and solitaryhabitat
= where the mushroom was found (grasses, leaves, meadows, paths, urban, waste, woods)# download the `mushrooms` dataset
mushrooms <- read_csv("https://rachaelcox.github.io/classes/datasets/mushrooms.csv")
## Parsed with column specification:
## cols(
## class = col_character(),
## cap_shape = col_character(),
## cap_surface = col_character(),
## cap_color = col_character(),
## odor = col_character(),
## gill_spacing = col_character(),
## gill_size = col_character(),
## gill_color = col_character(),
## stalk_shape = col_character(),
## stalk_root = col_character(),
## veil_type = col_character(),
## veil_color = col_character(),
## ring_number = col_double(),
## ring_type = col_character(),
## spore_print_color = col_character(),
## population = col_character(),
## habitat = col_character()
## )
Code Along: Plot a bar graph for counts of mushrooms found in each habitat
, colored by class
(edible or poisonous) using geom_bar()
.
# plot the variables as specified
ggplot(mushrooms, aes(x = habitat, fill = class)) + # note we need to use `fill =` instead of `color =` for this kind of plot
geom_bar() # default bar position is "stack"
# change bar positioning and fill colors
ggplot(mushrooms, aes(x = habitat, fill = class)) +
geom_bar(position = "dodge") + # change bar positions so they are side-by-side
scale_fill_brewer() # change the default colors
# make it pretty
ggplot(mushrooms, aes(x = habitat, fill = class)) +
geom_bar(position = position_dodge(preserve = "single")) + # fixing bar width requires this special function
scale_fill_brewer(palette = "Set1") + # specifying `palette = "Set1"` sets a qualitative color palette
coord_flip() # flips the x- and y-axes for better readability
# add more information
ggplot(mushrooms, aes(x = habitat, fill = class)) +
geom_bar(position = position_dodge(preserve = "single")) +
scale_fill_brewer(palette = "Set1", direction = -1) + # setting `direction = -1` reverses color assignment
coord_flip() +
facet_wrap(~population) # displays habitat and class for each kind of population
Practice: Plot a bar graph for counts of mushrooms of each type of odor
, colored by class
(edible or poisonous) using geom_bar()
.
ggplot(mushrooms, aes(x = odor, fill = class)) +
geom_bar(position = "fill") + # we get proportions of each class with `position = fill`
scale_fill_brewer(palette = "Set2") + # change the color palette
coord_flip()
ggplot(mushrooms, aes(x = odor, fill = class)) +
geom_bar(position = "fill") +
scale_fill_brewer(palette = "Set3") + # change the color palette again
coord_flip() +
facet_wrap(~cap_shape)
For this section, we will use the wine
dataset (obtained from Kaggle and cleaned by me), which contains the following variables:
type
: whether the wine is red or whitequality
: median score between 0 and 10 as blindly graded by wine expertsquality_grade
: quality category given to each rating based on distribution of ratings (low, med, high)alcohol
: the percent alcohol content of the wine (% by volume)alcohol_grade
: relative amount of alcohol compared to all wines (low, med, high)pH
: describes how acidic or basic a wine is on a scale from 0 (very acidic) to 14 (very basic); most wines are between 3-4 on the pH scaleacidity_grade
: acidity intensity (low, med, higj)fixed_acidity
: most acids involved with wine or fixed or nonvolatile/do not evaporate readily (tartaric acid - g / dm^3)volatile_acidity
: the amount of acetic acid in wine, which at too high of levels can lead to an unpleasant, vinegar taste (acetic acid - g / dm^3)citric_acid
: found in small quantities, citric acid can add freshness and flavor to wines (g / dm^3)residual_sugar
: the amount of sugar remaining after fermentation stops; it's rare to find wines with less than 1 gram/liter and wines with greater than 45 grams/liter are considered sweet (g / dm^3)chlorides
: the amount of salt in the wine (sodium chloride - g / dm^3)free_sulfur_dioxide
: the free form of SO2 exists in equilibrium between molecular SO2 (as a dissolved gas) and bisulfite ion; it prevents microbial growth and the oxidation of wine (mg / dm^3)total_sulfur_dioxide
: amount of free and bound forms of S02; in low concentrations, SO2 is mostly undetectable in wine, but at free SO2 concentrations over 50 ppm, SO2 becomes evident in the nose and taste of wine (mg / dm^3)density
: degree of consistency measured by mass per unit volume (g / cm^3)sulphates
: a wine additive which can contribute to sulfur dioxide gas (S02) levels, wich acts as an antimicrobial and antioxidant (potassium sulphate - g / dm^3)Typical pH of Wine Types
# download the `wine` dataset
wine <- read_csv("https://rachaelcox.github.io/classes/datasets/wine_features.csv")
## Parsed with column specification:
## cols(
## type = col_character(),
## quality = col_double(),
## quality_grade = col_character(),
## alcohol = col_double(),
## alcohol_grade = col_character(),
## pH = col_double(),
## acidity_grade = col_character(),
## fixed_acidity = col_double(),
## volatile_acidity = col_double(),
## citric_acid = col_double(),
## residual_sugar = col_double(),
## chlorides = col_double(),
## free_sulfur_dioxide = col_double(),
## total_sulfur_dioxide = col_double(),
## density = col_double(),
## sulphates = col_double()
## )
Code Along: Plot the distribution of pH for each wine type (i.e., red or wine) by mapping pH
to the x-axis, coloring by type
, and calling geom_density()
. Then, use geom_boxplot()
to visualize the distribution of pH
across quality_grade
, again coloring by type
.
# plot the distribution of pH for red and white wine
ggplot(wine, aes(x = pH, fill = type)) + # note we need to use `fill =` instead of `color =` for this kind of plot
geom_density()
# make it pretty
ggplot(wine, aes(x = pH, fill = type)) +
geom_density(alpha = 0.70) + # reduce opacity of distributions to 70%
scale_fill_manual(values = c("#790000", "#9e934d")) # specify custom colors
# plot the distribution of pH for each quality grade for red and white wine
ggplot(wine, aes(x = quality_grade, y = pH, fill = type)) +
geom_boxplot() +
scale_fill_manual(values = c("#790000", "#9e934d"))
# make it pretty by reordering the x-axis
ggplot(wine, aes(x = fct_relevel(quality_grade, c("low", "med", "high")), y = pH, fill = type)) +
geom_boxplot() +
scale_fill_manual(values = c("#790000", "#9e934d"))
Practice: Choose a numeric variable you are interested in. Plot its distribution relative to a categorical variable, e.g., type
, quality_grade
, alcohol_grade
or acidity_grade
. Use geom_density()
, geom_boxplot()
, and/or geom_violin()
.
# visualize alcohol grade as it related to residual sugar for both red and white wine
ggplot(wine, aes(x = alcohol_grade, y = residual_sugar, fill = type)) +
geom_violin() + # violin plots are an alternative to boxplots
scale_fill_manual(values = c("#790000", "#9e934d"))
# make it pretty
ggplot(wine, aes(x = fct_relevel(alcohol_grade, c("low", "med", "high")), y = residual_sugar, fill = type)) +
geom_violin(draw_quantiles = c(0.25, 0.5, 0.75)) + # we can label intervals in the distribution with `draw_quantiles`
scale_y_log10() + # adjust the y-axis for better readability
scale_fill_manual(values = c("#790000", "#9e934d"), name = "Wine\nType") +
ylab("Residual sugar (g/dm^3)") +
xlab("Alcohol grade")