June 1st, 2022
We will work with the iris
data set available in R. This data set gives the measurements in centimeters of the variables sepal length, sepal width, petal length and petal width for 50 flowers from each of 3 species of iris. The species are Iris setosa, Iris versicolor, and Iris virginica:
# view the first several rows of the iris data set
head(iris)
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1 5.1 3.5 1.4 0.2 setosa
## 2 4.9 3.0 1.4 0.2 setosa
## 3 4.7 3.2 1.3 0.2 setosa
## 4 4.6 3.1 1.5 0.2 setosa
## 5 5.0 3.6 1.4 0.2 setosa
## 6 5.4 3.9 1.7 0.4 setosa
geom_point()
and you will need to specify color = Species
inside of aes()
.# plot petal length vs sepal length, colored by species
ggplot(iris, aes(x = Sepal.Length, y = Petal.Length, color = Species)) +
geom_point()
Species
using facet_wrap()
.# plot petal length vs sepal length, faceted by species
ggplot(iris, aes(x = Sepal.Length, y = Petal.Length)) +
geom_point() +
facet_wrap(~Species)
The omp
data set contains a subset of DNA microarray data measuring the differential expression of E. coli outer membrane proteins (omp) in nutrient-limited chemostatic cultures. In this particular experiment, the medium is glucose-limited. The gene
column denotes one 8 genes that code for outer membrane proteins, time_min
denotes the time point sampled in minutes, and au
denotes the change in gene expression detected by the microarray chip (arbitrary units of fluorescence intensity).
# download the `omp` data set
omp <- read_csv("https://rachaelcox.github.io/classes/datasets/ecoli_omp_expression.csv")
## Parsed with column specification:
## cols(
## gene = col_character(),
## time_min = col_double(),
## au = col_double()
## )
head(omp)
## # A tibble: 6 x 3
## gene time_min au
## <chr> <dbl> <dbl>
## 1 ompA 5 -0.32
## 2 ompA 15 0.46
## 3 ompA 30 0.18
## 4 ompA 60 -0.23
## 5 ompC 5 -0.48
## 6 ompC 15 -0.56
geom_line()
, coloring each line by gene. Notice anything off about this plot?# plot the expression of each gene over time
ggplot(omp, aes(x = time_min, y = au, color = gene)) +
geom_line()
size
argument inside of geom_line()
.scale_color_colorblind()
to convert the legend colors to a colorblind-friendly palette.xlab()
and ylab
to give the figure pretty labels.# plot the expression of each gene over time & make it pretty
ggplot(omp, aes(x = time_min, y = au, color = gene)) +
geom_line(size = 2) +
scale_color_colorblind() +
xlab("Time (min)") +
ylab("Relative abundance (au)")
The bacteria
data set contains data from tests of the presence of the bacterium H. influenzae in children with otitis media in the Northern Territory of Australia. We are interested in two columns of this data set: presence
reports the presence (y) or absence (n) of the bacterium, treatment
reports the treatment, which was placebo, drug, or drug+ (drug plus high adherence).
# download the `bacteria` data set
bacteria <- read_csv("https://rachaelcox.github.io/classes/datasets/bacteria.csv")
## Parsed with column specification:
## cols(
## presence = col_character(),
## ap = col_character(),
## hilo = col_character(),
## week = col_double(),
## ID = col_character(),
## treatment = col_character()
## )
head(bacteria)
## # A tibble: 6 x 6
## presence ap hilo week ID treatment
## <chr> <chr> <chr> <dbl> <chr> <chr>
## 1 y p hi 0 X01 placebo
## 2 y p hi 2 X01 placebo
## 3 y p hi 4 X01 placebo
## 4 y p hi 11 X01 placebo
## 5 y a hi 0 X02 drug+
## 6 y a hi 2 X02 drug+
geom_bar()
, make a bar plot with the treatment
column on the x-axis, assigning the presence
column to the fill
argument to visualize the number of patients with and without bacteria for each type of treatment.ggplot(bacteria, aes(x = treatment, fill = presence)) +
geom_bar()
Notice that, by default, geom_bar()
stacks the counts for each class of presence
on top of each other. Generally, this isn't the clearest way to visualize this type of data. Now:
position='dodge'
in geom_bar()
.scale_fill_brewer()
to change the plot colors.ggplot(bacteria, aes(x = treatment, fill = presence)) +
geom_bar(position = 'dodge') +
scale_fill_brewer()
position
option in geom_bar()
to achieve this effect? Use ?geom_bar
to find out.scale_fill_brewer()
to use by specifying the type
and palette
arguments (see ?scale_fill_brewer
for details).ggplot(bacteria, aes(x = treatment, fill = presence)) +
geom_bar(position = 'fill') +
scale_fill_brewer(type = 'qual', palette = 'Set1')
The dandelion
data set contains RNA-seq reads for a subset of genes differentially expressed in response to five conditions.
# download the dandelion differential expression data set
dandelion <- read_csv("https://rachaelcox.github.io/classes/datasets/dandelion_diffexp_tidy.csv")
## Parsed with column specification:
## cols(
## transcript_dandelion = col_character(),
## baseMean = col_double(),
## z_score = col_double(),
## loci_arabidopsis = col_character(),
## protein_annotation = col_character(),
## gene_names_primary = col_character(),
## condition = col_character(),
## log2_foldchange = col_double()
## )
head(dandelion)
## # A tibble: 6 x 8
## transcript_dand~ baseMean z_score loci_arabidopsis protein_annotat~
## <chr> <dbl> <dbl> <chr> <chr>
## 1 DN42754_c0_g1_i1 167. 1.28 AT5G67060 Transcription f~
## 2 DN42754_c0_g1_i1 167. 1.28 AT5G67060 Transcription f~
## 3 DN42754_c0_g1_i1 167. 1.28 AT5G67060 Transcription f~
## 4 DN42754_c0_g1_i1 167. 1.28 AT5G67060 Transcription f~
## 5 DN42754_c0_g1_i1 167. 1.28 AT5G67060 Transcription f~
## 6 DN42754_c0_g1_i1 167. 1.28 AT5G67060 Transcription f~
## # ... with 3 more variables: gene_names_primary <chr>, condition <chr>,
## # log2_foldchange <dbl>
geom_tile
where each condition is on the x-axis and each gene (either transcript_dandelion
or loci_arabidopsis
) is on the y-axis.log2_foldchange
column. Use scale_fill_distiller()
to specify a continuous diverging color palette.(Note that I have told R Markdown to make a larger figure, by starting the code block with {r fig.height=6, fig.width=10}
instead of {r}
, because the default figure size is too narrow to show the resulting axes and map.)
ggplot(dandelion, aes(x = condition, y = transcript_dandelion, fill = log2_foldchange)) +
geom_tile() +
scale_fill_distiller(type = 'div',
palette = 'Spectral',
guide = 'colourbar',
direction = 1)
scale_fill_distiller()
(see ?scale_fill_distiller
) or you can use a different continuous color function, such as scale_fill_viridis_c()
or scale_fill_gradient()
.ggplot(dandelion, aes(x = condition, y = transcript_dandelion, fill = log2_foldchange)) +
geom_tile() +
scale_fill_viridis_c() +
theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust = 1)) +
guides(title = 'log2(fc)')
biopsy
data set, make boxplots of clump_thickness
for each type of outcome
. The geom you need to use is geom_boxplot()
.# download the biopsy data set
biopsy <- read_csv("https://rachaelcox.github.io/classes/datasets/biopsy.csv")
## Parsed with column specification:
## cols(
## clump_thickness = col_double(),
## uniform_cell_size = col_double(),
## uniform_cell_shape = col_double(),
## marg_adhesion = col_double(),
## epithelial_cell_size = col_double(),
## bare_nuclei = col_double(),
## bland_chromatin = col_double(),
## normal_nucleoli = col_double(),
## mitoses = col_double(),
## outcome = col_character()
## )
head(biopsy)
## # A tibble: 6 x 10
## clump_thickness uniform_cell_si~ uniform_cell_sh~ marg_adhesion
## <dbl> <dbl> <dbl> <dbl>
## 1 5 1 1 1
## 2 5 4 4 5
## 3 3 1 1 1
## 4 6 8 8 1
## 5 4 1 1 3
## 6 8 10 10 8
## # ... with 6 more variables: epithelial_cell_size <dbl>, bare_nuclei <dbl>,
## # bland_chromatin <dbl>, normal_nucleoli <dbl>, mitoses <dbl>, outcome <chr>
summary(biopsy)
## clump_thickness uniform_cell_size uniform_cell_shape marg_adhesion
## Min. : 1.000 Min. : 1.000 Min. : 1.000 Min. : 1.00
## 1st Qu.: 2.000 1st Qu.: 1.000 1st Qu.: 1.000 1st Qu.: 1.00
## Median : 4.000 Median : 1.000 Median : 1.000 Median : 1.00
## Mean : 4.442 Mean : 3.151 Mean : 3.215 Mean : 2.83
## 3rd Qu.: 6.000 3rd Qu.: 5.000 3rd Qu.: 5.000 3rd Qu.: 4.00
## Max. :10.000 Max. :10.000 Max. :10.000 Max. :10.00
## epithelial_cell_size bare_nuclei bland_chromatin normal_nucleoli
## Min. : 1.000 Min. : 1.000 Min. : 1.000 Min. : 1.00
## 1st Qu.: 2.000 1st Qu.: 1.000 1st Qu.: 2.000 1st Qu.: 1.00
## Median : 2.000 Median : 1.000 Median : 3.000 Median : 1.00
## Mean : 3.234 Mean : 3.545 Mean : 3.445 Mean : 2.87
## 3rd Qu.: 4.000 3rd Qu.: 6.000 3rd Qu.: 5.000 3rd Qu.: 4.00
## Max. :10.000 Max. :10.000 Max. :10.000 Max. :10.00
## mitoses outcome
## Min. : 1.000 Length:683
## 1st Qu.: 1.000 Class :character
## Median : 1.000 Mode :character
## Mean : 1.603
## 3rd Qu.: 1.000
## Max. :10.000
ggplot(biopsy, aes(y = clump_thickness, x = outcome, fill = outcome)) +
geom_boxplot() +
scale_fill_viridis_d(direction = -1, alpha = 0.75)
geom_violin()
. What do you notice?ggplot(biopsy, aes(y = clump_thickness, x = outcome, fill = outcome)) +
geom_violin(draw_quantiles = c(0.25, 0.5, 0.75)) +
scale_fill_viridis_d(direction = -1, alpha = 0.75, option = "E")
clump_thickness
distribution for each value of mitoses
using the facet_wrap()
function.ncol = 3
argument inside of the facet_wrap()
function.ggplot(biopsy, aes(y = clump_thickness, x = outcome, fill = outcome)) +
geom_violin(draw_quantiles = c(0.25, 0.5, 0.75)) +
scale_fill_viridis_d(direction = -1, alpha = 0.75, option = "A") +
facet_wrap(~mitoses, ncol = 3)
iris
data set, using the default histogram settings. Use geom_histogram()
.binwidth =
or bins =
. See ?geom_histogram
for more information.# default settings
ggplot(iris, aes(x = Sepal.Length)) +
geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
# wider bins
ggplot(iris, aes(x = Sepal.Length)) +
geom_histogram(binwidth = 0.2)
# even wider bins
ggplot(iris, aes(x = Sepal.Length)) +
geom_histogram(binwidth = 0.4)
# divide all observations by a set # of bins
ggplot(iris, aes(x = Sepal.Length)) +
geom_histogram(bins = 10)
geom_histogram()
, now use geom_density()
and fill the area under the curves by species identity.ggplot(iris, aes(x = Sepal.Length, fill = Species)) +
geom_density()
alpha
argument inside of geom_density()
, so the overlap of the various distributions becomes clearly visible.ggplot(iris, aes(x = Sepal.Length, fill = Species)) +
geom_density(alpha = 0.7)
Bonus challenge: For the iris
data set, make a plot of the 2d distribution of petal length vs. sepal length, by making an x-y plot that shows the individual data points as well as contour lines indicating the density of points in a given spatial region.
ggplot(iris, aes(x = Sepal.Length, y = Petal.Length, color = Species)) +
geom_point() +
geom_density2d()
Now instead of contour lines, add a fitted straight black line (not a curve, and no confidence band!) to each group of points. You'll need to check ?geom_smooth
to see which arguments you'll need to specify.
ggplot(iris, aes(x = Sepal.Length, y = Petal.Length, color = Species)) +
geom_point() +
geom_smooth(
aes(group = Species), # make a line for each species
method = lm, # fit a linear model
color = 'black',
se = FALSE) # no confidence interval
## `geom_smooth()` using formula 'y ~ x'
In this last example, because we are manually overriding the color of the lines, we need to set the group aesthetic to tell ggplot2 to draw a separate line for each species.