Day 2: Data visualization

In-class worksheet, solutions

June 1st, 2022

1. Plotting x-y relationships

1.1 Correlations

We will work with the iris data set available in R. This data set gives the measurements in centimeters of the variables sepal length, sepal width, petal length and petal width for 50 flowers from each of 3 species of iris. The species are Iris setosa, Iris versicolor, and Iris virginica:

# view the first several rows of the iris data set
head(iris)

##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa
## 3          4.7         3.2          1.3         0.2  setosa
## 4          4.6         3.1          1.5         0.2  setosa
## 5          5.0         3.6          1.4         0.2  setosa
## 6          5.4         3.9          1.7         0.4  setosa

Using ggplot, make a scatter plot of petal length vs. sepal length for the three species; the function you need for this is geom_point() and you will need to specify color = Species inside of aes().

# plot petal length vs sepal length, colored by species
ggplot(iris, aes(x = Sepal.Length, y = Petal.Length, color = Species)) + 
  geom_point()

Do the same plot but facet by Species using facet_wrap().

# plot petal length vs sepal length, faceted by species
ggplot(iris, aes(x = Sepal.Length, y = Petal.Length)) + 
  geom_point() + 
  facet_wrap(~Species)

1.2 Time series

The omp data set contains a subset of DNA microarray data measuring the differential expression of E. coli outer membrane proteins (omp) in nutrient-limited chemostatic cultures. In this particular experiment, the medium is glucose-limited. The gene column denotes one 8 genes that code for outer membrane proteins, time_min denotes the time point sampled in minutes, and au denotes the change in gene expression detected by the microarray chip (arbitrary units of fluorescence intensity).

# download the `omp` data set
omp <- read_csv("https://rachaelcox.github.io/classes/datasets/ecoli_omp_expression.csv")

## Parsed with column specification:
## cols(
##   gene = col_character(),
##   time_min = col_double(),
##   au = col_double()
## )

head(omp)

## # A tibble: 6 x 3
##   gene  time_min    au
##   <chr>    <dbl> <dbl>
## 1 ompA         5 -0.32
## 2 ompA        15  0.46
## 3 ompA        30  0.18
## 4 ompA        60 -0.23
## 5 ompC         5 -0.48
## 6 ompC        15 -0.56

Plot the expression of each gene over time using geom_line(), coloring each line by gene. Notice anything off about this plot?

# plot the expression of each gene over time
ggplot(omp, aes(x = time_min, y = au, color = gene)) +
  geom_line()

Make the same plot but make the size of the lines thicker by specifying the size argument inside of geom_line().
Use scale_color_colorblind() to convert the legend colors to a colorblind-friendly palette.
Use xlab() and ylab to give the figure pretty labels.

# plot the expression of each gene over time & make it pretty
ggplot(omp, aes(x = time_min, y = au, color = gene)) +
  geom_line(size = 2) +
  scale_color_colorblind() +
  xlab("Time (min)") +
  ylab("Relative abundance (au)")

2. Plotting amounts

2.1 Bar plots

The bacteria data set contains data from tests of the presence of the bacterium H. influenzae in children with otitis media in the Northern Territory of Australia. We are interested in two columns of this data set: presence reports the presence (y) or absence (n) of the bacterium, treatment reports the treatment, which was placebo, drug, or drug+ (drug plus high adherence).

# download the `bacteria` data set
bacteria <- read_csv("https://rachaelcox.github.io/classes/datasets/bacteria.csv")

## Parsed with column specification:
## cols(
##   presence = col_character(),
##   ap = col_character(),
##   hilo = col_character(),
##   week = col_double(),
##   ID = col_character(),
##   treatment = col_character()
## )

head(bacteria)

## # A tibble: 6 x 6
##   presence ap    hilo   week ID    treatment
##   <chr>    <chr> <chr> <dbl> <chr> <chr>    
## 1 y        p     hi        0 X01   placebo  
## 2 y        p     hi        2 X01   placebo  
## 3 y        p     hi        4 X01   placebo  
## 4 y        p     hi       11 X01   placebo  
## 5 y        a     hi        0 X02   drug+    
## 6 y        a     hi        2 X02   drug+

Using geom_bar(), make a bar plot with the treatment column on the x-axis, assigning the presence column to the fill argument to visualize the number of patients with and without bacteria for each type of treatment.

ggplot(bacteria, aes(x = treatment, fill = presence)) + 
  geom_bar()

Notice that, by default, geom_bar() stacks the counts for each class of presence on top of each other. Generally, this isn't the clearest way to visualize this type of data. Now:

Modify the plot so that bars indicating the patients with or without the bacterium are shown side-by-side. Hint: This requires the argument position='dodge' in geom_bar().
Use scale_fill_brewer() to change the plot colors.

ggplot(bacteria, aes(x = treatment, fill = presence)) + 
  geom_bar(position = 'dodge') +
  scale_fill_brewer()

Modify the plot so that bars represent the relative number of cases with or without the bacterium. What is the appropriate position option in geom_bar() to achieve this effect? Use ?geom_bar to find out.
Apply an alternative color palette for scale_fill_brewer() to use by specifying the type and palette arguments (see ?scale_fill_brewer for details).

ggplot(bacteria, aes(x = treatment, fill = presence)) + 
  geom_bar(position = 'fill') +
  scale_fill_brewer(type = 'qual', palette = 'Set1')

2.2 Heat maps

The dandelion data set contains RNA-seq reads for a subset of genes differentially expressed in response to five conditions.

# download the dandelion differential expression data set
dandelion <- read_csv("https://rachaelcox.github.io/classes/datasets/dandelion_diffexp_tidy.csv")

## Parsed with column specification:
## cols(
##   transcript_dandelion = col_character(),
##   baseMean = col_double(),
##   z_score = col_double(),
##   loci_arabidopsis = col_character(),
##   protein_annotation = col_character(),
##   gene_names_primary = col_character(),
##   condition = col_character(),
##   log2_foldchange = col_double()
## )

head(dandelion)

## # A tibble: 6 x 8
##   transcript_dand~ baseMean z_score loci_arabidopsis protein_annotat~
##   <chr>               <dbl>   <dbl> <chr>            <chr>           
## 1 DN42754_c0_g1_i1     167.    1.28 AT5G67060        Transcription f~
## 2 DN42754_c0_g1_i1     167.    1.28 AT5G67060        Transcription f~
## 3 DN42754_c0_g1_i1     167.    1.28 AT5G67060        Transcription f~
## 4 DN42754_c0_g1_i1     167.    1.28 AT5G67060        Transcription f~
## 5 DN42754_c0_g1_i1     167.    1.28 AT5G67060        Transcription f~
## 6 DN42754_c0_g1_i1     167.    1.28 AT5G67060        Transcription f~
## # ... with 3 more variables: gene_names_primary <chr>, condition <chr>,
## #   log2_foldchange <dbl>

Plot a heat map using geom_tile where each condition is on the x-axis and each gene (either transcript_dandelion or loci_arabidopsis) is on the y-axis.
Fill the color of the heat map using the log2_foldchange column. Use scale_fill_distiller() to specify a continuous diverging color palette.

(Note that I have told R Markdown to make a larger figure, by starting the code block with {r fig.height=6, fig.width=10} instead of {r}, because the default figure size is too narrow to show the resulting axes and map.)

ggplot(dandelion, aes(x = condition, y = transcript_dandelion, fill = log2_foldchange)) +
  geom_tile() +
  scale_fill_distiller(type = 'div', 
                       palette = 'Spectral', 
                       guide = 'colourbar', 
                       direction = 1)

Make the same plot again, but rotate the text on the x-axis so they aren't mashing into each other.
Apply a different color palette this time. You can use a different palette within scale_fill_distiller() (see ?scale_fill_distiller) or you can use a different continuous color function, such as scale_fill_viridis_c() or scale_fill_gradient().

ggplot(dandelion, aes(x = condition, y = transcript_dandelion, fill = log2_foldchange)) +
  geom_tile() +
  scale_fill_viridis_c() +
  theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust = 1)) +
  guides(title = 'log2(fc)')

3. Plotting distributions

3.1 Boxplots, violin plots

Using the biopsy data set, make boxplots of clump_thickness for each type of outcome. The geom you need to use is geom_boxplot().

# download the biopsy data set
biopsy <- read_csv("https://rachaelcox.github.io/classes/datasets/biopsy.csv")

## Parsed with column specification:
## cols(
##   clump_thickness = col_double(),
##   uniform_cell_size = col_double(),
##   uniform_cell_shape = col_double(),
##   marg_adhesion = col_double(),
##   epithelial_cell_size = col_double(),
##   bare_nuclei = col_double(),
##   bland_chromatin = col_double(),
##   normal_nucleoli = col_double(),
##   mitoses = col_double(),
##   outcome = col_character()
## )

head(biopsy)

## # A tibble: 6 x 10
##   clump_thickness uniform_cell_si~ uniform_cell_sh~ marg_adhesion
##             <dbl>            <dbl>            <dbl>         <dbl>
## 1               5                1                1             1
## 2               5                4                4             5
## 3               3                1                1             1
## 4               6                8                8             1
## 5               4                1                1             3
## 6               8               10               10             8
## # ... with 6 more variables: epithelial_cell_size <dbl>, bare_nuclei <dbl>,
## #   bland_chromatin <dbl>, normal_nucleoli <dbl>, mitoses <dbl>, outcome <chr>

summary(biopsy)

##  clump_thickness  uniform_cell_size uniform_cell_shape marg_adhesion  
##  Min.   : 1.000   Min.   : 1.000    Min.   : 1.000     Min.   : 1.00  
##  1st Qu.: 2.000   1st Qu.: 1.000    1st Qu.: 1.000     1st Qu.: 1.00  
##  Median : 4.000   Median : 1.000    Median : 1.000     Median : 1.00  
##  Mean   : 4.442   Mean   : 3.151    Mean   : 3.215     Mean   : 2.83  
##  3rd Qu.: 6.000   3rd Qu.: 5.000    3rd Qu.: 5.000     3rd Qu.: 4.00  
##  Max.   :10.000   Max.   :10.000    Max.   :10.000     Max.   :10.00  
##  epithelial_cell_size  bare_nuclei     bland_chromatin  normal_nucleoli
##  Min.   : 1.000       Min.   : 1.000   Min.   : 1.000   Min.   : 1.00  
##  1st Qu.: 2.000       1st Qu.: 1.000   1st Qu.: 2.000   1st Qu.: 1.00  
##  Median : 2.000       Median : 1.000   Median : 3.000   Median : 1.00  
##  Mean   : 3.234       Mean   : 3.545   Mean   : 3.445   Mean   : 2.87  
##  3rd Qu.: 4.000       3rd Qu.: 6.000   3rd Qu.: 5.000   3rd Qu.: 4.00  
##  Max.   :10.000       Max.   :10.000   Max.   :10.000   Max.   :10.00  
##     mitoses         outcome         
##  Min.   : 1.000   Length:683        
##  1st Qu.: 1.000   Class :character  
##  Median : 1.000   Mode  :character  
##  Mean   : 1.603                     
##  3rd Qu.: 1.000                     
##  Max.   :10.000

ggplot(biopsy, aes(y = clump_thickness, x = outcome, fill = outcome)) + 
  geom_boxplot() +
  scale_fill_viridis_d(direction = -1, alpha = 0.75)

Make the same plot again, instead using geom_violin(). What do you notice?

ggplot(biopsy, aes(y = clump_thickness, x = outcome, fill = outcome)) + 
  geom_violin(draw_quantiles = c(0.25, 0.5, 0.75)) +
  scale_fill_viridis_d(direction = -1, alpha = 0.75, option = "E")

Add a layer to your code that creates side-by-side plots of clump_thickness distribution for each value of mitoses using the facet_wrap() function.
Try this with and without specifying the ncol = 3 argument inside of the facet_wrap() function.

ggplot(biopsy, aes(y = clump_thickness, x = outcome, fill = outcome)) + 
  geom_violin(draw_quantiles = c(0.25, 0.5, 0.75)) +
  scale_fill_viridis_d(direction = -1, alpha = 0.75, option = "A") +
  facet_wrap(~mitoses, ncol = 3)

3.2 Histograms, density plots

Make a histogram plot of sepal lengths in the iris data set, using the default histogram settings. Use geom_histogram().
Make two more such plots, with different bin widths. You can change the bin width by specifying the argument binwidth = or bins =. See ?geom_histogram for more information.

# default settings
ggplot(iris, aes(x = Sepal.Length)) + 
  geom_histogram()

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

# wider bins
ggplot(iris, aes(x = Sepal.Length)) + 
  geom_histogram(binwidth = 0.2)

# even wider bins
ggplot(iris, aes(x = Sepal.Length)) + 
  geom_histogram(binwidth = 0.4)

# divide all observations by a set # of bins
ggplot(iris, aes(x = Sepal.Length)) + 
  geom_histogram(bins = 10)

Instead of geom_histogram(), now use geom_density() and fill the area under the curves by species identity.

ggplot(iris, aes(x = Sepal.Length, fill = Species)) + 
  geom_density()

Make the areas under the curve partially transparent using the alpha argument inside of geom_density(), so the overlap of the various distributions becomes clearly visible.

ggplot(iris, aes(x = Sepal.Length, fill = Species)) + 
  geom_density(alpha = 0.7)

4. Layering geoms

Bonus challenge: For the iris data set, make a plot of the 2d distribution of petal length vs. sepal length, by making an x-y plot that shows the individual data points as well as contour lines indicating the density of points in a given spatial region.

ggplot(iris, aes(x = Sepal.Length, y = Petal.Length, color = Species)) + 
  geom_point() + 
  geom_density2d()

Now instead of contour lines, add a fitted straight black line (not a curve, and no confidence band!) to each group of points. You'll need to check ?geom_smooth to see which arguments you'll need to specify.

ggplot(iris, aes(x = Sepal.Length, y = Petal.Length, color = Species)) + 
  geom_point() + 
  geom_smooth(
    aes(group = Species),  # make a line for each species
    method = lm,  # fit a linear model
    color = 'black',
    se = FALSE)  # no confidence interval

## `geom_smooth()` using formula 'y ~ x'

In this last example, because we are manually overriding the color of the lines, we need to set the group aesthetic to tell ggplot2 to draw a separate line for each species.