Day 4: Data transformation

In-class worksheet

July 2, 2020

In this worksheet, we will use the following libraries:

library(tidyverse)
library(tximport) # Bioconductor
library(rhdf5) # Bioconductor
library(DESeq2) # Bioconductor
library(pheatmap)
library(dada2) # Bioconductor
library(phyloseq) # Bioconductor
library(Biostrings) # Bioconductor
theme_set(theme_bw(base_size = 12)) # set the default plot theme for the ggplot2

1. RNAseq analysis with DESeq2

DESeq2 is a great package for analyzing and interpretting RNAseq results. We covered limited portions of the vignette (http://bioconductor.org/packages/devel/bioc/vignettes/DESeq2/inst/doc/DESeq2.html). Using that and the reference manual (https://bioconductor.org/packages/release/bioc/manuals/DESeq2/man/DESeq2.pdf), solve these problems based on the data we looked at together in class.

# download the txi object with all samples
txi <- readRDS(url("https://rachaelcox.github.io/classes/datasets/txi.RDS"))

kallisto_df <- read_csv("https://rachaelcox.github.io/classes/datasets/SraRunTable.csv")
row.names(kallisto_df) <- kallisto_df$Run
## Warning: Setting row names on a tibble is deprecated.

1.1 Setting up a coldata object

Setup a coldata object using all samples in kallisto_df. This object should have data for all samples and information about phenotype, od600_nm, and the sample Run id.

# R code here

1.2 Create dds object

Create a dds object using DESeqDataSetFromTximport() that incorporates both the optical density and phenotype data. Your design should be more complex this time and should include a plus sign (HINT: Check the “Quick Start” section of the DESeq2 vignette for guidance)

# R code here

1.3 Do full analysis

Get a data.frame with the results from your DESeq analysis. It should contain log2FC values for OD600 0.5 values/OD600 2 values.

# R code here

Bonus challenge: In the lecture slides we used the counts function to make a data.frame with the input counts for each condition. Look into the counts( ) function in DESeq2, is there a better way that we could have compared expression counts rather than using the RAW input values? How would you do this? Looking at the phenotype variable, what does the heatmap we made in class look like when you do this and include all eight samples? What about the PCA plot?

# R code here

2. 16S analysis with dada2

The dada object (generated via a call to the dada() function) stores inferred amplicon information. In the lecture we generated a dada object containing this information for all of the “Forward” reads in our dataset, dadaFS. How would you extract the unique amplicon sequences for sample # 4 (F3D142) from dadaFS?

dadaFs <- readRDS(url("https://rachaelcox.github.io/classes/datasets/dadaFs.RDS"))

# R code here

Extract and visualize the abundances from the same sample. Try to use the most informative plot.

# R code here

3 Other Bioconductor packages

Pick a Bioconductor package or packages that you might use for your own research (use lecture slides or Google, try to find one with a vignette available). Identify some classes and methods that you would use in your analysis and write a pipeline using pseudocode that starts with raw input data (e.g. a fastq file) and outputs something meaningful. This may not be easy but focus on thinking about the steps your pipeline would need and try to work through them logically. For an easier example, you could think about redoing the RNAseq analysis from the lecture slides using the edgeR package.

# Pseudocode here