Day 4: Data transformation

In-class worksheet

July 2, 2020

In this worksheet, we will use the following libraries:

library(tidyverse)
library(tximport) # Bioconductor
library(rhdf5) # Bioconductor
library(DESeq2) # Bioconductor
library(pheatmap)
library(dada2) # Bioconductor
library(phyloseq) # Bioconductor
library(Biostrings) # Bioconductor
theme_set(theme_bw(base_size = 12)) # set the default plot theme for the ggplot2

1. RNAseq analysis with DESeq2

DESeq2 is a great package for analyzing and interpretting RNAseq results. We covered limited portions of the vignette (http://bioconductor.org/packages/devel/bioc/vignettes/DESeq2/inst/doc/DESeq2.html). Using that and the reference manual (https://bioconductor.org/packages/release/bioc/manuals/DESeq2/man/DESeq2.pdf), solve these problems based on the data we looked at together in class.

# download the txi object with all samples
txi <- readRDS(url("https://rachaelcox.github.io/classes/datasets/txi.RDS"))

kallisto_df <- read_csv("https://rachaelcox.github.io/classes/datasets/SraRunTable.csv")
row.names(kallisto_df) <- kallisto_df$Run

## Warning: Setting row names on a tibble is deprecated.

1.1 Setting up a coldata object

Setup a coldata object using all samples in kallisto_df. This object should have data for all samples and information about phenotype, od600_nm, and the sample Run id.

coldata <- kallisto_df %>%
  select(phenotype,
         od600_nm,
         Run) %>%
  mutate(od600_nm = as.factor(od600_nm))

1.2 Create dds object

Create a dds object using DESeqDataSetFromTximport() that incorporates both the optical density and phenotype data. Your design should be more complex this time and should include a plus sign (HINT: Check the “Quick Start” section of the DESeq2 vignette for guidance)

dds <- dds <- DESeqDataSetFromTximport(txi,
                                colData = coldata,
                                design = ~ phenotype + od600_nm)

## Warning in DESeqDataSet(se, design = design, ignoreRank): some variables in
## design formula are characters, converting to factors

## using counts and average transcript lengths from tximport

1.3 Do full analysis

Get a data.frame with the results from your DESeq analysis. It should contain log2FC values for OD600 0.5 values/OD600 2 values.

dds <- DESeq(dds) # This runs the standard DESeq normalization pipeline

## estimating size factors

## using 'avgTxLength' from assays(dds), correcting for library size

## estimating dispersions

## gene-wise dispersion estimates

## mean-dispersion relationship

## final dispersion estimates

## fitting model and testing

resultsNames(dds) # Show contrast levels

## [1] "Intercept"              "phenotype_WT_vs_HNS.KO"
## [3] "od600_nm_2_vs_0.5"

res <- results(dds, contrast = c("od600_nm", "0.5", "2")) # Get log2FC
results_df <- as.data.frame(res)

Bonus challenge: In the lecture slides we used the counts function to make a data.frame with the input counts for each condition. Look into the counts( ) function in DESeq2, is there a better way that we could have compared expression counts rather than using the RAW input values? How would you do this? Looking at the phenotype variable, what does the heatmap we made in class look like when you do this and include all eight samples? What about the PCA plot?

counts_df <- counts(dds, normalized = T) # Normalize your reads
colnames(counts_df) <- coldata$Run # Fix column names
results_df <- cbind(counts_df, results_df)# Bind counts and results dfs

select <- order(rowMeans(counts_df), decreasing=TRUE)[1:20] # Select top 20 genes
df <- as.data.frame(colData(dds)[,c("phenotype")])
rownames(df) <- coldata$Run # Fix rownames (must matach counts_df columns)
colnames(df) <- "phenotype" # Fix column name
# Make a heatmap
pheatmap(counts_df[select,],
         cluster_rows=FALSE,
         show_rownames=FALSE,
         cluster_cols=FALSE, annotation_col=df)

ntd <- normTransform(dds) # Normalizes to gives log2(n + 1)
colnames(ntd) <- coldata$Run # Fix column names
plotPCA(ntd, intgroup="phenotype") # Plot PCA

2. 16S analysis with dada2

The dada object (generated via a call to the dada() function) stores inferred amplicon information. In the lecture we generated a dada object containing this information for all of the “Forward” reads in our dataset, dadaFS. How would you extract the unique amplicon sequences for sample # 4 (F3D142) from dadaFS?

dadaFs <- readRDS(url("https://rachaelcox.github.io/classes/datasets/dadaFs.RDS"))
F3D142 <- dadaFs[["F3D142"]]$sequence # This returns just the sequences
F3D142.seqs2 <- dada2::getSequences(dadaFs[["F3D142"]]) # This also returns just the sequences
F3D142.df <- dadaFs[["F3D142"]]$clustering # This returns a data.frame with a bunch of info

Extract and visualize the abundances from the same sample. Try to use the most informative plot.

F3D142.df <- dadaFs[["F3D142"]]$clustering

# Bar plot
ggplot(F3D142.df) +
  geom_col(aes(x = sequence,
               y = abundance)) +
  theme(axis.text.x = element_blank(),
        axis.ticks.x = element_blank())

# Histogram (could also use geom_density)
ggplot(F3D142.df, aes(x = abundance)) +
  geom_histogram()

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

3 Other Bioconductor packages

Pick a Bioconductor package or packages that you might use for your own research (use lecture slides or Google, try to find one with a vignette available). Identify some classes and methods that you would use in your analysis and write a pipeline using pseudocode that starts with raw input data (e.g. a fastq file) and outputs something meaningful. This may not be easy but focus on thinking about the steps your pipeline would need and try to work through them logically. For an easier example, you could think about redoing the RNAseq analysis from the lecture slides using the edgeR package.

x # Counts df
group # Group list
y <- DGEList(counts=x,group=group)
keep <- filterByExpr(y)
y <- calcNormFactors(y)
design <- model.matrix(~group)

y <- estimateDisp(y,design)
# To perform quasi-likelihood F-tests:
fit <- glmQLFit(y,design)
qlf <- glmQLFTest(fit,coef=2)
topTags(qlf)

# To perform likelihood ratio tests:
fit <- glmFit(y,design)
lrt <- glmLRT(fit,coef=2)
topTags(lrt)