Preparation (30 min)

Story

The story we’ll work with is a ‘drug repurposing’ effort to use an old diuretic drug as a cancer treatment:
Amiloride to treat multiple myeloma

Note that this is publically available data under GEO accession ID GSE95077, but the data shipped with this repository is altered for educational purposes.
In this study, RNA-seq data is available to compare amiloride (AMIL) to a state-of-the-art drug (TG003) in two cancer (multiple myeloma) cell lines with different mutations (BM: p53mut, JJ: del(17p)).

In total we have 18 samples:

  • 3 conditions (drugs): AMIL, TG and DMSO (control)
  • 2 cell types: BM and JJ
  • 3 replicates for each combination of conditions and cell-types



Tools

Poll 1.1: How do you get information about the current R-version and loaded packages?


Packages

Many useful functions are already available with R: base packages. Most analysis tools come with software packages that will need to be installed before use.

For R, two of the most common and reliable package repositories are:

Reminder

  1. All packages come for a specific R-version.
  2. All packages have specific package version, and can come from different sources (CRAN, Bioconductor, github, …).
  3. Most packages have dependencies on other packages that should also be installed
  4. Installation can be very time consuming but needs to be done only once.

Load packages

Task: Load the following packages: DESeq2, tidyverse, pheatmap

Poll 1.2: Did you manage to load DESeq2, tidyverse & pheatmap ?

library(DESeq2)
library(tidyverse)
library(pheatmap)

Reminder: Loading new packages enables new functionalities for a given R-session.

This may result in multiple (sometimes redundant) functions with the same name: e.g. sd()

In case of doubt, use package name explicitly: e.g. stats::sd()

Task: Uses “sessionInfo()” to show all available packages in your R session.


Data and Metadata

We assume that the sequencing data has already been processed (QC, mapping, counting) to obtain a so-called count matrix. Data received in-house (given that it’s a model organism) will have count matrices generated that you can start with immediately. A count matrix contains the number of reads that have been mapped to each gene, for each sample.


Example of a count matrix. Image from github.com/hbctraining
Example of a count matrix. Image from github.com/hbctraining

In principle, you can download the published data for our tutorial from GEO, and many published papers will (or should) have a GEO / SRA / data accession code available. For this course, we have already prepared the data that will be used in in a convenient format.

Recommendation: It is good style to put data into a separate directory (e.g. data/).

Import Data

Data can come in various formats. Here we have prepared the data as tab-separated file with header information.

Task: Read the data e.g. using read_tsv()

dfile <- "data/myeloma/myeloma_counts.tsv"
data <- read_tsv(file=dfile)

Take some time to inspect the data structure. What are the variables? How many samples are in the data ?

data %>% head()

Poll 1.3: What is the median expression count in sample “BM_CTRL_3”?

Recommendation: Some functions that could be of help: summary (baseR), pull (tidyverse)

Don’t worry if the syntax is confusing at first. It takes time and practice to familiarize yourself.

data %>% summary
data %>% pull("BM_CTRL_3") %>% median()

Where are the gene names, and which convention is chosen for them?

Import Metadata

Metadata (=data about data); typically contains important information on the samples and their origin. Often these are categorical data such as cell type, experimental condition or other batch assignments. But it could also be quantitative data (e.g. age, weight, …). Often the metadata is hidden in the paper (or in the filenames). This is bad practise - make sure to have a proper ‘samplesheet’. Metadata is just as important as data.

Our metadata file is located here:

‘data/myeloma/myeloma_meta.tsv’

Task: Read the metadata into an R-object “metadata”. Inspect the data frame - which data types are included? Make sure that the rownames of the metadata are valid sample names.

mfile <- "data/myeloma/myeloma_meta.tsv"               # location of metadata file
metadata <- read_tsv(file=mfile) 
## Rows: 18 Columns: 3
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: "\t"
## chr (3): sample, celltype, condition
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
metadata %>% head()

Poll 1.4: What is a factor?

Task Convert the character variables “condition” and “celltype” into factors.

# convert text strings to categorical variable (characters --> factors)
metadata$condition <- as.factor(metadata$condition)
metadata$celltype  <- as.factor(metadata$celltype)
metadata$sample    <- as.factor(metadata$sample)

# alternatively
#metadata <- metadata %>% mutate_if(is.character,as.factor)

Design (30 min)

For each gene (=row), the data matrix contains a vector of counts (\(y\)) which depends on the vector of samples (\(x\))

Tasks: Extract the 42nd gene in the data matrix and plot it against all samples

gene = data %>%
  slice(42) %>%                                                      # pick gene
  pivot_longer(-gene_id, names_to = "sample", values_to = "count")   # convert to long

gene_mean = gene %>% pull(count) %>% mean()                          # calculate mean

ggplot(gene, aes(x=sample, y=count)) +
  geom_point() +
  geom_hline(yintercept = gene_mean, color=2) +
  theme(axis.text.x = element_text(angle = 90))

#base-R
#gene_b = data[42,-1]
#boxplot(gene_b, xlab="sample", ylab="count", names=FALSE) 
#abline(h=rowMeans(gene_b), col="red", lty=2)

Hope

Gene expression can be predicted (and controlled) given relevant sample information \(x\) and a model \(f(x, \theta)\)

\[ y_i = f(x_i, \theta) ~~~~~~(i = 1\ldots n)\\ \] > Poll 1.5: What is \(n\) in our example?

Qualifications:

  • probabilitstic model: predict expectations up to some noise term \(\to E[y_i] + \epsilon_i\)
  • for count data: model the logarithm of the expectation \(\to \log E[y_i] = f(x_i, \theta)\)

To keep notation simple we use \(y_i = \log E[y_i]\) below.

Today we will focus on the function \(f\). Day 2 will cover more about \(\epsilon \propto P(\mu=0, \sigma^2)\)

What should \(x\) be?

Reminder: The researcher needs to decide which factors should be included in the model.

  • none: all samples have their own values \(y\)
  • simple average: samples are spread around some global mean \(\to y = \mu + \epsilon\)
  • quantitative variable: e.g. sequence depth \(\to y = f(sequencing~depth) + \epsilon\)
  • categorical variable: e.g. condition \(\to y = f(condition) + \epsilon\)
  • combination of variables: e.g. condition and celltype \(\to y = f(condition, celltype) + \epsilon\)
gene %>%
  # join gene with metadata (by sample)
  inner_join(metadata, by="sample") %>%
  #plot against condition
  ggplot(aes(x=condition, y=count, col=condition)) + 
  geom_point(size=3)  

Task: Plot gene counts against celltype.

Notice:

  • ggplot understands factors for plotting
  • R understands formulae with factors: \(Y \sim F\)

Under the hood: the model matrix

How are factorial variables (e.g. condition) encoded into something numerical ?

Use dummy variables: binary encoding of categorical variable \(x\) (switch different levels on and off)

\[ \left[ \begin{array}{c} y_1 \\ y_2 \\ \vdots \\ y_i \\ \vdots \\ y_n \\ \end{array} \right] = \left[ \begin{array}{ccc} 1 & 0 & \ldots & 0 \\ 1 & 0 & \ldots & 0 \\ \vdots \\ 0 & 1 & \ldots & 0 \\ \vdots \\ 0 & 0 & \ldots & 1 \\ \end{array} \right] \cdot \left[ \begin{array}{c} \mu_1 \\ \mu_2 \\ \vdots \\ \mu_p \\ \end{array} \right] \] and in compact notation: \[ y = X \mu \] \(X\): \(n\times p\) incidence model matrix \(\to\) multivariate linear regression.

Recall: \(y=log E[y]\) so this amounts to a generalized linear regression (GLM).


An equivalent alternative parameterization is: \[ \left[ \begin{array}{c} y_1 \\ y_2 \\ \vdots \\ y_i \\ \vdots \\ y_n \\ \end{array} \right] = \left[ \begin{array}{ccc} 1 & 0 & \ldots & 0 \\ 1 & 0 & \ldots & 0 \\ \vdots \\ 1 & 1 & \ldots & 0 \\ \vdots \\ 1 & 0 & \ldots & 1 \\ \end{array} \right] \cdot \left[ \begin{array}{c} \beta_0 \\ \beta_1 \\ \vdots \\ \beta_{p-1} \\ \end{array} \right] \] and in compact notation: \[ y = \beta_0 + X' \beta \] \(X'\): \(n \times (p-1)\) model matrix with intersept. With this parameterization one factor level serves as reference level \(\beta_0 = \mu_1\).

Sometimes this choice is preferable since then all the other parameters \(\beta_i=\mu_{i+1} - \mu_1\) can be interpreted as effects of level \(i \ge 1\) with respect to reference level.

We are free to chose from many different parameterizations \(\mu \leftrightarrow \beta\) which will have different interpretations.

Linear Model
Linear Model



Design In Action

In R, the model matrix \(X\) can be obtained easily from metadata + design formula. \[ \mbox{R syntax:} ~~~Y \sim V \]

where the formula should only refer to variables \(V\) that are defined as columns of the metadata.

metadata %>% colnames
## [1] "sample"    "celltype"  "condition"

Examples are given below.

Simple designs (1 factor, p-levels)

We may presuppose that gene expression (or rather the logarithm of the expected counts) is a function of “condition” only

\[ Y \sim condition \]

Per default this notation refers to equation \(y = \beta_0 + X' \beta\) with an intercept term \(\beta_0\), but the latter can also be switched off.

my_design <- ~ 1 + condition   # with intersect (identical to " ~ condition")
my_design <- ~ 0 + condition   # without intersect

The design formula together with a metadata implicitly defines the model matrix \(X\).

We need \(X\) only occasionally, but it may be helpful to visualize it here:

X <- model.matrix(my_design, data=metadata)
rownames(X) = metadata$sample
pheatmap(X, cluster_cols = FALSE, cluster_rows=FALSE)

Notice: A factor with \(p\) levels will have \(p\) dummy variables and \(p\) corresponding parameters (e.g. the means for \(p\) different groups).

Task Try the alternative design formula and inspect the effect on the matrix.

Multifactorial designs

More complicated designs are possible and often necessary (\(\to\) day 2 and 3)

Tasks: Include an additional factor (condition + celltype) and observe the design matrix
Bonus: Include an interaction term (+ condition*celltype) and discuss its interpretation

X <- model.matrix(~ 1 + condition * celltype, data=metadata)
rownames(X) = metadata$sample
pheatmap(X, cluster_cols = FALSE, cluster_rows=FALSE)


Experimental Design (10 min)

A more elaborate (highly recommended) tutorial can be found here

A typical experiment
A typical experiment



The problem: more factors!

Usually there are other factors in addition to a simple treatment:

  • known and interesting: multi-factorial design and interaction effects (condition * celltype)
  • known and uninteresting: batch, covariates
  • unknown and unvoidable: noise

… all contribute to total variability and limit our ability to detect treatment-related differences \(\to\) account for them as much as possible !


Poll 1.6 You are planning an experiment with 3 treatment levels \((X_1, X_2, X_3)\) and aim for 3 replicates each. Unfortunately, you can process at most 3 samples per day. So you’ll need three days \((Z_1, Z_2, Z_3)\). How should you assign your samples to different treatments?

Experimental Design - Poll
Experimental Design - Poll

Design Checks

For sanity checks it is useful to prepare a table of the metadata

metadata %>% select(-sample) %>% table
##         condition
## celltype AMIL DMSO TG
##       BM    3    3  3
##       JJ    3    3  3

This should answer a few common design questions - even before any data \(Y\) is collected !

  • Are all cells occupied (full design)?
  • Do all cells have the same number of observations (balanced design)
  • Are there missing observations - empty cells?
  • Is there confounding between factors?

Sources of Variability

Source What can be done about it ?
Treatment Control - or observe?
Biological Estimate with replication
Technical Control with sequencing depth (read sampling)
Sample Preparation Estimate with batch controls (blocking)
Unknown Confounders Reduce risk by randomization
Errors (e.g. sample swaps) Detection, Documentation, SOPs
Analysis Control software version and parameters

Some nuisance factors (e.g. batches) may be known and measurable, but are not of primary interest - they should be included as metadata and in the model.

\[ \mbox{R syntax:} ~~~Y \sim X + Z \]

Reminder: The choice of factors rests with the researcher.

Factor Choices.
Factor Choices.

Break (10 min)


Create DESeq2 object (5 min)

Each software has their own data representation :-(

A DESeq2 data object combines data, metadata and the design formula in one single object

dds = data + metadata + design

All subsequent calculations will use this object (and modify it).

Task: Create a DESeq data set using DESeq::DESeqDataSetFromMatrix()

dds <- DESeqDataSetFromMatrix(
  # countData= expects matrix, so remove "gene_id" column but keep as rowname
  countData=data %>% column_to_rownames("gene_id"),
  colData=metadata, 
  design= my_design)
## converting counts to integer mode

Data Exploration (40 min)

1. Data Access

Tasks:

  • Inspect the class, structure and dimensions of dds.
  • Re-extract the count matrix (Hint: ?counts) and save as Y
  • Re-extract the metadata (Hint: ?colData), remove column “sample” and save as meta
## class: DESeqDataSet 
## dim: 57905 18 
## metadata(1): version
## assays(1): counts
## rownames(57905): ENSG00000000003 ENSG00000000005 ... ENSG00000273492
##   ENSG00000273493
## rowData names(0):
## colnames(18): JJ_CTRL_1 BM_CTRL_2 ... JJ_TG_2 JJ_TG_3
## colData names(3): sample celltype condition

Tasks:

  • Extract the first 10 rows (genes) from count matrix Y and safe this data subset as Yr (reduced size for further use)
  • Optional: Extract 10 randomly sampled genes.
Yr <- Y %>% head(10)                        # extract first 10 rows (for illustrations only)
#Yr <- Y %>% data.frame() %>% sample_n(10)  # random rows

2. Data Visualization

Task: Create heatmap for the reduced data set Yr (?pheatmap). You might want to disable the default clustering of rows and columns

pheatmap(Yr, cluster_rows=FALSE, cluster_cols=FALSE, main="First glimpse")

Task: Add metadata meta as annotation to pheatmap

Customizing Metadata (5 min)

Task:

  • understand the object “meta” and how it is encoded as colors in pheatmap
  • modify “meta” by adding sequence depth as additional metadata for all samples (additional column)
  • rerun the pheatmap to yield

Notice: ‘meta=colData(dds)’ is almost identical to the original ‘metadata’ (read from file). But meta has rownames, and rownames(meta) = colnames(Y)

Customizing Colours (5 min)

Often we need to adjust colours or ensure consistency across a project. In the example above, the colour representation of the metadata may also be improved.

Use > library(RColorBrewer) for systematic and educated colour choices.

Goal: Define list of colours to be used for metadata?

# celltype colours
ct_col <- c("white", "black")                   
names(ct_col) <- levels(meta$celltype)

# condition colours
cn_col <- c("#1B9E77", "#D95F02", "#7570B3")   # = RColorBrewer::brewer.pal(n=3, "Dark2")
names(cn_col) <- levels(meta$condition)

# create list of colour choices
my_colors <- list( celltype= ct_col, condition = cn_col )

pheatmap(Yr, cluster_rows=FALSE, cluster_cols=FALSE, 
         annotation=meta, annotation_colors = my_colors, 
         main="Data + Metadata")

The colour encoding of the heatmap can also be customized (code not run)

blue_colors = RColorBrewer::brewer.pal(9,"Blues")
pheatmap(Yr, cluster_rows=FALSE, cluster_cols=FALSE,
         annotation = meta, annotation_colors = my_colors, 
         color=blue_colors,
         main="Data + Metadata") 

3. Data transformation

A more elaborate (highly recommended) tutorial can be found here

Task Inspect the quantiles and means of the first 10 genes (from Yr) and create a boxplot

Hint: Many R functions (e.g. summary, boxplot) treat rows as observations, and columns as variables. This is different from the common choice for count matrix Y its submatrix Yr. To utilize these functions efficiently we first have to transpose.

##  ENSG00000000003 ENSG00000000005 ENSG00000000419 ENSG00000000457 
##  Min.   :0       Min.   :0       Min.   :1155    Min.   : 431.0  
##  1st Qu.:0       1st Qu.:0       1st Qu.:1696    1st Qu.: 615.2  
##  Median :0       Median :0       Median :2114    Median : 790.0  
##  Mean   :0       Mean   :0       Mean   :2315    Mean   : 765.5  
##  3rd Qu.:0       3rd Qu.:0       3rd Qu.:2450    3rd Qu.: 866.5  
##  Max.   :0       Max.   :0       Max.   :4808    Max.   :1342.0  
##  ENSG00000000460  ENSG00000000938 ENSG00000000971   ENSG00000001036 
##  Min.   : 616.0   Min.   :0.000   Min.   :   0.00   Min.   :  40.0  
##  1st Qu.: 897.2   1st Qu.:1.000   1st Qu.:   1.25   1st Qu.:  81.5  
##  Median :1204.0   Median :2.500   Median : 997.00   Median : 721.0  
##  Mean   :1279.2   Mean   :2.444   Mean   :1510.50   Mean   : 931.2  
##  3rd Qu.:1483.5   3rd Qu.:3.000   3rd Qu.:2902.75   3rd Qu.:1535.5  
##  Max.   :2569.0   Max.   :6.000   Max.   :4345.00   Max.   :2411.0  
##  ENSG00000001084 ENSG00000001167
##  Min.   : 552    Min.   :1178   
##  1st Qu.:1302    1st Qu.:2080   
##  Median :2008    Median :2594   
##  Mean   :2109    Mean   :2707   
##  3rd Qu.:2890    3rd Qu.:3212   
##  Max.   :3963    Max.   :4837

Discussion: What is the problem and what could we do about it ?

Yt <- log2(Y + 1)   # simple log-transform; (why + 1 ?)
Yt[1:10,] %>% t %>% boxplot(las=2, cex.axis=0.5)

More boxplot customization with ggplot

Yt %>% head(10) %>%
  as_tibble(rownames="gene_id") %>% 
  pivot_longer(-gene_id, names_to = "sample", values_to = "log_count") %>% 
  inner_join(metadata, by="sample") %>%
  ggplot(aes(x=gene_id, y=log_count, color=condition)) +
  geom_boxplot() +
  facet_wrap(~celltype) +
  theme(axis.text.x = element_text(angle = 90))

What happened to zero counts ? What might be the source of remaining variability ?

  • Transformation: make distributions more pleasing
  • Normalization: make samples comparable

4. Sample-Sample correlations

Here the goal is to derive the overall similarity of samples based on the genome-wide expression data.

Question: How to calculate the sample-sample correlation matrix?

C <- cor(Yt)       # get sample-sample correlations: other correlations?
pheatmap(C, annotation = meta, annotation_colors = my_colors)

Notice the overall high correlations. How would you explore this further?

# correlation coeff. is only a single number
smoothScatter(Yt[,"JJ_TG_1"], Yt[,"JJ_AMIL_1"])

5. PCA plot

PCA aims to project samples from a high-dimensional space (M=20k+ genes) to a lower dimension (D=2). Similar to correlation analysis, the main purpose is to identify distinct and similar samples.

# vst() is one of many variance stabilization methods: log, vst, rlog, ...
# DESeq::plotPCA() requires a 'DESeqTransform' object as input

PCA <- dds %>% vst %>% plotPCA(intgroup=c("condition","celltype"), returnData=TRUE)
  
# get percentage explained variation per PC
percentVar <- round(100 * attr(PCA, "percentVar"), 1)
  
ggplot(PCA, aes(x=PC1, y=PC2, color=condition, shape=celltype)) +
  geom_point(size=3, alpha=0.5) +
  xlab(paste0("PC1: ",percentVar[1])) +
  ylab(paste0("PC2: ",percentVar[2]))

Homework

  1. Observe and interpret the PCA plot.
  2. Which factor explains the separation?
  3. Can we go ahead with analysis ?
  4. Optional: Repeat PCA analysis with more sophisticated normalization from the DESeq2 package “rlog()”. Notice that the output is not a matrix but a more complicated DESeqTransform object.
rld <- rlog(dds)  # more sophisticated log-transform to account for seq.depth=lib.size --> str(rld)

plotPCA(rld)
rld_PCA <- plotPCA(rld, intgroup=c("condition", "celltype"), returnData=TRUE)
percentVar <- round(100 * attr(rld_PCA, "percentVar"), 1)

ggplot(rld_PCA, aes(PC1, PC2, color=condition, shape=celltype)) +
  geom_point(size=3) +
  xlab(paste0("PC1: ",percentVar[1])) +
  ylab(paste0("PC2: ",percentVar[2])) +
  scale_colour_manual(values=cn_col)