suppressPackageStartupMessages({
library("tidyverse")
library("DESeq2")
library("pheatmap")
})
# define some convenience function for plotting
source('funcs.R')
Remember how important sessionInfo() is:
sessionInfo()
## R version 4.2.3 (2023-03-15)
## Platform: x86_64-conda-linux-gnu (64-bit)
## Running under: Ubuntu 22.04.4 LTS
##
## Matrix products: default
## BLAS/LAPACK: /home/runner/micromamba/envs/Rdeseq2/lib/libopenblasp-r0.3.27.so
##
## locale:
## [1] LC_CTYPE=C.UTF-8 LC_NUMERIC=C LC_TIME=C.UTF-8
## [4] LC_COLLATE=C.UTF-8 LC_MONETARY=C.UTF-8 LC_MESSAGES=C.UTF-8
## [7] LC_PAPER=C.UTF-8 LC_NAME=C LC_ADDRESS=C
## [10] LC_TELEPHONE=C LC_MEASUREMENT=C.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] stats4 stats graphics grDevices utils datasets methods
## [8] base
##
## other attached packages:
## [1] pheatmap_1.0.12 lubridate_1.9.3
## [3] forcats_1.0.0 stringr_1.5.1
## [5] dplyr_1.1.4 purrr_1.0.2
## [7] readr_2.1.5 tidyr_1.3.1
## [9] tibble_3.2.1 ggplot2_3.5.1
## [11] tidyverse_2.0.0 DESeq2_1.38.0
## [13] SummarizedExperiment_1.28.0 Biobase_2.58.0
## [15] MatrixGenerics_1.10.0 matrixStats_1.3.0
## [17] GenomicRanges_1.50.0 GenomeInfoDb_1.34.9
## [19] IRanges_2.32.0 S4Vectors_0.36.0
## [21] BiocGenerics_0.44.0
##
## loaded via a namespace (and not attached):
## [1] bitops_1.0-7 fontawesome_0.5.2 bit64_4.0.5
## [4] RColorBrewer_1.1-3 httr_1.4.7 tools_4.2.3
## [7] bslib_0.7.0 utf8_1.2.4 R6_2.5.1
## [10] KernSmooth_2.23-24 DBI_1.2.2 colorspace_2.1-0
## [13] withr_3.0.0 tidyselect_1.2.1 bit_4.0.5
## [16] compiler_4.2.3 cli_3.6.2 DelayedArray_0.24.0
## [19] labeling_0.4.3 sass_0.4.9 scales_1.3.0
## [22] genefilter_1.80.0 digest_0.6.35 rmarkdown_2.27
## [25] XVector_0.38.0 pkgconfig_2.0.3 htmltools_0.5.8.1
## [28] fastmap_1.2.0 highr_0.10 rlang_1.1.3
## [31] RSQLite_2.3.4 jquerylib_0.1.4 generics_0.1.3
## [34] farver_2.1.2 jsonlite_1.8.8 BiocParallel_1.32.5
## [37] vroom_1.6.5 RCurl_1.98-1.14 magrittr_2.0.3
## [40] GenomeInfoDbData_1.2.9 Matrix_1.6-5 Rcpp_1.0.12
## [43] munsell_0.5.1 fansi_1.0.6 lifecycle_1.0.4
## [46] stringi_1.8.4 yaml_2.3.8 zlibbioc_1.44.0
## [49] grid_4.2.3 blob_1.2.4 parallel_4.2.3
## [52] crayon_1.5.2 lattice_0.22-6 Biostrings_2.66.0
## [55] splines_4.2.3 annotate_1.76.0 hms_1.1.3
## [58] KEGGREST_1.38.0 locfit_1.5-9.9 knitr_1.46
## [61] pillar_1.9.0 geneplotter_1.76.0 codetools_0.2-20
## [64] XML_3.99-0.16.1 glue_1.7.0 evaluate_0.23
## [67] png_0.1-8 vctrs_0.6.5 tzdb_0.4.0
## [70] gtable_0.3.5 cachem_1.1.0 xfun_0.44
## [73] xtable_1.8-4 survival_3.6-4 AnnotationDbi_1.60.0
## [76] memoise_2.0.1 timechange_0.3.0
data <- readr::read_tsv("data/myeloma/myeloma_counts.tsv")
## Rows: 57905 Columns: 19
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: "\t"
## chr (1): gene_id
## dbl (18): JJ_CTRL_1, BM_CTRL_2, BM_CTRL_3, BM_AMIL_1, BM_AMIL_2, BM_AMIL_3, ...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
data %>% head()
metadata <- readr::read_tsv("data/myeloma/myeloma_meta.tsv")
## Rows: 18 Columns: 3
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: "\t"
## chr (3): sample, celltype, condition
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
metadata <- metadata %>% mutate_if(is.character,as.factor)
metadata %>% head()
my_design <- ~ celltype + condition
dds = data + metadata + design
dds <- DESeqDataSetFromMatrix(
countData=data %>% column_to_rownames("gene_id"),
colData=metadata,
design=my_design)
## converting counts to integer mode
class(dds)
## [1] "DESeqDataSet"
## attr(,"package")
## [1] "DESeq2"
dim(dds)
## [1] 57905 18
Previously you noticed that the PCA shows a problem with our data.
dds %>% PCA_tm
# DESeq2::plotPCA does not show this very clearly
# dds %>% rlog %>% plotPCA(intgroup=c("celltype", "condition")) # rlog slow
# dds %>% vst %>% plotPCA(intgroup=c("celltype", "condition")) # vst faster
Poll 2.1: What should we do next ?
Pre-processing is easier if done at the level of the dds object - and more consistent to keep data and metadata in sync.
It also tends to be iterative.
Task (3 min): Filter the problematic samples (columns 1 and 7) from dds object and produce a new PCA plot.
Hint: Although the dds
object is a more complex data
class it also has dimensions assigned – check dim(dds)
–
and samples can be un/selected like columns in a standard
dataframes.
dds_clean <- dds[, -c(1,7)]
dds_clean %>% PCA_tm
Task (5 min) Filter out the genes from dds_clean that have a total amount of only zero or 1 counts. Based on this filtering define a new data set and inspect its dimensions
Hint 1: Have a look at the output of
dds_clean %>% counts %>% rowSums() %>% head()
Poll 2.2 How many samples and genes are left in the filtered data set?
Remember: factors are used to represent categorical data !
Internally they are stored as integers and per default they are sorted alphanumerically. It may be more convenient to sort them by some user defined order - this is called relevelling
The values and default level order of factor “condition”:
#initial factor arrangement:
dds_clean$condition
## [1] DMSO DMSO AMIL AMIL AMIL DMSO DMSO AMIL AMIL AMIL TG TG TG TG TG
## [16] TG
## Levels: AMIL DMSO TG
Task (5 min) Reorder the condition with ‘DMSO’ as baseline.
Hint: Redefine dds_clean$condition
using the
relevel
function
This should be the order after releveling:
## [1] DMSO DMSO AMIL AMIL AMIL DMSO DMSO AMIL AMIL AMIL TG TG TG TG TG
## [16] TG
## Levels: DMSO AMIL TG
Notice: Per default DESeq2 will make comparisons with respect to the 1st level, unless specified more explicitly using “contrasts” (see tomorrow).
We have now covered the most important preprocessing steps:
counts()
– not yet normalized. data frame
of integerscolData()
– added data on samples. data
frame of numerical or categorical variablesdesign()
– variables to be modeled. formula
will be converted to design matrix with dummy variablesThis is before any statistical analysis ! A real-world analysis would likely be iterative
Message: Document your decisions, parameters, workflows and software versions carefully - use scripts or notebooks !
meta = dds_clean %>% colData %>% data.frame
( design_table <- meta %>% select(-sample) %>% table )
## condition
## celltype DMSO AMIL TG
## BM 2 3 3
## JJ 2 3 3
design_table %>% pheatmap(cluster_cols=F, cluster_rows=F, display_numbers=T, fontsize_number=10)
Notice: the original design may have changed as a result of sample filtering
To highlight the goal of the subsequent analysis it will be useful to focus on a single gene and highlight the effects of the different factor level combinations (!) on the mean expression
gene_select <- "ENSG00000003509"
gene <- dds_clean %>% counts %>% as_tibble(rownames='gene_id') %>%
filter(gene_id==gene_select) %>%
pivot_longer(-gene_id, names_to = "sample", values_to = "count") %>% # convert to long
inner_join(meta, by="sample") # add sample metadata from meta
gene %>%
ggplot(aes(x=condition, y=count, col=celltype, group=celltype)) +
geom_point(size=1) +
stat_summary(fun=mean, geom="line",linewidth=2)
In the following we will assess observed differences in gene expressions for different factor levels and contrasts.
The key idea is that some variations can be modeled (e.g. by the mean expression values in each cell shown as lines) , while other variability remains unaccounted for.
The combination of explainable changes and remaining variability will determine whether we consider a gene as differential for given contrasts of interest.
Everything we did until now was setup - and before any normalization, parameter estimates or statistical testing.
Luckily for much of the following DESeq2 provides a convenient wrapper function which takes a dds object, modifies it and returns a new dds object. To save memory we often just overwrite the old objects
dds_clean <- DESeq(dds_clean)
dds_clean <- DESeq(dds_clean)
## estimating size factors
## estimating dispersions
## gene-wise dispersion estimates
## mean-dispersion relationship
## final dispersion estimates
## fitting model and testing
As you can tell from the verbose output, several distinct steps have been performed that are wrapped together in the DESeq() function. The 3 main steps are of the workflow are:
We will briefly describe them below.
As mentioned yesterday, raw counts cannot be directly used to compare samples. To allow comparison we need to normalize. Note that this is also needed for exploratory analysis and visualisation.
A couple of reasons for normalization include:
sequencing depth
gene length
compositional bias
A number of commonly used normalization techniques:
method | takes into account: |
---|---|
CPM | sequencing depth |
RPKM | sequencing depth and gene length |
DESeq2 size factor | sequencing depth and RNA composition |
… and many others exist.
DESeq2 Workflow for the ‘median of ratios’:
Task (5 min): Plot a heatmap of the first 10 genes (rows) of the counts matrix and add the metadata after DESeq was run as above.
Hint: Remember that both the data and the metadata can be obtained from the dds object (counts(), colData()), but you might want to convert the output of colData() to a data frame, such that it can be passed as ‘annotation’ to pheatmap
Y <- dds_clean %>% counts %>% head(10)
meta <- dds_clean %>% colData %>% data.frame %>% select(-sample)
pheatmap(Y, annotation=meta)
Notice:
Task (2 min): The counts() function has a parameter (normalized=TRUE) to extract the normalized counts after DESeq2 is run. Modify the above to plot the same heatmap with normalized counts and observe the differences.
Remember that the size factor for each sample (meta$sizeFactor) is a sophisticated way to account for sequencing depth and underpins the normalization used above. It is good practice to inspect the size factors and try to detect outlier samples
Task (5 mnin):
meta
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.6668 0.8884 1.1038 1.0350 1.1731 1.4179
Observation:. No outliers. sizeFactors behaves as expected.
Task (5 min): Study the effect of sizeFactor normalization for a single gene. The code below is for unnormalized count data - with a simple modification you can obtain the normalized data.
gene <- dds_clean %>% counts() %>%
as_tibble(rownames='gene_id') %>% # tibble does not like rownames
filter(gene_id=="ENSG00000003509") %>% # pick gene (previously slice 42)
pivot_longer(-gene_id, names_to = "sample", values_to = "count") %>% # convert to long
inner_join(meta, by="sample")
gene %>%
ggplot(aes(x=condition, y=count, col=condition,shape=celltype)) +
geom_point(size=3)
Discussion: What do you observe?
Poll 2.3 : what kind of distribution would be the most appropriate for this data ?
To determine if the expression of a gene differs between different factor levels, we will also need to estimate its variance.
In an ideal world we would have many replicates (per level combination) to estimate this variance.
In RNA-seq data we tend to only have a few replicates (\(n \propto 3\)) and we need strong assumptions on the expected distributions.
More specifically, it would be great if the variance could be estimated from the mean gene expression \(\mu\) (c.f. Poisson)
This hope drives much of the work in gene expression analysis, and the discussion below.
Let’s look at the empirical variance-mean relationship for all genes. To keep things simple let’s focus on 3 replicate samples from the same celltype (BM) and condition (AMIL):
# pick 3 replicates (3,4,5) from same celltype-condition combination (BM-AMIL)
dds_clean[, 3:5] %>% var_mu_plot
## Warning in scale_x_log10(): log-10 transformation introduced infinite values.
## Warning in scale_y_log10(): log-10 transformation introduced infinite values.
## Warning: Removed 4709 rows containing non-finite outside the scale range
## (`stat_density2d()`).
Notice:
These properties are commonly observed for all RNA-seq experiments and these observations have guided the choice of distributions to be used for modelling of sequence counts.
Upshot: RNA-seq data (gene expression counts) are over-dispersed
\[ Var = \mu + \alpha \mu^2 \\ \]
and can be modeled with a negative binomial distribution
\[ Y \propto NB(\mu, \alpha) \\ \log \mu = X \beta \] with gene-specific dispersion parameter \(\alpha\).
Dispersion \(\alpha\) should account for the residual/unmodelled variation - assumed to be the same for all samples.
Empirically it has been found that \(\alpha\) may also depend on \(\mu\)
\[ \alpha = \alpha_0 + \alpha_1 / \mu \]
The estimation of \(\alpha(\mu)\) is part of the DESeq2 workflow and can be illustrated as follows
dds_clean %>% plotDispEsts
The black dots denote gene-wise estimates for dispersion \(\alpha\). For small number of samples, these dispersion estimates fluctuate strongly, and sometimes the estimates fail (\(NA \to 10^{-8}\))
The red line denotes an empirically trend for all genes there \(\alpha=\alpha(\mu)\) that we seek to exploit. In DESeq2 parlance, we share information across genes for a more robust (and opinionated) estimates of dispersion and variance.
The blue dots denote the adjusted estimates of \(\alpha\). Often the dispersion is correctecd upwards towards the empirical trend line. But genes with very high dispersion will not be adjusted - as this might overly reduce their variance (and increase the false positive).
Observations:
DESeq2 provides a simple tools to generate synthetic RNA-seq data from a Negative Binomial distribution with known dispersion relation – default: \(\alpha(\mu) = 0.1 + 4 / \mu\).
To illustrate the challenges in estimating dispersion parameter, we simulate expression data with \(n=5000\) genes and \(m=6\) samples in two condition (A,B) with known \(\mu\) and \(\alpha\).
set.seed(42) # for reproducibility
# simulate RNA-seq data from NB with two different conditions (A,B) | run DESEQ
dds_sim <- makeExampleDESeqDataSet(n=5000, m=6, betaSD = 2) %>% DESeq(quiet=TRUE)
dds_sim %>% PCA_tm(groups=c("condition")) + ggtitle("Synthetic Data")
# dds_sim %>% plotDispEsts() # does not produce ggplot
p1 <- dds_sim %>% plotDisp + ggtitle("design: ~ 1 + condition")
p2 <- dds_sim %>% plotDispErr
# poor design: does not account for factor 'condition'
design(dds_sim) <- ~ 1
# sample swaps
# dds_sim$condition = dds_sim$condition %>% sample
dds_sim <- dds_sim %>% DESeq(quiet=TRUE) # rerun DESeq
p3 <- dds_sim %>% plotDisp + ggtitle("design: ~ 1")
p4 <- dds_sim %>% plotDispErr
# requires gridExtra package
grid.arrange(p1, p3, p2, p4, ncol=2)
Notice (left):
MLE estimates of dispersion parameters are difficult; both technically (iterations) and statistically (few samples, low counts). Therefore large uncertainty in those estimates is to be expected \(\to\) need to moderate
For a given sample number and reasonable designs, dispersion estimates tend to be more reliable for large baseMean. They will become even more reliable for larger sample number
The main concern is an underestimate of dispersion and variance \(\to\) overestimate of significance \(\to\) increase in false positives differential calls. Therefore the final dispersion estimates (blue) have been biased towards higher values, given the trend over all genes.
Notice (right):
Poor design choices (unaccounted factors/batch effects, sample swaps, …) account poorly for the observed variance and inflate “residual variation”. This can result in poor dispersion models and poor estimates.
DESeq assumes that the gene-specific dispersion \(\alpha_i\) is constant for each gene \(i\), after accounting for all sample-specific effects.
Pay attention: does the fit look reasonable? Poor fits may be related to poor designs or poor filtering of lowly expressed genes
Once the dispersion parameters (for each gene) are obtained, the “best” parameters \(\beta\) can be estimated by Maximum Likelihood.
For Normal distributions \(y \sim N(\mu, \sigma^2)\) this is easy and corresponds to the least squares formula for both \(\mu\) and \(\sigma^2\), But recall that we assume that the expression counts follow a negative bionmial distribution
\[ y \sim NB(\mu, \alpha) \\ \log \mu = X \beta \\ \alpha = \alpha_0 + \alpha_1 / \mu \]
In general the \(\beta\) will be obtained through an iterative maximization processes. Finally we get a fit for each gene, and a Wald test can be used to test for significance of coefficients of interest.
Notice: In our factorial analysis, \(\beta\) summarizes multiple parameters that may correspond to the different cell means or the differences with respect to some reference level - depending on our encoding of the model matrix.
By default DESeq tests the difference of the last coefficient and its deviation from zero, but in general we will be interested in specific differences (contrasts, linear combinations) among the \(\beta\) and test their deviation from zero
\[ c^T \beta = 0 \]
To be continued - tomorrow.
To obtain the test results for all genes, DESeq2 offers a convenient function: results().
Task (2 min) Please run results() on the dds_clean object and have a look at the result.
( dds_res <- dds_clean %>% results )
## log2 fold change (MLE): condition TG vs DMSO
## Wald test p-value: condition TG vs DMSO
## DataFrame with 27351 rows and 6 columns
## baseMean log2FoldChange lfcSE stat pvalue
## <numeric> <numeric> <numeric> <numeric> <numeric>
## ENSG00000000419 2285.01560 0.317449 0.1790600 1.772863 7.62514e-02
## ENSG00000000457 775.42002 0.334962 0.0873551 3.834485 1.25827e-04
## ENSG00000000460 1300.43696 0.963262 0.0939353 10.254526 1.12928e-24
## ENSG00000000938 2.45443 -0.325046 0.7933991 -0.409688 6.82035e-01
## ENSG00000000971 1626.92877 0.309393 0.1271822 2.432678 1.49876e-02
## ... ... ... ... ... ...
## ENSG00000273486 40.217168 0.0672792 0.204936 0.328295 0.742689
## ENSG00000273487 0.162242 0.5041292 3.950877 0.127599 0.898466
## ENSG00000273488 13.780769 0.4674108 0.375660 1.244237 0.213412
## ENSG00000273489 1.064893 -1.1261042 1.341504 -0.839434 0.401226
## ENSG00000273492 0.856644 0.9590323 1.652038 0.580515 0.561568
## padj
## <numeric>
## ENSG00000000419 1.59065e-01
## ENSG00000000457 5.89362e-04
## ENSG00000000460 2.57159e-23
## ENSG00000000938 7.98272e-01
## ENSG00000000971 4.10219e-02
## ... ...
## ENSG00000273486 0.840324
## ENSG00000273487 NA
## ENSG00000273488 0.356482
## ENSG00000273489 0.565374
## ENSG00000273492 0.707109
Poll 2.4: What does a positive log2 foldChange and a significant padj value reflect ?
dds_res %>% subset(padj < 0.05) %>% nrow()
## [1] 8532
dds_res %>%
data.frame() %>%
subset(padj < 0.05) %>%
arrange(-abs(log2FoldChange)) %>% head(20)
Task (5 min): MA plots Plot the log2Foldchange from the results DESeq results object against the logarithm of the baseMean.
Optional: highlight (with different colour) the differentially expressed genes based on some threshold on p-value, log2FoldChange etc. This is a very common plots, so DESeq also has a function for this: plotMA().
Task (5 min): Volcano plots Plot the adjusted p-value against estimate effect (log2FoldChange).
dds_res %>%
data.frame() %>%
mutate(differential = padj < 0.05) %>%
ggplot(aes(x=log2FoldChange, y=-log10(padj), colour=differential)) +
geom_point(alpha=0.5) +
xlim(c(-2,2)) +
ylim(c(0, 10))
## Warning: Removed 8471 rows containing missing values or values outside the scale range
## (`geom_point()`).
# scale_color_manual(values=c("black", "red")) +
The results table is quite standardized and serves as input to many other tools.
Task (3 min): Please export your results table to a csv / tsv file.
hint: write_tsv, write_csv(), write.table(), write_delim() ,…
dds_res %>% data.frame %>%
# keep rownames as explicit column
rownames_to_column("gene_id") %>%
write_csv(file = "DMSO_vs_TG_BMcells.csv")
For example, you might want to upload the file to our (MPI-IE internal) shiny apps (https://rsconnect.ie-freiburg.mpg.de/shiny/)