Slides are here.
## [1] '5.1.0'
Lets have a look at the UMAP embedding of the full dataset processed in a standard way, ignoring any possible batch effects. Compare the cell separation by sequencing technology and by cell type.
We have subset and pre-processed the full dataset for you. You can
download it from zenodo
.
fdest <- "datasets/preprocessed_rds/panc_sub_processed.RDS"
if (!file.exists(fdest)) {
# utils::download.file('https://zenodo.org/record/7891484/files/panc_sub_processed.RDS?download=1',
# destfile = fdest, method = 'curl')
file.copy("/scratch/local/rseurat/datasets/preprocessed_rds/panc_sub_processed.RDS",
fdest, overwrite = FALSE)
}
panc_sub <- UpdateSeuratObject(readRDS(fdest)) ## Seurat v5 would require this
đ§â¨ Poll:
Which sequencing technologies are retained in the subset dataset?
How many cells were sequenced with smartseq2?
Plot UMAP embedding for the subset dataset. Inspect the cell separation by cell type and by sequencing technology.
â¨đĽ Exercise: plot gene expression
In the previous course units, you have learned to call differentially expressed genes with Seurat. In this task, we ask you to: - call genes differentially expressed between cells sequenced with the smartseq2 technology and those with the celseq technology - plot a violin plot for the top gene - plot a feature plot for the top gene
đ§â¨ Poll: What is the name of the top DE gene ?
Hereâs our proposed solution:
smartseq2.markers <- FindMarkers(panc_sub, ident.1 = "smartseq2", ident.2 = "celseq",
only.pos = TRUE)
VlnPlot(panc_sub, features = "LRRC75A-AS1", group.by = "celltype", split.by = "tech",
split.plot = TRUE)
This gene appears to be mostly expressed in cells sequenced with the smartseq2 technology in multiple cell populations.
Letâs plot itâs expression on the lowD embedding:
p1 <- FeaturePlot(panc_sub, features = "LRRC75A-AS1")
p2 <- DimPlot(panc_sub, reduction = "umap", group.by = "tech", cols = scales::alpha(my_cols,
0.3))
p1 + p2
On the other hand, what does the expression of known celltype markers look like ? Letâs plot the expression of the alpha cell marker gen âGCGâ.
Letâs now explore some options of mitigating batch effects:
This method is quite straightforward. It will account for batch
effects by correcting our principal components, its functionality is
conveniently wrapped under function RunHarmony
. Just keep
in mind that it will create a new âharmonyâ reduction thatâs the one you
want to use downstreamâŚ
The next code block illustrates the process. Unfortunately, we wonât apply this methodology here. But it should be straightforward for you to use it when/ if the need arises. For further details, check the package vignette (tutorial).
# Load DATA
obj.Name1 <- CreateSeuratObject(counts = counts.Name1, project = "Name1")
obj.Name2 <- CreateSeuratObject(counts = counts.Name2, project = "Name2")
# Merge DATA
object <- merge(x = obj.Name1, y = obj.Name2, add.cell.ids = c("Name1",
"Name2"), merge.data = F)
## This will give you a 'orig.ident' meta-data column with 'Name1' or
## 'Name2', allowing you to know from which experiment came each
## cell.
# ... QC, etc.
harmony::RunHarmony(object, h.group.by.vars = "orig.ident")
FindNeighbors(object, reduction = "harmony")
FindClusters(object)
RunUMAP(object, reduction = "harmony")
# ... DE, etc.
Thereâs a few more slides, here. For further details, check the original article.
Prior to the integration, we want to normalize each dataset separately to be integrated.
# split the dataset into a list of seurat objects
panc.list <- SplitObject(panc_sub)
# normalize and identify variable features for each dataset
# independently
panc.list <- lapply(X = panc.list, FUN = function(x) {
x <- NormalizeData(x)
x <- FindVariableFeatures(x, selection.method = "vst", nfeatures = 2000)
})
# select features that are repeatedly variable across datasets for
# integration
features <- SelectIntegrationFeatures(object.list = panc.list)
anchors <- FindIntegrationAnchors(object.list = panc.list, anchor.features = features,
normalization.method = "LogNormalize")
# this command creates an 'integrated' data assay
panc.combined <- IntegrateData(anchorset = anchors, normalization.method = "LogNormalize")
Letâs clean up some redundant objects from memory.
## used (Mb) gc trigger (Mb) max used (Mb)
## Ncells 8549044 456.6 17055038 910.9 17055038 910.9
## Vcells 193155952 1473.7 502811043 3836.2 501544092 3826.5
Letâs have a brief look at the panc.combined dataset - a new Assay has been created by the integration procedure.
đ§â¨ Poll: What is the new assay called? Hint: you can access assays of a Seurat object with
Assays()
.
After the integration, data scaling on the new assay is necessary, as well as calculation of PCA and UMAP embeddings.
stopifnot(DefaultAssay(panc.combined) == "integrated")
# Run the standard workflow for visualization and clustering
panc.combined <- ScaleData(panc.combined, verbose = FALSE)
panc.combined <- RunPCA(panc.combined, verbose = FALSE)
panc.combined <- RunUMAP(panc.combined, reduction = "pca", dims = 1:30)
Letâs have a look at the distribution of technology, and celltype, labels in lowD after integration.
We can also compare the distribution of cell tech labels before and after integration.
â¨đĽ Exercise: plot gene expression after integration
You have previously plotted the expression of a DE gene between the sequencing technologies on the dataset before integration. Repeat the violin and the feature plot for the dataset after integration. What has changed, what hasnât, and why?
Hereâs our proposed solution:
The expression of this gene hasnât changed - as it is not present in
the âintegratedâ assay, the VlnPlot
function has fetched
itâs expression value from the RNA assay, which is not modified by the
integration.
Letâs plot itâs expression on the lowD embedding:
p1 <- FeaturePlot(panc.combined, features = "LRRC75A-AS1")
p2 <- DimPlot(panc.combined, reduction = "umap", group.by = "tech", cols = scales::alpha(my_cols,
0.3))
p1 + p2
We can now appreciate that cells highly expressing this gene are distributed over all clusters. Note that the mixing with the cells negative for this marker is imperfect, but such is the result of the integration for this dataset.
Letâs ascertain ourselves that the expression of the celltype markers is behaving as expected after integration. For this purpose, letâs plot the expression of the alpha cell marker gen âGCGâ.
BĂźttner, M., Miao, Z., Wolf, F.A. et al. A test metric for assessing single-cell RNA-seq batch correction. Nat Methods 16, 43â49 (2019). https://doi.org/10.1038/s41592-018-0254-1
Stuart T, Butler A, Hoffman P, Hafemeister C, Papalexi E, Mauck WM 3rd, Hao Y, Stoeckius M, Smibert P, Satija R. Comprehensive Integration of Single-Cell Data. Cell. 2019 Jun 13;177(7):1888-1902.e21. doi: 10.1016/j.cell.2019.05.031. Epub 2019 Jun 6. PMID: 31178118; PMCID: PMC6687398
This is how to reproduce generation of the UMAP embedding for the full dataset and further the generation of the subset dataset used in this course unit:
library(SeuratData)
panc <- LoadData("panc8", type = "default")
table(panc$dataset)
panc <- NormalizeData(panc)
panc <- FindVariableFeatures(panc, selection.method = "vst", nfeatures = 2000)
all.genes <- rownames(panc)
panc <- ScaleData(panc, features = all.genes)
panc <- RunPCA(panc, features = VariableFeatures(object = panc))
panc <- RunUMAP(panc, dims = 1:15)
p1<-DimPlot(panc, reduction = "umap",label=TRUE)
p2<-DimPlot(panc, reduction = "umap",group.by="celltype")
p1+p2
panc_sub <- subset(x=panc,subset = dataset == c("celseq","celseq2","smartseq2"))
panc_sub <- NormalizeData(panc_sub)
panc_sub <- FindVariableFeatures(panc_sub, selection.method = "vst", nfeatures = 2000)
all.genes <- rownames(panc_sub)
panc_sub <- ScaleData(panc_sub, features = all.genes)
panc_sub <- RunPCA(panc_sub, features = VariableFeatures(object = panc_sub))
panc_sub <- RunUMAP(panc_sub, dims = 1:15)
saveRDS(panc_sub,"panc_sub_processed.RDS")