P01: Data Exploration; Regulatory Modules from ChIP-seq

pheatmap
upSetR
ChIP data
Fisher Exact Test
Author

Thomas Manke

Goal

  • Input: genome-wide binding data for 113 transcription factors in yeast (ChIP-chip): –> 6000 x 113 matrix
  • Analysis: identify combinations of TF
  • Output: a reproducible analysis document (Rmd) and report (html)
  • Reference: Lee et al. (2002) Transcriptional Regulatory Networks in Saccharomyces cerevisea.
  • Data Set: chip=“http://jura.wi.mit.edu/young_public/regulatory_network/binding_by_gene.tsv”

Setup

Define libraries required for this project

Code
library(pheatmap)

Skipped: Pre-processing

Steps: extract only p-values from full data, define proper names for rows, remove columns

Code
# Beware of slow connections ! 
chip="http://jura.wi.mit.edu/young_public/regulatory_network/binding_by_gene.tsv"
D=read.csv(chip, sep="\t", skip=1) # skip first line

### some filtering
rownames(D)=D[,1]                  # ORF-symbols in column 1, keep them as rownames 
ic=c(1:4,seq(6,230,by=2))          # define column indices to be excluded (only keep p-values)
P=D[,-ic]                          # new data-frame of p-values, exclude superfluous columns and ratios
####

# save cleaned data (only p-values) to file - for future use
# use write.table instead of write.csv - as I want to have tab-seperation
write.table(P, file="data/chip.tsv", sep="\t", quote=FALSE) 

Read Clean Data

Let’s start with a cleaned-up data set.

Now available at “https://raw.githubusercontent.com/maxplanck-ie/Rintro/master/data/chip.tsv” (you might want to locate the raw file on https://github.com/maxplanck-ie/Rintro/).

Faster: have it local

Code
fn="https://raw.githubusercontent.com/maxplanck-ie/Rintro/master/data/chip.tsv"  # remote filename
#fn="data/chip.tsv"                                                            # local: same but faster
P=read.csv(file=fn, sep="\t")

Tasks: Sanity Checks

Inspect data. Below are some ideas.
Try this iteratively on the console. So that you don’t clutter your notebook.

Code
str(P)             # structure
dim(P)             # dimensions
colnames(P)        # TF
anyNA(P)           # are there any undefined values
sum(P<0 | P>1)     # how many "strange" p-values --> errors?

Query: How many genes (actually: intergenic regions) are in this data frame?

Summary Plots

Code
boxplot(P)         # distribution

Code
#hist(as.matrix(P)) # hist() requires matrix as input 

Show me all the data

This might be slow.

Code
pheatmap(P, show_rownames=FALSE)

Transformations

Code
pt = 1e-3           # define P-value threshold
B = P < pt          # B is a Boolean matrix (FALSE/TRUE)

Filter

Some genes (rows) have no associated TF. Some TF target no gene.

Code
nt = rowSums(B)     # number of transcription factors targetting a gene
ng = colSums(B)     # number of genes targetted by TF
D = B[nt>1, ng>40]  # define filtered data

Query: What is the dimension of the filtered matrix B

Show me the processed data

Code
pheatmap(D+0, show_rownames=FALSE) # D + 0 is my trick to convert logical matrix to numeric matrix (FALSE=0, TRUE=1) 

Main Results: All Correlations

Code
C = cor(D)
pheatmap(C)

Explore specific factor combinations

Code
t1="FHL1"
t2="RAP1"
plot( P[,t1], P[,t2] )                     # not very meaningful

Code
smoothScatter( log(P[,t1]), log(P[,t2]) )  # scatterplot of log(p-values) for two TF 
abline(h=log(pt), v=log(pt), col="red")    # show thresholds

Correlation Coefficient: 0.7312521.

Contigency tables

Code
table(B[,t1])               # 194 genes bound by t1="FHL1"

FALSE  TRUE 
 6076   194 
Code
names(which(B[,t1]))        # their names
  [1] "YBL072C"   "YBL087C"   "YBL092W"   "YBL093C"   "YBR084C-A" "YBR085W"  
  [7] "YBR116C"   "YBR117C"   "YBR118W"   "YBR126C"   "YBR181C"   "YBR188C"  
 [13] "YBR189W"   "YBR190W"   "YBR191W"   "YDL060W"   "YDL061C"   "YDL075W"  
 [19] "YDL076C"   "YDL082W"   "YDL083C"   "YDL130W"   "YDL133C-A" "YDL133W"  
 [25] "YDL136W"   "YDL184C"   "YDL191W"   "YDR024W"   "YDR025W"   "YDR064W"  
 [31] "YDR384C"   "YDR385W"   "YDR418W"   "YDR447C"   "YDR448W"   "YDR449C"  
 [37] "YDR450W"   "YDR470C"   "YDR471W"   "YDR500C"   "YDR501W"   "YEL008W"  
 [43] "YEL009C"   "YEL054C"   "YER056C-A" "YER074W"   "YER101C"   "YER102W"  
 [49] "YER116C"   "YER117W"   "YER130C"   "YER131W"   "YFL034C-A" "YFL034W"  
 [55] "YFR031C-A" "YFR032C-A" "YGL030W"   "YGL031C"   "YGL100W"   "YGL103W"  
 [61] "YGL104C"   "YGL123W"   "YGL124C"   "YGL135W"   "YGL136C"   "YGL147C"  
 [67] "YGL189C"   "YGR027C"   "YGR033C"   "YGR034W"   "YGR085C"   "YGR117C"  
 [73] "YGR118W"   "YGR148C"   "YGR149W"   "YGR213C"   "YGR214W"   "YHL001W"  
 [79] "YHL015W"   "YHL016C"   "YHL033C"   "YHR021C"   "YHR021W-A" "YHR141C"  
 [85] "YHR142W"   "YHR203C"   "YHR204W"   "YIL018W"   "YIL069C"   "YIL133C"  
 [91] "YIL148W"   "YIL149C"   "YJL134W"   "YJL135W"   "YJL136C"   "YJL177W"  
 [97] "YJL178C"   "YJL189W"   "YJL190C"   "YJL191W"   "YJL192C"   "YJR123W"  
[103] "YJR145C"   "YJR146W"   "YJR147W"   "YKL006C-A" "YKL006W"   "YKL156W"  
[109] "YKL180W"   "YKR057W"   "YKR094C"   "YKR095W"   "YLL043W"   "YLL045C"  
[115] "YLR029C"   "YLR030W"   "YLR047C"   "YLR048W"   "YLR061W"   "YLR074C"  
[121] "YLR075W"   "YLR166C"   "YLR167W"   "YLR183C"   "YLR184W"   "YLR185W"  
[127] "YLR264W"   "YLR287C-A" "YLR325C"   "YLR326W"   "YLR333C"   "YLR337C"  
[133] "YLR340W"   "YLR344W"   "YLR367W"   "YLR387C"   "YLR388W"   "YLR406C"  
[139] "YLR407W"   "YLR441C"   "YLR447C"   "YLR448W"   "YML024W"   "YML025C"  
[145] "YML026C"   "YML063W"   "YML064C"   "YML073C"   "YMR116C"   "YMR142C"  
[151] "YMR143W"   "YMR194W"   "YMR229C"   "YMR230W"   "YMR242C"   "YNL067W"  
[157] "YNL068C"   "YNL069C"   "YNL096C"   "YNL162W"   "YNL163C"   "YNL178W"  
[163] "YNL302C"   "YOL039W"   "YOL040C"   "YOL120C"   "YOL121C"   "YOL127W"  
[169] "YOL128C"   "YOR095C"   "YOR096W"   "YOR182C"   "YOR183W"   "YOR234C"  
[175] "YOR235W"   "YOR292C"   "YOR293W"   "YOR312C"   "YOR355W"   "YOR369C"  
[181] "YPL079W"   "YPL080C"   "YPL090C"   "YPL131W"   "YPL143W"   "YPL199C"  
[187] "YPL249C-A" "YPR042C"   "YPR043W"   "YPR080W"   "YPR102C"   "YPR103W"  
[193] "YPR131C"   "YPR132W"  
Code
tb=table(B[,t1],B[,t2])     # contingency table: 119 genes bound by FHL1 and RAP1
tb
       
        FALSE TRUE
  FALSE  5989   87
  TRUE     75  119
Code
fisher.test(tb)             # a test at last: the overlap is highly unexpected

    Fisher's Exact Test for Count Data

data:  tb
p-value < 2.2e-16
alternative hypothesis: true odds ratio is not equal to 1
95 percent confidence interval:
  75.19957 157.98637
sample estimates:
odds ratio 
  108.6888 

Investigating Multiple Overlaps

upsetR is better (more general) than Venn diagrams

Code
library(UpSetR)
upset( as.data.frame(B+0), sets = c("FHL1", "RAP1", "MCM1", "STE12") )

Review

  • Data munging: typically requires basic processing and filtering steps ==> use clean data
  • data exploration and visualization: heatmap(), smoothScatter()
  • quantitative data descriptions: cor(), table()
  • some tests: fisher.test()
  • R notebooks: run code chunks interactively, or generate full reports