Code
library(pheatmap)Thomas Manke
Define libraries required for this project
Steps: extract only p-values from full data, define proper names for rows, remove columns
# Beware of slow connections ! 
chip="http://jura.wi.mit.edu/young_public/regulatory_network/binding_by_gene.tsv"
D=read.csv(chip, sep="\t", skip=1) # skip first line
### some filtering
rownames(D)=D[,1]                  # ORF-symbols in column 1, keep them as rownames 
ic=c(1:4,seq(6,230,by=2))          # define column indices to be excluded (only keep p-values)
P=D[,-ic]                          # new data-frame of p-values, exclude superfluous columns and ratios
####
# save cleaned data (only p-values) to file - for future use
# use write.table instead of write.csv - as I want to have tab-seperation
write.table(P, file="data/chip.tsv", sep="\t", quote=FALSE) Let’s start with a cleaned-up data set.
Now available at “https://raw.githubusercontent.com/maxplanck-ie/Rintro/master/data/chip.tsv” (you might want to locate the raw file on https://github.com/maxplanck-ie/Rintro/).
Faster: have it local
Inspect data. Below are some ideas.
Try this iteratively on the console. So that you don’t clutter your notebook.
Query: How many genes (actually: intergenic regions) are in this data frame?
This might be slow.
Some genes (rows) have no associated TF. Some TF target no gene.
Query: What is the dimension of the filtered matrix B


Correlation Coefficient: 0.7312521.
FALSE  TRUE 
 6076   194   [1] "YBL072C"   "YBL087C"   "YBL092W"   "YBL093C"   "YBR084C-A" "YBR085W"  
  [7] "YBR116C"   "YBR117C"   "YBR118W"   "YBR126C"   "YBR181C"   "YBR188C"  
 [13] "YBR189W"   "YBR190W"   "YBR191W"   "YDL060W"   "YDL061C"   "YDL075W"  
 [19] "YDL076C"   "YDL082W"   "YDL083C"   "YDL130W"   "YDL133C-A" "YDL133W"  
 [25] "YDL136W"   "YDL184C"   "YDL191W"   "YDR024W"   "YDR025W"   "YDR064W"  
 [31] "YDR384C"   "YDR385W"   "YDR418W"   "YDR447C"   "YDR448W"   "YDR449C"  
 [37] "YDR450W"   "YDR470C"   "YDR471W"   "YDR500C"   "YDR501W"   "YEL008W"  
 [43] "YEL009C"   "YEL054C"   "YER056C-A" "YER074W"   "YER101C"   "YER102W"  
 [49] "YER116C"   "YER117W"   "YER130C"   "YER131W"   "YFL034C-A" "YFL034W"  
 [55] "YFR031C-A" "YFR032C-A" "YGL030W"   "YGL031C"   "YGL100W"   "YGL103W"  
 [61] "YGL104C"   "YGL123W"   "YGL124C"   "YGL135W"   "YGL136C"   "YGL147C"  
 [67] "YGL189C"   "YGR027C"   "YGR033C"   "YGR034W"   "YGR085C"   "YGR117C"  
 [73] "YGR118W"   "YGR148C"   "YGR149W"   "YGR213C"   "YGR214W"   "YHL001W"  
 [79] "YHL015W"   "YHL016C"   "YHL033C"   "YHR021C"   "YHR021W-A" "YHR141C"  
 [85] "YHR142W"   "YHR203C"   "YHR204W"   "YIL018W"   "YIL069C"   "YIL133C"  
 [91] "YIL148W"   "YIL149C"   "YJL134W"   "YJL135W"   "YJL136C"   "YJL177W"  
 [97] "YJL178C"   "YJL189W"   "YJL190C"   "YJL191W"   "YJL192C"   "YJR123W"  
[103] "YJR145C"   "YJR146W"   "YJR147W"   "YKL006C-A" "YKL006W"   "YKL156W"  
[109] "YKL180W"   "YKR057W"   "YKR094C"   "YKR095W"   "YLL043W"   "YLL045C"  
[115] "YLR029C"   "YLR030W"   "YLR047C"   "YLR048W"   "YLR061W"   "YLR074C"  
[121] "YLR075W"   "YLR166C"   "YLR167W"   "YLR183C"   "YLR184W"   "YLR185W"  
[127] "YLR264W"   "YLR287C-A" "YLR325C"   "YLR326W"   "YLR333C"   "YLR337C"  
[133] "YLR340W"   "YLR344W"   "YLR367W"   "YLR387C"   "YLR388W"   "YLR406C"  
[139] "YLR407W"   "YLR441C"   "YLR447C"   "YLR448W"   "YML024W"   "YML025C"  
[145] "YML026C"   "YML063W"   "YML064C"   "YML073C"   "YMR116C"   "YMR142C"  
[151] "YMR143W"   "YMR194W"   "YMR229C"   "YMR230W"   "YMR242C"   "YNL067W"  
[157] "YNL068C"   "YNL069C"   "YNL096C"   "YNL162W"   "YNL163C"   "YNL178W"  
[163] "YNL302C"   "YOL039W"   "YOL040C"   "YOL120C"   "YOL121C"   "YOL127W"  
[169] "YOL128C"   "YOR095C"   "YOR096W"   "YOR182C"   "YOR183W"   "YOR234C"  
[175] "YOR235W"   "YOR292C"   "YOR293W"   "YOR312C"   "YOR355W"   "YOR369C"  
[181] "YPL079W"   "YPL080C"   "YPL090C"   "YPL131W"   "YPL143W"   "YPL199C"  
[187] "YPL249C-A" "YPR042C"   "YPR043W"   "YPR080W"   "YPR102C"   "YPR103W"  
[193] "YPR131C"   "YPR132W"         
        FALSE TRUE
  FALSE  5989   87
  TRUE     75  119
    Fisher's Exact Test for Count Data
data:  tb
p-value < 2.2e-16
alternative hypothesis: true odds ratio is not equal to 1
95 percent confidence interval:
  75.19957 157.98637
sample estimates:
odds ratio 
  108.6888 upsetR is better (more general) than Venn diagrams
---
title: "P01: Data Exploration; Regulatory Modules from ChIP-seq"
author: "Thomas Manke"
categories:
  - pheatmap
  - upSetR
  - ChIP data
  - Fisher Exact Test
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
knitr::opts_chunk$set(cache=TRUE)
```
# Goal 
- Input:  genome-wide binding data for 113 transcription factors in yeast (ChIP-chip): --> 6000 x 113 matrix
- Analysis: identify combinations of TF
- Output: a reproducible analysis document (Rmd) and report (html)
- Reference: Lee et al. (2002) Transcriptional Regulatory Networks in Saccharomyces cerevisea.
- Data Set: chip="http://jura.wi.mit.edu/young_public/regulatory_network/binding_by_gene.tsv"
# Setup
Define libraries required for this project
```{r}
library(pheatmap)
```
# Skipped: Pre-processing
Steps: extract only p-values from full data, define proper names for rows, remove columns 
```{r preprocessing, eval=FALSE}
# Beware of slow connections ! 
chip="http://jura.wi.mit.edu/young_public/regulatory_network/binding_by_gene.tsv"
D=read.csv(chip, sep="\t", skip=1) # skip first line
### some filtering
rownames(D)=D[,1]                  # ORF-symbols in column 1, keep them as rownames 
ic=c(1:4,seq(6,230,by=2))          # define column indices to be excluded (only keep p-values)
P=D[,-ic]                          # new data-frame of p-values, exclude superfluous columns and ratios
####
# save cleaned data (only p-values) to file - for future use
# use write.table instead of write.csv - as I want to have tab-seperation
write.table(P, file="data/chip.tsv", sep="\t", quote=FALSE) 
```
# Read Clean Data
Let's start with a cleaned-up data set. 
Now available at "https://raw.githubusercontent.com/maxplanck-ie/Rintro/master/data/chip.tsv"
(you might want to locate the *raw* file on https://github.com/maxplanck-ie/Rintro/).
Faster: have it local
```{r read}
fn="https://raw.githubusercontent.com/maxplanck-ie/Rintro/master/data/chip.tsv"  # remote filename
#fn="data/chip.tsv"                                                            # local: same but faster
P=read.csv(file=fn, sep="\t")
```
# Tasks: Sanity Checks
Inspect data. Below are some ideas.   
Try this iteratively on the console. So that you don't clutter your notebook.
```{r sanity, eval=FALSE}
str(P)             # structure
dim(P)             # dimensions
colnames(P)        # TF
anyNA(P)           # are there any undefined values
sum(P<0 | P>1)     # how many "strange" p-values --> errors?
```
**Query**: How many genes (actually: intergenic regions) are in this data frame?
## Summary Plots
```{r summary_plots}
boxplot(P)         # distribution
#hist(as.matrix(P)) # hist() requires matrix as input 
```
## Show me all the data
This might be slow.
```{r pheatmap, cache=TRUE}
pheatmap(P, show_rownames=FALSE)
```
# Transformations
```{r transform}
pt = 1e-3           # define P-value threshold
B = P < pt          # B is a Boolean matrix (FALSE/TRUE)
```
# Filter
Some genes (rows) have no associated TF. Some TF target no gene.
```{r filter}
nt = rowSums(B)     # number of transcription factors targetting a gene
ng = colSums(B)     # number of genes targetted by TF
D = B[nt>1, ng>40]  # define filtered data
```
**Query**: What is the dimension of the filtered matrix B 
```{r filtered_dim, eval=FALSE, echo=FALSE}
dim(B)
dim(D)
```
# Show me the processed data
```{r plot}
pheatmap(D+0, show_rownames=FALSE) # D + 0 is my trick to convert logical matrix to numeric matrix (FALSE=0, TRUE=1) 
```
# Main Results: All Correlations
```{r correlations}
C = cor(D)
pheatmap(C)
```
#  Explore specific factor combinations
```{r smoothScatter}
t1="FHL1"
t2="RAP1"
plot( P[,t1], P[,t2] )                     # not very meaningful
smoothScatter( log(P[,t1]), log(P[,t2]) )  # scatterplot of log(p-values) for two TF 
abline(h=log(pt), v=log(pt), col="red")    # show thresholds
```
Correlation Coefficient: `r C[t1,t2]`. 
# Contigency tables
```{r contTables}
table(B[,t1])               # 194 genes bound by t1="FHL1"
names(which(B[,t1]))        # their names
tb=table(B[,t1],B[,t2])     # contingency table: 119 genes bound by FHL1 and RAP1
tb
fisher.test(tb)             # a test at last: the overlap is highly unexpected
```
# Investigating Multiple Overlaps
upsetR is better (more general) than Venn diagrams
```{r upsetR}
library(UpSetR)
upset( as.data.frame(B+0), sets = c("FHL1", "RAP1", "MCM1", "STE12") )
```
# Review
* Data munging: typically requires basic processing and filtering steps ==> use clean data
* data exploration and visualization: heatmap(), smoothScatter()
* quantitative data descriptions: cor(), table()
* some tests: fisher.test()
* R notebooks: run code chunks interactively, or generate full reports