- 1 - SI Methods Dataset collection and processing ENCODE ChIP-seq data was downloaded from UCSC genome browser, including 686 profiles for 159 DNA binding proteins (1). TF recognition motifs are collected from HTSELEX (2), Jaspar (3) and Transfac 6.0 (4). There are 853 recognition motifs collected for 505 TFs. We also collected 172 recognition motifs for 133 RNA binding proteins (RBPs) (5). TCGA datasets of gene expression, copy number alteration (CNA) and DNA methylation were downloaded from TCGA Data Portal on 07/27/2014. Standardized somatic mutation data is downloaded from Broad firehose on 10/23/2014. Since we observed that most CpG islands distributed 1000nts upstream and downstream around gene transcription start sites (TSS), we used the methylation array probes +/-1000nts around the TSS and got the average value as promoter methylation level (SI Appendix, Fig. S12). The GTEx data on 01/17/2014 was downloaded for normal human tissues analysis (6). Outlier removal for ChIP-seq profiles Certain TF may have several ChIP-seq profiles available from different experimental conditions, antibodies or laboratories. We found that certain ChIP-seq profile of a TF may be very different from other profiles, which will lead to an ambiguous result for the regulatory inference. We clustered their regulatory potential scores (computed from ChIP-seq peaks near gene TSS) by hierarchical clustering with Pearson correlation as the distance measure (SI Appendix, Fig. S13). The hierarchical tree is cut at correlation 0.2 to form clusters. We ranked these clusters by numbers of profiles they contain, and only kept ChIP-seq profiles in the largest cluster. If no clusters can be formed with correlation threshold 0.2 or several largest clusters have the same size, we exclude the corresponding TF in further analysis. After this step, there are 544 out of 686 ENCODE ChIP-seq profiles left for analysis, representing 150 TFs. Dataset normalization All gene expression values measured by RNA sequencing platforms from TCGA and GTEx were log2 transformed. Our analysis is focused on the expression difference between tumor and normal tissues. For TCGA gene expression and DNA methylation profiles, there are very limited numbers of tumor samples with paired normal tissue control. Thus, we grouped all the normal tissue samples together and used their average value as background control in
23
Embed
Dataset collection and processing · 2015/06/03 · Dataset collection and processing ENCODE ChIP-seq data was downloaded from UCSC genome browser, including 686 profiles for 159
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
- 1 -
SI Methods Dataset collection and processing
ENCODE ChIP-seq data was downloaded from UCSC genome browser, including 686
profiles for 159 DNA binding proteins (1). TF recognition motifs are collected from
HTSELEX (2), Jaspar (3) and Transfac 6.0 (4). There are 853 recognition motifs collected
for 505 TFs. We also collected 172 recognition motifs for 133 RNA binding proteins (RBPs)
(5). TCGA datasets of gene expression, copy number alteration (CNA) and DNA methylation
were downloaded from TCGA Data Portal on 07/27/2014. Standardized somatic mutation
data is downloaded from Broad firehose on 10/23/2014. Since we observed that most CpG
islands distributed 1000nts upstream and downstream around gene transcription start sites
(TSS), we used the methylation array probes +/-1000nts around the TSS and got the average
value as promoter methylation level (SI Appendix, Fig. S12). The GTEx data on 01/17/2014
was downloaded for normal human tissues analysis (6).
Outlier removal for ChIP-seq profiles
Certain TF may have several ChIP-seq profiles available from different experimental
conditions, antibodies or laboratories. We found that certain ChIP-seq profile of a TF may be
very different from other profiles, which will lead to an ambiguous result for the regulatory
inference. We clustered their regulatory potential scores (computed from ChIP-seq peaks
near gene TSS) by hierarchical clustering with Pearson correlation as the distance measure
(SI Appendix, Fig. S13). The hierarchical tree is cut at correlation 0.2 to form clusters. We
ranked these clusters by numbers of profiles they contain, and only kept ChIP-seq profiles in
the largest cluster. If no clusters can be formed with correlation threshold 0.2 or several
largest clusters have the same size, we exclude the corresponding TF in further analysis.
After this step, there are 544 out of 686 ENCODE ChIP-seq profiles left for analysis,
representing 150 TFs.
Dataset normalization
All gene expression values measured by RNA sequencing platforms from TCGA and GTEx
were log2 transformed. Our analysis is focused on the expression difference between tumor
and normal tissues. For TCGA gene expression and DNA methylation profiles, there are very
limited numbers of tumor samples with paired normal tissue control. Thus, we grouped all
the normal tissue samples together and used their average value as background control in
- 2 -
each cancer. For CNA, TCGA provides very complete tumor normal paired measurements so
we use the gene CNA difference paired between tumor and normal samples. Since for each
GTEx sample, there is no normal tissue control as TCGA, we take the all tissue samples
average as background control.
Regulatory potential scores for TF and RBP binding
For each ChIP-seq profile, we searched the presence of ChIP-seq peaks 10000nts around
gene TSS annotated by RefSeq (7). A regulatory potential score is calculated between each
pair of ChIP-seq peak and gene TSS by multiplying the ENCODE ChIP-seq intensity score
with an exponential decay score exp(-A*Distance) of their distance between (8, 9). The
coefficient A is set as log(2)/1000, so that a binding peak 1000nts away from gene TSS will
decay by 50%. The ENCODE ChIP-seq intensity scores are linearly normalized into range
(0,1]. The exponential decay score also has a range (0, 1]. Thus the final regulatory potential
score has a range in (0,1]. For each gene TSS, if there are several ChIP-seq peaks of a TF
nearby, we merged their regulatory potential scores by noisy-or: 1− (1− 𝑠𝑐𝑜𝑟𝑒!)! .
For TF regulatory motifs, we searched for matches within the union DNaseI region in UCSC
multiple genome alignment of 33 placental mammals and derived a conservation score for
each motif site in human genome, using the CCAT package with default parameters (10).
Many TFs from the same TF family have very similar recognition motifs, thus the mapped
sites are highly overlapped. We clustered all TF motifs according to their mutual motif
similarity to 220 clusters, using CCAT package with default parameters (10). For all
overlapping motif sites on human genome, we merged them into one binding site of motif
cluster if more than 10% TFs from that cluster have motif hits included. The regulatory
potential scores for TF recognition motifs are calculated in the same way as ChIP-seq data
using exponential decay of distance, except that the conservation score from multiple genome
alignment is used instead of the ChIP-seq intensity score.
For most RBP motifs collected, they have lower information content comparing to TF
recognition motifs (5), and CCAT package cannot find significant hits with its statistical
model (10). Thus, we converted all RBP motifs into consensus sequences and searched for
their matches in the same strand of gene 3’UTRs by consensus matching (11). The 172 RBP
motifs are clustered into 73 clusters and overlapping binding sites are merged in the same
way as TF recognition motifs. The regulatory potential scores for RBP recognition motifs are
- 3 -
simply defined as the conservation scores on 3’UTR regions in multiple genome alignments.
When we profile the regulatory activity of RBP motifs, the promoter degree and CpG content
are replaced with 3’UTR degree (total number of RBP motifs in gene 3’UTR region) and
3’UTR AU content as gene expression background factors.
Frisch-Waugh-Lovell method of regression
When RABIT screens TFs driving tumor gene expression patterns, the multiple regressions
needs to be conducted against all 686 TF ChIP-seq profiles and in each of 7484 tumor
samples. Thus, RABIT uses the time efficient Frisch-Waugh-Lovell (FWL) method for
regression (12). In the regression, FWL separates factors whose values are not changed in
each tumor, such as CNA and Promoter degree, and only regresses against each variable
ChIP-seq profile to speed up the calculation. In Fig. 1B, the vector R represents ChIP-seq
regulatory potential scores across all target genes. The matrix B is composed of four columns
of background factors (Gene CNA, Promoter methylation, Promoter degree and CpG
content) that keep constant in the same tumor. We calculated the invariant matrix Q
(𝐼 − 𝐵(𝐵′𝐵)!!𝐵′) just one time for all regressions in the same tumor. For each ChIP-seq
profile, only the coefficient , which measures the effect of TF regulation, will be
calculated incrementally from the Q matrix.
The time complexity of linear regression is , where p is the number of covariates
(five in our analysis: four background factors plus one for TF regulatory potential score); and
N is the number of human genes (about 16000 in our analysis). For each tumor, if we run all
regressions one by one, the time complexity will be , where k is the number of
ChIP-seq profiles. With the FWL method, the computation of matrix takes
. The computation of and takes
for k ChIP-seq profiles. If we assume k > p, the
time complexity is reduced to from .
β̂r
O(p2N )
O(k * p2N )
(B 'B)−1
O((p−1)2N + (p−1)3) =O(p2N ) R 'QR R 'QY
k *O(N + (p−1)N + (p−1)2 ) =O(k * pN )
O(k * pN ) O(k * p2N )
- 4 -
Correlation between regulator gene expression, somatic mutation and target gene
expression for regulatory motif members
In Step three of RABIT framework, we tested the impact of TF gene expression and somatic
mutation variation on target gene expression with a linear regression. However, the
regression analysis is more complicated for regulatory motif because, unlike ChIP-seq
profile, one regulatory motif might represent several distinct TFs or RBPs (Fig. 4A). With the
TF regulatory activity score as the response variable, we applied the stepwise forward
regression to select among the covariates of gene expression and somatic mutation values of
all TF (or RBP) members (SI Appendix, Fig. S2B and Table S2). Instead of using F-test to
measure the effect of all covariates on regulatory activity scores across tumors, we used t-test
to assess the significance of each covariate (13). Among all regulators represented by a
regulatory motif, the p-values of covariates selected by forward regression are grouped
together and converted to FDRs by Benjamini-Hochberg procedure (14). This procedure is
applied for each cancer type, and a regulator is reported as cancer associated if at least one
covariate’s regression coefficient is statistically significant (FDR threshold 0.05).
Algorithm Comparison
In order to compare the performance of RABIT with other methods in finding cancer
associated TFs, we used receiver operating characteristic (ROC) curve and precision-recall
(PR) curve. The gold standard positive set is defined as TFs annotated as cancer associated in
at least two out of four cancer gene databases (NCI Cancer Index (15), Bushman (16, 17),
COSMIC (18) and CCGD (19)). The gold standard negative set is defined as the rest of TFs.
Among 150 TFs with ChIP-seq profile analyzed, there are 96 TFs classified as gold standard
positives and 54 classified as gold standard negatives. After running each method, we
generated a rank of TFs reflecting their relative relevance with a cancer type. We derived an
overall TF rank by averaging the TF ranks across all TCGA cancer types. We swept through
each TF rank list to generate the ROC and PR curves for each method (SI Appendix, Fig. S8
B and C). The parameters of running each method are listed as follows.
For LASSO, we used the glmnet R package and set the penalty weights of four background
factors (promoter degree, CpG content, CNA and promoter methylation) as zero and took all
other default parameters in running (20). By setting zero penalty weights, these background
factors will always be contained in the linear model as controls. For LAR, we used lars R
- 5 -
package (21). Because there is no way of inputting background covariates in lars, we first
calculated the residuals of tumor gene expression values and ChIP-seq regulatory potential
scores after regressing to four background factors. In this way, all impact of background
factors will be removed from our data. Then, we ran the lars package on these residual values
with default parameters.
Besides regression methods, we also included three methods designed for finding master
regulators in gene expression patterns. We ran MARINA and VIPER algorithms using
VIPER R package with default parameters (22, 23). We also ran the Expression-2-Kinase
(X2K) package with default parameters (24). Since the X2K only takes gene list as input, we
took the top 10% most up-regulated (or down-regulated) genes in each tumor to calculate the
master TFs in gene up-regulation (or down-regulation). We also included two methods for
baseline comparison. For each TF, we use t-test to check whether its ChIP-seq target genes
are significantly differentially regulated. The t-test p-values were converted to FDRs by
Benjamini-Hochberg procedure and FDR threshold 0.05 was used to select significantly TFs
in each tumor. For each TF, we also used its gene expression difference between all tumor
samples and all normal controls as a measurement of cancer relevance.
For RABIT, LAR, LASSO, X2K and t-test, a set of TFs was selected as important regulators
of gene expression patterns in each tumor. For each TF, the percentage of tumors with TF
selected was calculated for each TCGA cancer type (Fig. 2). The overall cancer relevance is
defined as average percentage of tumors with TF selected across all TCGA cancer types. For
MARINA and VIPER, the TFs were ranked by adjusted p-values from each algorithm. The
overall cancer relevance of TFs was defined from the average rank of TFs across all TCGA
cancer types. For TF expression, the TFs were ranked by the absolute value of tumor versus
normal expression difference averaged among all TCGA cancers. The ranked list of each
algorithm was swept through to generate the ROC and PR curves (SI Appendix, Fig. S8 B
and C). The area under ROC curve is compared between two algorithms using Delong-test
(25).
Besides comparing different methods, we also test the performance of RABIT without
controlling four background factors (promoter degree, CpG content, CNA and promoter
methylation). The area under curve of RABIT is significantly larger than the result without
and the precision of RABIT is consistently larger than the result without background factors
when the recall rate is higher than 0.5 (SI Appendix, Fig. S8E).
Correlation between TF regulatory activity and genome-wide CRISPR screening
For cell line K562 and HL60, there are gene expression data profiled by ENCODE and
genome-wide CRISPR screening data available from previous studies (26, 27). The CRISPR
screening scores were directly downloaded from each study. For each gene screened, a
positive score implied the cell growth became faster after TF CRISPR knock out, and a
negative score implied the cell growth became slower after TF CRISPR knock out.
We applied RABIT over ENCODE gene expression data over 76 cell lines. A regulatory
activity score was calculated for each TF to indicate whether the TF target genes are up
regulated or down regulated. For transcriptional repressors (defined below), the regulatory
activity scores were sign-reversed, since the direction of gene targets regulation is reverse to
the direction of TF activation. The spearman rank correlations between TF regulatory activity
scores and TF CRISPR screening scores were calculated and the p-values of correlation test
were calculated with R package.
In the analysis above, we only included transcriptional activators and repressors. The
correlations between TF gene expression values and target regulatory activity scores were
computed for all TCGA cancer data and significant correlations were selected with
correlation t-test FDR threshold 0.05. Transcriptional activators are defined as TFs with
positive correlations in more than 80% TCGA cancer types, and transcriptional repressors are
defined as TFs with negative correlations in more than 80% cancer types.
References
1. Consortium EP, et al. (2012) An integrated encyclopedia of DNA elements in the human genome. Nature 489(7414):57-74.
2. Jolma A, et al. (2013) DNA-binding specificities of human transcription factors. Cell 152(1-2):327-339.
3. Mathelier A, et al. (2014) JASPAR 2014: an extensively expanded and updated open-access database of transcription factor binding profiles. Nucleic acids research 42(Database issue):D142-147.
4. Matys V, et al. (2003) TRANSFAC: transcriptional regulation, from patterns to profiles. Nucleic acids research 31(1):374-378.
- 7 -
5. Ray D, et al. (2013) A compendium of RNA-binding motifs for decoding gene regulation. Nature 499(7457):172-177.
7. Pruitt KD, Tatusova T, & Maglott DR (2007) NCBI reference sequences (RefSeq): a curated non-redundant sequence database of genomes, transcripts and proteins. Nucleic acids research 35(Database issue):D61-65.
8. Tang Q, et al. (2011) A comprehensive view of nuclear receptor cancer cistromes. Cancer research 71(22):6940-6947.
9. Wang S, et al. (2013) Target analysis by integration of transcriptome and ChIP-seq data with BETA. Nature protocols 8(12):2502-2515.
10. Jiang P & Singh M (2014) CCAT: Combinatorial Code Analysis Tool for transcriptional regulation. Nucleic acids research 42(5):2833-2847.
11. Kheradpour P, Stark A, Roy S, & Kellis M (2007) Reliable prediction of regulator targets using 12 Drosophila genomes. Genome research 17(12):1919-1931.
12. Frisch R & Waugh FV (1933) Partial Time Regressions as Compared with Individual Trends. Econometrica 1(4):387-401.
13. Freedman D (2009) Statistical models : theory and practice (Cambridge University Press, Cambridge ; New York) pp xiv, 442 p.
14. Benjamini Y & Hochberg Y (1995) Controlling the False Discovery Rate - a Practical and Powerful Approach to Multiple Testing. J Roy Stat Soc B Met 57(1):289-300.
15. NCI (2014) Cancer Gene Index Project. 16. Sadelain M, Papapetrou EP, & Bushman FD (2012) Safe harbours for the integration
of new DNA in the human genome. Nature reviews. Cancer 12(1):51-58. 17. Vogelstein B, et al. (2013) Cancer genome landscapes. Science 339(6127):1546-
1558. 18. Futreal PA, et al. (2004) A census of human cancer genes. Nature reviews. Cancer
4(3):177-183. 19. Abbott KL, et al. (2014) The Candidate Cancer Gene Database: a database of cancer
driver genes from forward genetic screens in mice. Nucleic acids research. 20. Friedman J, Hastie T, & Tibshirani R (2010) Regularization Paths for Generalized
Linear Models via Coordinate Descent. Journal of statistical software 33(1):1-22. 21. Efron B, Hastie T, Johnstone I, & Tibshirani R (2004) Least angle regression. The
Annals of statistics 32(2):407-499. 22. Lefebvre C, et al. (2010) A human B-cell interactome identifies MYB and FOXM1 as
master regulators of proliferation in germinal centers. Molecular systems biology 6:377.
23. Alvarez MJ (2013) viper: Master Regulator Analysis including MARINA and VIPER algorithms. R package version 0.99.0).
24. Chen EY, et al. (2012) Expression2Kinases: mRNA profiling linked to multiple upstream regulatory layers. Bioinformatics 28(1):105-111.
25. DeLong ER, DeLong DM, & Clarke-Pearson DL (1988) Comparing the areas under two or more correlated receiver operating characteristic curves: a nonparametric approach. Biometrics 44(3):837-845.
26. Gilbert LA, et al. (2014) Genome-Scale CRISPR-Mediated Control of Gene Repression and Activation. Cell 159(3):647-661.
27. Wang T, Wei JJ, Sabatini DM, & Lander ES (2014) Genetic screens in human cells using the CRISPR-Cas9 system. Science 343(6166):80-84.
Supplementary Figures and Tables
A
B
−0
.10
.00
.2C
orr
ela
tio
n
LU
SC
.HiS
eq
LU
SC
.HiS
eq
.V2
OV.A
gile
nt
LU
AD
.HiS
eq
.V2
LU
AD
.HiS
eq
LIH
C.H
iSe
q
CO
AD
.HiS
eq
.V2
STA
D.H
iSe
q
RE
AD
.HiS
eq
.V2
BL
CA
.HiS
eq
GB
M.U
13
3A
BR
CA
.HiS
eq
BR
CA
.Ag
ilen
t
BR
CA
.HiS
eq
.V2
BL
CA
.HiS
eq
.V2
RE
AD
.Ag
ilen
t
LIH
C.H
iSe
q.V
2
GB
M.H
iSe
q.V
2
OV.U
13
3A
UC
EC
.GA
UC
EC
.GA
.V2
PR
AD
.HiS
eq
.V2
UC
EC
.HiS
eq
.V2
CE
SC
.HiS
eq
.V2
GB
M.A
gile
nt
CO
AD
.Ag
ilen
t
HN
SC
.HiS
eq
HN
SC
.HiS
eq
.V2
KIR
P.H
iSe
q.V
2
KIC
H.H
iSe
q.V
2
TH
CA
.HiS
eq
.V2
KIR
C.H
iSe
q
KIR
C.H
iSe
q.V
2
−0
.10
.0C
orr
ela
tio
n
LU
SC
.HiS
eq
.V2
LU
SC
.HiS
eq
OV.A
gile
nt
LIH
C.H
iSe
q
STA
D.H
iSe
q
LIH
C.H
iSe
q.V
2
CO
AD
.HiS
eq
.V2
LU
AD
.HiS
eq
.V2
BR
CA
.Ag
ilen
t
LU
AD
.HiS
eq
HN
SC
.HiS
eq
RE
AD
.HiS
eq
.V2
PR
AD
.HiS
eq
.V2
OV.U
13
3A
CO
AD
.Ag
ilen
t
BR
CA
.HiS
eq
HN
SC
.HiS
eq
.V2
RE
AD
.Ag
ilen
t
BR
CA
.HiS
eq
.V2
KIC
H.H
iSe
q.V
2
BL
CA
.HiS
eq
.V2
BL
CA
.HiS
eq
GB
M.U
13
3A
UC
EC
.GA
UC
EC
.HiS
eq
.V2
UC
EC
.GA
.V2
CE
SC
.HiS
eq
.V2
GB
M.A
gile
nt
TH
CA
.HiS
eq
.V2
KIR
P.H
iSe
q.V
2
GB
M.H
iSe
q.V
2
KIR
C.H
iSe
q
KIR
C.H
iSe
q.V
2
0.2
Fig. S1. Background confounding factors of gene expression. (A) For each tumor, we computed
the spearman rank correlation between the promoter degree of each gene and the tumor normal ex-
pression difference. For each cancer type, the correlation values across all tumors are shown. The
bottom and top of the box are the 25th and 75th percentiles (i.e., they give the interquartile range).
Whiskers on the top and bottom represent the maximum and minimum data points within the range
represented by 1.5 times the inter-quartile range. (B) The correlation values between the promoter
CpG content and gene tumor normal expression difference are shown by boxplots.
Background factors Regulators in selectionGene expression
n: number of genes pb: number of background factors
pr: number of regulators
Regulatory activity scores (t-value)
Regulator gene expression: E
or somatic mutation: M
n: number of tumors p: number of members
A
B
sub-matrix: B sub-matrix: R
Fig. S2. Linear model structure. (A) For each tumor, the TF regulatory effect is evaluated on target
gene expression by linear regression with each regression unit as a human gene. The covariate matrix
X is composed of two sub matrices B and R. Sub-matrix B contains the values of background factors,
which include Promoter degree, Promoter CpG content, Gene CNA and Promoter DNA methylation.
Sub-matrix R contains the regulatory potential scores that measure TF binding intensity near gene
TSS, and the set of TFs will be selected to accurately model the gene expression pattern in each
tumor. The response variable Y contains the gene expression differences between tumor sample and
normal controls. In our analysis, the number of regression unit n is about 16000, which represents the
number of human genes with TCGA gene expression measured and ChIP-seq binding peaks near its
TSS. The number of background factors pb, i.e., the dimension of B is 4. The dimension of R, pr, is 150,
which represents the number of TFs with ENCODE ChIP-seq profiles. (B) RABIT investigates whether
the public ChIP-seq profiles (or regulatory motifs) used can represent the active TF targets in each
cancer type. For each cancer type, we regress the response variable of TF regulatory activity scores
linearly against covariates of TF gene expression and somatic mutation, where each regression unit as
a tumor sample. The TF (or RBP) regulatory motif might contain several members, and all of them are
included as covariates. The response vector contains the TF regulatory activity scores, which are the
estimated coefficients normalized by their estimated standard errors (aka t-values) for TF regulatory
effects on target gene differential expression (example in Table 1A). The number of regression unit is
the number of tumors in each TCGA cancer type. The dimension of X is 2 for ChIP-seq analysis, which
represents the TF gene expression and somatic mutation. For regulatory motifs, there might be several
members included; such as RBP motif cluster 9 includes RBFOX1, RBFOX2, RBFOX3 and EIF2S1. To
determine which members are relevant with motif target genes expression patterns, we run a forward
selection among gene expression and mutation values (2p covariates) of all members with Mallows Cp
as model selection metric (example in SI Appendix, Table S2).
A BPercentage (%)
0 20 40 60
MCF−7+vehicle+UT−A
K562+IFNg30+Stanford
K562+Stanford
NB4+Stanford
MCF10A−Er−Src+EtOH_0.01pct+Harvard
K562+UT−A
HepG2+UT−A
MCF−7+serum_stimulated_media+UT−A
GM12878+UT−A
MCF10A−Er−Src+4OHTAM_1uM_4hr+Harvard
K562+IFNg6h+Yale
H1−hESC+Stanford
K562+IFNa6h+Yale
K562+IFNa30+Yale
HeLa−S3+Yale
t−value0 4 8 12
MCF−7+vehicle+UT−A
MCF−7+serum_stimulated_media+UT−A
K562+IFNg30+Stanford
K562+Stanford
K562+IFNg6h+Yale
MCF−7+estrogen+UT−A
K562+IFNa6h+Yale
K562+UT−A
NB4+Stanford
K562+IFNa30+Yale
MCF−7+serum_starved_media+UT−A
HeLa−S3+Yale
GM12878+UT−A
K562+Yale
HepG2+UT−A
H1−hESC+Stanford
MCF10A−Er−Src+EtOH_0.01pct+Harvard
HUVEC+UT−A
HeLa−S3+UT−A
MCF10A−Er−Src+4OHTAM_1uM_4hr+Harvard
Fig. S3. Selection of ChIP-seq profile with the largest statistical effect. When one TF has several
ChIP-seq profiles available, RABIT only uses the ChIP-seq profile that gives the most significant co-
efficient (the largest absolute t-value) in the regression analysis of TF regulation on target genes. (A)
In this example, all ENCODE MYC ChIP-seq profiles are analyzed together with TCGA data of breast
tumor TCGA-AO-A03P-01A. (B) For each TCGA breast tumor, only one most relevant ENCODE MYC
ChIP-seq profile is selected. We show the percentage of tumors that each ChIP-seq profile is selected.