1
Supplementary Information
Network inference analysis identifies SETDB1 as a key regulator for
reverting colorectal cancer cells into differentiated normal-like cells
Soobeom Lee, Chansu Lee, Chae Young Hwang, Dongsan Kim, Younghyun Han,
Sung Noh Hong, Seok-Hyung Kim, and Kwang-Hyun Cho*
*Correspondence to: [email protected]
2
This file includes:
Supplementary Materials and Methods
Supplementary References
Supplementary Figures and Figure Legends (S1-S8)
Supplementary Tables (S1-S5)
(Supplementary Table 2 is provided as a separate file in a spreadsheet format.)
3
Supplementary Materials and Methods
Gene expression data processing
In order to avoid the error caused by the use of heterogeneous Affymetrix platforms,
we collected 17 independent gene expression data of normal colon mucosa performed by
Affymetrix GeneChip Human Genome U133 Plus 2.0 (GEO-integrated normal dataset, n =
318; GSE71571, GSE23194, GSE37364, GSE41328, GSE33114, GSE23878, GSE20916,
GSE15960, GSE9254, GSE4183, GSE8671, GSE11831, GSE32323, GSE9348, GSE4107,
GSE10714, GSE18105) from the Gene Expression Omnibus. The data were normalized by
robust multi-array average (RMA) method as implemented in the ‘affy’ package (1) and batch-
corrected with ComBat method as implemented in the ‘sva’ package (2) in R. Also, the 11
Gene expression data of tumor samples performed by Affymetrix GeneChip Human Genome
U133 Plus 2.0 (GEO-integrated cancer dataset, n = 2,166; GSE33113, GSE13067, GSE13294,
GSE14333, GSE17536, GSE20916, GSE2109, GSE23878, GSE35896, GSE37892,
GSE39582) with KFSYSCC dataset (3) were downloaded and processed by same methods. In
addition, we used the paired data of 98 adjacent normal colon mucosa samples and their tumor
tissue samples performed by Affymetrix Human Genome U219 Array (GEO-paired dataset, n
= 196; GSE44076). For The Cancer Genome Atlas (TCGA) data, we used the transcriptome
data by the format ‘HTSeq- Fragments Per Kilobase of transcript per Million mapped reads
upper quartile (FPKM-UQ)’ downloaded from GDC Data Portal using ‘TCGAbiolinks’
package (4) (TCGA dataset, n = 698). A Single-cell RNA-seq dataset was downloaded from
the Gene Expression Omnibus (Single-cell RNA-seq dataset, n = 432; GSE81861) and we
selectively used the epithelial samples annotated by the previous study (5). In brief, the data
was performed by the SMARTer Ultra-Low RNA Kit for Illumina (Clontech, 634948) and
sequenced using HiSeq 2000 platform (Illumina). Among the 1,591 cells from tumors and
4
matched normal mucosa, they retained the 626 cells after quality control. To further filter
detectably expressed genes, we removed the genes that were measured to less than one tenth
of the total samples. The details about the datasets used in this study are given in Table S1.
Gene sets
Tissue-specific gene sets were obtained from Human Protein Atlas (6) (HPA),
Genotype-Tissue Expression project (7) (GTEx), and Functional Annotation of the Mammalian
Genome (8) (FANTOM) project. From HPA, we used the tissue specific gene lists which the
study provides with the labels of “elevated genes” for each tissue. From GTEx, we selected the
tissue-specific gene sets that have at least five-fold higher gene expression in a particular tissue
as compared to all other tissues. Likewise, we defined the tissue-specific gene sets based on
the same rule from the gene expression data of FANTOM5. The stem/progenitor associated
gene sets were individually collected from the previous studies (5,9-13). The detail gene lists
used in this study are given in Table S2.
5
Supplementary References
1. Gautier L, Cope L, Bolstad BM, Irizarry RA. affy--analysis of Affymetrix GeneChip
data at the probe level. Bioinformatics 2004;20(3):307-15 doi 10.1093/bioinformatics/btg405.
2. Leek JT, Johnson WE, Parker HS, Jaffe AE, Storey JD. The sva package for removing batch effects and other unwanted variation in high-throughput experiments. Bioinformatics 2012;28(6):882-3 doi 10.1093/bioinformatics/bts034.
3. Guinney J, Dienstmann R, Wang X, de Reynies A, Schlicker A, Soneson C, et al. The consensus molecular subtypes of colorectal cancer. Nat Med 2015;21(11):1350-6 doi 10.1038/nm.3967.
4. Colaprico A, Silva TC, Olsen C, Garofano L, Cava C, Garolini D, et al. TCGAbiolinks: an R/Bioconductor package for integrative analysis of TCGA data. Nucleic Acids Res 2016;44(8):e71 doi 10.1093/nar/gkv1507.
5. Park NI, Guilhamon P, Desai K, McAdam RF, Langille E, O'Connor M, et al. ASCL1 Reorganizes Chromatin to Direct Neuronal Fate and Suppress Tumorigenicity of Glioblastoma Stem Cells. Cell Stem Cell 2017;21(3):411 doi 10.1016/j.stem.2017.08.008.
6. Uhlen M, Fagerberg L, Hallstrom BM, Lindskog C, Oksvold P, Mardinoglu A, et al. Proteomics. Tissue-based map of the human proteome. Science 2015;347(6220):1260419 doi 10.1126/science.1260419.
7. Consortium GT. Human genomics. The Genotype-Tissue Expression (GTEx) pilot analysis: multitissue gene regulation in humans. Science 2015;348(6235):648-60 doi 10.1126/science.1262110.
8. Consortium F, the RP, Clst, Forrest AR, Kawaji H, Rehli M, et al. A promoter-level mammalian expression atlas. Nature 2014;507(7493):462-70 doi 10.1038/nature13182.
9. Wong DJ, Liu H, Ridky TW, Cassarino D, Segal E, Chang HY. Module map of stem cell genes guides creation of epithelial cancer stem cells. Cell Stem Cell 2008;2(4):333-44 doi 10.1016/j.stem.2008.02.009.
10. Loh KM, Ang LT, Zhang J, Kumar V, Ang J, Auyeong JQ, et al. Efficient endoderm induction from human pluripotent stem cells by logically directing signals controlling lineage bifurcations. Cell Stem Cell 2014;14(2):237-52 doi 10.1016/j.stem.2013.12.007.
11. Kim J, Woo AJ, Chu J, Snow JW, Fujiwara Y, Kim CG, et al. A Myc network accounts for similarities between embryonic stem and cancer cell transcription programs. Cell 2010;143(2):313-24 doi 10.1016/j.cell.2010.09.010.
12. Ben-Porath I, Thomson MW, Carey VJ, Ge R, Bell GW, Regev A, et al. An embryonic stem cell-like gene expression signature in poorly differentiated aggressive human tumors. Nat Genet 2008;40(5):499-507 doi 10.1038/ng.127.
13. Xie R, Everett LJ, Lim HW, Patel NA, Schug J, Kroon E, et al. Dynamic chromatin remodeling mediated by polycomb proteins orchestrates pancreatic differentiation of human embryonic stem cells. Cell Stem Cell 2013;12(2):224-37 doi 10.1016/j.stem.2012.11.023.
14. Li H, Courtois ET, Sengupta D, Tan Y, Chen KH, Goh JJL, et al. Reference component analysis of single-cell transcriptomes elucidates cellular heterogeneity in human colorectal tumors. Nature Genetics 2017;49(5):708-18 doi 10.1038/ng.3818.
6
Supplementary Figure legends
7
Supplementary Figure 1. Loss of colon-specific gene expression in colorectal cancer cells
and its clinical relevance.
A, Violin plots showing GSVA scores (bottom axis) of adult tissue associated gene sets across
datasets. –log10(P value) (top axis) is overlaid by red dots. B, Kaplan-Meier analysis of overall
survival (n = 623, P = 0.011) and disease-free survival (n = 545, P = 0.0011). GSVA score of
colon-specific gene set was dichotomized into high and low using the median as a cutoff. C,
Box plots showing the GSVA score of colon-specific gene set based on TNM staging system
assessed by TCGA. The extent of the tumor (left), the extent of spread to the lymph nodes
(middle), and the presence of distant metastasis (right). Colon-specific gene set from Human
Protein Atlas were used for above analysis. (*P < 0.05, **P < 0.01, ***P < 0.001)
8
9
Supplementary Figure 2. Acquisition of stem/progenitor gene expression program in
colorectal cancer cells.
A, Heatmap colored based on overlapping score among stem/progenitor associated gene sets.
B, Violin plots showing GSVA scores (bottom axis) of stem/progenitor associated gene sets
across datasets. –log10(P value) is overlaid by red dots (top axis). GEO-integrated dataset (top-
right), GEO-paired dataset (bottom-left), TCGA dataset (bottom-middle), and Single-cell
RNA-seq (bottom-right). C, t-SNE visualization plots colored by GSVA score of colon-tissue
specific gene set (left column), intestinal stem cell-specific gene set (middle column), and
embryonic stem cell-specific gene set (right column). Each dot indicates a sample; dashed
circles indicate samples that show unidentically increased patterns of GSVA score of intestinal
stem cell and the score of embryonic stem cell. GSVA score of embryonic stem cells, but not
that of intestinal stem cells, was particularly increased in those samples.
10
11
Supplementary Figure 3. Differentially expressed gene sets analysis and pseudotime
analysis of single-cell RNA sequencing data.
A, Heatmap of Spearman correlation coefficient values for 74 gene sets showing differential
expression between normal colon epithelial cells and colorectal cancer epithelial cells. The
differentially expressed gene sets were determined by adjust P value (FDR) under 1e-5. B, t-
SNE visualization plots of single cell RNA-seq data. First plot shows disease states (top) and
three plots on the bottom side are colored by GSVA score of colon-tissue specific gene set
(second row), intestinal stem cell-specific gene set (third row), and embryonic stem cell-
specific gene set (fourth row), respectively. C, Scatter plot colored by cell types based on the
previous study (14). The previous study classified the normal cells into enterocytes, goblet cells,
stem/TA (stem/transit-amplifying) cells, non-stem cells, and unknown cells and classified the
cancer cells into stem/TA cells, enterocyte-like cells, goblet-like cells and unknown cells. We
described the subtypes of cancer cells according to annotation of normal cells; ‘enterocyte-like’
to ‘enterocyte’, and ‘goblet-like’ to ‘goblet’. Shape indicates if the cell is a normal cell (star)
or a cancer cell (circle). Color indicates differentiation state. D, Dimensional reduction using
SLICER algorithm or E, using Monocle. Points are colored based on PC1 of each state-specific
gene set: Colon (left), intestinal stem cell (middle), embryonic stem cell (right). The
smoothened profiles of the PC1 for the state-specific gene sets over pseudotime are shown at
the right side. F, Transcription entropy measured using stemID algorithm is plotted along with
pseudotime. G, Z-score-normalized imputed gene expression level of various stem cell markers
along with pseudotime inferred by SLICER (top) and Monocle (bottom).
12
13
Supplementary Figure 4. Identification of core transcription factors of normal colon
epithelial cells and their reduced gene expression level and transcriptional activity in
colorectal cancer cells.
A, Summary of inferred gene regulatory networks. B, The number of genes in colon-associated
gene sets. The union of gene sets was used for MARINA analysis. C, Violin plot showing
GSVA scores of colon-associated gene sets used in MARINA analysis. D, Venn-diagram of
differentially expressed genes (DEGs) that are highly expressed in normal colon mucosa
compared to CRC tissues. DEGs were defined in GEO-paired dataset, GEO-integrated dataset,
and TCGA dataset, respectively. Among the DEGs, the common genes that are highly
expressed in all three datasets of normal colon mucosa were used in MARINA analysis. We
called these genes normal-specific genes. E-H, Bar plots showing –log10(P value) of putative
master regulators. We identified potential TFs with P value of less than 1e-10. For GEO-paired
normal GRN and GEO-integrated normal GRN, the five TFs significantly included normal-
specific genes and colon-specific genes as their target genes. These five TFs were defined as
the core TFs, and are colored in red. E, Identification of regulators of normal-specific genes in
GEO-paired normal GRN, F, regulators of colon-specific genes in GEO-paired normal GRN,
G, regulators of normal-specific genes in GEO-integrated normal GRN, H, regulators of colon-
specific genes in GEO-integrated normal GRN. I-K, Gene expression level of the core TFs,
transcriptional activity inferred by VIPER algorithm, and average expression level of target
genes of the core TFs I, in GEO-integrated dataset, J, in TCGA dataset, K, GEO-paired dataset.
14
Supplementary Figure 5. SETDB1 as a negative modulator that hinders transcriptional
activity of the core TFs and its clinical relevance in CRC patients.
A, GO-term enrichment among the genes that showed negative correlation with colon-
associated gene expression. The number of the genes that belong to each GO-term is labeled.
B, Interaction score (left axis, yellow color) with the core TFs based on STRING database and
conditional mutual information (CMI) score (right axis, blue color) are shown. STRING score
15
was calculated based on the combined score of STRING database. CMI score represents the
number of target genes whose regulatory interaction with the core TFs were altered relying on
the expression level of a potential modulator. For the 15 potential modulators, the target genes
that lost their interaction with their core TFs in CRC samples with high gene expression of the
potential modulator were counted. The target genes of the core TFs were inferred from GEO-
integrated normal GRN, and 300 samples of CRC with the highest gene expression of potential
modulators and 300 samples of CRC with the lowest gene expression level of potential
modulators were used. C, Heatmap showing the gene expression level of the core TFs in four
stem-like CRC cell lines. The data was produced from microarray analysis of four cell lines.
(n = 1) D, qRT-PCR analysis to confirm the knockdown efficiency upon siRNA transfection
in Caco2 cells. E, Imputed gene expression level of SETDB1 compared by normal stem/TA
cells and cancer stem/TA-like cells based on annotation of previous study(12). F, Imputed gene
expression level of SETDB1 in single-cell RNA-seq. G-I, Correlation between imputed
SETDB1 expression and PC1 value of each state-specific gene set in single-cell RNA seq data.
G, Colon, H, intestinal stem cell, I, embryonic stem cell. J, Kaplan-Meier analysis of overall
survival (n = 623, P = 0.48) and disease-free-survival (n = 545, P = 0.00074) with SETDB1
expression. SETDB1 expression level was dichotomized into high and low using the third
quantile as a cutoff. K, Kaplan-Meier analysis of overall survival (n = 623, P = 0.064) and
disease-free survival (n = 545, P = 7.8e-05). GSVA score of colon-specific gene set was
dichotomized into high and low using the median as a cutoff and further divided by SETDB1
expression level using its third quantile.
16
Supplementary Figure 6. Gene regulation of in vitro stem-like colorectal cancer cells and
immunofluorescence confocal imaging anaylsis.
17
A, SETDB1, KRT20, and MYC mRNA levels were measured by qRT-PCR in Caco2 cells stably
expressing three different shRNA sequences targeting SETDB1. Further analyses were
performed with shRNA#1. B, mRNA expression level of core TFs were compared in three cell
lines. The relative abundance of mRNA is presented with arbitrary units normalized by β-actin
mRNA expression. C, Overexpression of the core TFs was assessed by qRT-PCR analysis
(arbitrary units). All qRT-PCR analyses were conducted in triplicate and presented as
mean ± s.e.m. D, qRT-PCR analysis of gene expression in sorted cells. Relative fold change
was calculated over a mock-treated group of each cell line (M; Mock, 5F; five core TF
overexpression). (n = 3, mean ± s.e.m.) (*P < 0.05, **P < 0.01, ***P < 0.001). E,
Immunofluorescence of KRT20 (red) and Ki-67 (green) with DAPI nuclear counterstain (blue).
Scale bar, 20 μm. Caco2 (top), HCT116 (middle), and SW480 (bottom)
18
Supplementary Figure 7. Restoration of tissue-specific gene expression profiles and
transcriptional activity of the core TFs in differentiated normal-like cells.
19
A, Single-cell RNA-seq of Caco2 cell lines stably expressing shSETDB1 and shScramble was
reconstructed using SLICER algorithm. Dimensional reduction was performed by LLE. B,
Scatter plots colored by PC1 of each state-specific gene set. C-D, The average expression level
of inferred target genes from C, GEO-integrated dataset, and D, GEO-paired dataset. E-F, Z-
score-normalized average gene expression of target genes of the core TFs in single-cell RNA-
seq along with pseudotime. The target genes of the core TFs are inferred from E, GEO-
integrated normal GRN and F, GEO-paired normal GRN. G-H, Average expression of inferred
target genes from G, GEO-integrated normal GRN and H, GEO-paired normal GRN, in bulk
RNA sequencing data.
20
Supplementary Figure 8. Bulk RNA sequencing data analysis and drug responsiveness of
SETDB1-downregulated cells.
21
A, Heatmap showing colon-specific genes used in GSEA. B, ESC-specific genes used in GSEA.
C, GSVA scores of intestinal differentiation-related gene signatures, intestinal stem cell-related
gene signatures, and Wnt signaling-related gene signatures in Caco2, HCT116, and SW480
cells. D, Gene expression of epithelial cell markers, E-cadherin, ZO-1, and ZO-2 in Caco2,
HCT116, and SW480 cells. P-values were determined by unpaired two-tailed Student’s t-test
E, Percent growth over time in Caco2 after SETDB1 downregulation with cytotoxic drugs, 5-
FU and Oxaliplatin. F, Percent of growth inhibition after 5 days of cytotoxic drugs treatment
of Caco2 cells.
22
Tables and their legends Supplementary Table 1. Used public datasets of normal colon mucosa and colorectal cancer tissues for network analysis.
Dataset GSE ID Platform Number of samples
Sample type Reference
GEO-integrated dataset
GSE71571 [HG-U133_Plus_2] 44 Normal colon mucosa
Thomas SS, Makar KW, Li L, Zheng Y, Yang P, Levy L, Rudolph RY, Lampe PD, Yan M, Markowitz SD, et al. (2015) Gene Expression Omnibus GSE71571 (https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE71571)
GSE23194 [HG-U133_Plus_2] 11 Normal colon mucosa
Olivero M, Galatola G, Laudi C, Crispi S, Cantarella D, Morino M, Calogero RA, Direnzo MF (2010) Gene Expression Omnibus GSE23194(https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE23194)
GSE37364 [HG-U133_Plus_2] 38 Normal colon mucosa
Valcz G, Patai AV, Kalmár A, Péterfia B et al. (2014) Gene Expression Omnibus GSE37364 (https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc= GSE37364)
GSE41328 [HG-U133_Plus_2] 10 Normal colon mucosa
Lin G, He X, Ji H, Shi L et al. (2006) Gene Expression Omnibus GSE41328 (https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc= GSE41328)
GSE33114 [HG-U133_Plus_2] 6 Normal colon mucosa
de Sousa E Melo F, Colak S, Buikhuisen J, Koster J et al. (2011) Gene Expression Omnibus GSE33114 (https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc= GSE33114)
GSE23878 [HG-U133_Plus_2] 24 Normal colon mucosa
Uddin S, Ahmed M, Hussain A, Abubaker J et al. (2011) Gene Expression Omnibus GSE23878 (https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc= GSE23878)
GSE20916 [HG-U133_Plus_2] 44 Normal colon mucosa
Skrzypczak M, Goryca K, Rubel T, Paziewska A et al. (2010) Gene Expression Omnibus GSE20916 (https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc= GSE20916)
GSE15960 [HG-U133_Plus_2] 6 Normal colon mucosa
Galamb O, Spisák S, Sipos F, Tóth K et al. (2010) Gene Expression Omnibus GSE15960 (https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc= GSE15960)
GSE9254 [HG-U133_Plus_2] 19 Normal colon mucosa
LaPointe LC, Dunne R, Brown GS, Worthley DL et al. (2008) Gene Expression Omnibus GSE9254 (https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc= GSE9254)
GSE4183 [HG-U133_Plus_2] 8 Normal colon mucosa
Galamb O, Györffy B, Sipos F, Spisák S et al. (2008) Gene Expression Omnibus GSE4183 (https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc= GSE4183)
GSE8671 [HG-U133_Plus_2] 32 Normal colon mucosa
Sabates-Bellver J, Van der Flier LG, de Palo M, Cattaneo E et al. (2007) Gene Expression Omnibus GSE8671 (https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc= GSE8671)
23
GSE11831 [HG-U133_Plus_2] 17 Normal colon mucosa
Csillag C, Nielsen OH, Borup R, Nielsen FC et al. (2007) Gene Expression Omnibus GSE11831 (https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc= GSE11831)
GSE32323 [HG-U133_Plus_2] 17 Normal colon mucosa
Khamas A, Ishikawa T, Shimokawa K, Mogushi K et al.(2012) Gene Expression Omnibus GSE32323 (https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc= GSE32323)
GSE9348 [HG-U133_Plus_2] 12 Normal colon mucosa
Hong Y, Downey T, Eu KW, Koh PK et al. (2010) Gene Expression Omnibus GSE9348 (https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc= GSE9348)
GSE4107 [HG-U133_Plus_2] 10 Normal colon mucosa
Hong Y, Ho KS, Eu KW, Cheah PY. (2007) Gene Expression Omnibus GSE4107 (https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc= GSE4107)
GSE10714 [HG-U133_Plus_2] 3 Normal colon mucosa
Galamb O, Sipos F, Solymosi N, Spisák S et al. (2008) Gene Expression Omnibus GSE10714 (https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc= GSE10714)
GSE18105 [HG-U133_Plus_2] 17 Normal colon mucosa
Matsuyama T, Ishikawa T, Mogushi K, Yoshida T et al. (2010) Gene Expression Omnibus GSE18105 (https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc= GSE18105)
GSE33113 [HG-U133_Plus_2] 90 Colorectal cancer
de Sousa E Melo F, Colak S, Buikhuisen J, Koster J et al. (2010) Gene Expression Omnibus GSE33113 (https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc= GSE33113)
GSE13067 [HG-U133_Plus_2] 74 Colorectal cancer
Jorissen RN, Lipton L, Gibbs P, Chapman M et al. (2008) Gene Expression Omnibus GSE13067 (https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc= GSE13067)
GSE13294 [HG-U133_Plus_2] 155 Colorectal cancer
Jorissen RN, Lipton L, Gibbs P, Chapman M et al. (2008) Gene Expression Omnibus GSE13294 (https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc= GSE13294)
GSE14333 [HG-U133_Plus_2] 157 Colorectal cancer
Jorissen RN, Gibbs P, Christie M, Prakash S et al. (2009) Gene Expression Omnibus GSE14333 (https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc= GSE14333)
GSE17536 [HG-U133_Plus_2] 177 Colorectal cancer
Smith JJ, Deane NG, Wu F, Merchant NB et al. (2010) Gene Expression Omnibus GSE17536 (https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc= GSE17536)
GSE20916 [HG-U133_Plus_2] 101 Colorectal cancer
Skrzypczak M, Goryca K, Rubel T, Paziewska A et al. (2010) Gene Expression Omnibus GSE20916 (https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc= GSE20916)
GSE2109 [HG-U133_Plus_2] 293 Colorectal cancer
(2005) Gene Expression Omnibus GSE2109 https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=gse2109
24
GSE35896 [HG-U133_Plus_2] 62 Colorectal cancer
Schlicker A, Beran G, Chresta CM, McWalter G et al. (2012) Gene Expression Omnibus GSE35896 https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc= GSE35896
GSE37892 [HG-U133_Plus_2] 130 Colorectal cancer
Laibe S, Lagarde A, Ferrari A, Monges G et al. (2012) Gene Expression Omnibus GSE37892 (https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc= GSE37892)
GSE39582 [HG-U133_Plus_2] 585 Colorectal cancer
Marisa L, de Reyniès A, Duval A, Selves J et al. (2013) Gene Expression Omnibus GSE39582 (https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc= GSE39582)
KFSYSCC [HG-U133_Plus_2] 307 Colorectal cancer
GEO-paired dataset
GSE44076 [HG-U219] 196 (98,98) Both normal and cancer
Sanz-Pamplona R, Berenguer A, Cordero D, Molleví DG et al. (2014) Gene Expression Omnibus GSE44076 (https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc= GSE44076)
Single cell RNA-seq dataset
GSE81861 Illumina HiSeq 2000
432 (160 normal, 272 cancer)
Both normal and cancer
Li H, Courtois ET, Sengupta D, Tan Y et al. (2017) Gene Expression Omnibus GSE81861 (https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc= GSE81861)
TCGA dataset
- 698 (51 normal ,647 cancer)
Both normal and cancer
25
Supplementary Table 2. State-specific gene sets used for gene set enrichment analysis. Supplementary Table 2 is provided as a separate file in a spreadsheet format.
26
Supplementary Table 3. Statistics for the differences of gene expression level and transcriptional activity of the core transcription factors between normal cells and cancer cells.
GEO-integrated dataset Gene expression level of the core transcription factors
CDX2 ELF3 HNF4G PPARG VDR statistic 10.69473218 -0.27991229 18.9951185 15.7165582 35.70163876 dm 0.536065923 -0.01096842 0.949724098 0.844554083 1.296283306 p.value 3.96E-26 0.77956812 3.15E-75 3.90E-53 1.17E-225 Transcriptional activity of the core transcription factors (inferred based on GRN-integrated normal GRN
CDX2 ELF3 HNF4G PPARG VDR statistic 41.72789397 35.73218089 40.40599845 56.91114407 39.2084793 dm 1.556328825 1.335448951 1.564461287 1.68932893 1.479790124 p.value 8.36E-289 5.66E-226 9.28E-275 0 4.08E-262 Transcriptional activity of the core transcription factors (inferred based on GRN-paired normal GRN
CDX2 ELF3 HNF4G PPARG VDR statistic 46.39576685 46.62009775 45.79826922 62.61198087 49.68043703 dm 1.5641431 1.561384625 1.58824364 1.599375646 1.615947558 p.value 0 0 0 0 0 Average gene expression of target genes of the core transcription factors (target genes inferred based on GRN-integrated normal GRN)
CDX2 ELF3 HNF4G PPARG VDR statistic 38.30887338 28.94046032 34.93541657 46.41666984 35.78563691 dm 0.832416882 0.590634435 0.658375335 0.873905633 0.672654616 p.value 1.12E-252 6.20E-159 7.84E-218 0 1.60E-226 Average gene expression of target genes of the core transcription factors (target genes inferred based on GRN-paired normal GRN)
CDX2 ELF3 HNF4G PPARG VDR statistic 35.09524829 32.77176544 36.92188928 51.67591474 40.47372016 dm 1.095832079 1.093964678 0.834339496 0.747114108 1.208036378 p.value 1.84E-219 4.69E-196 3.09E-238 0 1.78E-275
GEO-paired dataset Gene expression level of the core transcription factors
27
CDX2 ELF3 HNF4G PPARG VDR
statistic 2.20562556 1.02542367 6.863472471 8.208834982 10.59242257 dm 0.271255102 0.128129592 0.749781633 0.860363265 1.001171429 p.value 0.028584775 0.306440214 8.83E-11 3.06E-14 5.57E-21 Transcriptional activity of the core transcription factors (inferred based on GRN-integrated normal GRN
CDX2 ELF3 HNF4G PPARG VDR statistic 14.7116911 12.88483083 14.12615278 20.13903245 13.47782902 dm 1.562898255 1.342930046 1.513515714 2.145258496 1.400957751 p.value 2.12E-33 7.52E-28 1.27E-31 2.03E-49 1.19E-29 Transcriptional activity of the core transcription factors (inferred based on GRN-paired normal GRN
CDX2 ELF3 HNF4G PPARG VDR statistic 15.72344139 14.78518976 15.57701892 15.31153259 17.67934757 dm 1.488905195 1.435685995 1.539381305 1.603325312 1.693519582 p.value 1.85E-36 1.27E-33 5.10E-36 3.23E-35 2.68E-42 Average gene expression of target genes of the core transcription factors (target genes inferred based on GRN-integrated normal GRN)
CDX2 ELF3 HNF4G PPARG VDR statistic 14.69408576 13.91245304 16.4467919 23.23589747 18.91836501 dm 1.193446551 1.151831565 0.854344098 1.115462689 1.448346815 p.value 9.02E-43 5.31E-39 1.36E-51 7.52E-89 1.02E-64 Average gene expression of target genes of the core transcription factors (target genes inferred based on GRN-paired normal GRN)
CDX2 ELF3 HNF4G PPARG VDR statistic 13.55443476 10.69792629 14.81879947 22.85192316 13.30928806 dm 1.103688083 0.704405943 0.821973535 1.159182815 0.783557499 p.value 2.59E-37 7.83E-25 2.21E-43 1.15E-86 3.60E-36
TCGA dataset Gene expression level of the core transcription factors
CDX2 ELF3 HNF4G PPARG VDR statistic 1.850584092 1.61995657 7.51564191 4.886970402 11.07323336 dm 0.326161729 0.152009336 0.841597477 0.599766811 1.024741445 p.value 0.064652966 0.105694577 1.75E-13 1.27E-06 2.32E-26
28
Transcriptional activity of the core transcription factors (inferred based on GRN-integrated normal GRN CDX2 ELF3 HNF4G PPARG VDR
statistic 10.42687764 8.512523836 11.47089852 14.20163248 10.92893464 dm 0.710771495 0.57664583 0.802963512 0.911107815 0.637488983 p.value 9.44E-24 1.04E-16 5.09E-28 2.20E-40 9.06E-26 Transcriptional activity of the core transcription factors (inferred based on GRN-paired normal GRN
CDX2 ELF3 HNF4G PPARG VDR statistic 13.25263534 12.20855149 16.74341816 16.14569644 15.41648611 dm 0.808598482 0.788867198 0.945253546 0.797628655 0.912970501 p.value 6.58E-36 3.37E-31 3.95E-53 4.81E-50 2.38E-46 Average gene expression of target genes of the core transcription factors (target genes inferred based on GRN-integrated normal GRN)
CDX2 ELF3 HNF4G PPARG VDR statistic 16.67919501 14.13717554 14.60667781 21.99732743 13.801814 dm 1.049325423 0.776367367 0.611777134 1.041892253 0.765029478 p.value 2.50E-39 1.18E-31 4.42E-33 1.33E-54 1.23E-30 Average gene expression of target genes of the core transcription factors (target genes inferred based on GRN-paired normal GRN)
CDX2 ELF3 HNF4G PPARG VDR statistic 13.6229712 12.68273866 12.90965964 18.15256124 16.75612911 dm 1.207318893 1.189132501 0.803333128 0.955618561 1.484688643 p.value 4.31E-30 3.08E-27 6.32E-28 1.09E-43 1.47E-39
Single cell RNA-seq dataset Gene expression level of the core transcription factors
CDX2 ELF3 HNF4G PPARG VDR statistic 26.82099459 6.558394416 3.99202638 15.35351893 25.01643239 dm 0.624101918 0.648971563 0.027649821 0.40625563 0.471720276 p.value 7.64E-94 1.56E-10 7.70E-05 9.89E-43 6.63E-86 Transcriptional activity of the core transcription factors (inferred based on GRN-integrated normal GRN
CDX2 ELF3 HNF4G PPARG VDR statistic 17.15267181 16.80751284 16.32338629 15.88131642 15.61006688 dm 1.107044628 1.1128358 1.085570903 1.01851108 1.020206844 p.value 1.27E-50 4.29E-49 5.83E-47 5.03E-45 7.64E-44
29
Transcriptional activity of the core transcription factors (inferred based on GRN-paired normal GRN
CDX2 ELF3 HNF4G PPARG VDR statistic 19.73716781 16.35767972 19.47534317 14.917533 16.95096755 dm 1.264901405 1.068227106 1.263164344 0.933674995 1.094555468 p.value 3.32E-62 4.12E-47 5.03E-61 7.46E-41 9.94E-50 Average gene expression of target genes of the core transcription factors (target genes inferred based on GRN-integrated normal GRN)
CDX2 ELF3 HNF4G PPARG VDR statistic 21.403009 20.78112317 21.38801285 23.54853846 20.10508557 dm 0.444017111 0.585597828 0.185252537 0.419437979 0.381470771 p.value 1.01E-69 6.48E-67 1.18E-69 2.29E-79 7.28E-64 Average gene expression of target genes of the core transcription factors (target genes inferred based on GRN-paired normal GRN)
CDX2 ELF3 HNF4G PPARG VDR statistic 23.85053526 21.28693763 24.15705599 20.66630722 21.37577609 dm 0.668953341 0.790239087 0.349984256 0.221123007 0.817355903 p.value 1.02E-80 3.38E-69 4.38E-82 2.14E-66 1.34E-69
30
Supplementary Table 4. Protein activity analysis using VIPER. NES log10(p-value) p-value
WEE1 -3.549173254 3.412914859 0.000386443
GPX1 -3.529568401 3.380658178 0.000416238
SMYD3 -3.496855425 3.32718486 0.000470777
UBA2 -3.406411922 3.181623746 0.000658228
PPP1R14B -3.384150606 3.146308922 0.000713988
AJUBA -3.360652919 3.109252278 0.000777585
COPS7B -3.346616469 3.087223833 0.000818043
SPRY4 -3.340551877 3.077731084 0.000836121
NTMT1 -3.327013997 3.056594707 0.00087782
SLC39A10 -3.324082464 3.052027614 0.0008871
CDK4 -3.319172692 3.044386435 0.000902846
MIF -3.299914018 3.014508613 0.000967145
ZC3HC1 -3.272621706 2.972426221 0.00106555
CTU2 -3.269790845 2.968078639 0.00107627
TTLL4 -3.263550441 2.958506267 0.001100256
CSNK1G2 -3.262768669 2.957308197 0.001103295
TESC -3.225970608 2.901196258 0.001255462
PSMC2 -3.213261916 2.881945186 0.001312366
NUP62 -3.178945011 2.830289926 0.001478121
UBE2I -3.176390627 2.826464082 0.0014912
CDC25B -3.171750382 2.819520905 0.001515232
CBFB -3.168591755 2.814799665 0.001531794
RYK -3.167660269 2.813408135 0.00153671
TRIM28 -3.164372195 2.808498953 0.001554179
HDAC2 -3.125509695 2.750808447 0.001774972
PTPN12 -3.12046471 2.743364186 0.001805659
CETN2 -3.09995667 2.71320917 0.00193549
CCND1 -3.096410869 2.708012698 0.001958787
B3GALT6 -3.094509079 2.705227669 0.001971389
HSP90AB1 -3.092394554 2.702132824 0.001985488
NEK2 -3.088926825 2.697061333 0.002008809
CBX4 -3.083839599 2.689630166 0.002043477
MEAF6 -3.073105391 2.673984535 0.002118437
LIG1 -3.061932706 2.657749321 0.002199129
PHF19 -3.050742301 2.64153897 0.002282764
NAA10 -3.049310578 2.639468638 0.002293672
RIPK2 -3.036815442 2.621435325 0.002390918
BARD1 -3.031084 2.613184665 0.002436774
RIOK1 -3.023942285 2.602922432 0.00249504
31
BAZ1B -3.019047477 2.595900775 0.002535708
SKP2 -3.012528704 2.586564547 0.002590809
MLLT1 -3.01123508 2.584713851 0.002601873
DPY30 -3.010932476 2.584281035 0.002604468
PPIL1 -2.979939312 2.540147079 0.002883055
CDK2 -2.957723001 2.508749716 0.003099205
SOX4 -2.951944718 2.500616123 0.003157795
TP53BP1 -2.951433627 2.499897352 0.003163025
MET -2.951262412 2.499656587 0.003164779
YEATS2 -2.940995485 2.485240726 0.003271593
PREB -2.939821908 2.483595604 0.003284009
MED1 -2.927324054 2.466110494 0.003418924
CAD -2.925303682 2.463289801 0.003441202
CKS2 -2.915733925 2.449951541 0.00354853
RNF220 -2.915366513 2.44944018 0.003552711
TRIM32 -2.914844582 2.448713853 0.003558657
PSMB4 -2.910008741 2.441989443 0.003614186
SETDB1 -2.908148943 2.439405832 0.003635751
ILF3 -2.906677651 2.437362917 0.003652894
KDM2B -2.90550857 2.435740246 0.003666568
PPIL3 -2.902915395 2.432142915 0.003697065
NEK6 -2.898683319 2.426277859 0.003747332
PTPN11 -2.896536526 2.423305464 0.003773067
DCAF13 -2.866747047 2.382250949 0.004147143
UBE2T -2.863994598 2.378475647 0.004183351
KDM1A -2.861129166 2.374548608 0.00422135
RPAP2 -2.859811426 2.37274377 0.00423893
FKBP14 -2.855745765 2.367179645 0.004293588
SOCS7 -2.851646271 2.361575938 0.004349347
CDK6 -2.84732014 2.355669752 0.0044089
DUSP10 -2.84439913 2.351686135 0.004449527
CLK2 -2.818722963 2.316816819 0.004821511
NMT1 -2.80985171 2.304830751 0.004956433
NAA20 -2.802728833 2.295229774 0.005067225
UBE3D -2.792322055 2.281238957 0.005233124
MRE11A -2.789695691 2.277714945 0.00527576
PRMT1 -2.775125945 2.258215673 0.005518033
GLMN -2.760101668 2.238197088 0.005778338
SMYD2 -2.754726887 2.231057582 0.005874115
IRAK1 -2.726853737 2.194218028 0.006394138
PRKRA -2.723022064 2.189178038 0.006468774
32
MARCH9 -2.719483733 2.184529105 0.006538391
DSTYK -2.707772829 2.16917808 0.006773637
EHMT2 -2.700965147 2.160279524 0.006913858
ADCY3 -2.687279218 2.142446147 0.007203671
ITGA2 -2.685822757 2.140552712 0.007235146
PRDX1 -2.668693518 2.118347781 0.00761469
CCNB1IP1 -2.668563486 2.118179666 0.007617638
FGFRL1 -2.658955696 2.105776624 0.007838327
WDR75 -2.655533109 2.10136717 0.007918316
EIF2A -2.638851789 2.079942759 0.008318734
C1QTNF1 -2.619141017 2.0547703 0.00881515
FUT1 -2.612330994 2.046109206 0.008992714
UGGT2 -2.60748243 2.039953966 0.009121075
ENY2 -2.606052504 2.038140464 0.009159242
IQGAP3 -2.598206665 2.02820445 0.009371207
AASDHPPT -2.590843001 2.018901321 0.009574116
UHRF1 -2.586514431 2.013442733 0.009695211
USP36 -2.582924337 2.008921053 0.009796681
KAT2A -2.582495456 2.008381225 0.009808865
ACP1 -2.57914256 2.004163488 0.00990459
The results of multi-sample VIPER (msVIPER) showed those with p-value less than 0.01. NES; normalized enrichment score
33
Supplementary Table 5. Primer sequences for the quantitative PCR and target sequences of small interference RNA (siRNA) and short hairpin RNA (shRNA). Primer sequences for the RT-PCR
Name Forward primer (5’->3’) Reverse primer (5’->3’)
AQP8 GCC ATC AAT GAG AAG ACA AAG G CAC CTA ATG AGC AGT CCA ACA A B-actin AGA GCT ACG AGC TGC CTG AC AGC ACT GTG TTG GCG TAC AG CA1 CCA AAC ATG ACA CCT CTC TGA A AGC TCG GCA GAA TAT TTG ACT C CD44 CTG CCG CTT TGC AGG TGT A CAT TGT GGG CAA GGT GCT ATT CDX2 TTC ACT ACA GTC GCT ACA TCA CC TCT GGG ACA CTT CTC AGA GGA C CEACAM5 CAG ATC AGG GGA AAA TCT GAA C TCG TGA CTG TGG TCC TAT TGA G EHMT2 TGC GTG CTG TTA TTC CTG TC TGA TCT TCT CTG TGC GGA TG ELF3 CAA AGA GTA CTG GGA CTG TCT CG CTC AGC TTC TCG TAG GTC ATG TTFABP1 GCA GAG CCA GGA AAA CTT TG TCT CCC CTG TCA TTG TCT CC FABP2 AAG CTT GCA GCT CAT GAC AA TCC ATT GTC TGT CCG TTT GA GAPDH TGA TGA CAT CAA GAA GGT GGT GAA G TCC TTG GAG GCC ATG TGG GCC ATHDAC2 GAG GTG GCT ACA CAA TCC GTA ACA CCA GGT GCA TGA GGT AAC HNF4G GTT TCT TCA GAC GCA GCA TTC AAC TTC AGC TTG TGC CAG TGT KAT2A GCT GAC CAC GTA TCC CAC TT ATG CAT TTC CGC AGT AGC TT KDM1A TCA ACT CTC TCC CTT AAG CAC TG CAC AGC TAT CAC TTC ACA TCC TGKRT20 ACG CCA GAA CAA CGA ATA CC ACG ACC TTG CCA TCC ACT AC LGR5 CTC CCA GGT CTG GTG TGT TG GAG GTC TAG GTA GGA GGT GAA GMYC GGC TCC TGG CAA AAG GTC A CTG CGT AGT TGT GCT GAT GT PPARG ATC TCT CCG TAA TGG AAG ACC A AGG CTC TTC ATG AGG CTT ATT G PRMT1 ACC GCA ACT CCA TGT TTC A CCA CCT TCC CCT TGA TGA T SETDB1 CCG GCC TAC AGA AAT AAT TGA G CAA GGT TCC TTT ATG CAG ATC C SMYD2 GTG TAC CAC CAA GGA CAA GGA T AGA TCT CCA GCA GCT CAC TAG G SMYD3 AGA ACT GAA GGC ACA CTG GAA TTC ATC ACT TGA ACC CCT CTG VDR AGA TGA CCC TTC TGT GAC CCT A ATG GCA CTT GAC TTC AGC AGT A
Primer sequences for the ChIP-qPCR
Name Forward primer (5’->3’) Reverse primer (5’->3’)
KRT20 TAG AGG CAT TTC TTG GAG GCT TT CGC ATG GGC TAA AAG AAT ACA CT FABP1 CTG GGC CTC CTC TAC TCT GT GTG ACA CTA GGC CAT CCA GG E-cadherin TGT AAA AGC CCT TTC TGA TCC CA TTG CTA GGG TCT AGG TGG GTT AT MUC2 CTC TAG TAG CTG CAT GTG TTC CC ATC GGA GGT CAT CTC TTT ATG GC
Target sequences of siRNA
Name Sense (5’->3’) Antisense (5’->3’) PMID
EHMT2 CACACACCGACCAGAGA UCUCUGGUCGGUGUGUG 20940408 HDAC2 CGGGGCAACAACA UGUUGUUGCCCCG 16642021 KAT2A GGAAAUGCAUCCUGCAGAU AUCUGCAGGAUGCAUUUCC 25770212 KDM1A CACAAGGAAAGCAGAAGA UCUUCUGCUUUCCUUGUG 16079795 PRMT1 GGACAUGACAUCCAAAGAU AUCUUUGGAUGUCAUGUCC 18951090 SETDB1 CGGGGGCGCCAAAACAA UUGUUUUGGCGCCCCCG 26471002 SMYD2 GAUUUGAUUCAGAGUGACA UGUCACUCUGAAUCAAAUC 22787429 SMYD3 AGCCUGAUUGAAGAUUUGA UCAAAUCUUCAAUCAGGCU 22194464
Target sequences of shRNA
Name Target sequence Catolog Number
shSETDB1#1 CCGGGCTCAGATGATAACTTCTGTACTCGAGTACAGAAGTTATCATCTGAGCTTTTTG
TRCN0000276169
34
shSETDB1#2 CCGGGCTCAGATGATAACTTCTGTACTCGAGTACAGAAGTTATCATCTGAGCTTTTTTG
TRCN0000148112
shSETDB1#3 CCGGAGTTAGAGACATGGGTAATACCTCGAGGTATTACCCATGTCTCTAACTTTTTTG
TRCN0000276105