Supplementary data to Cross-species common regulatory network inference without requirement for prior gene affiliation Amin Moghaddas Gholami 1, 2 and Kurt Fellenberg 1 1 Chair of Proteomics and Bioanalytics, Center for Integrated Protein Sciences Munich (CIPSM), Technische Universität München, Emil Erlenmeyer Forum 5, 85354 Freising, Germany. 2 Functional Genome Analysis, German Cancer Research Center (DKFZ), Im Neuenheimer Feld 580, 69120 Heidelberg, Germany. S1. Datasets and preprocessing The pair of normalized yeast microarray gene expression datasets described in the results section of the main text (Sce/Spo) was downloaded from the publically accessible ArrayExpress data repository (Cipollina et al., 2008b; Parkinson et al., 2009) as well as from the authors‟ web resource. After log2- transformation, genes for which more than 50% of the data were missing were discarded. The remaining missing values were imputed by k-nearest neighbor algorithm (Troyanskaya et al., 2001) and Spline interpolation (Bar-Joseph et al., 2003) which are commonly used for time-series data. Periodically expressed genes of significance with false discovery rate (FDR) of 0.05 were extracted by AR(1)-based background model (Futschik and Herzel, 2008). The second two datasets (human/mouse) were downloaded from Gene Expression Omnibus (Barrett et al., 2009). The studies investigated the effect of Estradiol on human (Stossi et al., 2004) and murine cells (Moggs et al., 2004). Stossi et al., examined U2OS osteosarcoma cells after treatment with either estrogen receptor (ER) alpha or beta for various periods of time up to 48 hours (10 time points in total). They generated U2OS human osteosarcoma cells stably expressing ESR1 or ESR2, at levels comparable to those in osteoblasts. The characterization of the response to estradiol (E2) over time is measured using Affymetrix GeneChip microarrays. Moggs and coworkers recorded the uterus response of immature mice subcutaneously injected with 17β-estradiol (E2) or arachis oil (AO) at various time points up to 72 hours following treatment. Datasets were normalized by variance stabilization (Huber et al., 2002). Differentially expressed genes were extracted using the eBayes method of the limma package (Smyth, 2004). For multiple testing adjustments, we calculated the FDR using the algorithm of Benjamini and Hochberg (Benjamini and Hochberg, 1995).
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Supplementary data to
Cross-species common regulatory network inference
without requirement for prior gene affiliation
Amin Moghaddas Gholami1, 2
and Kurt Fellenberg1
1Chair of Proteomics and Bioanalytics, Center for Integrated Protein Sciences Munich (CIPSM), Technische Universität
München, Emil Erlenmeyer Forum 5, 85354 Freising, Germany. 2Functional Genome Analysis, German Cancer Research Center (DKFZ), Im Neuenheimer Feld 580, 69120 Heidelberg,
Germany.
S1. Datasets and preprocessing
The pair of normalized yeast microarray gene expression datasets described in the results section of the
main text (Sce/Spo) was downloaded from the publically accessible ArrayExpress data repository
(Cipollina et al., 2008b; Parkinson et al., 2009) as well as from the authors‟ web resource. After log2-
transformation, genes for which more than 50% of the data were missing were discarded. The remaining
missing values were imputed by k-nearest neighbor algorithm (Troyanskaya et al., 2001) and Spline
interpolation (Bar-Joseph et al., 2003) which are commonly used for time-series data. Periodically
expressed genes of significance with false discovery rate (FDR) of 0.05 were extracted by AR(1)-based
background model (Futschik and Herzel, 2008).
The second two datasets (human/mouse) were downloaded from Gene Expression Omnibus (Barrett et
al., 2009). The studies investigated the effect of Estradiol on human (Stossi et al., 2004) and murine
cells (Moggs et al., 2004). Stossi et al., examined U2OS osteosarcoma cells after treatment with either
estrogen receptor (ER) alpha or beta for various periods of time up to 48 hours (10 time points in total).
They generated U2OS human osteosarcoma cells stably expressing ESR1 or ESR2, at levels comparable
to those in osteoblasts. The characterization of the response to estradiol (E2) over time is measured using
Affymetrix GeneChip microarrays. Moggs and coworkers recorded the uterus response of immature
mice subcutaneously injected with 17β-estradiol (E2) or arachis oil (AO) at various time points up to 72
hours following treatment.
Datasets were normalized by variance stabilization (Huber et al., 2002). Differentially expressed genes
were extracted using the eBayes method of the limma package (Smyth, 2004). For multiple testing
adjustments, we calculated the FDR using the algorithm of Benjamini and Hochberg (Benjamini and
Hochberg, 1995).
S2. Co-inertia analysis and alternative methods
Integrating datasets into simultaneous analysis is a major challenge in systems biology. It is crucial to
capture the associations between variables from different high-throughput multidimensional datasets.
Different techniques exist to investigate the associations between large-scale datasets. Canonical
Correlation Analysis (CCA; (Gittins, 1985), Partial Least Square (PLS; (Wold, 1966) and Co-inertia
Analysis (CIA; (Dolédec and Chessel, 1994) transform high-dimensional data into few, usually two or
three dimensions for visualization.
PLS is a correlation based method. It explains relationships between two datasets by simultaneously
decomposing the data matrices into low-dimensional vectors. CCA is a special case of PLS, identifying
linear combinations of variables from each set such so that they have maximum correlation. However
CCA and PLS often suffer from the asymmetry of microarray datasets where the number of variables
exceeds the number of samples.
Penalized CCA adapted with Elastic Net (CCA-EN; (Cao et al., 2008; Waaijenborg et al., 2008) and
Sparse CCA (SCCA; (Parkhomenko et al., 2009) are derivatives of the classical CCA. They incorporate
variable selection to address above limitation of earlier approaches. However, if invoked repeatedly
from within an unsupervised iterative algorithm, it becomes computationally infeasible for large number
of variables.
In contrast, CIA can cope with both asymmetry and large number of variables without becoming
computationally infeasible. CIA is a multivariate coupling approach measuring the adequacy between
datasets. It was first introduced applying ecological data (Dolédec and Chessel, 1994), and amino acid
properties (Thioulouse and Lobry, 1995). Culhane and co-workers demonstrated the efficiency of CIA
on cross-platform comparisons of gene expression data, applying it to both cDNA and Affymetrix
microarrays (Culhane et al., 2003). An extension of CIA that links more than two tables has been
reported (Dray et al., 2003). Fagan and co-workers combined information from multiple layers (genes,
samples and GO terms) by CIA (Fagan et al., 2007).
In our algorithm we used CIA because of its visual interpretability, its speed and because its applicability
to asymmetric microarray data had been demonstrated in many studies (Culhane et al., 2003; Fagan et
al., 2007; Jeffery et al., 2007; Singh et al., 2007).
S2.1 Co-inertia definition
The mathematical basis of CIA, following the notation of Dolédec and coworkers (Culhane et al., 2003;
Dolédec and Chessel, 1994; Jeffery et al., 2007) is summarized as below.
Let X and Y be the original data tables, with n rows, and respectively p and q columns. The two
statistical triplets produced by the ordination methods performed on the datasets are denoted (X, Dn, Dp)
and (Y, Dn, Dq), with Dn and Dp being diagonal matrices containing row and column weights for X, and
Dn and Dq diagonal matrices containing row and column weights for Y. After diagonalization let u and v
be a pair of eigenvectors for (X, Dn, Dp) and (Y, Dn, Dq), respectively. The projection of the
multidimensional space associated with X onto vector u generates n coordinates in a column matrix:
[1]
The projection of the multidimensional space associated with table Y on to vector v generates n
coordinates in a column matrix:
[2]
Co-inertia associated with the pair of vectors u and v can be written as
[3]
If the initial data tables are centered, then the co-inertia is the covariance between the two new scores:
[4]
with η1(u) denoting the projected inertia on to vector u (i.e. the variance of the newscores on u), η2(v) the
projected inertia on to vector v (i.e. the variance of the new scores on v), and Corr(α,ψ) the correlation
between the two coordinate systems. A CIA axis associated with a pair of eigenvectors u and v will
maximize Cov(α,ψ).
S3. Connecting variable affiliations in Co-inertia analysis
Measuring the associations between samples by CIA requires affiliation of each gene from one dataset to
one gene of the other dataset as a prerequisite. Or, projecting common variance between the genes of
both datasets requires prior matching between the samples of two datasets. Therefore, either the columns
or the rows of the tables (connecting variables) must be matchable and have to be weighted similarly. As
a basis for a suitable co-inertia analysis, this matching needs to be both complete and reliable. We used
Hungarian algorithm to affiliate connecting variables in CIA.
Distance matrix. Given two microarray datasets, let us assume r samples {r1, r2, …, rj} of dataset R and
b samples {b1, b2, …, bj} of dataset B. Projection of CIA sample distances on j-1 dimensions can be
plotted (Figure S1, top left). ω could be an (i j) distance matrix derived from CIA recording the
distance ω(ribj) between each pair of elements of R and B.
[5]
Weighted bipartite graph. The above distance matrix can be seen as a bipartite graph where each
vertex belongs to dataset R or B, and each edge corresponds to one element of the distance matrix (ω)
representing all inter-set distances of the CIA coordinates.
Here, the affiliation problem could be stated as given an ω, find j independent elements of permutation π
of {1, …, j} such that the sum of edge weights [6] is minimal for the selected edges.
Given a weighted bipartite graph where edge r->b has weight ω(rb),the optimal assignment minimizes
the overall weights. Figure S1 shows a graphical representation of the above.
[6]
r1
r2
r3
b1
b2
b3
Distances become
edge weights
ω1,1
ω3,3
R B
r1
r2
r3
b1
b2
b3
b1b2 b3
r1 ω1,1 ω1,2 ω1,3
r2 ω2,1 ω2,2 ω2,3
r3 ω3,1 ω3,2 ω3,3
ω - Weight Matrix
Hungarian
r1
r2
r3
b1
b2
b3
r1
r2
r3
b1
b2
b3
ω2,2
R B
=min
CIA distances in
j-1 dimensions
Figure S1. Affiliation of the connecting variables using Hungarian algorithm. Samples from R and B datasets are
represented as red and blue squares, respectively. Only samples are projected into 3-dimensional space for the simplicity
(top left). In the bipartite graph (top right) edges correspond to all pair-wise projected distances (weights) from every
element of R to all elements of B. Each edge corresponds to one element in the weight matrix ω recording these
distances. The Hungarian algorithm computes a matching of minimal distances (lower bipartite graph and lower 3d
plot).
S4. Cell-cycle data – cerevisiae versus pombe
Figure S2. a) ‘Sce’ and ‘Sp’ projected by CIA. The affiliated samples of both datasets are connected by lines, the lengths of
which indicate the divergence between the two datasets. Each end of a line marks the position of a sample (time point) in
the projection. Each blue or red dot represents a gene of ‘Sce’ or ‘Sp’, respectively, its position determined by its relative
expression across all samples. The genes that are projected in the same direction from the centroid are those which are
highly expressed in that sample. b) ‘Spo’ dataset projected by CA. c)‘Sce’ dataset projected by CA.
Eigenvalues are shown in the bottom corner for each dataset, normalized to 100%. The first two (x and y) axes of ‘Sce’
explain 49% and 33% of the total inertia within this dataset. The first two axes of ‘Spo’ represent 64% and 30% of the total
variance within ‘Spo’. Thus more than 80% of the variance of the CIA was accounted for by the first two co-inertia axes and
thus presents a good summary of the co-structure between the two datasets.
Table S1 shows the affiliated gene-cluster memberships. Matched clusters are represented as rows.
Ortholog genes are sorted by affiliations and printed in bold. Many-to-many relations are shown in
bracket and histones marked by italic.
Table S1. Cluster components (genes) of the affiliated gene clusters node† „Sce‟ gene clusters „Spo‟ gene clusters
† The transcription factor itself is cluster member
References:
Akache, B. and Turcotte, B. (2002) New regulators of drug sensitivity in the family of yeast zinc cluster proteins, J Biol Chem, 277, 21254-21260.
Aoki-Kinoshita, K.F. and Kanehisa, M. (2007) Gene annotation and pathway mapping in KEGG, Methods Mol Biol, 396, 71-91.
Bader, G.D. et al. (2003) BIND: the Biomolecular Interaction Network Database, Nucleic Acids Res, 31, 248--250.
Bader, G.D. and Hogue, C.W. (2000) BIND--a data specification for storing and describing biomolecular interactions, molecular complexes and pathways, Bioinformatics, 16,
465-477.
Bar-Joseph, Z. et al. (2003) Continuous representations of time-series gene expression data, J Comput Biol, 10, 341--356.
Barrett, T. et al. (2009) NCBI GEO: archive for high-throughput functional genomic data, Nucleic Acids Res, 37, D885--D890.
Benjamini, Y. and Hochberg, Y. (1995) Controlling the false discovery rate: a practical and powerful approach to multiple testing, J. R. Stat. Soc. Ser. B, 57, 289–300.
Breitkreutz, B.J. et al. (2008) The BioGRID Interaction Database: 2008 update, Nucleic Acids Res, 36, D637-640.
Cao, K.-A.L. et al. (2008) A sparse PLS for variable selection when integrating omics data, Stat Appl Genet Mol Biol, 7, Article 35.
Chatr-aryamontri, A. et al. (2007) MINT: the Molecular INTeraction database, Nucleic Acids Res, 35, D572-574.
Chua, G. et al. (2006) Identifying transcription factor functions and targets by phenotypic activation, Proc Natl Acad Sci U S A, 103, 12045-12050.
Cipollina, C. et al. (2008a) Saccharomyces cerevisiae SFP1: at the crossroads of central metabolism and ribosome biogenesis, Microbiology, 154, 1686-1699.
Cipollina, C. et al. (2008b) Revisiting the role of yeast Sfp1 in ribosome biogenesis and cell size control: a chemostat study, Microbiology, 154, 337-346.
Culhane, A.C. et al. (2003) Cross-platform comparison and visualisation of gene expression data using co-inertia analysis, BMC Bioinformatics, 4, 59.
Dolédec, S. and Chessel, D. (1994) Co-inertia analysis: an alternative method for studying species-environment relationships, Freshwater Biology, 31, 277-294.
Dray, S. et al. (2003) Procrustean co-inertia analysis for the linking of multivariate data sets, Ecoscience, 10(1), 110-119.
Fagan, A.s. et al. (2007) A multivariate analysis approach to the integration of proteomic and gene expression data, Proteomics, 7, 2162--2171.
Futschik, M.E. and Herzel, H. (2008) Are we overestimating the number of cell-cycling genes? The impact of background models on time-series analysis, Bioinformatics, 24,
1063--1069.
Gittins, R. (1985) Canonical analysis, a review with applications in ecology, Biomathematics, 12.
Guldener, U. et al. (2006) MPact: the MIPS protein interaction resource on yeast, Nucleic Acids Res, 34, D436-441.
Harbison, C.T. et al. (2004) Transcriptional regulatory code of a eukaryotic genome, Nature, 431, 99-104.
Hermjakob, H. et al. (2004) IntAct: an open source molecular interaction database, Nucleic Acids Res, 32, D452--D455.
Huber, W. et al. (2002) Variance stabilization applied to microarray data calibration and to the quantification of differential expression, Bioinformatics, 18 Suppl 1, S96--104.
Jeffery, I.B. et al. (2007) Integrating transcription factor binding site information with gene expression datasets, Bioinformatics, 23, 298-305.
Jorgensen, P. et al. (2004) A dynamic transcriptional network communicates growth potential to ribosome synthesis and critical cell size, Genes Dev, 18, 2491-2505.
Kanehisa, M. et al. (2008) KEGG for linking genomes to life and the environment, Nucleic Acids Res, 36, D480--D484.
Kerrien, S. et al. (2007) IntAct--open source resource for molecular interaction data, Nucleic Acids Res, 35, D561-565.
Lee, T.I. et al. (2002) Transcriptional regulatory networks in Saccharomyces cerevisiae, Science, 298, 799-804.
Liu, C. et al. (2003) Identification of the downstream targets of SIM1 and ARNT2, a pair of transcription factors essential for neuroendocrine cell differentiation, J Biol Chem,
278, 44857-44867.
Mewes, H.W. et al. (2006) MIPS: analysis and annotation of proteins from whole genomes in 2005, Nucleic Acids Res, 34, D169-172.
Moggs, J.G. et al. (2004) Phenotypic anchoring of gene expression changes during estrogen-induced uterine growth, Environ Health Perspect, 112, 1589-1606.
Oliva, A. et al. (2005) The cell cycle-regulated genes of Schizosaccharomyces pombe, PLoS Biol, 3, e225.
Parkhomenko, E. et al. (2009) Sparse canonical correlation analysis with application to genomic data integration, Stat Appl Genet Mol Biol, 8, Article1.
Parkinson, H. et al. (2009) ArrayExpress update--from an archive of functional genomics experiments to the atlas of gene expression, Nucleic Acids Res, 37, D868--D872.
Rustici, G. et al. (2004) Periodic gene expression program of the fission yeast cell cycle, Nat Genet, 36, 809--817.
Singh, A.V. et al. (2007) Integrative analysis of the mouse embryonic transcriptome, Bioinformation, 1, 406-413.
Smyth, G.K. (2004) Linear models and empirical bayes methods for assessing differential expression in microarray experiments, Stat Appl Genet Mol Biol, 3, Article3.
Stark, C. et al. (2006) BioGRID: a general repository for interaction datasets, Nucleic Acids Res, 34, D535--D539.
Stossi, F. et al. (2004) Transcriptional profiling of estrogen-regulated gene expression via estrogen receptor (ER) alpha or ERbeta in human osteosarcoma cells: distinct and com-
mon target genes for these receptors, Endocrinology, 145, 3473-3486.
Thioulouse, J. and Lobry, J.R. (1995) Co-inertia analysis of amino-acid physico-chemical properties and protein composition with the ADE package, Comput Appl Biosci, 11, 321-
-329.
Troyanskaya, O. et al. (2001) Missing value estimation methods for DNA microarrays, Bioinformatics, 17, 520--525.
Waaijenborg, S. et al. (2008) Quantifying the association between gene expressions and DNA-markers by penalized canonical correlation analysis, Stat Appl Genet Mol Biol, 7,
Article3.
Willis, R.C. and Hogue, C.W. (2006) Searching, viewing, and visualizing data in the Biomolecular Interaction Network Database (BIND), Curr Protoc Bioinformatics, Chapter 8,
Unit 8 9.
Wold, H. (1966) Multivariate Analysis. Academic Press, New York.
Workman, C.T. et al. (2006) A systems approach to mapping DNA damage response pathways, Science, 312, 1054-1059.
Xenarios, I. et al. (2002) DIP, the Database of Interacting Proteins: a research tool for studying cellular networks of protein interactions, Nucleic Acids Res, 30, 303-305.
Zanzoni, A. et al. (2002) MINT: a Molecular INTeraction database, FEBS Lett, 513, 135--140.
Zhu, G. et al. (2000) Two yeast forkhead genes regulate the cell cycle and pseudohyphal growth, Nature, 406, 90-94.