BigRoc, Weizmann Institute of Science, Israel, March 2010 Page 1 of 15 44 Integrative analysis of transcriptomics and proteomics data: implications to cancer biology Integrative analysis of transcriptomics and proteomics data (ArrayMining and TopoGSA) AP – Interdisciplinary Optimisation Laboratory School of Computer Science Centre for Integrative Plant Biology Centre for Healthcare Associated Infections titute of Infection, Immunity and Inflammation University of Nottingham Enrico Glaab & Natalio Krasnogor
44
Embed
Integrative analysis of transcriptomics and proteomics data with ArrayMining and TopoGSA
These slides are part of a presentation I gave on March 2010 at the BioInformatics and Genome Research Open Club at the Weizmann Institute of Science, Israel.
In these slides my student and I describe two web-applications for microarray and gene/protein set analysis, ArrayMining.net and TopoGSA. These use ensemble and consensus methods as well as the possibility of modular combinations of different analysis techniques for an integrative view of (microarray-based) gene sets, interlinking transcriptomics with proteomics data sources. This integrative process uses tools from different fields, e.g. statistics, optimisation and network topological studies. As an example for these integrative techniques, we use a microarray consensus-clustering approach based on Simulated Annealing, which is part of the ArrayMining.net Class Discovery Analysis module, and show how this approach can be combined in a modular fashion with a prior gene set analysis. The results reveal that improved cluster validity indices can be obtained by merging the two methods, and provide pointers to distinct sub-classes within pre-defined tumour categories for a breast cancer dataset by the Nottingham Queens Medical Centre. In the second part of the talk, I show how results from a supervised microarray feature selection analysis on ArrayMining.net can be investigated in further detail with TopoGSA, a new web-tool for network topological analysis of gene/protein sets mapped on a comprehensive human protein-protein interaction network. I discuss results from a TopoGSA analysis of the complete set of genes currently known to be mutated in cancer.
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
BigRoc, Weizmann Institute of Science, Israel, March 2010 Page 1 of 1544
Integrative analysis of transcriptomics and proteomics data: implications to cancer biology
Integrative analysis of transcriptomics and proteomics data (ArrayMining and TopoGSA)
ASAP – Interdisciplinary Optimisation LaboratorySchool of Computer Science
Centre for Integrative Plant Biology
Centre for Healthcare Associated InfectionsInstitute of Infection, Immunity and Inflammation
University of Nottingham
Enrico Glaab & Natalio Krasnogor
BigRoc, Weizmann Institute of Science, Israel, March 2010 Page 2 of 1544
Overview:
• Introduction: goals and data sets
• ArrayMining.net: tool set for microarray analysis– @ www.infobiotics.org
• TopoGSA: network topological analysis of
genes/proteins– @ www.infobiotics.org
• (time permitting) Network-based pathway
extension
Outline
Gibson G (2003) Microarray Analysis. PLoS Biol 1(1): e15. doi:10.1371/journal.pbio.0000015
BigRoc, Weizmann Institute of Science, Israel, March 2010 Page 3 of 1544
Introduction
• Typical problem in biosciences: How to make effective use of multiple, large-scale data sources?
• Typical problem in computer science: How to exploit the strengths of different algorithms?
GOAL: Develop new (& existing) methods combining diverse data sources and algorithms
BigRoc, Weizmann Institute of Science, Israel, March 2010 Page 4 of 1544
previously identified by Armstrong et al. newly identified
BigRoc, Weizmann Institute of Science, Israel, March 2010 Page 12 of 1544
ArrayMining.net: Gene selection
Gene selection module (2): Armstrong et al. dataset
• Automatic generation of box plots with gene and sample class annotations
• The first row shows the box plots for the two best-ranked newly identified genes in the Armstrong et al. dataset ()
• The second row shows two top-ranked previously iden- tified genes ()
• The user can easily compare and combine the results from different selection methods
BigRoc, Weizmann Institute of Science, Israel, March 2010 Page 13 of 1544
ArrayMining.net: Examples
Further examples: Gene selection and Clustering module
Automatic generation of heatmaps and PCA Cluster plots (Armstrong et al. dataset)
samples
gen
es
BigRoc, Weizmann Institute of Science, Israel, March 2010 Page 14 of 1544
ArrayMining.net: Examples
Further examples: 3D-ICA and Co-Expression analysis
3D Independent Component Analysis plot (left) and the largest connected components from a gene co-expression network (right) for the Armstrong et al. dataset
Sample space: Gene space:
ALL
AML
MLL
BigRoc, Weizmann Institute of Science, Israel, March 2010 Page 15 of 1544
ArrayMining.net: In-house data
Heat map: 50 most significant genes Box plot: 4 most significant genes
Apply the tools on new data: QMC Breast cancer data
Expression levels across 3 tumour grades:
STK6 MYBL2
KIF2C AURKb
BigRoc, Weizmann Institute of Science, Israel, March 2010 Page 16 of 1544
ArrayMining.net: QMC dataset
Gene name PC (gene vs. outcome):Fold
ChangeQ-value (Rank)
ESTROGEN RECEPTOR 1 -0.75 0.16 1.6e-20 (1.)
RAS-LIKE, ESTROGEN-REGULATED, GROWTH INHIBITOR
-0.66 0.46 5.3e-14 (2.)
WD REPEAT DOMAIN 19 -0.66 0.73 1.2e-13 (3.)
CARBONIC ANHYDRASE XII -0.65 0.28 2.7e-13 (4.)
ARP3 ACTIN-RELATED PROTEIN 3 HOMOLOG (YEAST)
0.64 1.37 9.6e-13 (5.)
TETRATRICOPEPTIDE REPEAT DOMAIN 8
-0.63 0.82 2.2e-12 (6.)
BREAST CANCER MEMBRANE PROTEIN 11
-0.62 0.24 7.1e-12 (7.)
QMC Breast cancer data set – selected genes
• all top-ranked genes are known or likely to be involved in breast cancer
• the selection is robust with regard to cross-validation cycles and algorithms
BigRoc, Weizmann Institute of Science, Israel, March 2010 Page 17 of 1544
Methods overview
Methods overview: ArrayMining & TopoGSA
BigRoc, Weizmann Institute of Science, Israel, March 2010 Page 18 of 1544
ArrayMining.net: Example
ArrayMining - Class Discovery Analysis module:
• Motiviation:Exploiting the synergies between partition-based and hierarchical clustering algorithms
• Approach:
Consensus clustering based on the agreement of clustering results for pairs of objects (details on next slide). - equivalent to median partition problem (NP-complete)- Simulated Annealing (SA) has been shown to provide good solutions
• Our solution:- Compare SA (Aarts et al. cooling scheme) with thermodynamic SA (TSA) and fast SA (FSA) FSA provides fastest convergence- Initialization: Input clustering with highest agreement to other inputs
BigRoc, Weizmann Institute of Science, Israel, March 2010 Page 19 of 1544
Sam
ple
1
Sample 2
ArrayMining.net: Consensus clustering
ArrayMining‘s consensus clustering approach:
Clustering Agreement:= No. of times pairs of samples are assigned to the same cluster across all input clusterings
Idea: Reward objects in the same cluster, if they have a high agreement.
Agreement matrix:
Aij := #agreements
across all clusterings for samples i and jFitness function:
:= (max(A)+min(A))/2
BigRoc, Weizmann Institute of Science, Israel, March 2010 Page 20 of 1544
• FSA (Fast SA; Szu, Hartley; 1987)Uses Cauchy-distributed random numbers anda sligthly modified cooling scheme:
BigRoc, Weizmann Institute of Science, Israel, March 2010 Page 24 of 1544
Methods overview
Methods overview: ArrayMining & TopoGSA
BigRoc, Weizmann Institute of Science, Israel, March 2010 Page 25 of 1544
ArrayMining.net: Gene set analysis
samples
pathways
Extension: Gene set analysis
• Expression levels for a single gene are often unreliable• Similar genes might contain complementary information• We want to integrate functional annotation data
Gene Set Analysis (GSA):
1) Identify sets of functionally similar genes (GO, KEGG, etc.)
2) Summarize gene sets to „Meta“- genes (PCA, MDS, etc.)
3) Apply statistical analysis
(example: Van Andel institute cancer gene sets)
Gene set enrichment analysis: A knowledge-based approach for interpreting genome-wide expression profiles. Subramanian et al. PNAS October 25, 2005 vol. 102 no. 43 15545–15550
BigRoc, Weizmann Institute of Science, Israel, March 2010 Page 26 of 1544
ArrayMining.net: Examples
Gene Set Analysis module – example analysis
Heat map for the Armstrong et al. dataset based on pathway meta-genes
• we apply the Gene Set Analysis module to the Armstrong et al. dataset• with known cancer gene sets the class separation is better than for single genes
BigRoc, Weizmann Institute of Science, Israel, March 2010 Page 27 of 1544
Consensus clustering: example (2)
Combine consensus clustering with gene set analysis
• Apply same consensus clustering as before on GO-based „meta-genes“
~3 times higher confidence
better separation
BigRoc, Weizmann Institute of Science, Israel, March 2010 Page 28 of 1544
External validation
Single clustering Consensus clustering Consensus (PAM+SOTA)
10000 random clusterings
BigRoc, Weizmann Institute of Science, Israel, March 2010 Page 29 of 1544
Interim Summary
• Consensus clustering (CC) results tend to be similar to or slightly better than the best single clusterings in terms of adj. rand index and validity indices (but longer runtime)
• The input clusterings should include diverse methods and exclude similar methods
• Using gene sets (GS) representing cellular pathways instead of single genes results in better cluster separation, adj. rand indices and validity indices (annotation data required)
GS & CC provide improved results, but: longer runtimes + annotation data required
ArrayMining Integrative Clustering - Summary
BigRoc, Weizmann Institute of Science, Israel, March 2010 Page 30 of 1544
Methods overview
Methods overview: ArrayMining & TopoGSA
BigRoc, Weizmann Institute of Science, Israel, March 2010 Page 31 of 1544
TopoGSA
TopoGSA: Network topological analysis of gene sets
What is TopoGSA? TopoGSA is a web-application mappinggene sets onto a comprehensive humanprotein interaction network and analysingtheir network topological properties.
Two types of analysis:
1. Compare genes within a gene set:
e.g. up- vs. down-regulated genes
2. Compare a gene set against a
database of known gene sets
(e.g. KEGG, BioCarta, GO)
www.infobiotics.net/TopoGSA
BigRoc, Weizmann Institute of Science, Israel, March 2010 Page 32 of 1544
TopoGSA - Methods
• the degree of each node in the gene set
• the local clustering coefficient Ci for each node vi in the gene set:
where ki is the degree of vi and ejk is the edge between vj and vk
• the shortest path length between pairs of nodes vi and vj in the gene set
• the node betweenness B(v) for each node v in the gene set:
here σst(v) is the number of shortest paths from s to t passing through v
• the eigenvector centrality for each node in the gene set
TopoGSA computes the following topological properties for an uploaded geneset and matched-size random gene sets:
BigRoc, Weizmann Institute of Science, Israel, March 2010 Page 33 of 1544
KEGG-BRITE pathway colouring
LEGEND:
• Cellular processes
• Environmental information processing
• Genetic information processing
• Human diseases
• Metabolism
• Cancer genes
General results:
• Metabolic pathways have high shortest path lenghts and low bet- weenness
• Disease pathways and cancer gene sets tend to have high betweenness and small shortest path lenghts
Mean nodebetweenness
Mean clustering
coefficient Mean shortest
path length
BigRoc, Weizmann Institute of Science, Israel, March 2010 Page 34 of 1544
ArrayMining TopoGSA
Send selected genes from ArrayMining to TopoGSA:
• Results of within-gene-set comparison:
Estrogen receptor 1 gene and apoptosis regulator Bcl2, both up-regulated in luminal samples, have outstanding network topological properties (higher betweenness, higher degree, higher centrality) in comparison to other genes.
• Results of comparison against reference databases: - Metabolic KEGG pathways are most similar to the uploaded gene set in terms of network topological properties. - Most similar BioCarta pathways: Cytokine, Differentiation and inflammatory pathways.
BigRoc, Weizmann Institute of Science, Israel, March 2010 Page 35 of 1544
Real-world application of tools sets
ArrayMining identifies RERG as a tumour marker
• RERG (Ras-related and oestrogen-regulated growth-inhibitor) was identified as a new candidate marker of ER-positive luminal-like breast cancer subtype
• Validation using immunohistochemistry on Tissue Microarrays containing 1,140 invasive breast cancers confirmed RERG‘s utility as a marker gene
TMAs of invasive breast cancer show strong RERG expression
BigRoc, Weizmann Institute of Science, Israel, March 2010 Page 36 of 1544
RERG Protein Expression VS BCSS & DMFI
Kaplan Meier plot of RERG protein expression with respect to BCSS in ER+ U ER- cohort
BCSS in months250200150100500
Cum
ulat
ive
Surv
ival
1.0
0.8
0.6
0.4
Positive RERG expression
Negative RERG expression
p=0.002
DMFI in months250200150100500
Cum
ulat
ive
Surv
ival
1.0
0.8
0.6
0.4
Positive RERG expression
Negative RERG expression
p= 0.007
Kaplan Meier plot of RERG protein expression with respect
to BCSS in ER+ only
BCSS in months250200150100500
Cum
ulat
ive
Surv
ival
1.0
0.8
0.6
BCSS in months250200150100500
Cum
ulat
ive
Surv
ival
1.0
0.8
0.6
Positive RERG expression
Negative RERG expression
p=0.027
With
out a
djuv
ant t
reat
men
tW
ithou
t Tam
oxife
n tr
eatm
ent
BigRoc, Weizmann Institute of Science, Israel, March 2010 Page 37 of 1544
Conclusions(I): Feature comparison with similar tools
ArrayMining & TopoGSA
GEPAS (Tarraga et al.)
Expression Profiler(Kapushesky et al.)
Pre-processing:Image analysis, single- and dimensionality reduction, gene name normalization,cross-study normalization, covariance-based filtering
Pre-processing:Image analysis, missing value imputation, multiple single study normalization methods, dimensionality reduction, ID converter
Pre-processing:Image analysis, single study normalization, missing value imputation, dimensionality reduction,advanced data selection
PDF-reports, sortable ranking tables, data anno-tation, 2D/3D plots, e-mail notification, video tutorials
Usability/features:
special tree visualization (Caat, SotaTree, Newick Trees), 2D plots, data annotation (Babelomics),
Usability/features:
Excel export, XML queries, 2D plots, data annotation (GO, chromosome location)
BigRoc, Weizmann Institute of Science, Israel, March 2010 Page 38 of 1544
Conclusions (2)
• Combining algorithms in a sequential and/or parallel fashion can provide performance improvements and new biological insights
• Microarray and gene set analysis tasks can be interlinked flexiblyin an (almost) completely automated process
• New analysis types like network-based topology analysis and co-expression analysis complement existing tools
• In the case of BC it allowed us to identify candidate genes to characterise ER+ luminal-like BC.
– RERG gene is a key marker of the luminal BC class
– It can be used to separate distinct prognostic subgroups
• Accessible through www.infobiotics.org
BigRoc, Weizmann Institute of Science, Israel, March 2010 Page 39 of 1544
Outlook : PPI-based pathway-enlargement
Idea:
Enlarge pathways by adding genes that are „strongly connected“ to the
pathway-nodes or increase the pathway-“compactness“
Pathway extension criteria:
• degree(v) > 1; and
• #pathway-links(v,p) / #outside-links(v,p) > T1; or
• #triangle-links(v,p) / #possible_triangles(v,p) > T2; or
• #pathway-links(v,p) / #pathway-nodes(p) > T3
black = pathway-nodes;red blue green = nodes added based on different criteria
...
...
...
BigRoc, Weizmann Institute of Science, Israel, March 2010 Page 40 of 1544
Pathway enlargment – added genes
Example case: BioCarta BTG family proteins and cell cycle regulation
Black: Original pathway nodes – Green: Nodes added based on connectivity
Added cancer gene
BigRoc, Weizmann Institute of Science, Israel, March 2010 Page 41 of 1544
Pathway enlargment – Example 1
More than 20 proteins
annotated in our
PPIN
5 added proteins by the
extension process
3 known disease
associated
2 candidates: METTL2B,
TMED10
Example: Alzheimer disease pathway
BigRoc, Weizmann Institute of Science, Israel, March 2010 Page 42 of 1544
Pathway enlargment – Example 2
Example: Interleukin signaling pathways
BigRoc, Weizmann Institute of Science, Israel, March 2010 Page 43 of 1544
Pathway enlargment - conclusionThe method integrates two sources of information, extending
canonical pathways using large-scale protein interaction data
Identifies new regulators, new candidates for disease pathways
Future: investigate extended pathways as input for enrichment/classification methods
This work is based on the following papers:Arraymining: a modular web-application for microarray analysis combining ensemble and
consensus methods with cross-study normalization. E. Glaab, J. Garibaldi, and N. Krasnogor. BMC Bioinformatics, 10(1):358, 2009.
TopoGSA: network topological gene set analysis. E. Glaab, A. Baudot, N. Krasnogor and A. Valencia. Bioinformatics.http://bioinformatics.oxfordjournals.org/cgi/content/abstract/btq131v1?ct=ct
RERG (Ras—related and oestrogen-regulated growth-inhibitor) expression in breast cancer as a marker of ER-positive luminal-like subtype. H.O. Habashy, D.G. Powe, E. Glaab, N. Krasnogor, J.M. Garibaldi, E.A. Rakha, G. Ball, A.R. Green and I.O. Ellis (to be submitted)
Extending biological pathway definitions using molecular interaction networks. E. Glaab, A. Baudot, N.Krasnogor and A. Valencia (to be submitted)
BigRoc, Weizmann Institute of Science, Israel, March 2010 Page 44 of 1544
Acknowledgements
• QMC: Hany Onsy Habashy, Desmond G Powe, Emad A Rakha, Graham Ball, Andrew R Green, Ian O Ellis.
• CS: Jon M. Garibaldi• CNIO: A. Valencia, A. Baudot
• BBSRC for grants BB/F01855X/1, BB/D0196131• EPSRC for grant EP/E017215/1• The EC for grant Marie-Curie Early-Stage-Training programme