Network-based analysis of functional genomics data Jianhua Ruan, PhD Computer Science Department University of Texas at San Antonio http://www.cs.utsa.edu/~jruan
Feb 10, 2016
Network-based analysis of functional
genomics dataJianhua Ruan, PhD
Computer Science DepartmentUniversity of Texas at San Antonio
http://www.cs.utsa.edu/~jruan
UTSA
Final project Final Project Report due Sat, Dec 15 Presentations: Mon, Dec 17, 8-10:30 pm 10 teams to present Each team will have up to 13 minutes. (10
min presentation, 3 min questions) Since time is limited, you don’t need to
cover all the details in your presentation. Focus on the most important concepts More details in your project report
Dog: ~20,000 genes
Human: ~22,000 genes
Rice: ~35,000 genes
Mouse: ~22,000 genesC. Elegans: ~20,000 genes
It is not just the genes, but the networks!
UTSA
Why networks? For complex systems, the actual output may not
be predictable by looking at only individual components: The whole is greater than the sum of its parts
Studying genes/proteins on the network level allows us to: Assess the role of individual genes/proteins in the overall
pathway Evaluate redundancy of network components Identify candidate genes involved in genetic diseases Sets up the framework for mathematical models
UTSA
Graph model of biological networks An abstract of the complex relationships
among molecules in the cell Vertex: molecule
Gene, protein, metabolite, DNA, RNA Edges: relationships
Physical interaction Functional association
Share many common statistical properties with real-world networks Small-world Scale-free Hierarchical Modular (community structure)
(Jeong et al., 2001)
Network-based disease studies
Genetic network reverse-engineering Network analysis algorithms
Research Overview
Data Mining, tree models, DNA motif finding Community discovery
Data integration, classification, graph algorithms
UTSA
Agenda Community discovery in biological
networks Network-based analysis of microarray data Network-based biomarker discovery for
metastatic breast cancer Conclusions
UTSA
Network communities Communities
Are relatively densely connected sub-networks (modules)
Appear in many types of networks Independently studied in many fields:
Social science, Computer science, Physics, Biology, etc. Significance
Biological systems are modular Metabolic pathways Protein complexes Transcriptional regulatory modules
Biological systems are large and complex Communities provide a high-level overview of the networks
Guilt-by-association Predict gene functions based on community memberships
UTSA
Community discovery problem Divide a network into relatively densely connected sub-
networks Similar to clustering, but # of clusters is determined
automatically
Vertexreorder
UTSA
Modularity function (Q) Measure strength of community structures
Newman, Phy Rev E, 2003
5 10 15 20 25 30 35 40 45 50
5
10
15
20
25
30
35
40
45
50
e11
e22
e33
e44
e55
)( iii
ii eeQ
Q = 0.45
Q = 0.56
Q = 0
Q = 0.40 Q = 0.54
Modularity automatically determines # of communities!
UTSA
Methods for community discovery Previous methods
Fast but inaccurate (CNM, Phy Rev E, 2004) Accurate but slow (Guimera&Amaral, Nature 2005) Relatively accurate, relatively fast (Ruan&Zhang, AAAI 2006, ICDM2007) Relatively accurate, fast, memory intensive (Newman, PNAS 2006)
Our new algorithm: Qcut/HQcut (Ruan&Zhang, Phys Rev E 2008) Accurate, fast and memory friendly HQcut solves the resolution limit of Q
UTSA
Algorithm Qcut
1 2 3
5
10
15
20
25
301 2 3
5
10
15
20
25
30
eig kmeans
Inter-community edge probability
Acc
urac
y
Our method
Newman’s
1. Recursive multi-way partitioning until Q is max
2. Improve Q by efficient heuristic search
UTSA
Resolution limit and HQcut Q is known to have a resolution limit problem
Cannot detect small communities Q slight decreases if forced to split
HQcut solves this problem Apply Qcut to get communities with largest Q Recursively search for sub-communities within each community without dramatic change to Q Statistical test for termination criteria
Ruan & Zhang, Physical Review E 2008
UTSA
Graphical user interface for Qcut/HQcut
UTSA
Application: protein complex prediction Input: a yeast PPI network
Data from Krogan et.al., Nature. 2006;440:637-43 2708 vertices (proteins), 7123 interactions
289 communities Sizes range from 2 to 49 Evaluation: compare communities with known complexes manually curated in MIPS database
UTSA
Communities in a yeast PPI network
Small ribosomal subunit (90%)
RNA poly II mediator (83%)
Proteasome core (90%)
Exosome (94%)
gamma-tubulin (77%)
respiratory chain complex IV (82%)
UTSA
Communities vs. complexes
Communities and complexes have good one to one correspondence
Overall accuracy: 0.81
Newman: 0.58
Agreement between a predicted complex (P) and a known complex (K) = |P∩K| / sqrt(|P| x |K|).
Known complex
Pre
dict
ed c
ompl
ex
UTSA
Work-in-progress: Random walk-based improvement Motivation:
PPI network often contain both false positive and false negative edges
Hub genes prevent good partitioning Three goals:
Eliminate false positive edges Predict missing links Reduce the impact of hub genes
Intuition: Two proteins with high topological similarity, regardless
of connected or not, may belong to same complex Two proteins with direct link but very different
topological properties may belong to different complexes
UTSA
Method overview
5 10 15 20 25 30
5
10
15
20
25
30
5 10 15 20 25 30
5
10
15
20
25
30
5 10 15 20 25 30
5
10
15
20
25
30
Initial prob vectors Equilibrium prob vectors
5 10 15 20 25 30
5
10
15
20
25
30
Similarity matrix
5 10 15 20 25 30
5
10
15
20
25
30
Adjacency matrix New network
= Random walk with resistance Distance
calculation
threshold
Original network
(guided by topology)
UTSA
Preliminary results on yeast PPI Predicted PPIs have much higher
functional relevance than removed PPIs, using several sources of evidence Gene Ontology Gene expression etc.
New network significantly improved accuracy of protein complex predictions Using HQcut: 0.50 to 0.55 Using MCL: 0.48 to 0.59
UTSA
Agenda Community structure in biological
networks Prediction of protein complexes Network-based microarray data analysis
Network-based biomarker discovery for metastatic breast cancer
Conclusions
UTSA
Microarray data analysis Gene network structure is unknown Microarray measures gene expression (activity) level
Clustering is the most common analysis tool Many clustering algorithms available
K-means Hierarchical Self organizing maps Parameter (e.g., k) hard to guess Does not consider network structure
Clustering
gene
s
Conditions
Common functions? Common regulation? Predict functions for
unknown genes?
UTSA
Network-based microarray data analysis
Genes i and j connected if their expression patterns are “sufficiently similar” Pearson correlation coefficient > arbitrary
threshold K nearest neighbors (KNN)
Key: how to get the “best” network?
Gen
e
SampleConstructCo-expression network
ij
=
UTSA
Motivation Can we use the idea of community
discovery for clustering microarray data? Advantages:
Parameter free Network topology considered Constructed network may have other
interesting applications beyond clustering
UTSA
Our idea
Intuition: the real network is naturally modular Can be measured by modularity (Q) If constructed right, should have the
highest Q
……
Net_1,Most dense
Net_m,Most sparse
Microarraydata
Similaritymatrix
Network series
Qcut
Qcut
Ruan, ICDM 2009
UTSA
Our idea (cont’d)
Network density
Mod
ular
ityRandom network
True network
Difference
• Therefore, use ∆Q to determine the best network parameter and obtain the best community structure
UTSA
Results: synthetic data set 1 High dimensional data generated by
synDeca. 20 clusters of high dimensional points, plus
some scatter points Clusters are of various shapes: eclipse,
rectangle, random
10 20 30 40 50 60 70 80 90 100
100
200
300
400
500
600
700
800
900
1000 0 50 100 150 200 250 3000
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Number of neighbors
QReal
QRandom
Qreal - Qrandom
Clustering Accuracy
∆Q
Accuracy
UTSA
Comparison
10 20 30 40 50 60 70 80 90 1000
0.2
0.4
0.6
0.8
1
Dimension
Clu
ster
ing
Acc
urac
y
This workkmeansoptimal knnHQcut
mKNN-HQcut with the optimum k
mKNN-HQcut with automatically determined k
K-means
UTSA
Results: synthetic data set 2 Gene expression data
Thalamuthu et al, 2006 600 data sets ~600 genes, 50
conditions, 15 clusters 0 or 1x outliers
Without outliers With outliersmKNN-HQcutWith optimal k
mKNN-HQcutWith auto k
UTSA
Comparison with other methods
Ruan et al., BioKDD 2010
UTSA
Results on yeast stress response data 3000 genes, 173 samples
Best k = 140. Resulting in 75 clusters
UTSA
Results on yeast stress response data Enrichment of common functions
Accumulative hyper-geometric test (Fisher’s exact test)
GO Function Terms
Gene
Protein biosynthesis (p < 10-96)
Nuclear transport (p < 10-50)
mt ribosome (p < 10-63)DNA repair (p < 10-66)RNA splicing (p < 10-105)Nitrogen compound metabolism (p < 10-37)
Peroxisome (p < 10-13)
UTSA
Comparison with k-means
K-means
mkNN-HQcutUsing automatically determined k = 140
Over
all e
nrich
men
t sco
re
UTSA
4 transcription factors regulate many genes in the community
4 telomere maintenance genes (p < 10-7)
16 unknown genes, all located in chromosome telomeric regions
5 other genes at rim of the sub-network
An interesting community A 25-gene community missed by other methods
UTSA
Application to Arabidopsis data ~22000 genes,
1138 samples 1150 singletons 800 (300) modules
of size >= 10 (20) > 80% (90%) of
modules have enriched functions
Much more significant than all five existing studies on the same data set
Top 40 most significant modules
UTSA
UTSA
Cis-regulatory network of Arabidopsis
MotifModule Ruan et al.,
BMC Bioinfo, to appear
UTSA
Beyond gene clustering (1) Gene specific studies
Collaborator is interested in Gibberellins A hormone important for the growth and
development of plant Commercially important Biosynthesis and signaling well studied Transcriptional regulation of biosynthesis and
signaling not yet clear 3 important gene families, GA20ox, GA3ox and
GA2ox for biosynthesis Receptor gene family: GID1A,B,C Analyze the co-expression network around
these genes
UTSA
GID1A
GID1B
GID1C
20ox1
20ox220ox3 20ox4
20ox5
3ox1
3ox23ox3
3ox4
2ox1
2ox2
2ox3
2ox42ox6
2ox7
2ox8
GA3
20ox
3ox
2ox
UTSA
UTSA
Beyond gene clusters (2) Cancer classification
Gene
Sam
ple
Sample
Alizadeh et. al. Nature, 2000
Sample: patient or cell lines
Qcut
UTSA
ActivatedBlood B
Chronic lymphocytic leukemia (CLL)
Follicular lymphoma (FL)
Blood T
Transformed cell lines
Diffuse large B-cell Lymphoma(DLBCL)
Resting Blood B
DLBCL
DLBCL
Network of cell samplesShape: cell line / cancer typeColor: clustering results
UTSA
Survival rate after chemotherapy
DLBCL-1DLBCL-2
DLBCL-3
Survival rate: 73%Median survival time: 71.3 months
Survival rate: 40%Median survival time: 22.3 months
Survival rate: 20%Median survival time: 12.5 months
UTSA
Beyond gene clustering (3) Topology vs function
Number of connections
% o
f ess
entia
l pro
tein
s
Jeong et. al. Nature 2001
UTSA
Community participation vs. essentiality (PPI)
Hub
Non-hub
Community participation
% E
ssen
tial
UTSA
Community participation vs. essentiality (coexp)
Key: how to systematically search for such relationships? Data mining – association rule?
Community participation
% E
ssen
tial
% E
ssen
tial
Number of connections
Non-hub
HubParticipation < 0.2
Participation >= 0.2
UTSA
Agenda Community structure in biological
networks Prediction of protein complexes
Network-based microarray data analysis Network-based biomarker discovery
for metastatic breast cancer Conclusions
UTSA
Background Metastasis is the spread of cancer from one
organ to another non-adjacent organ or part. Challenge: Predict Metastasis
If metastasis is likely => aggressive adjuvant therapy
How to decide the likelihood? Traditional predictive
factors are not goodenough
UTSA
Microarray-based marker discovery Examine genome-wide expression profiles
Idea: Score individual genes for how well they discriminate
between different classes of disease Establish gene expression signature
Limitations: # genes >> # patients Downstream effects Individual variations not
attributed to cancer Consequences:
Low reproducibility acrossdata sets
Missing biological insight
M
N
UTSA
Pathway vs. PPI Subnetwork as Marker Remedy to the problems: use pathway information
Less features => better stability Biological insight
Limitation: Majority of human genes not yet assigned to a definitive pathway
Alternatively: protein-protein interaction (PPI) networks Genes in the same pathway may have short distance in PPI Subnetworks (potential pathways) as markers Chuang et al. 2007, Mol Syst Bio.
Cannot differentiate causal vs. downstream genes
UTSA
Our approach Key observation: many known disease genes are
not differentially expressed (DE) between metastasis and non-metastasis, but their neighbors are e.g P53 and BRAC2
Intuition: find a small number of intermediate genes that connect DE genes
Known as the Steiner tree problem in computer science
UTSA
Steiner tree problem Given a connected graph and a list of input
nodes, find the smallest number of additional nodes so that there is a tree connecting the input nodes and those additional nodes
UTSA
Method overview
UTSA
Experiment setup Two breast cancer microarray data sets
van de Vijver et al, 2002, (Agilent, 78 +ve vs 217 -ve)
Wang et al, 2005, (affy, 106 +ve vs 180 -ve)
Two human PPI networks PINA (Wu et al 2009), 10,920 proteins and 61,746 interactions Chuang et al 2007, 11,203 proteins and 57,235 interactions
UTSA
Steiner tree example for Wang data set
UTSA
Cross-dataset stability of markers
van de Vijver dataset
Wang dataset
Overlap
Number of common genes(% of overlap)
p-value
Steiner tree-based markers
Chuang-PPI 1123 1100 384(20.88%) 2E-127
PINA-PPI 981 1135 400(23.31%) 5E-158
Top scoring Steiner tree-based markers
Chuang-PPI 203 194 63(18.86%) 7E-064
PINA-PPI 164 185 80(29.74%) 3E-103
Differentially expressed genes
324 319 34(5.58%) 1E-008
Approach inChuang et al (2007)
618 906 175(12.8%)
6E-054
Approach in original study
70 76 3(2.09%) .03
UTSA
Enriched cancer-related pathways Enrichment score for Wang dataset with Chuang-
PPI network.
UTSA
Overlap with known breast cancer genes
Overlap of 60 known breast cancer genes with STMs, t-STMs, DE genes, Chuang et al (2007) method and corresponding original studies. (A) Wang (B) van de Vijver dataset.
(A) (B)
UTSA
Classification accuracy
Classification accuracy based on AUC metric. The dataset label represents features taken from that dataset. For cross-data classification features taken from the labeled dataset and applied to the other dataset (A) Bayesian logistic regression (B) random tree classifier.
UTSA
Conclusions Methods for community discovery
Fully automated, i.e. parameter-free Higher accuracy than competing methods that
require extensive parameter tuning Improves protein complex prediction and
microarray data clustering, including tumor subtype classification
Many other applications Steiner-tree based biomarker discovery
Improves stability of metastatic breast cancer markers, and cross-dataset classification accuracy
A generic method for identifying the hidden causal genes given downstream targets
UTSA
Future work Network-based gene-specific studies Combine random walk with Steiner tree
algorithms to improve biomarker discovery and cancer classification
Other types of data Other diseases
UTSA
Questions?