Network-based analysis of functional genomics data

Network-based analysis of functional

genomics dataJianhua Ruan, PhD

Computer Science DepartmentUniversity of Texas at San Antonio

http://www.cs.utsa.edu/~jruan

http://www.cs.utsa.edu/~jruan

UTSA

Final project Final Project Report due Sat, Dec 15 Presentations: Mon, Dec 17, 8-10:30 pm 10 teams to present Each team will have up to 13 minutes. (10

min presentation, 3 min questions) Since time is limited, you don’t need to

cover all the details in your presentation. Focus on the most important concepts More details in your project report

Dog: ~20,000 genes

Human: ~22,000 genes

Rice: ~35,000 genes

Mouse: ~22,000 genesC. Elegans: ~20,000 genes

It is not just the genes, but the networks!

UTSA

Why networks? For complex systems, the actual output may not

be predictable by looking at only individual components: The whole is greater than the sum of its parts

Studying genes/proteins on the network level allows us to: Assess the role of individual genes/proteins in the overall

pathway Evaluate redundancy of network components Identify candidate genes involved in genetic diseases Sets up the framework for mathematical models

UTSA

Graph model of biological networks An abstract of the complex relationships

among molecules in the cell Vertex: molecule

Gene, protein, metabolite, DNA, RNA Edges: relationships

Physical interaction Functional association

Share many common statistical properties with real-world networks Small-world Scale-free Hierarchical Modular (community structure)

(Jeong et al., 2001)

Network-based disease studies

Genetic network reverse-engineering Network analysis algorithms

Research Overview

Data Mining, tree models, DNA motif finding Community discovery

Data integration, classification, graph algorithms

UTSA

Agenda Community discovery in biological

networks Network-based analysis of microarray data Network-based biomarker discovery for

metastatic breast cancer Conclusions

UTSA

Network communities Communities

Are relatively densely connected sub-networks (modules)

Appear in many types of networks Independently studied in many fields:

Social science, Computer science, Physics, Biology, etc. Significance

Biological systems are modular Metabolic pathways Protein complexes Transcriptional regulatory modules

Biological systems are large and complex Communities provide a high-level overview of the networks

Guilt-by-association Predict gene functions based on community memberships

UTSA

Community discovery problem Divide a network into relatively densely connected sub-

networks Similar to clustering, but # of clusters is determined

automatically

Vertexreorder

UTSA

Modularity function (Q) Measure strength of community structures

Newman, Phy Rev E, 2003

5 10 15 20 25 30 35 40 45 50

5

10

15

20

25

30

35

40

45

50

e11

e22

e33

e44

e55

)( iii

ii eeQ

Q = 0.45

Q = 0.56

Q = 0

Q = 0.40 Q = 0.54

Modularity automatically determines # of communities!

UTSA

Methods for community discovery Previous methods

Fast but inaccurate (CNM, Phy Rev E, 2004) Accurate but slow (Guimera&Amaral, Nature 2005) Relatively accurate, relatively fast (Ruan&Zhang, AAAI 2006, ICDM2007) Relatively accurate, fast, memory intensive (Newman, PNAS 2006)

Our new algorithm: Qcut/HQcut (Ruan&Zhang, Phys Rev E 2008) Accurate, fast and memory friendly HQcut solves the resolution limit of Q

UTSA

Algorithm Qcut

1 2 3

5

10

15

20

25

301 2 3

5

10

15

20

25

30

eig kmeans

Inter-community edge probability

Acc

urac

y

Our method

Newman’s

1. Recursive multi-way partitioning until Q is max

2. Improve Q by efficient heuristic search

UTSA

Resolution limit and HQcut Q is known to have a resolution limit problem

Cannot detect small communities Q slight decreases if forced to split

HQcut solves this problem Apply Qcut to get communities with largest Q Recursively search for sub-communities within each community without dramatic change to Q Statistical test for termination criteria

Ruan & Zhang, Physical Review E 2008

UTSA

Graphical user interface for Qcut/HQcut

UTSA

Application: protein complex prediction Input: a yeast PPI network

Data from Krogan et.al., Nature. 2006;440:637-43 2708 vertices (proteins), 7123 interactions

289 communities Sizes range from 2 to 49 Evaluation: compare communities with known complexes manually curated in MIPS database

UTSA

Communities in a yeast PPI network

Small ribosomal subunit (90%)

RNA poly II mediator (83%)

Proteasome core (90%)

Exosome (94%)

gamma-tubulin (77%)

respiratory chain complex IV (82%)

UTSA

Communities vs. complexes

Communities and complexes have good one to one correspondence

Overall accuracy: 0.81

Newman: 0.58

Agreement between a predicted complex (P) and a known complex (K) = |P∩K| / sqrt(|P| x |K|).

Known complex

Pre

dict

ed c

ompl

ex

UTSA

Work-in-progress: Random walk-based improvement Motivation:

PPI network often contain both false positive and false negative edges

Hub genes prevent good partitioning Three goals:

Eliminate false positive edges Predict missing links Reduce the impact of hub genes

Intuition: Two proteins with high topological similarity, regardless

of connected or not, may belong to same complex Two proteins with direct link but very different

topological properties may belong to different complexes

UTSA

Method overview

5 10 15 20 25 30

5

10

15

20

25

30

5 10 15 20 25 30

5

10

15

20

25

30

5 10 15 20 25 30

5

10

15

20

25

30

Initial prob vectors Equilibrium prob vectors

5 10 15 20 25 30

5

10

15

20

25

30

Similarity matrix

5 10 15 20 25 30

5

10

15

20

25

30

Adjacency matrix New network

= Random walk with resistance Distance

calculation

threshold

Original network

(guided by topology)

UTSA

Preliminary results on yeast PPI Predicted PPIs have much higher

functional relevance than removed PPIs, using several sources of evidence Gene Ontology Gene expression etc.

New network significantly improved accuracy of protein complex predictions Using HQcut: 0.50 to 0.55 Using MCL: 0.48 to 0.59

UTSA

Agenda Community structure in biological

networks Prediction of protein complexes Network-based microarray data analysis

Network-based biomarker discovery for metastatic breast cancer

Conclusions

UTSA

Microarray data analysis Gene network structure is unknown Microarray measures gene expression (activity) level

Clustering is the most common analysis tool Many clustering algorithms available

K-means Hierarchical Self organizing maps Parameter (e.g., k) hard to guess Does not consider network structure

Clustering

gene

s

Conditions

Common functions? Common regulation? Predict functions for

unknown genes?

UTSA

Network-based microarray data analysis

Genes i and j connected if their expression patterns are “sufficiently similar” Pearson correlation coefficient > arbitrary

threshold K nearest neighbors (KNN)

Key: how to get the “best” network?

Gen

e

SampleConstructCo-expression network

ij

=

UTSA

Motivation Can we use the idea of community

discovery for clustering microarray data? Advantages:

Parameter free Network topology considered Constructed network may have other

interesting applications beyond clustering

UTSA

Our idea

Intuition: the real network is naturally modular Can be measured by modularity (Q) If constructed right, should have the

highest Q

……

Net_1,Most dense

Net_m,Most sparse

Microarraydata

Similaritymatrix

Network series

Qcut

Qcut

Ruan, ICDM 2009

UTSA

Our idea (cont’d)

Network density

Mod

ular

ityRandom network

True network

Difference

• Therefore, use ∆Q to determine the best network parameter and obtain the best community structure

UTSA

Results: synthetic data set 1 High dimensional data generated by

synDeca. 20 clusters of high dimensional points, plus

some scatter points Clusters are of various shapes: eclipse,

rectangle, random

10 20 30 40 50 60 70 80 90 100

100

200

300

400

500

600

700

800

900

1000 0 50 100 150 200 250 3000

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Number of neighbors

QReal

QRandom

Qreal - Qrandom

Clustering Accuracy

∆Q

Accuracy

UTSA

Comparison

10 20 30 40 50 60 70 80 90 1000

0.2

0.4

0.6

0.8

1

Dimension

Clu

ster

ing

Acc

urac

y

This workkmeansoptimal knnHQcut

mKNN-HQcut with the optimum k

mKNN-HQcut with automatically determined k

K-means

UTSA

Results: synthetic data set 2 Gene expression data

Thalamuthu et al, 2006 600 data sets ~600 genes, 50

conditions, 15 clusters 0 or 1x outliers

Without outliers With outliersmKNN-HQcutWith optimal k

mKNN-HQcutWith auto k

UTSA

Comparison with other methods

Ruan et al., BioKDD 2010

UTSA

Results on yeast stress response data 3000 genes, 173 samples

Best k = 140. Resulting in 75 clusters

UTSA

Results on yeast stress response data Enrichment of common functions

Accumulative hyper-geometric test (Fisher’s exact test)

GO Function Terms

Gene

Protein biosynthesis (p < 10-96)

Nuclear transport (p < 10-50)

mt ribosome (p < 10-63)DNA repair (p < 10-66)RNA splicing (p < 10-105)Nitrogen compound metabolism (p < 10-37)

Peroxisome (p < 10-13)

UTSA

Comparison with k-means

K-means

mkNN-HQcutUsing automatically determined k = 140

Over

all e

nrich

men

t sco

re

UTSA

4 transcription factors regulate many genes in the community

4 telomere maintenance genes (p < 10-7)

16 unknown genes, all located in chromosome telomeric regions

5 other genes at rim of the sub-network

An interesting community A 25-gene community missed by other methods

UTSA

Application to Arabidopsis data ~22000 genes,

1138 samples 1150 singletons 800 (300) modules

of size >= 10 (20) > 80% (90%) of

modules have enriched functions

Much more significant than all five existing studies on the same data set

Top 40 most significant modules

UTSA

UTSA

Cis-regulatory network of Arabidopsis

MotifModule Ruan et al.,

BMC Bioinfo, to appear

UTSA

Beyond gene clustering (1) Gene specific studies

Collaborator is interested in Gibberellins A hormone important for the growth and

development of plant Commercially important Biosynthesis and signaling well studied Transcriptional regulation of biosynthesis and

signaling not yet clear 3 important gene families, GA20ox, GA3ox and

GA2ox for biosynthesis Receptor gene family: GID1A,B,C Analyze the co-expression network around

these genes

UTSA

GID1A

GID1B

GID1C

20ox1

20ox220ox3 20ox4

20ox5

3ox1

3ox23ox3

3ox4

2ox1

2ox2

2ox3

2ox42ox6

2ox7

2ox8

GA3

20ox

3ox

2ox

UTSA

UTSA

Beyond gene clusters (2) Cancer classification

Gene

Sam

ple

Sample

Alizadeh et. al. Nature, 2000

Sample: patient or cell lines

Qcut

UTSA

ActivatedBlood B

Chronic lymphocytic leukemia (CLL)

Follicular lymphoma (FL)

Blood T

Transformed cell lines

Diffuse large B-cell Lymphoma(DLBCL)

Resting Blood B

DLBCL

DLBCL

Network of cell samplesShape: cell line / cancer typeColor: clustering results

UTSA

Survival rate after chemotherapy

DLBCL-1DLBCL-2

DLBCL-3

Survival rate: 73%Median survival time: 71.3 months



UTSA

Beyond gene clustering (3) Topology vs function

Number of connections

% o

f ess

entia

l pro

tein

s

Jeong et. al. Nature 2001

UTSA

Community participation vs. essentiality (PPI)

Hub

Non-hub

Community participation

% E

ssen

tial

UTSA

Community participation vs. essentiality (coexp)

Key: how to systematically search for such relationships? Data mining – association rule?

Community participation

% E

ssen

tial

% E

ssen

tial

Number of connections

Non-hub

HubParticipation < 0.2

Participation >= 0.2

UTSA

Agenda Community structure in biological

networks Prediction of protein complexes

Network-based microarray data analysis Network-based biomarker discovery

for metastatic breast cancer Conclusions

UTSA

Background Metastasis is the spread of cancer from one

organ to another non-adjacent organ or part. Challenge: Predict Metastasis

If metastasis is likely => aggressive adjuvant therapy

How to decide the likelihood? Traditional predictive

factors are not goodenough

UTSA

Microarray-based marker discovery Examine genome-wide expression profiles

Idea: Score individual genes for how well they discriminate

between different classes of disease Establish gene expression signature

Limitations: # genes >> # patients Downstream effects Individual variations not

attributed to cancer Consequences:

Low reproducibility acrossdata sets

Missing biological insight

M

N

UTSA

Pathway vs. PPI Subnetwork as Marker Remedy to the problems: use pathway information

Less features => better stability Biological insight

Limitation: Majority of human genes not yet assigned to a definitive pathway

Alternatively: protein-protein interaction (PPI) networks Genes in the same pathway may have short distance in PPI Subnetworks (potential pathways) as markers Chuang et al. 2007, Mol Syst Bio.

Cannot differentiate causal vs. downstream genes

UTSA

Our approach Key observation: many known disease genes are

not differentially expressed (DE) between metastasis and non-metastasis, but their neighbors are e.g P53 and BRAC2

Intuition: find a small number of intermediate genes that connect DE genes

Known as the Steiner tree problem in computer science

UTSA

Steiner tree problem Given a connected graph and a list of input

nodes, find the smallest number of additional nodes so that there is a tree connecting the input nodes and those additional nodes

UTSA

Method overview

UTSA

Experiment setup Two breast cancer microarray data sets

van de Vijver et al, 2002, (Agilent, 78 +ve vs 217 -ve)

Wang et al, 2005, (affy, 106 +ve vs 180 -ve)

Two human PPI networks PINA (Wu et al 2009), 10,920 proteins and 61,746 interactions Chuang et al 2007, 11,203 proteins and 57,235 interactions

UTSA

Steiner tree example for Wang data set

UTSA

Cross-dataset stability of markers

van de Vijver dataset

Wang dataset

Overlap

Number of common genes(% of overlap)

p-value

Steiner tree-based markers

Chuang-PPI 1123 1100 384(20.88%) 2E-127

PINA-PPI 981 1135 400(23.31%) 5E-158

Top scoring Steiner tree-based markers

Chuang-PPI 203 194 63(18.86%) 7E-064

PINA-PPI 164 185 80(29.74%) 3E-103

Differentially expressed genes

324 319 34(5.58%) 1E-008

Approach inChuang et al (2007)

618 906 175(12.8%)

6E-054

Approach in original study

70 76 3(2.09%) .03

UTSA

Enriched cancer-related pathways Enrichment score for Wang dataset with Chuang-

PPI network.

UTSA

Overlap with known breast cancer genes

Overlap of 60 known breast cancer genes with STMs, t-STMs, DE genes, Chuang et al (2007) method and corresponding original studies. (A) Wang (B) van de Vijver dataset.

(A) (B)

UTSA

Classification accuracy

Classification accuracy based on AUC metric. The dataset label represents features taken from that dataset. For cross-data classification features taken from the labeled dataset and applied to the other dataset (A) Bayesian logistic regression (B) random tree classifier.

UTSA

Conclusions Methods for community discovery

Fully automated, i.e. parameter-free Higher accuracy than competing methods that

require extensive parameter tuning Improves protein complex prediction and

microarray data clustering, including tumor subtype classification

Many other applications Steiner-tree based biomarker discovery

Improves stability of metastatic breast cancer markers, and cross-dataset classification accuracy

A generic method for identifying the hidden causal genes given downstream targets

UTSA

Future work Network-based gene-specific studies Combine random walk with Steiner tree

algorithms to improve biomarker discovery and cancer classification

Other types of data Other diseases

UTSA

Questions?

Network-based analysis of functional genomics data

Documents

network level

phy rev e

icdm2007relatively accurate

fast ruanzhang

complex systems

phys rev e

significancebiological

social science