Top Banner
Network-based analysis of functional genomics data Jianhua Ruan, PhD Computer Science Department University of Texas at San Antonio http://www.cs.utsa.edu/~jruan
63

Network-based analysis of functional genomics data

Feb 10, 2016

Download

Documents

nola

Network-based analysis of functional genomics data. Jianhua Ruan, PhD Computer Science Department University of Texas at San Antonio http://www.cs.utsa.edu/~jruan. Final project. Final Project Report due Sat, Dec 15 Presentations: Mon, Dec 17, 8-10:30 pm 10 teams to present - PowerPoint PPT Presentation
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Network-based analysis of functional genomics data

Network-based analysis of functional

genomics dataJianhua Ruan, PhD

Computer Science DepartmentUniversity of Texas at San Antonio

http://www.cs.utsa.edu/~jruan

Page 2: Network-based analysis of functional genomics data

UTSA

Final project Final Project Report due Sat, Dec 15 Presentations: Mon, Dec 17, 8-10:30 pm 10 teams to present Each team will have up to 13 minutes. (10

min presentation, 3 min questions) Since time is limited, you don’t need to

cover all the details in your presentation. Focus on the most important concepts More details in your project report

Page 3: Network-based analysis of functional genomics data

Dog: ~20,000 genes

Human: ~22,000 genes

Rice: ~35,000 genes

Mouse: ~22,000 genesC. Elegans: ~20,000 genes

It is not just the genes, but the networks!

Page 4: Network-based analysis of functional genomics data

UTSA

Why networks? For complex systems, the actual output may not

be predictable by looking at only individual components: The whole is greater than the sum of its parts

Studying genes/proteins on the network level allows us to: Assess the role of individual genes/proteins in the overall

pathway Evaluate redundancy of network components Identify candidate genes involved in genetic diseases Sets up the framework for mathematical models

Page 5: Network-based analysis of functional genomics data

UTSA

Graph model of biological networks An abstract of the complex relationships

among molecules in the cell Vertex: molecule

Gene, protein, metabolite, DNA, RNA Edges: relationships

Physical interaction Functional association

Share many common statistical properties with real-world networks Small-world Scale-free Hierarchical Modular (community structure)

(Jeong et al., 2001)

Page 6: Network-based analysis of functional genomics data

Network-based disease studies

Genetic network reverse-engineering Network analysis algorithms

Research Overview

Data Mining, tree models, DNA motif finding Community discovery

Data integration, classification, graph algorithms

Page 7: Network-based analysis of functional genomics data

UTSA

Agenda Community discovery in biological

networks Network-based analysis of microarray data Network-based biomarker discovery for

metastatic breast cancer Conclusions

Page 8: Network-based analysis of functional genomics data

UTSA

Network communities Communities

Are relatively densely connected sub-networks (modules)

Appear in many types of networks Independently studied in many fields:

Social science, Computer science, Physics, Biology, etc. Significance

Biological systems are modular Metabolic pathways Protein complexes Transcriptional regulatory modules

Biological systems are large and complex Communities provide a high-level overview of the networks

Guilt-by-association Predict gene functions based on community memberships

Page 9: Network-based analysis of functional genomics data

UTSA

Community discovery problem Divide a network into relatively densely connected sub-

networks Similar to clustering, but # of clusters is determined

automatically

Vertexreorder

Page 10: Network-based analysis of functional genomics data

UTSA

Modularity function (Q) Measure strength of community structures

Newman, Phy Rev E, 2003

5 10 15 20 25 30 35 40 45 50

5

10

15

20

25

30

35

40

45

50

e11

e22

e33

e44

e55

)( iii

ii eeQ

Page 11: Network-based analysis of functional genomics data

Q = 0.45

Q = 0.56

Q = 0

Q = 0.40 Q = 0.54

Modularity automatically determines # of communities!

Page 12: Network-based analysis of functional genomics data

UTSA

Methods for community discovery Previous methods

Fast but inaccurate (CNM, Phy Rev E, 2004) Accurate but slow (Guimera&Amaral, Nature 2005) Relatively accurate, relatively fast (Ruan&Zhang, AAAI 2006, ICDM2007) Relatively accurate, fast, memory intensive (Newman, PNAS 2006)

Our new algorithm: Qcut/HQcut (Ruan&Zhang, Phys Rev E 2008) Accurate, fast and memory friendly HQcut solves the resolution limit of Q

Page 13: Network-based analysis of functional genomics data

UTSA

Algorithm Qcut

1 2 3

5

10

15

20

25

301 2 3

5

10

15

20

25

30

eig kmeans

Inter-community edge probability

Acc

urac

y

Our method

Newman’s

1. Recursive multi-way partitioning until Q is max

2. Improve Q by efficient heuristic search

Page 14: Network-based analysis of functional genomics data

UTSA

Resolution limit and HQcut Q is known to have a resolution limit problem

Cannot detect small communities Q slight decreases if forced to split

HQcut solves this problem Apply Qcut to get communities with largest Q Recursively search for sub-communities within each community without dramatic change to Q Statistical test for termination criteria

Ruan & Zhang, Physical Review E 2008

Page 15: Network-based analysis of functional genomics data

UTSA

Graphical user interface for Qcut/HQcut

Page 16: Network-based analysis of functional genomics data

UTSA

Application: protein complex prediction Input: a yeast PPI network

Data from Krogan et.al., Nature. 2006;440:637-43 2708 vertices (proteins), 7123 interactions

289 communities Sizes range from 2 to 49 Evaluation: compare communities with known complexes manually curated in MIPS database

Page 17: Network-based analysis of functional genomics data

UTSA

Communities in a yeast PPI network

Small ribosomal subunit (90%)

RNA poly II mediator (83%)

Proteasome core (90%)

Exosome (94%)

gamma-tubulin (77%)

respiratory chain complex IV (82%)

Page 18: Network-based analysis of functional genomics data

UTSA

Communities vs. complexes

Communities and complexes have good one to one correspondence

Overall accuracy: 0.81

Newman: 0.58

Agreement between a predicted complex (P) and a known complex (K) = |P∩K| / sqrt(|P| x |K|).

Known complex

Pre

dict

ed c

ompl

ex

Page 19: Network-based analysis of functional genomics data

UTSA

Work-in-progress: Random walk-based improvement Motivation:

PPI network often contain both false positive and false negative edges

Hub genes prevent good partitioning Three goals:

Eliminate false positive edges Predict missing links Reduce the impact of hub genes

Intuition: Two proteins with high topological similarity, regardless

of connected or not, may belong to same complex Two proteins with direct link but very different

topological properties may belong to different complexes

Page 20: Network-based analysis of functional genomics data

UTSA

Method overview

5 10 15 20 25 30

5

10

15

20

25

30

5 10 15 20 25 30

5

10

15

20

25

30

5 10 15 20 25 30

5

10

15

20

25

30

Initial prob vectors Equilibrium prob vectors

5 10 15 20 25 30

5

10

15

20

25

30

Similarity matrix

5 10 15 20 25 30

5

10

15

20

25

30

Adjacency matrix New network

= Random walk with resistance Distance

calculation

threshold

Original network

(guided by topology)

Page 21: Network-based analysis of functional genomics data

UTSA

Preliminary results on yeast PPI Predicted PPIs have much higher

functional relevance than removed PPIs, using several sources of evidence Gene Ontology Gene expression etc.

New network significantly improved accuracy of protein complex predictions Using HQcut: 0.50 to 0.55 Using MCL: 0.48 to 0.59

Page 22: Network-based analysis of functional genomics data

UTSA

Agenda Community structure in biological

networks Prediction of protein complexes Network-based microarray data analysis

Network-based biomarker discovery for metastatic breast cancer

Conclusions

Page 23: Network-based analysis of functional genomics data

UTSA

Microarray data analysis Gene network structure is unknown Microarray measures gene expression (activity) level

Clustering is the most common analysis tool Many clustering algorithms available

K-means Hierarchical Self organizing maps Parameter (e.g., k) hard to guess Does not consider network structure

Clustering

gene

s

Conditions

Common functions? Common regulation? Predict functions for

unknown genes?

Page 24: Network-based analysis of functional genomics data

UTSA

Network-based microarray data analysis

Genes i and j connected if their expression patterns are “sufficiently similar” Pearson correlation coefficient > arbitrary

threshold K nearest neighbors (KNN)

Key: how to get the “best” network?

Gen

e

SampleConstructCo-expression network

ij

=

Page 25: Network-based analysis of functional genomics data

UTSA

Motivation Can we use the idea of community

discovery for clustering microarray data? Advantages:

Parameter free Network topology considered Constructed network may have other

interesting applications beyond clustering

Page 26: Network-based analysis of functional genomics data

UTSA

Our idea

Intuition: the real network is naturally modular Can be measured by modularity (Q) If constructed right, should have the

highest Q

……

Net_1,Most dense

Net_m,Most sparse

Microarraydata

Similaritymatrix

Network series

Qcut

Qcut

Ruan, ICDM 2009

Page 27: Network-based analysis of functional genomics data

UTSA

Our idea (cont’d)

Network density

Mod

ular

ityRandom network

True network

Difference

• Therefore, use ∆Q to determine the best network parameter and obtain the best community structure

Page 28: Network-based analysis of functional genomics data

UTSA

Results: synthetic data set 1 High dimensional data generated by

synDeca. 20 clusters of high dimensional points, plus

some scatter points Clusters are of various shapes: eclipse,

rectangle, random

10 20 30 40 50 60 70 80 90 100

100

200

300

400

500

600

700

800

900

1000 0 50 100 150 200 250 3000

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Number of neighbors

QReal

QRandom

Qreal - Qrandom

Clustering Accuracy

∆Q

Accuracy

Page 29: Network-based analysis of functional genomics data

UTSA

Comparison

10 20 30 40 50 60 70 80 90 1000

0.2

0.4

0.6

0.8

1

Dimension

Clu

ster

ing

Acc

urac

y

This workkmeansoptimal knnHQcut

mKNN-HQcut with the optimum k

mKNN-HQcut with automatically determined k

K-means

Page 30: Network-based analysis of functional genomics data

UTSA

Results: synthetic data set 2 Gene expression data

Thalamuthu et al, 2006 600 data sets ~600 genes, 50

conditions, 15 clusters 0 or 1x outliers

Without outliers With outliersmKNN-HQcutWith optimal k

mKNN-HQcutWith auto k

Page 31: Network-based analysis of functional genomics data

UTSA

Comparison with other methods

Ruan et al., BioKDD 2010

Page 32: Network-based analysis of functional genomics data

UTSA

Results on yeast stress response data 3000 genes, 173 samples

Best k = 140. Resulting in 75 clusters

Page 33: Network-based analysis of functional genomics data

UTSA

Results on yeast stress response data Enrichment of common functions

Accumulative hyper-geometric test (Fisher’s exact test)

GO Function Terms

Gene

Protein biosynthesis (p < 10-96)

Nuclear transport (p < 10-50)

mt ribosome (p < 10-63)DNA repair (p < 10-66)RNA splicing (p < 10-105)Nitrogen compound metabolism (p < 10-37)

Peroxisome (p < 10-13)

Page 34: Network-based analysis of functional genomics data

UTSA

Comparison with k-means

K-means

mkNN-HQcutUsing automatically determined k = 140

Over

all e

nrich

men

t sco

re

Page 35: Network-based analysis of functional genomics data

UTSA

4 transcription factors regulate many genes in the community

4 telomere maintenance genes (p < 10-7)

16 unknown genes, all located in chromosome telomeric regions

5 other genes at rim of the sub-network

An interesting community A 25-gene community missed by other methods

Page 36: Network-based analysis of functional genomics data

UTSA

Application to Arabidopsis data ~22000 genes,

1138 samples 1150 singletons 800 (300) modules

of size >= 10 (20) > 80% (90%) of

modules have enriched functions

Much more significant than all five existing studies on the same data set

Top 40 most significant modules

Page 37: Network-based analysis of functional genomics data

UTSA

Page 38: Network-based analysis of functional genomics data

UTSA

Cis-regulatory network of Arabidopsis

MotifModule Ruan et al.,

BMC Bioinfo, to appear

Page 39: Network-based analysis of functional genomics data

UTSA

Beyond gene clustering (1) Gene specific studies

Collaborator is interested in Gibberellins A hormone important for the growth and

development of plant Commercially important Biosynthesis and signaling well studied Transcriptional regulation of biosynthesis and

signaling not yet clear 3 important gene families, GA20ox, GA3ox and

GA2ox for biosynthesis Receptor gene family: GID1A,B,C Analyze the co-expression network around

these genes

Page 40: Network-based analysis of functional genomics data

UTSA

GID1A

GID1B

GID1C

20ox1

20ox220ox3 20ox4

20ox5

3ox1

3ox23ox3

3ox4

2ox1

2ox2

2ox3

2ox42ox6

2ox7

2ox8

GA3

20ox

3ox

2ox

Page 41: Network-based analysis of functional genomics data

UTSA

Page 42: Network-based analysis of functional genomics data

UTSA

Beyond gene clusters (2) Cancer classification

Gene

Sam

ple

Sample

Alizadeh et. al. Nature, 2000

Sample: patient or cell lines

Qcut

Page 43: Network-based analysis of functional genomics data

UTSA

ActivatedBlood B

Chronic lymphocytic leukemia (CLL)

Follicular lymphoma (FL)

Blood T

Transformed cell lines

Diffuse large B-cell Lymphoma(DLBCL)

Resting Blood B

DLBCL

DLBCL

Network of cell samplesShape: cell line / cancer typeColor: clustering results

Page 44: Network-based analysis of functional genomics data

UTSA

Survival rate after chemotherapy

DLBCL-1DLBCL-2

DLBCL-3

Survival rate: 73%Median survival time: 71.3 months

Survival rate: 40%Median survival time: 22.3 months

Survival rate: 20%Median survival time: 12.5 months

Page 45: Network-based analysis of functional genomics data

UTSA

Beyond gene clustering (3) Topology vs function

Number of connections

% o

f ess

entia

l pro

tein

s

Jeong et. al. Nature 2001

Page 46: Network-based analysis of functional genomics data

UTSA

Community participation vs. essentiality (PPI)

Hub

Non-hub

Community participation

% E

ssen

tial

Page 47: Network-based analysis of functional genomics data

UTSA

Community participation vs. essentiality (coexp)

Key: how to systematically search for such relationships? Data mining – association rule?

Community participation

% E

ssen

tial

% E

ssen

tial

Number of connections

Non-hub

HubParticipation < 0.2

Participation >= 0.2

Page 48: Network-based analysis of functional genomics data

UTSA

Agenda Community structure in biological

networks Prediction of protein complexes

Network-based microarray data analysis Network-based biomarker discovery

for metastatic breast cancer Conclusions

Page 49: Network-based analysis of functional genomics data

UTSA

Background Metastasis is the spread of cancer from one

organ to another non-adjacent organ or part. Challenge: Predict Metastasis

If metastasis is likely => aggressive adjuvant therapy

How to decide the likelihood? Traditional predictive

factors are not goodenough

Page 50: Network-based analysis of functional genomics data

UTSA

Microarray-based marker discovery Examine genome-wide expression profiles

Idea: Score individual genes for how well they discriminate

between different classes of disease Establish gene expression signature

Limitations: # genes >> # patients Downstream effects Individual variations not

attributed to cancer Consequences:

Low reproducibility acrossdata sets

Missing biological insight

M

N

Page 51: Network-based analysis of functional genomics data

UTSA

Pathway vs. PPI Subnetwork as Marker Remedy to the problems: use pathway information

Less features => better stability Biological insight

Limitation: Majority of human genes not yet assigned to a definitive pathway

Alternatively: protein-protein interaction (PPI) networks Genes in the same pathway may have short distance in PPI Subnetworks (potential pathways) as markers Chuang et al. 2007, Mol Syst Bio.

Cannot differentiate causal vs. downstream genes

Page 52: Network-based analysis of functional genomics data

UTSA

Our approach Key observation: many known disease genes are

not differentially expressed (DE) between metastasis and non-metastasis, but their neighbors are e.g P53 and BRAC2

Intuition: find a small number of intermediate genes that connect DE genes

Known as the Steiner tree problem in computer science

Page 53: Network-based analysis of functional genomics data

UTSA

Steiner tree problem Given a connected graph and a list of input

nodes, find the smallest number of additional nodes so that there is a tree connecting the input nodes and those additional nodes

Page 54: Network-based analysis of functional genomics data

UTSA

Method overview

Page 55: Network-based analysis of functional genomics data

UTSA

Experiment setup Two breast cancer microarray data sets

van de Vijver et al, 2002, (Agilent, 78 +ve vs 217 -ve)

Wang et al, 2005, (affy, 106 +ve vs 180 -ve)

Two human PPI networks PINA (Wu et al 2009), 10,920 proteins and 61,746 interactions Chuang et al 2007, 11,203 proteins and 57,235 interactions

Page 56: Network-based analysis of functional genomics data

UTSA

Steiner tree example for Wang data set

Page 57: Network-based analysis of functional genomics data

UTSA

Cross-dataset stability of markers

van de Vijver dataset

Wang dataset

Overlap

Number of common genes(% of overlap)

p-value

Steiner tree-based markers

Chuang-PPI 1123 1100 384(20.88%) 2E-127

PINA-PPI 981 1135 400(23.31%) 5E-158

Top scoring Steiner tree-based markers

Chuang-PPI 203 194 63(18.86%) 7E-064

PINA-PPI 164 185 80(29.74%) 3E-103

Differentially expressed genes

324 319 34(5.58%) 1E-008

Approach inChuang et al (2007)

618 906 175(12.8%)

6E-054

Approach in original study

70 76 3(2.09%) .03

Page 58: Network-based analysis of functional genomics data

UTSA

Enriched cancer-related pathways Enrichment score for Wang dataset with Chuang-

PPI network.

Page 59: Network-based analysis of functional genomics data

UTSA

Overlap with known breast cancer genes

Overlap of 60 known breast cancer genes with STMs, t-STMs, DE genes, Chuang et al (2007) method and corresponding original studies. (A) Wang (B) van de Vijver dataset.

(A) (B)

Page 60: Network-based analysis of functional genomics data

UTSA

Classification accuracy

Classification accuracy based on AUC metric. The dataset label represents features taken from that dataset. For cross-data classification features taken from the labeled dataset and applied to the other dataset (A) Bayesian logistic regression (B) random tree classifier.

Page 61: Network-based analysis of functional genomics data

UTSA

Conclusions Methods for community discovery

Fully automated, i.e. parameter-free Higher accuracy than competing methods that

require extensive parameter tuning Improves protein complex prediction and

microarray data clustering, including tumor subtype classification

Many other applications Steiner-tree based biomarker discovery

Improves stability of metastatic breast cancer markers, and cross-dataset classification accuracy

A generic method for identifying the hidden causal genes given downstream targets

Page 62: Network-based analysis of functional genomics data

UTSA

Future work Network-based gene-specific studies Combine random walk with Steiner tree

algorithms to improve biomarker discovery and cancer classification

Other types of data Other diseases

Page 63: Network-based analysis of functional genomics data

UTSA

Questions?