Top Banner
Analysis of Gene Expression and Gene Networks Biclustering 2
38

Analysis of Gene Expression and Gene Networks Biclustering 2.

Jan 21, 2016

Download

Documents

Lewis Martin
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Analysis of Gene Expression and Gene Networks Biclustering 2.

Analysis of Gene Expression and Gene Networks

Biclustering 2

Page 2: Analysis of Gene Expression and Gene Networks Biclustering 2.

On this lecture

• Two current biclustering methodologies

• Iterative Signature Algorithm (ISA)– Simple– Randomized

• SAMBA– Combinatorial Roots– Fast

• And maybe a little more

Page 3: Analysis of Gene Expression and Gene Networks Biclustering 2.

What makes a biclustering algorithm?

• Score/Define what is a bicluster• Algorithm for finding one bicluster in

the data• Algorithm for finding all (many)

biclusters in the data

• Important themes:– Normalization– Redundencies

Page 4: Analysis of Gene Expression and Gene Networks Biclustering 2.

Previously in GE:

• What is a bicluster:– Cheng church– CTWC

• How to search for a bicluster– Cheng church– CTWC

• Normalization• Redundancies

Page 5: Analysis of Gene Expression and Gene Networks Biclustering 2.

• Developed at Naama Barkai’s Lab at WIS (I. Ihmels, S. Bergman)

• Motivation: – A bicluster is a “stable” set

of genes and conditions– It is possible to refine

approximate set of genes by “stabalizing” them

The Iterative Signature Algorithm

Page 6: Analysis of Gene Expression and Gene Networks Biclustering 2.

Normalization: ISA

• Can we normalize for both gene and condition dependent trends?

• In the ISA we are not trying to..

• Given a gene expression matrix E one conditions U and genes V form:– EC : normalize each column to 0 mean, 1

std– EG : normalize each gene to 0 mean, 1 std

Page 7: Analysis of Gene Expression and Gene Networks Biclustering 2.

What is a bicluster: ISA

• Observe: assume all columns are independent, what is the distribution of

(j in U’) eGij

for a random condition set U’ and gene i?

• Mean = 0, Std=sqrt(|U’|)• Same for (i in V’) eG

ij and gene set V’.• In a bicluster, we like independence

not to hold.

Page 8: Analysis of Gene Expression and Gene Networks Biclustering 2.

What is a bicluster: ISA

• Given a set of genes U’ define:– ISA(U’) = {v in V s.t. (j in U’) eG

vj > TGσU’}• Given a set of genes V’ define:

– ISA(V’) = {u in U s.t. (j in V’) eCiu > TCσV’}

• TG ,TC – threshold parameters, σU’ ,σV’ standard deviations

• A (perfect) bicluster is a pair (U’,V’) s.t.

ISA(V’) = U’ISA(U’) = V’

Page 9: Analysis of Gene Expression and Gene Networks Biclustering 2.

Searching for biclusters: ISA

• ISA – defining a directed graph on the set of condition and genes subsets.

• A bicluster is a cycle of two nodes U’• An approximated bicluster is a larger cycle but

not too large.

• The algorithm: start from a random or known gene set, compute ISA until converging to an approximated bicluster:

– Ui = ISA(Vi) , Vi = ISA(Ui-1)– Converge at i when for all j > i-m, |Ui-Uj|/|Ui+Uj| < 1-ε

Page 10: Analysis of Gene Expression and Gene Networks Biclustering 2.

Redundancies: ISA

• Starting from different seeds yield different fixed points (Biclusters)

• Using different threshold changes the graph structure and give more fixed points.

• Need to filter similar solutions and report a short list of significant biclusters

Page 11: Analysis of Gene Expression and Gene Networks Biclustering 2.

ISA - applications

• Starting from genes with a known functional annotation and refine them to a bicluster

• Starting from genes with known transcription factor binding sites

• Starting from a set of sequence orthologs

• See: Ihmels et al. Nat Gen 2002, Bergman et al. Phy Rev Letter 2003, Bergman et al. PLoS 2004.

Page 12: Analysis of Gene Expression and Gene Networks Biclustering 2.

ISA – Pros/Cons

• Pros– Simple, Quite fast– Elegant solution to the normalization problem– Good empirical results in several cases

• Cons– Thresholds setting– Finding good seeds– Redundencies– Non normal behaviors

• Assignment 3 will give you more insights

Page 13: Analysis of Gene Expression and Gene Networks Biclustering 2.

SAMBA

• Developed here• Motivation:

– Harvest efficient combinatorial techniques for biclustering large datasets.

– Couple a statistical model to the biclusters

– Allow integration of heterogeneous data

Page 14: Analysis of Gene Expression and Gene Networks Biclustering 2.

The SAMBA model

conditions

gen

es

edge

no edge

G=(U,V,E)Goal : Find high similarity submatrices

Goal : Find dense subgraphs

Page 15: Analysis of Gene Expression and Gene Networks Biclustering 2.

The SAMBA approach

• Normalization: translate GE matrix to a weighted bipartite graph using a statistical model for the data

• Bicluster model: Heavy subgraphs

• How to find biclusters: Combined hashing and local optimization

• Redundancies: Find many biclusters at once, filter them in post process

Page 16: Analysis of Gene Expression and Gene Networks Biclustering 2.

From a statistical model to edge weights – a simple example

• Background model: Independent edges, each present with prob. p.

• H – subgraph of n genes, m conds, k edges• P-value = tail of binomial distribution:

• Weight the graph– edges: (-1-log p)– non-edges: (-1-log(1-p)).

then subgraph weight log p-value.

knmknmknmk

kk

ppppk

nmHp

)1(2)1(

')( ''

'

Page 17: Analysis of Gene Expression and Gene Networks Biclustering 2.

Limitations of the uniform probability model

• Not all dense subgraphs are statistically significant. • Different genes/conds have typical noise

characteristics.• Noisy genes/conds have high probability of forming

dense subgraphs.• An extended likelihood ratio model:

Background Random Graph

Model

Bicluster Random Subgraph Model

Likelihood modeltranslates to sum of weights over edges and non

edges

=

Page 18: Analysis of Gene Expression and Gene Networks Biclustering 2.

A Degree Based Random Graph Model

• An edge between (u,v) occurs independently with prob p(u,v).• p(u,v) depends on both u and v degrees• P(u,v) = Pr((u,v) in E’ | all G=(U,V,E’) such that

deg(w, E’)=deg(w,E) for all w in U,V)

• Approximated using a hyper-geometric calculation

low-prob edges

medium-prob edges

high-prob edges

Page 19: Analysis of Gene Expression and Gene Networks Biclustering 2.

Model Likelihood Ratio

'),('),(

'),('),(

),(1

1log

),(log)(log

),(1

1

),()(

Evu

c

Evu

c

Evu

c

Evu

c

vup

p

vup

pBL

vup

p

vup

pBL

Subgraph weight = log likelihood ratio

• Model assumption - bicluster edges occur independently with prob pc

• Likelihood ratio score:

Page 20: Analysis of Gene Expression and Gene Networks Biclustering 2.

Heaviest bipartite subgraph

• NPC (Dawande et al. 97, Hochbaum 98)• (Recall: node blicque is polynomial!)

• Assumption: degree on V side bounded by d:

• Start by finding heavy bicliques.

• Alg: use hashing to discover heavy subsets of conds. Takes O(n2d) time and space.

Page 21: Analysis of Gene Expression and Gene Networks Biclustering 2.

Finding Heaviest Biclique432223222

464443224

Page 22: Analysis of Gene Expression and Gene Networks Biclustering 2.

Using bicliques to find the heaviest biclusters

'

(( ', ')) ((( '), ')u U

w U V w u V

Lemma: If B=(U’,V’) is maximal and XU’ then v s.t. |N(v)X|>=|X|/2.Pf:

Assume edge weight = 1, non-edge weight = -1

Note that:

'

'

0 (( , ')) | ( ) | | ( ) |

2 | ( ) | | |v V

v V

w X V N v X N v X

N v X X

Corrolary: If B=(U’,V’) is maximal then |U’|<= 2d

Page 23: Analysis of Gene Expression and Gene Networks Biclustering 2.

Using bicliques to find the heaviest biclusters

A set of conditions in a maximal bicluster is the union of up to log(2D) subsets of gene neighborhoods.

• Exhaustive O((n2D)log(2D)) time alg:

•Hash bicliques

•enumerate all log(2D) size N(v) combinations.

• Can be generalized to handle arbitrary edge/nonedge weights.

u’’ u’’’ …U’

Page 24: Analysis of Gene Expression and Gene Networks Biclustering 2.

SAMBA’s implementation

• Phase I: find heavy bicliques - hash for each gene of deg<d all subsets of neighbors of size 4-6.

• Phase II: greedy expansion of heaviest bicliques containing each gene/cond

• Phase III: filter overlapping biclusters.

Page 25: Analysis of Gene Expression and Gene Networks Biclustering 2.

Heterogeneous information sources

Transcription Level Protein Level Phenotype Level

1 + 1 = 0

ChIP Chip

mRNA profiling2-Hybrid

Protein ComplexesIdentification usingMass Spec

Syntheticlethality

Barcoded deletion libraries

and so many more…

Page 26: Analysis of Gene Expression and Gene Networks Biclustering 2.

From experiments to properties

StrongInduction

MediumInduction

MediumRepression

StrongRepression

p1 p2 p3 p4

StrongBinding toTF T

MediumBinding toTF T

HighSensitivity

MediumSensitivity

High ConfidenceInteraction

Medium ConfidenceInteraction

p1

Strong complex binding toprotein Pp2

Medium complex binding toProtein P

p1 p2 p1 p2 p1 p2

gene g

Page 27: Analysis of Gene Expression and Gene Networks Biclustering 2.

A Heterogeneous Collection of Yeast Genomic

Information• Gene expression: ~1000 conditions, 27

publications• TF binding profiles: 110 profiles from

growth on YPD (Lee et al.)• Phenotype profiles: 6 (30) profiles

(Giaever et al.)• Two hybrid interactions: ~1000

(Uetz et al.)• Protein Complex interaction: ~4000

(Ho et al.)• MIPS interactions: ~1000

Page 28: Analysis of Gene Expression and Gene Networks Biclustering 2.

A SAMBA moduleG

en

es

Properties

GO annotations

CPA1 CPA2

Page 29: Analysis of Gene Expression and Gene Networks Biclustering 2.

Statistical Model Provides High Specificity

+ Lymphoma data (Alizadeh et.al)

x Shuffled Data

log p-value

log

lik

elih

ood

Page 30: Analysis of Gene Expression and Gene Networks Biclustering 2.

Global View of modular organization in yeast

Page 31: Analysis of Gene Expression and Gene Networks Biclustering 2.
Page 32: Analysis of Gene Expression and Gene Networks Biclustering 2.

Inferring functional annotations

• Using SAMBA results for annotating uncharacterized yeast genes

• Performing “guilt by association”• Same procedure for properties (which

reflects poorly characterized conditions)

Mating Genes

Uncharacterized Putative Mating

Over

X%

Page 33: Analysis of Gene Expression and Gene Networks Biclustering 2.

Predictions are highly specific

5 mating predictions were tested experimentally4 mutants failed to mate

Page 34: Analysis of Gene Expression and Gene Networks Biclustering 2.

SAMBA as a universal language for functional genomics

databases

Gene expressionTF locationProteomicsPhenotypes

…..S

AM

BA

Qu

ery

User

Updated Relevant Modules

Page 35: Analysis of Gene Expression and Gene Networks Biclustering 2.

SAMBA – Pros/Cons

• Pros– Fast– Allow simultaneous normalization of

genes and conditions– Allow integration of hetergenous data– Well suited for query based usage

• Cons– Discretization

Page 36: Analysis of Gene Expression and Gene Networks Biclustering 2.

Two words on: Probabilistic Models for Biclsutering

• Bicluster model: each subcolumn have a typical normal distribution ,different from the background

• Model the entire matrix: tile the matrix by biclusters

• Model score: likelihood based• Avoid overfitting by standard

techinuqes

Page 37: Analysis of Gene Expression and Gene Networks Biclustering 2.

Two words on: Probabilistic Models for Biclsutering

• How to find the biclusters: Start by clustering and refine them using an EM algorithm:– Given a clustering calculate the model

parameters (distirubtions per bicluster)– Given the distributions, reassign the

biclusters

Page 38: Analysis of Gene Expression and Gene Networks Biclustering 2.

Biclustering - Summary

• A general data mining problem• The key point: defining what is a

bicluster• Algorithms vary depending on the

nature of bicluster model• The future problem: search for

biclusters in a really huge matrices.