ppt

Machine Learning in Bioinformatics’03 Washington D.C.

Gene Interaction Analysis Using k-way Interaction Loglinear Model: A Case Study on

Yeast Data

Xintao Wu UNC CharlotteDaniel Barbara George Mason Univ. Liying Zhang Memorial Sloan Kettering Cancer CenterYong Ye UNC Charlotte

Machine Learning in Bioinformatics’03 Washington, D.C. 2

Microarray data• The raw microarray images are transformed to gene expression

matrices where The rows denote genes The columns denote various samples, conditions, or

time points corresponds to the expression value of the sample on gene

• Comparison with market basket data

microarray Market basket data

row 10^3-10^4 genes 10^3-10^4 items

column 10^1-10^2 samples 10^6-10^9 transactions

data continuous 0-1

},...1,,...1|{ mjnixX ij

},...,,{ 21 msssS },...,,{ 21 ngggG

ijx js ig


Background -- Clustering

• Clustering over genes CAST Ben-Dor et al 1999 MST Xu et al 2002 HCS Hartuv & Shamir 2000 CLICK Shamir & Shamir 2000

• Drawback Each gene is assigned to only one cluster, however, a gene can

be characterized by several pathways (e.g., p53 protein) Impossible to determine interactions of genes in one cluster


Background — Interaction analysis

• Association rule, Creighton & Hanash 03 Need to descretize data Associations instead of interaction Undirected

• Graphical gaussian model, Kishino & Waddell 00 No need to descretize data Only pairwise interactions Undirected

• Bayesian network, Segal et al 03 Pairwise interactions Directed High complexity


Background -- Association Rule

• An association rule X Y satisfies with minimum confidence and support

support, s = P(XUY), probability that a transaction contains {X U Y}

confidence, c = P(Y|X), conditional probability that a transaction having X also contains Y

• Efficient algorithms Apriori by Agrawal & Srikant, VLDB94 FP-tree by Han, Pei & Yin, SIGMOD 2000 etc.

• Example of rules discovered in Microarray when gene A and B are over

expressed within a sample, then often gene C is over expressed too.

• Pros One gene can be assigned to any number of

rules (pathways).

• Cons Gene co-expression instead of interaction

Customerbuys Y

Customerbuys both

Customerbuys X

CBA ,


Criticism to Support and Confidence

• Example 1: (Aggarwal & Yu, PODS98) Among 5000 students

3000 play basketball 3750 eat cereal 2000 both play basket ball and eat cereal

play basketball eat cereal [40%, 66.7%] is misleading because the overall percentage of students eating cereal is 75% which is higher than 66.7%.

play basketball not eat cereal [20%, 33.3%] is far more accurate, although with lower support and confidence

basketball not basketball sum(row)cereal 2000 1750 3750not cereal 1000 250 1250sum(col.) 3000 2000 5000


• We need a measure of dependent or correlated events

• P(Y|X)/P(Y) is also called the lift of rule X => Y

)(

)|(

)()(

)(, YP

XYP

YPXP

YXPcorr YX

Criticism to Support and Confidence


Criticism to lift

• Suppose a triple ABC is unusually frequent because Case 1: AB and/or AC and/or BC are unusually frequent Case 2: there is something special about the triple that all three

occur frequently.

• Example 2: (DuMouchel & Pregibon, KDD 01) Suppose in a db of patient adverse drug reactions, A and B are

two drugs, and C is the occurrence of kidney failure Case 1: A and B may act independently upon the kidney, many

occurrences of ABC is because A and B are sometimes prescribed together

Case 2: A and B may have no effect on the kidney if taken alone, but when taken together a drug interaction occurs that often leads to kidney failure

Case 3: A and B may have small effect on the kidney if taken alone, but when taken together, there is a strong effect.


Criticism to lift• EXCESS2

FAlleeEXCESS 22

Predicted count of all-two-factor model based on two-way distributions

Shrinkage estimates, (or we can use raw count)

an estimate of the number of transactions containing the item set over and above those that can be explained by the pairwise associations of the items


Motivation• EXCESS2

By analyzing residues, we can pick up the multi-item associations that can not be explained by all the pairwise associations included in the all-2-way model.

can separate case 2 and 3 from case 1. do not include multi-way interactions

• Our contribution Extend all-two-factor model to general k-way loglinear model Apply association rule to identify gene sets for further analysis


Saturated log-linear model

ABCDijkl

BCDjkl

ACDikl

ABDijl

ABCijk

CDkl

BDjl

BCjk

ADil

ACik

ABij

Dl

Ck

Bj

Aiijkly

ˆlog

main effect 1-factor effect

2-factor effect which shows the dependency within the distributions of A,B.


Computing -term

0

...

0...

0

....

....

....

ABCDijk

ABCDkli

ABCDlij

ABCDijk

CDl

ACi

ABj

ABi

DCBA

• Linear constraints of coefficients

• UpDown method (Sarawagi et al, EDBT98)

Loglinear parameters sum to 0 over all indices

Ck

Bj

Ai

BCjk

ACik

ABijijk

ABCijk

Bj

Aiij

ABij

iAi

l

l

l

l

.

..

...

...

....


k-way loglinear model• Comparison with lift, EXCESS2

BCDACDABDABC

CDBDBCADAC

ABDCBAway

CDBDBCADAC

ABDCBApairwise

DCBAlift

y

y

y

3ˆlog

ˆlog

ˆlog

Independence model

pairwise model

3-way model


Our Method

• Step 1, transform gene expression raw data to build a boolean matrix

• Step 2, apply Apriori method to find all frequent gene sets

• Step 3, for k=1 to K For each large gene set

Fit k-way interaction model

If its standard residue

Include s into

)0(S

)1( kSs

)(ke

)(kS


Preprocessing

• The expression values need to be discretized into catagories, e.g., overexpressed, normal, underexpressed.

>0.2 overexpressed (-0.2, 0.2) normal <-0.2 underexpressed

A B C

s1 0.23 0.1 -0.24

s2 0.6 0.1 0.5

s3 0.3 0.13 0.28

s4 0.15 0.3 -0.25

s5 0.8 0.08 0.30

s6 -0.2 0.5 0.25

A A B B C C

S1 1 1

S2 1 1

s3 1 1

s4 1 1

s5 1 1

s6 1 1 1


Contingency table• For each frequent itemset s discovered by Apriori, we need to build a

contingency table for further k-way interaction analysis

• Note application of loglinear modeling is constrained by the size of samples as Loglinear modeling requires the size of samples should be larger than the number

of cells in the contingency table

B B B

A A A A A A A A A

D C 5 5 0 4 9 0 0 0 0

C 0 0 0 0 7 0 0 0 0

C 0 0 0 0 0 0 0 0 0

D C 1 0 0 3 7 0 0 0 0

C 0 0 0 4 130 7 0 1 0

C 0 0 1 0 7 2 0 1 2

D C 0 0 0 0 0 0 0 0 0

C 0 0 0 1 11 0 0 0 1

C 0 0 0 0 15 3 0 19 54

Frequent set


Examine residues• Analysis of residues may reveal cell-by-cell comparisons of

observed and fitted frequencies.

• Standard residue is asymptotically normal with mean 0

i

iii

y

yye

ˆ

ˆ


Experimental Results• Yeast Data

6316 genes, 300 samples >0.2 over expressed, (-0.2,0.2) normal, <-0.2 underexpressed

• Many frequent gene sets can be screened by all k-way interaction model when k is increased.

Support(%)

14 2735 2500 2253 1931

15 1134 1084 852 691

18 39 39 19 8

20 8 8 4 1

)0(S )1(S )2(S)3(S

The size of frequent item sets from Apriori

The size of item sets which can not be interpreted by k-way model


Experimental Results

Gene Set Frequency 1-way 2-way 3-way

YHR029C,YMR094W,YMR096W,YMR095C 56 0 15 26

YJR109C, YGL117W, YMR096W,YMR095C 54 0 15 23

YJR109C,YMR094W,YMR096W,YMR095C 56 0 17 32

YGL117W,YER175C,YMR096W,YMR095C 54 0 24 28

The frequencies and estimates from all k-way interactions

Y Y for Yeast

H A-P for the chromosome upon which the ORF resides (16)

R L or R for the left or right arm

029 3-digit the order of the open reading frame on the chromosome arm, starting from the centromere and counting out to the telomere

C W or C Whether the open reading frame is on the Watson or Crick strand

ORF naming


Experimental results

• Our results agree to some previously known biological interactions

Refer paper for details

• Our results also reveal some previously unknown interactions that have solid biological explanations

Refer paper for details


Lattice for -term of saturated model (2-category case)

044.0

044.0

044.0

044.0

11

10

01

00

AB

AB

AB

AB

All(4.560)

A(0.284)

B(1.407)

C(1.493)

D(-0.144)

AB(-0.044)

AC(0.681)

AD(-0.006)

BC(-0.765)

BD(-0.296)

CD(0.245)

ABC(0.233)

ABD(-0.185)

ACD(-0.118)

BCD(-0.093)

ABCD(0.038)

•Obs. 1, each of -term has only one absolute value because each gene can only have two states: over express or under express

•Obs. 2, We can compare the interactions by their magnitude of -terms derived from the saturated models


Two-category vs. multi-category• Two-category: we can directly compare the interactions based on -terms

derived from loglinear models E.g. , we can derive positive interaction

between AC, negative interaction between AC, no significant interaction between BC, and positive three-factor interaction among ABC

Not enough for analysis at finer level, e.g., what is the effect of weak-over expressed of gene A and B on gene C?

• Multi-category: we can not directly compare as the d.f. (and variance) is different for each interaction.

The values do not necessarily imply that the interaction of AC is greater than that of CD.

Test statistic needs to be formed.

223.0,765.0,681.0 ABCBCAC

245.0,681.0 CDAC


Framework (ongoing)


Preprocessing

• Preprocessing is used to get subset of genes for further interaction analysis.

Hierarchical clustering Association rule Specified by domain user based on known pathways

• Preprocessing is necessary as Graphical gaussian modeling is bounded by the size of samples Loglinear modeling is bounded by the number of cell of

contingency table, i.e., the size of samples should be 5 times larger than that of cells in contingency tables.


Interaction Modeling

• Graphical gaussian modeling is used to generate pairwise interactions for a relatively large subset of genes.

No information loss Efficient The independence graph may also indicate the interactions among

several pathways.

• The independence graph is decomposed to get components.

• Loglinear modeling is used to generate multi-way interactions among genes in each component.


Snapshot of prototype system


Thank you !


Graphical gaussian modeling• GGM assumes a family of normal distributions for underlying data

constrained to satisfy the pairwise condidtional independence restrictions inherent in the independence graph.

The microarray expression data, which are log-transformed from raw image data, satisfy near multivariate normal distribution

• Partial correlation The correlation between two variables after the common effects of the

third variables are removed

With a set of gene, where is the

xy-th element of the inverse of variance matrix ( ) No edge is included in the graph if is less than some threshold

)1)(1( 22.

yzxz

yzxzxyzxy

rr

rrrpr

yyxx

xygxy

ss

spr . xys

1VS|| .gxypr


Loglinear modeling

• The difference from market basket data is that each gene can have multiple categories (e.g., over-expressed, normal, under-expressed) which depend on discretization strategy.

ppt

Documents

way model

way interaction model

factor model

gene interaction analysis

gene expression raw

large gene

gene expression matriceswhere

way loglinear model