Machine Learning in Bioinformatics’03 Washington D.C. Gene Interaction Analysis Using k- way Interaction Loglinear Model: A Case Study on Yeast Data Xintao Wu UNC Charlotte Daniel Barbara George Mason Univ. Liying Zhang Memorial Sloan Kettering Cancer Center Yong Ye UNC Charlotte
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Machine Learning in Bioinformatics’03 Washington D.C.
Gene Interaction Analysis Using k-way Interaction Loglinear Model: A Case Study on
Yeast Data
Xintao Wu UNC CharlotteDaniel Barbara George Mason Univ. Liying Zhang Memorial Sloan Kettering Cancer CenterYong Ye UNC Charlotte
Machine Learning in Bioinformatics’03 Washington, D.C. 2
Microarray data• The raw microarray images are transformed to gene expression
matrices where The rows denote genes The columns denote various samples, conditions, or
time points corresponds to the expression value of the sample on gene
• Comparison with market basket data
microarray Market basket data
row 10^3-10^4 genes 10^3-10^4 items
column 10^1-10^2 samples 10^6-10^9 transactions
data continuous 0-1
},...1,,...1|{ mjnixX ij
},...,,{ 21 msssS },...,,{ 21 ngggG
ijx js ig
Machine Learning in Bioinformatics’03 Washington, D.C. 3
Background -- Clustering
• Clustering over genes CAST Ben-Dor et al 1999 MST Xu et al 2002 HCS Hartuv & Shamir 2000 CLICK Shamir & Shamir 2000
• Drawback Each gene is assigned to only one cluster, however, a gene can
be characterized by several pathways (e.g., p53 protein) Impossible to determine interactions of genes in one cluster
Machine Learning in Bioinformatics’03 Washington, D.C. 4
Background — Interaction analysis
• Association rule, Creighton & Hanash 03 Need to descretize data Associations instead of interaction Undirected
• Graphical gaussian model, Kishino & Waddell 00 No need to descretize data Only pairwise interactions Undirected
• Bayesian network, Segal et al 03 Pairwise interactions Directed High complexity
Machine Learning in Bioinformatics’03 Washington, D.C. 5
Background -- Association Rule
• An association rule X Y satisfies with minimum confidence and support
support, s = P(XUY), probability that a transaction contains {X U Y}
confidence, c = P(Y|X), conditional probability that a transaction having X also contains Y
• Efficient algorithms Apriori by Agrawal & Srikant, VLDB94 FP-tree by Han, Pei & Yin, SIGMOD 2000 etc.
• Example of rules discovered in Microarray when gene A and B are over
expressed within a sample, then often gene C is over expressed too.
• Pros One gene can be assigned to any number of
rules (pathways).
• Cons Gene co-expression instead of interaction
Customerbuys Y
Customerbuys both
Customerbuys X
CBA ,
Machine Learning in Bioinformatics’03 Washington, D.C. 6
Criticism to Support and Confidence
• Example 1: (Aggarwal & Yu, PODS98) Among 5000 students
3000 play basketball 3750 eat cereal 2000 both play basket ball and eat cereal
play basketball eat cereal [40%, 66.7%] is misleading because the overall percentage of students eating cereal is 75% which is higher than 66.7%.
play basketball not eat cereal [20%, 33.3%] is far more accurate, although with lower support and confidence
Machine Learning in Bioinformatics’03 Washington, D.C. 7
• We need a measure of dependent or correlated events
• P(Y|X)/P(Y) is also called the lift of rule X => Y
)(
)|(
)()(
)(, YP
XYP
YPXP
YXPcorr YX
Criticism to Support and Confidence
Machine Learning in Bioinformatics’03 Washington, D.C. 8
Criticism to lift
• Suppose a triple ABC is unusually frequent because Case 1: AB and/or AC and/or BC are unusually frequent Case 2: there is something special about the triple that all three
occur frequently.
• Example 2: (DuMouchel & Pregibon, KDD 01) Suppose in a db of patient adverse drug reactions, A and B are
two drugs, and C is the occurrence of kidney failure Case 1: A and B may act independently upon the kidney, many
occurrences of ABC is because A and B are sometimes prescribed together
Case 2: A and B may have no effect on the kidney if taken alone, but when taken together a drug interaction occurs that often leads to kidney failure
Case 3: A and B may have small effect on the kidney if taken alone, but when taken together, there is a strong effect.
Machine Learning in Bioinformatics’03 Washington, D.C. 9
Criticism to lift• EXCESS2
FAlleeEXCESS 22
Predicted count of all-two-factor model based on two-way distributions
Shrinkage estimates, (or we can use raw count)
an estimate of the number of transactions containing the item set over and above those that can be explained by the pairwise associations of the items
Machine Learning in Bioinformatics’03 Washington, D.C. 10
Motivation• EXCESS2
By analyzing residues, we can pick up the multi-item associations that can not be explained by all the pairwise associations included in the all-2-way model.
can separate case 2 and 3 from case 1. do not include multi-way interactions
• Our contribution Extend all-two-factor model to general k-way loglinear model Apply association rule to identify gene sets for further analysis
Machine Learning in Bioinformatics’03 Washington, D.C. 11
Saturated log-linear model
ABCDijkl
BCDjkl
ACDikl
ABDijl
ABCijk
CDkl
BDjl
BCjk
ADil
ACik
ABij
Dl
Ck
Bj
Aiijkly
ˆlog
main effect 1-factor effect
2-factor effect which shows the dependency within the distributions of A,B.
Machine Learning in Bioinformatics’03 Washington, D.C. 12
Computing -term
0
...
0...
0
....
....
....
ABCDijk
ABCDkli
ABCDlij
ABCDijk
CDl
ACi
ABj
ABi
DCBA
• Linear constraints of coefficients
• UpDown method (Sarawagi et al, EDBT98)
Loglinear parameters sum to 0 over all indices
Ck
Bj
Ai
BCjk
ACik
ABijijk
ABCijk
Bj
Aiij
ABij
iAi
l
l
l
l
.
..
...
...
....
Machine Learning in Bioinformatics’03 Washington, D.C. 13
k-way loglinear model• Comparison with lift, EXCESS2
BCDACDABDABC
CDBDBCADAC
ABDCBAway
CDBDBCADAC
ABDCBApairwise
DCBAlift
y
y
y
3ˆlog
ˆlog
ˆlog
Independence model
pairwise model
3-way model
Machine Learning in Bioinformatics’03 Washington, D.C. 14
Our Method
• Step 1, transform gene expression raw data to build a boolean matrix
• Step 2, apply Apriori method to find all frequent gene sets
• Step 3, for k=1 to K For each large gene set
Fit k-way interaction model
If its standard residue
Include s into
)0(S
)1( kSs
)(ke
)(kS
Machine Learning in Bioinformatics’03 Washington, D.C. 15
Preprocessing
• The expression values need to be discretized into catagories, e.g., overexpressed, normal, underexpressed.
>0.2 overexpressed (-0.2, 0.2) normal <-0.2 underexpressed
A B C
s1 0.23 0.1 -0.24
s2 0.6 0.1 0.5
s3 0.3 0.13 0.28
s4 0.15 0.3 -0.25
s5 0.8 0.08 0.30
s6 -0.2 0.5 0.25
A A B B C C
S1 1 1
S2 1 1
s3 1 1
s4 1 1
s5 1 1
s6 1 1 1
Machine Learning in Bioinformatics’03 Washington, D.C. 16
Contingency table• For each frequent itemset s discovered by Apriori, we need to build a
contingency table for further k-way interaction analysis
• Note application of loglinear modeling is constrained by the size of samples as Loglinear modeling requires the size of samples should be larger than the number
of cells in the contingency table
B B B
A A A A A A A A A
D C 5 5 0 4 9 0 0 0 0
C 0 0 0 0 7 0 0 0 0
C 0 0 0 0 0 0 0 0 0
D C 1 0 0 3 7 0 0 0 0
C 0 0 0 4 130 7 0 1 0
C 0 0 1 0 7 2 0 1 2
D C 0 0 0 0 0 0 0 0 0
C 0 0 0 1 11 0 0 0 1
C 0 0 0 0 15 3 0 19 54
Frequent set
Machine Learning in Bioinformatics’03 Washington, D.C. 17
Examine residues• Analysis of residues may reveal cell-by-cell comparisons of
observed and fitted frequencies.
• Standard residue is asymptotically normal with mean 0
i
iii
y
yye
ˆ
ˆ
Machine Learning in Bioinformatics’03 Washington, D.C. 18
• Many frequent gene sets can be screened by all k-way interaction model when k is increased.
Support(%)
14 2735 2500 2253 1931
15 1134 1084 852 691
18 39 39 19 8
20 8 8 4 1
)0(S )1(S )2(S)3(S
The size of frequent item sets from Apriori
The size of item sets which can not be interpreted by k-way model
Machine Learning in Bioinformatics’03 Washington, D.C. 19
Experimental Results
Gene Set Frequency 1-way 2-way 3-way
YHR029C,YMR094W,YMR096W,YMR095C 56 0 15 26
YJR109C, YGL117W, YMR096W,YMR095C 54 0 15 23
YJR109C,YMR094W,YMR096W,YMR095C 56 0 17 32
YGL117W,YER175C,YMR096W,YMR095C 54 0 24 28
The frequencies and estimates from all k-way interactions
Y Y for Yeast
H A-P for the chromosome upon which the ORF resides (16)
R L or R for the left or right arm
029 3-digit the order of the open reading frame on the chromosome arm, starting from the centromere and counting out to the telomere
C W or C Whether the open reading frame is on the Watson or Crick strand
ORF naming
Machine Learning in Bioinformatics’03 Washington, D.C. 20
Experimental results
• Our results agree to some previously known biological interactions
Refer paper for details
• Our results also reveal some previously unknown interactions that have solid biological explanations
Refer paper for details
Machine Learning in Bioinformatics’03 Washington, D.C. 21
Lattice for -term of saturated model (2-category case)
044.0
044.0
044.0
044.0
11
10
01
00
AB
AB
AB
AB
All(4.560)
A(0.284)
B(1.407)
C(1.493)
D(-0.144)
AB(-0.044)
AC(0.681)
AD(-0.006)
BC(-0.765)
BD(-0.296)
CD(0.245)
ABC(0.233)
ABD(-0.185)
ACD(-0.118)
BCD(-0.093)
ABCD(0.038)
•Obs. 1, each of -term has only one absolute value because each gene can only have two states: over express or under express
•Obs. 2, We can compare the interactions by their magnitude of -terms derived from the saturated models
Machine Learning in Bioinformatics’03 Washington, D.C. 22
Two-category vs. multi-category• Two-category: we can directly compare the interactions based on -terms
derived from loglinear models E.g. , we can derive positive interaction
between AC, negative interaction between AC, no significant interaction between BC, and positive three-factor interaction among ABC
Not enough for analysis at finer level, e.g., what is the effect of weak-over expressed of gene A and B on gene C?
• Multi-category: we can not directly compare as the d.f. (and variance) is different for each interaction.
The values do not necessarily imply that the interaction of AC is greater than that of CD.
Test statistic needs to be formed.
223.0,765.0,681.0 ABCBCAC
245.0,681.0 CDAC
Machine Learning in Bioinformatics’03 Washington, D.C. 23
Framework (ongoing)
Machine Learning in Bioinformatics’03 Washington, D.C. 24
Preprocessing
• Preprocessing is used to get subset of genes for further interaction analysis.
Hierarchical clustering Association rule Specified by domain user based on known pathways
• Preprocessing is necessary as Graphical gaussian modeling is bounded by the size of samples Loglinear modeling is bounded by the number of cell of
contingency table, i.e., the size of samples should be 5 times larger than that of cells in contingency tables.
Machine Learning in Bioinformatics’03 Washington, D.C. 25
Interaction Modeling
• Graphical gaussian modeling is used to generate pairwise interactions for a relatively large subset of genes.
No information loss Efficient The independence graph may also indicate the interactions among
several pathways.
• The independence graph is decomposed to get components.
• Loglinear modeling is used to generate multi-way interactions among genes in each component.
Machine Learning in Bioinformatics’03 Washington, D.C. 26
Snapshot of prototype system
Machine Learning in Bioinformatics’03 Washington, D.C. 27
Thank you !
Machine Learning in Bioinformatics’03 Washington, D.C. 28
Graphical gaussian modeling• GGM assumes a family of normal distributions for underlying data
constrained to satisfy the pairwise condidtional independence restrictions inherent in the independence graph.
The microarray expression data, which are log-transformed from raw image data, satisfy near multivariate normal distribution
• Partial correlation The correlation between two variables after the common effects of the
third variables are removed
With a set of gene, where is the
xy-th element of the inverse of variance matrix ( ) No edge is included in the graph if is less than some threshold
)1)(1( 22.
yzxz
yzxzxyzxy
rr
rrrpr
yyxx
xygxy
ss
spr . xys
1VS|| .gxypr
Machine Learning in Bioinformatics’03 Washington, D.C. 29
Loglinear modeling
• The difference from market basket data is that each gene can have multiple categories (e.g., over-expressed, normal, under-expressed) which depend on discretization strategy.