CISC667, F05, Lec23, Liao 1 CISC 667 Intro to Bioinformatics (Fall 2005) Support Vector Machines (II) Bioinformatics Applications
CISC667, F05, Lec23, Liao 1
CISC 667 Intro to Bioinformatics(Fall 2005)
Support Vector Machines (II)
Bioinformatics Applications
CISC667, F05, Lec23, Liao 2
CISC667, F05, Lec23, Liao 3
CISC667, F05, Lec23, Liao 4
CISC667, F05, Lec23, Liao 5
Combining pairwise similarity with SVMs for protein homology detection
Protein homologs
Protein non-homologs
Positivepairwise score
vectors
Negativepairwise score
vectors
Support vector machine
Binary classification
Target protein of unknown function
1
23
Positive train Negative train
Testing data
CISC667, F05, Lec23, Liao 6
Experiment: known protein families
Jaakkola, Diekhans and Haussler 1999
CISC667, F05, Lec23, Liao 7
Vectorization
CISC667, F05, Lec23, Liao 8
CISC667, F05, Lec23, Liao 9
A measure of sensitivity and specificity
ROC = 1
ROC = 0
ROC = 0.67
6
5
ROC: receiver operating characteristic score is the normalized area
under a curve the plots true positives as a function of false positives
CISC667, F05, Lec23, Liao 10
Performance Comparison (1)
CISC667, F05, Lec23, Liao 11
CISC667, F05, Lec23, Liao 12
Using Phylogenetic Profiles & SVMs YAL001C
E-value Phylogenetic profile
0.122 1
1.064 0
3.589 0
0.008 1
0.692 1
8.49 0
14.79 0
0.584 1
1.567 0
0.324 1
0.002 1
3.456 0
2.135 0
0.142 1
0.001 1
0.112 1
1.274 0
0.234 1
4.562 0
3.934 0
0.489 1
0.002 1
2.421 0
0.112 1
CISC667, F05, Lec23, Liao 13
phylogenetic profiles and Evolution Patterns1
1
1 1
10
0
1 1 0 1 0 0 0 1 1 0x
Impossible to know for sure if the gene followed exactly this
evolution pattern
CISC667, F05, Lec23, Liao 14
Tree Kernel (Vert, 2002) For a phylogenetic profile x and an evolution pattern e:• P(e) quantifies how “natural” the pattern is
• P(x|e) quantifies how likely the pattern e is the “true history” of the profile x
Tree Kernel :
K tree(x,y) = Σe p(e)p(x|e)p(y|e) Can be proved to be a kernel Intuition: two profiles get closer in the feature space when
they have shared common evolution patterns with high probability.
CISC667, F05, Lec23, Liao 15
1 1 0 1 0 0 0 1 1
10.33
0.67
0.34
0.5
0.75
0.55
1 0.33 0.67 0.34 0.5 0.75 0.55
Post-order traversal
Tree-Encoded Profile (Narra & Liao, 2004)
CISC667, F05, Lec23, Liao 16
CISC667, F05, Lec23, Liao 17
Using Support Vector Machines
CISC667, F05, Lec23, Liao 18
Kernel function:
where r = 0.10
Soft margin regularization C = 1.50
Coding scheme: BIN21
L() = i ½ i j yi yj (K(xi · xj) + ij /C)
Evaluation:
Q3 = (P1+P2+P3)/N
C = (TPTN - FP FN) / ( PP PN AP AN)
SOV: segment overlap accuracy
CISC667, F05, Lec23, Liao 19
Design tertiary classifiers
CISC667, F05, Lec23, Liao 20
CISC667, F05, Lec23, Liao 21
Nguyen & Rajapakse, Genome Informatics 14: 218-227 (2003)
CISC667, F05, Lec23, Liao 22
A two-stage SVM
CISC667, F05, Lec23, Liao 23