1 Machine Learning for Biological Data Mining 장병탁 서울대 컴퓨터공학부 E-mail: [email protected]http://scai.snu.ac.kr./~btzhang/ Byoung-Tak Zhang School of Computer Science and Engineering Seoul National University This material is available at http://scai.snu.ac.kr/~btzhang/ 2 Outline ? Basics in Molecular Biology ? Current Issues and Applications ? Machine Learning for Bioinformatics ? DNA Chip Data Mining ? Graphical Models for Gene Expression Analysis ? Summary
50
Embed
Machine Learning for Biological Data Mining · 2015-12-16 · Machine Learning Techniques for Bio Data Mining?Sequence Alignment? Simulated Annealing? Genetic Algorithms?Structure
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Human Genetic Variations(Single Nucleotide Polymorphisms)? SNP’s- “genetic individuality”? ~1/1000 bases variable (2 humans)?Make us more/less susceptible to diseases?May influence the effect of drug treatments
TTTGCTCCGTTTTCA
TTTGCTCYGTTTTCA
TTTGCTCTGTTTTCA
20
SNP (Single Nucleotide Polymorphism)? Finding single nucleotide changes at specific regions of
genes
?Diagnosis of hereditary diseases?Personal drug?Finding more effective drugs and
treatments
11
21
Human Individuality
22
Flood of Data! (SWISS-PROT)
1988 1990 1992 1994 1996
80
70
60
50
40
30
20
10
0
Year of release
Num
ber
of se
que
nces
x 1
000
12
23
How Can We Analyze the Flood of Data??Data: don’t just store it, analyze it! By
comparing sequences, one can find out about things like? ancestors of organisms? phylogenetic trees? protein structures? protein function
24
Bioinformatics Is About:
? Elicitation of DNA sequences from genetic material
? Sequence annotation (e.g. with information from experiments)
? Understanding the control of gene expression (i.e. under what circumstances proteins are transcribed from DNA)
? The relationship between the amino acid sequence of proteins and their structure.
13
25
Aim of Research in Bioinformatics
? Understand the functioning of living things – to “improve the quality of life”.
? Drug design? Identification of genetic risk factor? Gene therapy? Genetic modification of good crops and animals, etc
26
Current Issues and Applications
14
27
The Central Dogma of Information Flow in Biology
The sequence of amino acids making up a protein and hence its structure (folded state) and thus its function, is determined by transcription from DNA via RNA
DNA RNA Protein Function
28
3 Main Classes of Problem Areas
? Central dogma related: sequence, structure or function
? Data related: storage, retrieval & analysis (exponential growth of knowledge in molecular biology)
? Simulation of biological processes: protein folding (molecular dynamics) of metabolic pathways
15
29
Topics in Bioinformatics
? Sequence analysis? Sequence alignment? Structure and function prediction? Gene finding
? Structure analysis? Protein structure comparison? Protein structure prediction ? RNA structure modeling
? Expression analysis? Gen expression analysis? Gene clustering
? Finding evolutionary relationships? Finding coding regions of genomic sequences? Translating DNA to protein? Finding regulatory regions? Assembling genome sequences
Finding information and patterns in DNA and protein data
16
31
Structure Analysis
? Amino acid sequences of protein determine its 3D conformation
Pairwise sequence alignmentDatabase search for similar sequencesMultiple sequence alignmentPhylogenetic tree reconstructionProtein 3D structure alignment
Sequence alignment(homology search)
Machine Learning Methods
Problems in Biological Science
21
41
Expression (DNA Chip Data) Analysis
Clustering algorithms- Hierarchical cluster analysis- Kohonen neural networksClassification algorithms- Bayesian Networks- Neural Networks- Support Vector Machines- Decision Trees
Superfamily classificationOrtholog/paralog grouping of genes3D fold classification
? Classification (supervised learning)? Hidden Markov Model? Neural networks? Bayesian networks? Support vector machines? Nearest Neighbor Algorithm? Decision trees
54
Support Vector Machines for Functional Classification of Genes (1)
? Classification of microarray gene expression data [M. Brown, et al., PNAS, 97(1):262-267]
? Classifying gene functional class using gene expression data from DNA microarray hybridzation experiments? Dataset: 2467 genes, 79 experiments (2467x79 matrix)
? CAMDA? Critical Assessment of Techniques for Microarray Data Mining? Purpose: Evaluate the data-mining techniques available to the
microarray community.
? Data Set 1? Identification of cell cycle-regulated genes? Yeast Sacchromyces cerevisiae by microarray hybridization.? Gene expression data with 6,278 genes.
? Data Set 2? Cancer class discovery and prediction by gene expression
monitoring.? Two types of cancers: acute myeloid leukemia (AML) and acute
lymphoblastic leukemia (ALL).? Gene expression data with 7,129 genes.
34
67
CAMDA-2000 Data Set 1Identification of Cell Cycle-regulated Genes of the Yeast by Microarray Hybridization
? Data given: gene expression levels of 6,278 genes spanned by time? ? Factor-based synchronization: every 7 minute
from 0 to 119 (18)? Cdc15-based synchronization: every 10 minute
from 10 to 290 (24)? Cdc28-based synchronization: every 10 minute
from 0 to 160 (17)? Elutriation (size-based synchronization): every
30 minutes from 0 to 390 (14)
? Among 6,278 genes? 104 genes are known to be cell-cycle
regulated• classified into: M/G1 boundary (19), late G1 SCB
initialize p(zk), p(gi|zk), p(tj|zk) for i=1~N, j=1~M, k=1~K such that
while(until reach to max_iteration) do EM adaptation
//E-step
//M-step
end while//prototypesp(t|zk), k=1~K are prototypes for each cluster//clusteringgiven a gene gi, cluster of gi is k for which p(gi ? zk) is the biggest.
? Prediction error of this Bayesian network (given all attribute values)
? The result can be improved by more appropriate data preprocessing.
4/348/34Test data
1/381/38Training data
Weighted votingBayesian network
43
85
Non-negative Matrix Factorization
86
Non-negative Matrix Factorization (1)
? Method? Using NMF for class clustering and prediction of gene expression
data from acute leukemia patients
? NMF (non-negative matrix factorization)
??
??
?r
aaiaii HW
1
)()( ??? WHG
WHG
G : gene expression data matrix
W : basis matrix (prototypes)
H : encoding matrix (in low
dimension)
0,, ??? aiai HWG
? NMF as a latent variable model
…
…
h1 hr
g1 g2 gn
W
Whg ???
h2
44
87
NMF (2) Clustering Gene Expression Data
….
.
.
.
.
.
.
.
.
.
.
.
7,129 genes
38 samples
x.
.
.
.
2 factors
… encoding
38 samples7,129 genes
G W(?) H(?)
? Factors can capture the correlations between the genes using the values of expression level.
? Cluster training samples into 2 groups by NMF? Assign each sample to the factor (class) which has higher encoding value.? Accuracy: 0 ~1 error for the training data set
…
H1·
g1 g2 g7,129
W
H2 ·
g3 g4
88
NMF (3) Learning ProcedureInput : Gene expression data matrix, G (n ? m)
Output : base matrix W (n ? k), encoding matrix H (k ? m)
n: data size, m: number of genes, k: number of latent variables
Objective function :
Procedure
1. Initialize W, H with random numbers.
2. Update W, H iteratively until max_iteration or some criterion is met.
0
1 ,0
?
?? ?
ij
jijij
H
WW
? ?? ?? ?
??n
i
m
iii WHWHGF1 1
)()log(?
???
??i i
iiaaa WH
GWHH
?
??? )(
?
?
?
?
jja
iaia
ai
iiaia
WW
W
HWH
GWW ?
? ?
?
)(
45
89
NMF (4) Learning Curve
Learning Curve
1 11 21 31 41 51 61 71 81 91 101
Number of iteration
Log
likel
ihoo
d
Log Likelihood
90
NMF (5) Clustering Result
0
1
2
3
4
5
6
7
0 1 2 3 4 5 6
ALLAML
46
91
NMF (6)Diagnosis
? For each test sample g, estimate the encoding vector h that best approximates the sample.? W is the basis matrix computed during training (fixed).? As in training, assign each sample to the factor (class) which has the
highest encoding value.
? Accuracy: 1~2 error(s) for the test data set
.
.
.
.
2 factors
x.
.
W
h(?)
g
7,129 genes
7,129 genes
…
h1
g1 g2 g7,129
W
h2
g3 g4
92
Generative Topographic Mapping
47
93
Generative Topographic Mapping (1)
? GTM: a nonlinear, parametric mapping y(x;W)from a latent space to a data space.
<Latent Space> <Data Space>
x2
x1
y(x;W)
t1
t3
t2
Grid
94
GTM (2) Learning Algorithm (EM)
? Generate the grid of latent points.? Generate the grid of latent function centers.? Compute the matrix of basis function activations ? .? Initialize weights W in Y = ? W and the noise variance ? .? Compute ? n,k = ||tn – ? kW||2 = ||tn – yk(x,W)||2 for each n, k? Repeat
- Compute the responsibility matrix R using ? and ? . [E-Step] Compute G=RTR
- Update W by ? TG? W= ? TRT [M-step ] - Compute ? = ||t – ? W||2
- Update ?? Until convergence
48
95
GTM (3)Visualization
? Posterior distribution in latent space given a data point t:
X(t) ~
? For a whole set of data: for each t, plot in the latent space? Posterior mode:
? Posterior mean:
96
GTM (4) Clustering Experiment
? Gene Selection? Select about 50 genes out of 7,129 based on the three
test scores of cancer diagnosis.• Correlation metric (similar as t-test)• Wilcoxon test scores (a nonparametric t-test)• Median test scores (a nonparametric t-test)
? Clustering & Visualization? After learning a model, genes are plotted in the latent
space.? With the mapping in the latent space, clusters can be
identified.
49
97
?List of Genes Selected
98
GTM (5)Learning Curve
50
99
GTM (6)Clustering Result
Genes with high expression levels in case of ALL (large P-metric value)
Genes with high expression levels in case of AML (negative large P-mertic value)
100
Summary
? Challenges of Machine Learning Applied to Biosciences? Huge data size? Noise and data sparseness? Unlabeled and imbalanced data
? Biosciences for Machine Learning? New application? Biosystems are existence proofs for ideal AI systems? Provides interesting metaphors and algorithms!? Synergy effects between biosciences and artificial intelligence