1 Neighboring Feature Clustering Author: Z. Wang, W. Zheng, Y. Wang, J. Ford, F. Makedo n, J. Pearlman Presenter: Prof. Fillia Makedo n Dartmouth Col lege
Jan 21, 2016
1
Neighboring Feature Clustering
Author: Z. Wang, W. Zheng, Y. Wang,
J. Ford, F. Makedon, J. Pearlman
Presenter: Prof. Fillia Makedon
Dartmouth College
2
What is Neighboring Feature Clustering
Given an m × n matrix M,where m denotes m samples and n denotes n (ordered) dimensional features, the goal is to find a intrinsic partition of the features based on their characteristics such that each cluster is a continuous piece of features.
We assume there is a natural ordering of features that has relevance to the problem being solved
– E.g., in spectral datasets, such characteristics could be correlations– For example, if we decide feature 1 and 10 belong to a cluster,
feature 2 to 9 should also belong to that cluster. – ZHIFENG: PLEASE IMPROVE THIS SLIDE, PROVIDE AN
INTUITIVE DIAGRAM
3
MR spectral features and DNA Copy Number???
MR spectral features are highly redundant suggesting that the data lie in some low-dimensional space (ZHIFENG: WHAT DO YOU MEAN BY LOW DIMENSIONAL SPACE - CLARIFY)
Neighboring spectral features of MR spectra are highly correlated
Using NFC, we can partition the features into clusters. A cluster can be represented by a single feature, hence
reducing the dimensionality. This idea can be applied to DNA copy number analysis too.
Zhifeng: Yuhang said these two are not related!! Please explain how these are related.
4
Use MDL method to solve NFC
Reduce NFC into a one dimensional piece-wise linear approximation problem.
Given a sequence of n one dimensional points <x1,...,xn >, find the optimal step function-like line segments that can be fitted to the points
Fig. 1. Piecewise linear approximation [3] [4] is usually 2D. Here
we use its concept for a 1D situation. We use minimum description length (MDL) method [2] to
solve this reduced problem. Zhifeng: define and explain MDL
5
Minimum Description Length (MDL)
Zhifeng, please provide a slide to define this EXPLAIN HOW THE TRANSFORMATION IS DONE
(AS IN [1]) TO GIVE 1D piece-wise linear approximation.
Represent all the points by two line segments. Trade-off between approximation accuracy and
number of line segments. A compromise can be made using MDL. ??? Zhifeng: it is all very cryptic, pieces of explanation
are missing!
6
Outline
The problem– Spectral data– The abstract problem
Related work– HMM based, partial sum based, maximum likelihood based
Our approach– Problem reduced to 1D linear approximation– MDL approach
7
Reducing NFC to 1D Piece-Wise Linear Approximation Problem 1
Let correlation coefficient matrix of M be denoted as C. LetC∗ be the strictly upper triangular matrix derived from 1−|C|
(entries near 0 imply high correlation between the corresponding two features).
For features from i to j (1 ≤ i ≤ j ≤ n), the submatrix C∗i:j,i:j depicts pairwise correlations. We use its entries (excluding lower and diagonal entries) as the points to be explained by a line in the 1D piece-wise linear approximation problem.
The objective is to find the optimal piece-wise line segments to fit those created points.
Points near 0 mean high correlation. We need to force high correlations among a set. Thus the points are always approximated by 0.
8
example
For example, suppose we have a set with points all around 0.3. In piece-wise linear approximation, it is better to use 0.3 as the
approximation. However in NFC, we should penalize the points that stray away
from 0. So we still use 0 as the approximation. Unlike usual 1D piece-wise linear approximation problem, the
reduced problem has dynamic points (because they are created on the fly).
Zhifeng: provide figure to illustrate above example
9
Spectral data
MR spectral data– High dimensional data points– Spectral features are highly redundant (high correlation)– Find neighboring features with high correlation in a spectral dataset,
such as a MR spectral dataset.
frequeny
inte
nsit
Fig. 1 high dimensional data points
Fig. 2 correlation coefficient matrix
Both axes are the features or the number of dimensions
10
Problem
Finding a low-dimensional space - – zhifeng: define low dimensional space– Curse of dimensionality
We extract an abstract problem: Neighboring Feature Clustering (NFC)
– Features are ordered. Each cluster contains only neighboring features.
– Find an optimal clusters according to certain criteria
11
Another application (with variation)
Array Comparative Genomic Hybridization to detect copy number alterations.
aCGH data are noisy– Smoothing– Segmentation Fig. 3 aCGH technology
Fig. 4 aCGH data (smoothed). The X axis is log ratio
Fig.5 aCGH data(segmented). The X axis is log ratio
12
Related work
An algorithm trying to solve a similar problem– Baumgartner, et al, “Unsupervised feature dimension
reduction for classification of MR spectra”, Magnetic Resonance Image, 22:251-256,2004
An extensive literature on the reduced problem– The, et al, “On the detection of dominant points on digital
curves”, IEEE PAMI, 11(8) 859-872, 1989– Statistical methods…
Fig. 6 1D piece-wise approximation
13
Related work: statistical methods
HMM based– Fridlyand, et al , “Hidden Markov models approach to the an
alysis of array CGH data”, J. Multivariate Anal., 90, 132-153
Partial sum based– Lipson etc., ‘”Efficient Calculation of Interval Scores for DN
A copy Number Data analysis”, RECOMB 2005
Maximum likelihood based– Picard, etc., “A statistical approach for array CGH data anal
ysis”, BMC Bioinformatics, 6:27,2005
14
Framework of the method proposed
3. MDL code length (revised)
⎥⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢⎢
⎣
⎡
=
nn
n
n
C
CC
CCC
C
,
,22,2
,12,11,1
...
...
...
frequency
inte
nsity
1. Correlation coefficient matrix
1 32 n-1 nC1
,2
C2
,3
…C3
,n-1
C2
,n-1
C1
,n-1
Cn
-1,nC3
,nC2,
n
C1,
2
2. For each pair of features
4. Code length matrix
5. Shortest path (dynamic programming)
inte
nsity
frequency
Fig. 7 our approach
15
Minimum Description Length
Information Criterion– A model selection scheme– Common information criteria are Akaike Information Criter
ion(AIC), Bayesian Information Criterion (BIC), and Minimum Description Length (MDL)
– MDL is to encode the model and the data given the model. The balance is achieved in terms of code length
)(log)|(log),(log),( MpMDpMDpMDC −−=−=
Fig. 6 1D piece-wise approximation
16
Encoding model and data given model
For each pair (n*(n-1)/2 in total) of features– Encoding model
Cluster boundary, Gaussian parameter (standard deviation)
– Encoding data given model
d
Fig.8 encoding the data given model for each feature pair
17
Minimize the code length
Code length matrix
Shortest path– Recursive function– Dynamic programming
⎥⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢⎢
⎣
⎡
=
nn
n
n
C
CC
CCC
C
,
,22,2
,12,11,1
...
...
...
1 32 n-1 nC1,2 C2,3 …
C3,n-1
C2,n-1
C1,n-1
Cn-1,n
C3,n
C2,n
C1,2
Fig. 9 alternative representation of matrix C
Fig. 10 Recursive function for the shortest path
18
Results
We test on simulated data.
Fig. 11 the revised correlation matrix and the computed code length matrix