1 Neighboring Feature Clustering Author: Z. Wang, W. Zheng, Y. Wang, J. Ford, F. Makedon, J. Pearlman Presenter: Prof. Fillia Makedon Dartmouth College.

1

Neighboring Feature Clustering

Author: Z. Wang, W. Zheng, Y. Wang,

J. Ford, F. Makedon, J. Pearlman

Presenter: Prof. Fillia Makedon

Dartmouth College

2

What is Neighboring Feature Clustering

Given an m × n matrix M,where m denotes m samples and n denotes n (ordered) dimensional features, the goal is to find a intrinsic partition of the features based on their characteristics such that each cluster is a continuous piece of features.

We assume there is a natural ordering of features that has relevance to the problem being solved

– E.g., in spectral datasets, such characteristics could be correlations– For example, if we decide feature 1 and 10 belong to a cluster,

feature 2 to 9 should also belong to that cluster. – ZHIFENG: PLEASE IMPROVE THIS SLIDE, PROVIDE AN

INTUITIVE DIAGRAM

3

MR spectral features and DNA Copy Number???

MR spectral features are highly redundant suggesting that the data lie in some low-dimensional space (ZHIFENG: WHAT DO YOU MEAN BY LOW DIMENSIONAL SPACE - CLARIFY)

Neighboring spectral features of MR spectra are highly correlated

Using NFC, we can partition the features into clusters. A cluster can be represented by a single feature, hence

reducing the dimensionality. This idea can be applied to DNA copy number analysis too.

Zhifeng: Yuhang said these two are not related!! Please explain how these are related.

4

Use MDL method to solve NFC

Reduce NFC into a one dimensional piece-wise linear approximation problem.

Given a sequence of n one dimensional points <x1,...,xn >, find the optimal step function-like line segments that can be fitted to the points

Fig. 1. Piecewise linear approximation [3] [4] is usually 2D. Here

we use its concept for a 1D situation. We use minimum description length (MDL) method [2] to

solve this reduced problem. Zhifeng: define and explain MDL

5

Minimum Description Length (MDL)

Zhifeng, please provide a slide to define this EXPLAIN HOW THE TRANSFORMATION IS DONE

(AS IN [1]) TO GIVE 1D piece-wise linear approximation.

Represent all the points by two line segments. Trade-off between approximation accuracy and

number of line segments. A compromise can be made using MDL. ??? Zhifeng: it is all very cryptic, pieces of explanation

are missing!

6

Outline

The problem– Spectral data– The abstract problem

Related work– HMM based, partial sum based, maximum likelihood based

Our approach– Problem reduced to 1D linear approximation– MDL approach

7

Reducing NFC to 1D Piece-Wise Linear Approximation Problem 1

Let correlation coefficient matrix of M be denoted as C. LetC∗ be the strictly upper triangular matrix derived from 1−|C|

(entries near 0 imply high correlation between the corresponding two features).

For features from i to j (1 ≤ i ≤ j ≤ n), the submatrix C∗i:j,i:j depicts pairwise correlations. We use its entries (excluding lower and diagonal entries) as the points to be explained by a line in the 1D piece-wise linear approximation problem.

The objective is to find the optimal piece-wise line segments to fit those created points.

Points near 0 mean high correlation. We need to force high correlations among a set. Thus the points are always approximated by 0.

8

example

For example, suppose we have a set with points all around 0.3. In piece-wise linear approximation, it is better to use 0.3 as the

approximation. However in NFC, we should penalize the points that stray away

from 0. So we still use 0 as the approximation. Unlike usual 1D piece-wise linear approximation problem, the

reduced problem has dynamic points (because they are created on the fly).

Zhifeng: provide figure to illustrate above example

9

Spectral data

MR spectral data– High dimensional data points– Spectral features are highly redundant (high correlation)– Find neighboring features with high correlation in a spectral dataset,

such as a MR spectral dataset.

frequeny

inte

nsit

Fig. 1 high dimensional data points

Fig. 2 correlation coefficient matrix

Both axes are the features or the number of dimensions

10

Problem

Finding a low-dimensional space - – zhifeng: define low dimensional space– Curse of dimensionality

We extract an abstract problem: Neighboring Feature Clustering (NFC)

– Features are ordered. Each cluster contains only neighboring features.

– Find an optimal clusters according to certain criteria

11

Another application (with variation)

Array Comparative Genomic Hybridization to detect copy number alterations.

aCGH data are noisy– Smoothing– Segmentation Fig. 3 aCGH technology

Fig. 4 aCGH data (smoothed). The X axis is log ratio

Fig.5 aCGH data(segmented). The X axis is log ratio

12

Related work

An algorithm trying to solve a similar problem– Baumgartner, et al, “Unsupervised feature dimension

reduction for classification of MR spectra”, Magnetic Resonance Image, 22:251-256,2004

An extensive literature on the reduced problem– The, et al, “On the detection of dominant points on digital

curves”, IEEE PAMI, 11(8) 859-872, 1989– Statistical methods…

Fig. 6 1D piece-wise approximation

13

Related work: statistical methods

HMM based– Fridlyand, et al , “Hidden Markov models approach to the an

alysis of array CGH data”, J. Multivariate Anal., 90, 132-153

Partial sum based– Lipson etc., ‘”Efficient Calculation of Interval Scores for DN

A copy Number Data analysis”, RECOMB 2005

Maximum likelihood based– Picard, etc., “A statistical approach for array CGH data anal

ysis”, BMC Bioinformatics, 6:27,2005

14

Framework of the method proposed

3. MDL code length (revised)

⎥⎥⎥⎥⎥

⎦

⎤

⎢⎢⎢⎢⎢

⎣

⎡

=

nn

n

n

C

CC

CCC

C

,

,22,2

,12,11,1

...

...

...

frequency

inte

nsity

1. Correlation coefficient matrix

1 32 n-1 nC1

,2

C2

,3

…C3

,n-1

C2

,n-1

C1

,n-1

Cn

-1,nC3

,nC2,

n

C1,

2

2. For each pair of features

4. Code length matrix

5. Shortest path (dynamic programming)

inte

nsity

frequency

Fig. 7 our approach

15

Minimum Description Length

Information Criterion– A model selection scheme– Common information criteria are Akaike Information Criter

ion(AIC), Bayesian Information Criterion (BIC), and Minimum Description Length (MDL)

– MDL is to encode the model and the data given the model. The balance is achieved in terms of code length

)(log)|(log),(log),( MpMDpMDpMDC −−=−=

Fig. 6 1D piece-wise approximation

16

Encoding model and data given model

For each pair (n*(n-1)/2 in total) of features– Encoding model

Cluster boundary, Gaussian parameter (standard deviation)

– Encoding data given model

d

Fig.8 encoding the data given model for each feature pair

17

Minimize the code length

Code length matrix

Shortest path– Recursive function– Dynamic programming

⎥⎥⎥⎥⎥

⎦

⎤

⎢⎢⎢⎢⎢

⎣

⎡

=

nn

n

n

C

CC

CCC

C

,

,22,2

,12,11,1

...

...

...

1 32 n-1 nC1,2 C2,3 …

C3,n-1

C2,n-1

C1,n-1

Cn-1,n

C3,n

C2,n

C1,2

Fig. 9 alternative representation of matrix C

Fig. 10 Recursive function for the shortest path

18

Results

We test on simulated data.

Fig. 11 the revised correlation matrix and the computed code length matrix

1 Neighboring Feature Clustering Author: Z. Wang, W. Zheng, Y. Wang, J. Ford, F. Makedon, J. Pearlman Presenter: Prof. Fillia Makedon Dartmouth College.

Documents

dimensional features

dimensional points

continuous piece of

piecewise linear approximation

created points

approximation accuracy

n matrix

number of line segments