Machine Learning: An Introduction Fu Chang

Post on 07-Dec-2014

336 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

DESCRIPTION

 

Transcript

Machine Learning: An Introduction

Fu Chang

Institute of Information Science

Academia Sinica

2788-3799 ext. 1819

fchang@iis.sinica.edu.tw

Machine Learning: as a Tool for Classifying Patterns

What is the difference between you and me? Tentative answer 1:

– You are pretty, and I am ugly

A vague answer, not very useful Tentative answer 2:

– You have a tiny mouth, and I have a big one

More useful, but what if we are viewed from the side? In general, can we use a single feature difference to

distinguish one pattern from another?

Old Philosophical Debates

What makes a cup a cup?

Philosophical views

– Plato: the ideal type

– Aristotle: the collection of all cups

– Wittgenstein: family resemblance

Machine Learning Viewpoint

Represent each object with a set of features:

– Mouth, nose, eyes, etc., viewed from the front, the right side,

the left side, etc.

Each pattern is taken as a conglomeration of sample

points or feature vectors

A

B

Two types of sample points

Patterns as Conglomerations of sample Points

Types of Separation

Left panel: positive separation between heterogeneous data points

Right panel: a margin between them

ML Viewpoint (Cnt’d)

Training phase:– Want to learn pattern differences among conglomerations of

labeled samples

– Have to describe the differences by means of a model:

probability distribution, prototype, neural network, etc.

– Have to estimate parameters involved in the model

Testing phase:– Have to classify at acceptable accuracy rates

Models

Prototype classifiers

Neural networks

Support vector machines

Classification and regression tree

AdaBoost

Boltzmann-Gibbs Models

Neural Networks

Back-Propagation Neural Networks

Layers:– Input: number of nodes = dimension of feature vector– Output: number of nodes = number of class types– Hidden: number of nodes > dimension of feature vector

Direction of data migration– Training: backward propagation– Testing: forward propagation

Training problems– Overfitting– Convergence

Illustration

Neural Networks

Forward propagation:

Error:

Backward update of weights: gradient descent

jiji w

Jw

)))((() ( 00 ki jijij kjj wwinputwfwfoutputactual

k kk outputactualoutputdesiredJ 2) () ()(w

Support Vector Machines (SVM)

SVM

Gives rise to the optimal solution to binary classification problem

Finds a separating boundary (hyperplane) that maintains the largest margin between samples of two class types

Things to tune up with:– Kernel functions: defining the similarity measure of two sample

vectors– Tolerance for misclassification– Parameters associated with the kernel function

Illustration

Classification and Regression Tree (CART)

Illustration

Determination of Branch Points

At the root, a feature of each input sample is examined

For a given feature f, we want to determine a branch point b

– Samples whose f values fall below b are assigned to the left branch;

otherwise they are assigned to the right branch

Determination of a branch point

– The impurity of a set of samples S is defined as

where is the proportion of samples in S labeled as class

type C

C

SCpSCpimpurity )(log)()S(

)( SCp

Branch Point (Cnt’d)

At a branch point b, the impurity reduction is then defined as

The optimal branch point for the given feature f examined as this node is then set as

To determine which feature type should be examined at the root, we compute b(f) for all possible feature types. We then take the feature type at the root as

branch)](right branch)left ([split) thebefore()( impurityimpurityimpuritybI

)( maxaug)( bIfbb

))(( maxaug fbIff

root

AdaBoost

AdaBoost

Can be thought as a linear combination of the same classifier c(·, ·) with varying weights

The Idea: – Iteratively apply the same classifier C to a set of samples

– At iteration m, the samples erroneously classified at (m-1)st iteration are duplicated at a rate γm

– The weight βm is related to γm in a certain way

M

mmm xcxf

1);()(

Boltzmann-Gibbs Model

Boltzmann-Gibbs Density Function

Given:– States s1, s2, …, sn

– Density p(s) = ps

– Features fi, i = 1, 2, …

Maximum entropy principle:– Without any information, one chooses the density ps to maximi

ze the entropy

subject to the constraints

s

ss pp log

s

iis iDsfp ,)(

Boltzmann-Gibbs (Cnt’d)

Consider the Lagrangian

Take partial derivatives of L with respect to ps and set them to

zero, we obtain Boltzmann-Gibbs density functions

where Z is the normalizing factor

s

ss

iisi

iss pDsfpppL )1())((log

Z

sfp i

ii

s

)(exp

Exercise I

Derive the Boltzmann-Gibbs density functions from the Lagrangian, shown on the last viewgraph

Boltzmann-Gibbs (Cnt’d)

Maximum entropy (ME)– Use of Boltzmann-Gibbs as prior distribution

– Compute the posterior for given observed data and

features fi

– Use the optimal posterior to classify

Bayesian Approach

Given:

– Training samples X = {x1, x2, …, xn}

– Probability density p(t|Θ)

– t is an arbitrary vector (a test sample)

– Θ is the set of parameters

– Θ is taken as a set of random variables

Bayesian Approach (Cnt’d)

Posterior density:

Different class types give rise to different posteriors Use the posteriors to evaluate the class type of a given

test sample t

where,)|()|()|(

XtXt ppp

)()|(

)()|()|(

pp

ppp

X

XX

Boltzmann-Gibbs (Cnt’d)

Maximum entropy Markov model (MEMM)

– The posterior consists of transition probability densities

p(s | s´, X)

Conditional random field (CRF)

– The posterior consists of both transition probability densities

p(s | s´, X) and state probability densities

p(s | X)

References

R. O. Duda, P. E. Hart, and D. G. Stork, Pattern Classification, 2n

d Ed., Wiley Interscience, 2001.

T. Hastie, R. Tibshirani, and J. Friedman, The Elements of Statisti

cal Learning, Springer-Verlag, 2001.

S. Theodoridis and K. Koutroumbas, Pattern Recognition, Acade

mic Press, 1999.

top related