Introduction to Machine Learning

Post on 07-Dec-2014

682 Views

Category:

Documents

3 Downloads

Preview:

Click to see full reader

DESCRIPTION

 

Transcript

Introduction to Machine Learning

Jinhyuk Choi

Human-Computer Interaction Lab @ Information and Communications University

Contents

Concepts of Machine Learning

Multilayer Perceptrons

Decision Trees

Bayesian Networks

What is Machine Learning?

Large storage / large amount of data

Looks random but certain patterns

Web log data

Medical record

Network optimization

Bioinformatics

Machine vision

Speech recognition…

No complete identification of the process

A good or useful approximation

What is Machine Learning?Definition

Programming computers to optimize a

performance criterion using example data or past

experience

Role of Statistics

Inference from a sample

Role of Computer science

Efficient algorithms to solve the optimization problem

Representing and evaluating the model for inference

Descriptive (training) / predictive (generalization)

Learning from Human-generated data??

What is Machine Learning?Concept Learning

• Inducing general functions from specific training examples (positive or negative)

• Looking for the hypothesis that best fits the training examples

• Concepts:

- describing some subset of objects or events defined over a larger set

- a boolean-valued function

Objects

눈, 코, 다리생식능력,

무생물…

Bird

날개, 부리,

깃털…

Concept

boolean function :

Bird(animal) “true or not”

What is Machine Learning?Concept Learning

Inferring a boolean-valued function from training examples of its input and output

Positive examples

Negative examples

Hypothesis 1

Hypothesis 2

Concept

Web log data

Medical record

Network optimization

Bioinformatics

Machine vision

Speech recognition…

What is Machine Learning?Learning Problem Design

Do you enjoy sports ?

Learn to predict the value of “EnjoySports” for an arbitrary day, based on the value of its other attributes

What problem?

Why learning?

Attributes selection

Effective?

Enough?

What learning algorithm?

Applications

Learning associations

Classification

Regression

Unsupervised learning

Reinforcement learning

Examples (1)

TV program preference inference based on web usage data

Web page #1

Web page #2

Web page #3

Web page #4

….

Classifier

TV Program #1

TV Program #2

TV Program #3

TV Program #4

….

1 2

3

What are we supposed to do at each step?

Examples (2)from a HW of Neural Networks Class (KAIST-2002)

Function approximation (Mexican hat)

2 2

3 1 2 1 2 1 2( , ) sin 2 , , [ 1,1]f x x x x x x

Examples (3)from a HW of Machine Learning Class (ICU-2006)

Face image classification

Examples (4)from a HW of Machine Learning Class (ICU-2006)

Examples (5)from a HW of Machine Learning Class (ICU-2006)

Sensay

Examples (6)

A. Krause et. al, “Unsupervised, Dynamic Identification of Physiological and Activity Context in Wearable

Computing”, ISWC 2005

#1. Multilayer Perceptrons

Neural Network?

VS. Adaline

MLP

SOM

Hopfield network

RDFN

Bifurcating neuron networks

Multilayer Networks of Sigmoid Units

• Supervised learning

• 2-layer

• Fully connected

Really looks like the brain??

Sigmoid Unit

The back-propagation algorithm

Network model

ix jy ko

j ji ii

y v xs

jivkjw

k kj jj

o w ys

Input layer hidden layer output layer

Error function: 2

1,

2 k kk

E v w t o

Stochastic gradient descent

Gradient-Descent Function Minimization

Gradient-descent function minimization

In order to find a vector parameter x

that minimizes a function f x

Start with a random initial value of 0x x

.

Determine the direction of the steepest descent in the parameter space by

1 2

, ,...,n

f f ff

x x x

Move to the direction a step.

1x i x i fh

Repeat the above two steps until no more change in x

.

For gradient-descent to work…

The function to be minimized should be continuous.

The function should not have too many local minima.

Back-propagation

Derivation of back-propagation algorithm

Adjustment of kjw :

2

21 1

2 2

11 1 2

2

1

k k jk k jk jk j k j k j

j k k k k

j k k k k

Et o t w y

w w w

y o o t o

y o o t o

s

1ok

kj k k k k jkj

Ew o o t o y

wd

h h

Derivation of back-propagation algorithm

Adjustment of jiv :

2

2

2

1 1

2 2

1

2

11 1 1 2

2

1

k k k kj jk k jj i j i j i

k kj ji ik j ij i

k k k ki j j kjk

i j

Et o t w y

v v v

t w v xv

x y y w o o t o

x y y

s

s s

1k k k kj kjk

w o o t o

1 1

1

yj

ji j j kj k k k k ikji

oj j kj k i

k

Ev y y w o o t o x

v

y y w x

d

h h

h d

Backpropagation

Batch learning vs. Incremental learning

Batch standard backprop proceeds as

follows:

Initialize the weights W.

Repeat the following steps:

Process all the training data DL to compute the gradient

of the average error function AQ(DL,W).

Update the weights by subtracting the gradient times the

learning rate.

Incremental standard backprop can be done as follows:

Initialize the weights W.

Repeat the following steps for j = 1 to NL:

Process one training case (y_j,X_j) to compute the gradient

of the error (loss) function Q(y_j,X_j,W).

Update the weights by subtracting the gradient times the

learning rate.

Training

Overfitting

#2. Decision Trees

Introduction

Divide & conquer

Hierarchical model

Sequence of recursive splits

Decision node vs. leaf node

Advantage

Interpretability

IF-THEN rules

Divide and Conquer Internal decision nodes Univariate: Uses a single attribute, xi

Numeric xi : Binary split : xi > wm

Discrete xi : n-way split for n possible values

Multivariate: Uses all attributes, x

Leaves Classification: Class labels, or proportions Regression: Numeric; r average, or local fit

Learning Construction of the tree using training examples Looking for the simplest tree among the trees that code the training

data without error Based on heuristics NP-complete “Greedy”; find the best split recursively (Breiman et al, 1984; Quinlan, 1986, 1993)

Classification Trees

Split is main procedure for tree construction By impurity measure

For node m, Nm instances reach m, Nim

belong to Ci

Node m is pure if pim is 0 or 1

Measure of impurity is entropy

To be pure!!!

m

imi

miN

Npm,CP̂ x|

K

i

im

imm pp

12logI

Representation

Each node specifies a test of some attribute of the instance

Each branch correspond to one of the possible values for this attribute

Best Split

If node m is pure, generate a leaf and stop, otherwise split and continue recursively

Impurity after split: Nmj of Nm take branch j. Nimj belong to

Ci

Find the variable and split that min impurity (among all variables -- and split positions for numeric variables)

mj

imji

mjiN

Npj,m,CP̂ x|

K

i

imj

imj

n

j m

mj

m ppN

N

12

1

logI'

Q) “Which attribute should be tested at the root of the tree?”

Top-Down Induction of Decision Trees

Entropy “Measure of uncertainty”

“Expected number of bits to resolve uncertainty”

Suppose Pr{X = 0} = 1/8

If other events are equally likely, the number of events is 8. To indicate one out of so many events, one needs lg 8 bits.

Consider a binary random variable X s.t. Pr{X = 0} = 0.1.

The expected number of bits:

In general, if a random variable X has c values with prob. p_c:

The expected number of bits:

1.01

1lg1.01

1.0

1lg1.0

1 1

1lg lg

c c

i i i

i ii

H p p pp

EntropyExample

14 examples

Entropy 0 : all members positive or negative

Entropy 1 : equal number of positive & negative

0 < Entropy < 1 : unequal number of positive & negative

2 2

([9 ,5 ])

(9 /14) log (9 /14) (5 /14) log (5/14) 0.940

Entropy

Information Gain

Measures the expected reduction in entropy caused by partitioning the examples

Information Gain

ICU-Student tree

Gender

HeightIQ

Candidate

• # of samples = 100

• # of positive samples = 50

• Entropy = 1

Male Female

Left side:

• # of samples = 50

• # of positive samples = 40

• Entropy = 0.72

Right side:

• # of samples = 50

• # of positive samples = 10

• Entropy = 0.72

On average

• Entropy = 0.5 * 0.72 + 0.5*0.72

= 0.72

• Reduction in entropy = 0.28

Information gain

Training Examples

Selecting the Next Attribute

Partially learned tree

Hypothesis Space Search

Hypothesis space: the set of

all possible decision trees

DT is guided by information

gain measure.

Occam’s razor ??

Overfitting

• Why “over”-fitting?

– A model can become more complex than the true target function(concept) when it tries to satisfy noisy data as well

Avoiding over-fitting the data Two classes of approaches to avoid overfitting

Stop growing the tree earlier.

Post-prune the tree after overfitting

Ok, but how to determine the optimal size of a tree?

Use validation examples to evaluate the effect of pruning (stopping)

Use a statistical test to estimate the effect of pruning (stopping)

Use a measure of complexity for encoding decision tree.

Approaches based on the first strategy

Reduced error pruning

Rule post-pruning

Rule Extraction from Trees

C4.5Rules (Quinlan, 1993)

#3. Bayesian Networks

Bayes’ RuleIntroduction

xx

xp

pPP

CCC

| |

posterior

likelihoodprior

evidence

1|1|0

00|11|

110

xx

xxx

CC

CCCC

CC

Pp

PpPpp

PP

Bayes’ Rule: K>2 ClassesIntroduction

K

kkk

ii

iii

CPCp

CPCp

p

CPCpCP

1

|

|

||

x

x

x

xx

xx | max |if choose

1 and 01

kkii

K

iii

CPCPC

CPCP

Bayesian NetworksIntroduction

Graphical models, probabilistic networks causality and influence

Nodes are hypotheses (random vars) and the prob corresponds to our belief in the truth of the hypothesis

Arcs are direct influences between hypotheses

The structure is represented as a directed acyclic graph (DAG) Representation of the dependencies among random variables

The parameters are the conditional probs in the arcs

Small set of probability, relating only neighbor node

all possible combinations of cicumstances

B.N.

Bayesian NetworksIntroduction

Learning Inducing a graph

From prior knowledge

From structure learning

Estimating parameters

EM

Inference Beliefs from evidences

Especially among the nodes not directly connected

StructureIntroduction

Initial configuration of BN Root nodes

Prior probabilities

Non-root nodes

Conditional probabilities given all possible combinations of direct predecessors

A B

D

E

C

P(b)P(a)

P(d|ab), P(d|aㄱb), P(d|ㄱab), P(d|ㄱaㄱb)

P(e|d)

P(e|ㄱd)

P(c|a)

P(c|ㄱa)

Causes and Bayes’ RuleIntroduction

75060204090

4090

|~|

|

||

.....

..

R~PRWPRPRWP

RPRWP

WP

RPRWPWRP

Diagnostic inference:Knowing that the grass is wet, what is the probability that rain is the cause?causal

diagnostic

Causal vs Diagnostic InferenceIntroduction

Causal inference: If the sprinkler is on, what is the probability that the grass is wet?

P(W|S) = P(W|R,S) P(R|S) + P(W|~R,S) P(~R|S)

= P(W|R,S) P(R) + P(W|~R,S) P(~R)

= 0.95*0.4 + 0.9*0.6 = 0.92

Diagnostic inference: If the grass is wet, what is the probabilitythat the sprinkler is on? P(S|W) = 0.35 > 0.2 P(S)P(S|R,W) = 0.21Explaining away: Knowing that it has rained

decreases the probability that the sprinkler is on.

Bayesian Networks: CausesIntroduction

Causal inference:P(W|C) = P(W|R,S) P(R,S|C) +

P(W|~R,S) P(~R,S|C) + P(W|R,~S) P(R,~S|C) + P(W|~R,~S) P(~R,~S|C)

and use the fact thatP(R,S|C) = P(R|C) P(S|C)

Diagnostic: P(C|W ) = ?

Bayesian Nets: Local structureIntroduction

P (F | C) = ?

d

i

iid XXPX,XP1

1 parents|

Bayesian Networks: InferenceIntroduction

P (C,S,R,W,F ) = P (C ) P (S |C ) P (R |C ) P (W |R,S ) P (F |R )

P (C,F ) = ∑S ∑R ∑W P (C,S,R,W,F )

P (F |C) = P (C,F ) / P(C ) Not efficient!

Belief propagation (Pearl, 1988) Junction trees (Lauritzen and Spiegelhalter, 1988) Independence assumption

InferenceEvidence & Belief Propagation

Evidence – values of observed nodes

V3 = T, V6 = 3

Our belief in what the value of Vi

„should‟ be changes.

This belief is propagated

As if the CPTs became

V3=T 1.0

V3=F 0.0

P V2=T V2=F

V6=1 0.0 0.0

V6=2 0.0 0.0

V6=3 1.0 1.0

V1

V5

V2

V4

V3

V6

Specifically:

9

Belief Propagation

Message

Messages

Going down arrow, sum out parent Going up arrow, Bayes Law

)(

)()|()|(

BP

APABPBAP

Bayes Law:

1/a

“Causal” message “Diagnostic” message

* some figures from: Peter Lucas BN lecture course

The Messages

• What are the messages?

• For simplicity, let the nodes be binary

V1

V2

V1=T 0.8

V1=F 0.2

P V1=T V1=F

V2=T 0.4 0.9

V2=F 0.6 0.1

The message passes on information.

What information? Observe:

P(V2| V1) = P(V2| V1=T)P(V1=T)

+ P(V2| V1=F)P(V1=F)

The information needed is the CPT

of V1 = V(V1)

Messages capture information

passed from parent to child

The Messages

)|()()(

)|()()|( 121

2

12121 VVPVP

VP

VVPVPVVP a

• We know what the messages are

• What about ?

V1

V2

Assume E = { V2 } and compute by Bayes‟rule:

The information not available at V1 is the P(V2|V1). To be

passed upwards by a -message. Again, this is not in general

exactly the CPT, but the belief based on evidence down the tree.

Belief Propagation

V

U2

V1 V2

U1

π(U2)

π(V1)π(V2)

π(U1)

λ(U1)

λ(V2)

λ(V1)

λ(U2)

Evidence & Belief

V1

V5

V2

V4

V3

V6

Evidence

Belief

Evidence

Works for classification ??

Naive Bayes’ Classifier

Given C, xj are independent:

p(x|C) = p(x1|C) p(x2|C) ... p(xd|C)

Application ProceduresFor classification

MLP

Data collection & Pre-processing (Training data / Test data)

Decision node selection (output node)

Network training

Generalization

Parameter tuning & Pruning

Final network

Decision Trees

Data collection & Pre-processing (Training data / Test data)

Decision attribute selection

Tree construction

Pruning

Final tree

Bayesian Networks

Data collection & Pre-processing (Training data / Test data)

Structure configuration

Prior knowledge

Parameter learning

Decision node selection

Inference (classification)

Evidence & belief

Final network

Simulation

Simulation Packages

WEKA (JAVA)

http://www.cs.waikato.ac.nz/ml/weka/

FullBNT (MATLAB)

http://www.cs.ubc.ca/~murphyk/Software/BNT/bnt.html

MSBNx

http://research.microsoft.com/msbn/

MATLAB Neural Networks Toolbox

http://www.mathworks.com/products/neuralnet/

C4.5

http://www.rulequest.com/Personal/

WEKA

FullBNT clear all

N = 4; % 노드의 개수

dag = zeros(N,N); % 네크워크 구조 shell

C = 1; S = 2; R = 3; W = 4; % 각 노드 Naming

dag(C,[R S]) = 1; % 네트워크 구조 명시

dag(R,W) = 1;

dag(S,W)=1;

%discrete_nodes = 1:N;

node_sizes = 2*ones(1,N); % 각 노드가 가질 수 있는 값의 개수

%node_sizes = [4 2 3 5];

%onodes = [];

%bnet = mk_bnet(dag, node_sizes, 'discrete', discrete_nodes, 'observed', onodes);

bnet = mk_bnet(dag, node_sizes, 'names', {'C','S','R','W'}, 'discrete', 1:4);

%C = bnet.names('cloudy'); % bnet.names is an associative array

%bnet.CPD{C} = tabular_CPD(bnet, C, [0.5 0.5]);

%%%%%% Specified Parameters

%bnet.CPD{C} = tabular_CPD(bnet, C, [0.5 0.5]);

%bnet.CPD{R} = tabular_CPD(bnet, R, [0.8 0.2 0.2 0.8]);

%bnet.CPD{S} = tabular_CPD(bnet, S, [0.5 0.9 0.5 0.1]);

%bnet.CPD{W} = tabular_CPD(bnet, W, [1 0.1 0.1 0.01 0 0.9 0.9 0.99]);

MSBNx

References Textbooks

Ethem ALPAYDIN, Introduction to Machine Learning, The MIT Press, 2004

Tom Mitchell, Machine Learning, McGraw Hill, 1997

Neapolitan, R.E., Learning Bayesian Networks, Prentice Hall, 2003

Materials

Serafín Moral, Learning Bayesian Networks, University of Granada, Spain

Zheng Rong Yang, Connectionism, Exeter University

KyuTae Cho ,Jeong KiYoo ,HeeJin Lee, Uncertainty in AI, Probabilistic reasoning, Especially for Bayesian Networks

Gary Bradski, Sebastian Thrun, Bayesian Networks in Computer Vision, Stanford University

Recommended Textbooks

Christopher M. Bishop, Pattern Recognition and Machine Learning, Springer, 2006

J. Ross Quinlan, C4.5: Programs for Machine Learning, Morgan Kaufmann, 1992

Haykin, Simon S., Neural networks : a comprehensive foundation, Prentice Hall, 1999

Jensen, Finn V., Bayesian networks and decision graphs, Springer, 2007

top related