Computational Learning Theory PAC IID VC Dimension SVM Kunstmatige Intelligentie / RuG KI2 - 5 Marius Bulacu & prof. dr. Lambert Schomaker.

Post on 19-Dec-2015

217 Views

Category:

Documents

1 Downloads

Preview:

Click to see full reader

Transcript

Computational Learning Theory• PAC• IID• VC Dimension• SVM

Kunstmatige Intelligentie / RuG

KI2 - 5

Marius Bulacu & prof. dr. Lambert Schomaker

2

Learning

Learning is essential for unknown environments–i.e., when designer lacks omniscience

Learning is useful as a system construction method–i.e., expose the agent to reality rather than trying to write it down

Learning modifies the agent's decision mechanisms to improve performance

3

Learning Agents

4

Learning Element

Design of a learning element is affected by:– Which components of the performance element are to be learned– What feedback is available to learn these components– What representation is used for the components

Type of feedback:– Supervised learning: correct answers for each example– Unsupervised learning: correct answers not given– Reinforcement learning: occasional rewards

5

Inductive Learning

Simplest form: learn a function from examples

- f is the target function

- an example is a pair (x, f(x))

Problem: find a hypothesis h

such that h ≈ f

given a training set of examples

This is a highly simplified model of real learning:

- ignores prior knowledge

- assumes examples are given

6

Inductive Learning Method

Construct/adjust h to agree with f on training set (h is consistent if it agrees with f on all examples) E.g., curve fitting:

7

Inductive Learning Method

Construct/adjust h to agree with f on training set (h is consistent if it agrees with f on all examples) E.g., curve fitting:

8

Inductive Learning Method

Construct/adjust h to agree with f on training set (h is consistent if it agrees with f on all examples) E.g., curve fitting:

9

Inductive Learning Method

Construct/adjust h to agree with f on training set (h is consistent if it agrees with f on all examples) E.g., curve fitting:

10

Inductive Learning Method

Construct/adjust h to agree with f on training set (h is consistent if it agrees with f on all examples) E.g., curve fitting:

11

Inductive Learning Method

Construct/adjust h to agree with f on training set (h is consistent if it agrees with f on all examples) E.g., curve fitting:

Occam’s razor: prefer the simplest hypothesis consistent with data

12

Occam’s Razor

William of Occam

(1285-1349, England)

“If two theories explain the facts equally well, then the simpler theory is to be preferred.”

Rationale:

There are fewer short hypotheses than long hypotheses.

A short hypothesis that fits the data is unlikely to be a coincidence.

A long hypothesis that fits the data may be a coincidence.

Formal treatment in computational learning theory

13

The Problem

• Why does learning work?

• How do we know that the learned hypothesis h is close to the target function f if we do not know what f is?

answer provided by

computational learning theory

14

The Answer

• Any hypothesis h that is consistent with a sufficiently large number of training examples is unlikely to be seriously wrong.

Therefore it must be:

Probably Approximately Correct

PAC

15

The Stationarity Assumption

• The training and test sets are drawn randomly from the same population of examples using the same probability distribution.

Therefore training and test data are

Independently and Identically Distributed

IID

“the future is like the past”

16

How many examples are needed?

Number of examples Probability that h and f disagree on an example

Probability of existence of a wrong hypothesis

consistent with all examples

)Hln(lnm 11

Size of hypothesis space

Sample complexity

17

Formal Derivation

H (the set of all possible hypothese)

f

HBAD (the set of “wrong” hypotheses)

1))x(f)x(h,x(P

))x(f)x(h,x(P

)Hln(lnm)(H

)(H)Hh(P

m

mBADBAD

11

1

1

18

What if hypothesis space is infinite?

Can’t use our result for finite H Need some other measure of complexity for H

– Vapnik-Chervonenkis dimension

19

20

21

22

Shattering two binary dimensionsover a number of classes

In order to understand the principle of shattering sample points into classes we will look at the simple case of

two dimensions

of binary value

23

2-D feature space

0

0

1

1

f1

f2

24

2-D feature space, 2 classes

0

0

1

1

f1

f2

25

the other class…

0

0

1

1

f1

f2

26

2 left vs 2 right

0

0

1

1

f1

f2

27

top vs bottom

0

0

1

1

f1

f2

28

right vs left

0

0

1

1

f1

f2

29

bottom vs top

0

0

1

1

f1

f2

30

lower-right outlier

0

0

1

1

f1

f2

31

lower-left outlier

0

0

1

1

f1

f2

32

upper-left outlier

0

0

1

1

f1

f2

33

upper-right outlier

0

0

1

1

f1

f2

34

etc.

0

0

1

1

f1

f2

35

2-D feature space

0

0

1

1

f1

f2

36

2-D feature space

0

0

1

1

f1

f2

37

2-D feature space

0

0

1

1

f1

f2

38

XOR configuration A

0

0

1

1

f1

f2

39

XOR configuration B

0

0

1

1

f1

f2

40

2-D feature space, two classes: 16 hypotheses

f1=0f1=1f2=0f2=1

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

“hypothesis” = possible class partioning of all data samples

41

2-D feature space, two classes, 16 hypotheses

f1=0f1=1f2=0f2=1

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

two XOR class configurations:

2/16 of hypotheses requires a non-linear separatrix

42

XOR, a possible non-linear separation

0

0

1

1

f1

f2

43

XOR, a possible non-linear separation

0

0

1

1

f1

f2

44

2-D feature space, three classes, # hypotheses?

f1=0f1=1f2=0f2=1

0 1 2 3 4 5 6 7 8

…………

45

2-D feature space, three classes, # hypotheses?

f1=0f1=1f2=0f2=1

0 1 2 3 4 5 6 7 8

…………

34 = 81 possible hypotheses

46

Maximum, discrete space

Four classes: 44 = 256 hypotheses

Assume that there are no more classes than discrete cells

Nhypmax = ncellsnclasses

47

2-D feature space, three classes…

0

0

1

1

f1

f2

In this example, is linearly separatablefrom the rest, as is .

But is not linearly separatable from the rest of the classes.

48

2-D feature space, four classes…

0

0

1

1

f1

f2 Minsky & Papert:simple tablelookup or logic will do nicely.

49

2-D feature space, four classes…

0

0

1

1

f1

f2Spheres or radial-basisfunctions may offer a compact classencapsulation in case of limited noise andlimited overlap

(but in the end the datawill tell: experimentationrequired!)

50

SVM (1): Kernels

Complicated separation boundary

Simple separation boundary: Hyperplane

f1

f2

f1

f2

f3

Kernels Polynomial Radial basis Sigmoid

Implicit mapping to a higher dimensional space where linear separation is possible.

51

SVM (2): Max Margin

Support vectors

Max Margin

“Best” Separating Hyperplane

From all the possible separating hyperplanes, select the one that gives Max Margin.

Solution found by Quadratic Optimization – “Learning”.

f1

f2Good generalization

top related