Computational Learning Theory • PAC • IID • VC Dimension • SVM Kunstmatige Intelligentie / RuG KI2 - 5 Marius Bulacu & prof. dr. Lambert Schomaker
Dec 19, 2015
Computational Learning Theory• PAC• IID• VC Dimension• SVM
Kunstmatige Intelligentie / RuG
KI2 - 5
Marius Bulacu & prof. dr. Lambert Schomaker
2
Learning
Learning is essential for unknown environments–i.e., when designer lacks omniscience
Learning is useful as a system construction method–i.e., expose the agent to reality rather than trying to write it down
Learning modifies the agent's decision mechanisms to improve performance
3
Learning Agents
4
Learning Element
Design of a learning element is affected by:– Which components of the performance element are to be learned– What feedback is available to learn these components– What representation is used for the components
Type of feedback:– Supervised learning: correct answers for each example– Unsupervised learning: correct answers not given– Reinforcement learning: occasional rewards
5
Inductive Learning
Simplest form: learn a function from examples
- f is the target function
- an example is a pair (x, f(x))
Problem: find a hypothesis h
such that h ≈ f
given a training set of examples
This is a highly simplified model of real learning:
- ignores prior knowledge
- assumes examples are given
6
Inductive Learning Method
Construct/adjust h to agree with f on training set (h is consistent if it agrees with f on all examples) E.g., curve fitting:
7
Inductive Learning Method
Construct/adjust h to agree with f on training set (h is consistent if it agrees with f on all examples) E.g., curve fitting:
8
Inductive Learning Method
Construct/adjust h to agree with f on training set (h is consistent if it agrees with f on all examples) E.g., curve fitting:
9
Inductive Learning Method
Construct/adjust h to agree with f on training set (h is consistent if it agrees with f on all examples) E.g., curve fitting:
10
Inductive Learning Method
Construct/adjust h to agree with f on training set (h is consistent if it agrees with f on all examples) E.g., curve fitting:
11
Inductive Learning Method
Construct/adjust h to agree with f on training set (h is consistent if it agrees with f on all examples) E.g., curve fitting:
Occam’s razor: prefer the simplest hypothesis consistent with data
12
Occam’s Razor
William of Occam
(1285-1349, England)
“If two theories explain the facts equally well, then the simpler theory is to be preferred.”
Rationale:
There are fewer short hypotheses than long hypotheses.
A short hypothesis that fits the data is unlikely to be a coincidence.
A long hypothesis that fits the data may be a coincidence.
Formal treatment in computational learning theory
13
The Problem
• Why does learning work?
• How do we know that the learned hypothesis h is close to the target function f if we do not know what f is?
answer provided by
computational learning theory
14
The Answer
• Any hypothesis h that is consistent with a sufficiently large number of training examples is unlikely to be seriously wrong.
Therefore it must be:
Probably Approximately Correct
PAC
15
The Stationarity Assumption
• The training and test sets are drawn randomly from the same population of examples using the same probability distribution.
Therefore training and test data are
Independently and Identically Distributed
IID
“the future is like the past”
16
How many examples are needed?
Number of examples Probability that h and f disagree on an example
Probability of existence of a wrong hypothesis
consistent with all examples
)Hln(lnm 11
Size of hypothesis space
Sample complexity
17
Formal Derivation
H (the set of all possible hypothese)
f
HBAD (the set of “wrong” hypotheses)
1))x(f)x(h,x(P
))x(f)x(h,x(P
)Hln(lnm)(H
)(H)Hh(P
m
mBADBAD
11
1
1
18
What if hypothesis space is infinite?
Can’t use our result for finite H Need some other measure of complexity for H
– Vapnik-Chervonenkis dimension
19
20
21
22
Shattering two binary dimensionsover a number of classes
In order to understand the principle of shattering sample points into classes we will look at the simple case of
two dimensions
of binary value
23
2-D feature space
0
0
1
1
f1
f2
24
2-D feature space, 2 classes
0
0
1
1
f1
f2
25
the other class…
0
0
1
1
f1
f2
26
2 left vs 2 right
0
0
1
1
f1
f2
27
top vs bottom
0
0
1
1
f1
f2
28
right vs left
0
0
1
1
f1
f2
29
bottom vs top
0
0
1
1
f1
f2
30
lower-right outlier
0
0
1
1
f1
f2
31
lower-left outlier
0
0
1
1
f1
f2
32
upper-left outlier
0
0
1
1
f1
f2
33
upper-right outlier
0
0
1
1
f1
f2
34
etc.
0
0
1
1
f1
f2
35
2-D feature space
0
0
1
1
f1
f2
36
2-D feature space
0
0
1
1
f1
f2
37
2-D feature space
0
0
1
1
f1
f2
38
XOR configuration A
0
0
1
1
f1
f2
39
XOR configuration B
0
0
1
1
f1
f2
40
2-D feature space, two classes: 16 hypotheses
f1=0f1=1f2=0f2=1
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
“hypothesis” = possible class partioning of all data samples
41
2-D feature space, two classes, 16 hypotheses
f1=0f1=1f2=0f2=1
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
two XOR class configurations:
2/16 of hypotheses requires a non-linear separatrix
42
XOR, a possible non-linear separation
0
0
1
1
f1
f2
43
XOR, a possible non-linear separation
0
0
1
1
f1
f2
44
2-D feature space, three classes, # hypotheses?
f1=0f1=1f2=0f2=1
0 1 2 3 4 5 6 7 8
…………
45
2-D feature space, three classes, # hypotheses?
f1=0f1=1f2=0f2=1
0 1 2 3 4 5 6 7 8
…………
34 = 81 possible hypotheses
46
Maximum, discrete space
Four classes: 44 = 256 hypotheses
Assume that there are no more classes than discrete cells
Nhypmax = ncellsnclasses
47
2-D feature space, three classes…
0
0
1
1
f1
f2
In this example, is linearly separatablefrom the rest, as is .
But is not linearly separatable from the rest of the classes.
48
2-D feature space, four classes…
0
0
1
1
f1
f2 Minsky & Papert:simple tablelookup or logic will do nicely.
49
2-D feature space, four classes…
0
0
1
1
f1
f2Spheres or radial-basisfunctions may offer a compact classencapsulation in case of limited noise andlimited overlap
(but in the end the datawill tell: experimentationrequired!)
50
SVM (1): Kernels
Complicated separation boundary
Simple separation boundary: Hyperplane
f1
f2
f1
f2
f3
Kernels Polynomial Radial basis Sigmoid
Implicit mapping to a higher dimensional space where linear separation is possible.
51
SVM (2): Max Margin
Support vectors
Max Margin
“Best” Separating Hyperplane
From all the possible separating hyperplanes, select the one that gives Max Margin.
Solution found by Quadratic Optimization – “Learning”.
f1
f2Good generalization