Top Banner
Machine Learning (CSE 446): Learning Theory Noah Smith c 2017 University of Washington [email protected] November 27, 2017 1 / 47
44

Machine Learning (CSE 446): Learning Theory

Jan 24, 2022

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Machine Learning (CSE 446): Learning Theory

Machine Learning (CSE 446):Learning Theory

Noah Smithc© 2017

University of [email protected]

November 27, 2017

1 / 47

Page 2: Machine Learning (CSE 446): Learning Theory

Big Questions in Learning Theory

I When is learning possible?

I How much data is required?

I Will a learned classifier generalize to test data?

2 / 47

Page 3: Machine Learning (CSE 446): Learning Theory

Big Questions in Learning Theory

I When is learning possible?

I How much data is required?

I Will a learned classifier generalize to test data?

Theory can come before or after practice.

3 / 47

Page 4: Machine Learning (CSE 446): Learning Theory

The Ultimate Learning Algorithm?

Simple D that is inherently noisy: X and Y both binary. Let p(X = Y ) = 0.8.

4 / 47

Page 5: Machine Learning (CSE 446): Learning Theory

The Ultimate Learning Algorithm?

Simple D that is inherently noisy: X and Y both binary. Let p(X = Y ) = 0.8.

There’s simply no way to get better than 80% accuracy with any classifier f .

5 / 47

Page 6: Machine Learning (CSE 446): Learning Theory

The Ultimate Learning Algorithm?

Simple D that is inherently noisy: X and Y both binary. Let p(X = Y ) = 0.8.

There’s simply no way to get better than 80% accuracy with any classifier f .

Even if your data aren’t noisy and low error is achievable by some f , you still have toworry about lousy samples from D.

6 / 47

Page 7: Machine Learning (CSE 446): Learning Theory

The Ultimate Learning Algorithm?

Simple D that is inherently noisy: X and Y both binary. Let p(X = Y ) = 0.8.

There’s simply no way to get better than 80% accuracy with any classifier f .

Even if your data aren’t noisy and low error is achievable by some f , you still have toworry about lousy samples from D.

You can’t hope for perfection every time, or even “pretty good” every time, orperfection most of the time. The best you can hope for is pretty good, most of thetime.

7 / 47

Page 8: Machine Learning (CSE 446): Learning Theory

Probably Approximately Correct

I Probably: on most test sets (i.e., succeed on (1− δ) of the possible test sets)

I Approximately Correct: low error (i.e., accuracy at least (1− ε))

Definition: An (ε, δ)-PAC learning algorithm is defined as one that, given samplesfrom any data distribution D, returns a “bad function” with probability ≤ δ, where abad function is one whose test error rate is greater than ε on D.

8 / 47

Page 9: Machine Learning (CSE 446): Learning Theory

Efficiency

Definition: An (ε, δ)-PAC learning algorithm is efficient if its runtime is polynomial in1ε and 1

δ .

9 / 47

Page 10: Machine Learning (CSE 446): Learning Theory

Efficiency

Definition: An (ε, δ)-PAC learning algorithm is efficient if its runtime is polynomial in1ε and 1

δ .

E.g., if you want to reduce error rate from 5% to 4%, you shouldn’t require anexponential increase in computational resources.

10 / 47

Page 11: Machine Learning (CSE 446): Learning Theory

Efficiency

Definition: An (ε, δ)-PAC learning algorithm is efficient if its runtime is polynomial in1ε and 1

δ .

E.g., if you want to reduce error rate from 5% to 4%, you shouldn’t require anexponential increase in computational resources.

Note that this extends to the size of the training set: if your training dataset mustincrease exponentially, that will also affect runtime!

11 / 47

Page 12: Machine Learning (CSE 446): Learning Theory

Example: “And-Literals” MachineThanks to Andrew Moore; see also https://www.autonlab.org/_media/tutorials/pac05.pdf

Let X range over binary vectors (unknown distribution), denoted 〈X1, . . . , Xd〉.

12 / 47

Page 13: Machine Learning (CSE 446): Learning Theory

Example: “And-Literals” MachineThanks to Andrew Moore; see also https://www.autonlab.org/_media/tutorials/pac05.pdf

Let X range over binary vectors (unknown distribution), denoted 〈X1, . . . , Xd〉.

Let H, the set of hypotheses, contain all logical conjunctions of 〈X1, . . . , Xd〉 and theirnegations.

13 / 47

Page 14: Machine Learning (CSE 446): Learning Theory

Example: “And-Literals” MachineThanks to Andrew Moore; see also https://www.autonlab.org/_media/tutorials/pac05.pdf

Let X range over binary vectors (unknown distribution), denoted 〈X1, . . . , Xd〉.

Let H, the set of hypotheses, contain all logical conjunctions of 〈X1, . . . , Xd〉 and theirnegations.

Example: X1 ∧X7 ∧ ¬X9.

14 / 47

Page 15: Machine Learning (CSE 446): Learning Theory

Example: “And-Literals” MachineThanks to Andrew Moore; see also https://www.autonlab.org/_media/tutorials/pac05.pdf

Let X range over binary vectors (unknown distribution), denoted 〈X1, . . . , Xd〉.

Let H, the set of hypotheses, contain all logical conjunctions of 〈X1, . . . , Xd〉 and theirnegations.

How many hypotheses are there, |H|?

15 / 47

Page 16: Machine Learning (CSE 446): Learning Theory

Example: “And-Literals” MachineThanks to Andrew Moore; see also https://www.autonlab.org/_media/tutorials/pac05.pdf

Let X range over binary vectors (unknown distribution), denoted 〈X1, . . . , Xd〉.

Let H, the set of hypotheses, contain all logical conjunctions of 〈X1, . . . , Xd〉 and theirnegations.

How many hypotheses are there, |H|? 3d

16 / 47

Page 17: Machine Learning (CSE 446): Learning Theory

Example: “And-Literals” MachineThanks to Andrew Moore; see also https://www.autonlab.org/_media/tutorials/pac05.pdf

Let X range over binary vectors (unknown distribution), denoted 〈X1, . . . , Xd〉.

Let H, the set of hypotheses, contain all logical conjunctions of 〈X1, . . . , Xd〉 and theirnegations.

How many hypotheses are there, |H|? 3d

Assume: Y is given by some h∗ ∈ H. That is, for a given x, y = fh∗(x), without noise.

17 / 47

Page 18: Machine Learning (CSE 446): Learning Theory

Example: “And-Literals” MachineThanks to Andrew Moore; see also https://www.autonlab.org/_media/tutorials/pac05.pdf

Let X range over binary vectors (unknown distribution), denoted 〈X1, . . . , Xd〉.

Let H, the set of hypotheses, contain all logical conjunctions of 〈X1, . . . , Xd〉 and theirnegations.

How many hypotheses are there, |H|? 3d

Assume: Y is given by some h∗ ∈ H. That is, for a given x, y = fh∗(x), without noise.

Learning: choose h ∈ H given a training dataset drawn from distribution D.

18 / 47

Page 19: Machine Learning (CSE 446): Learning Theory

The Game

I We choose the “machine”(e.g., the and-literalmachine), or the class offunctions F = {fh : h ∈ H}.

I Nature chooses h∗ ∈ H andrandomly samples N inputsfrom D (which is fixed andunknown), then labels themusing yn = fh∗(xn).

I Let H0 contain all h ∈ H thatachieve zero training set error.We choose some hest ∈ H0.

I Let Hbad contain all h ∈ Hsuch that the test set error offh is greater than ε.

19 / 47

Page 20: Machine Learning (CSE 446): Learning Theory

The Game

I We choose the “machine”(e.g., the and-literalmachine), or the class offunctions F = {fh : h ∈ H}.

I Nature chooses h∗ ∈ H andrandomly samples N inputsfrom D (which is fixed andunknown), then labels themusing yn = fh∗(xn).

I Let H0 contain all h ∈ H thatachieve zero training set error.We choose some hest ∈ H0.

I Let Hbad contain all h ∈ Hsuch that the test set error offh is greater than ε.

First consider p(h ∈ H0 | h ∈ Hbad)

20 / 47

Page 21: Machine Learning (CSE 446): Learning Theory

The Game

I We choose the “machine”(e.g., the and-literalmachine), or the class offunctions F = {fh : h ∈ H}.

I Nature chooses h∗ ∈ H andrandomly samples N inputsfrom D (which is fixed andunknown), then labels themusing yn = fh∗(xn).

I Let H0 contain all h ∈ H thatachieve zero training set error.We choose some hest ∈ H0.

I Let Hbad contain all h ∈ Hsuch that the test set error offh is greater than ε.

First consider p(h ∈ H0 | h ∈ Hbad)

= p(∀n ∈ {1, . . . , N}, fh(xn) = yn | h ∈ Hbad) ≤ (1− ε)N

≤ e−ε·N

21 / 47

Page 22: Machine Learning (CSE 446): Learning Theory

The Game

I We choose the “machine”(e.g., the and-literalmachine), or the class offunctions F = {fh : h ∈ H}.

I Nature chooses h∗ ∈ H andrandomly samples N inputsfrom D (which is fixed andunknown), then labels themusing yn = fh∗(xn).

I Let H0 contain all h ∈ H thatachieve zero training set error.We choose some hest ∈ H0.

I Let Hbad contain all h ∈ Hsuch that the test set error offh is greater than ε.

First consider p(h ∈ H0 | h ∈ Hbad)

= p(∀n ∈ {1, . . . , N}, fh(xn) = yn | h ∈ Hbad) ≤ (1− ε)N

≤ e−ε·N

In other words, this unfortunate event isbounded by the probability of avoiding one ofthe ε× 100% cases of h’s error, N times.

22 / 47

Page 23: Machine Learning (CSE 446): Learning Theory

The Game

I We choose the “machine”(e.g., the and-literalmachine), or the class offunctions F = {fh : h ∈ H}.

I Nature chooses h∗ ∈ H andrandomly samples N inputsfrom D (which is fixed andunknown), then labels themusing yn = fh∗(xn).

I Let H0 contain all h ∈ H thatachieve zero training set error.We choose some hest ∈ H0.

I Let Hbad contain all h ∈ Hsuch that the test set error offh is greater than ε.

First consider p(h ∈ H0 | h ∈ Hbad)

= p(∀n ∈ {1, . . . , N}, fh(xn) = yn | h ∈ Hbad) ≤ (1− ε)N

≤ e−ε·N

23 / 47

Page 24: Machine Learning (CSE 446): Learning Theory

The Game

I We choose the “machine”(e.g., the and-literalmachine), or the class offunctions F = {fh : h ∈ H}.

I Nature chooses h∗ ∈ H andrandomly samples N inputsfrom D (which is fixed andunknown), then labels themusing yn = fh∗(xn).

I Let H0 contain all h ∈ H thatachieve zero training set error.We choose some hest ∈ H0.

I Let Hbad contain all h ∈ Hsuch that the test set error offh is greater than ε.

First consider p(h ∈ H0 | h ∈ Hbad)

= p(∀n ∈ {1, . . . , N}, fh(xn) = yn | h ∈ Hbad) ≤ (1− ε)N

≤ e−ε·N

Now consider p(hest ∈ Hbad)

24 / 47

Page 25: Machine Learning (CSE 446): Learning Theory

The Game

I We choose the “machine”(e.g., the and-literalmachine), or the class offunctions F = {fh : h ∈ H}.

I Nature chooses h∗ ∈ H andrandomly samples N inputsfrom D (which is fixed andunknown), then labels themusing yn = fh∗(xn).

I Let H0 contain all h ∈ H thatachieve zero training set error.We choose some hest ∈ H0.

I Let Hbad contain all h ∈ Hsuch that the test set error offh is greater than ε.

First consider p(h ∈ H0 | h ∈ Hbad)

= p(∀n ∈ {1, . . . , N}, fh(xn) = yn | h ∈ Hbad) ≤ (1− ε)N

≤ e−ε·N

Now consider p(hest ∈ Hbad)

≤ p(∃h : h ∈ H0 ∧ h ∈ Hbad)

25 / 47

Page 26: Machine Learning (CSE 446): Learning Theory

The Game

I We choose the “machine”(e.g., the and-literalmachine), or the class offunctions F = {fh : h ∈ H}.

I Nature chooses h∗ ∈ H andrandomly samples N inputsfrom D (which is fixed andunknown), then labels themusing yn = fh∗(xn).

I Let H0 contain all h ∈ H thatachieve zero training set error.We choose some hest ∈ H0.

I Let Hbad contain all h ∈ Hsuch that the test set error offh is greater than ε.

First consider p(h ∈ H0 | h ∈ Hbad)

= p(∀n ∈ {1, . . . , N}, fh(xn) = yn | h ∈ Hbad) ≤ (1− ε)N

≤ e−ε·N

Now consider p(hest ∈ Hbad)

≤ p(∃h : h ∈ H0 ∧ h ∈ Hbad)

= p

(∨h∈H

h ∈ H0 ∧ h ∈ Hbad

)

26 / 47

Page 27: Machine Learning (CSE 446): Learning Theory

The Game

I We choose the “machine”(e.g., the and-literalmachine), or the class offunctions F = {fh : h ∈ H}.

I Nature chooses h∗ ∈ H andrandomly samples N inputsfrom D (which is fixed andunknown), then labels themusing yn = fh∗(xn).

I Let H0 contain all h ∈ H thatachieve zero training set error.We choose some hest ∈ H0.

I Let Hbad contain all h ∈ Hsuch that the test set error offh is greater than ε.

First consider p(h ∈ H0 | h ∈ Hbad)

= p(∀n ∈ {1, . . . , N}, fh(xn) = yn | h ∈ Hbad) ≤ (1− ε)N

≤ e−ε·N

Now consider p(hest ∈ Hbad)

≤ p(∃h : h ∈ H0 ∧ h ∈ Hbad)

= p

(∨h∈H

h ∈ H0 ∧ h ∈ Hbad

)≤∑h∈H

p(h ∈ H0 ∧ h ∈ Hbad) “union bound”

27 / 47

Page 28: Machine Learning (CSE 446): Learning Theory

The Game

I We choose the “machine”(e.g., the and-literalmachine), or the class offunctions F = {fh : h ∈ H}.

I Nature chooses h∗ ∈ H andrandomly samples N inputsfrom D (which is fixed andunknown), then labels themusing yn = fh∗(xn).

I Let H0 contain all h ∈ H thatachieve zero training set error.We choose some hest ∈ H0.

I Let Hbad contain all h ∈ Hsuch that the test set error offh is greater than ε.

First consider p(h ∈ H0 | h ∈ Hbad)

= p(∀n ∈ {1, . . . , N}, fh(xn) = yn | h ∈ Hbad) ≤ (1− ε)N

≤ e−ε·N

Now consider p(hest ∈ Hbad)

≤ p(∃h : h ∈ H0 ∧ h ∈ Hbad)

= p

(∨h∈H

h ∈ H0 ∧ h ∈ Hbad

)≤∑h∈H

p(h ∈ H0 ∧ h ∈ Hbad) “union bound”

Note thatp(P ∧Q) = p(P | Q) · p(Q)︸ ︷︷ ︸

≤1

≤ p(P | Q).

28 / 47

Page 29: Machine Learning (CSE 446): Learning Theory

The Game

I We choose the “machine”(e.g., the and-literalmachine), or the class offunctions F = {fh : h ∈ H}.

I Nature chooses h∗ ∈ H andrandomly samples N inputsfrom D (which is fixed andunknown), then labels themusing yn = fh∗(xn).

I Let H0 contain all h ∈ H thatachieve zero training set error.We choose some hest ∈ H0.

I Let Hbad contain all h ∈ Hsuch that the test set error offh is greater than ε.

First consider p(h ∈ H0 | h ∈ Hbad)

= p(∀n ∈ {1, . . . , N}, fh(xn) = yn | h ∈ Hbad) ≤ (1− ε)N

≤ e−ε·N

Now consider p(hest ∈ Hbad)

≤ p(∃h : h ∈ H0 ∧ h ∈ Hbad)

= p

(∨h∈H

h ∈ H0 ∧ h ∈ Hbad

)≤∑h∈H

p(h ∈ H0 ∧ h ∈ Hbad) “union bound”

≤∑h∈H

p(h ∈ H0 | h ∈ Hbad)

29 / 47

Page 30: Machine Learning (CSE 446): Learning Theory

The Game

I We choose the “machine”(e.g., the and-literalmachine), or the class offunctions F = {fh : h ∈ H}.

I Nature chooses h∗ ∈ H andrandomly samples N inputsfrom D (which is fixed andunknown), then labels themusing yn = fh∗(xn).

I Let H0 contain all h ∈ H thatachieve zero training set error.We choose some hest ∈ H0.

I Let Hbad contain all h ∈ Hsuch that the test set error offh is greater than ε.

First consider p(h ∈ H0 | h ∈ Hbad)

= p(∀n ∈ {1, . . . , N}, fh(xn) = yn | h ∈ Hbad) ≤ (1− ε)N

≤ e−ε·N

Now consider p(hest ∈ Hbad)

≤ p(∃h : h ∈ H0 ∧ h ∈ Hbad)

= p

(∨h∈H

h ∈ H0 ∧ h ∈ Hbad

)≤∑h∈H

p(h ∈ H0 ∧ h ∈ Hbad) “union bound”

≤∑h∈H

p(h ∈ H0 | h ∈ Hbad)

≤ |H| · (1− ε)N ≤ |H| · e−ε·N30 / 47

Page 31: Machine Learning (CSE 446): Learning Theory

Blumer Bound

We want to bound p(hest ∈ Hbad) ≤ δ:

|H| · e−ε·N ≤ δ

⇒ N ≥ 1

ε

(ln |H|+ ln

1

δ

)≈ 0.69

ε

(log2 |H|+ log2

1

δ

)

31 / 47

Page 32: Machine Learning (CSE 446): Learning Theory

Blumer Bound

We want to bound p(hest ∈ Hbad) ≤ δ:

|H| · e−ε·N ≤ δ

⇒ N ≥ 1

ε

(ln |H|+ ln

1

δ

)≈ 0.69

ε

(log2 |H|+ log2

1

δ

)For our and-literals machine, |H| = 3d, so we need 1

ε

(1.1d+ ln 1

δ

)training examples

to “PAC-learn.”

32 / 47

Page 33: Machine Learning (CSE 446): Learning Theory

Blumer Bound

We want to bound p(hest ∈ Hbad) ≤ δ:

|H| · e−ε·N ≤ δ

⇒ N ≥ 1

ε

(ln |H|+ ln

1

δ

)≈ 0.69

ε

(log2 |H|+ log2

1

δ

)Corollary: if hest ∈ H0, then you can estimate ε as

1

N

(ln |H|+ ln

1

δ

)

33 / 47

Page 34: Machine Learning (CSE 446): Learning Theory

Blumer Bound

We want to bound p(hest ∈ Hbad) ≤ δ:

|H| · e−ε·N ≤ δ

⇒ N ≥ 1

ε

(ln |H|+ ln

1

δ

)≈ 0.69

ε

(log2 |H|+ log2

1

δ

)General observation: if we can decrease |H| without losing good solutions, that’s agood thing.

34 / 47

Page 35: Machine Learning (CSE 446): Learning Theory

Simple PAC-Learnable Algorithm for And-Literals MachineData: D = 〈(xn, yn)〉Nn=1

Result: finitialize: f = x1 ∧ x2 ∧ · · · ∧ xd ∧ ¬x1 ∧ ¬x2 ∧ · · · ∧ ¬xd;for n ∈ {1, . . . , N} do

if yn = +1 thenfor j ∈ {1, . . . , d} do

if xn[j] = 0 thenremove xj from f

endelse

remove ¬xj from fend

end

end

endreturn f

Algorithm 1: ThrowOutBadTerms 35 / 47

Page 36: Machine Learning (CSE 446): Learning Theory

Another Example: Lookup Table

Suppose H is all lookup tables, where we map every vector in {0, 1}d to a binary value.

36 / 47

Page 37: Machine Learning (CSE 446): Learning Theory

Another Example: Lookup Table

Suppose H is all lookup tables, where we map every vector in {0, 1}d to a binary value.

|H| = 22d

37 / 47

Page 38: Machine Learning (CSE 446): Learning Theory

Another Example: Lookup Table

Suppose H is all lookup tables, where we map every vector in {0, 1}d to a binary value.

|H| = 22d

N ≥ 0.69

ε

(2d + log2

1

δ

)

38 / 47

Page 39: Machine Learning (CSE 446): Learning Theory

Shallow Decision Trees (Binary Features, Binary Classification)

Let H(k) contain all decision trees of depth k.

|H(0)| = 2

|H(k)| = d · |H(k−1)|2

So log2 |H(k)| = (2k − 1) · (1 + log2 d) + 1, and we need

N ≥ 0.69

ε

((2k − 1) · (1 + log2 d) + 1 + log2

1

δ

)

39 / 47

Page 40: Machine Learning (CSE 446): Learning Theory

(The rest of the slides are from the wrap-up on November 29.)

40 / 47

Page 41: Machine Learning (CSE 446): Learning Theory

Quick Review

I (ε, δ) PAC-learners (and efficiency)

I For a finite hypothesis class H that contains h∗, and noise-free data:

N ≥ 1

ε

(ln |H|+ ln

1

δ

)I Analyses for and-literal machines, lookup table machines, k-depth decision trees.

41 / 47

Page 42: Machine Learning (CSE 446): Learning Theory

Limitations

I We’ve assumed no noise.

I We’ve assumed that H is finite.

42 / 47

Page 43: Machine Learning (CSE 446): Learning Theory

Limitations

I We’ve assumed no noise.

I We’ve assumed that H is finite.

Theoretical results for infinite H rely on measures of complexity like theVapnik-Chernovenkis (VC) dimension, which typically we can only bound.

43 / 47

Page 44: Machine Learning (CSE 446): Learning Theory

Limitations

I We’ve assumed no noise.

I We’ve assumed that H is finite.

Theoretical results for infinite H rely on measures of complexity like theVapnik-Chernovenkis (VC) dimension, which typically we can only bound.

The VC dimension of a hypothesis space H over input space X is the largest K suchthat there exists a set of K elements of X (call it X) such that for any binary labelingof X, some h ∈ H matches the labeling.

44 / 47