Top Banner
CSE 575: Sta*s*cal Machine Learning Jingrui He CIDSE, ASU
45

machine learning

Sep 28, 2015

Download

Documents

abbas

statistical machine learning
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
  • CSE 575: Sta*s*cal Machine Learning

    Jingrui He CIDSE, ASU

  • Instance-based Learning

  • 3

    1-Nearest Neighbor

    Four things make a memory based learner: 1. A distance metric

    Euclidian (and many more) 2. How many nearby neighbors to look at? One 1. A weigh:ng func:on (op:onal)

    Unused

    2. How to t with the local points? Just predict the same output as the nearest neighbor.

  • 4

    Consistency of 1-NN Consider an es*mator fn trained on n examples

    e.g., 1-NN, regression, ... Es*mator is consistent if true error goes to zero as amount of

    data increases e.g., for no noise data, consistent if:

    Regression is not consistent! Representa*on bias

    1-NN is consistent (under some mild neprint)

    What about variance???

  • 5

    1-NN overts?

  • 6

    k-Nearest Neighbor Four things make a memory based learner: 1. A distance metric

    Euclidian (and many more) 2. How many nearby neighbors to look at?

    k 1. A weigh:ng func:on (op:onal)

    Unused

    2. How to t with the local points? Just predict the average output among the k nearest neighbors.

  • 7

    k-Nearest Neighbor (here k=9)

    K-nearest neighbor for funcFon Hng smooth away noise, but there are clear deciencies. What can we do about all the discon*nui*es that k-NN gives us?

  • 8

    Curse of dimensionality for instance-based learning

    Must store and retrieve all data! Most real work done during tes*ng For every test sample, must search through all dataset very slow! There are fast methods for dealing with large datasets, e.g., tree-based

    methods, hashing methods,

    Instance-based learning o^en poor with noisy or irrelevant features

  • Support Vector Machines

  • w.x = j w(j) x(j) 10

    Linear classiers Which line is beber?

    Data:

    Example i:

  • 11

    Pick the one with the largest margin!

    w.x = j w(j) x(j)

    w.x + b = 0

  • 12

    Maximize the margin

    w.x + b = 0

  • 13

    But there are a many planes

    w.x + b = 0

  • 14

    w.x + b = 0

    Review: Normal to a plane

  • 15

    w.x + b = +1

    w.x + b = -1

    w.x + b = 0

    margin 2

    x- x+

    Normalized margin Canonical hyperplanes

  • 16

    Normalized margin Canonical hyperplanes

    w.x + b = +1

    w.x + b = -1

    w.x + b = 0

    margin 2

    x- x+

  • 17

    Margin maximiza*on using canonical hyperplanes

    w.x + b = +1

    w.x + b = -1

    w.x + b = 0

    margin 2

  • 18

    Support vector machines (SVMs)

    w.x + b = +1

    w.x + b = -1

    w.x + b = 0

    margin 2

    Solve eciently by quadra*c programming (QP) Well-studied solu*on algorithms

    Hyperplane dened by support vectors

  • 19

    What if the data is not linearly separable?

    Use features of features of features of features.

  • 20

    What if the data is s*ll not linearly separable?

    Minimize w.w and number of training mistakes Tradeo two criteria?

    Tradeo #(mistakes) and w.w 0/1 loss Slack penalty C Not QP anymore Also doesnt dis*nguish near misses

    and really bad mistakes

  • 21

    Slack variables Hinge loss

    If margin 1, dont care If margin < 1, pay linear penalty

  • 22

    Side note: Whats the dierence between SVMs and logis*c regression?

    SVM: LogisFc regression:

    Log loss:

  • 23

    Constrained op*miza*on

  • 24

    Lagrange mul*pliers Dual variables

    Moving the constraint to objecFve funcFon Lagrangian:

    Solve:

  • 25

    Lagrange mul*pliers Dual variables

    Solving:

  • 26

    Dual SVM deriva*on (1) the linearly separable case

  • 27

    Dual SVM deriva*on (2) the linearly separable case

  • 28

    Dual SVM interpreta*on

    w.x + b = 0

  • 29

    Dual SVM formula*on the linearly separable case

  • 30

    Dual SVM deriva*on the non-separable case

  • 31

    Dual SVM formula*on the non-separable case

  • 32

    Why did we learn about the dual SVM?

    There are some quadra*c programming algorithms that can solve the dual faster than the primal

    But, more importantly, the kernel trick!!! Another lible detour

  • 33

    Reminder from last *me: What if the data is not linearly separable?

    Use features of features of features of features.

    Feature space can get really large really quickly!

  • 34

    Higher order polynomials

    number of input dimensions

    numbe

    r of m

    onom

    ial terms

    d=2

    d=4

    d=3

    m input features d degree of polynomial

    grows fast! d = 6, m = 100 about 1.6 billion terms

  • 35

    Dual formula*on only depends on dot-products, not on w!

  • 36

    Dot-product of polynomials

  • 37

    Finally: the kernel trick!

    Never represent features explicitly Compute dot products in closed form

    Constant-*me high-dimensional dot-products for many classes of features

    Very interes*ng theory Reproducing Kernel Hilbert Spaces

  • 38

    Polynomial kernels

    All monomials of degree d in O(d) opera*ons:

    How about all monomials of degree up to d? Solu*on 0:

    Beber solu*on:

  • 39

    Common kernels

    Polynomials of degree d

    Polynomials of degree up to d

    Gaussian kernels

    Sigmoid

  • 40

    Overvng?

    Huge feature space with kernels, what about overvng??? Maximizing margin leads to sparse set of support vectors

    Some interes*ng theory says that SVMs search for simple hypothesis with large margin

    O^en robust to overvng

  • 41

    What about at classica*on *me For a new input x, if we need to represent (x), we are in trouble!

    Recall classier: sign(w.(x)+b) Using kernels we are cool!

  • 42

    SVMs with kernels Choose a set of features and kernel func*on Solve dual problem to obtain support vectors i

    At classica*on *me, compute:

    Classify as

  • 43

    Whats the dierence between SVMs and Logis*c Regression?

    SVMs Logistic Regression

    Loss function Hinge loss Log-loss

    High dimensional features with kernels

    Yes! No

  • 44

    Kernels in logis*c regression

    Dene weights in terms of support vectors:

    Derive simple gradient descent rule on i

  • 45

    Whats the dierence between SVMs and Logis*c Regression? (Revisited)

    SVMs Logistic Regression

    Loss function Hinge loss Log-loss

    High dimensional features with kernels

    Yes! Yes!

    Solution sparse Often yes! Almost always no!

    Semantics of output

    Margin Real probabilities