Top Banner

of 42

12 Knn Perceptron

Apr 04, 2018

Download

Documents

jure1
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
  • 7/31/2019 12 Knn Perceptron

    1/42

    CS246: Mining Massive DatasetsJure Leskovec, Stanford University

    http://cs246.stanford.edu

  • 7/31/2019 12 Knn Perceptron

    2/42

    Would like to do prediction:estimate a function f(x) so that y = f(x)

    Where ycan be: Real number: Regression

    Categorical: Classification

    Complex object: Ranking of items, Parse tree, etc.

    Data is labeled: Have many pairs {(x, y)}

    x vector of real valued features

    y class ({+1, -1}, or a real number)

    2/14/2011 Jure Leskovec, Stanford C246: Mining Massive Datasets 2

    X Y

    X Y

    Training andtest set

  • 7/31/2019 12 Knn Perceptron

    3/42

    We will talk about the following methods:

    k-Nearest Neighbor (Instance based learning)

    Perceptron algorithm

    Support Vector Machines

    Decision trees

    Main question:

    How to efficiently train

    (build a model/find model parameters)?

    2/14/2011 Jure Leskovec, Stanford C246: Mining Massive Datasets 3

  • 7/31/2019 12 Knn Perceptron

    4/42

    Instance based learning

    Example: Nearest neighbor

    Keep the whole training dataset: {(x, y)}

    A query example (vector) q comes

    Find closest example(s) x*

    Predict y*

    Can be used both for regression and classification

    Recommendation systems

    2/14/2011 Jure Leskovec, Stanford C246: Mining Massive Datasets 4

  • 7/31/2019 12 Knn Perceptron

    5/42

    To make Nearest Neighbor work we need 4 things: Distance metric:

    Euclidean

    How many neighbors to look at?

    One Weighting function (optional):

    Unused

    How to fit with the local points? Just predict the same output as the nearest neighbor

    2/14/2011 Jure Leskovec, Stanford C246: Mining Massive Datasets 5

  • 7/31/2019 12 Knn Perceptron

    6/42

    Suppose x1,, xm are two dimensional:

    x1=(x11,x12), x2=(x21,x22),

    One can draw nearest neighbor regions:

    2/14/2011 Jure Leskovec, Stanford C246: Mining Massive Datasets 6

    d(xi,xj) = (xi1-xj1)2 + (xi2-xj2)

    2 d(xi,xj) = (xi1-xj1)2 + (3xi2-3xj2)

    2

  • 7/31/2019 12 Knn Perceptron

    7/42

    Distance metric: Euclidean

    How many neighbors to look at? k

    Weighting function (optional): Unused

    How to fit with the local points? Just predict the average output among knearest neighbors

    2/14/2011 Jure Leskovec, Stanford C246: Mining Massive Datasets 7

    k=9

  • 7/31/2019 12 Knn Perceptron

    8/42

    Distance metric: Euclidean

    How many neighbors to look at? All of them (!)

    Weighting function:

    wi = exp(-d(xi, q)2/Kw) Nearby points to query q are weighted more strongly. Kwkernel width.

    How to fit with the local points? Predict weighted average: wiyi/wi

    2/14/2011 Jure Leskovec, Stanford C246: Mining Massive Datasets 8

    K=10 K=20 K=80

    d(xi, q) = 0

    wi

  • 7/31/2019 12 Knn Perceptron

    9/42

    Given: a set P ofn points in Rd

    Goal: Given a query point q

    NN: find the nearest neighbor p ofq in P

    Range search: find one/all points in P withindistance rfrom q

    2/14/2011 Jure Leskovec, Stanford C246: Mining Massive Datasets 9

    q

    p

  • 7/31/2019 12 Knn Perceptron

    10/42

  • 7/31/2019 12 Knn Perceptron

    11/42

    Simplest spatial structure on Earth! Split the space into 2dequal subsquares Repeat until done:

    only one pixel left only one point left

    only a few points left

    Variants:

    split only one dimensionat a time

    Kd-trees (in a moment)

    2/14/2011 Jure Leskovec, Stanford C246: Mining Massive Datasets 11

  • 7/31/2019 12 Knn Perceptron

    12/42

    Range search: Put root node on the stack

    Repeat:

    pop the next node Tfrom the stack

    for each child CofT: ifCis a leaf, examine point(s) in C

    ifCintersects with the ball of radiusraround q, add Cto the stack

    Nearest neighbor:

    Start range search with r =

    Whenever a point is found, update r

    Only investigate nodes with respect tocurrent r

    2/14/2011 Jure Leskovec, Stanford C246: Mining Massive Datasets 12

    q

  • 7/31/2019 12 Knn Perceptron

    13/42

  • 7/31/2019 12 Knn Perceptron

    14/42

    Main ideas [Bentley 75] : Only one-dimensional splits

    Choose the split carefully: E.g., Pick dimension of largest

    variance and split at median(balanced split)

    Do SVD or CUR, project and split

    Queries: as for quadtrees Advantages:

    no (or less) empty spaces

    only linear space Query time at most:

    Min[dn, exponential(d)]

    2/14/2011 Jure Leskovec, Stanford C246: Mining Massive Datasets 14

  • 7/31/2019 12 Knn Perceptron

    15/42

    Range search: Put root node on the stack

    Repeat:

    pop the next node Tfrom thestack

    for each child CofT:

    ifCis a leaf, examine point(s) in C

    ifCintersects with the ball of radius raround q, add Cto the stack

    In what order we search the children?

    Best-Bin-First (BBF), Last-Bin-First (LBF)

    2/14/2011 Jure Leskovec, Stanford C246: Mining Massive Datasets 15

  • 7/31/2019 12 Knn Perceptron

    16/42

    Performance of a single Kd-tree is low Randomized Kd-trees: Build several trees

    Find top few dimensions of largest variance

    Randomly select one of these dimensions; split on median

    Construct many complete (i.e., one point per leaf) trees

    Drawbacks: More memory

    Additional parameter to tune: number of trees

    Search Descend through each tree until leaf is reached

    Maintain a single priority queue for all the trees

    For approximate search, stop after a certain number ofnodes have been examined

    2/14/2011 Jure Leskovec, Stanford C246: Mining Massive Datasets 16

  • 7/31/2019 12 Knn Perceptron

    17/42

    d=128, n=100k

    2/14/2011 Jure Leskovec, Stanford C246: Mining Massive Datasets 17

    [Muja-Lowe, 2010]

  • 7/31/2019 12 Knn Perceptron

    18/42

    Overlapped partitioning reduces boundaryerrors

    no backtracking necessary

    Spilling

    Increases tree depth

    more memory

    slower to build

    Better when split passes through sparse regions

    Lower nodes may spill too much

    hybrid of spill and non-spill nodes

    Designing a good spill factor hard

    2/14/2011 Jure Leskovec, Stanford C246: Mining Massive Datasets 18

  • 7/31/2019 12 Knn Perceptron

    19/42

    For high dim. data, use randomized projections(CUR) or SVD

    Use Best-Bin-First (BBF)

    Make a priority queue of all unexplored nodes Visit them in order of their closeness to the query

    Closeness is defined by distance to a cell boundary

    Space permitting:

    Keep extra statistics on lower and upper bound foreach cell and use triangle inequality to prune space

    Use spilling to avoid backtracking

    Use lookup tables for fast distance computation

    2/14/2011 Jure Leskovec, Stanford C246: Mining Massive Datasets 19

  • 7/31/2019 12 Knn Perceptron

    20/42

    Bottom-up approach [Guttman 84] Start with a set of points/rectangles

    Partition the set into groups of small cardinality

    For each group, find minimum rectanglecontaining objects from this group (MBR)

    Repeat

    Advantages: Supports near(est) neighbor search

    (similar as before)

    Works for points and rectangles

    Avoids empty spaces

    2/14/2011 Jure Leskovec, Stanford C246: Mining Massive Datasets 20

  • 7/31/2019 12 Knn Perceptron

    21/422/14/2011 Jure Leskovec, Stanford C246: Mining Massive Datasets #21

    R-trees with fan-out 4: group nearby rectangles to parent MBRs

    A

    B

    C

    D

    E

    F

    G

    H

    I

    J

  • 7/31/2019 12 Knn Perceptron

    22/422/14/2011 Jure Leskovec, Stanford C246: Mining Massive Datasets #22

    R-trees with fan-out 4: every parent node completely covers its children

    A

    B

    C

    D

    E

    F

    G

    H

    I

    J

    P1

    P2

    P3

    P4

    F GD E

    H I JA B C

  • 7/31/2019 12 Knn Perceptron

    23/422/14/2011 Jure Leskovec, Stanford C246: Mining Massive Datasets #23

    R-trees with fan-out 4: every parent node completely covers its children

    A

    B

    C

    D

    E

    F

    G

    H

    I

    J

    P1

    P2

    P3

    P4

    P1 P2 P3 P4

    F GD E

    H I JA B C

  • 7/31/2019 12 Knn Perceptron

    24/422/14/2011 Jure Leskovec, Stanford C246: Mining Massive Datasets #24

    Example of a range search query

    A

    B

    C

    D

    E

    F

    G

    H

    I

    J

    P1

    P2

    P3

    P4

    P1 P2 P3 P4

    F GD E

    H I JA B C

  • 7/31/2019 12 Knn Perceptron

    25/422/14/2011 Jure Leskovec, Stanford C246: Mining Massive Datasets #25

    Example of a range search query

    A

    B

    C

    D

    E

    F

    G

    H

    I

    J

    P1

    P2

    P3

    P4

    P1 P2 P3 P4

    F GD E

    H I JA B C

  • 7/31/2019 12 Knn Perceptron

    26/422/14/2011 Jure Leskovec, Stanford C246: Mining Massive Datasets #26

    Insertion of pointx: Find MBR intersecting withxand insert

    If a node is full, then a split: Linear choose far apart nodes as ends. Randomly choose nodes

    and assign them so that they require the smallest MBR enlargement

    Quadratic choose two nodes so the dead space between them ismaximized. Insert nodes so area enlargement is minimized

    A

    B

    C

    D

    E

    F

    G

    H I

    JP1

    P2

    P3

    P4

    P1 P2 P3 P4

    F GD E

    H I JA B C

  • 7/31/2019 12 Knn Perceptron

    27/42

    Approach [Weber, Schek, Blott98] In high-dimensional spaces, all tree-based indexing

    structures examine large fraction of leaves

    If we need to visit so many nodes anyway, it isbetter to scan the whole data set and avoid

    performing seeks altogether

    1 seek = transfer of few hundred KB

    2/14/2011 Jure Leskovec, Stanford C246: Mining Massive Datasets 27

  • 7/31/2019 12 Knn Perceptron

    28/42

    Natural question:How to speed-up linear scan?

    Answer: Use approximation

    Use only ibits per dimension (and speed-up

    the scan by a factor of 32/i)

    Identify all points which could be

    returned as an answer

    Verify the points using original data set

    2/14/2011 Jure Leskovec, Stanford C246: Mining Massive Datasets 28

  • 7/31/2019 12 Knn Perceptron

    29/42

  • 7/31/2019 12 Knn Perceptron

    30/42

    Example: Spam filtering

    Instance space X:

    Binary feature vectorsxof word occurrences

    dfeatures (words + other things, d~100,000)

    Class Y:

    y: Spam (+1), Ham (-1)

    2/14/2011 Jure Leskovec, Stanford C246: Mining Massive Datasets 30

  • 7/31/2019 12 Knn Perceptron

    31/42

    Binary classification:

    Input: Vectorsxiand labels y

    i Goal: Find vector w = (w1, w2,... , wn)

    Each wi is a real number

    2/14/2011 Jure Leskovec, Stanford C246: Mining Massive Datasets 31

    f(x) =1 if w

    1x

    1+ w

    2x

    2+. . . w

    nx

    n

    0 otherwise{

    w x = 0 - --- -

    -

    -- -

    - -

    - -

    -

    -

    w x =

    ,

    1,

    ww

    xxx

    Note:

  • 7/31/2019 12 Knn Perceptron

    32/42

    (very) Loose motivation: Neuron Inputs are feature values

    Each feature has a weight wi

    Activation is the sum: f(x) = iwixi= wx -

    If the f(x) is:

    Positive: predict +1 Negative: predict -1

    2/14/2011 Jure Leskovec, Stanford C246: Mining Massive Datasets 32

    x1x2x3x4

    >0?

    w1w2w3w4

    viagra

    nigeria

    Spam=1

    Ham=-1

    w

    x1x2

    wx=0

  • 7/31/2019 12 Knn Perceptron

    33/42

  • 7/31/2019 12 Knn Perceptron

    34/42

    Perceptron Convergence Theorem: If there exist a set of weights that are consistent (i.e.,

    the data is linearly separable) the perceptron learning

    algorithm will converge

    How long would it take to converge?

    Perceptron Cycling Theorem:

    If the training data is not linearly separable the

    perceptron learning algorithm will eventually repeatthe same set of weights and therefore enter an infinite

    loop

    How to provide robustness, more expressivity?

    2/14/2011 Jure Leskovec, Stanford C246: Mining Massive Datasets 34

  • 7/31/2019 12 Knn Perceptron

    35/42

    Separability: some parameters get

    training set perfectly

    Convergence: if training set isseparable, perceptron will

    converge (binary case)

    Mistake bound: number of

    mistakes < 1/2

    2/14/2011 Jure Leskovec, Stanford C246: Mining Massive Datasets 35

  • 7/31/2019 12 Knn Perceptron

    36/42

    If more than 2 classes: Weight vector wc for each class

    Calculate activation for each class

    f(x,c)= i wc,ixi = wcx

    Highest activation wins:

    c = arg maxc f(x,c)

    2/14/2011 Jure Leskovec, Stanford C246: Mining Massive Datasets 36

    w1

    w3

    w2

    w1x

    biggest

    w3x

    biggest

    w2x

    biggest

  • 7/31/2019 12 Knn Perceptron

    37/42

    Overfitting:

    Regularization: if the data

    is not separable weightsdance around

    Mediocre generalization: Finds a barely separating

    solution

    2/14/2011 Jure Leskovec, Stanford C246: Mining Massive Datasets 37

  • 7/31/2019 12 Knn Perceptron

    38/42

    Winnow algorithm Similar to perceptron, just different updates

    Learns linear threshold functions

    2/14/2011 Jure Leskovec, Stanford C246: Mining Massive Datasets 38

    (demotion)1)x(if/2ww,xbut w0f(x)If

    )(promotion1)x(if2ww,xwbut1f(x)If

    nothingdo:mistakenoIfxwiff1isPrediction

    w:Initialize

    iii

    iii

    i

    ==

    =

  • 7/31/2019 12 Knn Perceptron

    39/42

    Algorithm learns monotone functions For the general case:

    Duplicate variables:

    To negate variable xi, introduce a new variable xi=-xi

    Learn monotone functions over 2n variables

    Balanced version:

    Keep two weights for each variable;

    effective weight is the difference

    2/14/2011 Jure Leskovec, Stanford C246: Mining Massive Datasets 39

    (demotion)1where22

    1,)(but0)(If

    )(promotion1where2

    12,)(but1)(If

    :RuleUpdate

    ==

    ==

    +++

    +++

    iiiii

    iiiii

    xwwwwxwwxf

    xwwwwxwwxf

  • 7/31/2019 12 Knn Perceptron

    40/42

    Thick Separator (aka Perceptron with Margin)(Applies both for Perceptron and Winnow)

    Promote if:

    w x > + Demote if:

    w x < -

    2/14/2011 Jure Leskovec, Stanford C246: Mining Massive Datasets 40

    w x = 0

    - --- -

    -

    -- -

    - -

    - -

    -

    -

    w x =

    Note: is a functional margin. Its effect could disappear as wgrows.

    Nevertheless, this has been shown to be a very effective algorithmic addition.

  • 7/31/2019 12 Knn Perceptron

    41/42

    Additive weight update algorithm[Perceptron, Rosenblatt, 1958]

    Multiplicative weight update algorithm[Winnow, Littlestone, 1988]

    2/14/2011 Jure Leskovec, Stanford C246: Mining Massive Datasets 41

    (demotion)1)x(if1-ww,xbut w0ClassIf

    )(promotion1)x(if1ww,xwbut1ClassIf

    iii

    iii

    ==

    =+=

    (demotion)1)x(if/2ww,xbut w0ClassIf

    )(promotion1)x(if2ww,xwbut1ClassIf

    iii

    iii

    ==

    ==

    xwiff1isPredictionRw:Hypothesis;{0,1}x:Examples

    nn

    ww + i yjxj

    ww i exp{yjxj}

  • 7/31/2019 12 Knn Perceptron

    42/42

    Winnow Online: can adjust to changing

    target, over time

    Advantages

    Simple

    Guaranteed to learn alinearly separableproblem

    Suit able for problemsw ith many irrelevantattr ibutes

    Limitations

    only linear separations

    only converges forlinearly separable data

    not really efficient with

    Perceptron Online: can adjust to changing

    target, over time

    Advantages

    Simple

    Guaranteed to learn alinearly separableproblem

    Limitations

    only linear separations only converges for

    linearly separable data

    not really efficient withmany features