12 Knn Perceptron

7/31/2019 12 Knn Perceptron

1/42

CS246: Mining Massive DatasetsJure Leskovec, Stanford University

http://cs246.stanford.edu


2/42

Would like to do prediction:estimate a function f(x) so that y = f(x)

Where ycan be: Real number: Regression

Categorical: Classification

Complex object: Ranking of items, Parse tree, etc.

Data is labeled: Have many pairs {(x, y)}

x vector of real valued features

y class ({+1, -1}, or a real number)

2/14/2011 Jure Leskovec, Stanford C246: Mining Massive Datasets 2

X Y

X Y

Training andtest set


3/42

We will talk about the following methods:

k-Nearest Neighbor (Instance based learning)

Perceptron algorithm

Support Vector Machines

Decision trees

Main question:

How to efficiently train

(build a model/find model parameters)?



4/42

Instance based learning

Example: Nearest neighbor

Keep the whole training dataset: {(x, y)}

A query example (vector) q comes

Find closest example(s) x*

Predict y*

Can be used both for regression and classification

Recommendation systems



5/42

To make Nearest Neighbor work we need 4 things: Distance metric:

Euclidean

How many neighbors to look at?

One Weighting function (optional):

Unused

How to fit with the local points? Just predict the same output as the nearest neighbor



6/42

Suppose x1,, xm are two dimensional:

x1=(x11,x12), x2=(x21,x22),

One can draw nearest neighbor regions:


d(xi,xj) = (xi1-xj1)2 + (xi2-xj2)

2 d(xi,xj) = (xi1-xj1)2 + (3xi2-3xj2)

2


7/42

Distance metric: Euclidean

How many neighbors to look at? k

Weighting function (optional): Unused

How to fit with the local points? Just predict the average output among knearest neighbors


k=9


8/42

Distance metric: Euclidean

How many neighbors to look at? All of them (!)

Weighting function:

wi = exp(-d(xi, q)2/Kw) Nearby points to query q are weighted more strongly. Kwkernel width.

How to fit with the local points? Predict weighted average: wiyi/wi


K=10 K=20 K=80

d(xi, q) = 0

wi


9/42

Given: a set P ofn points in Rd

Goal: Given a query point q

NN: find the nearest neighbor p ofq in P

Range search: find one/all points in P withindistance rfrom q


q

p


10/42


11/42

Simplest spatial structure on Earth! Split the space into 2dequal subsquares Repeat until done:

only one pixel left only one point left

only a few points left

Variants:

split only one dimensionat a time

Kd-trees (in a moment)



12/42

Range search: Put root node on the stack

Repeat:

pop the next node Tfrom the stack

for each child CofT: ifCis a leaf, examine point(s) in C

ifCintersects with the ball of radiusraround q, add Cto the stack

Nearest neighbor:

Start range search with r =

Whenever a point is found, update r

Only investigate nodes with respect tocurrent r


q


13/42


14/42

Main ideas [Bentley 75] : Only one-dimensional splits

Choose the split carefully: E.g., Pick dimension of largest

variance and split at median(balanced split)

Do SVD or CUR, project and split

Queries: as for quadtrees Advantages:

no (or less) empty spaces

only linear space Query time at most:

Min[dn, exponential(d)]



15/42

Range search: Put root node on the stack

Repeat:

pop the next node Tfrom thestack

for each child CofT:

ifCis a leaf, examine point(s) in C

ifCintersects with the ball of radius raround q, add Cto the stack

In what order we search the children?

Best-Bin-First (BBF), Last-Bin-First (LBF)



16/42

Performance of a single Kd-tree is low Randomized Kd-trees: Build several trees

Find top few dimensions of largest variance

Randomly select one of these dimensions; split on median

Construct many complete (i.e., one point per leaf) trees

Drawbacks: More memory

Additional parameter to tune: number of trees

Search Descend through each tree until leaf is reached

Maintain a single priority queue for all the trees

For approximate search, stop after a certain number ofnodes have been examined



17/42

d=128, n=100k


[Muja-Lowe, 2010]


18/42

Overlapped partitioning reduces boundaryerrors

no backtracking necessary

Spilling

Increases tree depth

more memory

slower to build

Better when split passes through sparse regions

Lower nodes may spill too much

hybrid of spill and non-spill nodes

Designing a good spill factor hard



19/42

For high dim. data, use randomized projections(CUR) or SVD

Use Best-Bin-First (BBF)

Make a priority queue of all unexplored nodes Visit them in order of their closeness to the query

Closeness is defined by distance to a cell boundary

Space permitting:

Keep extra statistics on lower and upper bound foreach cell and use triangle inequality to prune space

Use spilling to avoid backtracking

Use lookup tables for fast distance computation



20/42

Bottom-up approach [Guttman 84] Start with a set of points/rectangles

Partition the set into groups of small cardinality

For each group, find minimum rectanglecontaining objects from this group (MBR)

Repeat

Advantages: Supports near(est) neighbor search

(similar as before)

Works for points and rectangles

Avoids empty spaces



21/422/14/2011 Jure Leskovec, Stanford C246: Mining Massive Datasets #21

R-trees with fan-out 4: group nearby rectangles to parent MBRs

A

B

C

D

E

F

G

H

I

J



R-trees with fan-out 4: every parent node completely covers its children

A

B

C

D

E

F

G

H

I

J

P1

P2

P3

P4

F GD E

H I JA B C



R-trees with fan-out 4: every parent node completely covers its children

A

B

C

D

E

F

G

H

I

J

P1

P2

P3

P4

P1 P2 P3 P4

F GD E

H I JA B C



Example of a range search query

A

B

C

D

E

F

G

H

I

J

P1

P2

P3

P4

P1 P2 P3 P4

F GD E

H I JA B C



Insertion of pointx: Find MBR intersecting withxand insert

If a node is full, then a split: Linear choose far apart nodes as ends. Randomly choose nodes

and assign them so that they require the smallest MBR enlargement

Quadratic choose two nodes so the dead space between them ismaximized. Insert nodes so area enlargement is minimized

A

B

C

D

E

F

G

H I

JP1

P2

P3

P4

P1 P2 P3 P4

F GD E

H I JA B C


27/42

Approach [Weber, Schek, Blott98] In high-dimensional spaces, all tree-based indexing

structures examine large fraction of leaves

If we need to visit so many nodes anyway, it isbetter to scan the whole data set and avoid

performing seeks altogether

1 seek = transfer of few hundred KB



28/42

Natural question:How to speed-up linear scan?

Answer: Use approximation

Use only ibits per dimension (and speed-up

the scan by a factor of 32/i)

Identify all points which could be

returned as an answer

Verify the points using original data set



29/42


30/42

Example: Spam filtering

Instance space X:

Binary feature vectorsxof word occurrences

dfeatures (words + other things, d~100,000)

Class Y:

y: Spam (+1), Ham (-1)



31/42

Binary classification:

Input: Vectorsxiand labels y

i Goal: Find vector w = (w1, w2,... , wn)

Each wi is a real number


f(x) =1 if w

1x

1+ w

2x

2+. . . w

nx

n

0 otherwise{

w x = 0 - --- -

-

-- -

- -

- -

-

-

w x =

,

1,

ww

xxx

Note:


32/42

(very) Loose motivation: Neuron Inputs are feature values

Each feature has a weight wi

Activation is the sum: f(x) = iwixi= wx -

If the f(x) is:

Positive: predict +1 Negative: predict -1


x1x2x3x4

>0?

w1w2w3w4

viagra

nigeria

Spam=1

Ham=-1

w

x1x2

wx=0


33/42


34/42

Perceptron Convergence Theorem: If there exist a set of weights that are consistent (i.e.,

the data is linearly separable) the perceptron learning

algorithm will converge

How long would it take to converge?

Perceptron Cycling Theorem:

If the training data is not linearly separable the

perceptron learning algorithm will eventually repeatthe same set of weights and therefore enter an infinite

loop

How to provide robustness, more expressivity?



35/42

Separability: some parameters get

training set perfectly

Convergence: if training set isseparable, perceptron will

converge (binary case)

Mistake bound: number of

mistakes < 1/2



36/42

If more than 2 classes: Weight vector wc for each class

Calculate activation for each class

f(x,c)= i wc,ixi = wcx

Highest activation wins:

c = arg maxc f(x,c)


w1

w3

w2

w1x

biggest

w3x

biggest

w2x

biggest


37/42

Overfitting:

Regularization: if the data

is not separable weightsdance around

Mediocre generalization: Finds a barely separating

solution



38/42

Winnow algorithm Similar to perceptron, just different updates

Learns linear threshold functions


(demotion)1)x(if/2ww,xbut w0f(x)If

)(promotion1)x(if2ww,xwbut1f(x)If

nothingdo:mistakenoIfxwiff1isPrediction

w:Initialize

iii

iii

i

==

=


39/42

Algorithm learns monotone functions For the general case:

Duplicate variables:

To negate variable xi, introduce a new variable xi=-xi

Learn monotone functions over 2n variables

Balanced version:

Keep two weights for each variable;

effective weight is the difference


(demotion)1where22

1,)(but0)(If

)(promotion1where2

12,)(but1)(If

:RuleUpdate

==

==

+++

+++

iiiii

iiiii

xwwwwxwwxf

xwwwwxwwxf


40/42

Thick Separator (aka Perceptron with Margin)(Applies both for Perceptron and Winnow)

Promote if:

w x > + Demote if:

w x < -


w x = 0

- --- -

-

-- -

- -

- -

-

-

w x =

Note: is a functional margin. Its effect could disappear as wgrows.

Nevertheless, this has been shown to be a very effective algorithmic addition.


41/42

Additive weight update algorithm[Perceptron, Rosenblatt, 1958]

Multiplicative weight update algorithm[Winnow, Littlestone, 1988]


(demotion)1)x(if1-ww,xbut w0ClassIf

)(promotion1)x(if1ww,xwbut1ClassIf

iii

iii

==

=+=

(demotion)1)x(if/2ww,xbut w0ClassIf

)(promotion1)x(if2ww,xwbut1ClassIf

iii

iii

==

==

xwiff1isPredictionRw:Hypothesis;{0,1}x:Examples

nn

ww + i yjxj

ww i exp{yjxj}


42/42

Winnow Online: can adjust to changing

target, over time

Advantages

Simple

Guaranteed to learn alinearly separableproblem

Suit able for problemsw ith many irrelevantattr ibutes

Limitations

only linear separations

only converges forlinearly separable data

not really efficient with

Perceptron Online: can adjust to changing

target, over time

Advantages

Simple

Guaranteed to learn alinearly separableproblem

Limitations

only linear separations only converges for

linearly separable data

not really efficient withmany features

12 Knn Perceptron

Documents