7/31/2019 12 Knn Perceptron
1/42
CS246: Mining Massive DatasetsJure Leskovec, Stanford University
http://cs246.stanford.edu
7/31/2019 12 Knn Perceptron
2/42
Would like to do prediction:estimate a function f(x) so that y = f(x)
Where ycan be: Real number: Regression
Categorical: Classification
Complex object: Ranking of items, Parse tree, etc.
Data is labeled: Have many pairs {(x, y)}
x vector of real valued features
y class ({+1, -1}, or a real number)
2/14/2011 Jure Leskovec, Stanford C246: Mining Massive Datasets 2
X Y
X Y
Training andtest set
7/31/2019 12 Knn Perceptron
3/42
We will talk about the following methods:
k-Nearest Neighbor (Instance based learning)
Perceptron algorithm
Support Vector Machines
Decision trees
Main question:
How to efficiently train
(build a model/find model parameters)?
2/14/2011 Jure Leskovec, Stanford C246: Mining Massive Datasets 3
7/31/2019 12 Knn Perceptron
4/42
Instance based learning
Example: Nearest neighbor
Keep the whole training dataset: {(x, y)}
A query example (vector) q comes
Find closest example(s) x*
Predict y*
Can be used both for regression and classification
Recommendation systems
2/14/2011 Jure Leskovec, Stanford C246: Mining Massive Datasets 4
7/31/2019 12 Knn Perceptron
5/42
To make Nearest Neighbor work we need 4 things: Distance metric:
Euclidean
How many neighbors to look at?
One Weighting function (optional):
Unused
How to fit with the local points? Just predict the same output as the nearest neighbor
2/14/2011 Jure Leskovec, Stanford C246: Mining Massive Datasets 5
7/31/2019 12 Knn Perceptron
6/42
Suppose x1,, xm are two dimensional:
x1=(x11,x12), x2=(x21,x22),
One can draw nearest neighbor regions:
2/14/2011 Jure Leskovec, Stanford C246: Mining Massive Datasets 6
d(xi,xj) = (xi1-xj1)2 + (xi2-xj2)
2 d(xi,xj) = (xi1-xj1)2 + (3xi2-3xj2)
2
7/31/2019 12 Knn Perceptron
7/42
Distance metric: Euclidean
How many neighbors to look at? k
Weighting function (optional): Unused
How to fit with the local points? Just predict the average output among knearest neighbors
2/14/2011 Jure Leskovec, Stanford C246: Mining Massive Datasets 7
k=9
7/31/2019 12 Knn Perceptron
8/42
Distance metric: Euclidean
How many neighbors to look at? All of them (!)
Weighting function:
wi = exp(-d(xi, q)2/Kw) Nearby points to query q are weighted more strongly. Kwkernel width.
How to fit with the local points? Predict weighted average: wiyi/wi
2/14/2011 Jure Leskovec, Stanford C246: Mining Massive Datasets 8
K=10 K=20 K=80
d(xi, q) = 0
wi
7/31/2019 12 Knn Perceptron
9/42
Given: a set P ofn points in Rd
Goal: Given a query point q
NN: find the nearest neighbor p ofq in P
Range search: find one/all points in P withindistance rfrom q
2/14/2011 Jure Leskovec, Stanford C246: Mining Massive Datasets 9
q
p
7/31/2019 12 Knn Perceptron
10/42
7/31/2019 12 Knn Perceptron
11/42
Simplest spatial structure on Earth! Split the space into 2dequal subsquares Repeat until done:
only one pixel left only one point left
only a few points left
Variants:
split only one dimensionat a time
Kd-trees (in a moment)
2/14/2011 Jure Leskovec, Stanford C246: Mining Massive Datasets 11
7/31/2019 12 Knn Perceptron
12/42
Range search: Put root node on the stack
Repeat:
pop the next node Tfrom the stack
for each child CofT: ifCis a leaf, examine point(s) in C
ifCintersects with the ball of radiusraround q, add Cto the stack
Nearest neighbor:
Start range search with r =
Whenever a point is found, update r
Only investigate nodes with respect tocurrent r
2/14/2011 Jure Leskovec, Stanford C246: Mining Massive Datasets 12
q
7/31/2019 12 Knn Perceptron
13/42
7/31/2019 12 Knn Perceptron
14/42
Main ideas [Bentley 75] : Only one-dimensional splits
Choose the split carefully: E.g., Pick dimension of largest
variance and split at median(balanced split)
Do SVD or CUR, project and split
Queries: as for quadtrees Advantages:
no (or less) empty spaces
only linear space Query time at most:
Min[dn, exponential(d)]
2/14/2011 Jure Leskovec, Stanford C246: Mining Massive Datasets 14
7/31/2019 12 Knn Perceptron
15/42
Range search: Put root node on the stack
Repeat:
pop the next node Tfrom thestack
for each child CofT:
ifCis a leaf, examine point(s) in C
ifCintersects with the ball of radius raround q, add Cto the stack
In what order we search the children?
Best-Bin-First (BBF), Last-Bin-First (LBF)
2/14/2011 Jure Leskovec, Stanford C246: Mining Massive Datasets 15
7/31/2019 12 Knn Perceptron
16/42
Performance of a single Kd-tree is low Randomized Kd-trees: Build several trees
Find top few dimensions of largest variance
Randomly select one of these dimensions; split on median
Construct many complete (i.e., one point per leaf) trees
Drawbacks: More memory
Additional parameter to tune: number of trees
Search Descend through each tree until leaf is reached
Maintain a single priority queue for all the trees
For approximate search, stop after a certain number ofnodes have been examined
2/14/2011 Jure Leskovec, Stanford C246: Mining Massive Datasets 16
7/31/2019 12 Knn Perceptron
17/42
d=128, n=100k
2/14/2011 Jure Leskovec, Stanford C246: Mining Massive Datasets 17
[Muja-Lowe, 2010]
7/31/2019 12 Knn Perceptron
18/42
Overlapped partitioning reduces boundaryerrors
no backtracking necessary
Spilling
Increases tree depth
more memory
slower to build
Better when split passes through sparse regions
Lower nodes may spill too much
hybrid of spill and non-spill nodes
Designing a good spill factor hard
2/14/2011 Jure Leskovec, Stanford C246: Mining Massive Datasets 18
7/31/2019 12 Knn Perceptron
19/42
For high dim. data, use randomized projections(CUR) or SVD
Use Best-Bin-First (BBF)
Make a priority queue of all unexplored nodes Visit them in order of their closeness to the query
Closeness is defined by distance to a cell boundary
Space permitting:
Keep extra statistics on lower and upper bound foreach cell and use triangle inequality to prune space
Use spilling to avoid backtracking
Use lookup tables for fast distance computation
2/14/2011 Jure Leskovec, Stanford C246: Mining Massive Datasets 19
7/31/2019 12 Knn Perceptron
20/42
Bottom-up approach [Guttman 84] Start with a set of points/rectangles
Partition the set into groups of small cardinality
For each group, find minimum rectanglecontaining objects from this group (MBR)
Repeat
Advantages: Supports near(est) neighbor search
(similar as before)
Works for points and rectangles
Avoids empty spaces
2/14/2011 Jure Leskovec, Stanford C246: Mining Massive Datasets 20
7/31/2019 12 Knn Perceptron
21/422/14/2011 Jure Leskovec, Stanford C246: Mining Massive Datasets #21
R-trees with fan-out 4: group nearby rectangles to parent MBRs
A
B
C
D
E
F
G
H
I
J
7/31/2019 12 Knn Perceptron
22/422/14/2011 Jure Leskovec, Stanford C246: Mining Massive Datasets #22
R-trees with fan-out 4: every parent node completely covers its children
A
B
C
D
E
F
G
H
I
J
P1
P2
P3
P4
F GD E
H I JA B C
7/31/2019 12 Knn Perceptron
23/422/14/2011 Jure Leskovec, Stanford C246: Mining Massive Datasets #23
R-trees with fan-out 4: every parent node completely covers its children
A
B
C
D
E
F
G
H
I
J
P1
P2
P3
P4
P1 P2 P3 P4
F GD E
H I JA B C
7/31/2019 12 Knn Perceptron
24/422/14/2011 Jure Leskovec, Stanford C246: Mining Massive Datasets #24
Example of a range search query
A
B
C
D
E
F
G
H
I
J
P1
P2
P3
P4
P1 P2 P3 P4
F GD E
H I JA B C
7/31/2019 12 Knn Perceptron
25/422/14/2011 Jure Leskovec, Stanford C246: Mining Massive Datasets #25
Example of a range search query
A
B
C
D
E
F
G
H
I
J
P1
P2
P3
P4
P1 P2 P3 P4
F GD E
H I JA B C
7/31/2019 12 Knn Perceptron
26/422/14/2011 Jure Leskovec, Stanford C246: Mining Massive Datasets #26
Insertion of pointx: Find MBR intersecting withxand insert
If a node is full, then a split: Linear choose far apart nodes as ends. Randomly choose nodes
and assign them so that they require the smallest MBR enlargement
Quadratic choose two nodes so the dead space between them ismaximized. Insert nodes so area enlargement is minimized
A
B
C
D
E
F
G
H I
JP1
P2
P3
P4
P1 P2 P3 P4
F GD E
H I JA B C
7/31/2019 12 Knn Perceptron
27/42
Approach [Weber, Schek, Blott98] In high-dimensional spaces, all tree-based indexing
structures examine large fraction of leaves
If we need to visit so many nodes anyway, it isbetter to scan the whole data set and avoid
performing seeks altogether
1 seek = transfer of few hundred KB
2/14/2011 Jure Leskovec, Stanford C246: Mining Massive Datasets 27
7/31/2019 12 Knn Perceptron
28/42
Natural question:How to speed-up linear scan?
Answer: Use approximation
Use only ibits per dimension (and speed-up
the scan by a factor of 32/i)
Identify all points which could be
returned as an answer
Verify the points using original data set
2/14/2011 Jure Leskovec, Stanford C246: Mining Massive Datasets 28
7/31/2019 12 Knn Perceptron
29/42
7/31/2019 12 Knn Perceptron
30/42
Example: Spam filtering
Instance space X:
Binary feature vectorsxof word occurrences
dfeatures (words + other things, d~100,000)
Class Y:
y: Spam (+1), Ham (-1)
2/14/2011 Jure Leskovec, Stanford C246: Mining Massive Datasets 30
7/31/2019 12 Knn Perceptron
31/42
Binary classification:
Input: Vectorsxiand labels y
i Goal: Find vector w = (w1, w2,... , wn)
Each wi is a real number
2/14/2011 Jure Leskovec, Stanford C246: Mining Massive Datasets 31
f(x) =1 if w
1x
1+ w
2x
2+. . . w
nx
n
0 otherwise{
w x = 0 - --- -
-
-- -
- -
- -
-
-
w x =
,
1,
ww
xxx
Note:
7/31/2019 12 Knn Perceptron
32/42
(very) Loose motivation: Neuron Inputs are feature values
Each feature has a weight wi
Activation is the sum: f(x) = iwixi= wx -
If the f(x) is:
Positive: predict +1 Negative: predict -1
2/14/2011 Jure Leskovec, Stanford C246: Mining Massive Datasets 32
x1x2x3x4
>0?
w1w2w3w4
viagra
nigeria
Spam=1
Ham=-1
w
x1x2
wx=0
7/31/2019 12 Knn Perceptron
33/42
7/31/2019 12 Knn Perceptron
34/42
Perceptron Convergence Theorem: If there exist a set of weights that are consistent (i.e.,
the data is linearly separable) the perceptron learning
algorithm will converge
How long would it take to converge?
Perceptron Cycling Theorem:
If the training data is not linearly separable the
perceptron learning algorithm will eventually repeatthe same set of weights and therefore enter an infinite
loop
How to provide robustness, more expressivity?
2/14/2011 Jure Leskovec, Stanford C246: Mining Massive Datasets 34
7/31/2019 12 Knn Perceptron
35/42
Separability: some parameters get
training set perfectly
Convergence: if training set isseparable, perceptron will
converge (binary case)
Mistake bound: number of
mistakes < 1/2
2/14/2011 Jure Leskovec, Stanford C246: Mining Massive Datasets 35
7/31/2019 12 Knn Perceptron
36/42
If more than 2 classes: Weight vector wc for each class
Calculate activation for each class
f(x,c)= i wc,ixi = wcx
Highest activation wins:
c = arg maxc f(x,c)
2/14/2011 Jure Leskovec, Stanford C246: Mining Massive Datasets 36
w1
w3
w2
w1x
biggest
w3x
biggest
w2x
biggest
7/31/2019 12 Knn Perceptron
37/42
Overfitting:
Regularization: if the data
is not separable weightsdance around
Mediocre generalization: Finds a barely separating
solution
2/14/2011 Jure Leskovec, Stanford C246: Mining Massive Datasets 37
7/31/2019 12 Knn Perceptron
38/42
Winnow algorithm Similar to perceptron, just different updates
Learns linear threshold functions
2/14/2011 Jure Leskovec, Stanford C246: Mining Massive Datasets 38
(demotion)1)x(if/2ww,xbut w0f(x)If
)(promotion1)x(if2ww,xwbut1f(x)If
nothingdo:mistakenoIfxwiff1isPrediction
w:Initialize
iii
iii
i
==
=
7/31/2019 12 Knn Perceptron
39/42
Algorithm learns monotone functions For the general case:
Duplicate variables:
To negate variable xi, introduce a new variable xi=-xi
Learn monotone functions over 2n variables
Balanced version:
Keep two weights for each variable;
effective weight is the difference
2/14/2011 Jure Leskovec, Stanford C246: Mining Massive Datasets 39
(demotion)1where22
1,)(but0)(If
)(promotion1where2
12,)(but1)(If
:RuleUpdate
==
==
+++
+++
iiiii
iiiii
xwwwwxwwxf
xwwwwxwwxf
7/31/2019 12 Knn Perceptron
40/42
Thick Separator (aka Perceptron with Margin)(Applies both for Perceptron and Winnow)
Promote if:
w x > + Demote if:
w x < -
2/14/2011 Jure Leskovec, Stanford C246: Mining Massive Datasets 40
w x = 0
- --- -
-
-- -
- -
- -
-
-
w x =
Note: is a functional margin. Its effect could disappear as wgrows.
Nevertheless, this has been shown to be a very effective algorithmic addition.
7/31/2019 12 Knn Perceptron
41/42
Additive weight update algorithm[Perceptron, Rosenblatt, 1958]
Multiplicative weight update algorithm[Winnow, Littlestone, 1988]
2/14/2011 Jure Leskovec, Stanford C246: Mining Massive Datasets 41
(demotion)1)x(if1-ww,xbut w0ClassIf
)(promotion1)x(if1ww,xwbut1ClassIf
iii
iii
==
=+=
(demotion)1)x(if/2ww,xbut w0ClassIf
)(promotion1)x(if2ww,xwbut1ClassIf
iii
iii
==
==
xwiff1isPredictionRw:Hypothesis;{0,1}x:Examples
nn
ww + i yjxj
ww i exp{yjxj}
7/31/2019 12 Knn Perceptron
42/42
Winnow Online: can adjust to changing
target, over time
Advantages
Simple
Guaranteed to learn alinearly separableproblem
Suit able for problemsw ith many irrelevantattr ibutes
Limitations
only linear separations
only converges forlinearly separable data
not really efficient with
Perceptron Online: can adjust to changing
target, over time
Advantages
Simple
Guaranteed to learn alinearly separableproblem
Limitations
only linear separations only converges for
linearly separable data
not really efficient withmany features