Top Banner
Feature extraction and Discrimination Feature extraction and Discrimination Seoul National University Deep Learning September-December, 2019 1 / 43
43

Feature extraction and Discriminationstat.snu.ac.kr/mcp/Lectures_2_3_SVM.pdf · 2019-09-23 · Feature extraction and Discrimination Extracting invariant features Features should

Mar 29, 2020

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Feature extraction and Discriminationstat.snu.ac.kr/mcp/Lectures_2_3_SVM.pdf · 2019-09-23 · Feature extraction and Discrimination Extracting invariant features Features should

Feature extraction and Discrimination

Feature extraction and Discrimination

Seoul National University Deep Learning September-December, 2019 1 / 43

Page 2: Feature extraction and Discriminationstat.snu.ac.kr/mcp/Lectures_2_3_SVM.pdf · 2019-09-23 · Feature extraction and Discrimination Extracting invariant features Features should

Feature extraction and Discrimination

Two parts of Deep Neural Network

Outcome can be binary or continuous or multivariate.

Consider P(y = 1|x , θ) = fk(fk−1(· · · f2(f1(x ; θ1); θ2) · · · ; θk−1); θk).

First part: Extracting features:g(x ; Θk−1) = fk−1(· · · f2(f1(x ; θ1); θ2) · · · ; θk−1), whereΘk−1 = (θ1, θ2, · · · , θk−1).

Second part: Classification: P(y = 1|x , θ) = fk(g(x ; Θk−1); θk),where fk(x ; θk) = exp(xθk)/(1 + exp(xθk)).

Feature extraction and classification are conducted separately before2012.

We briefly review common approaches to feature extraction andclassification separately.

Seoul National University Deep Learning September-December, 2019 2 / 43

Page 3: Feature extraction and Discriminationstat.snu.ac.kr/mcp/Lectures_2_3_SVM.pdf · 2019-09-23 · Feature extraction and Discrimination Extracting invariant features Features should

Feature extraction and Discrimination

Extracting invariant features

Features should be invariant with respect to small scale changes,small rotations, blur, brightness, thickness, etc.

Averaging or integration makes the images invariant.

Integration over all rota-tions

An image gradient is a directional change in the intensity or color inan image.

Seoul National University Deep Learning September-December, 2019 3 / 43

Page 4: Feature extraction and Discriminationstat.snu.ac.kr/mcp/Lectures_2_3_SVM.pdf · 2019-09-23 · Feature extraction and Discrimination Extracting invariant features Features should

Feature extraction and Discrimination

Extracting invariant features

Some unwanted variance is due to perspective.

Histogram can remove such variance.

Histogram can remove important features. Localized histogram canovercome this problem.

slides from Christoph Lampert

Seoul National University Deep Learning September-December, 2019 4 / 43

Page 5: Feature extraction and Discriminationstat.snu.ac.kr/mcp/Lectures_2_3_SVM.pdf · 2019-09-23 · Feature extraction and Discrimination Extracting invariant features Features should

Feature extraction and Discrimination

Feature extraction

The feature extraction step is performed manually using knowntransformations such as Scale-Invariant-Feature-Transformation(SIFT), Rotation-Invariant-FT, Histogram of oriented Gradients(HoG), etc.

The feature extraction consists of multiple steps of operations.

Seoul National University Deep Learning September-December, 2019 5 / 43

Page 6: Feature extraction and Discriminationstat.snu.ac.kr/mcp/Lectures_2_3_SVM.pdf · 2019-09-23 · Feature extraction and Discrimination Extracting invariant features Features should

Feature extraction and Discrimination

Feature extraction for key points

Finding key points: For natural images a large part is background.Instead of global image representation, one may form local interestpoint.

The image is convolved with Gaussian filters at different scales, andthen the difference of successive Gaussian blurred images are taken.Keypoints are taken as maxima/minima of the Difference ofGaussians (DoG) that occur at multiples scales.

Seoul National University Deep Learning September-December, 2019 6 / 43

Page 7: Feature extraction and Discriminationstat.snu.ac.kr/mcp/Lectures_2_3_SVM.pdf · 2019-09-23 · Feature extraction and Discrimination Extracting invariant features Features should

Feature extraction and Discrimination

Feature extraction: SIFT

SIFT: A goal is to extract distinctive invariant features; locate thefeature key point, calculate the gradients of the image around keypoint, form localized histogram of gradients over grids, normalize thetotal histogram to represent the image with high-dimensional vector

Extracted features are invariant to image scale and rotation; robust toaffine distortion, change in 3D viewpoint locality, robust to occlusionand clutter.

Seoul National University Deep Learning September-December, 2019 7 / 43

Page 8: Feature extraction and Discriminationstat.snu.ac.kr/mcp/Lectures_2_3_SVM.pdf · 2019-09-23 · Feature extraction and Discrimination Extracting invariant features Features should

Discrimination

Discrimination

Seoul National University Deep Learning September-December, 2019 8 / 43

Page 9: Feature extraction and Discriminationstat.snu.ac.kr/mcp/Lectures_2_3_SVM.pdf · 2019-09-23 · Feature extraction and Discrimination Extracting invariant features Features should

Discrimination

Linear discriminant analysis

A classification rule is a function f : χ→ {1, · · · ,K} where χ is thedomain of X . For new X , prediction of Y is f (X ).

In case of binary classification, F (x) = wT x + b such thatf (x) = I (F (x) > 0) is called a linear discriminant.

The misclassification rate of f is defined R(f ) = P(Y 6= f (x)). Therule that minimizes R(f ) is called Bayes rule.

Bayes classifier f (x) is f (x) = arg maxj=1,··· ,K

P(Y = j |X = x) =

arg maxj=1,··· ,K

P(X = x |Y = j)P(Y = j)

LDA uses Bayes rule assuming P(X = x |Y = j) = N(µj ,Σ), µj ∈ Rp,Σ ∈ Rp×p

Seoul National University Deep Learning September-December, 2019 9 / 43

Page 10: Feature extraction and Discriminationstat.snu.ac.kr/mcp/Lectures_2_3_SVM.pdf · 2019-09-23 · Feature extraction and Discrimination Extracting invariant features Features should

Discrimination

Linear discriminant analysis

LDA classification rule: wTX ≥ c, where w = Σ−1(µ0 − µ1),c = (µ0 − µ1)T Σ−1(µ0 + µ1)/2.

Consider Σ = UDUT , µj = D−12UT µj , x = D−

12UT x , x ∈ Rp. LDA

classifies x to the nearest centroid, i.e. assigns to class j so that12 ||x − µj ||

22 − log πj is minimized.

Seoul National University Deep Learning September-December, 2019 10 / 43

Page 11: Feature extraction and Discriminationstat.snu.ac.kr/mcp/Lectures_2_3_SVM.pdf · 2019-09-23 · Feature extraction and Discrimination Extracting invariant features Features should

Discrimination

Linear discriminant analysis

Without assuming multivariate normality, E (wTX |Y = j) = wTµj ,Var(wTX |Y = j) = wTΣw . Let

J(w) ={E (wTX |Y = 0)− E (wTX |Y = 1)}2

wTΣw

=wT (µ0 − µ1)(µ0 − µ1)Tw

wTΣw.

w = Σ−1(µ0 − µ1) maximizes J(w). Fisher linear discriminantfunction is wTX , and the classification rule wX ≥ c , wherec = (µ0 − µ1)T Σ−1(µ0 + µ1)/2, is the same as the LDA.

Seoul National University Deep Learning September-December, 2019 11 / 43

Page 12: Feature extraction and Discriminationstat.snu.ac.kr/mcp/Lectures_2_3_SVM.pdf · 2019-09-23 · Feature extraction and Discrimination Extracting invariant features Features should

Discrimination

Discriminative vs. Generative

Discriminative P(Y = j |X ) vs. generative models P(X |Y = j).

When P(X = x |Y = j) = N(µj ,Σ), j = 1, 2,P(Y = 1|X = x) = exp(β0 + β1x)/{1 + exp(β0 + β1x)}, where

β0 = log π1π2− µT1 Σ−1µ1

2 +µT2 Σ−1µ2

2 and β1 = (µ1 − µ2)TΣ.

Logistic regression is a discriminative version of LDA.

When f (X |Y = j) =∏

i=1,··· ,pfi (xi ), for X ∈ Rp, P(Y = 1|X ) gives

generalized additive models.

Deep learning can be viewed as logistic regression given the featuresof the last layer.

Seoul National University Deep Learning September-December, 2019 12 / 43

Page 13: Feature extraction and Discriminationstat.snu.ac.kr/mcp/Lectures_2_3_SVM.pdf · 2019-09-23 · Feature extraction and Discrimination Extracting invariant features Features should

Discrimination

Challenges of deep learning

Since features are functions of many parameters in compositionalforms, deep learning poses a high dimensional nonconvex modelfitting problem.

High dimensional problems are often handled exploiting sparsity orsmoothness.

In statistical literature high dimensional convex problems have beenstudied.

Some high dimensional nonconvex cases have been studied (e.g. Lohand Wainwright, 2017) for global optima.

Seoul National University Deep Learning September-December, 2019 13 / 43

Page 14: Feature extraction and Discriminationstat.snu.ac.kr/mcp/Lectures_2_3_SVM.pdf · 2019-09-23 · Feature extraction and Discrimination Extracting invariant features Features should

Discrimination

Handling high-dimensional problems using sparsity

One overarching strategy of solving high-dimensional problem is ‘beton sparsity’.

Intuitively, if we have sample size n and the unknown p, with p >> n,the number of samples n is too small to allow for accurate estimationof the parameters, unless many of p parameters are zero. If the truemodel is sparse, so that only k < n parameters are actually nonzero,then we can estimate the parameters. (Hastie, Tibshirani andWainwright, SLS)

For example, for lasso, if ||β∗||1 = o(√

n/ log(p)), lasso is known tobe consistent for prediction.

Seoul National University Deep Learning September-December, 2019 14 / 43

Page 15: Feature extraction and Discriminationstat.snu.ac.kr/mcp/Lectures_2_3_SVM.pdf · 2019-09-23 · Feature extraction and Discrimination Extracting invariant features Features should

Discrimination

Handling high-dimensional problems using smoothness

(Informally) Functions can be considered as infinitely long vector.

Estimating an unknown function requires overcoming highdimensionalproblems.

One popular strategy to such problem is to assume that the unknownfunction belongs to a restricted function class.

Typically the function class is defined by degree of smoothness.

SVM can be justified in two ways. One gives sparse solution and theother utilizes restricted function space called Reproducing KernelHilbert Space (RKHS). We review SVM in these respects.

Seoul National University Deep Learning September-December, 2019 15 / 43

Page 16: Feature extraction and Discriminationstat.snu.ac.kr/mcp/Lectures_2_3_SVM.pdf · 2019-09-23 · Feature extraction and Discrimination Extracting invariant features Features should

Support Vector Machine

Support Vector Machine

Seoul National University Deep Learning September-December, 2019 16 / 43

Page 17: Feature extraction and Discriminationstat.snu.ac.kr/mcp/Lectures_2_3_SVM.pdf · 2019-09-23 · Feature extraction and Discrimination Extracting invariant features Features should

Support Vector Machine

Support vector machine: Linearly separable case

Consider a hyperplane that can separate data. xi ∈ Rd , i = 1, · · · n.d >> n.

Margin=minimum distance of xi over i from the boundary plane.

SVM is a classifier by the hyperplane with maximum margin.

slide from Andrew Zisserman.

Since wT x + b = 0 and c(wT x + b) = 0 represent the same plane,choose normalizatioin such that wT x + b = 1 and wT x + b = −1 forsupport vectors, which give margin of 2/||w ||.

Seoul National University Deep Learning September-December, 2019 17 / 43

Page 18: Feature extraction and Discriminationstat.snu.ac.kr/mcp/Lectures_2_3_SVM.pdf · 2019-09-23 · Feature extraction and Discrimination Extracting invariant features Features should

Support Vector Machine

Support vector machine: Linearly separable case

w is obtained by maximizing 2/||w || subject to wT xi + b ≥ 1 ifyi = 1, wT xi + b ≤ −1 if yi = −1 for i = 1, · · · , n.

Equivalently, minw12 ||w ||

2 subject to yi (wT xi + b) ≥ 1 for

i = 1, · · · , n.

Seoul National University Deep Learning September-December, 2019 18 / 43

Page 19: Feature extraction and Discriminationstat.snu.ac.kr/mcp/Lectures_2_3_SVM.pdf · 2019-09-23 · Feature extraction and Discrimination Extracting invariant features Features should

Support Vector Machine

Support vector machine: Optimization problem

minw12 ||w ||

2

subject to−{yi (wT xi + b)− 1} ≤ 0 for i = 1, · · · , n.

Lagrangian:

L(w , α) =1

2||w ||2 −

∑i

αi{yi (wT xi + b)− 1}

with αi ≥ 0.

maxα≥0

L(w , α) =1

2||w ||2, if the restriction is satisfied,

maxα≥0

L(w , α) =∞, otherwise. Therefore minimizing maxα≥0

L(w , α)

solves the original problem.

Goal: minw,b

maxα≥0

L(w , α)

Seoul National University Deep Learning September-December, 2019 19 / 43

Page 20: Feature extraction and Discriminationstat.snu.ac.kr/mcp/Lectures_2_3_SVM.pdf · 2019-09-23 · Feature extraction and Discrimination Extracting invariant features Features should

Support Vector Machine

Support vector machine: Equivalent optimization problems

(Primal) minw,b

maxα≥0

1

2||w||2 −

∑j

αj [(wTxj + b)yj − 1]

(Dual) maxα≥0

minw,b

1

2||w||2 −

∑j

αj [(wTxj + b)yj − 1]

Solve for dual. ∂L∂w = w −

∑j αjyjxj = 0 → w =

∑j αjyjxj

∂L∂b = −

∑j αjyj →

∑j αjyj = 0.

Plugging back, (Dual) maxα≥0,

∑j αjyj=0

∑j

αj −1

2

∑i ,j

yiyjαiαj(xTi xj)

(solving for α)

Seoul National University Deep Learning September-December, 2019 20 / 43

Page 21: Feature extraction and Discriminationstat.snu.ac.kr/mcp/Lectures_2_3_SVM.pdf · 2019-09-23 · Feature extraction and Discrimination Extracting invariant features Features should

Support Vector Machine

Support vector machine: Dual solution

Plugging back, (Dual) maxα≥0,

∑j αjyj=0

∑j

αj −1

2

∑i ,j

yiyjαiαj(xTi xj)

(solving for α)

After obtaining α, we have w =∑

j αjyjxj .

b = ( mini :yi=1

wTxi − maxi :yi=−1

wTxi )/2

For new observation x, classify by y ← sign[∑

i αiyixTi x + b].

The solution depends on x only through xT x .

Using the dual form, the dimension is reduced.

Seoul National University Deep Learning September-December, 2019 21 / 43

Page 22: Feature extraction and Discriminationstat.snu.ac.kr/mcp/Lectures_2_3_SVM.pdf · 2019-09-23 · Feature extraction and Discrimination Extracting invariant features Features should

Support Vector Machine

Support vector machine: Sparse solution

Convexity of the objective function and constraint allows to solve dualproblem. Moreover the solution satisfies Karush-Kuhn-Tuckercomplementary slackness condition.

(Sparsity) The solution satisfies the KKT complemetary slacknesscondition,

αi{yi (wTxi + b)− 1} = 0.

That is, for the support vector, αi > 0, for other points the constraintis inactive, αi = 0.

Seoul National University Deep Learning September-December, 2019 22 / 43

Page 23: Feature extraction and Discriminationstat.snu.ac.kr/mcp/Lectures_2_3_SVM.pdf · 2019-09-23 · Feature extraction and Discrimination Extracting invariant features Features should

Support Vector Machine

Support vector machine: Linearly nonseparable case

slide from Andrew Zisserman.

minw∈Rd ,ξi∈R+

||w ||2 + Cn∑

i=1

ξi

subject to yi (wT xi + b) ≥ 1− ξi for i = 1, · · · , n.

We can fulfill every constraint by choosing ξi large enough.

Seoul National University Deep Learning September-December, 2019 23 / 43

Page 24: Feature extraction and Discriminationstat.snu.ac.kr/mcp/Lectures_2_3_SVM.pdf · 2019-09-23 · Feature extraction and Discrimination Extracting invariant features Features should

Support Vector Machine

Support vector machine: Linearly nonseparable case

minw∈Rd ,ξi∈R+ ||w ||2 + C∑n

i=1 ξi subject to yi (wT xi + b) ≥ 1− ξi for

i = 1, · · · , n. Margin is allowed to be less than 1, 1− ξi , but price ispaid by increasing objective function by Cξi .

Lagrangian

L(w , ξ, α, r) =1

2||w ||2+C

∑i

ξi−∑i

αi{yi (wT xi+b)−1+ξi}−∑i

riξi

with αi , ri ≥ 0.

Going through similar derivation of dual form as in separable case,dual problem is

maxα

∑j

αj −1

2

∑i ,j

yiyjαiαj(xTi xj),

under the contraints, 0 ≤ α≤ C ,∑

j αjyj = 0.

Seoul National University Deep Learning September-December, 2019 24 / 43

Page 25: Feature extraction and Discriminationstat.snu.ac.kr/mcp/Lectures_2_3_SVM.pdf · 2019-09-23 · Feature extraction and Discrimination Extracting invariant features Features should

Support Vector Machine

Support vector machine: Linearly nonseparable case

KKT condition gives αi = 0 for points satisfying yi (wT xi + b) > 1

without margin violation; αi = C for points with margin violation;0 < α < C for support vectors.

Link between margin maximization and risk minimization

The constraint yi (wT xi + b) ≥ 1−ξi and ξi ≥ 0 gives

ξi =max(0, 1− yi (wT xi + b)).

Learning problem becomes unconstrained optimization over w,minw∈Rd ||w ||2 + C

∑Ni=1 max(0, 1− yi (w

T xi + b))

C is a regularization parameter. small C = soft margin. large C =hard margin.

Seoul National University Deep Learning September-December, 2019 25 / 43

Page 26: Feature extraction and Discriminationstat.snu.ac.kr/mcp/Lectures_2_3_SVM.pdf · 2019-09-23 · Feature extraction and Discrimination Extracting invariant features Features should

Support Vector Machine

Classification in transformed space

Note that optimization depends on x only through xTx. Even if we use zwith higher dimension than x, the solution is in the same dimension.Linearly nonseparable data can become separable in a higher dimension.

φ :

(x1

x2

)→

x21

x22√

2x1x2

Seoul National University Deep Learning September-December, 2019 26 / 43

Page 27: Feature extraction and Discriminationstat.snu.ac.kr/mcp/Lectures_2_3_SVM.pdf · 2019-09-23 · Feature extraction and Discrimination Extracting invariant features Features should

Support Vector Machine

Transforming to a higher dimension

φ :

(x1

x2

)→

x21

x22√

2x1x2

Data may become linearly separable in transformed space.

We can replace xTi xj with φ(xi )Tφ(xj) in the algorithm.

The classifier for a new datum x isf (x) = wTφ(x) + b =

∑ni=1 αiyiφ(xi )

Tφ(x) + b.

Define kernel function k(x , z) = φ(x)Tφ(z). The classifier isf (x) =

∑ni=1 αiyik(xi , x) + b.

Seoul National University Deep Learning September-December, 2019 27 / 43

Page 28: Feature extraction and Discriminationstat.snu.ac.kr/mcp/Lectures_2_3_SVM.pdf · 2019-09-23 · Feature extraction and Discrimination Extracting invariant features Features should

Support Vector Machine

Kernel trick

Define kernel function k(x , z) = φ(x)Tφ(z). The classifier isf (x) =

∑ni=1 αiyik(xi , x) + b.

Although we can compute φ(x)Tφ(z) given φ(.), k(x , z) may be easyto compute. e.g. when d = 3,k(x , z) = (xT z)2 =

∑di ,j=1 xixjzizj = φ(x)Tφ(z), where

φ(x) = (x1x1, x1x2, x1x3, x2x1, x2x2, x2x3, x3x1, x3x2, x3x3). Somekernels are infinite dimensional, e.g., k(x , z) = exp(− 1

2σ2 ||x − z ||22).

If we specify k(x , z), does φ exist? Yes, if k is a positive definitekernel.Definition (Positive definite kernel). A function k : χ× χ→ R iscalled a positive definite kernel, IFF for any set of points in χ,{x (1), · · · , x (m)}, Ki ,j = ((k(x (i), x (j))))i ,j is symmetric and positivesemi-definite.

Seoul National University Deep Learning September-December, 2019 28 / 43

Page 29: Feature extraction and Discriminationstat.snu.ac.kr/mcp/Lectures_2_3_SVM.pdf · 2019-09-23 · Feature extraction and Discrimination Extracting invariant features Features should

Support Vector Machine

Effect of hyperparameters using gaussian kernel

• f (x) =∑n

i=1 αiyi exp(−||x − xi ||2/2σ2) + b. σ = 1.

Figure: C=10 Figure: C=100 Figure: C=∞

Seoul National University Deep Learning September-December, 2019 29 / 43

Page 30: Feature extraction and Discriminationstat.snu.ac.kr/mcp/Lectures_2_3_SVM.pdf · 2019-09-23 · Feature extraction and Discrimination Extracting invariant features Features should

Support Vector Machine

Effect of hyperparameters using gaussian kernel

• f (x) =∑n

i=1 αiyi exp(−||x − xi ||2/2σ2) + b. C =∞.

Figure: σ = 1 Figure: σ = .25 Figure: σ = 0.1

Seoul National University Deep Learning September-December, 2019 30 / 43

Page 31: Feature extraction and Discriminationstat.snu.ac.kr/mcp/Lectures_2_3_SVM.pdf · 2019-09-23 · Feature extraction and Discrimination Extracting invariant features Features should

Support Vector Machine

Alternative justification of SVM

SVM reduces the dimension of the problem via dual thanks toconvexity.

Convexity also leads to the KKT complimentary slackness conditionand the sparsity of the solution.

Kernel tricks lack theoretical justification (what if φ(x) is infinitedimensional?)

Alternative approach uses Reproducing Kernel Hilbert Space (RKHS)and the representer theorem which reduces function estimation to afinite linear combination of kernel products evaluated at the inputvalue in the training dataset.

We review RKHS and the representer theorem.

Seoul National University Deep Learning September-December, 2019 31 / 43

Page 32: Feature extraction and Discriminationstat.snu.ac.kr/mcp/Lectures_2_3_SVM.pdf · 2019-09-23 · Feature extraction and Discrimination Extracting invariant features Features should

Reproducing Kernel Hilbert Space

Reproducing Kernel Hilbert Space

Seoul National University Deep Learning September-December, 2019 32 / 43

Page 33: Feature extraction and Discriminationstat.snu.ac.kr/mcp/Lectures_2_3_SVM.pdf · 2019-09-23 · Feature extraction and Discrimination Extracting invariant features Features should

Reproducing Kernel Hilbert Space

Hilbert Space

Definition (Inner product) Let F be a vector space over R. Afunction 〈., .〉F : F × F → R is said to be an inner product on F if1. 〈α1f1 + α2f2, g〉F = α1〈f1, g〉F + α2〈f2, g〉F2. 〈f , g〉F = 〈g , f 〉F3. 〈f , f 〉F ≥ 0 and 〈f , f 〉F = 0 if and only if f = 0.

Definition (Complete space) A space is complete if every Cauchysequence in that space has a limit and this limit is in that space.

Definition (Hilbert space) Hilbert space is a complete inner productspace.

Seoul National University Deep Learning September-December, 2019 33 / 43

Page 34: Feature extraction and Discriminationstat.snu.ac.kr/mcp/Lectures_2_3_SVM.pdf · 2019-09-23 · Feature extraction and Discrimination Extracting invariant features Features should

Reproducing Kernel Hilbert Space

Motivation to use RKHS

The functions in Hilbert space may not be ideal for statisticallearning. Consider evaluating the function f (x) at the point x = k .Define g as g(x) = c , if x = k; f (x), otherwise. Because it differsfrom f only at one point, g is clearly still square-integrable, andmoreover, ‖f − g‖ = 0.

A condition on the integrability of the function is not strong enoughto use for prediction since resulting predictions will be bumpy.

We need a nicer function space for prediction where a functionevaluated at each point is continuous. Such space is reproducingkernel Hilbert spaces.

Seoul National University Deep Learning September-December, 2019 34 / 43

Page 35: Feature extraction and Discriminationstat.snu.ac.kr/mcp/Lectures_2_3_SVM.pdf · 2019-09-23 · Feature extraction and Discrimination Extracting invariant features Features should

Reproducing Kernel Hilbert Space

Reproducing Kernel Hilbert Space: Kernel

Definition (Positive definite kernel). A function k : χ× χ→ R iscalled a positive definite kernel, IFF for any set of points in χ,{x (1), · · · , x (m)}, Ki ,j = ((k(x (i), x (j))))i ,j is symmetric and positivesemi-definite.

Theorem (Mercer). Let k : χ× χ→ R be a positive definite kernelfunction. Then there exists a Hilbert space and a mapping φ : χ→ Hsuch that ∀x , x ′ ∈ χ, k(x , x ′) = 〈φ(x), φ(x ′)〉H.

Seoul National University Deep Learning September-December, 2019 35 / 43

Page 36: Feature extraction and Discriminationstat.snu.ac.kr/mcp/Lectures_2_3_SVM.pdf · 2019-09-23 · Feature extraction and Discrimination Extracting invariant features Features should

Reproducing Kernel Hilbert Space

Reproducing Kernel Hilbert Space: How to generate

We first discuss how to generate RKHS function space then present aformal definition.

For any positive definite kernel, k(x , .) is the function obtained byfixing the first coordinate at x . For the Gaussian kernel, k(x , .) is anormal density function centered at x . We can generate functionsf (.) =

∑ni=1 αik(xi , .), g(.) =

∑mj=1 βjk(yj , .). Given such generated

two functions, define 〈f , g〉 =∑n

i=1

∑mj=1 αiβjk(xi , yj). Note that

〈k(xi , ), k(yj , .)〉 = k(xi , yj).

This mapping turns out to satisfy the definition of inner product.

The function space generated by kernel k(x , .) is a Hilbert space.

The kernel has reproducing property.

Seoul National University Deep Learning September-December, 2019 36 / 43

Page 37: Feature extraction and Discriminationstat.snu.ac.kr/mcp/Lectures_2_3_SVM.pdf · 2019-09-23 · Feature extraction and Discrimination Extracting invariant features Features should

Reproducing Kernel Hilbert Space

Reproducing Kernel Hilbert Space

Definition (Reproducing Kernel Hilbert Space): A Hilbert space H offunctions f : χ→ R defined on a non-empty set χ is said to be aReproducing Kernel Hilbert Space (RKHS) if evaluation functional δxis continuous ∀x ∈ χ.

Definition (Evaluation functional): Let H be a Hilbert space offunctions f : χ→ R, defined on a non-empty set χ. For a fixed pointx ∈ χ, map δx : H → R, δx : f 7→ f (x) is called the evaluationfunctional at x .

Definition (Continuity): A function A: H → G, where H, G are bothnormed linear spaces over R, is said to be continuous at f0 ∈ H if forevery ε > 0, there exists a δ = δ(ε, f0) > 0, s.t. ||f − f0||H < δ implies||Af − Af0||G < ε.

An important property of an RKHS is that if two functions f ∈ H andg ∈ H are close in the norm of H, then f (x) and g(x) are close forall x ∈ χ. This is due to the definition of RKHS.

Seoul National University Deep Learning September-December, 2019 37 / 43

Page 38: Feature extraction and Discriminationstat.snu.ac.kr/mcp/Lectures_2_3_SVM.pdf · 2019-09-23 · Feature extraction and Discrimination Extracting invariant features Features should

Reproducing Kernel Hilbert Space

Reproducing Kernel Hilbert Space

Definition (Reproducing kernel): Let H be a Hilbert space ofR-valued functions defined on a non-empty set χ. A functionk : χ× χ→ R is called a reproducing kernel on H if it satisfies1. ∀x ∈ χ, k(., x) ∈ H,2. ∀x ∈ χ, ∀f ∈ H, 〈f , k(·, x)〉H = f (x).

Seoul National University Deep Learning September-December, 2019 38 / 43

Page 39: Feature extraction and Discriminationstat.snu.ac.kr/mcp/Lectures_2_3_SVM.pdf · 2019-09-23 · Feature extraction and Discrimination Extracting invariant features Features should

Reproducing Kernel Hilbert Space

Reproducing Kernel Hilbert Space

Theorem (Existence of the reproducing kernel): H is a RKHS (i.e., itsevaluation functionals δx are continuous linear operators), if and onlyif H has a reproducing kernel. Denote RKHS spanned by k by Hk .

Examples:

Linear kernel: k(x , x ′) = xT x ′

Gaussian kernel: k(x , x ′) = exp(−‖x−x′‖2

σ2 )Polynomial kernel: k(x , x ′) = (xT x ′ + 1)d , d ∈ N

(Summary) We can consider a Hilbert space where evaluationfunctional is continuous. This continuity allows existence ofreproducing kernel. Function space constructed using thesereproducing kernels is a RKHS.

Seoul National University Deep Learning September-December, 2019 39 / 43

Page 40: Feature extraction and Discriminationstat.snu.ac.kr/mcp/Lectures_2_3_SVM.pdf · 2019-09-23 · Feature extraction and Discrimination Extracting invariant features Features should

Reproducing Kernel Hilbert Space

Representer theorem (Kimeldorf and Wahba, 1971)

(Representer Theorem) Let l be a loss function on f = β0 + h andh ∈ H and H be a RKHS generated by a Mercer kernel k. Let fminimize

Cn(f ) =n∑

i=1

l(yi , f (xi )) + λ||h||H.

Then

f (x) = b +n∑

i=1

αik(xi , x)

where b and αi ∈ R, i = 1, · · · , n.

Representer theorem reduces an infinite dimensional problem to afinite dimensional problem.

Seoul National University Deep Learning September-December, 2019 40 / 43

Page 41: Feature extraction and Discriminationstat.snu.ac.kr/mcp/Lectures_2_3_SVM.pdf · 2019-09-23 · Feature extraction and Discrimination Extracting invariant features Features should

Reproducing Kernel Hilbert Space

Alternate view of SVM

In classification, we would like to find f to minimize misclassificationrate Ey ,x I (y 6= f (x)). It is hard to minimize bona fide 0-1 loss since itinvolves combinatorial computation. Surrogate loss functions can beused.

In logistic regression we minimize logistic loss, log[1 + exp{−yi f (xi )}].The SVM is related to minimizing hinge loss,∑n

i=1 max(0, 1− yi f (xi )).

Seoul National University Deep Learning September-December, 2019 41 / 43

Page 42: Feature extraction and Discriminationstat.snu.ac.kr/mcp/Lectures_2_3_SVM.pdf · 2019-09-23 · Feature extraction and Discrimination Extracting invariant features Features should

Reproducing Kernel Hilbert Space

SVM: Minimizing a hinge loss in a RKHS

Consider minimizing a hinge loss,∑n

i=1 max(0, 1− yi f (xi )),f (x) = β0 + h(x), and h is in RKHS Hk with kernel k(x , .). Inminimizing, one can consider controlling complexity of h by penalizingthe regularization term ||h||2H.

Due to the Representer theorem, the minimizer of

n∑i=1

max(0, 1− yi f (xi )) + λ||h||2H

has a form f (.) = β0 +∑n

i=1 βik(xi , .) with||h||2H =

∑ni ,j βiβjk(xi , xj).

Seoul National University Deep Learning September-December, 2019 42 / 43

Page 43: Feature extraction and Discriminationstat.snu.ac.kr/mcp/Lectures_2_3_SVM.pdf · 2019-09-23 · Feature extraction and Discrimination Extracting invariant features Features should

Reproducing Kernel Hilbert Space

SVM: Minimizing a hinge loss in a RKHS

Plugging in, the problem reduces to minimize over β

n∑i=1

max(0, 1− yi (n∑

j=1

βjk(xj , xi ) + β0)) + λ

n∑i ,j

βiβjk(xi , xj)

The objective function becomes the same as the dual form frommaximizing the margin.

By restricting f to RKHS, an infinite dimensional problem turns to afinite dimensional problem to find β.

Seoul National University Deep Learning September-December, 2019 43 / 43