Feature extraction and Discriminationstat.snu.ac.kr/mcp/Lectures_2_3_SVM.pdf · 2019-09-23 · Feature extraction and Discrimination Extracting invariant features Features should

Feature extraction and Discrimination


Seoul National University Deep Learning September-December, 2019 1 / 43


Two parts of Deep Neural Network

Outcome can be binary or continuous or multivariate.

Consider P(y = 1|x , θ) = fk(fk−1(· · · f2(f1(x ; θ1); θ2) · · · ; θk−1); θk).

First part: Extracting features:g(x ; Θk−1) = fk−1(· · · f2(f1(x ; θ1); θ2) · · · ; θk−1), whereΘk−1 = (θ1, θ2, · · · , θk−1).

Second part: Classification: P(y = 1|x , θ) = fk(g(x ; Θk−1); θk),where fk(x ; θk) = exp(xθk)/(1 + exp(xθk)).

Feature extraction and classification are conducted separately before2012.

We briefly review common approaches to feature extraction andclassification separately.



Extracting invariant features

Features should be invariant with respect to small scale changes,small rotations, blur, brightness, thickness, etc.

Averaging or integration makes the images invariant.

Integration over all rota-tions

An image gradient is a directional change in the intensity or color inan image.



Extracting invariant features

Some unwanted variance is due to perspective.

Histogram can remove such variance.

Histogram can remove important features. Localized histogram canovercome this problem.

slides from Christoph Lampert



Feature extraction

The feature extraction step is performed manually using knowntransformations such as Scale-Invariant-Feature-Transformation(SIFT), Rotation-Invariant-FT, Histogram of oriented Gradients(HoG), etc.

The feature extraction consists of multiple steps of operations.



Feature extraction for key points

Finding key points: For natural images a large part is background.Instead of global image representation, one may form local interestpoint.

The image is convolved with Gaussian filters at different scales, andthen the difference of successive Gaussian blurred images are taken.Keypoints are taken as maxima/minima of the Difference ofGaussians (DoG) that occur at multiples scales.



Feature extraction: SIFT

SIFT: A goal is to extract distinctive invariant features; locate thefeature key point, calculate the gradients of the image around keypoint, form localized histogram of gradients over grids, normalize thetotal histogram to represent the image with high-dimensional vector

Extracted features are invariant to image scale and rotation; robust toaffine distortion, change in 3D viewpoint locality, robust to occlusionand clutter.


Discrimination

Discrimination


Discrimination

Linear discriminant analysis

A classification rule is a function f : χ→ {1, · · · ,K} where χ is thedomain of X . For new X , prediction of Y is f (X ).

In case of binary classification, F (x) = wT x + b such thatf (x) = I (F (x) > 0) is called a linear discriminant.

The misclassification rate of f is defined R(f ) = P(Y 6= f (x)). Therule that minimizes R(f ) is called Bayes rule.

Bayes classifier f (x) is f (x) = arg maxj=1,··· ,K

P(Y = j |X = x) =

arg maxj=1,··· ,K

P(X = x |Y = j)P(Y = j)

LDA uses Bayes rule assuming P(X = x |Y = j) = N(µj ,Σ), µj ∈ Rp,Σ ∈ Rp×p


Discrimination


LDA classification rule: wTX ≥ c, where w = Σ−1(µ0 − µ1),c = (µ0 − µ1)T Σ−1(µ0 + µ1)/2.

Consider Σ = UDUT , µj = D−12UT µj , x = D−

12UT x , x ∈ Rp. LDA

classifies x to the nearest centroid, i.e. assigns to class j so that12 ||x − µj ||

22 − log πj is minimized.


Discrimination


Without assuming multivariate normality, E (wTX |Y = j) = wTµj ,Var(wTX |Y = j) = wTΣw . Let

J(w) ={E (wTX |Y = 0)− E (wTX |Y = 1)}2

wTΣw

=wT (µ0 − µ1)(µ0 − µ1)Tw

wTΣw.

w = Σ−1(µ0 − µ1) maximizes J(w). Fisher linear discriminantfunction is wTX , and the classification rule wX ≥ c , wherec = (µ0 − µ1)T Σ−1(µ0 + µ1)/2, is the same as the LDA.


Discrimination

Discriminative vs. Generative

Discriminative P(Y = j |X ) vs. generative models P(X |Y = j).

When P(X = x |Y = j) = N(µj ,Σ), j = 1, 2,P(Y = 1|X = x) = exp(β0 + β1x)/{1 + exp(β0 + β1x)}, where

β0 = log π1π2− µT1 Σ−1µ1

2 +µT2 Σ−1µ2

2 and β1 = (µ1 − µ2)TΣ.

Logistic regression is a discriminative version of LDA.

When f (X |Y = j) =∏

i=1,··· ,pfi (xi ), for X ∈ Rp, P(Y = 1|X ) gives

generalized additive models.

Deep learning can be viewed as logistic regression given the featuresof the last layer.


Discrimination

Challenges of deep learning

Since features are functions of many parameters in compositionalforms, deep learning poses a high dimensional nonconvex modelfitting problem.

High dimensional problems are often handled exploiting sparsity orsmoothness.

In statistical literature high dimensional convex problems have beenstudied.

Some high dimensional nonconvex cases have been studied (e.g. Lohand Wainwright, 2017) for global optima.


Discrimination

Handling high-dimensional problems using sparsity

One overarching strategy of solving high-dimensional problem is ‘beton sparsity’.

Intuitively, if we have sample size n and the unknown p, with p >> n,the number of samples n is too small to allow for accurate estimationof the parameters, unless many of p parameters are zero. If the truemodel is sparse, so that only k < n parameters are actually nonzero,then we can estimate the parameters. (Hastie, Tibshirani andWainwright, SLS)

For example, for lasso, if ||β∗||1 = o(√

n/ log(p)), lasso is known tobe consistent for prediction.


Discrimination

Handling high-dimensional problems using smoothness

(Informally) Functions can be considered as infinitely long vector.

Estimating an unknown function requires overcoming highdimensionalproblems.

One popular strategy to such problem is to assume that the unknownfunction belongs to a restricted function class.

Typically the function class is defined by degree of smoothness.

SVM can be justified in two ways. One gives sparse solution and theother utilizes restricted function space called Reproducing KernelHilbert Space (RKHS). We review SVM in these respects.


Support Vector Machine




Support vector machine: Linearly separable case

Consider a hyperplane that can separate data. xi ∈ Rd , i = 1, · · · n.d >> n.

Margin=minimum distance of xi over i from the boundary plane.

SVM is a classifier by the hyperplane with maximum margin.

slide from Andrew Zisserman.

Since wT x + b = 0 and c(wT x + b) = 0 represent the same plane,choose normalizatioin such that wT x + b = 1 and wT x + b = −1 forsupport vectors, which give margin of 2/||w ||.



Support vector machine: Linearly separable case

w is obtained by maximizing 2/||w || subject to wT xi + b ≥ 1 ifyi = 1, wT xi + b ≤ −1 if yi = −1 for i = 1, · · · , n.

Equivalently, minw12 ||w ||

2 subject to yi (wT xi + b) ≥ 1 for

i = 1, · · · , n.



Support vector machine: Optimization problem

minw12 ||w ||

2

subject to−{yi (wT xi + b)− 1} ≤ 0 for i = 1, · · · , n.

Lagrangian:

L(w , α) =1

2||w ||2 −

∑i

αi{yi (wT xi + b)− 1}

with αi ≥ 0.

maxα≥0

L(w , α) =1

2||w ||2, if the restriction is satisfied,

maxα≥0

L(w , α) =∞, otherwise. Therefore minimizing maxα≥0

L(w , α)

solves the original problem.

Goal: minw,b

maxα≥0

L(w , α)



Support vector machine: Equivalent optimization problems

(Primal) minw,b

maxα≥0

1

2||w||2 −

∑j

αj [(wTxj + b)yj − 1]

(Dual) maxα≥0

minw,b

1

2||w||2 −

∑j

αj [(wTxj + b)yj − 1]

Solve for dual. ∂L∂w = w −

∑j αjyjxj = 0 → w =

∑j αjyjxj

∂L∂b = −

∑j αjyj →

∑j αjyj = 0.

Plugging back, (Dual) maxα≥0,

∑j αjyj=0

∑j

αj −1

2

∑i ,j

yiyjαiαj(xTi xj)

(solving for α)



Support vector machine: Dual solution

Plugging back, (Dual) maxα≥0,

∑j αjyj=0

∑j

αj −1

2

∑i ,j

yiyjαiαj(xTi xj)

(solving for α)

After obtaining α, we have w =∑

j αjyjxj .

b = ( mini :yi=1

wTxi − maxi :yi=−1

wTxi )/2

For new observation x, classify by y ← sign[∑

i αiyixTi x + b].

The solution depends on x only through xT x .

Using the dual form, the dimension is reduced.



Support vector machine: Sparse solution

Convexity of the objective function and constraint allows to solve dualproblem. Moreover the solution satisfies Karush-Kuhn-Tuckercomplementary slackness condition.

(Sparsity) The solution satisfies the KKT complemetary slacknesscondition,

αi{yi (wTxi + b)− 1} = 0.

That is, for the support vector, αi > 0, for other points the constraintis inactive, αi = 0.



Support vector machine: Linearly nonseparable case

slide from Andrew Zisserman.

minw∈Rd ,ξi∈R+

||w ||2 + Cn∑

i=1

ξi

subject to yi (wT xi + b) ≥ 1− ξi for i = 1, · · · , n.

We can fulfill every constraint by choosing ξi large enough.




minw∈Rd ,ξi∈R+ ||w ||2 + C∑n

i=1 ξi subject to yi (wT xi + b) ≥ 1− ξi for

i = 1, · · · , n. Margin is allowed to be less than 1, 1− ξi , but price ispaid by increasing objective function by Cξi .

Lagrangian

L(w , ξ, α, r) =1

2||w ||2+C

∑i

ξi−∑i

αi{yi (wT xi+b)−1+ξi}−∑i

riξi

with αi , ri ≥ 0.

Going through similar derivation of dual form as in separable case,dual problem is

maxα

∑j

αj −1

2

∑i ,j

yiyjαiαj(xTi xj),

under the contraints, 0 ≤ α≤ C ,∑

j αjyj = 0.




KKT condition gives αi = 0 for points satisfying yi (wT xi + b) > 1

without margin violation; αi = C for points with margin violation;0 < α < C for support vectors.

Link between margin maximization and risk minimization

The constraint yi (wT xi + b) ≥ 1−ξi and ξi ≥ 0 gives

ξi =max(0, 1− yi (wT xi + b)).

Learning problem becomes unconstrained optimization over w,minw∈Rd ||w ||2 + C

∑Ni=1 max(0, 1− yi (w

T xi + b))

C is a regularization parameter. small C = soft margin. large C =hard margin.



Classification in transformed space

Note that optimization depends on x only through xTx. Even if we use zwith higher dimension than x, the solution is in the same dimension.Linearly nonseparable data can become separable in a higher dimension.

φ :

(x1

x2

)→

x21

x22√

2x1x2



Transforming to a higher dimension

φ :

(x1

x2

)→

x21

x22√

2x1x2

Data may become linearly separable in transformed space.

We can replace xTi xj with φ(xi )Tφ(xj) in the algorithm.

The classifier for a new datum x isf (x) = wTφ(x) + b =

∑ni=1 αiyiφ(xi )

Tφ(x) + b.

Define kernel function k(x , z) = φ(x)Tφ(z). The classifier isf (x) =

∑ni=1 αiyik(xi , x) + b.



Kernel trick

Define kernel function k(x , z) = φ(x)Tφ(z). The classifier isf (x) =

∑ni=1 αiyik(xi , x) + b.

Although we can compute φ(x)Tφ(z) given φ(.), k(x , z) may be easyto compute. e.g. when d = 3,k(x , z) = (xT z)2 =

∑di ,j=1 xixjzizj = φ(x)Tφ(z), where

φ(x) = (x1x1, x1x2, x1x3, x2x1, x2x2, x2x3, x3x1, x3x2, x3x3). Somekernels are infinite dimensional, e.g., k(x , z) = exp(− 1

2σ2 ||x − z ||22).

If we specify k(x , z), does φ exist? Yes, if k is a positive definitekernel.Definition (Positive definite kernel). A function k : χ× χ→ R iscalled a positive definite kernel, IFF for any set of points in χ,{x (1), · · · , x (m)}, Ki ,j = ((k(x (i), x (j))))i ,j is symmetric and positivesemi-definite.



Effect of hyperparameters using gaussian kernel

• f (x) =∑n

i=1 αiyi exp(−||x − xi ||2/2σ2) + b. σ = 1.

Figure: C=10 Figure: C=100 Figure: C=∞



Effect of hyperparameters using gaussian kernel

• f (x) =∑n

i=1 αiyi exp(−||x − xi ||2/2σ2) + b. C =∞.

Figure: σ = 1 Figure: σ = .25 Figure: σ = 0.1



Alternative justification of SVM

SVM reduces the dimension of the problem via dual thanks toconvexity.

Convexity also leads to the KKT complimentary slackness conditionand the sparsity of the solution.

Kernel tricks lack theoretical justification (what if φ(x) is infinitedimensional?)

Alternative approach uses Reproducing Kernel Hilbert Space (RKHS)and the representer theorem which reduces function estimation to afinite linear combination of kernel products evaluated at the inputvalue in the training dataset.

We review RKHS and the representer theorem.


Reproducing Kernel Hilbert Space




Hilbert Space

Definition (Inner product) Let F be a vector space over R. Afunction 〈., .〉F : F × F → R is said to be an inner product on F if1. 〈α1f1 + α2f2, g〉F = α1〈f1, g〉F + α2〈f2, g〉F2. 〈f , g〉F = 〈g , f 〉F3. 〈f , f 〉F ≥ 0 and 〈f , f 〉F = 0 if and only if f = 0.

Definition (Complete space) A space is complete if every Cauchysequence in that space has a limit and this limit is in that space.

Definition (Hilbert space) Hilbert space is a complete inner productspace.



Motivation to use RKHS

The functions in Hilbert space may not be ideal for statisticallearning. Consider evaluating the function f (x) at the point x = k .Define g as g(x) = c , if x = k; f (x), otherwise. Because it differsfrom f only at one point, g is clearly still square-integrable, andmoreover, ‖f − g‖ = 0.

A condition on the integrability of the function is not strong enoughto use for prediction since resulting predictions will be bumpy.

We need a nicer function space for prediction where a functionevaluated at each point is continuous. Such space is reproducingkernel Hilbert spaces.



Reproducing Kernel Hilbert Space: Kernel

Definition (Positive definite kernel). A function k : χ× χ→ R iscalled a positive definite kernel, IFF for any set of points in χ,{x (1), · · · , x (m)}, Ki ,j = ((k(x (i), x (j))))i ,j is symmetric and positivesemi-definite.

Theorem (Mercer). Let k : χ× χ→ R be a positive definite kernelfunction. Then there exists a Hilbert space and a mapping φ : χ→ Hsuch that ∀x , x ′ ∈ χ, k(x , x ′) = 〈φ(x), φ(x ′)〉H.



Reproducing Kernel Hilbert Space: How to generate

We first discuss how to generate RKHS function space then present aformal definition.

For any positive definite kernel, k(x , .) is the function obtained byfixing the first coordinate at x . For the Gaussian kernel, k(x , .) is anormal density function centered at x . We can generate functionsf (.) =

∑ni=1 αik(xi , .), g(.) =

∑mj=1 βjk(yj , .). Given such generated

two functions, define 〈f , g〉 =∑n

i=1

∑mj=1 αiβjk(xi , yj). Note that

〈k(xi , ), k(yj , .)〉 = k(xi , yj).

This mapping turns out to satisfy the definition of inner product.

The function space generated by kernel k(x , .) is a Hilbert space.

The kernel has reproducing property.




Definition (Reproducing Kernel Hilbert Space): A Hilbert space H offunctions f : χ→ R defined on a non-empty set χ is said to be aReproducing Kernel Hilbert Space (RKHS) if evaluation functional δxis continuous ∀x ∈ χ.

Definition (Evaluation functional): Let H be a Hilbert space offunctions f : χ→ R, defined on a non-empty set χ. For a fixed pointx ∈ χ, map δx : H → R, δx : f 7→ f (x) is called the evaluationfunctional at x .

Definition (Continuity): A function A: H → G, where H, G are bothnormed linear spaces over R, is said to be continuous at f0 ∈ H if forevery ε > 0, there exists a δ = δ(ε, f0) > 0, s.t. ||f − f0||H < δ implies||Af − Af0||G < ε.

An important property of an RKHS is that if two functions f ∈ H andg ∈ H are close in the norm of H, then f (x) and g(x) are close forall x ∈ χ. This is due to the definition of RKHS.




Definition (Reproducing kernel): Let H be a Hilbert space ofR-valued functions defined on a non-empty set χ. A functionk : χ× χ→ R is called a reproducing kernel on H if it satisfies1. ∀x ∈ χ, k(., x) ∈ H,2. ∀x ∈ χ, ∀f ∈ H, 〈f , k(·, x)〉H = f (x).




Theorem (Existence of the reproducing kernel): H is a RKHS (i.e., itsevaluation functionals δx are continuous linear operators), if and onlyif H has a reproducing kernel. Denote RKHS spanned by k by Hk .

Examples:

Linear kernel: k(x , x ′) = xT x ′

Gaussian kernel: k(x , x ′) = exp(−‖x−x′‖2

σ2 )Polynomial kernel: k(x , x ′) = (xT x ′ + 1)d , d ∈ N

(Summary) We can consider a Hilbert space where evaluationfunctional is continuous. This continuity allows existence ofreproducing kernel. Function space constructed using thesereproducing kernels is a RKHS.



Representer theorem (Kimeldorf and Wahba, 1971)

(Representer Theorem) Let l be a loss function on f = β0 + h andh ∈ H and H be a RKHS generated by a Mercer kernel k. Let fminimize

Cn(f ) =n∑

i=1

l(yi , f (xi )) + λ||h||H.

Then

f (x) = b +n∑

i=1

αik(xi , x)

where b and αi ∈ R, i = 1, · · · , n.

Representer theorem reduces an infinite dimensional problem to afinite dimensional problem.



Alternate view of SVM

In classification, we would like to find f to minimize misclassificationrate Ey ,x I (y 6= f (x)). It is hard to minimize bona fide 0-1 loss since itinvolves combinatorial computation. Surrogate loss functions can beused.

In logistic regression we minimize logistic loss, log[1 + exp{−yi f (xi )}].The SVM is related to minimizing hinge loss,∑n

i=1 max(0, 1− yi f (xi )).



SVM: Minimizing a hinge loss in a RKHS

Consider minimizing a hinge loss,∑n

i=1 max(0, 1− yi f (xi )),f (x) = β0 + h(x), and h is in RKHS Hk with kernel k(x , .). Inminimizing, one can consider controlling complexity of h by penalizingthe regularization term ||h||2H.

Due to the Representer theorem, the minimizer of

n∑i=1

max(0, 1− yi f (xi )) + λ||h||2H

has a form f (.) = β0 +∑n

i=1 βik(xi , .) with||h||2H =

∑ni ,j βiβjk(xi , xj).



SVM: Minimizing a hinge loss in a RKHS

Plugging in, the problem reduces to minimize over β

n∑i=1

max(0, 1− yi (n∑

j=1

βjk(xj , xi ) + β0)) + λ

n∑i ,j

βiβjk(xi , xj)

The objective function becomes the same as the dual form frommaximizing the margin.

By restricting f to RKHS, an infinite dimensional problem turns to afinite dimensional problem to find β.


Feature extraction and Discriminationstat.snu.ac.kr/mcp/Lectures_2_3_SVM.pdf · 2019-09-23 · Feature extraction and Discrimination Extracting invariant features Features should

Documents