Top Banner
Feature Extraction: Modern Questions and Challenges Dmitry Storcheus Google Research co-authors: Mehryar Mohri (NYU Courant), Afshin Rostamizadeh (Google)
41

DataEngConf: Feature Extraction: Modern Questions and Challenges at Google

Apr 15, 2017

Download

Software

Hakka Labs
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: DataEngConf: Feature Extraction: Modern Questions and Challenges at Google

Feature Extraction: Modern Questions and Challenges

Dmitry Storcheus Google Research

co-authors: Mehryar Mohri (NYU Courant), Afshin Rostamizadeh (Google)

Page 2: DataEngConf: Feature Extraction: Modern Questions and Challenges at Google

Feature Extraction (FE)Def: Feature Extraction (FE) is any algorithm that transformation raw data into features that can be used as an input for a learning algorithm.

Examples

Construct bag-of-words vector from an email

Remove stopwords in a sentence

Apply PCA projection to high-dimensional data

Page 3: DataEngConf: Feature Extraction: Modern Questions and Challenges at Google

Why Feature Extraction?Excessive amount of raw features available (image classification, spam detection)

Learning algorithms are already well-studied

No ML algorithm can perform stable without feature engineering, but if features are extracted well, even linear methods show great results

Companies invest in feature extraction pipelines

Page 4: DataEngConf: Feature Extraction: Modern Questions and Challenges at Google

Goals of this talk

Structured survey of Feature Extraction methods

Open research problems + some solutions from Google

Empirical advice

Page 5: DataEngConf: Feature Extraction: Modern Questions and Challenges at Google

Benefits of FEIncrease learning accuracy by taking out the most significant information from raw data (Guyon, 2003).

Denoising

Reduce overfitting to train set

Memory optimization

Train and inference time optimization

Page 6: DataEngConf: Feature Extraction: Modern Questions and Challenges at Google

Classification of FE methodsVariable subset selection [Blum, 1997; Kohavi, 1997]: Choose best k variables out of existing ones

Feature Construction: make new variables

Normalization (Lp unit ball or custom metric)

Nonlinear transformation to existing variables

Feature Crossing

Clustering [Duda, 2001]

Page 7: DataEngConf: Feature Extraction: Modern Questions and Challenges at Google

Classification of FE methodsDimensionality Reduction: project data into low-dimensional subspace

Principal Component Analysis [Pearson, 1901]

Linear Discriminant Analysis [Fisher, 1938]

Random Projection [Hegde, 2008]

Manifold Learning: project data onto a nonlinear low-dimensional manifold

Isometric Feature Mapping [Tenenbaum, 2000]

Locally Linear Embedding [Roweis, 2000]

Laplacian Eigenmap [Belkin, 2003]

Page 8: DataEngConf: Feature Extraction: Modern Questions and Challenges at Google

Classification of FE methodsDistance metric learning: learn a custom distance function on data

Local LDA [Hastie, 1996]

Relevance Component Analysis [Bar-Hillel, 2003]

Multiple Kernel Learning [Cortes, 2009]

Representation Learning [Bengio, 2013]: extracting features at each level of neural network

Autoencoders [Bengio, 2007]

Restricted Bolzmann Machines [Hinton, 2003]

Page 9: DataEngConf: Feature Extraction: Modern Questions and Challenges at Google

Questions and challenges

Should Feature Extraction be supervised?

Should Feature Extraction be coupled with a classifier?

How to make Feature Extraction methods scalable?

What is the connection between convex and non-convex methods?

Page 10: DataEngConf: Feature Extraction: Modern Questions and Challenges at Google

Questions and challenges

Should Feature Extraction be supervised?

Should Feature Extraction be coupled with a classifier?

We address these 2 questions on the example of nonlinear dimensionality reduction

Page 11: DataEngConf: Feature Extraction: Modern Questions and Challenges at Google

Dimensionality reduction

Determine a lower dimensional space preserving various geometric properties of the input.

X PX

P

PCA: captures variance

Isomap: preserves distances along manifold

MVU: preserves angles

()

Kernel PCA with specific

kernel function[Ham et al.,2004]

Page 12: DataEngConf: Feature Extraction: Modern Questions and Challenges at Google

Principal Component AnalysisSetting

real-valued training sample of size m

mean centered data matrix

sample covariance matrix

matrix of top d eigenvectors of is

PCA of is

X 2 Rn⇥m

C =1

mXX>

U 2 Rm⇥dC

X U>X

Page 13: DataEngConf: Feature Extraction: Modern Questions and Challenges at Google

Principal Component AnalysisPseudocode

input

centered data matrix

number of principal components

1. compute

2. = TopEigenVectors // LAPACK:dsyevx(), runtime

return

X

d

C =1

mXX>

�C, d

�U O(dm2)

U>X

Page 14: DataEngConf: Feature Extraction: Modern Questions and Challenges at Google

Principal Component AnalysisEmpirical results of PCA

up to +10% accuracy improvement for face recognition [Yang, 2004]

+3% improvement in accuracy on drug discovery data [Janecek, 2008]

+7% accuracy on email classification with 128 principal components out of 57021 total [Gomez, 2008]

Eige

nvalue

0

0.25

0.5

0.75

1

1 2 3 4 5 6 7 8 9 10

Eige

nvalue

0

0.25

0.5

0.75

1

1 2 3 4 5 6 7 8 9 10

Page 15: DataEngConf: Feature Extraction: Modern Questions and Challenges at Google

Dimensionality reduction

Determine a lower dimensional space preserving various geometric properties of the input.

X PX

P

PCA: captures variance

Isomap: preserves distances along manifold

MVU: preserves angles

()

Kernel PCA with specific

kernel function[Ham et al.,2004]

Page 16: DataEngConf: Feature Extraction: Modern Questions and Challenges at Google

Kernel PCAProblem set up: doing PCA in Reproducing Kernel Hilbert Space

training sample

Kernel function

example

Reproducing Kernel Hilbert Space and a map

K(xi, xj) : Rn ⇥ Rn 7! R

S = x1, ..., xm 2 Rn

K(xi, xj) = exp

✓� kxi � xjk2

2�

2

HK �K : Rn 7! HK

K(xi, xj) = h�K(xi),�K(xj)iHK

Page 17: DataEngConf: Feature Extraction: Modern Questions and Challenges at Google

Kernel PCA

Problem set up

is a sample covariance operator of kernel

eigenspace of corresponding to top eigenvalues

is principal component projection

Kernel PCA of point is

C : HK 7! HK K

U 2 HK C d

⇧U : H 7! U 2 H

x ⇧U�K(x)

Page 18: DataEngConf: Feature Extraction: Modern Questions and Challenges at Google

Kernel PCA - pseudocodeinput

sample

number of principal components

1. compute normalized kernel matrix , s.t.

2. = TopEigenVectors

3. =TopEigenValues

4. for i=1 to m

for k=1 to d

return matrix of coordinates

d

U

K [K]i,j =1

m

K(xi, xj)�K, d

S = x1, ..., xm 2 Rn

[⇧]k,i =1p�k

mX

t=1

K(xi, xt)Ut,k

�K, d

Page 19: DataEngConf: Feature Extraction: Modern Questions and Challenges at Google

Kernel PCA

Empirical results

[Sholkopf, 1998] on USPS dataset (handwritten digits) linear SVM shows 91% accuracy, while Kernel PCA+ SVM gives 96% accuracy with only 20% of total principal components

Page 20: DataEngConf: Feature Extraction: Modern Questions and Challenges at Google

Generalized Nonlinear Dimensionality Reduction

Space: let be the reproducing Hilbert space of some kernel and let be the feature map of kernel

Projection: let be some orthogonal projection in

Dim reduction applied to point is x 2 X

K� : x 7! K(x, ·)

P�(x)

K

P

HK

HK

Page 21: DataEngConf: Feature Extraction: Modern Questions and Challenges at Google

Open problemsHow to choose kernel function K and projection P?

Different kernel means different nonlinear dim reduction method

Can we automatically adjust kernel functor for different problems?

Can this be done in supervised manner?

Page 22: DataEngConf: Feature Extraction: Modern Questions and Challenges at Google

How to choose P and K?

Traditionally the choice is based on optimizing geometric properties (e.g. PCA, Isomap, MVU).

However is it a good choice if we use dimensionality reduction for feature generation? E.g. we want to use reduced data as an input for a learning problem?

Page 23: DataEngConf: Feature Extraction: Modern Questions and Challenges at Google

Learning on reduced data: a bad example

-3-2-10123

-3 -2 -1 0 1 2 3

u2

u1

C =

✓4 00 1

u1 =

✓10

u2 =

✓01

◆�1 = 4

�2 = 1

-3-2-10123

-3 -2 -1 0 1 2 3

u2

u1Classification error 50%

Page 24: DataEngConf: Feature Extraction: Modern Questions and Challenges at Google

Coupled Nonlinear Dimensionality Reduction Learner receives:

p PSD kernels

Learner constructs a space , which is the reproducing space of a mixed kernel

Learner constructs a projection which is the Kernel PCA projection with kernel

K1, . . . ,Kp

H

Kµ =pX

k=1

µkKk

⇧U

Page 25: DataEngConf: Feature Extraction: Modern Questions and Challenges at Google

Hypothesis setHypothesis set

Consists of linear maps in the projected subspace of

Parametrized by and

Feature map is the feature mapping of kernel

Projection is based on the unlabeled set

Regularization is a convex set from which is selected.

⇧U : H 7! U 2 H

H

H =

⇢x 7! hw,⇧U�(x)iH : kwkH 1,µ 2 M

�.

� : X 7! H Kµ

U

M µ

w µ

Page 26: DataEngConf: Feature Extraction: Modern Questions and Challenges at Google

Objective function

For any convex loss function

the coupled training of dimensionality reduction + classifier is

h = argminkwk1;µ2M�L(w, µ)

L(w, µ)

Page 27: DataEngConf: Feature Extraction: Modern Questions and Challenges at Google

Supervised Nonlinear Dimensionality Reduction - conclusion [Storcheus, 2015]

Learning error depends on log(p), take many base kernels

Joint learning of kernel and separation hyperplane is suggested

Automatic learning of dim reduction method by learning a kernel function

Coupled algorithm:

Structural risk minimization

Fit and by minimizing training loss. w µ

Page 28: DataEngConf: Feature Extraction: Modern Questions and Challenges at Google

Thank you for your attention

Page 29: DataEngConf: Feature Extraction: Modern Questions and Challenges at Google

References-11. Guyon, Isabelle, and André Elisseeff. "An introduction to variable and feature selection." The Journal of Machine Learning

Research 3 (2003): 1157-1182.

2. Blum, Avrim L., and Pat Langley. "Selection of relevant features and examples in machine learning." Artificial intelligence 97, no. 1 (1997): 245-271.

3. Duda, Richard O., Peter E. Hart, and David G. Stork. "Unsupervised learning and clustering." Pattern classification (2001): 519-598.

4. Hegde, Chinmay, Michael Wakin, and Richard Baraniuk. "Random projections for manifold learning." In Advances in neural information processing systems, pp. 641-648. 2008.

5. Tenenbaum, Joshua B., Vin De Silva, and John C. Langford. "A global geometric framework for nonlinear dimensionality reduction." Science 290, no. 5500 (2000): 2319-2323.

6. Roweis, Sam T., and Lawrence K. Saul. "Nonlinear dimensionality reduction by locally linear embedding." Science 290, no. 5500 (2000): 2323-2326.

7. Belkin, Mikhail, and Partha Niyogi. "Laplacian eigenmaps for dimensionality reduction and data representation." Neural computation 15, no. 6 (2003): 1373-1396.

8. Cortes, Corinna, Mehryar Mohri, and Afshin Rostamizadeh. "Learning non-linear combinations of kernels." In Advances in

Page 30: DataEngConf: Feature Extraction: Modern Questions and Challenges at Google

References-II1. Yang, Jian, David Zhang, Alejandro F. Frangi, and Jing-yu Yang. "Two-dimensional PCA: a new approach to appearance-based

face representation and recognition." Pattern Analysis and Machine Intelligence, IEEE Transactions on 26, no. 1 (2004): 131-137.

2. Janecek, Andreas GK, and Wilfried N. Gansterer. "A comparison of classiffication accuracy achieved with wrappers, filters and PCA'." In Workshop on New Challenges for Feature Selection in Data Mining and Knowledge Discovery. 2008.

3. Gomez, Juan Carlos, and Marie-Francine Moens. "PCA document reconstruction for email classification." Computational Statistics & Data Analysis 56, no. 3 (2012): 741-751.

4. Schölkopf, Bernhard, Alexander Smola, and Klaus-Robert Müller. "Kernel principal component analysis." In Artificial Neural Networks—ICANN'97, pp. 583-588. Springer Berlin Heidelberg, 1997.

5. Storcheus, Dmitry, Mehryar Mohri, and Afshin Rostamizadeh. "Foundations of Coupled Nonlinear Dimensionality Reduction." arXiv preprint arXiv:1509.08880 (2015).

6. Bengio, Yoshua, Aaron Courville, and Pierre Vincent. "Representation learning: A review and new perspectives." Pattern Analysis and Machine Intelligence, IEEE Transactions on 35, no. 8 (2013): 1798-1828.

7. Ham, Jihun, Daniel D. Lee, Sebastian Mika, and Bernhard Schölkopf. "A kernel view of the dimensionality reduction of manifolds." In Proceedings of the twenty-first international conference on Machine learning, p. 47. ACM, 2004.

Page 31: DataEngConf: Feature Extraction: Modern Questions and Challenges at Google

Back-up slides

Page 32: DataEngConf: Feature Extraction: Modern Questions and Challenges at Google

General Framework

Define a learning scenario that describes every FE method

Implement this scenario by Kernel PCA algorithm that generalizes most feature extraction methods

Linear dimensionality reduction, manifold learning, shallow neural nets

Page 33: DataEngConf: Feature Extraction: Modern Questions and Challenges at Google

FE learning scenarioLearner receives:

(partially) labeled training sample of size

Drawn over input space

Feature space

Feature Extraction:

Family of functions

Classification:

Family of functions

mS̃ = ((x̃1, y1), . . . , (x̃m, ym))

X

F : X̃ 7! X

H : X 7! {�1, 1}

Page 34: DataEngConf: Feature Extraction: Modern Questions and Challenges at Google

FE Learning scenarioFE stage

for any feature extraction function let the features extracted from input data point be

apply to training set to get

Classification stage

Apply to sample with extracted features

Testing stage:

the classification of a test point is a composition

xi = f(x̃i)f 2 F

S̃ S = ((x1, y1), ..., (xm, ym))

h 2 H S

x

h(f(x))

Page 35: DataEngConf: Feature Extraction: Modern Questions and Challenges at Google

FE Learning problem

The learning problem is to find best and

How?

By minimizing loss function on training set

f? 2 F h? 2 H

Page 36: DataEngConf: Feature Extraction: Modern Questions and Challenges at Google

FE Learning problem

Types of loss functions w.r.t. feature extraction

unsupervised

supervised decoupled

supervised coupled

Page 37: DataEngConf: Feature Extraction: Modern Questions and Challenges at Google

Different loss functions - different methods

for example, if F are orthogonal projections and X star R^d

L is reconstruction error, then f^star is PCA Projection

L is scattered distances, the f^star is multidimensional scaling

Variance within classes - LDA

Page 38: DataEngConf: Feature Extraction: Modern Questions and Challenges at Google

Ongoing work: algorithm

Page 39: DataEngConf: Feature Extraction: Modern Questions and Challenges at Google

Computation of hypothesisLabeled training sample .

Compute for .

Indexing: index over the base kernels and index over eigenvalues.

Eigenvalues of have the form .

Binary selection variables if eigenspace of is included in projection.

Coordinates of in the range of are .

Other constants relative to the training set are .

k 2 [1, p] j 2 [1, u]

CU µk�j(CU,k)

⇠k,j = 1

⇧Uw 2 H zk,j

h(xn)S = ((x1, y1), . . . , (xm, ym))

h(xn) xn 2 S

µk�j(CU,k)

ck,j(xn)

Page 40: DataEngConf: Feature Extraction: Modern Questions and Challenges at Google

Computation of hypothesisNumerical expression

optimize for and

Constraints

h(xn)

h(xn) =X

k,j

⇠k,jck,j(xn)zk,jpµk

µ z

kzk2 1

µ 2 M

Page 41: DataEngConf: Feature Extraction: Modern Questions and Challenges at Google

Optimization problem

Minimize a training loss

Over a convex set

kzk2 1

µ 2 M

minµ,z

1

m

X

n

L

X

k,j

⇠k,jck,j(xn)zk,jpµk, yn

!