Scalable machine learning for massive datasets: Fast summation algorithms Getting good enough solutions as fast as possible Vikas Chandrakant Raykar [email protected]University of Maryland, CollegePark March 8, 2007 Vikas C. Raykar (Univ. of Maryland) Doctoral dissertation March 8, 2007 1 / 69
125
Embed
Scalable machine learning for massive datasets: Fast summation algorithms · 2007-03-08 · Scalable machine learning for massive datasets: Fast summation algorithms Getting good
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Scalable machine learning for massive datasets:Fast summation algorithms
A class of techniques using only good space division schemes called dualtree methods have been proposed.
Vikas C. Raykar (Univ. of Maryland) Doctoral dissertation March 8, 2007 23 / 69
Thesis contributions
Outline of the proposal
1 Motivation
2 Key Computational tasks
3 Thesis contributionsAlgorithm 1: Sums of Gaussians
Kernel density estimation
Gaussian process regression
Implicit surface fitting
Algorithm 2: Sums of Hermite × GaussiansOptimal bandwidth estimation
Projection pursuit
Algorithm 3: Sums of error functionsRanking
4 Conclusions
Vikas C. Raykar (Univ. of Maryland) Doctoral dissertation March 8, 2007 24 / 69
Thesis contributions Algorithm 1: Sums of Gaussians
Algorithm 1: Sums of Gaussians
The most commonly used kernel function in machine learning is theGaussian kernel
K (x , y) = e−‖x−y‖2/h2,
where h is called the bandwidth of the kernel.
−5 −4 −3 −2 −1 0 1 2 3 4 50
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
xi
Vikas C. Raykar (Univ. of Maryland) Doctoral dissertation March 8, 2007 25 / 69
Thesis contributions Algorithm 1: Sums of Gaussians
Discrete Gauss Transform
G (yj) =
N∑
i=1
qie−‖yj−xi‖
2/h2.
{qi ∈ R}i=1,...,N are the N source weights.
{xi ∈ Rd}i=1,...,N are the N source points.
{yj ∈ Rd}j=1,...,M are the M target points.
h ∈ R+ is the source scale or bandwidth.
Vikas C. Raykar (Univ. of Maryland) Doctoral dissertation March 8, 2007 26 / 69
Thesis contributions Algorithm 1: Sums of Gaussians
Fast Gauss Transform (FGT)
ǫ − exact approximation algorithm.
Computational complexity is O(M + N).
Proposed by Greengard and Strain and applied successfully to a fewlower dimensional applications in mathematics and physics.
However the algorithm has not been widely used much in statistics,pattern recognition, and machine learning applications where higherdimensions occur commonly.
Vikas C. Raykar (Univ. of Maryland) Doctoral dissertation March 8, 2007 27 / 69
Thesis contributions Algorithm 1: Sums of Gaussians
Constants are important
FGT ∼ O(pd(M + N)).
We propose a method Improved FGT (IFGT) which scales as ∼O(dp(M + N)).
5 10 15 20 25 3010
0
105
1010
1015
1020
1025
p=5
d
FGT ~ pd
IFGT ~ dp
Vikas C. Raykar (Univ. of Maryland) Doctoral dissertation March 8, 2007 28 / 69
Thesis contributions Algorithm 1: Sums of Gaussians
Brief idea of IFGT
yj
r
ryk
rxk
ck
Vikas C. Raykar (Univ. of Maryland) Doctoral dissertation March 8, 2007 29 / 69
Thesis contributions Algorithm 1: Sums of Gaussians
Brief idea of IFGT
yj
r
ryk
rxk
ck
Step 0 Determine parameters of algorithm based on specified errorbound, kernel bandwidth, and data distribution.
Vikas C. Raykar (Univ. of Maryland) Doctoral dissertation March 8, 2007 29 / 69
Thesis contributions Algorithm 1: Sums of Gaussians
Brief idea of IFGT
yj
r
ryk
rxk
ck
Step 0 Determine parameters of algorithm based on specified errorbound, kernel bandwidth, and data distribution.
Step 1 Subdivide the d-dimensional space using a k-center clusteringbased geometric data structure (O(N log K )).
Vikas C. Raykar (Univ. of Maryland) Doctoral dissertation March 8, 2007 29 / 69
Thesis contributions Algorithm 1: Sums of Gaussians
Brief idea of IFGT
yj
r
ryk
rxk
ck
Step 0 Determine parameters of algorithm based on specified errorbound, kernel bandwidth, and data distribution.
Step 1 Subdivide the d-dimensional space using a k-center clusteringbased geometric data structure (O(N log K )).
Step 2 Build a p truncated representation of kernels inside eachcluster using a set of decaying basis functions (O(Ndp)).
Vikas C. Raykar (Univ. of Maryland) Doctoral dissertation March 8, 2007 29 / 69
Thesis contributions Algorithm 1: Sums of Gaussians
Brief idea of IFGT
yj
r
ryk
rxk
ck
Step 0 Determine parameters of algorithm based on specified errorbound, kernel bandwidth, and data distribution.
Step 1 Subdivide the d-dimensional space using a k-center clusteringbased geometric data structure (O(N log K )).
Step 2 Build a p truncated representation of kernels inside eachcluster using a set of decaying basis functions (O(Ndp)).
Step 3 Collect the influence of all the the data in a neighborhoodusing coefficients at cluster center and evaluate (O(Mdp)).
Vikas C. Raykar (Univ. of Maryland) Doctoral dissertation March 8, 2007 29 / 69
Thesis contributions Algorithm 1: Sums of Gaussians
Sample result
For example in three dimensions and 1 million training and test points[h=0.4]
IFGT – 6 minutes.
Direct – 34 hours.
with an error of 10−8.
FIGTree
We have also combined the IFGT with a kd-tree based nearest neighborsearch algorithm.
Vikas C. Raykar (Univ. of Maryland) Doctoral dissertation March 8, 2007 30 / 69
Thesis contributions Algorithm 1: Sums of Gaussians
Speedup as a function of d [h = 1.0]FGT cannot be run for d > 3
0 2 4 6 8 1010
−2
10−1
100
101
102
103
d
Tim
e (
se
c)
DirectFGTFIGTree
2 4 6 8 1010
−10
10−8
10−6
10−4
10−2
d
Ma
x.
ab
s.
err
or
/ Q
Desired errorFGTFIGTree
Vikas C. Raykar (Univ. of Maryland) Doctoral dissertation March 8, 2007 31 / 69
Thesis contributions Algorithm 1: Sums of Gaussians
Speedup as a function of d [h = 0.5√
d ]FIGTree scales well with d .
0 10 20 30 40 5010
−1
100
101
102
103
d
Tim
e (s
ec)
DirectFGTFIGTree
Vikas C. Raykar (Univ. of Maryland) Doctoral dissertation March 8, 2007 32 / 69
Thesis contributions Algorithm 1: Sums of Gaussians
Speedup as a function of ǫ
Better speedup for lower precision.
10−10
10−5
0
0.5
1
1.5
2
2.5
3
3.5
4
Desired error, ε
Tim
e (
se
c)
DirectFIGTreeDual tree
10−10
10−5
10−15
10−10
10−5
100
Desired error, ε
Ma
x. a
bs. e
rro
r / Q
Desired errorFIGTreeDual tree
Vikas C. Raykar (Univ. of Maryland) Doctoral dissertation March 8, 2007 33 / 69
Thesis contributions Algorithm 1: Sums of Gaussians
Speedup as a function of h
Scales well with bandwidth.
10−3
10−2
10−1
100
101
10−3
10−2
10−1
100
101
102
h
Tim
e (s
ec)
DirectFIGTreeDual tree
d=2
d=3
d=4
d=5
d=2
d=3d=4d=5
Vikas C. Raykar (Univ. of Maryland) Doctoral dissertation March 8, 2007 34 / 69
Thesis contributions Algorithm 1: Sums of Gaussians
Applications
Direct application
Kernel density estimation.
Prediction in Gaussian process regression, SVM, RLS.
Vikas C. Raykar (Univ. of Maryland) Doctoral dissertation March 8, 2007 35 / 69
Thesis contributions Algorithm 1: Sums of Gaussians
Applications
Direct application
Kernel density estimation.
Prediction in Gaussian process regression, SVM, RLS.
Embed in iterative or optimization methods
Training of kernel machines and Gaussian processes.
Computing eigen vector in unsupervised learning tasks.
Vikas C. Raykar (Univ. of Maryland) Doctoral dissertation March 8, 2007 35 / 69
Thesis contributions Algorithm 1: Sums of Gaussians
Application 1: Kernel density Estimation
Estimate the density p from an i.i.d. sample x1, . . . , xN drawn from p.
Vikas C. Raykar (Univ. of Maryland) Doctoral dissertation March 8, 2007 36 / 69
Thesis contributions Algorithm 1: Sums of Gaussians
Application 1: Kernel density Estimation
Estimate the density p from an i.i.d. sample x1, . . . , xN drawn from p.
The most popular method is the kernel density estimator (also knownas Parzen window estimator).
p(x) = 1N
∑Ni=1
1hK
(x−xi
h
)
Vikas C. Raykar (Univ. of Maryland) Doctoral dissertation March 8, 2007 36 / 69
Thesis contributions Algorithm 1: Sums of Gaussians
Application 1: Kernel density Estimation
Estimate the density p from an i.i.d. sample x1, . . . , xN drawn from p.
The most popular method is the kernel density estimator (also knownas Parzen window estimator).
p(x) = 1N
∑Ni=1
1hK
(x−xi
h
)
The widely used kernel is a Gaussian.
p(x) =1
N
N∑
i=1
1
(2πh2)d/2e−‖x−xi‖
2/2h2. (3)
The computational cost of evaluating this sum at M points due to N
data points is O(NM),
The proposed FIGTree algorithm can be used to compute the sumapproximately to ǫ precision in O(N + M) time.
Vikas C. Raykar (Univ. of Maryland) Doctoral dissertation March 8, 2007 36 / 69
Thesis contributions Algorithm 1: Sums of Gaussians
KDE experimental resultsN = M = 44, 484 ǫ = 10−2
SARCOS dataset
d Optimal h Direct time (sec.) FIGTree time (sec.) Speedup
Vikas C. Raykar (Univ. of Maryland) Doctoral dissertation March 8, 2007 37 / 69
Thesis contributions Algorithm 1: Sums of Gaussians
KDE experimental resultsN = M = 7000 ǫ = 10−2
10−4
10−3
10−2
10−1
100
10−2
100
102
h
Tim
e (s
ec)
d=4
DirectFIGTreeDual tree
Vikas C. Raykar (Univ. of Maryland) Doctoral dissertation March 8, 2007 38 / 69
Thesis contributions Algorithm 1: Sums of Gaussians
Application 2: Gaussian processes regression
Vikas C. Raykar (Univ. of Maryland) Doctoral dissertation March 8, 2007 39 / 69
Thesis contributions Algorithm 1: Sums of Gaussians
Application 2: Gaussian processes regression
Regression problem
Training data D = {xi ∈ Rd , yi ∈ R}N
i=1
Predict y for a new x .
Also get uncertainty estimates.
0 0.2 0.4 0.6 0.8 1−1.5
−1
−0.5
0
0.5
1
1.5
x
y
Vikas C. Raykar (Univ. of Maryland) Doctoral dissertation March 8, 2007 39 / 69
Thesis contributions Algorithm 1: Sums of Gaussians
Gaussian processes regression
Bayesian non-linear non-parametric regression.
Vikas C. Raykar (Univ. of Maryland) Doctoral dissertation March 8, 2007 40 / 69
Thesis contributions Algorithm 1: Sums of Gaussians
Gaussian processes regression
Bayesian non-linear non-parametric regression.
The regression function is represented by an ensemble of functions, onwhich we place a Gaussian prior.
Vikas C. Raykar (Univ. of Maryland) Doctoral dissertation March 8, 2007 40 / 69
Thesis contributions Algorithm 1: Sums of Gaussians
Gaussian processes regression
Bayesian non-linear non-parametric regression.
The regression function is represented by an ensemble of functions, onwhich we place a Gaussian prior.
This prior is updated in the light of the training data.
Vikas C. Raykar (Univ. of Maryland) Doctoral dissertation March 8, 2007 40 / 69
Thesis contributions Algorithm 1: Sums of Gaussians
Gaussian processes regression
Bayesian non-linear non-parametric regression.
The regression function is represented by an ensemble of functions, onwhich we place a Gaussian prior.
This prior is updated in the light of the training data.
As a result we obtain predictions together with valid estimates ofuncertainty.
Vikas C. Raykar (Univ. of Maryland) Doctoral dissertation March 8, 2007 40 / 69
Thesis contributions Algorithm 1: Sums of Gaussians
Gaussian process model
Model
y = f (x) + ε
Vikas C. Raykar (Univ. of Maryland) Doctoral dissertation March 8, 2007 41 / 69
Thesis contributions Algorithm 1: Sums of Gaussians
Gaussian process model
Model
y = f (x) + ε
ε is N (0, σ2).
Vikas C. Raykar (Univ. of Maryland) Doctoral dissertation March 8, 2007 41 / 69
Thesis contributions Algorithm 1: Sums of Gaussians
Gaussian process model
Model
y = f (x) + ε
ε is N (0, σ2).
f (x) is a zero-mean Gaussian process with covariance functionK (x , x
′
).
Most common covariance function is the Gaussian.
Vikas C. Raykar (Univ. of Maryland) Doctoral dissertation March 8, 2007 41 / 69
Thesis contributions Algorithm 1: Sums of Gaussians
Gaussian process model
Model
y = f (x) + ε
ε is N (0, σ2).
f (x) is a zero-mean Gaussian process with covariance functionK (x , x
′
).
Most common covariance function is the Gaussian.
Infer the posterior
Given the training data D and a new input x∗ our task is to compute theposterior p(f∗|x∗,D).
Vikas C. Raykar (Univ. of Maryland) Doctoral dissertation March 8, 2007 41 / 69
Thesis contributions Algorithm 1: Sums of Gaussians
Solution
The posterior is a Gaussian.
The mean is used as the prediction.
The variance is the uncertainty associated with the prediction.
Vikas C. Raykar (Univ. of Maryland) Doctoral dissertation March 8, 2007 42 / 69
Thesis contributions Algorithm 1: Sums of Gaussians
Solution
The posterior is a Gaussian.
The mean is used as the prediction.
The variance is the uncertainty associated with the prediction.
0 0.2 0.4 0.6 0.8 1−1.5
−1
−0.5
0
0.5
1
1.5
x
y
3σ2σ
σ
Vikas C. Raykar (Univ. of Maryland) Doctoral dissertation March 8, 2007 42 / 69
Thesis contributions Algorithm 1: Sums of Gaussians
Direct Training
ξ = (K + σ2I)−1y
Direct computation of the inverse of a matrix requires O(N3)operations and O(N2) storage.
Impractical even for problems of moderate size (typically a fewthousands).
For example N=25,600 takes around 10 hours, assuming you haveenough RAM.
Vikas C. Raykar (Univ. of Maryland) Doctoral dissertation March 8, 2007 43 / 69
Thesis contributions Algorithm 1: Sums of Gaussians
Iterative methods
(K + λI)ξ = y.
The iterative method generates a sequence of approximate solutionsξk at each step which converge to the true solution ξ.
Can use the conjugate-gradient method.
Vikas C. Raykar (Univ. of Maryland) Doctoral dissertation March 8, 2007 44 / 69
Thesis contributions Algorithm 1: Sums of Gaussians
Iterative methods
(K + λI)ξ = y.
The iterative method generates a sequence of approximate solutionsξk at each step which converge to the true solution ξ.
Can use the conjugate-gradient method.
Computational cost of conjugate-gradient
Requires one matrix-vector multiplication and 5N flops per iteration.
Four vectors of length N are required for storage.
Hence computational cost now reduces to O(kN2).
For example N=25,600 takes around 17 minutes (compare to 10hours).
Vikas C. Raykar (Univ. of Maryland) Doctoral dissertation March 8, 2007 44 / 69
Thesis contributions Algorithm 1: Sums of Gaussians
CG+FIGTree
The core computational step in each conjugate-gradient iteration isthe multiplication of the matrix K with a vector, say q.
Coupled with the CG the IFGT reduces the computational cost of GPregression to O(N).
For example N=25,600 takes around 3 secs. (compare to 10hours[direct] or 17 minutes[CG]).
Vikas C. Raykar (Univ. of Maryland) Doctoral dissertation March 8, 2007 45 / 69
Thesis contributions Algorithm 1: Sums of Gaussians
Results on the robotarm datasetTraining time
256 512 1024 2048 4096 819210
−2
10−1
100
101
102
m
Trai
ning
tim
e (s
ecs)
robotarm
SDSR and PPCG+FIGTreeCG+dual−tree
Vikas C. Raykar (Univ. of Maryland) Doctoral dissertation March 8, 2007 46 / 69
Thesis contributions Algorithm 1: Sums of Gaussians
Results on the robotarm datasetTest error
256 512 1024 2048 4096 81920.13
0.135
0.14
0.145
0.15
0.155
0.16
m
SM
SE
robotarm
Vikas C. Raykar (Univ. of Maryland) Doctoral dissertation March 8, 2007 47 / 69
Thesis contributions Algorithm 1: Sums of Gaussians
Results on the robotarm datasetTest time
256 512 1024 2048 4096 8192
10−2
10−1
100
m
Test
ing
time
(sec
s)
robotarm
Vikas C. Raykar (Univ. of Maryland) Doctoral dissertation March 8, 2007 48 / 69
Thesis contributions Algorithm 1: Sums of Gaussians
How to choose ǫ for inexact CG?
Matrix-vector product may be performed in an increasingly inexact manneras the iteration progresses and still allow convergence to the solution.
0 2 4 6 8 10 12 1410
−7
10−6
10−5
Iteration
ε
Vikas C. Raykar (Univ. of Maryland) Doctoral dissertation March 8, 2007 49 / 69
Thesis contributions Algorithm 1: Sums of Gaussians
Application 3:Implicit surface fitting
Vikas C. Raykar (Univ. of Maryland) Doctoral dissertation March 8, 2007 50 / 69
Thesis contributions Algorithm 1: Sums of Gaussians
Implicit surface fitting as regression
negative off−surface points
positiveoff−surface points
on−surface points
surface normals
Vikas C. Raykar (Univ. of Maryland) Doctoral dissertation March 8, 2007 51 / 69
Thesis contributions Algorithm 1: Sums of Gaussians
Implicit surface fitting as regression
Using the proposed approach we can handle point clouds containingmillions of points.
Vikas C. Raykar (Univ. of Maryland) Doctoral dissertation March 8, 2007 52 / 69
Thesis contributions Algorithm 2: Sums of Hermite × Gaussians
Algorithm 2: Sums of Hermite × Gaussians
The FIGTree can be used in any kernel machine where we encountersums of Gaussians.
Vikas C. Raykar (Univ. of Maryland) Doctoral dissertation March 8, 2007 53 / 69
Thesis contributions Algorithm 2: Sums of Hermite × Gaussians
Algorithm 2: Sums of Hermite × Gaussians
The FIGTree can be used in any kernel machine where we encountersums of Gaussians.
Most kernel methods require choosing some hyperparameters (e.g.bandwidth h of the kernel).
Vikas C. Raykar (Univ. of Maryland) Doctoral dissertation March 8, 2007 53 / 69
Thesis contributions Algorithm 2: Sums of Hermite × Gaussians
Algorithm 2: Sums of Hermite × Gaussians
The FIGTree can be used in any kernel machine where we encountersums of Gaussians.
Most kernel methods require choosing some hyperparameters (e.g.bandwidth h of the kernel).
Optimal procedures to choose these parameters are O(N2).
Vikas C. Raykar (Univ. of Maryland) Doctoral dissertation March 8, 2007 53 / 69
Thesis contributions Algorithm 2: Sums of Hermite × Gaussians
Algorithm 2: Sums of Hermite × Gaussians
The FIGTree can be used in any kernel machine where we encountersums of Gaussians.
Most kernel methods require choosing some hyperparameters (e.g.bandwidth h of the kernel).
Optimal procedures to choose these parameters are O(N2).
Most of these procedures involve solving some optimization whichinvolves taking the derivatives of kernel sums.
Vikas C. Raykar (Univ. of Maryland) Doctoral dissertation March 8, 2007 53 / 69
Thesis contributions Algorithm 2: Sums of Hermite × Gaussians
Algorithm 2: Sums of Hermite × Gaussians
The FIGTree can be used in any kernel machine where we encountersums of Gaussians.
Most kernel methods require choosing some hyperparameters (e.g.bandwidth h of the kernel).
Optimal procedures to choose these parameters are O(N2).
Most of these procedures involve solving some optimization whichinvolves taking the derivatives of kernel sums.
The derivatives of Gaussian sums involve sums of products of Hermitepolynomials and Gaussians.
Gr (yj) =∑N
i=1 qiHr
(yj−xi
h
)e−(yj−xi )
2/2h2j = 1, . . . ,M.
Vikas C. Raykar (Univ. of Maryland) Doctoral dissertation March 8, 2007 53 / 69
Thesis contributions Algorithm 2: Sums of Hermite × Gaussians
Kernel density estimation
The most popular method for density estimation is the kernel densityestimator (KDE).
p(x) =1
N
N∑
i=1
1
hK
(x − xi
h
)
FIGTree can be directly used to accelerate KDE.
Efficient use of KDE requires choosing h optimally.
Vikas C. Raykar (Univ. of Maryland) Doctoral dissertation March 8, 2007 54 / 69
Thesis contributions Algorithm 2: Sums of Hermite × Gaussians
The bandwidth h is a very crucial parameter
As h decreases towards 0, the number of modes increases to thenumber of data points and the KDE is very noisy.
As h increases towards ∞, the number of modes drops to 1, so thatany interesting structure has been smeared away and the KDE justdisplays a unimodal pattern.
Small bandwidth h=0.01 Large bandwidth h=0.2
Vikas C. Raykar (Univ. of Maryland) Doctoral dissertation March 8, 2007 55 / 69
Thesis contributions Algorithm 2: Sums of Hermite × Gaussians
Application 1: Fast optimal bandwidth selection
The state-of-the-art method for optimal bandwidth selection forkernel density estimation scales as O(N2).
We present a fast computational technique that scales as O(N).
The core part is the fast ǫ − exact algorithm for kernel densityderivative estimation which reduces the computational complexityfrom O(N2) to O(N).
Vikas C. Raykar (Univ. of Maryland) Doctoral dissertation March 8, 2007 56 / 69
Thesis contributions Algorithm 2: Sums of Hermite × Gaussians
Application 1: Fast optimal bandwidth selection
The state-of-the-art method for optimal bandwidth selection forkernel density estimation scales as O(N2).
We present a fast computational technique that scales as O(N).
The core part is the fast ǫ − exact algorithm for kernel densityderivative estimation which reduces the computational complexityfrom O(N2) to O(N).
For example for N = 409, 600 points.
Direct evaluation → 12.76 hours.Fast evaluation → 65 seconds with an error of around 10−12.
Vikas C. Raykar (Univ. of Maryland) Doctoral dissertation March 8, 2007 56 / 69
Thesis contributions Algorithm 2: Sums of Hermite × Gaussians
Marron Wand normal mixtures
−3 −2 −1 0 1 2 30
0.1
0.2
0.3
0.4
0.5
0.6
0.72
−3 −2 −1 0 1 2 30
0.2
0.4
0.6
0.8
1
1.2
1.43
−3 −2 −1 0 1 2 30
0.2
0.4
0.6
0.8
1
1.2
1.4
1.64
−3 −2 −1 0 1 2 30
0.5
1
1.5
2
2.5
3
3.5
45
−3 −2 −1 0 1 2 30
0.05
0.1
0.15
0.2
0.25
0.3
0.356
−3 −2 −1 0 1 2 30
0.05
0.1
0.15
0.2
0.25
0.3
0.357
−3 −2 −1 0 1 2 30
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
8
−3 −2 −1 0 1 2 30
0.05
0.1
0.15
0.2
0.25
0.3
0.359
−3 −2 −1 0 1 2 30
0.1
0.2
0.3
0.4
0.5
0.6
0.710
−3 −2 −1 0 1 2 30
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
11
−3 −2 −1 0 1 2 30
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
12
−3 −2 −1 0 1 2 30
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
13
−3 −2 −1 0 1 2 30
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
14
−3 −2 −1 0 1 2 30
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
15
Vikas C. Raykar (Univ. of Maryland) Doctoral dissertation March 8, 2007 57 / 69
Thesis contributions Algorithm 2: Sums of Hermite × Gaussians
Vikas C. Raykar (Univ. of Maryland) Doctoral dissertation March 8, 2007 58 / 69
Thesis contributions Algorithm 2: Sums of Hermite × Gaussians
Projection pursuit
The idea of projection pursuit is to search for projections from high- tolow-dimensional space that are most interesting.
1 Given N data points in a d dimensional space project each data pointonto the direction vector a ∈ Rd , i.e., zi = aT xi .
2 Compute the univariate nonparametric kernel density estimate, p, ofthe projected points zi .
3 Compute the projection index I (a) based on the density estimate.
4 Locally optimize over the the choice of a, to get the most interesting
projection of the data.
Vikas C. Raykar (Univ. of Maryland) Doctoral dissertation March 8, 2007 59 / 69
Thesis contributions Algorithm 2: Sums of Hermite × Gaussians
Projection index
The projection index is designed to reveal specific structure in thedata, like clusters, outliers, or smooth manifolds.
The entropy index based on Renyi’s order-1 entropy is given by
I (a) =
∫p(z) log p(z)dz .
The density of zero mean and unit variance which uniquely minimizesthis is the standard normal density.
Thus the projection index finds the direction which is mostnon-normal.
Vikas C. Raykar (Univ. of Maryland) Doctoral dissertation March 8, 2007 60 / 69
Thesis contributions Algorithm 2: Sums of Hermite × Gaussians
Speedup
The computational burden is reduced in the following three instances.
1 Computation of the kernel density estimate.
2 Estimation of the optimal bandwidth.
3 Computation of the first derivative of the kernel density estimate,which is required in the optimization procedure.
Vikas C. Raykar (Univ. of Maryland) Doctoral dissertation March 8, 2007 61 / 69
Thesis contributions Algorithm 2: Sums of Hermite × Gaussians
Image segmentation via PP
(a)
−20
24
−5
0
5−5
0
5
(b)
−3 −2 −1 0 1 20
0.5
1
1.5
2(c) (d)
Image segmentation via PP with optimal KDE took 15 minutes while thatusing the direct method takes around 7.5 hours.
Vikas C. Raykar (Univ. of Maryland) Doctoral dissertation March 8, 2007 62 / 69
Thesis contributions Algorithm 3: Sums of error functions
Algorithm 3: Sums of error functions
Another sum which we have encountered in ranking algorithms is
E (y) =
N∑
i=1
qi erfc(y − xi).
−5 0 5−0.5
0
0.5
1
1.5
2
2.5
z
erfc
(z)
Vikas C. Raykar (Univ. of Maryland) Doctoral dissertation March 8, 2007 63 / 69
Thesis contributions Algorithm 3: Sums of error functions
Example
N = M = 51, 200.
Direct evaluation takes about 18 hours.
We specify ǫ = 10−6.
Fast evaluation just takes 5 seconds.
Actual error is around 10−10
Vikas C. Raykar (Univ. of Maryland) Doctoral dissertation March 8, 2007 64 / 69
Thesis contributions Algorithm 3: Sums of error functions
Application 1: Ranking
For some applications ranking or ordering the elements is moreimportant.
Information retrieval.Movie recommendation.Medical decision making.
Compare two instances and predict which one is better.
Various ranking algorithms train the models using pairwise preferencerelations.
Computationally expensive to train due to the quadratic scaling in thenumber of pairwise constraints,
Vikas C. Raykar (Univ. of Maryland) Doctoral dissertation March 8, 2007 65 / 69
Thesis contributions Algorithm 3: Sums of error functions
Fast ranking algorithm
We propose a new ranking algorithm.
Our algorithm also uses pairwise comparisons the runtime is stilllinear.
This is made possible by fast approximate summation of erfcfunctions.
The proposed algorithm is as accurate as the best available methodsin terms of ranking accuracy.
Several orders of magnitude faster.
For a dataset with 4, 177 examples the algorithm took around 2seconds.
Direct took 1736 seconds and the best competitor RankBoost took63 seconds.
Vikas C. Raykar (Univ. of Maryland) Doctoral dissertation March 8, 2007 66 / 69
Conclusions
Outline of the proposal
1 Motivation
2 Key Computational tasks
3 Thesis contributionsAlgorithm 1: Sums of Gaussians
Kernel density estimation
Gaussian process regression
Implicit surface fitting
Algorithm 2: Sums of Hermite × GaussiansOptimal bandwidth estimation
Projection pursuit
Algorithm 3: Sums of error functionsRanking
4 Conclusions
Vikas C. Raykar (Univ. of Maryland) Doctoral dissertation March 8, 2007 67 / 69
Conclusions
Conclusions
Identified the key computationally intensive primitives in machinelearning.
Vikas C. Raykar (Univ. of Maryland) Doctoral dissertation March 8, 2007 68 / 69
Conclusions
Conclusions
Identified the key computationally intensive primitives in machinelearning.
We presented linear time algorithms.
Vikas C. Raykar (Univ. of Maryland) Doctoral dissertation March 8, 2007 68 / 69
Conclusions
Conclusions
Identified the key computationally intensive primitives in machinelearning.
We presented linear time algorithms.
We gave high accuracy guarantees.
Vikas C. Raykar (Univ. of Maryland) Doctoral dissertation March 8, 2007 68 / 69
Conclusions
Conclusions
Identified the key computationally intensive primitives in machinelearning.
We presented linear time algorithms.
We gave high accuracy guarantees.
Unlike methods which rely on choosing a subset of the dataset we useall the available points and still achieve O(N) complexity.
Vikas C. Raykar (Univ. of Maryland) Doctoral dissertation March 8, 2007 68 / 69
Conclusions
Conclusions
Identified the key computationally intensive primitives in machinelearning.
We presented linear time algorithms.
We gave high accuracy guarantees.
Unlike methods which rely on choosing a subset of the dataset we useall the available points and still achieve O(N) complexity.
Applied it to a few machine learning tasks.
Vikas C. Raykar (Univ. of Maryland) Doctoral dissertation March 8, 2007 68 / 69
Conclusions
Publications
Conference papers
A fast algorithm for learning large scale preference relations. Vikas C. Raykar, Ramani Duraiswami, and BalajiKrishnapuram, In Proceedings of the AISTATS 2007, Peurto Rico, March 2007 [also submitted to PAMI]
Fast optimal bandwidth selection for kernel density estimation. Vikas C. Raykar and Ramani Duraiswami, InProceedings of the sixth SIAM International Conference on Data Mining, Bethesda, April 2006, pp. 524-528. [inpreparation for JCGS]
The Improved Fast Gauss Transform with applications to machine learning. Vikas C. Raykar and Ramani Duraiswami,To appear in Large Scale Kernel Machines, MIT Press 2006. [also submitted to JMLR]
Technical reports
Fast weighted summation of erfc functions. Vikas C. Raykar, R. Duraiswami, and B. Krishnapuram, CS-TR-4848,Department of computer science, University of Maryland, CollegePark.
Very fast optimal bandwidth selection for univariate kernel density estimation. Vikas C. Raykar and R. Duraiswami,CS-TR-4774, Department of computer science, University of Maryland, CollegePark.
Fast computation of sums of Gaussians in high dimensions. Vikas C. Raykar, C. Yang, R. Duraiswami, and N. Gumerov,CS-TR-4767, Department of computer science, University of Maryland, CollegePark.
Software releases under LGPLThe FIGTree algorithm.
Fast optimal bandwidth estimation.
Fast erfc summation (coming soon)
Vikas C. Raykar (Univ. of Maryland) Doctoral dissertation March 8, 2007 69 / 69