Computational and Statistical Aspects of Statistical Machine Learning John Lafferty Department of Statistics Retreat Gleacher Center
Computational and Statistical Aspectsof Statistical Machine Learning
John Lafferty
Department of Statistics RetreatGleacher Center
Outline
• “Modern” nonparametric inference for high dimensional dataI Nonparametric reduced rank regression
• Risk-computation tradeoffsI Covariance-constrained linear regression
• Other research and teaching activities
2
Context for High Dimensional Nonparametrics
Great progress in recent years on high dimensional linear models
Many problems have important nonlinear structure.
We’ve been studying “purely functional ” methods for highdimensional, nonparametric inference
• no basis expansions
• no Mercer kernels
3
Additive Models
Fully nonparametric models appear hopeless
• Logarithmic scaling, p = log n (e.g., “Rodeo” Lafferty andWasserman (2008))
Additive models are useful compromise
• Exponential scaling, p = exp(nc) (e.g., “SpAM” Ravikumar,Lafferty, Liu and Wasserman (2009))
4
Additive Models
420 Chapter 23. Nonparametric Regression
10 15 20 25
−0.0
50.
000.
050.
100.
150.
20
Age
Chan
ge in
BM
D Females
10 15 20 25
−0.0
50.
000.
050.
100.
150.
20
Age
Chan
ge in
BM
D Males
Figure 23.1. Bone Mineral Density Data
−0.10 −0.05 0.00 0.05 0.10
150
160
170
180
190
Age−0.10 −0.05 0.00 0.05 0.10 0.15
100
150
200
250
300
Bmi
−0.10 −0.05 0.00 0.05 0.10
120
160
200
240
Map−0.10 −0.05 0.00 0.05 0.10 0.15
110
120
130
140
150
160
Tc
Figure 23.2. Diabetes Data5
Multivariate Regression
Y ∈ Rq and X ∈ Rp. Regression function m(X ) = E(Y |X ).
Linear model Y = BX + ε where B ∈ Rq×p.
Reduced rank regression: r = rank(B) ≤ C.
Recent work has studied properties and high dimensional scaling ofreduced rank regression where nuclear norm ‖B‖∗ is used as convexsurrogate for rank constraint (Yuan et al., 2007; Negahban andWainwright, 2011). E.g.,
‖Bn − B∗‖F = OP
(√Var(ε)r(p + q)
n
)
6
Low-Rank Matrices and Convex Relaxation
low rank matrices convex hullrank(X ) ≤ t ‖X‖∗ ≤ t
7
Nuclear Norm Regularization
Algorithms for nuclear norm minimization are a lot like iterative softthresholding for lasso problems.
To project a matrix B onto the nuclear norm ball ‖X‖∗ ≤ t :
• Compute the SVD:B = U diag(σ) V T
• Soft threshold the singular values:
B ← U diag(Softλ(σ)) V T
8
Nonparametric Reduced Rank RegressionFoygel, Horrell, Drton and Lafferty (NIPS 2012)
Nonparametric multivariate regression m(X ) = (m1(X ), . . . ,mq(X ))T
Each component an additive model
mk (X ) =
p∑j=1
mkj (Xj)
What is the nonparametric analogue of ‖B‖∗ penalty?
9
Low Rank Functions
What does it mean for a set of functions m1(x), . . . ,mq(x) to be lowrank?
Let x1, . . . , xn be a collection of points.
We require the n × q matrix M(x1:n) = [mk (xi)] is low rank.
Stochastic setting: M = [mk (Xi)]. Natural penalty is
1√n‖M‖∗ = 1√
n
q∑s=1
σs(M) =
q∑s=1
√λs( 1
nMTM)
Population version:
|||M|||∗ :=∥∥∥√Cov(M(X ))
∥∥∥∗
=∥∥∥Σ(M)1/2
∥∥∥∗
10
Constrained Rank Additive Models (CRAM)
Let Σj = Cov(Mj). Two natural penalties:∥∥∥Σ1/21
∥∥∥∗
+∥∥∥Σ
1/22
∥∥∥∗
+ · · ·+∥∥∥Σ
1/2p
∥∥∥∗∥∥∥(Σ
1/21 Σ
1/22 · · ·Σ1/2
p )∥∥∥∗
Population risk (first penalty) 12E∥∥∥Y −
∑j Mj(Xj)
∥∥∥2
2+ λ
∑j
∣∣∣∣∣∣Mj∣∣∣∣∣∣∗
Linear case:
p∑j=1
∥∥∥Σ1/2p
∥∥∥∗
=
p∑j=1
‖Bj‖2
∥∥∥(Σ1/21 Σ
1/22 · · ·Σ1/2
p )∥∥∥∗
= ‖B‖∗
11
CRAM Backfitting Algorithm (Penalty 1)
Input: Data (Xi ,Yi), regularization parameter λ.Iterate until convergence:
For each j = 1, . . . ,p:
Compute residual: Rj = Y −∑
k 6=j Mk (Xk )
Estimate projection Pj = E(Rj |Xj), smooth: Pj = SjRj
Compute SVD: 1n Pj PT
j = U diag(τ) UT
Soft-threshold: Mj = U diag([1− λ/√τ ]+)UT Pj
Output: Estimator M(Xi) =∑
j Mj(Xij).
12
Scaling of Estimation Error
Using a “double covering” technique, (12 -parametric,
12 -nonparametric), we bound the deviation between empirical andpopulation functional covariance matrices in spectral norm:
supV
∥∥∥Σ(V )− Σn(V )∥∥∥
sp= OP
(√q + log(pq)
n
).
This allows us to bound the excess risk of the empirical estimatorrelative to an oracle.
13
Summary
• Variations on additive models enjoy most of the good statisticaland computational properties of sparse or low-rank linearmodels.
• We’re building a toolbox for large scale, high dimensionalnonparametric inference.
14
Computation-Risk Tradeoffs
• In “traditional” computational learning theory, dividing linebetween learnable and non-learnable is polynomialvs. exponential time
• Valiant’s PAC model
• Mostly negative results: It is not possible to efficiently learn innatural settings
• Claim: Distinctions in polynomial time matter most
15
Analogy: Numerical Optimization
In numerical optimization, it is understood how to tradeoffcomputation for speed of convergence
• First order methods: linear cost, linear convergence
• Quasi-Newton methods: quadratic cost, superlinear convergence
• Newton’s method: cubic cost, quadratic convergence
Are similar tradeoffs possible in statistical learning?
16
Hints of a Computation-Risk Tradeoff
Graph estimation:
• Our method for estimating graph for Ising models:n = Ω(d3 log p), T = O(p4) for graphs with p nodes andmaximum degree d
• Information-theoretic lower bound: n = Ω(d log p)
17
Statistical vs. Computational Efficiency
Challenge: Understand how families of estimators with differentcomputational efficiencies can yield different statistical efficiencies
RateH,F (n) = infmn∈H
supm∈F
Risk(mn,m)
• H: computationally constrained hypothesis class
• F : smoothness constraints on “true” model
18
Computation-Risk Tradeoffs for Linear Regression
Dinah Shender has been studying such a tradeoff in the setting ofhigh dimensional linear regression
19
Computation-Risk Tradeoffs for Linear Regression
Standard ridge estimator solves(1n
X T X + λnI)βλ =
1n
X T Y
Sparsify sample covariance to get estimator(Tt [Σ] + λnI
)βt ,λ =
1n
X T Y
where Tt [Σ] is hard-thresholded sample covariance:
Tt ([mij ]) =[mij 1(|mij | > t)
]Recent advance in theoretical CS (Spielman et al.): Solving asymmetric diagonally-dominant linear system with m nonzero matrixentries can be done in time
O(m log2 p)
20
Computation-Risk Tradeoffs for Linear Regression
Dinah has recently proved the statistical error scales as
‖βt ,λ − β∗‖‖β∗‖
= OP (‖Tt (Σ)− Σ‖2) = O(t1−q)
for class of covariance matrices with rows in sparse `q balls (asstudied by Bickel and Levina).
• Combined with the computational advance, this gives us anexplicit, fine-grained risk/computation tradeoff
21
Simulation
0.0 0.5 1.0 1.5 2.0
0.8
0.9
1.0
1.1
1.2
1.3
1.4
lambda
risk
22
Some Other Projects
Minhua Chen: Convex optimization for dictionarylearning
Eric Janofsky: Nonparanormal componentanalysis
Min Xu: High dimensional conditional densityand graph estimation
23
Courses in the Works
• Winter 2013: Nonparametric Inference (Undergraduate andMasters)
• Spring 2013: Machine Learning for Big Data (UndergraduateStatistics and Computer Science)
Charles Cary: Developing Cloud-based infras-tructure for the course. Candidate data: 80 mil-lion images, Yahoo! clickthrough data, Sciencejournal articles, City of Chicago datasets.
24