Robust Optimization and Data Approximation in Machine Learning Gia Vinh Anh Pham Electrical Engineering and Computer Sciences University of California at Berkeley Technical Report No. UCB/EECS-2015-216 http://www.eecs.berkeley.edu/Pubs/TechRpts/2015/EECS-2015-216.html December 1, 2015
75
Embed
Robust Optimization and Data Approximation in Machine Learning · 2015-12-01 · Robust Optimization and Data Approximation in Machine Learning ... Robust Optimization and Data Approximation
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Robust Optimization and Data Approximation in
Machine Learning
Gia Vinh Anh Pham
Electrical Engineering and Computer SciencesUniversity of California at Berkeley
Permission to make digital or hard copies of all or part of this work forpersonal or classroom use is granted without fee provided that copies arenot made or distributed for profit or commercial advantage and that copiesbear this notice and the full citation on the first page. To copy otherwise, torepublish, to post on servers or to redistribute to lists, requires prior specificpermission.
Robust Optimization and Data Approximation in Machine Learning
by
Gia Vinh Anh Pham
A dissertation submitted in partial satisfactionof the requirements for the degree of
Doctor of Philosophy
in
Computer Science
in the
GRADUATE DIVISION
of the
UNIVERSITY OF CALIFORNIA, BERKELEY
Committee in charge:
Professor Laurent El Ghaoui, ChairProfessor Bin Yu
Professor Ming Gu
The dissertation of Gia Vinh Anh Pham is approved.
Chair Date
Date
Date
University of California, Berkeley
Robust Optimization and Data Approximation in Machine Learning
Table 3.1. Top 10 keywords obtained with topic comp.graphics as the thresholding levelincreases. The top keywords do not change significantly when more thresholding applies.
problem faster (blue line) and use less memory (red line) to store the data by sacrificing
some minor amount of accuracy (black line).
3.3.2 20 Newsgroup Dataset
This dataset consist of Usenet articles Lang collected from 20 different newsgroups (e.g.
comp.graphics, comp.windows.x, misc.forsale, sci.space, talk.reglision.misc, etc.) which con-
tains an overall number of 20000 documents. Except for a small fraction of the articles, each
document belongs to exactly one newsgroup. The task is to learn which newsgroup an article
was posted to. The dataset is divided so that the training set contains 2/3 the number of
examples and the test set contains the rest. The dimension of the problem is 61K (number
of words in the dictionary). For each document, word we compute the corresponding TFIDF
score and then threshold the data based on TF-IDF score.
In Table 3.1, we show the list of top 20 keywords obtained with topic comp.graphics as
we varied the thresholding level, as can be seen as we increase the thresholding levels the top
28
Figure 3.2. Speed-up and space-saving for SVM with data thresholding on 20 Newsgroupdataset, topic = comp.graphics. Number of training samples: 13K, number of features: 61K.
features (words) do not change significantly. This suggests that the result (weight vector) of
sparse linear classification with thresholding can still be used for feature selection.
Figure 3.2 shows the speed-up and space-saving for SVM with data thresholding on 20
Newsgroup dataset for topic comp.graphics. As we increase the thresholding level, we can
solve the problem faster (red line) and use less memory (blue line) to store the data by
sacrificing some minor amount of accuracy (black line)
3.3.3 UCI NYTimes dataset
UCI NYTimes dataset contains 300,000 news articles with the vocabulary consists of
102,660 words. In this experiment, we use the first 100,000 news articles (which has approx-
Table 3.2. SVM Result: Top 20 keywords for topic ’stock’ on UCI NYTimes dataset withthresholding level of 0.05 on TF-IDF score (reduced dataset is only 3% of the full dataset).Total runtime: 4317s
imately 30 millions words in the collection) for training task. It is impossible to train SVM
with the whole dataset on a laptop/PC, so in this experiment we will just run SVM on the
thresholded dataset (by TF-IDF scores) and then report the top features in Table 3.2. The
thresholded dataset size with the thresholding level 0.05 only contains 850,000 non-zeroes,
which is roughly 3% of the full dataset.
30
Chapter 4
Low-rank Approximation
4.1 Introduction
A standard task in scientific computing is to determine for a given matrix A ∈ Rn×d an
approximate decomposition A ≈ BC where B ∈ Rn×k, C ∈ Rk×d; k is called the numerical
rank of the matrix. When k is much smaller than either n or d, such decomposition allows
the matrix to be stored inexpensively, and to be multiplied to vectors or other matrices
quickly.
Low-rank approximation has been used for solving many large-scale problems that in-
volves large amounts of data. By replacing the original matrices with approximated (low-
rank) ones, the perturbed problems often requires much less computational effort to solve.
Low-rank approximation methods have been shown to be successful on a variety of learning
tasks, such as Spectral partitioning for image and video segmentation Fowlkes et al. (2004),
Manifold learning Talwalkar et al. (2008). Recently, there have been interesting works on
using low-rank approximation in Kernel learning: Fine and Scheinberg (2002) proposed an
efficient method for kernel SVM learning problem by approximating the kernel matrix by a
31
low-rank positive semidefinite matrix. Bach and Jordan (2005) and Zhang et al. (2012) also
proposed similar approaches for more general kernel learning problems which also exploit
side information in the computation of low-rank decompositions for kernel matrices.
One important point worth noting here is that all of the above works just replace the
original data matrices directly with the low-rank ones, and then provide a bound on the error
made by solving the perturbed problem compared to the solution of the original problem. In
this thesis, we propose a new modeling approach that takes into account the approximation
error made when replacing the original matrices with low-rank approximations to modify
the original problem accordingly, via robust optimization perspective.
In this chapter, we focus on solving LASSO problem:
min‖β‖1≤λ
‖y −XTβ‖2 (4.1.1)
However, the result can also be generalized to more supervised learning problems:
minv∈R,w∈C
f(ATw + cv) (4.1.2)
where the loss function f : Rn → R is convex, the data matrix A ∈ Rd×n and vector c ∈ Rn
are given, C ⊆ Rd is a convex set that constraining the decision variable w.
In practice, a low-rank approximation may not be directly available, but has to be com-
puted. While the effort in finding low-rank approximation is typically linear in the size of
data matrix, in our approach leads to the biggest savings in what we refer to as the repeated
instances setup. In such a setup, the task is to solve multiple instances of similar programs,
where all the instances involve a common (or very similar) coefficient matrix. In that setup,
of which we provide real-world examples later, it makes sense to invest some effort in finding
the low-rank approximation once, and then exploit the same low-rank approximation for
each instance. The robust low-rank approach then offers enormous computational savings,
and produces solutions that are guaranteed to be feasible for the original problem.
32
4.2 Algorithms for Low-rank Approximation
In general, to find rank-k approximation of a matrix A ∈ Rn×d, we wish to find matrices
B ∈ Rn×k, C ∈ Rk×d such that the spectral norm/largest singular value norm error ‖A −
BC‖2 is minimal. The truncated singular value decomposition is known to provide the best
low-rank approximation for any given fixed rank Eckart and Young (1936); however, it is also
very costly to compute. There have been many different approaches proposed for computing
low-rank approximations, such as rank-revealing factorization (QR or LU) Gu and Eisenstat
(i) set x+ ← x(k) + d(k), η = 1 and δ ← ∇f(x(k))Td(k)
(ii) while f(x+) > max0≤j≤min(k,M)
f(x(k−j))+ γηδ
• compute η =−η2δ
2(f (x+)− f (x(k))− ηδ)
• if η ≥ σ1η and η ≤ σ2η then set η ← η, else set η ← η/2
• set x+ ← x(k) + ηd(k)
(iii) set steplength ρk ← η
3. set x(k+1) = x(k) + ρkd(k), s(k) = x(k+1) − x(k), and y(k) = ∇f(x(k+1))−∇f(x(k))
4. if (s(k))Ty(k) ≤ 0 then set αk+1 ← αmax
else set αk+1 ← min
(max
((s(k))T s(k)
(s(k))Ty(k), αmin
), αmax
)5. set k ← k + 1
(PC(z) denotes the projection of vector z onto the set C : PC(z) = arg minw∈C ‖z − w‖)
35
For sparse learning problems such as LASSO, the set C is the l1-ball B1(λ) = w :
‖w‖1 ≤ λ. We adopt the following algorithm developed by Duchi et al. (2008) to compute
projections onto the l1 ball efficiently in linear time:
Algorithm 3 Linear time projection onto l1 ball (Duchi et al. (2008))
input: z ∈ Rd and λ > 0
initialize U = 1, . . . , d, s = 0, η = 0
while U 6= ∅ do
1. pick k ∈ U randomly.
2. partition U into G = j ∈ U |zj ≥ zk, L = U \G
3. if (s+∑
j∈G zj)− (η + |G|)zk < λ then s← s+∑
j∈G zj; η ← η + |G|;U ← L
4. else U ← G \ k
set θ = (s− λ)/η
output w = max(zi − θ, 0)di=1
We now show how to use low-rank approximation to speedup first order methods. As
we will see, the speedup when applying our approach will be achieved with any first order
algorithm that involves computing the gradient. Generally, first-order method for LASSO
requires us to compute the gradient of the objective function in which for the constrained
LASSO problem, it is of the form:
∇β‖y −XTβ‖2 = XXTβ − y‖XTβ − y‖2
Finding such gradient involves computing the product of a matrix X ∈ Rn×d and a vector
u ∈ Rd as well as the product of XT with a vector v ∈ Rn. Each operation costs O(nd)
flops for dense matrix X or O(N) flops for sparse matrix X with N non-zero entries.
36
Assuming that the data matrix has a low-rank structure, that is X ≈ UV T where U ∈
Rn×k, V ∈ Rd×k with k << min(n, d). We can then exploit the low-rank structure in
order to improve the efficiency of the above matrix-vector multiplication by writing Xu =
U(V Tu). Computing z = V Tu costs O(kd) flops, and computing Uz costs O(kn) flops,
thus the total number of flops is O(k(n + d)). This is a significant improvement compared
to the cost of matrix-vector multiplication on the original data matrix, especially when
X is dense, high dimensional and approximately low-rank. When X is sparse and k is
much smaller than the average number of non-zero entries per sample, the matrix-vector
multiplication computation that exploits the low-rank structure is also much faster than
direct multiplication. Furthermore, the space complexity required to store the low-rank
approximation is only O(k(n + d)) compared to O(nd) when X is dense or O(N) if X is a
sparse matrix.
For more general supervised learning problem (4.1.2), we could also compute the gradient
exploiting the low-rank structure of the data matrix: suppose that A = V UT where U ∈
Rn×k, V ∈ Rd×k with k << min(n, d), let h(w, v) = f(ATw + cv), the gradient can be
computed as following:
∇wh(w, v) = A∇f(ATw + cv) = V UT∇f(UV Tw + cv)
∇vh(w, v) = cT∇f(ATw + bv) = cT∇f(UV Tw + cv)
Similarly, the cost of computing the above gradient is only O(k(n + d)) for dense data
matrix X or O(N) for sparse one.
4.3 Robust Low-rank LASSO
Important questions worth asking when replacing the original data matrix by a low-rank
approximation are “how much information do we lose?”, and “how does it effect the LASSO
37
problem?”. In this section, we will examine the effect of using low-rank approximation, or
in other words, thresholding the singular values of the data matrix according to an absolute
threshold level εk. We replace the matrix X by X, the closest (in largest singular value
norm) rank-k approximation to X, and the error ∆ := X −X satisfies ‖∆‖ ≤ εk, where ‖ · ‖
denotes the largest singular value norm.
As seen in previous section, computing the low-rank approximation of a matrix is feasible
even in a large scale setting. Our basic goal is to end up solving a slightly modified LASSO
problem using rank-k approximation, while controlling for the error made.
We want to take into account the worst-case error in the objective function that is made
upon replacing X with its rank-k approximation, X. To this end, we consider the following
robust counterpart to (4.1.1):
ψεk,λ(X) := min‖β‖1≤λ
max‖X−X‖≤εk
∥∥y −XTβ∥∥
2
= min‖β‖1≤λ
max‖∆‖≤εk
∥∥∥y − (X −∆)Tβ∥∥∥
2
(4.3.1)
Let us define f(z) = ‖z‖2, we have:
38
max‖∆‖≤εk
∥∥∥y − (X −∆)Tβ∥∥∥
2= max‖∆‖≤εk
f(y − XTβ + ∆Tβ)
= max‖∆‖≤εk
maxu∈Rm
uT (y − XTβ + ∆Tβ)− f ∗(u)
= max
u∈Rm
uT (y − XTβ)− f ∗(u) + max
‖∆‖≤εkuT∆Tβ
= max
u∈Rm
uT (y − XTβ)− f ∗(u) + max
‖∆‖≤εk〈∆T , βuT 〉
= max
u∈Rm
uT (y − XTβ)− f ∗(u) + εk‖βuT‖∗
= max
u∈Rm
uT (y − XTβ)− f ∗(u) + εk‖β‖2‖u‖2
= max
u∈Rm
uT (y − XTβ)− f ∗(u) + max
‖z‖2≤εk‖β‖2uT z
= max‖z‖2≤εk‖β‖2
maxu∈Rm
uT (y − XTβ + z)− f ∗(u)
= max‖z‖2≤εk‖β‖2
f(y − XTβ + z)
= max‖z‖2≤εk‖β‖2
‖y − XTβ + z‖2
= ‖y − XTβ‖2 + εk‖β‖2
here ‖.‖∗ is the nuclear norm (the dual of spectral norm) and f ∗ is the conjugate dual of f .
Therefore, we can write the robust counterpart of LASSO problem in (2) as:
ψεk,λ(X) = min‖β‖1≤λ
‖y − XTβ‖2 + εk‖β‖2 (4.3.2)
Let g(β) = ‖y − XTβ‖2 + εk‖β‖2, its gradient is:
∇βg(β) = XXTβ − y‖XTβ − y‖2
+ εkβ
‖β‖2
Hence the cost of computing this gradient is also similar as for the low-rank LASSO problem,
which is O(k(n+ d)).
39
4.3.1 Theoretical Analysis
In this section, we present some theoretical analysis for the Low-rank LASSO models, in
particular we bound how far their solutions from the true weight vector. In order to do so,
we will need the following definitions:
Restricted nullspace (Cohen et al. (2009)): for a given subset S ⊆ 1, . . . , d and a
constant α ≥ 1, define:
C(S, α) := θ ∈ Rd : ‖θSC‖1 ≤ α‖θS‖1
Given k ≤ d, the matrix X is said to satisfy the restricted nullspace condition of order k if
null(X) ∩ C(S, 1) = 0,∀S ⊆ 1, . . . , d : |S| = k.
Restricted eigenvalue (Bickel et al. (2008)): the sample covariance matrix XTX/n is
said to satisfy the restricted eigenvalue condition over a set S with parameters α ≥ 1, γ > 0
if 1n‖Xθ‖2
2 ≥ γ2‖θ‖22,∀θ ∈ C(S, α). We denote that X ∈ RE(S, α, γ) if the above condition
is true.
Main result: we consider the classical linear model with noisy settings: y = Xβ∗ + w,
where y ∈ Rn is the vector of responses, the matrix X ∈ Rn×d is feature matrix, and w ∈ Rd
is a random white noise vector: w ∼ N (0, σ2In×n). Let define:
β(λ) = arg min‖β‖1≤λ
‖y −Xβ‖2
β(λ, η) = arg min‖β‖1≤λ
‖y − Xβ‖2 + η‖β‖2
Hence, β(λ, εk) is the solution for the robust counterpart optimization problem (4.3.2) and
β(λ, 0) is the solution for the LASSO problem if we just replace the feature matrix X by its
low rank approximation X.
We will now show that when the feature matrix X is “close“ to low-rank (i.e. εk is small),
the estimators β(λ, η) is closed to the optimal weight vector β∗ for any 0 ≤ η ≤ εk.
40
To shorten the equations, we abbreviate β(λ, η) as β.
We also use standard assumptions as for nominal LASSO problem as in Raskutti et al.
(2010):
(a) λ = ‖β∗‖1, S is the support of β∗ and |S| = p.
(b) X ∈ RE(S, 1, γ) for some γ > 0.
Since ‖∆x‖2 ≤ ‖∆‖2‖x‖2 ≤ εk‖x‖2, we have:
‖y −Xβ‖2 = ‖y − Xβ + ∆β‖2
≤ ‖y − Xβ‖2 + ‖∆β‖2
≤ ‖y − Xβ‖2 + εk‖β‖2
=(‖y − Xβ‖2 + η‖β‖2
)+ (εk − η)‖β‖2
≤(‖y − Xβ‖2 + η‖β‖2
)+ (εk − η)‖β‖2
=(‖y −Xβ −∆β‖2 + η‖β‖2
)+ (εk − η)‖β‖2
≤(‖y −Xβ‖2 + ‖∆β‖2 + η‖β‖2
)+ (εk − η)‖β‖2
≤ ‖y −Xβ‖2 + (εk + η)‖β‖2 + (εk − η)‖β‖2
≤ ‖y −Xβ∗‖2 + (εk + η)‖β‖2 + (εk − η)‖β‖2
In addition, ‖β‖2 ≤ ‖β‖1 ≤ λ, ‖β‖2 ≤ ‖β‖1 ≤ λ, so:
‖y −Xβ‖2 ≤ ‖y −Xβ∗‖2 + 2εkλ (4.3.3)
Now let ζ = β∗− β, we can write y = Xβ∗+w as y−Xβ = Xζ+w, therefore (4.3.3) implies
that:
‖Xζ + w‖2 ≤ ‖w‖2 + 2εkλ
⇒ ‖Xζ + w‖22 ≤ (‖w‖2 + 2εkλ)2
⇒ ‖Xζ‖22 ≤ 4εkλ‖w‖2 + 4ε2kλ
2 − 2ζTXTw
⇒ 1
n‖Xζ‖2
2 ≤4εkλ
n‖w‖2 +
4ε2kλ2
n+ 2‖ζ‖1
∥∥∥∥XTw
n
∥∥∥∥∞
41
Using assumption (a), we obtain:
‖βSC‖1 + ‖βS‖1 = ‖β‖1 ≤ λ = ‖β∗‖1 = ‖β∗S‖1
⇒ ‖βSC‖1 ≤ ‖β∗S‖1 − ‖βS‖1 ≤ ‖β∗S − βS‖1
⇒ ‖β∗SC − βSC‖1 ≤ ‖β∗S − βS‖1
⇒ ‖ζSC‖1 ≤ ‖ζS‖1 ⇒ ζ ∈ C(S, 1) and
‖ζ‖1 ≤ 2‖ζS‖1 ≤ 2√p‖ζS‖2 ≤ 2
√p‖ζ‖2
Using assumption (b) X ∈ RE(S, 1, γ) and the fact that ζ ∈ C(S, 1), we have 1n‖Xζ‖2
2 ≥
γ2‖ζ‖22, therefore:
γ2‖ζ‖22 ≤
4εkλ
n‖w‖2 +
4ε2kλ2
n+ 4√p‖ζ‖2
∥∥∥∥XTw
n
∥∥∥∥∞
(4.3.4)
Lemma 4 (Wainwright (2010)) Suppose X is bounded by L, i.e. |Xij| ≤ L, w ∼
N (0, σ2In×n) then with high probability we have:∥∥∥∥XTw
n
∥∥∥∥∞≤ L
√3σ2 log d
n
Lemma 5 Suppose w ∼ N (0, σ2In×n) and c > 1, then with probability at least 1−e− 316n(c−1)2
we have:
‖w‖2 ≤√σcn
Proof: Since w ∼ N (0, σ2In×n), Z =∑n
i=1(wi/σ)2 ∼ χ2n, using the tail bound for
Chi-square random variable by Johnstone (2000), for any ρ > 0 we have:
P [|Z − n| ≥ nρ] ≤ exp
(− 3
16nρ2
),
Thus, by setting ρ = c − 1, with probability at least 1 − e− 316n(c−1)2 we have Z ≤ cn, i.e.
‖w‖2 ≤√σcn.
42
Assuming L, σ, λ are constants, using (4.3.4) and the results of Lemma 1 and 2, with
high probability we have:
‖ζ‖22 ≤ O
(εk√n
)+O
(ε2kn
)+O
(√p log d
n
)‖ζ‖2
Solving this inequality we obtain the upper bound on ‖β∗−β‖2 = ‖ζ‖2 (with high probability)
as following:
‖β∗ − β(λ, η)‖2 ≤ O
(√p log d
n+
εk√n
+ε2kn
)The above result shows that when the error made by replacing the feature matrix with its
low-rank approximation (LR-LASSO) is small enough compared to√n (i.e. εk <<
√n), the
corresponding estimator β(λ, 0) (for LR-LASSO) is closed to the optimal solution β∗. The
solution of the robust counterpart of LASSO (RLR-LASSO) which is β(λ, εk), is also closed
to the optimal solution β∗.
4.3.2 Discussion
One might ask what happen if we just use the low-rank approximation matrix directly.
We will show that we might end up with unexpected solutions if we do so. Indeed, we consider
the rank 1 LASSO problem in which the data matrix X is approximated with rank-1 matrix
X = uvT for some u ∈ Rn, v ∈ Rd: min‖β‖1≤λ ‖y − uvTβ‖2. We randomly generate two
vector u ∈ R20, v ∈ R20 and solve the rank-1 LASSO problem for λ = 2−12, . . . , 212, figure
4.1 shows the sparsity pattern of the solution of rank-1 LASSO vs robust rank-1 LASSO
problem.
We can also explain analytically the reason why solution of non-robust rank-1 LASSO
43
−12 −7 −2 2 7 12−0.8
−0.7
−0.6
−0.5
−0.4
−0.3
−0.2
−0.1
0
0.1
0.2
log2(λ)
Low−Rank LASSO
−12 −7 −2 2 7 12−0.2
−0.15
−0.1
−0.05
0
0.05
0.1
0.15
log2(λ)
Robust Low−Rank LASSO
Figure 4.1. Rank-1 LASSO (left) and Robust Rank-1 LASSO (right) with random data. Theplot shows the elements of the optimizer as a function of the l1-norm penalty parameter λ.The non-robust solution has cardinality 1 or 0 for all of 0 < λ < C for some C. The robustversion allows a much better control of the sparsity of the solution as a function of λ.
misbehaves for small value of λ as follows: suppose X = pqT , p ∈ Rn, q ∈ Rd, the rank-1
LASSO problem is: φ(λ) = min‖β‖1≤λ
‖pqTβ − y‖2
We can always normalize p, q, y so that ‖p‖2 = ‖q‖2 = ‖y‖2 = 1.
Figure 4.2. Top 5000 singular values for each dataset, scaled by maximum singular value.The size of each dataset (number of samples x number of features) is also shown.
4.4.2 Multi-label classification
Since we treat each label as a single classification subproblem (one-against-all method),
we evaluate the performance of all three models using the Macro-F1 measure. The Macro-
F1 measure treats every label equally and calculates the global measure as the means of the
local measures of all labels: Macro-F1 = 2RPR+P
where R = 1L
∑Ll=1Ri, P = 1
L
∑Ll=1 Pi are the
average recall and precision across all labels. We will also report the average running time
of each model.
Figure 4.3 and 4.4.2 show the average training time speedup for LR-LASSO, RLR-LASSO
compared to LASSO on all labels (RCV1V2 has 101 labels, TMC2007 has 22 labels) over
different runs for each value of regularization parameter λ ∈ 10i : i = 3, 4, 5, 6, as the
low-rank dimension k varies (one run for each value of the regularization parameter λ).
LR-LASSO and RLR-LASSO run much faster than the nominal LASSO for both datasets,
Given the original data matrix X, for each query term we need to solve a LASSO problem
in which the response vector y is a column vector of X and the corresponding feature matrix
is obtained by removing that column from X. The low-rank approximation for new data
matrix given the low-rank approximation of the original matrix can be derived quickly.
Suppose that X = UV T where U ∈ Rn×k, V ∈ Rd×k. Let’s X(i) ∈ Rn×(d−1) be the matrix
obtained from X by removing the ith column, X−i ∈ Rn×d be the matrix obtained from X
by replacing its ith column with zeroes. Let ci be the ith column of X, ei is the ith unit vector
in Rd, we can write X−i as:
X−i = X − cieTi = UV T − cieTi = [U,−ci][V, ei]T
Let U (i) = [U,−ci] and V (i) be the matrix obtained from [V, ei] by removing its ith row.
We can then write X(i) = U (i)(V (i))T , therefore the rank of X(i) can be bounded as following:
rank(X(i)) ≤ minrank(U (i)), rank(V (i))
≤ minrank(U) + 1, rank(V ) + 1
≤ k + 1
51
automotive agriculture technology tourism aerospacecar government company tourist boeing
vehicle farm computer hotel aircraftauto farmer system business spacesales food web visitor program
model water information economy jetdriver trade internet travel planeford land american tour nasa
driving crop job local flightengine economic product room airbus
consumer country software plan militarydefence financial healthcare petroleum gaming
afghanistan company health oil gameattack million care prices gamblingforces stock cost gas casino
military market patient fuel playergulf money corp company online
troop business al gore barrel computeraircraft firm doctor gasoline tribeterrorist fund drug bush moneypresident investment medical energy playstation
war economy insurance opec video
Table 4.1. NYTimes news articles - top 10 predictive words for different query terms (10industry sectors). Low-rank approximation with k = 20 is used, total training time (for allquery terms) is 42918 seconds
We could also compute rank − k decomposition of X(i) with slightly more complicated
derivation as presented in (Brand (2006)).
We experiment with NYTimes and PubMed datasets, both of these datasets are so large
that it’s impossible to run nominal LASSO on a personal workstation. Thus, our experi-
ment’s goal is to find the top predictive words for each given query term. In Table 4.1 and
Table 4.2, we report the top 10 predictive words for given lists of 10 query terms, produced
by RLR-LASSO with k = 20.
52
arthritis asthma cancer depression diabetesjoint bronchial tumor effect diabetic
Table 4.2. PUBMED abstracts - top 10 predictive words for different query terms (10diseases). Low-rank approximation with k = 20 is used, total training time (for all queryterms) is 56352 seconds
53
Chapter 5
Conclusion
In this dissertation, we introduced new learning models that are capable of solving large
scale machine learning problems using data approximation under robust optimization per-
spective. We have showed that data approximation under robust scheme is efficient and
scalable and more reliable than just applying the approximation directly.
In Chapter 3, we have described data thresholding technique for large-scale sparse linear
classification and provided theoretical bound on the amount of thresholding level needed to
obtain desired performance. The proposed method is a promising pro-processing method
that can be applied in any learning algorithm to efficiently solve large-scale sparse linear
classification problems, both in terms of required memory as well as the running time of the
algorithm.
Next, in Chapter 4, we have described an efficient and scalable robust low-rank model
for LASSO problem, in which the approximation error made by replacing the original data
matrix with a low-rank approximation is taken into account in the model itself. The exper-
imental results lead us to believe that the proposed model could make statistical learning
that involves running multiple instances of LASSO feasible for extremely large datasets. Al-
54
though we only focus on first-order methods in which the main computational effort is to
compute the gradient, we believe that similar approach can be taken on how low-rank struc-
tures could help in speeding up second-order methods for solving LASSO problem. Another
interesting direction we also wish to pursue is to construct robust low-rank models for other
learning problems.
55
Bibliography
Anthony, M. and Bartlett, P. L. (1999). Neural Network Learning: Theoretical Foundations.
Cambridge University Press.
Bach, F. R. and Jordan, M. I. (2005). Predictive low-rank decomposition for kernel methods.
In Proceedings of the 22nd international conference on Machine learning, ICML ’05, pages
33–40, New York, NY, USA. ACM.
Balakrishnan, S. and Madigan, D. (2008). Algorithms for sparse linear classifiers in the
massive data setting. J. Mach. Learn. Res., 9:313–337.
Becker, S., Bobin, J., and Candes, E. J. (2011). Nesta: A fast and accurate first-order
method for sparse recovery. SIAM J. Imaging Sciences, 4(1):1–39.
Ben-Tal, A., El Ghaoui, L., and Nemirovski, A. (2009). Robust Optimization (Princeton
Series in Applied Mathematics). Princeton University Press.
Bhattacharyya, C. (2004). Robust classification of noisy data using second order cone pro-
gramming approach. In Intelligent Sensing and Information Processing, 2004. Proceedings
of International Conference on, pages 433 – 438.
Bhattachryya, S., Grate, L., Mian, S., Ghaoui, L. E., and Jordan, M. (2003). Robust
sparse hyperplane classifiers: application to uncertain molecular profiling data. Journal
of Computational Biology, 11(6):1073–1089.
56
Bickel, P. J., Ritov, Y., and Tsybakov, A. B. (2008). Simultaneous analysis of lasso and
dantzig selector.
Birgin, E. G., Martınez, J. M., and Raydan, M. (2000). Nonmonotone spectral projected
gradient methods on convex sets. SIAM Journal on Optimization, pages 1196–1211.
Birgin, E. G., Martınez, J. M., and Raydan, M. (2001). ACM Trans. Math. Softw., 27(3):340–
349.
Birgin, E. G., Martinez, J. M., and Raydan, M. (2003). Inexact spectral projected gradient
methods on convex sets. IMA Journal on Numerical Analysis, 23:539–559.
Bordes, A. and Bottou, L. (2005). The huller: a simple and efficient online svm. In In