Doubly Greedy Primal-dual Coordinate Descent for Sparse Empirical Risk Minimization A. Appendix A: Convergence Analysis A.1. Proof of Theorem 4.2 Recall primal, dual and Lagrangian forms: P (x) def = 1 n n X i=1 φi (hAi , xi)+ g(x), (20) L(x, y) def = g(x)+ 1 n y T Ax - 1 n n X i=1 φ ⇤ i (yi ) (21) D(y) def = min x L(x, y) ⌘ L(¯ x(y), y) (22) where ¯ x(y): R n ! R d is the optimal primal variable with respect to some y, namely, ¯ x(y) = arg min x L(x, y) For simplicity, we will use ¯ x (t) def =¯ x(y (t ) throughout this paper. Similarly, we also use ¯ y(x): R d ! R n to be the optimal dual variable with respect to some x. Recall with the choice of regularizer of our model, g(x)= h(x)+ λkxk1, where h(x)= μ 2 kxk 2 2 satisfies μ-strong convexity, μ- smooth and separable. The conjugate of loss function(e.g. smooth hinge loss used in our experiments): φ ⇤ is γ-strongly convex. Recall the primal gap defined as Δ (t) p def = L(x (t+1) , y (t) ) - D(y (t) ), and dual gap Δ (t) d def = D ⇤ - D(y (t) ). In the proof, we will connect the objective change in primal/dual update with the primal/dual gap and show how the sub-optimality: Δ (t) = Δ (t) p + Δ (t) d enjoys linear convergence. Lemma A.1. (Primal Progress): L(x (t) , y (t) ) - L(x (t+1) , y (t) ) ≥ 1 kx (t) - ¯ x (t) k0 - 1 Δ (t) p Proof. This lemma is a direct result by our greedy update rule of our primal variables. L(x (t) , y (t) ) - L(¯ x (t) , y (t) ) = X i L(x (t) , y (t) ) - L((¯ x (t) i - x (t) i )ei + x (t) , y (t) ) = X i2supp(x (t) -¯ x (t) ) L(x (t) , y (t) ) - L((¯ x (t) i - x (t) i )ei + x (t) , y (t) ) k ¯ x (t) - x (t) k0 ⇥ max i L(x (t) , y (t) ) - L((¯ x (t) i - x (t) i )ei + x (t) , y (t) ) = k ¯ x (t) - x (t) k0 ( L(x (t) , y (t) ) - L(x (t+1) , y (t) ) ) And by adding L(x (t+1) , y (t) ) - L(x (t) , y (t) ) to both sides we finishes the proof. Recall i (t) is the selected coordinate to update in dual variable y (t) . Lemma A.2. (Primal-Dual Progress). Δ (t) d - Δ (t-1) d + Δ (t) p - Δ (t-1) p L(x (t+1) , y t ) - L(x t , y t ) +⌘( 1 n hA i (t) , x (t) - ¯ x (t) i i (t) ) 2 - ⌘( 1 n hA i (t) , ¯ x (t) i- g) 2 , where g 2 1 n @φ ⇤ i (t) (y (t) ). Our goal is to prove that Δ (t) d - Δ (t-1) d + Δ (t) p - Δ (t-1) p -δΔ (t) p - δΔ (t) d to show linear convergence in sub-optimality. Since L(x (t+1) , y t ) - L(x t , y t ) - 1 kx (t) -¯ x (t) k 0 Δ (t) p , this lemma is the middle step that connects to the primal part, and the remaining part represents the dual progress and will be ana- lyzed later. Proof. The primal and dual gap comes from both primal and dual progresses: Δ (t) d - Δ (t-1) d | {z } dual progress + Δ (t) p - Δ (t-1) p | {z } primal progress • Dual progress: By Danskins’ theorem, -D(y) is γ-strongly convex. There- fore for any g 2 @φ ⇤ i (t) (y (t) ), we have, Δ (t) d - Δ (t-1) d = ( - D(y (t) ) - ( - D(y (t-1) ) -( 1 n hA i (t) , ¯ x (t) i- g)(y (t) i (t) - y (t-1) i (t) ) (23) - γ 2 (y (t) i (t) - y (t-1) i (t) ) 2 • Primal progress: Similarly we get, L(x t , y t ) - L(x t , y (t-1) ) ( 1 n hA i (t) , x (t) i- g)(y (t) i (t) - y (t-1) i (t) ) + γ 2 (y (t-1) i (t) - y (t) i (t) ) 2 (24) Therefore, Δ (t) p - Δ (t-1) p = L(x (t+1) , y t ) - L(x t , y (t-1) ) - (D(y (t) ) - D(y (t-1) )) = L(x (t+1) , y t ) - L(x t , y t )+ L(x t , y t ) - L(x t , y (t-1) ) -(D(y (t) ) - D(y (t-1) )) L(x (t+1) , y t ) - L(x t , y t ) + 1 n hA i (t) , x (t) - ¯ x (t) i(y (t) i (t) - y (t-1) i (t) ) Here the last inequality comes from inequalities (24) and (23). Meanwhile, with the update rule of dual variable: y (t) i (t) arg max β 1 n hA i (t) , x (t) iβ - φ ⇤ i (t) (β) - 1 2⌘ (β - y (t) i (t) ) 2 Therefore 9 g 2 @φ ⇤ i (t) (y (t) ) such that y (t) i (t) - y (t-1) i (t) = ⌘( 1 n (hA i (t) , x (t) i- g). Therefore: (23) = -( 1 n hA i (t) , ¯ x (t) i- g)(y (t) i (t) - y (t-1) i (t) ) - γ 2 (y (t) i (t-1) - y (t) i (t) ) 2 = h 1 n A i (t) , ¯ x (t) - x (t) i(y (t-1) i (t) - y (t) i (t) ) -( 1 ⌘ + γ 2 )(y (t) i (t) - y (t-1) i (t) ) 2 (25) • Summing together we have:
7
Embed
A. Appendix A: Convergence Analysisproceedings.mlr.press/v70/lei17b/lei17b-supp.pdf · 2018. 10. 24. · Doubly Greedy Primal-dual Coordinate Descent for Sparse Empirical Risk Minimization
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Doubly Greedy Primal-dual Coordinate Descent for Sparse Empirical Risk Minimization
A. Appendix A: Convergence AnalysisA.1. Proof of Theorem 4.2
Recall primal, dual and Lagrangian forms:
P (x)
def=
1
n
nX
i=1
�
i
(hAi
,xi) + g(x), (20)
L(x,y) def= g(x) +
1
n
y
T
Ax� 1
n
nX
i=1
�
⇤i
(y
i
) (21)
D(y)
def= min
x
L(x,y) ⌘ L(¯x(y),y) (22)
where ¯
x(y) : Rn ! Rd is the optimal primal variable with respectto some y, namely,
¯
x(y) = argmin
x
L(x,y)For simplicity, we will use ¯
x
(t)
def=
¯
x(y
(t
) throughout this paper.Similarly, we also use ¯
y(x) : Rd ! Rn to be the optimal dualvariable with respect to some x.
Recall with the choice of regularizer of our model, g(x) = h(x)+
�kxk1
, where h(x) =
µ
2
kxk22
satisfies µ-strong convexity, µ-smooth and separable. The conjugate of loss function(e.g. smoothhinge loss used in our experiments): �⇤ is �-strongly convex.
Recall the primal gap defined as �
(t)
p
def= L(x(t+1)
,y
(t)
) �D(y
(t)
), and dual gap �
(t)
d
def= D
⇤ � D(y
(t)
). In the proof,we will connect the objective change in primal/dual update withthe primal/dual gap and show how the sub-optimality: �
(t)
=
�
(t)
p
+�
(t)
d
enjoys linear convergence.
Lemma A.1. (Primal Progress):
L(x(t)
,y
(t)
)� L(x(t+1)
,y
(t)
) � 1
kx(t) � ¯
x
(t)k0
� 1
�
(t)
p
Proof. This lemma is a direct result by our greedy update rule ofour primal variables.
L(x(t)
,y
(t)
)� L(¯x(t)
,y
(t)
)
=
X
i
�L(x(t)
,y
(t)
)� L((x̄(t)
i
� x
(t)
i
)e
i
+ x
(t)
,y
(t)
)
=
X
i2supp(x(t)�¯x
(t))
�L(x(t)
,y
(t)
)� L((x̄(t)
i
� x
(t)
i
)e
i
+ x
(t)
,y
(t)
)
k¯x(t) � x
(t)k0
⇥max
i
�L(x(t)
,y
(t)
)� L((x̄(t)
i
� x
(t)
i
)e
i
+ x
(t)
,y
(t)
)
= k¯x(t) � x
(t)k0
�L(x(t)
,y
(t)
)� L(x(t+1)
,y
(t)
)
�
And by adding L(x(t+1)
,y
(t)
)� L(x(t)
,y
(t)
) to both sides wefinishes the proof.
Recall i(t) is the selected coordinate to update in dual variabley
(t).
Lemma A.2. (Primal-Dual Progress).
�
(t)
d
��
(t�1)
d
+�
(t)
p
��
(t�1)
p
L(x(t+1)
,y
t
)� L(xt
,y
t
)
+⌘(
1
n
hAi
(t) ,x(t) � ¯
x
(t)ii
(t))2 � ⌘(
1
n
hAi
(t) , ¯x(t)i � g)
2
, where g 2 1
n
@�
⇤i
(t)(y(t)
).
Our goal is to prove that �(t)
d
� �
(t�1)
d
+ �
(t)
p
� �
(t�1)
p
���(t)
p
� ��
(t)
d
to show linear convergence in sub-optimality.Since L(x(t+1)
,y
t
) � L(xt
,y
t
) � 1
kx(t)�¯x
(t)k0�
(t)
p
, thislemma is the middle step that connects to the primal part, andthe remaining part represents the dual progress and will be ana-lyzed later.
Proof. The primal and dual gap comes from both primal and dualprogresses:
�
(t)
d
��
(t�1)
d| {z }dual progress
+�
(t)
p
��
(t�1)
p| {z }primal progress
• Dual progress:
By Danskins’ theorem, �D(y) is �-strongly convex. There-fore for any g 2 @�
⇤i
(t)(y(t)
), we have,
�
(t)
d
��
(t�1)
d
=
��D(y
(t)
�� ��D(y
(t�1)
�
�( 1n
hAi
(t) , ¯x(t)i � g)(y
(t)
i
(t) � y
(t�1)
i
(t) ) (23)
��
2
(y
(t)
i
(t) � y
(t�1)
i
(t) )
2
• Primal progress:
Similarly we get,L(xt
,y
t
)� L(xt
,y
(t�1)
)
(
1
n
hAi
(t) ,x(t)i � g)(y
(t)
i
(t) � y
(t�1)
i
(t) )
+
�
2
(y
(t�1)
i
(t) � y
(t)
i
(t))2 (24)
Therefore,�
(t)
p
��
(t�1)
p
= L(x(t+1)
,y
t
)� L(xt
,y
(t�1)
)� (D(y
(t)
)�D(y
(t�1)
))
= L(x(t+1)
,y
t
)� L(xt
,y
t
) + L(xt
,y
t
)� L(xt
,y
(t�1)
)
�(D(y
(t)
)�D(y
(t�1)
))
L(x(t+1)
,y
t
)� L(xt
,y
t
)
+
1
n
hAi
(t) ,x(t) � ¯
x
(t)i(y(t)
i
(t) � y
(t�1)
i
(t) )
Here the last inequality comes from inequalities (24) and(23).
Meanwhile, with the update rule of dual variable:
y
(t)
i
(t) argmax
�
1
n
hAi
(t) ,x(t)i� � �
⇤i
(t)(�)� 1
2⌘
(� � y
(t)
i
(t))2
Therefore 9 g 2 @�
⇤i
(t)(y(t)
) such that y(t)
i
(t) � y
(t�1)
i
(t) =
⌘(
1
n
(hAi
(t) ,x(t)i � g). Therefore:
(23) = �( 1n
hAi
(t) , ¯x(t)i � g)(y
(t)
i
(t) � y
(t�1)
i
(t) )
��
2
(y
(t)
i
(t�1) � y
(t)
i
(t))2
= h 1n
A
i
(t) , ¯x(t) � x
(t)i(y(t�1)
i
(t) � y
(t)
i
(t))
�( 1⌘
+
�
2
)(y
(t)
i
(t) � y
(t�1)
i
(t) )
2 (25)
• Summing together we have:
Doubly Greedy Primal-dual Coordinate Descent for Sparse Empirical Risk Minimization
�
(t)
d
��
(t�1)
d
+�
(t)
p
��
(t�1)
p
L(x(t+1)
,y
t
)� L(xt
,y
t
)
+
2
n
hAi
(t) , ¯x(t) � x
(t)i(y(t)
i
(t) � y
(t�1)
i
(t) )
�( 1⌘
+
�
2
)(y
(t)
i
(t) � y
(t�1)
i
(t) )
2
= L(x(t+1)
,y
t
)� L(xt
,y
t
)
+
2⌘
n
hAi
(t) ,x(t) � ¯
x
(t)i( 1n
hAi
(t) ,x(t)i � g)
�⌘2
(
1
⌘
+
�
2
)(h 1n
A
i
(t) ,x(t)i � g)
2
L(x(t+1)
,y
t
)� L(xt
,y
t
)
+⌘(
1
n
hAi
(t) ,x(t) � ¯
x
(t)i)2 � ⌘(
1
n
hAi
(t) , ¯x(t)i � g)
2
Afterwards, we upper bound the dual progress ( 1
n
hAi
(t) ,x(t) �
¯
x
(t)i)2 � (
1
n
hAi
(t) , ¯x(t)i � g)
2 by dual gap �
(t)
d
:Lemma A.3. (Dual Progress).
(
1
n
hAi
(t) , x(t) � x̄
(t)i)2 � (
1
n
hAi
(t) , ¯x(t)i � g)
2
� �
2n
�
(t)
d
+
5R
2
2n
2
kx(t) � ¯
x
(t)k2 (26)
, where g 2 1
n
@�
⇤i
(t)(y(t)
i
(t)).
Proof. For simplicity, we denote �
⇤(y) =
1
n
Pi
�
⇤i
(y
i
). Tobegin with,
�
(t)
d
= D
⇤ �D(y) 2
�
k 1n
A
¯
x
(t) � @�
⇤(y
(t)
)k2
2n
�
k 1n
A
¯
x
(t) � @�
⇤(y
(t)
)k21In our algorithm, the greedy choice of i(t) makes sure k 1
n
Ax
(t) �@�
⇤(y
(t)
)ki
(t) = k 1
n
Ax
(t) � @�
⇤(y
(t)
)k1. However, herewe need the relation between k 1
n
A
¯
x
(t) � @�
⇤(y
(t)
)ki
(t) andk 1
n
A
¯
x
(t) � @�
⇤(y
(t)
)k1 (assumed to be reached at coordinate
i
⇤). We bridge their gap by �
def=
1
n
A(
¯
x
(t) � x
(t)
). Since
�( 1n
hAi
(t) , ¯x(t)i � 1
n
(�
⇤i
(t))0(y
(t)
i
(t)))2
= �( 1n
hAi
(t) ,x(t)i � 1
n
(�
⇤i
(t))0(y
(t)
i
(t)) + �
i
(t))2
� 1
2n
2
⇣hA
i
(t) ,x(t)i � (�
⇤i
(t))0(y
(t)
i
(t))
⌘2
+ �
2
i
(t)
= �1
2
�� 1n
Ax
(t) � @�
⇤(y
(t)
)
��2
1 + �
2
i
(t)
�1
2
(
1
n
hAi
⇤,x
(t)i � 1
n
(�
⇤i
⇤)
0(y
(t)
i
⇤ ))
2
+ k�k21
= �1
2
(
1
n
hAi
⇤,
¯
x
(t)i � 1
n
(�
⇤i
⇤)
0(y
(t)
i
⇤ )� �
i
⇤)
2
+ k�k21
�1
4
(
1
n
hAi
⇤,
¯
x
(t)i � 1
n
(�
⇤i
⇤)
0(y
(t)
i
⇤ ))
2
+
3
2
k�k21
= �1
4
k 1n
A
¯
x
(t) � @�
⇤(y
(t)
)k21 +
3
2
k�k21
� �
2n
�
k
d
+
3
2
k�k21
The first inequality follows �(a + b)
2
= �a2 � b
2 � 2ab �a2 � b
2
+
1
2
a
2
+ 2b
2
= � 1
2
a
2
+ b
2, and replace a by1
n
hAi
(t) ,x(t)i� 1
n
(�
⇤i
(t))0(y
(t)
i
(t)) and b
def= �
i
(t) . And similarly forthe third inequality.
Meanwhile, since kA(
¯
x
(t)�x(t)
)k1 Rk¯x(t)�x(t)k, togetherwe get Lemma A.3.
Now we have established the connection between the primal anddual progress (change in primal/dual gap) with primal and dual gap,and the only redundant part is kx(t) � ¯
x
(t)k, but since µ
2
kx(t) �¯
x
(t)k L(x(t)
,y
(t)
)� L(¯x(t)
,y
(t)
), which could be absorbedin the primal gap. Therefore, back to the main inequality (26):
Proof of Theorem 4.2.
�
(t)
d
��
(t�1)
d
+�
(t)
p
��
(t�1)
p
Lemma A.2 L(x(t+1)
,y
t
)� L(xt
,y
t
)
+h 1n
Ax
(t) �r'(y(t)
),y
(t) � y
(t�1)i
�2h 1n
A
¯
x
(t) �r'(y(t)
),y
(t) � y
(t�1)iLemma A.3 L(x(t+1)
,y
t
)� L(xt
,y
t
)� ⌘�
2n
�
(t)
d
+
5⌘R
2
2n
2
kx(t) � ¯
x
(t)k2
L(x(t+1)
,y
t
)� L(xt
,y
t
)� ⌘�
2n
�
(t)
d
+
5⌘R
2
µn
2
�L(x(t)
,y
(t)
)� L(¯x(t)
,y
(t)
)
�
= (1� 5⌘R
2
µn
2
)
�L(x(t+1)
,y
t
)� L(xt
,y
t
)
�� ⌘�
2n
�
(t)
d
+
5⌘R
2
µn
2
�L(x(t+1)
,y
t
)� L(¯x(t)
,y
t
)
�
Lemma A.1 �(1� 5⌘R
2
µn
2
)
1
kx(t) � ¯
x
(t)k0
� 1
�
(t)
p
� ⌘�
2n
�
(t)
d
+
5⌘R
2
µn
2
�L(x(t+1)
,y
t
)� L(¯x(t)
,y
t
)
�
= ��(1� 5⌘R
2
µn
2
)
1
kx(t) � ¯
x
(t)k0
� 1
� 5⌘R
2
µn
2
��
(t)
p
�⌘�
2n
�
(t)
d
Therefore, we havekx(t) � ¯
x
(t)k0
kx(t) � ¯
x
(t)k0
� 1
(1�5⌘R
2
µn
2
)�
(t)
p
+(1+
⌘�
2n
)�
(t)
d
�
(t�1)
d
+�
(t�1)
p
i.e. linear convergence. Notice when
⌘
(t) 2n
2
µ
(10R
2
+ n�µ)kx(t) � ¯
x
(t)k0
(27)
�
(t) 1
1 +
⌘
(t)�
2n
�
(t�1)
Specifically, when inequality holds for (27), and suppose kx(t) �¯
x
(t)k0
s, then it requires O(s(
n
+ 1) log
1
✏
) iterations toachieve ✏ primal and dual sub-optimality, where =
R
2
µ�
.
Doubly Greedy Primal-dual Coordinate Descent for Sparse Empirical Risk Minimization
B Appendix B: Additional Experimental Results
Finally, we show result for � = 0.01, 0.1, and µ = 0.01, 0.1, 1. Here are some comments for results under differentparameters.
The winning margin of DGPD is larger on data sets of dense feature matrix than that of sparse feature matrix. One reasonfor this is, for data of sparse feature matrix, features of higher frequency are more likely to be active than those of lowerfrequency, and therefore, the feature sub-matrix corresponding to the active primal variables are often denser than submatrixmatrix corresponding to the inactive ones. This results in a less overall speedup.
100 200 300 400 500 600 700 800 900 1000
time
10 -15
10 -10
10 -5
rela
tive p
rim
al o
bje
ctiv
e
Mnist-RF-Time-l01-mu1
DGPD
DualRCD
PrimalRCD
SPDC-dense
100 200 300 400 500 600 700 800 900 1000
time
10 -15
10 -10
10 -5re
lativ
e p
rim
al o
bje
ctiv
e
Aloi-RF-Time-l01-mu1
DGPD
DualRCD
PrimalRCD
SPDC-dense
10 -2 10 0 10 2
time
10 -8
10 -6
10 -4
10 -2
rela
tive
prim
al o
bje
ctiv
e
RCV1-Time-l01-mu1
DGPD
DualRCD
PrimalRCD
SPDC-dense
10 0 10 2
time
10 -4
10 -3
10 -2
10 -1
rela
tive
prim
al o
bje
ctiv
e
Mnist-RB-Time-l01-mu1
DGPD
DualRCD
PrimalRCD
SPDC
10 0 10 2
time
10 -4
10 -3
10 -2
10 -1
rela
tive p
rim
al o
bje
ctiv
e
Aloi-RB-Time-l01-mu1
DGPD
DualRCD
PrimalRCD
SPDC
0 5 10 15
time
10 -15
10 -10
10 -5
rela
tive p
rim
al o
bje
ctiv
e
Sector-Time-l01-mu1
DGPD
DualRCD
PrimalRCD
SPDC
0 200 400 600 800 1000
iter
10 -15
10 -10
10 -5
rela
tive p
rim
al o
bje
ctiv
e
Mnist-RF-Iteration-l01-mu1
DGPD
DualRCD
PrimalRCD
SPDC-dense
0 200 400 600 800 1000
iter
10 -15
10 -10
10 -5
rela
tive p
rim
al o
bje
ctiv
e
Aloi-RF-Iteration-l01-mu1
DGPD
DualRCD
PrimalRCD
SPDC-dense
0 50 100 150
iter
10 -8
10 -6
10 -4
10 -2
rela
tive
prim
al o
bje
ctiv
e
RCV1-Iteration-l01-mu1
DGPD
DualRCD
PrimalRCD
SPDC-dense
0 20 40 60 80 100
iter
10 -4
10 -3
10 -2
10 -1
rela
tive
prim
al o
bje
ctiv
e
Mnist-RB-Iteration-l01-mu1
DGPD
DualRCD
PrimalRCD
SPDC
0 50 100 150 200
iter
10 -6
10 -5
10 -4
10 -3
10 -2
10 -1
rela
tive p
rim
al o
bje
ctiv
e
Aloi-RB-Iteration-l01-mu1
DGPD
DualRCD
PrimalRCD
SPDC
0 200 400 600 800 1000
iter
10 -15
10 -10
10 -5
rela
tive p
rim
al o
bje
ctiv
e
Sector-Iteration-l01-mu1
DGPD
DualRCD
PrimalRCD
SPDC
Figure 2. Relative Objective versus Time (the upper 2 rows) and versus # iterations (the lower 2 rows) for � = 0.1, µ = 1.
Doubly Greedy Primal-dual Coordinate Descent for Sparse Empirical Risk Minimization
We also observe that in order to achieve the best performance of DGPD, both primal and dual sparsity must hold, and thesparsity is partially controled by the L1/L2 penalty. In particular, when the L1 penalty has too much weight, the primaliterate would become too sparse to yield a reasonable prediction accuracy, which then results in a particularly dense dualiterate due to its non-zero loss on most of the samples. Another example is, when the L2 penalty becomes too large, theclassifier would tend to mis-classify many examples in order to gain a large margin, which results in dense dual iterates.
However, in practice such hyperparameter settings are less likely to be chosen due to its inferior prediction performance.
100 200 300 400 500 600 700 800 900 1000
time
10 -15
10 -10
10 -5
rela
tive p
rim
al o
bje
ctiv
e
Mnist-RF-Time-l01-mu01
DGPD
DualRCD
PrimalRCD
SPDC-dense
100 200 300 400 500 600 700 800 900 1000
time
10 -15
10 -10
10 -5
rela
tive p
rim
al o
bje
ctiv
e
Aloi-RF-Time-l01-mu01
DGPD
DualRCD
PrimalRCD
SPDC-dense
10 -2 10 0 10 2
time
10 -8
10 -6
10 -4
10 -2
rela
tive
prim
al o
bje
ctiv
e
RCV1-Time-l01-mu01
DGPD
DualRCD
PrimalRCD
SPDC-dense
10 0 10 2
time
10 -4
10 -3
10 -2
10 -1
rela
tive
prim
al o
bje
ctiv
e
Mnist-RB-Time-l01-mu01
DGPD
DualRCD
PrimalRCD
SPDC
10 0 10 2
time
10 -4
10 -3
10 -2
10 -1
rela
tive p
rim
al o
bje
ctiv
e
Aloi-RB-Time-l01-mu01
DGPD
DualRCD
PrimalRCD
SPDC
0 5 10 15
time
10 -15
10 -10
10 -5
rela
tive p
rim
al o
bje
ctiv
e
Sector-Time-l01-mu01
DGPD
DualRCD
PrimalRCD
SPDC
0 200 400 600 800 1000
iter
10 -15
10 -10
10 -5
rela
tive p
rim
al o
bje
ctiv
e
Mnist-RF-Iteration-l01-mu01
DGPD
DualRCD
PrimalRCD
SPDC-dense
0 200 400 600 800 1000
iter
10 -15
10 -10
10 -5
rela
tive p
rim
al o
bje
ctiv
e
Aloi-RF-Iteration-l01-mu01
DGPD
DualRCD
PrimalRCD
SPDC-dense
0 50 100 150
iter
10 -8
10 -6
10 -4
10 -2
rela
tive
prim
al o
bje
ctiv
e
RCV1-Iteration-l01-mu01
DGPD
DualRCD
PrimalRCD
SPDC-dense
0 20 40 60 80 100
iter
10 -4
10 -3
10 -2
10 -1
rela
tive
prim
al o
bje
ctiv
e
Mnist-RB-Iteration-l01-mu01
DGPD
DualRCD
PrimalRCD
SPDC
0 50 100 150 200
iter
10 -6
10 -5
10 -4
10 -3
10 -2
10 -1
rela
tive p
rim
al o
bje
ctiv
e
Aloi-RB-Iteration-l01-mu01
DGPD
DualRCD
PrimalRCD
SPDC
0 200 400 600 800 1000
iter
10 -15
10 -10
10 -5
rela
tive p
rim
al o
bje
ctiv
e
Sector-Iteration-l01-mu01
DGPD
DualRCD
PrimalRCD
SPDC
Figure 3. Relative Objective versus Time (the upper 2 rows) and versus # iterations (the lower 2 rows) for � = 0.1, µ = 0.1.
Doubly Greedy Primal-dual Coordinate Descent for Sparse Empirical Risk Minimization
100 200 300 400 500 600 700 800 900 1000
time
10 -15
10 -10
10 -5
rela
tive p
rim
al o
bje
ctiv
e
Mnist-RF-Time-l001-mu1
DGPD
DualRCD
PrimalRCD
SPDC-dense
100 200 300 400 500 600 700 800 900 1000
time
10 -15
10 -10
10 -5
rela
tive p
rim
al o
bje
ctiv
e
Aloi-RF-Time-l001-mu1
DGPD
DualRCD
PrimalRCD
SPDC-dense
10 -2 10 0 10 2
time
10 -8
10 -6
10 -4
10 -2
rela
tive p
rim
al o
bje
ctiv
e
RCV1-Time-l001-mu1
DGPD
DualRCD
PrimalRCD
SPDC-dense
10 0 10 2
time
10 -4
10 -3
10 -2
10 -1
rela
tive p
rim
al o
bje
ctiv
e
Mnist-RB-Time-l001-mu1
DGPD
DualRCD
PrimalRCD
SPDC
10 0 10 2
time
10 -4
10 -3
10 -2
10 -1
rela
tive p
rim
al o
bje
ctiv
e
Aloi-RB-Time-l001-mu1
DGPD
DualRCD
PrimalRCD
SPDC
0 5 10 15
time
10 -15
10 -10
10 -5
rela
tive
prim
al o
bje
ctiv
e
Sector-Time-l001-mu1
DGPD
DualRCD
PrimalRCD
SPDC
0 200 400 600 800 1000
iter
10 -15
10 -10
10 -5
rela
tive p
rim
al o
bje
ctiv
e
Mnist-RF-Iteration-l001-mu1
DGPD
DualRCD
PrimalRCD
SPDC-dense
0 200 400 600 800 1000
iter
10 -15
10 -10
10 -5
rela
tive p
rim
al o
bje
ctiv
e
Aloi-RF-Iteration-l001-mu1
DGPD
DualRCD
PrimalRCD
SPDC-dense
0 50 100 150
iter
10 -8
10 -6
10 -4
10 -2
rela
tive p
rim
al o
bje
ctiv
eRCV1-Iteration-l001-mu1
DGPD
DualRCD
PrimalRCD
SPDC-dense
0 20 40 60 80 100
iter
10 -4
10 -3
10 -2
10 -1
rela
tive p
rim
al o
bje
ctiv
e
Mnist-RB-Iteration-l001-mu1
DGPD
DualRCD
PrimalRCD
SPDC
0 50 100 150 200
iter
10 -6
10 -5
10 -4
10 -3
10 -2
10 -1
rela
tive p
rim
al o
bje
ctiv
e
Aloi-RB-Iteration-l001-mu1
DGPD
DualRCD
PrimalRCD
SPDC
0 200 400 600 800 1000
iter
10 -15
10 -10
10 -5
rela
tive p
rim
al o
bje
ctiv
e
Sector-Iteration-l001-mu1
DGPD
DualRCD
PrimalRCD
SPDC
Figure 4. Relative Objective versus Time (the upper 2 rows) and versus # iterations (the lower 2 rows) for � = 0.01, µ = 1.
Doubly Greedy Primal-dual Coordinate Descent for Sparse Empirical Risk Minimization
100 200 300 400 500 600 700 800 900 1000
time
10 -10
10 -5
rela
tive p
rim
al o
bje
ctiv
e
Mnist-RF-Time-l001-mu01
DGPD
DualRCD
PrimalRCD
SPDC-dense
100 200 300 400 500 600 700 800 900 1000
time
10 -15
10 -10
10 -5
rela
tive p
rim
al o
bje
ctiv
e
Aloi-RF-Time-l001-mu01
DGPD
DualRCD
PrimalRCD
SPDC-dense
10 -2 10 0 10 2
time
10 -5
10 -4
10 -3
10 -2
10 -1
obj
RCV1-Time-l001-mu01
DGPD
DualRCD
PrimalRCD
SPDC
10 0 10 2
time
10 -6
10 -5
10 -4
10 -3
10 -2
10 -1
obj
Mnist-RB-Time-l001-mu01
DGPD
DualRCD
PrimalRCD
SPDC
10 0 10 2
time
10 -6
10 -5
10 -4
10 -3
10 -2
10 -1
obj
Aloi-RB-Time-l001-mu01
DGPD
DualRCD
PrimalRCD
SPDC
2 4 6 8 10 12 14
time
10 -15
10 -10
10 -5
rela
tive p
rim
al o
bje
ctiv
e
Sector-Time-l001-mu01
DGPD
DualRCD
PrimalRCD
SPDC-dense
0 200 400 600 800 1000
iter
10 -15
10 -10
10 -5
rela
tive p
rim
al o
bje
ctiv
e
Mnist-RF-Iteration-l001-mu01
DGPD
DualRCD
PrimalRCD
SPDC-dense
0 200 400 600 800 1000
iter
10 -15
10 -10
10 -5
rela
tive p
rim
al o
bje
ctiv
e
Aloi-RF-Iteration-l001-mu01
DGPD
DualRCD
PrimalRCD
SPDC-dense
10 0 10 1 10 2
iter
10 -5
10 -4
10 -3
10 -2
10 -1
obj
RCV1-Iter-l001-mu01
DGPD
DualRCD
PrimalRCD
SPDC
10 0 10 1 10 2
iter
10 -6
10 -5
10 -4
10 -3
10 -2
10 -1
ob
j
Mnist-RB-Iter-l001-mu01
DGPD
DualRCD
PrimalRCD
SPDC
10 0 10 1 10 2
iter
10 -6
10 -5
10 -4
10 -3
10 -2
10 -1
obj
Aloi-RB-Iter-l001-mu01
DGPD
DualRCD
PrimalRCD
SPDC
0 200 400 600 800
iter
10 -15
10 -10
10 -5
rela
tive
prim
al o
bje
ctiv
e
Sector-Iteration-l001-mu01
DGPD
DualRCD
PrimalRCD
SPDC-dense
Figure 5. Relative Objective versus Time (the upper 2 rows) and versus # iterations (the lower 2 rows) for � = 0.01, µ = 0.1.
Doubly Greedy Primal-dual Coordinate Descent for Sparse Empirical Risk Minimization
100 200 300 400 500 600 700 800 900 1000
time
10 -15
10 -10
10 -5
rela
tive p
rim
al o
bje
ctiv
e
Mnist-RF-Time-l001-mu001
DGPD
DualRCD
PrimalRCD
SPDC-dense
100 200 300 400 500 600 700 800 900 1000
time
10 -15
10 -10
10 -5
rela
tive p
rim
al o
bje
ctiv
e
Aloi-RF-Time-l001-mu001
DGPD
DualRCD
PrimalRCD
SPDC-dense
10 -2 10 0 10 2
time
10 -8
10 -6
10 -4
10 -2
rela
tive p
rim
al o
bje
ctiv
e
RCV1-Time-l001-mu001
DGPD
DualRCD
PrimalRCD
SPDC-dense
10 0 10 2
time
10 -4
10 -3
10 -2
10 -1
rela
tive p
rim
al o
bje
ctiv
e
Mnist-RB-Time-l001-mu001
DGPD
DualRCD
PrimalRCD
SPDC
10 0 10 2
time
10 -4
10 -3
10 -2
10 -1
rela
tive p
rim
al o
bje
ctiv
e
Aloi-RB-Time-l001-mu001
DGPD
DualRCD
PrimalRCD
SPDC
0 5 10 15
time
10 -15
10 -10
10 -5
rela
tive
prim
al o
bje
ctiv
e
Sector-Time-l001-mu001
DGPD
DualRCD
PrimalRCD
SPDC
0 200 400 600 800 1000
iter
10 -15
10 -10
10 -5
rela
tive p
rim
al o
bje
ctiv
e
Mnist-RF-Iteration-l001-mu001
DGPD
DualRCD
PrimalRCD
SPDC-dense
0 200 400 600 800 1000
iter
10 -15
10 -10
10 -5
rela
tive p
rim
al o
bje
ctiv
e
Aloi-RF-Iteration-l001-mu001
DGPD
DualRCD
PrimalRCD
SPDC-dense
0 50 100 150
iter
10 -8
10 -6
10 -4
10 -2
rela
tive p
rim
al o
bje
ctiv
e
RCV1-Iteration-l001-mu001
DGPD
DualRCD
PrimalRCD
SPDC-dense
0 20 40 60 80 100
iter
10 -4
10 -3
10 -2
10 -1
rela
tive p
rim
al o
bje
ctiv
e
Mnist-RB-Iteration-l001-mu001
DGPD
DualRCD
PrimalRCD
SPDC
0 50 100 150 200
iter
10 -6
10 -5
10 -4
10 -3
10 -2
10 -1
rela
tive p
rim
al o
bje
ctiv
e
Aloi-RB-Iteration-l001-mu001
DGPD
DualRCD
PrimalRCD
SPDC
0 200 400 600 800 1000
iter
10 -15
10 -10
10 -5
rela
tive p
rim
al o
bje
ctiv
e
Sector-Iteration-l001-mu001
DGPD
DualRCD
PrimalRCD
SPDC
Figure 6. Relative Objective versus Time (the upper 2 rows) and versus # iterations (the lower 2 rows) for � = 0.01, µ = 0.01.