Regret to the Best vs.
Regret to the Average
Eyal Even-Dar Michael Kearns Yishay Mansour Jennifer Wortman
Upenn + Tel Aviv Univ.Slides: Csaba
Motivation
Expert algorithms attempt to control regret to the return of the best expert
Regret to the average return? Same bound! Weak???
EW: wi1=1, wit=wi,t-1e git , pit=wit/Wt, Wt = i wit
E1: 1 0 1 0 1 0 1 0 1 0 …E2: 0 1 0 1 0 1 0 1 0 1 …
GA,T=T/2-cT1/2
GT+ = GT
- = GT0 = T/2
RT+ · cT1/2, RT
0· c T1/2
Notation - gains
git2 [0,1] - gains
g=(git) - sequence of gains
GiT(g)= t=1T git - cumulated gains
G0T(g)=(i GiT(g))/N - average gain
G-T(g)=mini GiT(g) - worst gain
G+T(g)=maxi GiT(g) - best gain
GDT(g)=i Di GiT(g) - weighted avg. gain
Notation - algorithms
wit – unnormalized weights
pit=wit/Wt, – normalized weightsWt = i wit
gA,t=i pit git – gain of A
GAT(g)= t gA,t – cumulated gain of A
Notation - regret
regret to the.. R+
T(g) = (G+T(g) – GA,T(g)) Ç 1 – best
R-T(g) = (G-
T(g) – GA,T(g)) Ç 1 – worst
R0T(g) = (G0
T(g) – GA,T(g)) Ç 1 – avg
RDT(g) = (GD
T(g) – GA,T(g)) Ç 1 – dist.
Goal
Algorithm A is “nice” if .. R+
A,T · O(T1/2)
R0A,T · 1
Program: Examine existing algorithms (“difference
algorithms”) – lower bound Show “nice” algorithms Show that no substantial further improvement is
possible
“Difference” algorithms
Def:A is a difference algorithm if for N=2, git2 {0,1}, p1t = f(dt), p2t = 1-f(dt), dt = G1t-G2t
Examples: EW: wit = e Git
FPL: Choose argmaxi ( Git+Zit )
Prod: wit = s (1+ gis) = (1+)Git
A lower bound for difference algorithms
Theorem:If A is a difference algorithm then there exist some series, g, g’ (tuned to A), such that
R+AT (g) R0
AT (g’) ¸ R+AT (g) R-
AT (g’) = (T)
For R+AT = maxg R+
AT(g), R-AT = maxg R-
AT(g),
R0AT = maxg R0
AT(g),
R+AT R0
AT ¸ R+AT R-
AT = (T)
Proof
Assume T is even, p11 · ½
: first time t when p1t¸ 2/3 ) R+AT(g) ¸ /3
9 2 {2,3,..,} s.t. p1-p1-1 ¸ 1/(6)
1 1 1 1 1 1 1 1 1 1 1 1 1 1 …0 0 0 0 0 0 0 0 0 0 0 0 0 0 …
g:
Proof/2 p1-p1-1 ¸ 1/(6)
G+T=G-
T=G0=T/2
GAT(g’)· + (T-2)/2 (1-1/(6)) R-
AT(g’) ¸ (T-2)/(12) ) R+
AT(g)R-AT(g’)¸ (T-2)/36
1 1 1 1 1 1 0 1 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 1 0 1 1 1 1 1 1 1g’:
p1,t=p1,
p1,t+1=p1,-1
Gain: · 1-1/(6)p1t=p1,T-t
Gain: p1t+1-p1t=1
Tightness
We know that for difference algorithms
R+AT R0
AT ¸ R+AT R-
AT = (T) Can a (difference) algorithm achieve this? Theorem: EW=EW(), with appropriately
tuned =(), 0· · 1/2 has
R+EW,T· T1/2+ (1+ln N)
R0EW,T· T1/2-
Breaking the frontier
What’s wrong with the difference algorithms? They are designed to find the best expert with
low regret (fast) ..they don’t pay attention to the average gain
and how it compares with the best gain
BestWorst(A)
G+T-G-
T: the spread of cumulated gain Idea: Stay with the average, until the spread
becomes large. Then switch to learning (using algorithm A).
When the spread is large enough, G0
T=GBW(A),T À G-T
) “Nothing” to loose Spread threshold: NR; where R=RT,N is a
bound on the regret of A.
BestWorst(A)
Theorem: R+BW(A),T = O(NR), GBW(A),T¸ G-{T}
Proof:At the time of switch, GBW(A) ¸ (G++ (N-1)+G-)/N. Since G+¸ G-+NR,
GBW(A)¸ G- + R.
PhasedAgression(A,R,D)
for k=1:log2(R) do=2k-1/RA.reset(); s:=0 // local time, new phase
while (G+s-GD
s<2R) do
qs := A.getNormedWeights( gs-1 )
ps := qs + (1-) Dend
endA.reset()run A until time T
PA(A,R,D) – Theorem
Theorem:Let A be any algorithm with regret R = RT,N to the best expert, D any distribution.Then for PA=PA(A,R,D),
R+PA,T· 2R(log R+1)
RDPA,T· 1
Proof Consider local time s during phase k. D and A share the gains & the regret
G+s-GPA,s < 2k-1/R£ R + (1-2k-1/R) £ 2R < 2R
GDs-GPA,s· 2k-1/R £ R =2k-1
What happens at the end of the phase?
GPA,s-GD,s ¸ 2k-1/R £ (G+
s-R-GDs)
¸ 2k-1/R £ (G+s-GD
s-R+GDsGD
s)¸ 2k-1/R £ R = 2k-1.
What if PA ends in phase k at time T:
G+T-GPA,T · 2R k · 2R (log R + 1)
GDT-GPA,T· 2k-1 - j=1
k-1 2j-1= 2k-1(2k-1-1)=1
General lower bounds
Theorem:
R+A,T=O(T1/2) ) R0
A,T=(T1/2)
R+A,T· (Tlog(T))1/2/10 ) R0
A,T=(T), where ¸ 0.02
Compare this with
R+PA,T· 2R(log R+1), RD
PA,T· 1,
where R=(T log N)1/2
Conclusions
Achieving constant regret to the average is a reasonable goal.
“Classical” algorithms do not have this property, but satisfy R+
AT R0AT ¸ (T).
Modification: Learn only when it makes sense; ie. when the best is much better than the average
PhasedAgression: Optimal tradeoff Can we remove dependence on T?