Top Banner

of 33

10.1.1.161.7484

Jun 04, 2018

Download

Documents

tamann2004
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
  • 8/13/2019 10.1.1.161.7484

    1/33

    SIAM J. Optimization c1992 Society for Industrial and Applied MathematicsVol. ???, No. ???, pp. ??????, ??? 1992 ???

    PRIMAL-DUAL PROJECTED GRADIENT ALGORITHMSFOR EXTENDED LINEAR-QUADRATIC PROGRAMMING*

    CIYOU ZHU and R. T. ROCKAFELLAR

    Abstract. Many large-scale problems in dynamic and stochastic optimization can be modeledwith extended linear-quadratic programming, which admits penalty terms and treats them throughduality. In general the objective functions in such problems are only piecewise smooth and must beminimized or maximized relative to polyhedral sets of high dimensionality. This paper proposes anew class of numerical methods for fully quadratic problems within this framework, which exhibitsecond-order nonsmoothness. These methods, combining the idea of finite-envelope representationwith that of modified gradient projection, work with local structure in the primal and dual problemssimultaneously, feeding information back and forth to trigger advantageous restarts.

    Versions resembling steepest descent methods and conjugate gradient methods are presented.When a positive threshold of -optimality is specified, both methods converge in a finite number ofiterations. With threshold 0, it is shown under mild assumptions that the steepest descent versionconverges linearly, while the conjugate gradient version still has a finite termination property. Thealgorithms are designed to exploit features of primal and dual decomposability of the Lagrangian,which are typically available in a large-scale setting, and they are open to considerable parallelization.

    Key words. Extended linear-quadratic programming, large-scale numerical optimization,finite-envelope representation, gradient projection, primal-dual methods, steepest descent methods,conjugate gradient methods.

    AMS(MOS) subject classifications. 65K05, 65K10, 90C20

    1. Introduction. A number of recent papers have described extended linear-quadratic programming as a modeling scheme that is much more flexible for problemsof optimization than conventional quadratic programming and seems especially suitedto large-scale applications, in particular because of way penalty terms can be incorpo-rated. Rockafellar and Wets in [1], [2], first used the concept in two-stage stochasticprogramming, where the primal dimension is low but the dual dimension is high. It wasdeveloped further in its own right in Rockafellar [3], [4], and carried in the latter paper

    to the context of continuous-time optimal control. Discrete-time problems of optimalcontrol, both deterministic and stochastic (i.e., multistage stochastic programming)were analyzed as extended linear-quadratic programming problems in Rockafellar andWets [5] and shown to have a remarkable property of Lagrangian decomposability inthe primal and dual arguments, both of which can be high dimensional. These modelsraise new computational challenges and possibilities.

    A foundation for numerical schemes in large-scale extended linear-quadratic pro-gramming has been laid in Rockafellar [6] and elaborated for problems in multistageformat in Rockafellar [7]. The emphasis in [6] is on basicfinite-envelope methods, whichuse duality in generating envelope approximations to the primal and dual objectivefunctions through a finite sequence of separate minimizations or maximizations of theLagrangian. These methods generalize the one originally proposed in [1] for two-stage

    stochastic programming and implemented by King [8] and Wagner [9]. They center

    *Received by the editors December ???, 1990; accepted for publication (in revised form) ??????, 1992. This work was supported in part by AFOSR grants 87-0281 and 89-0081 and NSF grantsDMS-8701768 and DMS-8819586.

    Department of Mathematical Sciences, The Johns Hopkins University, Baltimore, MD 21218.

    Department of Applied Mathematics, FS-20, University of Washington, Seattle, WA 98195.

    1

  • 8/13/2019 10.1.1.161.7484

    2/33

    2 c. zhu and r. t. rockafellar

    on the fully quadratic case, where strong convexity is present in both the primaland dual objectives, relying on exterior schemes such as the proximal point algorithmto create such strong convexity iteratively when it might otherwise be lacking.

    Here we propose new algorithms which for fully quadratic problems combine theidea of finite-envelope representation with that of nonlinear gradient projection. Inthese methods the envelope approximations are utilized in a sort of steepest descent

    format or conjugate gradient format in the primal and dual problems simultaneously.A type of feedback is introduced between primal and dual that takes advantage ofinformation jointly uncovered in computations, which in practice greatly speeds con-vergence. Both algorithms fit into a fundamental scheme for which global convergenceis established. Under a weak geometric assumption akin to strict complementaryslackness at optimality, the steepest descent version is shown to converge at a linearrate, while the conjugate gradient version has a finite termination property.

    Both versions differ significantly from their traditional namesakes not only throughthe incorporation of a primal-dual scheme of gradient projection, but also in handlingobjective functions that generally could involve a complicated polyhedral cell struc-ture not conducive to explicit description by linear equations and inequalities. Theytreat the underlying constraints without resorting to an active set strategy, which

    would not be suitable for problems having high dimensionality in both primal anddual.An important feature is that the computations are not carried out in terms of

    a large, sparse matrix, such as might in principle serve in part to specify the twoproblems, but through subroutines for separate minimization and maximization of theLagrangian in its primal and dual arguments. This framework appears much betteradapted to the special structure available in dynamic and stochastic applications, andit supports extensive parallelization. To make this point clearer, and to introduce factsand notation that will later be needed, we discuss briefly the nature of extended linear-quadratic programming and the way it differs from ordinary quadratic programming.

    From the Lagrangian point of view, extended linear-quadratic programming isdirected toward finding a saddle point (u, v) of a function

    (1.1) L(u, v) =pu + 12uP u + qv 12vQv vRu over U V,

    where U and Vare nonempty polyhedral (convex) sets in lRn and lRm respectively,and the matrices P lRnn andQ lRmm are symmetric and positive semidefinite.(One has p lRn, q lRm, and R lRmn.) Associated with L, U and V are theprimal and dual problems

    minimize f(u) over all u U, where f(u) := supvV

    L(u, v),(P)maximize g(v) over all v V, where g(v) := inf

    uUL(u, v).(Q)

    We speak of the fully quadraticcase of (

    P) and (

    Q) when both of the matrices P and

    Q are actually positive definite.Standard quadratic programming would correspond to Q = 0 and V = lRm1+

    lRm2 . Then f would consist of a quadratic function plus the indicator of a systemofm1 linear inequality constraints and m2 linear equations, the indicator being thefunction which assigns an infinite penalty whenever these constraints are violated.Other choices of Q and V yield finite penalty expressions of various kinds. This isexplained in [4, Secs. 2 and 3] with many examples. For sound modeling in large-scale

  • 8/13/2019 10.1.1.161.7484

    3/33

    primal-dual projected gradient algorithms for elqp 3

    applications with dynamics and stochastics such as in [1], [2] and [5], it appears wiseto use finite rather than infinite penalties whenever constraints are soft. Extendedlinear-quadratic programming makes this option conveniently available. To the extentthat constraints in the primal problem are hard, they can be handled either by plac-ing them in the definition of the polyhedron U or through an augmented Lagrangiantechnique which corresponds to an exterior scheme of iterations of the proximal point

    algorithm, as already mentioned.Theorem 1.1. [4] (Properties of the objective functions.)The objective functions

    f in (P) and g in (Q) are piecewise linear-quadratic: in each case the space can bepartitioned in principle into a finite collection of polyhedral cells, relative to which thefunction has a linear or quadratic formula. Moreover, fis convex whileg is concave.In the fully quadratic case of (P) and (Q), f is strongly convex and g is stronglyconcave, both functions having continuous first derivatives.

    Theorem 1.2. [4], [1] (Duality and optimality.)(a) If either of the optimal values inf(P) or sup(Q) is finite, then both are finite

    and equal, in which event optimal solutionsu andv exist for the two problems. In thefully quadratic case in particular, the optimal values inf(P)and sup(Q)are finite andequal; then, moreover, the optimal solutionsu andv are unique.

    (b) A pair(u, v)is a saddle point ofL(u, v)overU V if and only ifu solves(P)andv solves(Q), or equivalently, f(u) = g(v).

    Current numerical methods in standard quadratic programming, and the some-what more general area of linear complementarity problems [10], where U = lRn+,V = lRm+, andQ is not necessarily the zero matrix, are surveyed by Lin and Pang [11].Other efforts in recent times have been made by Ye and Tse [12], Monteiro and Adler[13], Goldfarb and Liu [14].

    None of these approaches is consonant with the large-scale applications that at-tract our interest, because the structure in such applications is not well served by thewholesale reformulations that would be required when penalty expressions are muchinvolved. Although any problem of extended linear-quadratic programming can inprinciple be recast as a standard problem in quadratic programming, as establishedin [1, Theorem 1], there is a substantial price to be paid in dimensionality and lossof symmetry, as well as in potential ill-conditioning. If the original problem had nprimal andm dual variables, and the expression ofU and V involvedm and n con-straints beyond nonnegavity of variables, then the reformulated problem in standardformat would generally haven + n + mprimal andm + m dual variables, and its fullconstraint system would tend to degeneracy (see [1, proof of Theorem 1]). The dualproblem would be quite different in its theoretical properties from the primal problem,so that computational ideas developed for the one could not be applied to the other.

    Any problem of extended linear-quadratic programming can alternatively beposed in terms of solving a certain linear variational inequality (generalized equa-tion) as explained in [6, Theorem 2.3], and from that one could pass to a linearcomplementarity model. Symmetry and the meaningful representation of dynamicand stochastic structure could be maintained to a larger extent in this manner. But

    linear complementarity algorithms tend to be less robust than methods utilizing ob-jective function values, and an increase in dimensionality would still be required inhandling constraints, even if these are simply upper and lower bounds on the variables.Furthermore, such algorithms typically have to be carried to completion. They do notgenerate sequences of primal-feasible and dual-feasible solutions along with estimatesof how far these are from being optimal, as is highly desirable when problem sizeborders on the difficult.

  • 8/13/2019 10.1.1.161.7484

    4/33

    4 c. zhu and r. t. rockafellar

    While much could be said about the special problem structure in dynamic andstochastic applications [5], [7], it can be summarized for present purposes in the asser-tion that such problems, when formulated with care, satisfy the double decomposabilityassumption[6]. This means that for any fixed u Uit is relatively easy to maximizeL(u, v) over vV, and likewise, for any fixed vVit is relatively easy to minimizeL(u, v) over u

    U, usually because of separability when either of the Lagrangian ar-

    guments is considered by itself. These subproblems of maximization and minimizationcalculate not only the objective values f(u) and g(v) but also, in the fully quadraticcase where L is strongly convex-concave, the uniquely determined vectors

    (1.2) F(u) = argmaxvV

    L(u, v) and G(v) = argminuU

    L(u, v).

    The issue is how to make use of such information in the design of numerical methods.Some proposals have already been made in Rockafellar [6]. Other ideas, which involvesplitting algorithms, have been explored by Tseng [15], [16]. Here we aim at adaptingclassical descent algorithms with help from convex analysis [17].

    In this paper we make the blanket assumption of double decomposability, takingit as license also for exact line searchability [6]: the supposition that it is possible to

    minimizef(u) over any line segment joining two points inU, and likewise, to maximizeg(v) over any line segment joining two points in V. We focus on the fully quadraticcase, even though standard quadratic programming is thereby excluded and a di-rect comparison with other computational approaches, apart from the finite-envelopemethods in [6], becomes difficult. Our attention to that case is justified by its own po-tential in mathematical modeling (cf. [2], [4]) and because strong convexity-concavityof the Lagrangian can be created, if need be, through some outer implementation ofthe proximal point algorithm [18], [19], as carried out in [1] and [8]. The questionsconcerning such an outer algorithm are best handled elsewhere, since they have adifferent character and relate to a host of primal-dual procedures in extended linear-quadratic programming besides the ones developed here, cf. [1], [2], [6]. In particular,such questions are taken up in Zhu [20].

    The supposition that line searches can be carried out exactly is an expedient toallow us to concentrate on more important matters for now. It is also in keeping withthe exploration of finite termination properties of the kind usually associated withconjugate gradient-like algorithms, which is part of our agenda. One may observealso that because of the piecewise linear-quadratic nature of the objective functionsin Theorem 1.1, line searches in our context are of a special kind where exactnessis not far-fetched.

    A common sort of problem structure which fits with double decomposability isthebox-diagonal case, whereP andQ are diagonal matrices,

    (1.3) P= diag[1, . . . , n] and Q= diag[1, . . . , m],

    the entriesj andi being positive (for fully quadratic problems), while U andV are

    boxes representing upper and lower bounds (not necessarily finite) on the componentsofu = (u1, . . . , un) and v= (v1, . . . , vm):

    (1.4) U= [u1, u+1] [un , u+n ] and V = [v1, v+1] [vm, v+m].

    In this case, we have for eachu Uthat the problem of maximizingL(u, v) overv Vto obtain f(u) and F(u) decomposes into separate one-dimensionalsubproblems in

  • 8/13/2019 10.1.1.161.7484

    5/33

    primal-dual projected gradient algorithms for elqp 5

    the individual coordinates: fori = 1, . . . , m

    (1.5) maximize

    qi

    nj=1

    rijuj

    vi 12iv2i subject to vi vi v+i .

    Likewise, the problem of minimizing L(u, v) over u U for given v V, so as tocalculateg (v) and G(v), reduces to the separate problems

    (1.6) minimize

    pj

    mi=1

    virij

    ui+ 12ju2j subject to uj uj u+j .

    Clearly, there exist very simple closed-formsolutions to these one-dimensional sub-problems. No actual minimization or maximization routine needs to be invoked. Oftenthere are ways also of obtaining the answers without explicitly introducing therij s.

    In notation, we shall refer consistently to

    (1.7) u= the unique optimal solution to (P),

    v= the unique optimal solution to (Q),these properties meaning by Theorem 1.2 that

    (1.8) (u, v) = the unique saddle point of L on U V,

    or equivalently in terms of the mappings F and G that

    (1.9) v= F(u) and u= G(v).

    Furthermore, we shall write

    (1.10) uP = [uP u]12

    and vQ= [vQv]12

    ,w, uP =wP u and z, vQ= zQv

    for the norms and inner products corresponding to the positive definite matrices Pand Q. It is these norms and inner products, rather than the canonical ones, thatintrinsically underlie the analysis of our problems, and it is well to bear this in mind.Just as the function f, if it isC2 around a pointu, can be expanded as

    f(u) =f(u) +f(u), u u + 12u u, 2f(u)(u u) + ou u2,

    it can also be expanded as

    f(u) =f(u) + Pf(u), u uP+ 1

    2u u, 2Pf(u)(u u)P+ ou u2P

    for a certain vectorPf(u) and a certain matrix2Pf(u); similarly for g in terms ofQg(v) and2Qg(v). Clearly,

    (1.11)Pf(u) = P1f(u), 2Pf(u) = P12f(u),Qg(v) =Q1g(v), 2Qg(v) = Q12g(v).

  • 8/13/2019 10.1.1.161.7484

    6/33

    6 c. zhu and r. t. rockafellar

    In appealing to this symbolism we shall better be able to bring out the basic structureand convergence properties of the proposed algorithms.

    We cite now from [6] several fundamental properties on which the algorithmicdevelopments in this paper will depend.

    Proposition 1.3. [6, p. 459] (Optimality estimates.) Supposeu and v are ele-ments ofUandV satisfyingf(u)

    g(v)

    , where

    0. Thenuandv are-optimal

    in the sense that|f(u) f(u)| and|g(v) g(v)| . Moreover,u uP 2andv vQ

    2.

    Proposition 1.4. [6, pp. 438, 469] (Regularity properties.)The functionsfandg are continuously differentiable everywhere, and the mappingsF andG are Lipschitzcontinuous:

    (1.12) f(u) = uL(u, F(u)) = p + P u RTF(u),

    g(v) = vL(G(v), v) =q Qv RG(v),where in terms of the constant

    (1.13) (P,Q,R) := Q12 RP12

    one has

    (1.14) F(u) F(u)Q (P,Q,R)u uP for all u and u,

    G(v) G(v)P (P,Q,R)v vQ for all v and v.The finite-envelope idea enters through repeated application of the mappings F

    andG. The rationale is discussed at length in [6], but the main facts needed here arein the next two propositions.

    Proposition 1.5. [6, p. 460] (Envelope properties.) For arbitrary u0 U andv0 V, let v1 = F(u0) and u1 = G(v0), followed by v2 = F(u1) and u2 = G(v1).Then in the primal problem(1.15)

    f(u)

    L(u, v1) for all u, with L(u0, v1) = f(u0) and

    uL(u0, v1) =

    f(u0),

    f(u) L(u, v2) for all u, with L(u1, v2) = f(u1) anduL(u1, v2) = f(u1),while in the dual problem(1.16)

    g(v) L(u1, v) for all v, with L(u1, v0) = g(v0) andvL(u1, v0) = g(v0),g(v) L(u2, v) for all v, with L(u2, v1) = g(v1) andvL(u2, v1) = g(v1).

    Proposition 1.6. [6, p. 470] (Modified gradient projection.)For arbitraryu0Uandv0 V, letv1 = F(u0)andu1 = G(v0), followed byv2 = F(u1)andu2 = G(v1).Then

    L(u, v1) =f(u0) + f(u0)(u u0) + 12 (u u0)P(u u0)=f(u0) + Pf(u0), u u0P+ 12u u02P,= 12

    (u u0) + Pf(u0)2P + const.(1.17)L(u1, v) =g(v0) + g(v0)(v v0) 12 (v v0)Q(v v0)

    =g(v0) + Qg(v0), v v0Q 12v v02Q,= 12

    (v v0) Qg(v0)2Q+ const.(1.18)

  • 8/13/2019 10.1.1.161.7484

    7/33

    primal-dual projected gradient algorithms for elqp 7

    so from the definition ofu2 andv2 one has that

    (1.19) u2 u0 = P-projection ofPf(u0) on U u0,

    v2 v0 = Q-projection of Qg(v0) on V v0.Proof. The first equation in (1.17) expands L( , v1) at u0 in accordance with

    (1.15), and the rest of (1.17) re-expresses this via (1.10) and (1.11). Since u2 :=argminuUL(u, v1),u2is thus the P-nearest point ofU tou0Pf(u0), sou2u0is the P-projection ofPf(u0) onU u0. The assertions in the v-argument areverified similarly.

    The formulas in (1.19) give the precise form of (nonlinear) gradient projectionthat is available through our assumed ability to calculate F(u) and G(v) wheneverwe please. It is this form, therefore, that we shall incorporate in our algorithms.The reader should note this carefully, or a crucial feature of our approach, in itsapplicability to large-scale problems, will be missed. Although the gradients of fand g exist and are expressed by the formulas in Proposition 1.4, we do not haveto calculate them through these formulas, much less apply a subroutine for gradientprojection. In particular,it is not necessary to generate or store the potentially hugeor dense matrixR. To execute our algorithms, one only needs to be able to generatethe pointsu1,u2,v1and v2from a given pairu0and v0. As explained, this can be donethrough subroutines which minimize or maximize the Lagrangian individually in theprimal or dual argument, cf. (1.2). For multistage, possibly stochastic, optimizationproblems expressed in the format of [1], [2], and [6], such subroutines can easily bewritten in terms of the underlying data structure (without ever introducing R!).

    In obtaining our results about local rates of convergence, a mild condition on theoptimal solutions u and v will eventually be required. To formulate it, we introducethe sets

    U0 := argminuU

    uL(u, v)u= argminuU

    f(u)u= argminuU

    Pf(u), uP,(1.20)V0 := argmax

    vVvL(u, v)v= argmax

    vVg(v)v= argmax

    vV Qg(v), vQ,(1.21)which are called the critical faces ofU and V in (P) and (Q) [6]. They are closedfaces of the polyhedral sets U and V, and they contain the optimal solutions u andv, respectively, by virtue of the elementary conditions for the minimum of a smoothconvex function (or the maximum of a smooth concave function).

    Definition 1.7. (Critical face condition.) Thecritical facecondition will be saidto be satisfied at the optimal solutions u and v if uri U0 and v ri V0 (where ridenotes relative interior in the sense of convex analysis).

    We do not add this condition as a standing assumption, but it will be invokedseveral times in connection with the following property of the envelope mappings Fand G, which is implicit in [6, Theorem 5.4] in its proof, but is stated here explicitly.

    Proposition 1.8. (Envelope behavior near the critical faces.)There exist neigh-borhoods ofu andv with the property that if the pointsu

    0U andv

    0V belong to

    these neighborhoods, then the points

    v1 = F(u0), u1= G(v0), v2 = F(u1), u2 = G(v1),

    will be such thatu1 andu2 belong to the primal critical faceU0, whilev1 andv2 belongto the dual critical face V0. Under the critical face condition, the neighborhoods canbe chosen so thatu1 andu2 actually belong to ri U0, whilev1 andv2 belong to ri V0.

  • 8/13/2019 10.1.1.161.7484

    8/33

    8 c. zhu and r. t. rockafellar

    Proof. We adapt the argument given for [6, Theorem 5.4]. From (1.9) and thecontinuity ofF andG in Proposition 1.4, we know that by making u0 andv0 close tou and v we will make u1 and u2 close to u and v1 and v2 close to v. For each vectorw lRn, let M(w) be the closed face of the polyhedron U on which the functionu wu achieves its minimum. This could be empty for some choices ofw, but inthe case of w =

    uL(u, v) it is U0, which contains u. The graph of M as a set-

    valued mapping is closed (as can be verified directly or through the observation thatM is the subdifferential of the support function ofU, cf. [17, Secs. 13, 23]), and Mhas only finitely many values (since U has only finitely many faces). It follows thatM(w) M( w) = U0 when w is in some neighborhood of w. We can apply this inparticular tow = uL(u1, v0), noting that this vector will be close to w whenu0 andv0 are sufficiently close to u and v. The point u1 minimizes L(u, v0) over uU andtherefore has the property thatuL(u1, v0)(u u1)0 for all uU, which meansu1 M(w). Thereforeu1 U0 when u0 and v0 are sufficiently close to u and v.

    Parallel reasoning demonstrates that v1 V0 under such circumstances. If thecritical face condition holds, then as u1 and v1 approach u and v they must actuallyenter the relative interiors ri U0 and ri V0. The same argument can be applied now toreach these conclusions for u2 and v2.

    2. Formulation of the Algorithms. The new methods for the fully quadraticcase of problems (P) and (Q) will be formulated as conceptual algorithms involvingline search. The convergence analysis will be undertaken in Sections 3, 4, and 5, andthe numerical test results will be given in Section 6.

    In what follows, we use [w1, w2] to denote the line segment between two pointsw1 and w2, and we use as the running index for iterations.

    The main characteristic of the new methods is the coupling of line search proce-dures in the primal and dual problems with interactive restarts. To assist the readerin understanding this, we first formulate the method analogous to steepest descent,where there are fewer parameters and the algorithmic logic is simpler.

    Algorithm 1. (Primal-Dual Steepest Descent Algorithm, PDSD.) Construct pri-mal and dual sequences

    {u0

    } U and

    {v0

    } V as follows.

    Step 0(initialization). Choose a real value for the parameter 0 (optimalitythreshold). Set:= 0 (iteration counter). Specify starting points u00 U and v00 Vfor the sequences{u0} U and{v0} V that will be generated along with{u0}and{v0}.

    Step 1 (evaluation). Calculate

    f(u0), g(v

    0 ), obtaining as by-products v

    1 =F(u

    0), u

    1 =G(v

    0 ),

    g(v1 ), f(u1), obtaining as by-products u

    2 =G(v

    1 ), v

    2 =F(u

    1).

    Step 2(interactive restarts). Take

    u

    0 := u

    0 , v

    1 := v

    1 , u

    2 := u

    2 iff(u

    0) f(u

    1),u0 := u1 , v

    1 := v

    2 , u

    2 :=G(v

    1 ) otherwise (this is an interactive primal restart).

    v0 := v0 , u

    1 := u

    1 , v

    2 := v

    2 ifg(v

    0 ) g(v1 ),

    v0 := v1 , u

    1 := u

    2 , v

    2 :=F(u

    1) otherwise (this is an interactive dual restart).

    (In an interactive primal restart, the calculation ofG(v1 ) yields the new g(v1 ). Like-

    wise, in an interactive dual restart, the calculation ofF(u1) yields the newf(u1).)

  • 8/13/2019 10.1.1.161.7484

    9/33

    primal-dual projected gradient algorithms for elqp 9

    Step 3 (optimality test). Let

    u:=

    u0 iff(u

    0) f(u1),

    u1 iff(u0)> f(u

    1),

    and v:=

    v0 ifg(v

    0 ) g(v1 ),

    v1 ifg(v0 )< g(v

    1 ).

    Iff(u) g(v) , terminate with u and v being -optimal solutions to (P) and (Q).Step 4 (line segment search). Search for

    u+10 := argminu[u

    0,u2]

    f(u) and v+10 := argmaxv[v

    0,v2]

    g(v).

    Return then to Step 1 with the counter increased by 1.

    Basically, the idea in this method is that if the point u1 calculated as a by-productof finding the projected gradient (1.19) in the dual problem gives a better value tothe objective in the primal problem than does the current primal point u0 , we take itinstead as the current primal point (and accordingly recalculate the projected gradientin the primal problem). Likewise, if the point v1 calculated as a by-product of findingthe projected gradient (1.19) in the primal problem happens to give a better value tothe objective in the primal problem than the current dual point v0 , we take it instead

    as the current dual point (and accordingly recalculate the projected gradient in thedual problem). Here it may be recalled that u1 minimizes overUthe convex quadraticfunctionL( , v0 ), which is a lower approximant to the objective function f in (P) thatwould have the same minimum value as f over U if v0 were dual optimal. By thesame token, v1 maximizes over V the concave quadratic function L(u

    0 , ), which is

    an upper approximant to the objective function g in (Q) that would have the samemaximum value as g overV is u0 were primal optimal.

    Once the issue of triggering a primal or dual interactive restart (or both) settlesdown in a given iteration, we perform line searches in the directions indicated by theprojected gradients in the two problems. IfU were the whole space lRn, the primalsearch direction would be the true direction of steepest descent for f (relative to thegeometry induced by the Euclidean norm Pon lRn). Similarly, ifVwere the wholespace lRm, the dual search direction would be the true direction of steepest ascent forg (relative to the geometry of the Euclidean norm Q on lRm). However, even inthis unconstrained case there would be a difference in the way the searches are carriedout, in comparison with classical steepest descent, because instead of looking alongan entire half-line we only optimize along a line segment whose length is that of thegradient, i.e., we restrict the step size to be at most 1. (Also, we call for an exactoptimum because the objective is piecewise strictly quadratic with only finitely manypieces. Clearly, this requirement could be loosened, but the issue is minor and we donot wish to be distracted by it here.)

    The restriction to a line segment instead of a half-line is motivated in part by thefact that the line segment is known to lie entirely in the feasible set. A search overa half-line would have to cope with detecting the feasibility boundary in the searchparameter, which could be a disadvantage in a high-dimensional setting, although

    this topic could be explored further. Heuristic motivation for the restriction comesalso from evidence of second-order effects induced by the primal-dual feedback, asdiscussed below. It turns out that under mild assumptions the optimal step sizesalong a half-line would eventually be no greater than 1 anyway.

    The interactive restarts may seem like a merely opportunistic feature of Algo-rithm 1, but they have a marked effect, as the numerical tests in Section 6 will re-veal. When interactive restarts are blocked, so that the algorithm reverts to two

  • 8/13/2019 10.1.1.161.7484

    10/33

    10 c. zhu and r. t. rockafellar

    independent procedures in the primal and dual settings (through a sort of computa-tional lobotomy), the performance is slowed down to what one might expect froma steepest-descent-like algorithm. On the other hand, when the interactions are per-mitted the performance in practice is quite comparable to that of more complicatedprocedures which attempt to exploit second-order properties. The feedback betweenprimal and dual appears able to supply some such information to the calculations.

    In order to develop a broader range of interactive-restart methods, analogous notonly to steepest descent but to conjugate gradients, we next formulate as Algorithm 0a bare-bones procedure which will serve in establishing convergence properties of suchmethods, including Algorithm 1. The chief complication in Algorithm 0 beyond whathas already been seen in Algorithm 1 comes through the introduction ofcycles forprimal and dual restarts. With respect to these cycles an additional threshold param-eter is introduced as a technical safeguard against interactive restarts being triggeredtoo freely, without assurance of adequate progress.

    Algorithm 0. (General Primal-Dual Projected Gradient Algorithm, PDPG.)Construct primal and dual sequences{u0} U and{v0} V as follows.

    Step 0 (initialization). Choose an integer value for the parameter k > 0 (cyclesize) and real values for the parameters

    0 (optimality threshold) and > 0

    (progress threshold). Set := 0 (iteration counter), kp := 0 (primal restart counter),andkd:= 0 (dual restart counter). Specify starting points u00 Uand v00 V for thesequences{u0} U and{v0} Vthat will be generated along with{u0} and{v0}.

    Step 1 (evaluation). Calculatef(u0), g(v

    0 ), obtaining as by-products v

    1 =F(u

    0), u

    1 =G(v

    0 ),

    g(v1 ), f(u1), obtaining as by-products u

    2 =G(v

    1 ), v

    2 =F(u

    1).

    Step 2(interactive restarts). Takeu0 := u

    0 , v

    1 := v

    1 , u

    2 := u

    2 iff(u

    0) f(u1), or f(u0)< f(u1)+ and kp < k,

    u0 := u1 , v

    1 := v

    2 , u

    2 :=G(v

    1 ) otherwise (this is an interactive primal restart).

    v0

    := v0

    , u1

    := u1

    , v2

    := v2

    ifg (v0

    )

    g(v1

    ), or g (v0

    )> g(v1

    )

    and kd< k,v0 := v

    1 , u

    1 := u

    2 , v

    2 :=F(u

    1) otherwise (this is an interactive dual restart).

    (In an interactive primal restart the calculation ofG(v1 ) yields the new g (v1 ). Like-

    wise, in an interactive dual restart the calculation of F(u1) yields the new f(u1).)

    Set kp := 0 if an interactive primal restart occurred in this step,kd := 0 if an interactive dual restart occurred in this step.

    Step 3 (optimality test). Let

    u:=

    u0 iff(u

    0) f(u1),

    u1 iff(u0)> f(u

    1),

    and v:=

    v0 ifg(v

    0 ) g(v1 ),

    v1 ifg(v0 )< g(v

    1 ).

    Iff(u) g(v) , terminate with u and v being -optimal solutions to (P) and (Q).Step 4(search endpoint generation). Take

    ue :=u2 ifkp 0(mod k),ue U according to an auxiliary rule otherwise.ve :=v2 ifkd 0(mod k),ve V according an auxiliary rule otherwise.

  • 8/13/2019 10.1.1.161.7484

    11/33

    primal-dual projected gradient algorithms for elqp 11

    Step 5 (line segment search). Search for

    u+10 := argminu[u

    0,u

    e]

    f(u) and v+10 := argmaxv[v

    0,v

    e]

    g(v).

    Return then to Step 1 with the counters, kp and kd increased by 1.

    By specifying the auxiliary rules in Step 4 for generating the search interval end-

    pointsue andve in iterations wherekpor kdis not a multiple ofk, we obtain particularrealizations of Algorithm 0. An attractive case in which these rules correspond to aconjugate gradient approach with cycle size k will be developed presently as Algo-rithm 2. Before proceeding, however, we want to emphasize for theoretical purposesthat Algorithm 1 is itself a particular realization of Algorithm 0.

    Proposition 2.1. Algorithm 0 reduces to Algorithm 1 when the cycle size isk= 1 (except for a slight difference in iteration= 0).

    Proof. In returning from Step 4 of Algorithm 0 to Step 1, the counters kp andkdare always at least 1. It follows that ifk = 1 the condition in Step 2 with progressthresholdwill never come into play after such a return. Thus, the only possible effectof this threshold will be in iteration = 0, where a restart will be avoided unless itimproves the objective by at least . In Step 4, kp and kd will always be multiples of

    k, so we will always have ue =u

    2 and v

    e =v

    2 . Thus the counters kp and kd become

    redundant and the auxiliary rules moot.In Algorithm 0 in general, kp counts iterations in the primal problem from the

    start or the most recent interactive primal restart. An iteration that begins withkpbeing a positive multiple ofkis said to be one in which an ordinary primal restarttakesplace (whether or not an interactive primal restart also takes place), because it marksthe completion of a cycle ofk iterations not cut short by an interactive primal restart.Every iteration involving an ordinary or interactive primal restart ends by searchingthe line segment [u0 , u

    2 ], whereu

    2u0 is the negative of the current projected gradient

    offin (1.19). The dual situation is parallel in terms of the counterkd and the notionof an ordinary dual restart.

    The role of the parameter > 0 is to control the extent to which the algorithmforgoes interactive restarts and insists on waiting for ordinary restarts. Interactiverestarts are always accepted if they improve the corresponding objective value bythe amountor more, but there can only be finitely many iterations with this size ofimprovement, due to the finiteness of the joint optimal value in (P) and (Q) (Theorem1.1). When such improvement is no longer possible, interactive restarts are blockedin the primal until an ordinary restart has again intervened, unless one is alreadyoccurring in the same iteration; the same holds in the dual. This feature ensures thatfull cycles ofk iterations will continue to be performed in the primal and dual as longas the algorithm keeps running, which is important in establishing certain propertiesof convergence.

    Recall that the point u2 minimizes over U the lower envelope function L(u, v1)as a representation off(u) atu0 (Proposition 1.5), which hasuL(u0 , v1 ) = f(u0).Even apart from the projected gradient interpretation, therefore, there is motivation

    in searching the line segment [u0 , u2 ] in order to reduce the objective value f(u) inprimal. The same motivation exists for searching [v0 , v

    2 ] in the dual.

    As a matter of fact, we shall prove in Proposition 5.1 that on exiting from Step 5(line segment search) of Algorithm 0, the point u+11 =G(v

    +10 ) will be the minimum

    point relative to U for the envelope function

    f(u) := maxv[v

    0,v2]L(u, v) max

    vVL(u, v) =f(u).

  • 8/13/2019 10.1.1.161.7484

    12/33

    12 c. zhu and r. t. rockafellar

    When the algorithm reaches Step 2 in the iteration, it will compare the point u+10resulting from the just-completed line search in the primal with the point u+11 result-ing from minimizing the lower envelope function f(u), and it will take the betterof the two as the next primal iterate. In the dual procedure there are correspondingcomparisons between v+10 and v

    +11 .

    We focus now on a specialization of Algorithm 0 in which, in contrast to Algo-

    rithm 1, the cycle provisions are crucial and the auxiliary rules nontrivial. The rulesemulate those of the classical conjugate gradient method (Hestenes-Stiefel).

    Algorithm 2. (Primal-Dual Conjugate Gradient Method, PDCG.) In the imple-mentation of Algorithm 0, choose a cycle size k > 1 and use the following auxiliaryrules to get the search intervals in Step 4. Unless kp 0(mod k), set

    wp := Pf(u0) Pf(u10 ),(2.1)p :=

    max{0, wp , u0 u2P}/wp , u1e u0Pifwp , u1e u0P >0,0 otherwise,

    (2.2)

    ucg := (u2+ pu

    1e )/(1 + p ),(2.3)

    [u0 , ue ] := [u

    0 , u

    cg] ifucg u0P 1,

    Lp U otherwise,

    (2.4)

    whereLp =

    u lRn | u= u0+ (ucg u0), 0 ucg u01P

    .Similarly, unlesskd 0(mod k), set

    wd := Qg(v0 ) + Qg(v10 ),(2.5)d :=

    max{0, wd , v0 v2 Q}/wd , v1e v0 Qifwd , v1e v0 Q > 0,0 otherwise,

    (2.6)

    vcg := (v2 + dv

    1e )/(1 + d ),(2.7)

    [v0 , ve ] :=

    [v0 , v

    cg] ifvcg v0Q 1,

    Ld V otherwise,(2.8)

    whereLd =

    v lRm | v= v0 + (vcg v0 ), 0 vcg v01Q

    .

    Note that because the auxiliary rules are never invoked in iteration = 0 (wherekp = 0 and kd = 0), the points indexed with 1 in the statement of Algorithm 2are all well defined. Another thing to observe is the fact that in (2.2) and (2.6) weactually have

    (2.9) wp , u1e u0P 0 and wd , v1e v0 Q 0.

    These inequalities follow from (2.1) and (2.5) and the monotonicity of gradient map-pings of convex functions. In Proposition 4.4 we shall prove that under the criticalface condition the inequalities in (2.9) hold strictlyin a vicinity of the optimal solutionif the critical faces are reached by the corresponding iterates.

    On the other hand, it is apparent from (2.3) and (2.7) that(2.10)

    ucg u0 =u2 u0 + p (u1e u0)

    (1 + p ) and vcg v0 =

    v2 v0 + d (u1e v0 )(1 + d )

    .

    Hence, the search direction vector in the primal is, in fact, a convex combination ofthe P-projection ofPf(u0) and the search direction vector in the previous primal

  • 8/13/2019 10.1.1.161.7484

    13/33

    primal-dual projected gradient algorithms for elqp 13

    iteration. Similarly, the search direction vector in the dual is a convex combinationof the Q-projection ofQg(v0) and the search direction vector in the previous dualiteration.

    We shall prove in Theorem 4.5 that under the critical face condition, the primaliterations in Algorithm 2 reduce in a vicinity of the optimal solution to (P) to theexecution of the Hestenes-Stiefel conjugate gradient method if the critical face U0 is

    eventually reached by the primal iterates, and similarly for the dual iterations. Fromthis we will obtain a termination property for Algorithm 2, which will be invoked byan interactive restart of the algorithm.

    Algorithm 2 departs a bit from the philosophy of Algorithm 1 in utilizing un-projected gradients in (2.1) and (2.5) instead of just projected gradients. These un-projected gradients are available through (1.11) and (1.12) (also (1.15) or (1.16)),and for multistage optimization problems in the pattern laid out in [7] they can stillbe calculated without having to invoke the gigantic R matrix. An earlier version ofAlgorithm 2 that we worked with did use the projected gradients exclusively, and itperformed similarly, but there were technical difficulties in establishing a finite termi-nation property. Future research may shed more light on this issue. The same can besaid of another small departure in Algorithm 2 from the philosophy one might hope

    maintain in a conjugate gradient method: the introduction on occasion of step sizespossibly greater than 1 relative to [u0 , u

    cg] or [v0 , v

    cg] (although not, of course, relative

    to the designated intervals [u0 , ue ] or [v0 , v

    e ]) through the second alternatives in (2.4)

    or (2.8).

    3. Global Convergence and Local Quadratic Structure. This section es-tablishes some basic convergence properties of Algorithms 0, 1 and 2. It also revealsthe special quadratic structure in (P) and (Q) around the optimal solutions uand vinthe case where the critical face condition is satisfied, which will be utilized in furtherconvergence analysis in Section 5.

    Proposition 3.1. (Feasible descent and ascent.)

    (a) In Algorithm 0 (hence also in Algorithms 1 and2) the vectoru2 u0 gives afeasible descent direction for the primal objective functionf atu0 (unlessu

    2

    u0 = 0,

    in which caseu0 = u). Similarly, the vectorv2 v0 gives a feasible ascent directionfor the dual objective functiong atv0 (unlessv

    2 v0 = 0, in which casev0 = v).

    (b) In Algorithm 2, the vector ucg u0 gives a feasible descent direction for theprimal objectivef atu0 unlessu

    0 = u. Similarly, the vectorv

    cg v0 gives a feasible

    ascent direction for the dual objective g at v0 unless v0 = v. Thus, Algorithm 2 is

    well defined in the sense that, regardless of the type of iteration, as long as it does notterminate in optimality, the vectorue u0 gives a feasible descent direction atu0 inthe primal while the vectorve v0 gives a feasible ascent direction atv0 in the dual.

    Proof. (a) We know that u2 minimizes L(u, v1 ) over u U, where L(u, v1 ) is

    given by formula (1.17). We obtain from this formula that unless u2 =u0 , implying

    u0 is optimal for the primal, we must have f(u0)(u2 u0) < 0. Descent in thisdirection is feasible because the line segment [u0 , u

    2 ] is included in U by convexity.

    The proof of the dual part is parallel.(b) The argument is by induction. From the optimality test in Step 3 we see that

    the algorithm will terminate at (u, v) if either u0 = u in the primal or v0 = v in the

    dual. (For instance, ifu0 = u, thenv1 = v, so that f(u) g(v) = 0.) Suppose neither

    u0 nor v0 is optimal. Proposition 3.1(a) covers our claims for the initial iteration of

    each primal or dual cycle. Suppose that the claims are true for iteration l 1 of aprimal cycle, 0 < l < k, this corresponding to iteration 1 of the algorithm as a

  • 8/13/2019 10.1.1.161.7484

    14/33

    14 c. zhu and r. t. rockafellar

    whole. We have (u2u0)f(u0)< 0 by part (a) and (u1e u0)f(u0) 0 throughthe line search. (Note that we get this inequality instead of an equation because thesearch is over a segment rather than a half-line; the minimizing point could be at theend of the segment.) Hence

    (ucg u

    0)f(u

    0) =

    (u2

    u0)

    f(u0) +

    p (u

    1e

    u0)

    f(u0)

    1 + p 0, termination must come with-optimal solutionsuandv in just a finite number of iterations. With = 0, unless the procedure happens

    to terminate with the exact optimal solutionsu andv in a finite number of iterations,the sequences generated will be such thatu0 uandv0 v as . Furthermore,thenu1 u andu2 u, as well asv1 v andv2 v.

    Proof. Consider first the case where = 0. From Proposition 1.4, the pointu2 = G

    F(u0)

    depends continuously on u0. Denote byD the continuous mapping

    u0 (u0, u2u0) from U to UlRn. LetM : UlRn U be the line searchmapping defined by

    M(u0, d) = argminu[u0,u0+d]

    f(u).

    The mapping M is closed at the point (u0, d) withd = 0, cf. [21, Theorem 8.3.1]. Nowby Proposition 3.1(a), u2 u0= 0 foru0= u. Hence the composite mappingMD isclosed on U\ {u}, cf. [21, Theorem 7.3.2]. Define

    A = BMD,

    whereB : U Uis the point-to-set mappingB(u) ={u U| f(u)f(u)}. Notethat the sequence {f(u0)} is nonincreasing. Now let Kpbe the sequence consists of theindices of those iterations in which a line search on [u0 , u

    2 ] is performed for the primal

    objective function. ThenKp is an infinite subsequence of{} unless the procedurehappens to terminate with the exact optimal solutions u and v in a finite number ofiterations. Let and be two consecutive elements inKp with > . Then wecan write

    u

    0 A(u

    0 ).

    By Proposition 3.1, moreover, the vector u2

    u0 is a descent direction for the primal

    objective f(u) at u0 unless u0 is already optimal. Since we are in the fully quadraticcase, the set{u lRn | f(u) f(u00)} is compact, and the optimal solution u forproblem (P) is unique. It follows then that u0 u as , Kp, cf. [21,Theorem 7.3.4]. Therefore f(u0)f(u) as , which in turn implies u0 u as since f is strongly convex (Theorem 1.1).

    For analogous reasons, v0 v. Then since u1 = G(v0 ) and u = G(v) with themapping G continuous (Proposition 1.4), we have u1 u. Likewise, v1 v. The

  • 8/13/2019 10.1.1.161.7484

    15/33

    primal-dual projected gradient algorithms for elqp 15

    argument can be applied then again: we have u2 =G(v1 ), so u

    2 u and in parallel

    fashionv2 v.In particular, we have f(u0)g(v0 ) f(u)g(v) = 0 because f and g are

    continuous (Theorems 1.1 and 1.2(a)). In the case where > 0, this guaranteestermination in finitely many iterations.

    Corollary 3.3. (Points in the critical faces.) The sequences generated by Algo-

    rithm 0 have the property that eventuallyu1 andu2 belong to the primal critical faceU0, whilev1 andv

    2 belong to the dual critical faceV0.

    Proof. This follows via Proposition 1.8.Corollary 3.4. (A special case of finite termination.)If = 0and either of the

    critical facesU0 orV0 consists of just a single point, Algorithm 0 (and therefore alsoAlgorithms 1 and 2) will terminate at the optimal solution pair (u, v) after a finitenumber of iterations.

    Proof. When U0 consists of the single p oint u, we have by Corollary 3.3 thatu2 = u for all sufficiently large . Once this is the situation, the line search in thefirst iteration of the next primal cycle will yield u. On returning to Step 1 for thesucceeding iteration, v will be generated as F(u), and termination must then come inStep 3. The situation is analogous when V0 consists of just v.

    A companion result to Corollary 3.3 is the following.Proposition 3.5. (Convergence onto critical faces.) Let{u0} and{v0} be se-

    quences generated by Algorithm 1 or Algorithm 2. Then for the primal critical faceU0, we have eitheru0 U0 for all sufficiently large oru0 U0 for all sufficientlylarge . Similarly, for the dual critical face V0 we have either v0 V0 for all suffi-ciently large orv0 V0 for all sufficiently large.

    Proof. We prove the primal part. The proof of the dual part is similar. Observethat v0 v as v0 v in the algorithm. Hence by Proposition 1.8, we have u1 U0as well as u2 U0 for sufficiently large . Then in Algorithm 1 we have

    u0 U0 [u0 , u2 ] U0 u+10 U0 u+10 U0

    since u+1

    0

    is defined either as u+1

    0

    or as u+1

    1

    . From this it is apparent that ourassertion is valid in the case of sequences generated by Algorithm 1.

    For Algorithm 2, we claim that for sufficiently large we have ue U0 whenu0 U0. For if ue = u2 , we certainly have ue = u2 U0. If ue= u2 , then ucgis a convex combination of u1e and u2 U0, and there is no interactive restart initeration, i.e., u0 =u

    0 U0. Now u0=u10 by Proposition 3.1(b). Hence we have

    either u0 =u1e which implies u

    1e U0, or u0 ri[u10 , u1e ], which also implies

    u1e U0 since U0 is a face ofU. Then ucg U0, and by the definition ofue in thealgorithm we have ue U0. Therefore

    u0 U0 [u0 , ue ] U0 u+10 U0 u+10 U0

    for sufficiently large . Thus, our assertion is valid also in the case of sequences

    generated by Algorithm 2.Remark. With the aid of the concept of anultimate quadratic regionintroduced

    later in Definition 3.7, it will be seen that when the critical face condition is satisfied,the assertion of the proposition can be written as follows: after the sequences{u0}and{v0} have entered an ultimate quadratic region, once u

    0 U0 for some , thenu0 U0 for all ; and similarly once v

    0 V0 for some , then v0 V0 for all .

  • 8/13/2019 10.1.1.161.7484

    16/33

    16 c. zhu and r. t. rockafellar

    For Algorithm 2, broader results on finite termination than the one in Corollary3.4 will be obtained when the critical face condition is satisfied through reduction toa simpler quadratic structure which is identified as governing in a neighborhood ofthe solution. This local structure will also be the basis for developing convergencerates for Algorithms 1 and 2 in cases without finite termination. In developing it inthe next theorem, we recall the notion of the affine hull affCof a convex set C: this

    is the smallest affine set that includes C, or equivalently, the intersection of all thehyperplanes that includeC [17].

    Theorem 3.6. (Quadratic structure near optimality.) Suppose the critical facecondition is satisfied. Then f is quadratic in some neighborhood of u, while g isquadratic in some neighborhood of v. Furthermore, for points u0 U and v0 Vsufficiently close to u andv, theP-projection ofPf(u0) onU u0 is the same asthat on affU0 u0, while theQ-projection ofQg(v0) onV v0 is the same as thaton affV0 v0.

    Proof. Since by Proposition 1.8 the point v1 = F(u0) lies in the critical face V0whenu0 is sufficiently close to u, we have

    (3.1) maxvV

    {v(q Ru) 12vQv} = maxvV0

    {v(q Ru) 12vQv}.

    The mapping F is continuous (Proposition 1.4) and v ri V0 by assumption, so wehave v1 ri V0 when u0 is sufficiently close to u. Then (3.1) can further be writteninstead as

    maxvV

    {v(q Ru) 12vQv} = maxvaffV0

    {v(q Ru) 12vQv}.

    Locally, therefore,

    (3.2) f(u) = pu + 12uP u + maxvaffV0

    {v(q Ru) 12vQv}.

    Similarly, forv in some neighborhood of v we have

    (3.3) g(v) =qv 12vQv+ minuaffU0

    {u(p RTv) + 12uP u}.

    The set affV0, because it is affine and contains v, has the form v+ S for a certainsubspaceSof lRm, which in turn can be written as the set of all vectors of the formv =Dw for a certain m dmatrixD of rankd (the dimension ofS). In substitutingv= v + Dw in (3.2) and taking the maximum instead over all w lRd, we see throughelementary calculus and linear algebra that the maximum value is a quadratic functionofu. This establishes that f(u) is quadratic in u on a neighborhood of u. The sameargument can be pursued in (3.3) to verify that g(v) is quadratic around v.

    Next we consider the projected gradients. According to Proposition 1.6, the P-

    projection ofPf(u0) onU u0 is the vector u2 u0, whereu2= GF(u0). Whenu0 is close enough to u in U, u2 belongs by Proposition 1.8 to ri U0, which is theinterior ofU0 relative to affU0. Thus, for u0 in some neighborhood of u in U0 theP-projection ofPf(u0) on U u0 belongs to the relatively open convex subsetri U0u0 of U u0 and must be the same as the projection on this subset or onU0u0 itself. When the nearest point of a convex set Cbelongs to ri C, it is the samethe nearest point of affC. The P-projection ofPf(u0) on U u0 is therefore the

  • 8/13/2019 10.1.1.161.7484

    17/33

    primal-dual projected gradient algorithms for elqp 17

    same as the P-projection ofPf(u0) on affU0 u0. The Q-projection ofQg(v0)on V v0 is analyzed in parallel fashion.

    Theorem 3.6 together with Proposition 1.8 makes it possible for us to concentrateour analysis of the terminal behavior of our algorithms, in the case of optimalitythreshold = 0, on regions around (u, v) of the following special kind.

    Definition 3.7. (Ultimate quadratic regions.) By an ultimate quadratic region

    for problems (P) and (Q) when the critical face condition is satisfied, we shall meanan open convex neighborhood U V of (u, v) with the properties that

    (a) U U0 = U ri U0 and V V0 = V ri V0,(b) fis quadratic on U and g is quadratic on V,(c) for allu0 UU theP-projection ofPf(u0) on Uu0 is that on (affU0)

    u0, while for all v0 V V the Q-projection ofQg(v0) on Vv0 is that on(affV0) v0,

    (d) for all u0 U U and v0 V V the points u1 = G(v0), v1 = F(u0),u2 =G(v1) and v2 =F(u1) are such that u1 and u2 belong to ri U0, while v1 and v2belong to ri V0.

    Here we recognize that the affine sets affU0 and affV0 are translates of certainsubspaces, which in fact are the sets (affU0) u and (affV0) v. The projections in(c) of this definition can be described also in terms of these subspaces. Let

    (3.4)

    Sp = P-projection mapping onto the subspace (affU0) u,Sd= Q-projection mapping onto the subspace (affV0) v,Sp =I Sp, Sd =I Sd.

    The mappingSp projects onto the subspace of lRn that is orthogonally complementary

    to (affU0) u with respect to the P-inner product in (1.10), while the mapping Sdprojects onto the subspace of lRm that is orthogonally complementary to (affV0) vwith respect to the Q-inner product. All these projections are linear transformations,of course.

    Proposition 3.8. (Projection decomposition.) For (u0, v0) in an ultimate

    quadratic regionU

    V

    , one has foru2:= G(F(u0)) andv2 := F(G(v0)) that

    u2 u0= SpPf(u0) Sp (u0 u) = Sp2Pf(u)(u0 u) Sp (u0 u),

    v2 v0= SdQg(v0) Sd(v0 v) = Sd2Qg(v)(v0 v) Sd(v0 v).

    Proof. TheP-projection ofPf(u0) on (affU0) u0 can be realized by takingtheP-projection ofPf(u0) + (u0 u) on the set (affU0) u0+ (u0 u) and thensubtracting (u0u). Therefore, in a region with property (c) of Definition 3.7 wehave by (1.17) in Proposition 1.6 that

    u2 u0= SpPf(u0) + (u0 u) (u0 u) =SpPf(u0) (I Sp)(u0 u),

    which is the first equality asserted. The second equality comes from havingPf(u0) =Pf(u) +2Pf(u)(u0 u) (since f is quadratic in the region in ques-tion), and Sp

    Pf(u)= 0 by the optimality of u. The proof of the dual equalities isalong the same lines.

    4. Rate of Convergence. In taking advantage of the existence of an ultimatequadratic region, we shall utilize in our technical arguments a change of variables thatwill make a number of basic properties clearer. This change of variables amounts

  • 8/13/2019 10.1.1.161.7484

    18/33

    18 c. zhu and r. t. rockafellar

    to the introduction of orthonormal coordinate systems relative to the inner productsnaturally associated with our problems, namely , P on lRn and , Q on lRm,as given in (1.10). The coordinate systems are introduced in such a way that thesubspaces (affU0) uand (affV0) v for the projections in (3.4) and Proposition 3.8take a very simple form.

    LetW be an n

    n orthogonal matrix and Z an m

    m orthogonal matrix. Our

    shift will be fromu and v to u= W P12 u and v= Z Q12 v. In these variables and with

    U=W P12 U, V =Z Q

    12 V,

    our primal and dual problems take the form

    minimize f(u) over all u U ,(P)maximize g(v) over all v V ,(Q)

    where we have

    (4.1) f(u) = supvV

    L(u, v) and g(v) = infuU

    L(u, v),

    (4.2) F(u) = argmaxvV

    L(u, v) and G(v) = argminuU

    L(u, v),

    in the notation that

    (4.3) L(u, v) = pu + 12u2 + qv 12v2 vRu on U V ,

    (4.4) p= W P12p, q= Z Q

    12 q, R= Z Q

    12 RP

    12 WT.

    The optimal solutions u and u to (P) and (Q) translate into optimal solutions u andv to (P) and (Q), namely

    (4.5) u= W P12u and v= Z Q

    12v.

    Letd1 be the dimension of the subspace (affU0) u and d2 the dimension of thesubspace (affV0) v. We chooseWsuch that, in the new coordinates correspondingto the components of u, the setW P

    12 (affU0 u) = affU0 uis the subspace spanned

    by the first d1 columns ofIn. Likewise, we chooseZ such that in the v coordinates

    the set ZQ12 (affV0 v) = affV0 v is the subspace spanned by the first d2 columns

    ofIm. We partition the vectors u

    lRn and v

    lRm into

    (4.6) u=

    ufur

    and v=

    vfvr

    ,

    where uf consists first d1 components of u and vf consists first d2 components ofv. (Here uf is the free part of u, relative to (affU0) u being the subspace thatindicates the remaining degrees of freedom in the tail of our convergence analysis when

  • 8/13/2019 10.1.1.161.7484

    19/33

    primal-dual projected gradient algorithms for elqp 19

    the critical face condition is satisfied, whereas ur is the restricted part of u.) Theprojection mappingsSp,Sp , Sd, andSd reduce in this way to the simple projections

    (4.7)

    Sp :

    ufur

    uf

    0

    , Sp :

    ufur

    0ur

    ,

    Sd:

    vfvr vf0 , Sd : vfvr 0vr .

    We partition the columns of the matrix R in accordance with u and the rows inaccordance with v. Thus,

    (4.8) R=

    Rff RfrRrf Rrr

    .

    In this notation the primal objective function in the transformed problem (P)takes, in an ultimate quadratic region, the simple form

    (4.9)f(u) = 1

    2(u

    u)

    A(u

    u) + const. for some u, where

    A: =I+

    Rff Rfr

    TRffRfr

    ,

    while in the dual problem one similarly has

    (4.10)

    g(v) = 12 (v v)B(v v) + const. for some v, where

    B: =I+

    RffRrf

    RffRrf

    T.

    In fact, in the notation (4.5) and with U0 and V0 denoting the critical faces W P12 U0

    and Z Q12 V0 in the transformed problems, one has the expansions

    f(u) = f(u) + 12 (uf uf)

    I+ RTffRff

    (uf uf) for u affU0,(4.11)

    g(v) = g(v) 12 (vf vf)

    I+ RffRTff

    (vf vf) for v affV0.(4.12)

    It will be helpful to write the Hessian matrices A and B in (4.9) and (4.10) as

    (4.13) A=

    Aff AfrArf Arr

    =

    I+ RTffRff

    RTffRfr,

    RTfrRff I+RTfr Rfr

    ,

    (4.14) B=

    Bff BfrBrf Brr

    =

    I+ RffRTff RffR

    Trf

    RrfRTff I+RrfRTrf

    .

    A crucial property of our change of variables u= W P12 u and v= Z Q

    12 v is that

    u = uP andv = vQ,

  • 8/13/2019 10.1.1.161.7484

    20/33

    20 c. zhu and r. t. rockafellar

    and accordingly

    f(u) = Pf(u)P andg(v) = Qg(v)Q,2f(u) = 2Pf(u)P and2g(v) = 2Qg(v)Q.

    The following result is a strengthening of Proposition 3.1 in the sense that itgives a quantitative estimate for the relationship betweenu0 u2P andu0 uPin primal, and betweenv0 v2Q andv0 vQ in the dual.

    Proposition 4.1. (Norm estimates.) Suppose the critical face condition is sat-isfied. Then foru0 andv0 in an ultimate quadratic region for problems(P) and(Q),and withu2 := G(F(u0)) andv2 := F(G(v0)), one has

    (4.15) (5+42Pf(u)2P)12 u0 uP u0 u2P (1 + 2Pf(u)2P)

    12 u0 uP,

    (4.16) (5 + 42Qg(v)2Q)12 v0 vQ v0 v2Q (1 + 2Qg(v)2Q)

    12 v0 vQ.

    Proof. In the transformed coordinates the first equation in Proposition 3.8 givesus u2u0 =Sp2f(u)(u0 u) Sp (u0 u). In the notation (4.13) for2f(u)

    this gives

    u0 u22 = Sp

    A(u0 u)2 + Sp (u0 u)2

    A(u0 u)2 + u0 u2 (A2 + 1) u0 u2.

    This gives the right half of (4.15). To get the left half, decompose u0 uinto1+2,where is a unit vector in the null space of (Aff Afr ) while is a unit vector in theorthogonal complement of that null space, and the direction of is so chosen that2 > 0. Partition and as well:

    = f

    r

    , = f

    r

    .

    It follows from (AffAfr )= Afff+ Afr r = 0 that f= A1ffAfr r and

    f2 A1ff2Afr2r2 Afr2r2,

    because the smallest eigenvalue ofAff is no less than 1. Therefore

    2 = f2 + r2 (1 + Afr2)r2 r2 11 + Afr2

    1

    1 + A2 .

    Denoteu0 u by . We get

    1r+ 2r 1r 2r (2 (2)2)1/2(1 + A2)1/2 2.

    Recalling that all the eigenvalues ofAffare no less than 1, we obtain

    u0 u22 = (AffAfr)22 + 1r+ 2r2 22+

    max{0, (2 22)1/2(1 + A2)1/2 2 }

    2.

  • 8/13/2019 10.1.1.161.7484

    21/33

    primal-dual projected gradient algorithms for elqp 21

    But the term (2 22)1/2(1 + A2)1/2 2 decreases monotonically as2 increasesfrom 0. This term equals 2 := (5+4A2)1/2when2 = 2.Therefore u0u22 (2)2, from which the left half of (4.15) follows. The proof of (4.16) is similar.

    Theorem 4.2. (Rate of convergence of PDSD.)Consider Algorithm 1 in the caseof threshold = 0, and suppose the critical face condition is satisfied. In terms of

    := (P,Q,R) :=

    Q12 RP

    12

    , let

    c1 := 1 1(1 + 2)

    2 + 5(1 + 2) + 4(1 + 2)3

  • 8/13/2019 10.1.1.161.7484

    22/33

    22 c. zhu and r. t. rockafellar

    Case 1: there exists some such that u0 U0 for all , orCase 2: u0 U0 for all .

    In Case 1 the equation Sp(u0 u) = 0 holds for all . Then it follows from(4.22) that

    =SpA(u

    0u)A(u0u)

    Sp

    A(u0u)ASpA(u0u) =

    SpA(u

    0u)SpA(u0u)

    Sp

    A(u0u)ASpA(u0u) 1,

    because all the eigenvalues of A are at least 1. Now Step 5 of the algorithm mustcoincide with the steepest descent method for f on affU0 with perfect line search,since [u0 ,u

    2 ] is in an ultimate quadratic region of the problem. Note that all the

    eigenvalues of the Hessian matrix Aff are in the interval [1, 1 +Rff2], whereRff2 R2 =2. Hence by using the expression offin (4.11), we have [22]

    (4.23)f(u

    +10 ) f(u)

    f(u0) f(u)

    Rff2Rff2 + 2

    2

    1 11 + 12R2

    2,

    which yields (4.20) since f(u+10 ) f(u+10 ) in the algorithm.In Case 2 we have 0, we haveb(u0) b(u) for sufficiently large . Then

    |b(u0)Sp(u0 u)| = |b(u)Sp (u0 u) +

    b(u0) b(u)Sp (u0 u)|

    |b(u)Sp (u0 u)| |

    b(u0) b(u)Sp (u0 u)|

    |b(u)Sp (u0 u)| Sp(u0 u).

  • 8/13/2019 10.1.1.161.7484

    23/33

    primal-dual projected gradient algorithms for elqp 23

    But|b(u)Sp (u0 u)| = f(u)Sp(u0 u). Therefore

    (4.24) |b(u0)Sp (u0 u)|

    |b(u)Sp(u0 u)| 1 f(u)

    where

    f(u)= 0, for otherwise U0 = U and then u

    0U0 in contradiction to our

    assumption in Case 2.Now, ifdd b(u0)Sp (u0 u), we obtain from (4.15) that

    f(u0) f(u+10 )f(u0) f(u)

    (dd)2

    (dAd)(u0 u)A(u0 u) + 2dd 1A2 + A(5 + 4A2) .

    Otherwisedd < b(u0)Sp (u0 u), and then

    f(u0) f(u+10 )f(u0) f(u)

    b(u0)Sp(u0 u)2

    Ab(u)Sp (u0 u)A(5 + 4A2)b(u)Sp(u0 u) + 2b(u)Sp (u0 u)=

    1

    A2 + A(5 + 4A2)

    1 f(u)

    by (4.24). Thus, we have

    liminf

    f(u0) f(u+10 )f(u0) f(u)

    1A

    2 + A(5 + 4A2)

    ,

    which can be written as

    (4.25) limsup

    f(u+10 ) f(u)f(u0) f(u)

    1 1A2 + A(5 + 4A2) .Noting thatA = 1 + (Rff Rfr)2 1 + R2 = 1 + 2, we get the first inequalityin (4.19), which is also true for Case 1 in view of (4.20) since c2 < c1. The dual parthas a parallel argument.

    Observe that the rates in (4.20) and (4.21) are much better than the ones in(4.19). The former will be effective if any interactive restarts occur for , asindicated in the theorem. This partially explains the effects of interactive restarts onthe algorithm as observed in our numerical tests.

    The role of the constant= (P,Q,R) in the convergence rate in Theorem 4.2 hasbeen borne out in our numerical tests, although because of the interactive restarts themethod appears to work much better than one might expect from steepest descent.We have definitely observed in small-scale problems where some idea of the size ofis available that the convergence is faster with low than with high .

    Although Theorem 4.2 centers on the specialization of Algorithm 0 to Algorithm 1,the argument has content also for Algorithm 2. Recall from the discussion after the

  • 8/13/2019 10.1.1.161.7484

    24/33

    24 c. zhu and r. t. rockafellar

    statement of Algorithm 0 in Section 2 that in everyk iterations of Algorithm 0 (whenimplemented with cycle size k >1) there is at least one primal line search on [ u0 , u

    2 ]

    and at least one dual line search on [v0 , v2 ]. This gives us the following result about

    Algorithm 2, which will be complemented by a finite termination result in Theorem4.5.

    Corollary 4.3. (Rate of convergence of PDCG.)Suppose the critical face con-

    dition is satisfied. Then Algorithm 2 with = 0 converges at least k-step linearly inthe sense that

    (4.26) limsup

    f(u+k0 ) f(u)f(u0) f(u)

    c1 and limsup

    g(v+k0 ) g(v)g(v0 ) g(v)

    c1,

    wherec1 is the value defined in(4.17), unless the algorithm terminates after a finitenumber of iterations with(u, v) = (u, v).

    To derive a special finite termination property of Algorithm 2, we need the fol-lowing.

    Proposition 4.4. (inequalities in PDCG.)Suppose the critical face condition issatisfied. Letbe an iteration number such that for , all the pointsu0 , u2 andv0

    , v2

    are in an ultimate quadratic region U

    V in Definition 3.7, where U is

    contained in the P-ball aroundu of radius 12 , and likewiseV is contained in the Q-ball aroundv of radius 12 . Ifu

    0 U0 for some , then in Algorithm 2 onehas

    (4.27) wp , u1e u0P >0

    whenever(2.1)(2.4) are used to generateue for > , and similarly ifv

    0 V0 forsome , then in Algorithm 2 one has

    (4.28) wd , v1e v0 Q> 0

    whenever(2.5)(2.8) are used to generateve for > .

    Proof. It suffices once more to give the argument in the context of the transformedvariables. Observe that the gradient mappingf is strongly monotone, and thatwp = f(u0)f(u10 ) with u0 [u10 ,u1e ] when (2.1)(2.4) are used to generateue in the primal. Hence the primal part of the assertion is true if u0= u1e for > .According to Proposition 3.5 (cf. also the remark after it), one has u0 U0 for all .We partition all vectors in conformity with the scheme in (4.6). Then u0,r =urand u2,r = ur.

    If the (1)th iteration with > is the first iteration of a primal cycle, then theline search is performed on [u10 ,u

    12 ]. For the direction vectord

    1 := u12 u10 ,the optimal step length for u= u0+ d

    to minimize the quadratic form (4.9) overall [0, ) can be derived from the expression in (4.11) as

    1 =d1

    f(u1

    0 )

    d1Ad1 =d1

    f d1

    fd1f Affd1f ,

    where the first equation in Proposition 3.8 has been used withf(u10 ), andAff isthe Hessian component in (4.13). Note that none of the eigenvalues ofAff is less than1. Hence1 1,and the equality holds only ifd1f is an eigenvector correspondingto 1 as an eigenvalue ofAff, i.e.,Affd

    1f =d

    1f . But it follows from (4.11) and the

  • 8/13/2019 10.1.1.161.7484

    25/33

    primal-dual projected gradient algorithms for elqp 25

    first equation in Proposition 3.8 that we also have Aff(uf u10,f ) = d1f . Therefore1 = 1 implies u12,f =

    uf and u12 =

    u. And then u0 = u, i.e., the iteration

    terminates at the primal optimal solution.

    If the ( 1)th iteration with > is not the first iteration of a primal cycle,then formulas (2.1)(2.4) are used to define u1e . In the proof of Proposition 3.5,we have actually shown that u1e

    U0 for all > . Hence [u

    1

    0

    ,u1e ]

    U0 forall > . Then it follows from (2.4) thatu1e u10 1 unless u1e is on therelative boundary ofU0. In either case we have u0= u1e again, since u10 U for > and U is contained in the P-ball around u of radius 12 . The dual claimscan be verified similarly.

    Theorem 4.5. (A finite termination property of PDCG.) Assume that the crit-ical face condition is satisfied. Suppose that the cycle size k chosen in Algorithm2 is such that k > k, where k denotes the rank of the linear transformation uSd

    RSp(u)

    . (It suffices in this to havek >min{m, n}.) Let be an iteration num-ber as defined in Proposition 4.4 and satisfying the conditions there. Ifu

    0 U0 forsome (as is sure to happen in an interactive primal restart at that stage), thenthe algorithm will terminate in the next full primal cycle, if not earlier. Similarly, ifv

    0

    V0 for some

    (as is sure to happen in an interactive dual restart at that

    stage), then the algorithm will terminate in the next full dual cycle, if not earlier.

    Proof. We concentrate on the primal part; the proof of the dual part is parallel.In the transformed variables, where we place the argument once more, k is the rankof the submatrix Rff ofRin (4.8). Note that for the process is in an quadraticregion of the problem as specified in Proposition 4.4. In the proof of Proposition 4.4,we have shown that for all ,[u0 ,ue ] U0, and that the line searches on [u0 ,ue ]are perfect in the sense that, on exiting Step 5 of iteration , u+10 minimizes

    fon the half-line from u0 in the direction of u

    e u0 . Observe there is no interactive

    primal restart in the firstk 1 iterations of a full primal cycle, i.e., u0 = u0 for theseiterations. We claim now that the search direction vectors ue u0 and ve v0 arethe same as the ones that would be generated by a conjugate gradient algorithm on frelative to affU0. The finite termination property will be a consequence of observing

    that the Hessians offin an quadratic region of the problem (cf. (4.11)) have at mostk+ 1 different eigenvalues.

    The proof of the claim will go by induction. We know from Proposition 3.8 thatthe claim is true for the first iteration of the full primal cycle in question. Supposeit is true for the ( 1)th iteration generating u0 in that cycle, but u0= u. Thenby (4.27) in Proposition 4.4, the first alternative of (2.2) will be used to generate p .Hence it follows from (2.1)(2.3) and Proposition 3.8 that(4.29)

    (ucgu0)f = (u2u0)f+ p (u1e u0)f

    1 + p=

    f(u0)f+ p (u1e u0)f1 + p

    ,

    p (u1e u0)f =

    max{0, (f(u0)f f(u10 )f)f(u0)f}(

    f(u

    0

    )f

    f(u1

    0

    )f)(u1e

    u

    0

    )f(u1e u0)f,

    where all the points u1e , u0 and u2 are on the critical face

    U0. By the inductionhypothesis, the directions of line search are the same as the ones generated by theconjugate gradient algorithm in all the previous iterations of the cycle. Hence

    f(u0)ff(u10 )f = 0

  • 8/13/2019 10.1.1.161.7484

    26/33

    26 c. zhu and r. t. rockafellar

    which implies (f(u0)ff(u10 )f)f(u0)f 0. Therefore, by noting that u1e u0 is a positive multiple of u

    0u10 , we obtain

    (4.30) p (u1e u0)f =

    (f(u0)f f(u10 )f)f(u0)f(f(u0)f f(u10 )f)(u10 u0)f

    (u10 u0)f.

    Comparing (4.29) and (4.30) with the conjugate gradient formulas of Hestenes andStiefel [22], we see that the vector ucg u0 is equivalent to the search direction vectorin a standard conjugate gradient algorithm for f relative to the free variables, i.e.,over affU0.

    Observe that the rank of linear transformation in Theorem 4.5 is bounded aboveby the ranks of the projection mappings Sp and Sd, which are dim U0 and dim V0.Hence

    k min{dim U0, dim V0}.Therefore, even in the case that the original problems (P) and (Q) are of high dimen-sion, the optimal solution can still be reached in a relatively short cycle after enteringan ultimate quadratic regions for the problem if merely one of the critical faces U0andV0 happens to be of low dimension, provided that at least one of the critical facesis eventually reached by the corresponding iterates. This condition will certainly besatisfied if any interactive restarts occur for , since all points u1 and v1 will beon the critical facesU0and V0 by Proposition 1.8, and once u0 orv

    0 are on the critical

    faces, they will stay there (Proposition 3.5).There are ways to force this condition to be satisfied, such as to insert at the be-

    ginning of each primal cycle a line search in the direction of the projection off(u0)on the tangent cone toU atu0 , and similarly in the dual. (See Burke and More [23].)But even without such remedies, we often find in our test problems that the criticalfaces are identified in the tail of iteration, and that restarts do occur in most cases,after which the iteration terminates at the optimal solution in a few steps.

    5. Envelope Properties. To finish off, we establish two results on the finite-envelope property of the pointsu

    1 and v

    1 in our algorithms.

    Proposition 5.1. (General saddle point property of iterates.) On exiting fromStep 5 of Algorithm 0 withu+10 andv

    +10 , the elementsu

    +11 G(v+10 ) andv+11

    F(u+10 )that will be calculated on return to Step 1 will be such that the pair(u+10 , v

    +11 )

    is the unique saddle point ofL(u, v) on [u0 , u2 ] V, while the pair(u+11 , v+10 )is the

    unique saddle point ofL(u, v) onU [v0 , v2 ]. In particular, u+11 will be the uniqueminimizing point relative to U for the envelope function

    f(u) := maxv[v

    0,v2]L(u, v) max

    vVL(u, v) =f(u),

    whereasv+11 will be the unique maximizing point relative toVfor the envelope function

    g(u) := minu[u0 ,u

    2 ]L(u, v)

    minuU

    L(u, v) = g(u).

    Proof. Recall that because we are in the fully quadratic case, L(u, v) and f(u)are strictly convex inu, whileL(u, v) andg(v) are strictly concave in v. In particular,u+10 must be the unique solution to the problem in Step 5 of minimizingf(u) overu [u0 , u2 ]. This is the primal problem of extended linear-quadratic programmingthat corresponds to L on [u0 , u

    2 ] V instead of U V. Applying Theorem 1.1 to

  • 8/13/2019 10.1.1.161.7484

    27/33

    primal-dual projected gradient algorithms for elqp 27

    it instead of to the original problem we deduce the existence of a vector v such that(u+10 , v

    ) is a saddle point ofL relative to [u0 , u2 ] V. Then v is the unique point

    maximizingL(u+10 , v) with respect to v V (by the strict concavity ofL(u, v) in v).Thus, v is the unique element ofF(u+10 ), sov

    = v+11 . It follows from Theorem 1.1again that (u+10 , v

    +11 ) is the unique saddle point ofL(u, v) on [u

    0 , u

    2 ] V, and v+11

    is the unique solution to the corresponding dual problem, which by definition consists

    of maximizing the function g over V .The rest of the assertions are true by a parallel argument in which Theorem 1.1

    is applied to the primal and dual problems that correspond to L on U [v0 , v2 ].Proposition 5.2. (Ultimate saddle point property of iterates.) Suppose the crit-

    ical face condition is satisfied. Letbe an iteration number as specified in Proposition4.4 and satisfying the conditions there. If = r is the first iteration of someprimal cycle withvr0 U0, then for all r in that cycle, on exiting from Step 5 ofAlgorithm 2 withu+10 the elementv

    +11 F(u+10 )that will be calculated on return to

    Step 1 will be such that(u+10 , v+11 ) is the unique saddle point ofL(u, v) onU

    V,where

    (5.1) U := aff[ur0, u

    re] [u0 , ue ] U0

    and dim(aff

    [ur0, ure] [u0 , ue ]

    ) = r+ 1. In particular, v+11 will be the

    unique maximizing point relative to V for the envelope function

    g(v) := minuU

    L(u, v) minuU

    L(u, v) = g(u),

    and one will have g+1 g in that primal cycle. Moreover, for = r + d11withd1 := dim U0, it will be true thatg =g in an ultimate quadratic region for theproblem, and also thatv+11 = v, as long as the algorithm does not terminate earlier.

    Similarly, if = s is the first iteration of some dual cycle with vs0 V0,then for all s in that cycle, on exiting from Step 5 of Algorithm 2 with v+10 theelementu+11 G(v+10 ) that will be calculated on return to Step 1 will be such that(u+1

    1

    , v+1

    0

    ) is the unique saddle point ofL(u, v) onU

    V, where

    (5.2) V := aff

    [vs0, vse] [v0 , ve ]

    V0,with dim(aff

    [vs0, v

    se] [v0 , ve ]

    ) = s + 1. In particular, u+11 will be the

    unique minimizing point relative to U for the envelope function

    f(u) := maxvV

    L(u, v) maxvV

    L(u, v) = f(u),

    and one will have f+1 f in that dual cycle. Moreover, for = s+d2 1 withd2 := dim V0, it will be true that f = f in an ultimate quadratic region for theproblem, and also thatu+11 = u, as long as the algorithm does not terminate earlier.

    Proof. We concentrate on the primal part; the proof of the dual part is parallel.

    The argument is similar to the one given for Proposition 5.1, but with the segment[u0 , u

    2 ] replaced byU

    . Recall from the proof of Theorem 4.5 that for r,the primalprocedure is equivalent to the conjugate gradient algorithm on the restriction off tothe affine hull affU0 of the critical faceU0. Therefore the vectorsure ur0, , ue u0are linearly independent, and u+10 minimizes f(u) over u U. The inequalityg+1 g follows from the inclusion U+1 U. When = r + d11 we havedim

    [ur0, u

    re] [u0 , ue ]

    = d1, and then U =U0. From the fact that (3.3) holds

  • 8/13/2019 10.1.1.161.7484

    28/33

    28 c. zhu and r. t. rockafellar

    in V (cf. the derivation of this relation in the proof of Theorem 3.6) we get g = gin an ultimate quadratic region.

    This result tells us that on entering an ultimate quadratic region, the primal iter-ations in Algorithm 2 produce an improving envelope for the dual objective functionwhich approaches that function, whereas the dual iterations produce an improvingenvelope for the primal objectives which approaches that function. To some extent

    this explains the phenomenon we have observed in our numerical experiments thatrestarts often incur fast termination, or at least bring significant progress in the nextfew iterations.

    6. Numerical Tests. Numerical tests of Algorithm 1, the Primal-Dual SteepestDescent Algorithm (PDSD), and Algorithm 2, the Primal-Dual Conjugate GradientAlgorithm (PDCG), have been conducted on a DECstation 3100 with double precisionon some medium to large-sized problems. For comparisons we have used the BasicFinite Envelope Method (BFEM) of [6] and the Stanford LSSOL code of Gill, Ham-marling, Murray, Saunders and Wright [24] for quadratic programming. To enhancethe performance of LSSOL in this situation, we tailored its Cholesky factorizationsubroutine to take advantage of the special structure of the P and Q matrices in ourexamples.

    Comparisons with LSSOL are based on the fact that any extended-linear-quadraticprogramming problem can be converted into a standard quadratic programming prob-lem by introducing auxiliary variables and additional constraints [1, Theorem 1]. Itmust be kept in mind, however, that such a transformation not only increases thedimension substantially but disrupts much of the large-scale structure that might beput to use. A fundamental difficulty with any comparisons with available QP meth-ods, therefore, is that such methods are not really designed to handle the kinds ofproblems we wish to tackle, which stem from [1], [2] and [6]. They typically requiresetting up and working with the huge R matrix, and trying to exploit any sparsitypatterns that might be present in it, whereas we never need this matrix but workwith decomposition in the calculation of the F andG, as explained in Section 1, afterProposition 1.6.

    The integer recorded as the size of each problem is the number of primal vari-ables and also the number of dual variables. (The two would not have to be the same.)Thus, size = 100 means that problem (P) is an extended linear-quadratic program-ming problem on lR100 for which the dual (D) is likewise such a problem on lR100, whilethe associated Lagrangian saddle point problem concerns a quadratic convex-concavefunction on a product of polyhedral sets in lR100 lR100. In order to solve such aproblem using LSSOL, it must be reformulated as a primal problem in 400 variableswith 100 general equality constraints and 200 lower bounds on the auxiliary variables,in addition to having the original polyhedral constraints on the 100 primal variables.

    In all the tests of PDCG and PDSD we have taken = 102 as the progressthreshold and = 108 as the optimality threshold. For PDCG we have taken k = 5as the cycle size parameter (whereas PDSD always has k = 1 by definition). We have

    run BFEM with mode=1, which means that in each iteration a quadratic saddlepoint subproblem is solved over a product of two triangles. For the sake of expediencyin solving this small subproblem we have set it up as a standard QP problem in themanner of [1, Theorem 1] and have applied LSSOL. No doubt the CPU time could beimproved by using a customized procedure within BFEM instead of this heavy-handedapproach.

    The generation of test problems of large size raises serious questions about the

  • 8/13/2019 10.1.1.161.7484

    29/33

    primal-dual projected gradient algorithms for elqp 29

    representative nature of such problems. It does not make sense to think of a largeproblem simply in terms of a large matrix, the elements of which are all random.Rather, a certain amount of structure must be respected. As an attempt to address thisissue, we have taken all our problems to have the (deterministic) dynamical structuredescribed in Rockafellar and Wets [5]. Only the parameters natural to this structurehave been randomized. The dynamical structure enables us to use special routines

    in calculating f(u) and F(u), and on the other hand g(v) and G(v) [7]. For thispurpose, and in implementing BFEM, we rely on code written by S. E. Wright [25] atthe University of Washington.

    The problems have been obtained as discretized versions of certain continuous-time problems of extended linear-quadratic optimal control of the kind developedin Rockafellar [4]. The first digit of the problem number corresponds to differentcontinuous-time problems and the second digit corresponds to different discretizationlevels, i.e., the number of subintervals into which the fixed time interval has beenpartitioned, which determines the size of the discretized problem. Hence, e.g., theproblems 0.1, 1.1, ..., 9.1 are the discretization of 10 different continuous-time prob-lems with the same discretization level (a transverse family of test problems), andthe problems 1.0, 1.1, ..., 1.7 are the discretization of one continuous-time problem

    with 8 different discretization levels (a vertical family of test problems). Only thedata values in the continuous-time model have been generated randomly, and in eachvertical family these are the same for all the problems. By increasing the number ofsubintervals, one can get larger and larger problems which remain stable with respectto the numerical scaling.

    Table 1. Test results of problems 0.19.1.CPU time (sec.) Iterations

    Prb. Size PDCG PDSD BFEM LSSOL PDCG PDSD BFEM LSSOL0.1 100 4.6 4.8 6.6 283.1 23 34 31 5001.1 100 5.0 5.8 7.5 295.0 28 50 37 4972.1 100 5.0 4.0 8.1 299.7 28 24 41 4953.1 100 3.0 2.6 3.4 339.8 5 5 8 562

    4.1 100 3.8 3.5 3.8 353.2 13 17 11 6195.1 100 3.2 2.7 3.5 314.5 8 6 9 5446.1 100 3.5 3.0 3.8 339.2 11 11 11 5527.1 100 3.6 3.7 4.3 256.0 13 20 14 4458.1 100 4.5 5.2 *17.5 290.6 22 42 4819.1 100 3.5 3.3 4.0 347.2 12 15 12 591

    Table 2. Test results of problems 0.29.2.CPU time (sec.) Iterations

    Prb. Size PDCG PDSD BFEM LSSOL PDCG PDSD BFEM LSSOL0.2 340 9.2 8.9 15.3 24 28 311.2 340 12.5 14.4 19.3 35 50 392.2 340 10.1 11.9 20.5 25 38 42

    3.2 340 5.2 4.3 6.8 9 8 114.2 340 7.8 6.6 8.7 18 17 155.2 340 6.5 5.5 8.0 14 12 126.2 340 5.7 5.1 7.3 11 11 127.2 340 5.4 5.9 7.7 10 15 138.2 340 9.8 11.2 20.3 25 38 429.2 340 6.0 6.4 9.5 12 17 17

  • 8/13/2019 10.1.1.161.7484

    30/33

    30 c. zhu and r. t. rockafellar

    Table 3. Test results of problems 0.49.4.

    CPU time (sec.) IterationsPrb. Size PDCG PDSD BFEM LSSOLPDCG PDSD BFEM LSSOL0.4 5140 122.6 138.6 270.8 23 32 381.4 5140 177.9 230.9 315.7 32 52 44

    2.4 5140 218.6 191.7 399.3 40 44 563.4 5140 46.7 45.0 110.1 8 9 164.4 5140 111.8 94.8 126.7 20 20 185.4 5140 71.4 64.5 133.2 12 13 196.4 5140 80.5 78.2 141.2 14 16 207.4 5140 54.9 85.1 104.7 10 19 158.4 5140 161.1 235.9 362.9 29 55 509.4 5140 76.5 77.8 115.5 14 17 16

    The test results in Tables 1, 2 and 3 concern transverse families of size 100, 340and 5140 respectively. The problems in the first family are small enough for theLSSOL approach to be viable as a comparison. But for the second and third families,our DECstation 3100 falls short of having enough memory for the LSSOL approach.

    Here we see that PDCG and PDSD are in the leading positions with BFEM not veryfar behind in terms of CPU times.

    The notation for the iterations of BFEM on problem 8.1 signifies that themethod failed to terminate with optimality in 100 iterations. The corresponding figurefor CPU time is preceded by * since it only indicates how long the first 100 iterationstook. (The same conventions are adopted in all other tables.)

    Table 4. Test results of discretized problems 0.00.7.

    CPU time (sec.)/IterationsPrb. Size PDCG PDSD BFEM LSSOL Value0.0 40 2.9/11 3.0/15 3.3/13 35.3/327 23.86260.1 100 4.3/23 4.8/34 6.6/31 244.4/500 15.78240.2 340 9.0/24 9.1/28 15.2/31 15.21070.3 1300 27.1/22 32.1/32 58.7/34 15.21450.4 5140 122.5/23 137.2/32 269.2/38 15.21790.5 20500 568.6/27 593.7/32 1396.2/46 15.21880.6 81940 2873.8/27 2722.6/32 *10637.6/ 15.21900.7 100020 4209.3/28 3976.5/32 7086.3/38 15.2191

    Table 5. Test results of discretized problems 1.01.7.

    CPU time (sec.)/IterationsPrb. Size PDCG PDSD BFEM LSSOL Value1.0 40 2.9/15 3.0/21 3.9/22 40.9/360 242.05983

    1.1 100 4.9/28 5.9/50 7.5/37 294.8/497 249.073781.2 340 12.4/35 14.4/50 19.1/39 249.779751.3 1300 45.3/37 52.2/52 76.0/44 249.798661.4 5140 178.4/32 230.7/52 317.9/44 249.799561.5 20500 812.4/36 1007.5/52 1421.5/45 249.799721.6 81940 4015.8/36 4699.9/52 6119.9/45 249.799761.7 100020 5749.6/36 6538.5/52 8264.0/44 249.79976

  • 8/13/2019 10.1.1.161.7484

    31/33

    primal-dual projected gradient algorithms for elqp 31

    Table 6. Test results of discretized problems 2.02.7.

    CPU time (sec.)/IterationsPrb. Size PDCG PDSD BFEM LSSOL Value2.0 40 3.6/28 4.4/63 *9.7/ 44.7/446 -261.50422.1 100 4.7/28 4.1/24 8.2/41 254.8/495 -362.2297

    2.2 340 9.5/25 11.1/38 20.0/42 -369.73342.3 1300 41.4/33 39.8/39 97.6/56 -369.80732.4 5140 220.3/40 191.9/44 402.9/56 -369.80462.5 20500 936.9/43 769.4/40 1945.4/63 -369.80362.6 81940 4370.3/40 3396.5/39 8893.3/71 -369.80342.7 100020 6247.2/40 5123.7/40 10032.1/53 -369.8033

    The test results in Tables 4, 5 and 6 refer to vertical families based on the firstthree continuous-time problems. They cover sizes th