Linearly constrained estimation by mathematical programming

254

European Journal of Operational Research 42 (1989) 254-267 North-Holland

Theory and Methodology

Linearly constrained estimation by mathematical programming

Emil K L A F S Z K Y * Department of Computer Science, Technical University, Miskolc, 3515 Miskolc-Egyetemvdtros, Hungary

Jfinos MAYER* Department of Applied Mathematics, Computer and Automation Institute of the Hungarian Academy of Sciences, 1132 Budapest, Victor H.u. 18-20, Hungary

Tamfis T E R L A K Y * Department of Operations Research, Eiitv6s Lbrfnd University, 1088 Budapest, M{tzeum krt. 6-8, Hungary

Abstract: Some mathematical programming models of the mixing problem are discussed in this paper. Five models, based on different discrepancies, are considered and their fundamental properties are examined. Using variational and Smirnov distances, linear programming models are obtained. Pearson divergence leads to quadratic programming, Hellinger divergence leads to/p-programming and the Kullback-Leibler information divergence gives a special geometric programming model. Finally some computational experiences are presented.

Keywords: Statistical estimation, mixing, distance, information divergence, linear programming, quadratic programming, /p-programming, geometric programming.

1. Introduction

Mathematical models of mixing problems have several applications in practice. Two examples are the following:

Gasoline blending. From given hydrocarbon dis- tillates the 'best approximation' of a given kind of petrol with prescribed quality parameters is to be mixed. (Parameters may be given according to any standards.) Because of its practical importance

* Research partially supported by the Hungarian National Research Foundation under Grant no. OTKA-1049.

Received April 1988; revised August 1988

this problem received attention already at the early years of mathematical programming, see [3,7].

Concrete mixing. From different sand and gravel components with various granule distributions the 'best approximation' of the 'ideal' sand and gravel mixture for a given kind of concrete is to be mixed. This is one of the standard problems en- gineers are faced with when building trusses, see [26,20].

Mixing of chemicals, liquids, gases or other kinds of material leads frequently to the same mathematical problem. This indicates that the mixing problem considered in this paper has several practical applications. Five mathematical

0377-2217/89/$3.50 © 1989, Elsevier Science Publishers B.V. (North-Holland)

E. Klafszky et al. / Linearly constrained estimation 255

models of the mixing problem are studied here. The mixing problem is closely related to the decomposition of distributions, see Medgyessy [18].

The general mathematical model of mixing and five measures for discrepancy are presented in the second section. Four of these were summarized by Csiszfir [5,6], as special cases of the f-divergence while the fifth (Smirnov distance) can be found in R6nyi's book [21]. Section 2 also contains a brief discussion of the pertinent works in the literature. Based on the different discrepancies five models are constructed in the third section. Basic properties, special structure of these models and their interconnections are examined as well. In the fourth section computational results are presented and the final section contains a brief summary together with conclusions.

2. Mathematical models for mixing

The general mathematical model of mixing is presented in this section and five divergences-- frequently used in mathematical statistics--are discussed.

The mathematical model of mixing. Let a (i) = ( ai, 1 . . . . . a,,j . . . . . ai,,), i = 1 . . . . . m, be probability distributions. Denote

a(1) }

A = " = (a I . . . . . aj . . . . . a , ) a (m) ]

as the m × n matrix, having a m , . . . , a (") as its row vectors, and al , . •., a j , . . . , a n denotes column vectors of matrix A. Using distributions a m . . . . , a (m) our aim is to find the best approximation of the probability distribution

c= (v,, . . . ,vj ..... v,).

In this case A>I0, A - l = l , c - l = l , where 1 = (1 . . . . . 1) denotes the all-one vector.

Problem. Find a vector x ~ R m, for which 1 • x = 1, x >/0 (mixing conditions) hold, and the mixture xA = z is a 'good' approximation of the goal distribution, i.e. for which D(z I1 c) is minimal.

Here function D(z [ ]c) measures the deviation of the distributions z and c. Depending on the

different functions D ( z l l c) different mixing models are obtained.

Divergence functions. It was proved by CsiszAr [5,6] that most of the divergences that are com- monly used to measure the discrepancy between distributions, belong to the family of f-divergences. Some of these are presented here.

Let f : R + --, ~ be a convex function with f(1) = 0, and for the sake of convenience we may assume that

f(O)-- lira f(u); O.f(o)=a. lira f(u) u ~ 0 + u ~ U

Definit ion 2.1. Let P = ( P l , . . . , P , , ) and q = (q~ . . . . . G ) be two arbitrary distributions; then

D / ( p [ ] q ) = ~ q j . f pj j = l

is the f-dwergence of p and q.

With different functions f , different discrepancies are derived.

Variational distance (norm l 0. Let f ( u ) = ] u - l l ; then

D/(pL l q )= j~=lq j -~ j - = = ]PJ-qJ]"

Pearson's Q-divergence. Let f ( u ) -~ (u - 1)2; then

D i ( p l lq) = Y'. qj - 1 j = l

o (p j_ q )2 = Z

j = 1 q2

Kullback-Leibler information divergence (entropy). Let f ( u ) = u. log u, and for the sake of simplicity of presentation,

0 . l o g -=0 lim u . l o g -=u lim u- log u = 0 a u ~ , 0 + a u ~ 0 +

and

a l im a . log a a . log ~ = u-*0 + u

a - ' l o g a - a , lim l o g u = - ~ . u ~ O +

256 E. Klafszky et al. / Linearly constrained estimation

So in this case

Df(p[ [ q ) = ~ qj~j( log pj = " Ps j = l qj E Pj log - - . j = l qs

This Df(p [ [q) is called information divergence. Hellinger divergence. Let f(u) = 1 - f f f ; then

Df(Pl lq)= ~ q j ( l -

: : 1 -

j = l j = l j = l

Smirnov distance. (/+-norm). Let P = (P1 . . . . . P,) and Q = (Q1 . . . . , Q,) be cumulative distributions, i.e.

J J

Pj = E Pi and Q / = ~ qi, J = 1 , . . . , n . i=1 i=1

Then

Of(p Iq) = max IPj-Qsl l<~j<~n

is the Smirnov distance of the distributions P and q.

Mixing models, defined by the above given distances are examined in the next section. For the sake of simplicity mixing models obtained by using variational, Pearson, etc. discrepancy measures will be called variational model, Pearson model, etc.

The variational, Pearson, Kullback-Leibler and Hellinger models as formulated above belong to the following class of linearly constrained statistical estimation problems:

minimize D ( p [ [ q),

1 .p = 1, (2 .1 ) p>~0, p ~ ,

where ~ is a given convex polyhedron and q a fixed discrete probability distribution.

Using the variational distance in (2.1) leads to a problem which can easily be reformulated as a linear programming problem. This is common knowledge, see e.g. [23].

Before discussing the other three models a short comment is to be made. Polyhedron ~ in (2.1) can either be given as an intersection of half spaces, i.e. ~ = {plHp <~ h) for some given ma-

trix H and vector h, or it can be specified as the convex hull of some given vectors, i.e.

~a= conv{ a <1) ..... a<m) }.

Our mixing problem as formulated above clearly falls in this second class of problems. It is a well-known fact that these two formulations are far from being equivalent from the point of view of the numerical solution of problems with linear constraints. The transition from one representation of ~a to the other requires application of algorithms which are very time consuming.

Selecting the Pearson divergence as the objective function in (2.1) the resulting problem is clearly a convex quadratic programming problem. In the case when ~a is given by a system of linear inequalities standard quadratic programming methods can be applied while for the representation of ~a as a convex hull special-purpose algorithms have been developed, see Wolfe [27,28].

Using information divergence in (2.1) results in estimation models introduced by Kullback and Leibler, see Kullback [16]. These type of models with an inequality representation of ~a have been thoroughly investigated by Jaynes, Kullback and Tribus, see Jaynes [14]; a corresponding duality theory was developed and dual-type algorithms were constructed via unconstrained convex programming by Brockett, Charnes and Cooper [2], and Charnes and Cooper [4]. Quadratically constrained models in this type have been introduced and studied by Fang and Rajasekera [12]. The connection of (2.1) with geometric programming is well-known, see Duffin and Peterson [9], a gener- alization is given in Scott and Jefferson [24]. In this paper problem (2.1) is considered with the alternative representation of ~a, in fact ~a is assumed to be given as the convex hull of a finite set of distributions. Using geometric programming this gives rise to a duality relationship and dual- type algorithmic approach different from those given in [2,4] (constrained dual).

For the case when Hellinger distance is used in (2.1) the authors of this paper did not succeed in finding pertinent works in the literature.

The Smirnov model does not belong to the family specified by (2.1), but it is a well-known fact that it can be reformulated as a linear programming problem, see e.g. [23].

Concerning our notations, remark that matrices will be denoted by Latin capital letters, vectors by


Latin small letters and their components by the corresponding Greek letters, e.g.

a(l ) m 17

A = " = (a , . . . . . a , ) = (%j),=, , ,=1 a (m)

The following denotations for l 1- and /a-norms will also be introduced:

17

IIPIIA = ~ IPj[ and I lP l lc = max ]pjI- j = 1 1 ~j~<n

3. Duality relationships, algorithmic aspects

In this section mixing models obtained by the above introduced discrepancy measures are examined. By constructing primal-dual pairs of mathematical programming problems, algorithms for solution of mixing models are proposed as well. For convenience positivity of the goal distribution, (c > 0), is assumed below. To allow for a short formulation of duality propositions the following notion will be introduced.

Definition (Gapless duality). Two mathematical programming problems are said to be gapless duals of each other, if

(i) one of them is formulated as a minimization, the other one as a maximization problem,

(ii) there exists an optimal solution for both problems and the optimal values of the objective functions coincide,

(iii) for any feasible solution of the minimization problem the objective function majorizes the objective function value of the maximization problem at any feasible solution of that problem.

3.1. Variational model

If variational distance measures the discrepancy between two distributions, we have the following mixing model.

Find a vector x ~ R " for which 17

I x a j - ~/I is minimal,

(3.1) subject to 1. x = 1,

x>~0.

The basic duality relationship concerning this estimation problem is formulated in the following proposition.

Proposition 1. The following optimization problems

are gapless duals.

Primal problem (P)

minimize II xA - c II A

subject to 1 - x = 1,

x>~0.

Dual problem (D)

maximize cy +

subject to A y + l ~ < ~ O ,

Ilyllc~l.

Proof. Introducing new variables z + = ( f + . . . . . ~+) >~ 0 and z - = (~{ . . . . . ~-) >/0 with ~j - ~f = x a j - - "/j , j = l . . . . . n , the primal problem is equivalent to the following linear programming problem.

minimize 1 - z + + 1 • z -

subject to AVx + z + - z - = c, (3.2)

l ' x = l , x>~0, z+>~0; z->~0.

The dual of (3.2) is as follows:

maximize cy +

subject to A y + 1- 7/~< 0, (3.3)

- l ~ < y ~ l .

Problem (3.3) is the linear programming formulation of the dual in the proposition, so LP-duality proves Proposition 1. []

Remark. The linear programming primal-dual formulation of the variational model is well- known, see e.g. [23].

The matrix structure of problems (3.2) and (3.3) is illustrated in Figure 1.

A special feature of problem (3.2) is that it contains all positive and negative unit vectors except +e17+l. Utilizing this special structure a starting feasible solution can be found by a single pivot operation and so the first phase of the simplex method becomes trivial. In fact, any unit element of the last row (e.g. the first one) can be chosen as the first pivot element. In this case the unit matrices + E of the tableau remain un- changed and depending on the sign of the new right-hand side components, the previously chosen

258 E. Klafszky et al. / Linearly constrained estimation

[ o i I L I I

I 0

Figure 1. Structure of the LP-equivalent of the variational model

E -E

0 . . . 0 0

vector together with unit vectors +e j or - % , j = 1 . . . . . n, can be chosen as an initial feasible basis to start the simplex method with.

To summarize the considerations above: The variational model has been transformed into a special linear programming problem where the first phase of the simplex method turns out to be trivial.

Remark. When the dual problem (3.3) is solved by the simplex method, the upper-bounding technique can be used. Fixing the components of y at their lower bound ( - 1), slack variables provide an initial feasible solution.

3.2. Pearson model

The above formulated problems (3.5) and (3.6) are gapless duals.

Proof. Denote z = xA; then the objective function in (3.4) can be reformulated as follows.

(z T - cT)S(z -- C)

= zTSz -- 2zTSc + cTSc.

Using relations Sc = 1 and z- 1 = 1, it is clear that problems (3.4) and (3.5) are equivalent.

The gapless duality follows from /p-programming duality theory, see e.g. Terlaky [25]. []

It is well-known, see [25], that

y + S z = O ,

x ( A y - 1-7/) = 0, (3.7)

together with primal and dual feasibility, is an optimality criterion for problems (3.5) and (3.6).

Remark. Notice that for any optimal solution (y, z, ~) of (3.6) A y <~ 0 holds: since y = - Sz = - S A T x for all optimal solutions x of (3.6). This implies Ay = - A S A T x <~ O, since A >/0, S >/0, and x >/0. (This property will be used in the algorithm for solving the Kullback-Leibler model.)

Using Pearson's Q-divergence for building the mixing model the following problem is derived.

maximize ( x A - c ) S ( A T x -- c)

subject to 1. x = 1, (3.4)

x>_-0,

where S in an n × n matrix defined by S = diag(1/yj), i.e. ojj = 1/yj , j = 1 . . . . . n, and oi j= O, i = 1 . . . . . n; j = l . . . . . n, ifi--/:j.

Proposition 2. An equivalent formulation of the Pearson model is the following quadratic programming problem.

Primal problem (P)

minimize ½ z VSz

subject to xA - z = 0, (3.5) l . x = l , x>~0.

Let us formulate the following dual problem.

Dual problem (D)

maximize - ½yTS-1Y + ~1 (3.6) subject to Ay+l .~l<<.O.

Quadratic programming problems (3.5) and (3.6) will be solved as a linear complementarity problem. The corresponding linear complementarity problem is the following:

xA - z = O, 1 . x = l , A y + l . n + u = O ,

y + S z = O ,

x>~O, u>~O, xu = O.

Using relations z = xA and y = - S z and introducing the notation P = A S A T the following linear complementarity problem is obtained.

1 - x = l , 1 . ~ l - Px + u = O ,

x>~0; u = 0 , xu = 0. (3.8)

The data structure of problem (3.8) is illustrated in Figure 2.

In this case variables (~, ~1, /~2 . . . . . /~m) give a starting basis (by two pivot operations; pivot posi-

x

o l d D{

{

Figure 2.

E. Klafszky et al. / Linearly constrained estimation

1

i

O

Structure of the linear complementarity problem corresponding to the Pearson model

tions are indicated in Figure 2 by [] for Lemke's [17] complementary pivot algorithm).

Note that ~/ is a free variable and so Lemke's method can be mod i f i ed I l ike the simplex method--so that 7/is kept as a basic variable, i.e. it is never a candidate for leaving the basis.

3.3. Kullback-Leibler model

Using information divergence for measuring the discrepancy between the goal distribution and the mixture we have the following problem.

Find a vector x ~ R" which

minimizes i ~'j log ~j j = ] "Yj

subject to xA - z = 0, (3.9)

1 - x = l , x>~0.

The duality relationship underlying this problem is formulated as follows.

Proposition 3. The following pr imal-dual problems are gapless duals. The primal & equivalent to (3.9).

Primal problem (P)

minimize - ~ fj log yj + fj log 5 j = l j ~ l

subject to xA - z = 0, (3.10)

1 - z = l , x>~0; z>~0.

Dual problem (D)

maximize - l o g i Yj e-nJ j = l ( 3 . 1 1 )

subject to d y <~ O.

259

ProoL Since xA = z and A >/0, z >/0, and A • 1 = 1; these relations imply that 1 • x = 1 is equivalent to 1 • z = 1 in this case. So (3.9) is equivalent with (3.10). It can easily be seen that problem (3.10) is a geometric programming problem (Duffin, Peter- son and Zener [8]; Klafszky [15]) whose dual problem is as follows:

maximize

subject to e"'"~'~< 1, i = 1 . . . . . m, (3.12)

e~ ~ yj e ~'~<1. j=l

The first group of constraints in problem (3.12) can be reformulated as the inequality Ay <~ 0 by taking the logarithm of both sides. The last inequality is clearly active at any optimal solution so by eliminating variable 7/the dual problem in the proposition results. Observing the Slater regularity of problem (3.10) the gapless duality follows from geometric programming duality theory [8,15] []

Problem (3.12) is clearly equivalent to the following problem.

minimize ~ yj e -°j j_ , (3.13)

subject to Ay <~ O.

Let us consider the system of equilibrium conditions of problems (3.10) and (3.12), as given below.

xA - z = O, Ay <~ O,

1 - z = l , t/

x>~0, z>~0, e n ~ j e ~ J = l , j = l

xAy = O,

~/ e - ~ j + ~ = ~j .

Let us introduce the notations

n

G ( y ) = ~ "/j e 77j and IIq[I = ~ Iqjl- 1--1 j = l

In this case G ( y ) = II V'G(y)II, moreover variables z and ~7 can be eliminated from the equilibrium system, since

1 1 e " - G(v~ and z - G ( y ~ T G ( y )

260 E. Klafszky et aL / Linearly constrained estimation

hold. This means that the equilibrium system can equivalently be formulated in the simple form as follows:

1 xA - - VG(y) , Ay ~ O,

G(y)

x >/O, (3.14)

xAy = O.

Notice that system (3.14) is exactly the Kuhn- Tucker system of the dual problem (3.13).

Numerical solution of the Kullback-Leibler mod- eL One possible way to solve the Kullback-Leibler model would be to solve directly problem (3.9) or the equivalent problem

& a j x minimize ~ (ajx) log - -

j = l YJ

subject to 1. x = 1,

x>~0.

An other possibility is the solution of the dual problem (3.13), in this case the optimal value of the primal variables can be obtained from the equilibrium conditions (3.14).

The first way is alluring, since the reformulated primal problem is a convex programming problem with a single linear equality constraint and non- negativity requirements for the variables. The pitfall inherent in this formulation can be dis- covered when partial derivatives of the objective are computed: It turns out that the objective function is not differentiable along the subspaces ajx = O, j = 1,. . . , n. Considering the geometric programming formulation (3.10) of the primal problem it turns out that this trouble is just a special instance of the general difficulty of the same nature, reported in connection with the numerical solution of general geometric programming problems [22]. A new approach, based on controlled dual perturbations has recently been published by Fang, Peterson and Rajasekera [11] for handling this problem.

To avoid the difficulties mentioned above, in our approach the second way was chosen, i.e. the dual problem (3.13) was solved and optimal pro- portions for mixing was computed from the Kuhn-Tucker system (3.14).

Algorithm. The algorithm developed for the solution of the linearly constrained nonlinear programming problem (3.13) is based on the reduced

gradient type algorithm of Murtagh and Saunders which was successfully implemented in their MINOS program package [19]. So nonbasic variables are divided into two subsets as free ('super- basic) and fixed ('nonbasic') variables. For the minimization in the subspace of free variables the Davidon-Fletcher-Powell method [13] has been implemented. The special features of the problem have been utilized in the following way:

Starting point. In connection with the Pearson model it has been shown above that for the optimal solution of that model the inequality Ay <~ 0 holds. This means that the optimal solution of the Pearson model can be used as a feasible starting point for our algorithm. Since the Pearson divergence can be considered as an approximation of the Kullback-Leibler divergence (using Taylor series expansion), this provides a 'good' starting point.

No pivot transformations needed. If ,4 is of full row rank, then choosing any starting basis from the column vectors of ,4 no basis transformations are needed in the solution process since the */i variables are all free variables.

Computation of primal variables. If y* is an optimal solution of problem (3.13) then an optimal solution (x*, z*) of (3.9) can be determined as follows. (Here again full row rank of matrix A is assumed.) It was proved above that the following relation determines the optimal z *,

1 z*= - - v O ( y * ) .

G ( y * )

Vector x* can be determined easily as follows.

A~x * = z * ,~ ~ x * = ~z~

, ,x*= _ _ 1 v O(y,)B_l. G ( y * )

Here subscripts B and N denote basic and nonbasic parts, respectively. The vector

wTG(y* )B -1

consists of those components of the reduced gradient of G (y *) which belong to slack variables, so it is computed at each step of the algorithm which solves (3.13).


As the proposed method can be considered as a specialization of the algorithm published in [19], theoretical convergence behavior is at least as good as it is for the method in [19].

3.4. Hellinger model

If Hellinger divergence is used as the objective function of the mixing problem, then the following model results.

m i n i m i z e 1 -- ~ i ( ajx )'~j j : l (3.15)

subject to 1 . x = l , x > 0 .

Proposition 4. Problem (3.15) can equivalently be formulated as an lp-programming problem.

Proof. First notice that the minimization of

1 - ~ (~-7,x)7, j~l

is equivalent to the maximization of // y

j=l

Introducing new variables ~0j, j = 1 . . . . . n, under the conditions ajx >1 oa 2, j = 1 , . . . , n, the maximization of the above specified function is equivalent to the maximization of

n

j=l

since increasing oaj implies increasing ajx. Clearly ajx = oa 2, for all j ' s at an optimal solution.

Since a (n >/0 and a u) =~ 0 hold, so constraint 1 • x = 1 can be replaced by the constraint 1 • x ~ 1 because again equality holds for all optimal solutions.

Based on the above considerations, problem (3.15) can be reformulated as the following/p-programming primal problem.

maximize ~ ~j~0j j=l 1 2 ½ajx<~O, = 1 . subject to 7~0j - j . . . . n,

1 . x - l ~ < 0 , - E x <~ O,

where ~j = ~-j, j = 1 . . . . . n.

(3.16)

This proves the proposition. []

Proposition 5. Problem (3.16) as primal problem, and the dual given below are gapless duals.

Dual problem (D)

minimize

subject to g y + 1 . ~ / ~ 0 ,

y > O ,

~/~0.

(3.17)

Proof. Utilizing theoretical results of l /programming [25], the dual problem of (3.16) can be formulated as follows.

minimize

subject to

2 n ~j

+ j~, ~/J 2 @ ' =

U=?, - { A y + l . C l - v = O ,

y>~0, ~ > 0 , v>-0,

~ j j = 0 ~ t x j = O , j = l . . . . . n.

This problem simplifies if we use the conditions u = ? > 0, since in this case /~/= 0 cannot occur, i.e. y > 0 holds for all feasible solutions. The dual problem becomes the following.

t/

minimize ~ + 7 "b Yl= 2'/9

subject to lay<<, 1. ~,

y > 0 ; ~ > 0 .

By introducing the notation ~ / = - 2 ~ the dual problem (3.17), given in the proposition, has been derived. Let us observe that problem (3.16) is Slater regular, the desired gapless duality relationship results from /p-programming duality theory, see [25]. []

For simplicity denote G(y) = E~= t't;/~; in the sequel.

Let us consider the equilibrium system of the primal-dual pair (3.16) and (3.17). This is the

262 E. Klafszky et aL / Linearly constrained estimation

following system:

w2 - a j x < O, 1 . ~ + A y ~ < 0 ,

l . x = l , ~ < 0 ,

x>_-0, y > 0 ,

~ i ( a ( i ) y + ~ ) = O , i = 1 . . . . . m,

~ ( l . x - 1) = 0 ,

~ j ( o f l - a , x ) = 0, j = l . . . . . n,

~joaj = {j, j = l . . . . . n.

Considering the complementarity conditions, the first one is obviously equivalent to the equality x A y + 7/= 0, the third one to aa 2 - a j x = 0 for all j ' s , since y > 0 holds. Furthermore matrix A has no zero vectors as its row vector, so ~ < 0 and 1 • x = 1 hold. Using these observations the above given equilibrium system has the following equivalent formulation.

xA = - W G ( y ) , 1 .~ + A y e 0 ,

l . x = 1, y > 0, (3.18) x>~0

x A y = - 7,

where

v a ( y ) = . . . . . ,,, .

Since G ( y ) = - W G ( y ) y , so for any vectors, satisfying (3.18) the following relations hold:

-+1 = x A y = - X7G( y ) y = G( y ) .

By eliminating t/, (3.18) is equivalent to the following:

xA = - WG(y) , A y - 1. GCY) <~ O, l . x = 1, y > 0, (3.19)

x > 0 .

It can easily be derived from the considerations above that for any optimal solution of the dual problem (3.17) ~ / = - G ( y ) holds, so (3.17) is equivalent to the following convex programming problem.

minimize G ( y )

subject to Ay - 1 . G( y ) <~ O,

y > 0 .

Notice that for optimal solutions,

It wG(y) [ ] A = -- WG(y) • 1 = x A . 1 = x . 1 = 1

holds.

Numerical solution o f the Hellinger model. Con- sidering the numerical solution of the primal problem (3.15) the same arguments hold as for the Kullback-Leibler model, i.e. the objective function is not differentiable along the subspaces a j x =0 , j = 1 . . . . . n, which is a source of several numerical difficulties. So dual problem (3.17) was solved and the optimal value of the mixing variable x was computed utilizing the equilibrium conditions (3.19). For general /p-problems a new algorithm has been published recently by Fang and Rajasekera [10]: they handle the above-mentioned difficulties directly in the primal problem by controlled dual perturbations.

For the solution of the linearly constrained problem (3.17) the same version of the reduced gradient method as for the Kullback-Leibler model has been implemented. The following spe- cialities of the problem are utilized.

Starting point selection, no pivoting needed. Re- marks made in Section 3.3 concerning the basis are valid here as well. Starting from a feasible solution of the dual problem (3.17), inequality y > 0 holds throughout the procedure, since the objective function acts in this case like a penalty function.

Computation o f pr imal variables. Having computed the optimal solution of (3.17) the optimal solution of the primal problem can be computed analogously as in the case of the Kullback-Leibler model.

For the theoretical convergence behavior the same remark holds, as for the Kullback-Leibler model.

3.5. Smirnov model

If cumulative distributions are mixed instead of probability distributions (densities), then Smirnov distance can be used.

For simplicity of notations denote

J

Yj= ~ Vk, j = l . . . . . n, k = l

and

J

a i j = Y" aik, j = l . . . . . n; i = l , . . . , m. k = l

So the following Smirnov model results.


minimize max I x a j - vjl 1 <<.j<~n

subject to l . x = 1, (3.20)

x>~0.

The underlying duality relationship is as follows.

Proposition 6. The fol lowing two problems are gap-

less duals.

Primal problem (P)

minimize II xA - c II c

subject to l . x = l ,

x > 0 .

Dual problem (D)

maximize cy + 7/

subject to A y + l . ~ 7 < ~ O ,

I lYllA~<l

Proof. The primal problem is clearly equivalent to the following linear programming problem.

minimize ~"

subject to x A + ~ . l > c ,

- x A + ~. 1 >1 - c , (3.21)

x l = 1 , x > 0 .

The dual problem of (3.21) is the following.

maximize cY 1 - cY 2 + ~1,

subject to A y 1 - A y 2 + 1 . 77 <~ 0, (3.22)

1 .yl + 1 . y2 = 1, yl>~0, y2>~0,

which is just a reformulation of the dual problem given in the proposition. It is obvious that at the optimal solution II y ]l A = 1 holds. So gapless duality follows from LP duality theory. []

Remark. The primal-dual LP-formulation of the Smirnov model is well-known, see e.g. [23].

Considering the size of problems (3.21) and (3.22) it is more convenient with the simplex method to solve dual problem (3.22). The matrix structure of problem (3.22) is illustrated in Figure 3.

The first phase of the simplex method can be passed by performing a single pivot step. In fact, one can chose an element of the shadowed part of the last row (e.g. the first one) as a pivot element.

-o l . . . . . 7 4 °

E

0

Figure 3. Structure of the LP-equivalent of the Smirnov model

After pivoting the right-hand side is nonnegative again (since A >/0 holds) and so the variable corresponding to the pivot column and slack variables provide a feasible basic solution.

Note that simply by two pivots an initial feasible basis can be constructed for the primal problem as well. To verify this is left to the reader.

Remark. Comparing the duality relations in Prop- osition 6 with the analogous relations given in connection with the variational model complete symmetry of l~- and /z-norms can be observed.

4. Computational experiences

A program system--based on the proposed algorithms--has been developed for the solution of the different mixing models in order to be able to compare them computationally.

The program system has been developed in FORTRAN-77 language and was run on a 8MHz IBM/PC-AT endowed with a 10 MHz math coprocessor.

The algorithms implemented for the various models utilize the specific features of the individual models and they are based on the following methods.

Variational model: Simplex method with upper- bounding technique for the dual problem.

Pearson model: Lemke's complementary pivot method.

Kullback-Leibler Reduced gradient method for model: the dual problem.

Hellinger model: Reduced gradient method for the dual problem.

Smirnov model: Simplex method for the dual problem.

In the cases of the Kullback-Leibler and Hel- linger models our experience was that to compute

264

Table 1 Granule distributions

E. Klafszky et al. / Linearly constrained estimation

for sand and gravel mixtures

1. 0.200 0.174 0.214 0.087 0.058 0.048 0.161 0.016 0.015 0.027 2. 0.150 0.100 0.200 0.080 0.090 0.130 0.210 0.010 0.010 0.020 3. 0.000 0.000 0.010 0.140 0.210 0.120 0.070 0.440 0.010 0.000

a4. 0.265 0.140 0.180 0.130 0.100 0.070 0.055 0.040 0.005 0.015 5. 0.000 0.010 0.020 0.030 0.030 0.120 0.710 .050 0.030 0.000

a6. 0.230 0.230 0.190 0.100 0.100 0.060 0.040 0.050 0.000 0.000 a7. 0.130 0.170 0.160 0.140 0.110 0.110 0.080 0.060 0.010 0.030

8. 0.000 0.050 0.100 0.080 0.090 0.180 0.430 0.050 0.020 0.000 9. 0.000 0.039 0.337 0.139 0.087 0.085 0.245 0.017 0.017 0.034

10. 0.640 0.189 0.070 0.006 0.005 0.013 0.042 0.005 0.010 0.020 11. 0.620 0.250 0.045 0.005 0.008 0.006 0.032 0.004 0.010 0.020 12. 0.010 0.350 0.510 0.030 0.010 0.010 0.010 0.010 0.007 0.053 13. 0.000 0.250 0.260 0.200 0.090 0.050 0.020 0.050 0.020 0.060 14. 0.000 0.000 0.000 0.000 0.000 0.000 0.020 0.330 0.300 0.350 a15. 0.200 0.100 0.170 0.140 0.120 0.090 0.090 0.020 0.020 0.050 ~16. 0.110 0.080 0.150 0.120 0.120 0.130 0.130 0.090 0.020 0.050

a This distribution represents a desirable mixture for some specific kind of concrete.

a solution of the primal problem with a given accuracy via solving the dual and utilizing the equilibrium conditions, requires the solution of the dual with a significantly higher accuracy. So e.g. to get a primal solution with an accuracy 10 -5 the norm of the reduced gradient for the dual problem was to be decreased below 10 -8 , and to 10-3 primal accuracy corresponds 10-6 dual accuracy in the previous sense. For the critical operations (scalar product of vectors, division) double precision computations were employed.

Scaling input data (columns of matrix A, and vector c) turned out to be a useful tool for impro- ving numerical stability of the algorithms.

Some of our computations results are presented here. Our models were tested on two different data sets. One of them is a practical and the other is a randomly generated data set.

Practical data set. The data in this case originate in a practical mixing problem. Data represent granule distributions of different kind of sand and gravel mixtures, rows of Table 1 contain the gran-

Table 2 Divergence for Example 1

P I H V S

P 0.129 0.060 0.030 0.303 0.113 I 0.130 0.059 0.028 0.298 0.110 H 0.131 0.059 0.028 0.297 0.110 V 0.186 0.079 0.038 0.281 0.103 S 0.263 0.123 0.062 0.406 0.054

ule distributions. Rows marked by an "a" contain distributions which represent desirable mixtures for some specific kind of concrete, they can be used as goal distributions. The first column of Table 1 serves as identification for the distributions, the individual distributions will be referred to by their serial number.

Some typical results of our computations are presented below. The following abbreviations are used.

P: Pearson divergence. I: I divergence, Kullback-Leibler divergence. H: Hellinger distance. V: Variational distance. S: Smimov distance.

Example 1. Distributions 1, 2, 3, 5, 9 and 10 are mixed to approximate distribution 7. Rows of Table 2 correspond to the different mixing models, the diagonal elements are the optimal values of the divergence function whereas off-diagonals are substitution values of the optimal weights into the other divergence functions.

Table 3 Optimal weights for Example 1

P 0.768 0.063 0.170 0.000 0.000 0.000 I 0.672 0.161 0.167 0.000 0.000 0.000 H 0.640 0.194 0.169 0.000 0.000 0.000 V 0.230 0.657 0.113 0.000 0.000 0.000 S 0.358 0.610 0.001 0.000 0.000 0.032


Table 4 Divergences for Example 2

P I H V S

P 0.024 0.012 0.006 0.147 0.042 1 0.024 0.012 0.006 0.148 0.042 H 0.024 0.012 0.006 0.148 0.042 V 0.046 0.021 0.010 0.149 0.053 S 0.029 0.015 0.008 0.150 0.019

Table 6 Divergences for Example 3

P I H V S

P 0.101 0.051 0.026 0.282 0.118 1 0.102 0.051 0.026 0.280 0.108 H 0.105 0.051 0.026 0.280 0.103 V 0.132 0.066 0.034 0.272 0.165 S 0.245 0.114 0.058 0.371 0.048


P 0.000 I 0.000 H 0.000 V 0.000 S 0.000

0.665 0.174 0.000 0.000 0.000 0.000 0.115 0.046 0.664 0.172 0.000 0.000 0.000 0.000 0.116 0.048 0.663 0.172 0.000 0.000 0.000 0.000 0.117 0.049 0.733 0.144 0.000 0.000 0.000 0.000 0.027 0.096 0.608 0.201 0.000 0.000 0.000 0.000 0.163 0.028

The optimal weights obtained from the five models are presented in Table 3, all values being rounded to 3 decimals. The sum of weights for the original values is 1.0 with an accuracy of 10 -5.

These results represent average behavior of the models, i.e. results for the P, 1 and H fits are nearly the same with I and H being very close to each other and P differing from them by a little extent. Results for divergences V and S differ from the previous results and from each other significantly.

Example 2. In this example distribution 16 was approximated by distributions 1, 2, 3, 5, 8, 9, 12, 13 and 14. Results are presented in Tables 4 and 5 in an analogous way as results for the previous example. The optimal weights are shown in Table 5.

These results show a deviation from average behavior into the direction of considerable coinci- dence of results for the P, I and H fits. Diver- gences V and S behave as in the previous example.

Example 3. Distribution 7 is approximated by distributions 1, 2, 3, 5, 8, 9, 10, 11 and 12. The results can be observed in Tables 6 and 7.


In this case results differ significantly for the different models this being true also for divergences P, I and H. Notice that there is a struct- ural difference between weights for P and 1. The first entry in the first row is a ' true' zero, not just a rounded value.

Remark. Looking at the first three columns in Tables 2, 4 and 6 it can be observed that the ratio of the entries in the first and second columns is roughly 2, the same being true for the second and third columns. This is in accordance with mathematical statistics, see [1].

Randomly generated test problems. To explore the behavior of the different mixing models a population consisting of 80 randomly generated distributions has been computed, and using these data 150 test problems have been solved. This computational experiment confirmed the conclusions drawn from our practical problem. The average behavior shows slight differences between the P, I and H fits. We have found however problems, where the Pearson model gave completely different wights as the Kullback-Leibler and Hel- linger models. Such problem is given in the next example.

Example 4. The first two distributions in Table 8 are mixed to approximate the third distribution. The computational results are presented in Table 9.

The results show that for the Pearson and Smirnov models the optimal mixture consists solely

P 0.000 0.490 0.182 0.000 0.000 I 0.159 0.410 0.184 0.000 0.000 H 0.236 0.375 0.184 0.000 0.000 V 0.000 0.524 0.117 0.000 0.000 S 0.000 0.808 0.016 0.000 0.000

0.000 0.000 0.124 0.205 0.000 0.000 0.082 0.165 0.000 0.000 0.060 0.144 0.000 0.000 0.078 0.280 0.018 0.000 0.000 0.159

266 E. Klafszky et at,. / Linearly constrained estimation

Table 8 Distributions from a randomly generated data set

1. 0.049 0.044 0.044 0.044 0.058 0.067 0.091 0.107 0.126 0.128 0.128 0.114 2. 0.052 0.045 0.044 0.048 0.059 0.077 0.097 0.116 0.117 0.131 0.122 0.092

a3. 0.007 0.022 0.028 0.050 0.077 0.106 0.134 0.135 0.129 0.119 0.113 0.080

This distribution represents a desirable mixture for some specific kind of concrete.

Table 9 Divergences and optimal weights for Example 4

P I H V S Optimal weights

P 0.340 0.096 0.040 0.276 0.034 1.00 0.00 1 0.352 0.089 0.036 0.234 0.024 0.00 1.00 H 0.352 0.089 0.036 0.234 0.024 0.00 1.00 V 0.352 0.089 0.036 0.234 0.024 0.00 1.00 S 0.340 0.096 0.040 0.276 0.080 1.00 0.00

of the first component while the optimal mixture resulting from the Kullback-Leibler, Hellinger and variational models consists exclusively of the second component. According to our experience problems with significant difference between the results obtained from the Pearson model, and the Kullback-Leibler and Hellinger models may ap- pear in situations when among the components there are 'similar' distributions.

The CPU time for the linear and quadratic models was just a few seconds. For the Kull- back-Leibler and Hellinger models the CPU times reported below were obtained in the case of 10 -5 primal accuracy, with an accuracy of 10 -8 for the reduced gradient in the dual problem. For distributions with 10 elements the CPU time for mixing 9 distributions was in the average around 20 seconds for the Kullback-Leibler model, and around 40 seconds for the Hellinger model. With distributions consisting of 12 elements and mixing 2 components, average CPU time for the Kullback- Leibler model was 15 seconds and for the Hel- linger model 35 seconds. Due to the lack of a good starting point solving the Hellinger model re- quired in all cases significantly more CPU time than solving the Kullback-Leibler model.

5. Conclusions

Five mathematical programming models be- longing to the class of linearly constrained statistical estimation problems have been investigated in this paper. Four of them are associated with the unifying framework of f-divergences while the fifth (Smirnov distance) has been included for the sake

of completeness. The models are characterized within the field of linearly constrained estimation by the fact that the constraint polyhedron is given as the convex hull of some fixed distributions. For the Kullback-Leibler and Hellinger models new duality relations and dual-type algorithms have been developed and the Hellinger model has been formulated as an /p-programming problem.

Summarizing our computational experiences we can say that in most cases results from the Pear- son model gave a very good approximation of results obtained by applying the Kullback-Leibler or the Hellinger model. Table 7 shows however that sometimes this is not the case and the ap- propriate model is to be selected carefully. If costs are associated with the different distributions (this is the case in practice), then significantly different costs can be obtained for the mixture by using different models.

The Pearson model can be solved very effi- ciently by Lemke's method. The reduced gradient method for I and H mixing turned out to be much slower but the result for the Pearson model can be utilized as a starting point for the iterations thus reducing the number of iterations significantly.

The results obtained for the variational and Smirnov models differ from each other and from the results from the other three models, but they can be solved effectively with the simplex method.

Acknowledgements

We would like to thank the referees of this paper for their helpful and constructive sugges- tions made on the first draft.


References

[1] Borovkov, A.A., Mathematical Statistics, Nauka Publ. Co., Moscow, 1984, in Russian.

[2] Brockett, P.L., Charnes, A., and Cooper, W.W., "'MDI estimation via unconstrained convex programming", Com- munications in Statistics B - Simulation and Computation 9 (1980) 223-234.

[3] Charnes, A., Cooper, W.W., and Mellon, A., "Blending aviation gasolines--a study in programming interdepen- dent activities in an integrated oil company", Econometrica 20 (1952) 135-159.

[4] Charnes, A., and Cooper, W.W., "Constrained Kull- back-Leibler estimation; generalized Cobb-Douglas bal- ance, and unconstrained convex programming", Rendi- conti di Accademia Nazionale dei Lincei, Serie VIII LVII1/4 (1975) 568-576.

[5] Csiszhr, I., "Information divergence type measures for discrepancy between probability distributions", MTA HI Oszr Ki~zlemfnyei 17 (1967) 123-149, 267-291 (in Hungarian).

[6] Csisz~tr, I., "/-divergence geometry of probability distributions and minimization problems", Annals of Probability 3 (1975) 146-158.

[7] Dantzig, G.B., Linear Programming and Extensions, Princeton University Press, Princeton, N J, 1963.

[8] Duffin, R.J., Peterson, E.L., and Zener, C., Geometric Programming: Theory and Applications, Wiley, New York, 1967.

[9] Duffin, RJ., and Peterson, E.L., "Optimization and in- sight by geometric programming", Journal of Applied Physics 60 (1986) 1860-1864.

[10] Fang, S.C., and Rajasekera, J.R., "Controlled dual perturbations for/p-programming", Zeitschrift fiir Operations Research 30 (1986) A 29-A 42.

[11] Fang, S.C., Peterson, E.L., and Rajasekera, J.R., "Con- trolled dual perturbations for posynomial programs", European Journal of Operational Research 35 (1988) 111-117.

[12] Fang, S.C., and Rajasekera, J.R., "Quadratically constrained minimum cross-entropy analysis", Mathematical Programming, forthcoming.

[13] Gill, P.E., Murray, W., and Wright, M.H., Practical Opti- mization, Academic Press, London, 1981.

[14] Jaynes, E.T., "Where do we stand on maximum entropy?", in: R.D. Levine and M. Tribus, The Maximum Entropy Formalism, The MIT Press, Cambridge, MA, 1979, 15-118.

[15] Klafszky, E., "Geometric programming", ,~vstems Analy- sis & Related Topics, Seminary Notes 11 (1976) Budapest.

[16] Kullback, S., Information Theory and Statistics, Wiley, 1959.

[17] Lemke, C.E., "Bimatrix equilibrium points and mathematical programming", Management Science 11 (1965) 681-689.

[18] Medgyessy, P., Decomposition of Superpositions of Density Functions and Discrete Distributions, Wiley, 1977.

[19] Murtagh, B.A., and Saunders, M.A., "Large scale linearly constrained optimization", Mathematical Programming 14 (1965) 41-72.

[20] Rrnyi, A.: "On the mathematical theory of comminution", bSpitOanyag (1960) 1-8 (in Hungarian).

[21] Rrnyi, A., Wahrscheinlichkeitsrechnung mit einem Anhang fiber lnformationstheorie, VEB Deutscher Verlag der Wissenschaften, Berlin, 1962 (in German).

[22] Sarma, P.V.L.N., Martens, X.M., Reklaitis, G.V., and Rijckaert, A.B., "A comparison of computational strate- gies for geometric programs", JOTA 26 (1978).

[23] Schrage, L., Linear, Integer, and Quadratic" Programming with L1NDO, The Scientific Press, 1984.

[24] Scott, C.H., and Jefferson, T.R., "A generalisation of geometric programming with an application to information theory", Information Sciences 12 (1977) 263-269.

[25] Terlaky, T.: "On /p-programming", European Journal of Operational Research 22 (1985) 70-100.

[26] Wesche, K., Baustoffe fiir tragende Bauteile, Bauverlag Gmbh., Wiesbaden, Berlin, 1974.

[27] Wolfe, P, "Algorithm for a least-distance programming problem", Mathematical Programming Stua~v 1 (1974) 190-205.

[28] Wolfe, P., "Finding the nearest point in a polytope", Mathematical Programming 11 (1976) 128-149.

Linearly constrained estimation by mathematical programming

Documents