Top Banner
JOURNAL OF OPTIMIZATION THEORY AND APPLICATIONS: Vol. 59, No. 3, DECEMBER 1988 A Markov Chain Game with Dynamic Information 1 G. J. OLSDER 2 AND G. P. PAPAVASSILOPOULOS 3 Communicated by Y. C. Ho Abstract. Two players, not knowing each other's position, move in a domain and can flash a searchlight. The game terminates when one player is caught within the area illuminated by the flash of the other. However, if this first player is not in this area, then the other player has disclosed his position to the former one, who may be able to exploit this information. The game is considered on a finite state space and in discrete time. Key Words. Markov games, zero-sum games, incomplete information, pursuit-evasion, mixed strategies. 1. Introduction The game to be discussed belongs to the class of two-person zero-sum games. The players move in a certain domain and are unaware of each other's positions unless a player flashes a searchlight that illuminates an area of known shape around this player. By flashing his searchlight, a player discloses his position to the other player, wherever the players are. Termina- tion of the game can occur only if a player flashes his searchlight and the other player finds himself trapped within the area illuminated. Both players t The work of the second author was supported by ZWO, The Netherlands Organization for the Advancement of Pure Research, Contract No. B62-239, by the US Air Force Office of Scientific Research, Grant No. AFOSR-85-0245, and by the National Science Foundation, Grant No. NSF-INT-8504097. z Professor, Department of Mathematics and lnformatics, Delft University of Technology, Delft, The Netherlands. 3 Associate Professor, Department of Electrical Engineering Systems, University of Southern California, Los Angeles, California; Visiting Professor at Delft University of Technology during 1986. 467 0022-3239/88/1200-0467506.00/0 © 1988 Plenum Publishing Corporation
20

A markov chain game with dynamic information

Mar 06, 2023

Download

Documents

John Sayas
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: A markov chain game with dynamic information

JOURNAL OF OPTIMIZATION THEORY AND APPLICATIONS: Vol. 59, No. 3, DECEMBER 1988

A Markov Chain Game with Dynamic Information 1

G . J. O L S D E R 2 A N D G. P. P A P A V A S S I L O P O U L O S 3

Communicated by Y. C. Ho

Abstract. Two players, not knowing each other's position, move in a domain and can flash a searchlight. The game terminates when one player is caught within the area illuminated by the flash of the other. However, if this first player is not in this area, then the other player has disclosed his position to the former one, who may be able to exploit this information. The game is considered on a finite state space and in discrete time.

Key Words. Markov games, zero-sum games, incomplete information, pursuit-evasion, mixed strategies.

1. Introduction

The game to be discussed belongs to the class of two-person zero-sum games. The players move in a certain d o m a i n and are unaware of each other 's posi t ions unless a player flashes a searchlight that i l luminates an

area of known shape a round this player. By flashing his searchlight, a player

discloses his posi t ion to the other player, wherever the players are. Termina- t ion of the game can occur only if a player flashes his searchlight a nd the

other player finds himself t rapped within the area i l luminated. Both players

t The work of the second author was supported by ZWO, The Netherlands Organization for the Advancement of Pure Research, Contract No. B62-239, by the US Air Force Office of Scientific Research, Grant No. AFOSR-85-0245, and by the National Science Foundation, Grant No. NSF-INT-8504097.

z Professor, Department of Mathematics and lnformatics, Delft University of Technology, Delft, The Netherlands.

3 Associate Professor, Department of Electrical Engineering Systems, University of Southern California, Los Angeles, California; Visiting Professor at Delft University of Technology during 1986.

467 0022-3239/88/1200-0467506.00/0 © 1988 Plenum Publishing Corporation

Page 2: A markov chain game with dynamic information

468 JOTA: VOL. 59, NO. 3, DECEMBER 1988

want to catch the other player in their searchlight before they are caught themselves. Therefore, flashing has two aspects; in order for a player to win, he must flash; if, however, during such a flash the other player is not caught, the flashing player is in a more vulnerable position, because he has betrayed his location to the other player. The idea of this game is described in Ref. 1.

The game described above can be interpreted as a duel, such as for instance described in Ref. 2. In a duel, two combatants are each armed with a certain number of bullets (the flashes in our case) and slowly advance toward each other. If a combatant fires a bullet, the probability of success (i.e., hitting the opponent) increases if the distance between the two com- batants decreases. If the bullets are noisy (i.e., a combatant hears the firing of a bullet by his opponent) and if this combatant is not hit, he can use this information (fewer bullets left with the opponent) to his advantage. Such information is called dynamic, since it depends on the actions of the players. A feature that makes the game in this paper different is that, in addition to the flashing times, the strategies that decide on their own motion must also be determined.

It is easy to understand that both players will exploit probabilistic strategies, since a pure optimal strategy by one player would certainly lead to a loss for him if the other player would know this optimal strategy. Conceptually, it is very complicated to define mixed and /o r behavioral strategies for games that proceed continuously in time (see Ref. 3, Section 5.4). Therefore, the problem has been investigated for a finite state space (i.e., the players move on a network with a finite number of nodes) and for the discrete-time case.

Thus, the realm of Markov chain games has been entered (for a survey, see Ref. 4). For such games, both a normal and an extensive form description can be given; they are equivalent and the former is more amenable for obtaining numerical results. See also Ref. 5, where a specific Markov chain game has been solved and where the players use mixed strategies. A slightly different pursuit evasion game is given in Ref. 6. Both papers deal with a maneuvering evader, while being fired upon by the pursuer.

The problem treated in this paper bears also some relationship to the so-called bomber-versus-battleship game, as originally described in Ref. 7. In Ref. 6, more references with respect to this game are given. In this game, a battleship is floating around on a set of integers, albeit to take one step in either direction in each time unit. The sole aim of the battleship is to avoid a bomber, equipped with only one bomb, which is hovering over it, but suffers from the fact that the bomb takes two time units to drop. Dropping the bomb (when and where) coincides with switching on the searchlight in the current game.

Page 3: A markov chain game with dynamic information

JOTA: VOL. 59, NO. 3, DECEMBER 1988 469

Explicit results have been obtained when the players are confined to the circumference of a circle and, during each time step, each player can either move to one of the two adjacent nodes or stay where he is. Specific results depend on the number of nodes and the shape of the area illuminated (which may be different for the two players). In essence, finding the saddle- point strategies boils down to the solution of a two-level linear programming problem.

In Section 2, the exact problem statement will be given. In the first part of Section 3, a simplified version of the game is considered; the two players know each other's initial positions and only one player can flash (once). Here, it will be shown that mixed and behavioral strategies coincide. In the second part of Section 3, there is again only one player who can flash once; he knows the initial position of the other player, whereas now this latter player does not know the initial position of the player with the flash option. Though in principle the game may last an infinitely long time, it will be shown that, for the solution, a finite-horizon version suffices. In Section 4, the solution will be given for the case that both players have one flash at their disposal and do not know each other's initial positions. Some numerical examples are provided. In Section 5, players with different shapes of their flashing area are considered, which leads to an additional matrix game to be solved. Also, a minor variation of the game described in Section 4 is considered. The variation is that one player has two flash options and the other one has none. Lastly, in Section 6, some possible extensions, conjectures, and limitations will be discussed.

2. Problem Statement

Let us consider n positions arranged on a circle, enumerated by 1, 2 , . . . , n, as in Fig. 1. A pursuer P is at position i ~ S = {1 . . . . , n}, and an evader V is at position j ~ S. At the initial instant of time, t = 0, they do not know each other's position, but each one knows his own. Therefore, each player assumes a uniform distribution with respect to the initial position of the other player. At each instant of time, they can move one position to the left or the right or stay where they are. The time is discrete, t = 1, 2, 3 , . . . .

Each one of them has a flashlight that he can flash only once, at some time t = 1, 2, 3 , . . . . The flashing illuminates the position of the player who flashes, as well as the positions to his left and right. If at the time P (or V) flashes, V (or P) is within the illuminated area, then he is caught and the game terminates; if V is not caught, and has not yet flashed his light, then the game terminates when he also flashes.

Page 4: A markov chain game with dynamic information

470 JOTA: VOL. 59, NO. 3, DECEMBER 1988

1

Fig. 1. State space.

Let us point out that, if P flashes at some instant of time, before V does, and V is not caught, then P has revealed his position to V, who saw where P was when he flashed, whereas P knows only that V is in some of the remaining n - 3 positions. At this instant of time, the information pattern of both P and V, concerning each other's position, changes; thus, we are dealing with a problem with dynamic information. Or, to put it in another way, each player receives information about the position of his opponent only when he or his opponent flashes his light; thus, there is a learning aspect with respect to the flashing, but not with respect to the motion of the players.

3. Solution of Two Simplified Versions of the Problem

In this section, two simple versions of the general problem are solved. The combination of these two versions will provide the solution to the general problem, and that solution is given in Section 4.

Case 1. Both P and V know each other's initial position; V has no flash.

Here, the situation is exactly as before, but V has no flash and both P and V know each other's initial positions. If the motion of P is completely deterministic, then V can always avoid capture; correspondingly, P can capture V if V's motion is deterministic. Thus, an equilibrium solution in pure strategies does not exist, and we will consider that P and V move according to some randomized decision schemes.

P 's decision of when and where to flash is also considered to be the result of a randomized decision scheme.

Page 5: A markov chain game with dynamic information

JOTA: VOL. 59, NO• 3, DECEMBER 1988 471

Let A denote an

An h12 0 0 0 0 "~21 hZ2 A23 0 0 0 0 A32 A33 A34 0 0

0 A43 A44 A45 0

n x n Markovian matrix o f the following structure: • . . 0 i~1, n

o o o o o o

~ n - - l , n - - I An-l,~

hn , n--1 "~nn

A = 0

0 0 . . . . . . . . . 0 h.-l• .-2 h~.l 0 . . . . . . . . . 0 0

Au-->O , ~ asj = 1, J

h o = O , i f i = 2 , 3 , . . . , n - l a n d [ j - i [ > l ,

h 13 . . . . . h t,,,-1 = h,,2 . . . . . h . . . . 2 = O.

, ( l a )

( lb )

(lc) (ld)

The mot ion o f V is described as follows. Let V be at posi t ion i at t = 0, and let

ei = ( 0 , . . . ,0, 1, 0 , . . . , 0 ) r ~R ".

ith posi t ion

At times t = 0 , 1, 2 , . . . , the evader V chooses matrices A~, A2, A3,... of the type (1), so that the probabil i ty o f being at any posit ion in S, at time t, is given by the row vector

e r A~ A 2 . . . At. (2)

Actually, only the row i o f A1 and rows i - 1, i, and i + 1 o f A2, and so on, need to be considered.

The mot ion o f P as well as his decision o f whether to flash or not and where and when are described as follows• At time t = 0, P is at posi t ion j. Cons ider matrices P1, P2, . • • o f the type (1). The mot ion of P is described by the ana logue o f (2):

e r Pt P : " ' " P,. (3)

At time t, P may decide to flash at any o f the posit ions where he may be, so that, at t ime t, he also has a vector ( P , I , . . . , P,,,), where P,k is chosen as a funct ion o f the previous decisions o f P ; P,k is zero if posit ion k is not accessible for him at t ime t, given that he was at posi t ion j at t = 0. It holds that

P,k = 1, P,k>-O. t,k

A little reflection will persuade the reader that P can dispense with deciding on both P~, P2, .. • and P,k and consider only a probabil i ty vector {q.}:

[q~'] . qi, -> O. Y~ q. = 1, q, = , (4a)

,.i 1. q . ,

Page 6: A markov chain game with dynamic information

472 JOTA: VOL. 59, NO. 3, DECEMBER 1988

q, = 0, if position i is not accessible at t, if P is at j at t = 0; (4b)

i.e., q, = probability with which P will be at time t at position i and flash there. This is essentially due to the open-loop character of P's information, since his knowledge concerning V's position is not altered by any informa- tion about V's actual motion.

As an example of the q, 's in (4a), let e~ = (1, 0 , . . . , 0); i.e., P starts at time t = 0 at ~osition 1. Then,

ql l

q21

0

0

0 • , q 2 =

0

0

q l

0

q n ~ t

Let II be the n × n matrix

1 1 0 0

0 1 1 1

0 0 1 t 17=

q12] q22

q32 I

° i 0

0

° I qn-l,2

• q . , 2 -I

q3

ql3] q23

q33 I !

q43 i 0 ,0

o !

qn-2,3 [

q,- 1,3 [ I

. q . , 3 _!

(4c)

0

0

1

0 0

1 0

• . . 0 1

• . . . . . 0

° . . . . . 0

1 1 1

0 1 1

; (5)

i.e., Ilij = 1 if position j is illuminated when P is at position i and flashes. Using (2), (4a), (4b), (5), we find that the probability of capture is

eTAllIql + e f A ~ A 2 I I q 2 + . . . + e r A a A 2 " " " &IIq, + . . . . (6)

The summation stops at t = T for a finite time case, and extends to infinity for the infinite time one.

Clearly, (qi, q2 , . . . ) lies in a multidimensional simplex, and so does (A1, A~A2,.. .), as the following lemma demonstrates, enabling us to re- formulate (6) as a classical matrix game.

Lemma 3.1. The set of vectors

{(e~A,, e r A , A 2 , . . . , e f A l A 2 " ' " A,) r

E R " IA1 , . . . , A, of type (1)} (7)

is a convex polyhedron in R" .

Page 7: A markov chain game with dynamic information

JOTA: VOL. 59, NO. 3, DECEMBER 1988 473

Proof. Let

e / ~ , = y , ,

or equivalent ly,

eTA, = )71,

Let also

e T A 1 A 2 = ~ 2 , . . . , e f A , A 2 " ' ' A,=f i , ,

Y , A , = Y2, . • • , Y , - , A , = Y,-

eta , , = ; , , fi,~2 = fi2, . - - , fit--iA, = fi,-

Let 0 < A < 1. We will show that there exists A~, A : , . . . , At, so that

e.r,a, = Aft1 + (1 - A ))~1,

(~Yl + (1 - A ) ~ , ) A 2 = Ay: + (1 - Z )Y2,

(8a)

(8b)

(/~fit-- 1 -[- (1 - - /~ )fit-1)At = h i l t -t- (1 - A ))~,. ( 8 c )

Obviously ,

A1 = A A 1 + ( 1 - A ) ~ . I

satisfies (8a). Assume tha t A 1 , . . . , Ag-1, have been de te rmined so as to satisfy the first k - 1 o f Eqs. (8). We have to choose Ak so as to satisfy the kth o f Eqs. (8), i.e.,

(Ayk-, + (1 -- A))~k_,)Ak = Ayk-, Ak + (1 -- A)~k_,Ak. (9)

Let

:9k-, = ( a , , . . . , a , ) , fik-t = (b l , • . . , b , ) ,

and

Ak --

[ Aal/[Aai + (1 - A)bl]

0

0

0 ] ha2/[ha2 + (1 - h)b2] . 0 Ak

"Za~/[ha. + (1 - h)bn]

( 1 - h ) b ~ / [ h a ~ + ( 1 - h)bi] in (10), where 0-</z--< 1. Obviously , Ak is o f type (1) and satisfies the kth of Eqs. (8). The

process can now be cont inued. The e lements of each A~, i = 1 , . . . , t, satisfy

. . 0 ( ,0 ) 0 (1 - A )b . / [Aa . + (1 - X)b~]

I f a i = b i = 0 , then use /z, 1 - / z in place of A a f f [ A a i + ( 1 - A ) b i ] .

Page 8: A markov chain game with dynamic information

474 JOTA: VOL. 59, NO. 3, DECEMBER 1988

certain linear inequality constraints as indicated after formula (1). Since each Ai appears linearly in the vectors forming the set indicated by (7), it is clear not only that this set of vectors spans a convex area, but also that this area is bounded by a finite number of linear equality constraints. Thus, the set is a convex polyhedron. []

It is clear now that (6), at least for the finite summation case (i.e., finite game duration, t = 0, 1 , . . . , T) can be reduced to a classical matrix game. Namely, let 6 1 , . . . , 6k be the extreme points of the polyhedron (7). Then, any element of (7) can be expressed as Ay, where

A = [ t ~ l , . . . , ~k] , y = ( y l , . . . , y k ) T , yi>-O, ~ y , = 1,

and thus (6) assumes the form

y r a T ~ , Y~yi = Z 4i = 1, (11)

where Ar is the matrix resulting from the multiplication of rows of A with columns of II and disregarding the qi,'s that are zero. As a matter of fact, the same game (11) would result if one were to mix directly the pure strategies of P and V, as a little reflection will convince the reader. In (11), V chooses y and minimizes; P chooses ~ and maximizes. The classical minimax theorem yields existence of solutions and uniqueness in value for this game.

Let us consider the size of the matrix A r in (11). After one time step, P can reach three positions; after two time steps, five positions; etc. Since he can flash at any of these time instants, his total number of strategies, if there are T time steps to go, is 3 + 5 + . . . + ( 2 T + l ) = T2+2T. A tacit assumption here is that T - < [n/2] . For T - > [n/2] , all points on the circle can be reached. Player V can choose from three directions during each time step, and his total number of different strategies is therefore 3 r The size of AT is 3 r , (TZ+ 2 T). The elements of AT are either 0 (no capture) or 1 (capture).

Example 3.1. Take n = 7 and T = 2. Then, A is a 9 × 8 matrix. Assume that initially P is two positions to the right of V. An optimal mixed strategy for V consists of the following maneuvers: two times left (probability 1/3); first left and then stay there (probability 1/3); and lastly two times right (probability 1/3). The directions left and right are defined for V (and P) with the nose toward the center of the circle. Only three pure strategies form part of the optimal mixed strategy for V. Another way of describing V's behavior is by means of behavioral strategies; during the first time step, V moves to the right (probability 1/3) or to the left (probability 2/3). If, during the first time step, V moved to the right, then during the second

Page 9: A markov chain game with dynamic information

JOTA: VOL, 59, NO. 3, DECEMBER 1988 475

time step he will again move to the right (probability 1). If, during the first time step, V had moved to the left, then during the second time step he stays there (probability 1/2) or moves again to the left (probability 1/2). Whatever P does, the value of the game will not be more than 1/3. If we change the initial positions somewhat, such that P is initially three instead of two positions to the right of V, then it turns out somewhat surprisingly that the value is 1/2 (P can now catch V by either moving along the left bound or by moving along the right bound, which apparently increases his changes). []

The solution of (11) can be obtained by solving the following linear programming (LP) problem (see Ref. 3):

max(yTIK), subject to A r t y - < I% Yi >-0,

where l,, stands for a vector of rn elements that are all equal to one, T is the number of elements of ~, and the superscript T denotes transpose. The solution to this LP problem yields the mixed strategy of player P, normalized with respect to the saddle-point value of the game. The saddle-point value equals the maximum value v of the criterion of the LP problem.

The dual problem,

min(t~r/f), subject to A T ~ t > - - I K , t~i-->0,

yields the optimal mixed strategy for player V, also normalized with respect to the saddle-point value v.

For different T's, (11) has different solutions. It is our intention now to show that no time larger than [n/2] needs to be considered, since the solution for T = [n/2] provides the solutions for any arbitrarily large T.

Lemma 3.2. The value .17" of (11) is an increasing function of T.

Proof. Intuitively, one sees that, the larger the time interval, the better are the chances for P to capture V. More rigorously, consider the summation (6) up to time T. If P imposes on himself the restriction q T = ( 0 , . . . , 0 ) 7,

then V can exploit only A1,. • •, At-1 and the choice of Ar is irrelevant. On the other hand, V cannot terminate effectively the game at T - 1 for arbitrary T, since no choice of AT nullifies the last term. []

Lemma 3.3. .11 <- J2 <-" " " -<- Jt,,/21 = JE, , /2>I = JE,,/21+2 . . . .

Proof. Continuing the rationale of the proof of Lemma 3.2, one sees that a choice by V of Ar = I will result in

e T A l l I q l + , . . + e r A~ • . . A T _ I I I q T _ I + e r A~ • . . A T _ l l I q , r

= e T a , n q , + - - - + e T A , ' " A,-_,[q~-_, + q~]. (12)

Page 10: A markov chain game with dynamic information

476 JOTA: VOL. 59, NO. 3, DECEMBER 1988

qr can be added to qT-l, and (12) can be considered as the payoff of a T - 1 period game, if the addition of qT to qr-1 does not alter the character of qr-1, i.e., does not create nonzero components in qT-~ =qr+qr-~ in places where qr-~ has to have zeros [see (4a), (4c)]. This can happen only if T>-[n/2], i.e., the minimum time needed for P to go to any position on the circle, when he starts moving one step at a time at t = 0. Thus, if T - [ n / 2 ] , V can terminate effectively the game at time T, by choosing

A[n/2]+I = A[n/21+2 . . . . . I,

which is obviously to his interest given the increasing character of Jr. []

Remark 3.1. If T is larger than [n/2] , it holds that Jr >-3/n, since P can flash randomly at some position, after time t-> [n/2] has elapsed and V has spread himself uniformly on the circle. It is natural to expect that J~,,/2l will be greater than or equal to K/n , if K positions can be flashed instead of 3. For K = 1 [the II matrix of (5) becomes the unit matrix], we have Jr,,~2] = I/n, as the following reasoning shows.

Consider n = 4. For t = 0 and t = 1, the probabilities with which V moves are denoted by arrows on the outside of the circle, whereas in the inside of the circle the total probability of V being at the corresponding positions are given. The possible positions of P at each time are marked by a star. Obviously, for the strategies of V denoted in Fig. 2, P has no more than probability 1/4 of capturing V, and one can easily construct other strategies with the same property. That the strategy of V delineated in Fig. 2 corresponds to an equilibrium for J2, J3 , . . • is easy to show. One can generalize this example for n > 3, by considering strategies of V that essentially assign total probability 1/n of V being in any position accessible by P at any instant of time. It should be pointed out, that the case 1I = unit, i.e., that P can illuminate only his position and no adjacent ones, results in making P indifferent to knowing V's initial position, since he can get the same payoff 1/n by flashing randomly at some position after time

P

t=O

OG O'

t..2, I . ) ' t=T t : 2 t=3

Fig. 2. Mixed strategies of V.

Page 11: A markov chain game with dynamic information

JOTA: VOL. 59, NO• 3, DECEMBER 1988 477

T>-[n/2] has elapsed. On the other hand, if T~[n/2], V can escape obviously with probability 1, by going with probability 1 to the position(s) not accessible by P within this time.

Case 2. P knows V's position; V does not know P's position; V has no flash.

Here, we examine the case where P knows V's initial position, whereas V does not know P's position, but only that P is not at the position of V or the adjacent ones at t = 0. P is the only one who can flash. This situation arises by considering the general problem of the first part of this section, in which V flashes and fails before P flashes. Thus, P can use his flash knowing V's position (betrayed by V's own flashing). The time interval T is finite (the infinite time case can be reduced to the finite one by arguments essentially the same as those employed in Lemmas 3.2 and 3.3).

Consider Fig. 3. The evader V is at position n - 1, knows that P is at position 1, or 2, o r . . . , or n - 3 . The pursuer P knows that V is in position n - 1 at time V. V solves the following problem:

rain max 3' AI,T~tl+ yrA2,rq2+''"

1 T ~ q +n--3 y A n - 3 " r q n - 3 ' (13)

. . . I

where each yTA~T~I~ is exactly of the type (11), with Air corresponding to time interval T, with initial positions i for P and n - 1 for V. The rationale of V employing (13) is that, since P can be anywhere in 1 , 2 , . . . , n - 3 , with equal probability 1 / (n - 3), a weighted sum of problems of type (11) needs to be solved. In (13), y is a probability vector and so is each qi. Clearly, ( ~ , - r 7" • . . , q n - 3 ) is not a probability vector, but obviously spans a convex polyhedron; thus, it can be expressed as EA, where E is the matrix whose columns are the extreme points of this polyhedron and A is a

3~~-~x~n -5

'n n" n_1 n-2 Fig. 3. Location of P and V.

Page 12: A markov chain game with dynamic information

478 JOTA: VOL. 59, NO. 3, DECEMBER 1988

probability vector of appropriate dimension. Thus, (13) is transformed easily to a classical matrix game.

The transformation of (13) to an LP problem is as follows: T2+ T

min E (q,),, i=1

s.t. [Al,r A2,r ' ' " An-3,r] [121 q2 -->/(,,_3)(r2+2T) '

q - 3

T2+2T T2+2T (~1), = Y. (g) i , j = 2 , . . . , n - 3 ,

i=1 i=1

(~j),->o.

The solution to this LP problem, with both equality and inequality con- straints, equals the inverse of the saddle-point value of the game. Because of symmetry arguments on the circle, the number of elements in the vector

t~ = (qT, T T • . - , qn-3)

can be reduced by a factor of about 2. Another way of solving (13) is to solve first for ~, i = 1 , . . . , n - 3 , as

functions of y (these functions are continuous and piecewise linear), then substitute these t~i(y) functions and minimize the resulting function with respect to y. This latter function is continuous, piecewise linear, and convex; hence the (or a) minimum exists.

Let us call J r the value of the game formulated in (13). Obviously, it does not depend on the initial position of V, and the coefficient 1/(n-3) in the weighted sum of (13) can be disregarded, since each position of P at t -- 0 is assigned equal weight. A rationale similar to those used in Lemmas 2 and 3 shows that

J1 ~ J2 ~ ' ' ' ~ J[n/2J = ~n/2]+1 = ,J[n/2]+2 . . . . ;

thus, only a finite number of problems of type (13) need to be considered. Before leaving this section, let us point out that, if y*, 4 " , . . . , q*-3 solves (13), then the payoff of V is Jr , whereas the payoff of P is y*rA~T~*, where j is the true position of P known to him. Essentially, in this subsection we dealt with a nonzero sum game, because P and V have different information, but one of a very special type, since the information of V is included in that of P.

Page 13: A markov chain game with dynamic information

JOTA: VOL. 59, NO. 3, DECEMBER 1988 479

4. Solution of the Problem of Section 1

We are now ready to address the main problem of this paper, which was described in Sections 1 and 2. I f both P and V flash at time t -- 1, then, because of the assumed uniform distribution, the probability for V to capture P is 3 /n and so is the probabili ty of P to capture V. Thus, the payoff is 3 / n - 3 / n = 0. The reason why the probabili ty of P being captured is 3 / n is that V has no information about P ' s initial and current position; thus, from V's point of view, P can be anywhere with equal probability. Let us consider now that V flashes at time t --- 1 and P does not. Then, the payoff for V is

( 3 / n ) - ( 1 - ( 3 / n) )JT-1

= (probability that V captures P)

-(probability that V does not capture P)

x (probability that P captures V

during the remaining T - 1 period game).

Similarly, if V flashes at time t = 2 and P does not flash at t = 1 or 2, the payoff for V is ( 3 / n ) - ( 1 - ( 3 / n ) ) • - 2 .

In general, let

cK = (3 /n) - (t - ( 3 /n ) ) J r -K ,

and consider the matrix

0

--{3 t

- - C 1 M =

- - C l

- - C 1

Let

e l Cl Cl Cl

0 C 2 C 2 C 2

- - c 2 0 c 3 c 3

--13 2 - - C 3

--17 2 - - C 3

K = 1 , 2 , 3 , . . . , T - l , (14)

. . . . . . C 1

C2

C3

- - C T _ 2 0 C T _ 1

--CT-2 --CT-1 0

(15)

s = (s , , s 2 , . . . , ST) r , r = (r, , r 2 , . . . , rT) T

represent probability vectors, according to which P and V choose to flash at times t = 1, 2, . . . , T. Then, V is interested in maximizing, whereas P in minimizing r rMs; i.e.,

min m a x ( r r M s ) . (16) s r

The solution of the matrix game (16) provides the solution to our problem. The rationale for introducing (16) is the following. Let r, be the probability

Page 14: A markov chain game with dynamic information

480 JOTA: VOL. 59, NO. 3, DECEMBER 1988

that V flashes at t ime t. Then, his incurred payoff , given that P flashes af ter him, is

( 3 / n ) - (1 - (3/n))JTT_,.

This is incurred with probabi l i ty equal to the p roduc t o f probabi l i t ies that V flashes at t ime t and P does not flash before t ime t, i.e.,

r,(1 - s , - s 2 . . . . . s,) = r,(s,+,+. • "+sT).

Using fo rmula (14) and a symmetr ica l ra t ionale with P flashing before V, we arrive at (15), (16).

Lemma 4.1. It holds that: (i) c~ <-c2<- . . "<-=c-r_l=l/n; (ii) if T>--[n/2], then

e l = C 2 = " " " = C T - [ n / 2 ] ~ C T - - [ n / 2 ] + 1 ~ C T - [ n / 2 ] + 2 ~ " " " ~ C T - I =

f l / n , n->7,

= ~ 1 / 4 , n = 6 ,

] 2 / 5 , n = 5 ,

1,5/8, n = 4 ;

(iii) i f T>-[n /2] and n - l l , then c1-<0.

Proof. (i) This is an immedia te consequence o f (14) and the fact that J1 --- J 2 - " • • •

(ii) A rat ionale s imilar to the one used in L e m m a 3.2 yields

Thus,

C 1 ~ C 2 ~ • • • ~ C T _ [ n / 2 ] ,

Consider now Fig. 3, and assume that V is at posi t ion n - 1, whereas P is at one of 1 , 2 , . . . , n - 3 . We will calculate J~ o f Section 3. Let n->7. We have only one step to go. Let V go to posi t ion n with probabi l i ty a, to posi t ion n - 2 with probabi l i ty b, and stay at posi t ion n - 1 with probabi l i ty 1 - a - b. I f P is at posi t ion 1, he goes to posi t ion n and flashes, thus captur ing V with probabi l i ty a + ( 1 - a - b ) = 1 - b. I f he is at posi t ion 2, he goes to posi t ion 1 and flashes, thus captur ing V with probabi l i ty a. I f P is at posi t ion n - 4 , he goes to n - 3 , flashes, and captures V with probabi l i ty b. I f P is at n - 3, he goes to n - 2, flashes, and captures V with probabi l i ty

Page 15: A markov chain game with dynamic information

JOTA: VOL. 59, NO. 3, DECEMBER 1988 481

b + (1 - a - b). Any other posi t ion o f P in 3, 4 , . . . , n - 5 yields zero probabil- ity o f capture. Thus, P captures V with probabil i ty

1 ( l _ a _ b + a ) + n _ ~ a n - 3 -

1 1 2 + b + ( 1 - a - b + b ) = .

n - 3 n - 3 n - 3

Thus,

Cr_ l=(3 /n ) - (1 - (3 /n ) ) [2 / (n -3 ) ]= l /n , if n->7.

I f n----7, say n = 6, then the mot ion o f P is not necessarily as described above. Let us consider the cases n = 4, 5, 6 individually. First, let n = 4 and consider Fig. 4.

Let the motions o f P and V be described probabilist ically as indicated. Then,

• I1 = min max[/~l (a l + 0/2) -~ ~2(0~2-~" a3)] oq /~i

[lol o r°l = moin max[.,...] 1 'JL::J'

/zl+/x2= 1, a l+a2+~3= 1,

/~ > 0, a~-->0.

4 2

al

12

Fig. 4. Mixed strategies of P and V.

Page 16: A markov chain game with dynamic information

482 JOTA: VOL. 59, NO. 3, DECEMBER 1988

It is easy to see that

# * l = l Z * = a * l = C t * 3 = l / 2 , t~* =0 , J l = 1/2.

Thus, for n = 4,

cr-1 = (3/4)(1 - (3/4))(1/2) = 3 / 4 - (1/4)(1/2) = 5/8.

Similarly, one can examine the cases n = 5 and n = 6. For n = 5, c r - i = 2/5; and, for n = 6, cr-1 = 1/4.

(iii) Consider Fig. 3. We consider that, if P is at any of the positions 2, 3, n - 4 , n - 5, the pursuer P acts by flashing at the next instant of time, yielding a probability of capture 2 / ( n - 3), whereas for the other positions 4, 5, 6, . . . , n - 6 , P waits for a sufficient time and then flashes randomly, so that, for these positions, the probabili ty of capture is 3/n. Considering that there are n - 7 such positions, each one weighted by 1 / ( n - 3 ) , we have that

J r >- 2 / ( n - 3) + ( 3 / n ) [ ( n - 7)/ (n - 3)],

since such a policy is obviously suboptimal for P. It is easy to see now that

cl >-O¢:> J r <- a / ( n - 3 ) .

But, for n -> 11, it holds that

3 / ( n - 3) < 2 / ( n - 3) + (3 /n) [ (n - 7) / (n - 3)].

Thus, if T is sufficiently large and n-> 11, c~ has to be negative. []

Use of Lemma 4.1 enables one to see that the matrix game (16) has pure strategies solutions

r* = s* = ( O , . . . , O, 1 , 0 , . . . , 0 ) ,

with the position of 1 determined by the first ci (starting from i = T - 1 and going, backward in time) which becomes negative, or equivalently by the first Ji, going forward in time, which becomes smaller than 3 / ( n - 3 ) . For example, consider M to have the form

oJ M = + 0 + + , T = 5 .

+ - 0 +

Then,

r* = s* = (0, O, 1, O, 0).

Page 17: A markov chain game with dynamic information

JOTA: VOL. 59, NO. 3, DECEMBER 1988 483

Notice that, for a given n and different T's, the distance of the r* = 1 from the last equation remains fixed, so that, independently of the duration of the game, the players will flash at a fixed time distance from the final time. If the game is of infinite duration, then they never flash, a fact that is in agreement with the fact that an infinite matrix M of the form

f 0 -1 -1 -1 i ]

M = +1 0 -1 -1 1 +1 0

.

does not have a saddle point (see Ref. 3, page 164). Of course, for particular cases where n <_ 10, one does not have the guarantee of Lemma 4.1(iii) that cl will be negative, and one has to calculate the c~'s in order to find whether c~ changes sign. Clearly, if for some n and T, Cl > 0, then both players flash in the beginning.

5. Non ident i ca l P layers

We now make two remarks with respect to games in which the players enter the game nonsymmetrically.

Envisage the situation that the players have different flashing shapes. This is, for instance, the case if P can flash three positions and V can flash five positions. Then, the matrix M in (15) will become

- d l 0 c2 • • • c2

M = -d2

0 cT-~

- 1 -d2 . . . . d~-_t 0

and it is no longer skew-symmetric. The elements -d~ are defined similarly to the c~ elements, the only difference being that a different flashing area has been used. Define ic to be the lowest index of c such that c¢ is positive, and define id to be the lowest index of f such that did is positive (or, equivalently, -did is negative). It follows easily that the matrix game corre- sponding to the current M does not have pure solutions if ic ~ id. It is easy to show that the optimal mixed strategies are a mixture of the pure strategies corresponding to row or column numbers

min(ic, in), min(ic, in)+ 1 . . . . , max(it, id).

These strategies can again be found by solving an LP problem.

Page 18: A markov chain game with dynamic information

484 JOTA: VOL. 59, NO. 3, DECEMBER 1988

Example 5.1. Consider n =7, P's flashing area is three positions (k = 3), one to the left and right and his own position, E's flashing area is only his own position (k = 1). By solving problem (I3) for this problem, it was found that

cl = c2 = c3 = 0.1428,

d l = -0.1428, d2 = -0.144, d3 = -0.0476.

Therefore, the M matrix becomes

[0.0000 0.1428 0.1428 0.1428]

/0.1428 0.0000 0.1428 0.1428 /

M = / 0 . 1 4 2 8 0.114 0.0000 0.1428/'

[0.1428 0.114 0.0476 0.0000J

and i~ = 1, id = 4. The value of this game related to the M matrix (the row chooser maximizes and the column-chooser minimizes) turns out to be 0.0962. The optimal strategies are totally mixed and are given by

0.326]

0.326 [ s = 0.261/,

0.087.]

0.326]

r = 10-2391 /0.109/. 1_0.326.1

The next game to be discussed, in which the players enter the game nonsymmetrically, is again along the circumference of a circle. During each time step, both P and V can go one position to the left, stay where they are, or go one position to the right. In this game, P has two flash options and V has none. Suppose that initially the players do not know each other's position (they assume a random distribution for each other's position). If at the first flash of P the other player has not been caught, then all that P knows is that V is situated at one of the remaining n - 3 points (if k = 3 ) according to a uniform distribution. On the other hand, V then knows P's position. In order to decide on when to flash the searchlight for the second time, a minimax problem of the kind (13) must be solved again.

6. Conclusions

We have discussed a discrete-time game with dynamic information in a finite state space. We have considered in detail the state space consisting of n elements that are positioned around the circumference of a circle. Each state has thus two adjacent states. The two players move in the state space and can terminate the game by flashing the searchlight. A surprising result

Page 19: A markov chain game with dynamic information

JOTA: VOL 59, NO. 3, DECEMBER 1988 485

of the analysis is that, provided that both players enter the game with the same capacities (i.e., the flashing areas of the players have identical shape), the time of flashing can be determined by means of a pure strategy. This time is determined by the size of the flashing area and the number of states. If the players have different flashing areas, then the best flashing time can be determined by means of a mixed strategy.

Extensions of the state space are possible, and it seems that the techniques developed in the paper extend directly to connected networks, An obvious adjustment must then be made with respect to the weights 1 / (n -3 ) in (13). These weights came from a uniform distribution. For a general network, one obtains these weights by solving a static game in which the players can position themselves arbitrarily on the network (hence, no restrictive dynamics). Also, different shapes of illumination do not seem to cause any difficulty; they can be attacked by the same methods.

The techniques developed in the paper can be used directly for solving the dynamic game in which one player has two flashes at his disposal and the other has none. Extensions to more than two flashes are nontrivial, since, by having flashed all but the last time, a game of the kind described by 03) still has to be solved (to determine the time of the last flash), but now the weights are functions of the previous flash times.

In order to solve the problem numerically, one must solve a series of linear programming (LP) problems that have been given explicitly in the text. The size of these LP problems increases drastically with the number of states n. We have not studied possible reductions in the size of these LP problems, due to redundancy and/or symmetries in the problem statement.

The continuous-time, continuous-state-space version of the problem treated seems hard to tackle. One needs conceptual extensions of mixed and behavioral strategies as functions of a continuously evolving time. It seems, however, that the ratio k~ n (flashing area/total length or surface of the state space) and the maximum speeds are the crucial parameters. The ratio k /n is also a crucial parameter in the current setup. If for instance k~ n decreases (and k is the same for both players), then the time of flashing will be closer to the end of the game (if one deals with a finite horizon).

References

t. HO, Y. C., and OLSDER, G..1., Differential Games: Concepts and Applications, Mathematics of Conflict, Edited by M. Shubik, North-Holland, Amsterdam, The Netherlands, pp. 127-186, 1983.

2. KIMELDORF, G., Duels: An Overview, Mathematics of Conflict, Edited by M. Shubik, North-Holland, Amsterdam, The Netherlands, pp. 55-72, 1983.

Page 20: A markov chain game with dynamic information

486 JOTA: VOL. 59, NO. 3, DECEMBER 1988

3. BASAR, T., and OLSDER, G, J,, Dynamic Noncooperative Game Theory, Academic Press, New York, New York, 1982.

4. PARTHASARATHY, T., and STERN, M., Markov Games: a Survey, Differential Games and Control Theory II, Edited by E. O. Roxin et aL, Marcel Dekker, New York, New York, pp. 1-46, 1977.

5. BERNHARD, P., COLOMB, m. t., and PAPAVASSILOPOULOS, G. P., Rabbit and Hunter Game: Two Discrete Stochastic Formulations, Computers and Mathematics with Applications, Vol. 13, pp. 205-225, 1987.

6. KUMAR, P. R., Optimal Mixed Strategies in Dynamic Games, IEEE Transactions on Automatic Control, Vol. AC-25, pp. 743-749, 1980.

7. ISAACS, R., A Game of Aiming and Evasion: General Discussion and the Marks- man's Strategies, Rand Corporation, Report No. 1385, 1954.