Top Banner
5258 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 57, NO. 8, AUGUST 2011 Rate Distortion Theory for Causal Video Coding: Characterization, Computation Algorithm, and Comparison En-Hui Yang, Fellow, IEEE, Lin Zheng, Da-Ke He, Member, IEEE, and Zhen Zhang, Fellow, IEEE Abstract—Causal video coding is considered from an in- formation theoretic point of view, where video source frames are encoded in a frame by frame manner, the encoder for each frame can use all previous frames and all previous encoded frames while the corresponding decoder can use only all previous encoded frames, and each frame itself is modeled as a source . A novel computation approach is proposed to analytically characterize, numerically compute, and compare the minimum total rate of causal video coding required to achieve a given distortion (quality) level . Among many other things, the computation approach includes an iterative algorithm with global convergence for computing . The global convergence of the algorithm further enables us to demonstrate a somewhat surprising result (dubbed the more and less coding theorem)—under some conditions on source frames and dis- tortion, the more frames need to be encoded and transmitted, the less amount of data after encoding has to be actually sent. With the help of the algorithm, it is also shown by example that is in general much smaller than the total rate offered by the traditional greedy coding method. As a by-product, an extended Markov lemma is established for correlated ergodic sources. Index Terms—Causal video coding, extended Markov lemma, it- erative algorithm, multi-user information theory, predictive video coding, rate distortion characterization and computation, rate dis- tortion theory, stationary ergodic sources. I. INTRODUCTION C ONSIDER a causal video coding model shown in Fig. 1, where , represents a video frame, and represent respectively its encoded frame and recon- structed frame, all frames , are encoded in a frame by frame manner, and the encoder for can use all Manuscript received March 31, 2010; revised December 23, 2010; accepted March 04, 2011. Date of current version July 29, 2011. This work was supported in part by the Natural Sciences and Engineering Research Council of Canada under Grant RGPIN203035-06 and Strategic Grant STPGP397345, and by the Canada Research Chairs Program. E. Yang and L. Zheng are with the Department of Electrical and Computer Engineering, University of Waterloo, Waterloo, ON N2L 3G1, Canada (e-mail: [email protected]; [email protected]). D.-K. He is with Research in Motion/SlipStream, Waterloo, ON N2L 5Z5, Canada (e-mail: [email protected]). Z. Zhang is with the Department of Electrical Engineering-Systems, Uni- versity of Southern California, Los Angeles, CA 90095-1594 USA (e-mail: [email protected]). Communicated by E. Ordentlich, Associate Editor for Source Coding. Digital Object Identifier 10.1109/TIT.2011.2159043 Fig. 1. Causal video coding model. previous frames , and all previous en- coded frames , while the corresponding decoder can use only all previous encoded frames. The model is causal because the encoder for is not allowed to access to future frames in the encoding order. In the special case where the encoder for each is further restricted to enlist help only from all previous encoded frames , causal video coding reduces to predictive video coding. All MPEG-series and H-series video coding standards [13], [19] proposed so far fall into the above causal video coding model (strictly speaking, into the predictive video coding model); the differences among these different video coding standards lie in how information available to the encoder of each frame is used to generate . The causal coding model is the same as the sequential coding model of correlated source proposed in [15] when , and also called the C-C model in [10], [11], and [12]. However, when , which is a typical case in MPEG-series and H-series video coding, the causal coding model considered here is quite different from sequential coding 1 . In a special case where all frames are iden- tical, which rarely happens in practical video coding, the causal video coding model is reduced to the successive refinement setting considered in [8]. Notwithstanding, when frames are not identical, causal video coding is drastically different from successive refinement even though the decoding structure looks similar in both cases. Partial results of this paper were presented without proof in [23] and [22]. It is expected that a future video coding standard will continue to fall into the causal video coding model shown in Fig. 1. To 1 The name of sequential coding was used in [15] to refer to a special video coding paradigm where the encoder for frame , can only use the previous frame as a helper and the corresponding decoder uses only the previous encoded frame and reconstructed frame as a helper. 0018-9448/$26.00 © 2011 IEEE
23

Rate Distortion Theory for Causal Video Coding

Nov 23, 2015

Download

Documents

Pranav Sodhani

Rate Distortion Theory for Causal Video Coding
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
  • 5258 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 57, NO. 8, AUGUST 2011

    Rate Distortion Theory for Causal Video Coding:Characterization, Computation Algorithm,

    and ComparisonEn-Hui Yang, Fellow, IEEE, Lin Zheng, Da-Ke He, Member, IEEE, and Zhen Zhang, Fellow, IEEE

    AbstractCausal video coding is considered from an in-formation theoretic point of view, where video source frames

    are encoded in a frame by frame manner, theencoder for each frame

    can use all previous frames and allprevious encoded frames while the corresponding decoder canuse only all previous encoded frames, and each frame

    itselfis modeled as a source

    . A novel computationapproach is proposed to analytically characterize, numericallycompute, and compare the minimum total rate of causal videocoding

    required to achieve a given distortion(quality) level

    . Among many other things,the computation approach includes an iterative algorithm withglobal convergence for computing

    . The globalconvergence of the algorithm further enables us to demonstratea somewhat surprising result (dubbed the more and less codingtheorem)under some conditions on source frames and dis-tortion, the more frames need to be encoded and transmitted,the less amount of data after encoding has to be actually sent.With the help of the algorithm, it is also shown by example that

    is in general much smaller than the total rateoffered by the traditional greedy coding method. As a by-product,an extended Markov lemma is established for correlated ergodicsources.

    Index TermsCausal video coding, extended Markov lemma, it-erative algorithm, multi-user information theory, predictive videocoding, rate distortion characterization and computation, rate dis-tortion theory, stationary ergodic sources.

    I. INTRODUCTION

    C ONSIDER a causal video coding model shown in Fig. 1,where , represents a video frame,and represent respectively its encoded frame and recon-

    structed frame, all frames , are encoded ina frame by frame manner, and the encoder for can use all

    Manuscript received March 31, 2010; revised December 23, 2010; acceptedMarch 04, 2011. Date of current version July 29, 2011. This work was supportedin part by the Natural Sciences and Engineering Research Council of Canadaunder Grant RGPIN203035-06 and Strategic Grant STPGP397345, and by theCanada Research Chairs Program.

    E. Yang and L. Zheng are with the Department of Electrical and ComputerEngineering, University of Waterloo, Waterloo, ON N2L 3G1, Canada (e-mail:[email protected]; [email protected]).

    D.-K. He is with Research in Motion/SlipStream, Waterloo, ON N2L 5Z5,Canada (e-mail: [email protected]).

    Z. Zhang is with the Department of Electrical Engineering-Systems, Uni-versity of Southern California, Los Angeles, CA 90095-1594 USA (e-mail:[email protected]).

    Communicated by E. Ordentlich, Associate Editor for Source Coding.Digital Object Identifier 10.1109/TIT.2011.2159043

    Fig. 1. Causal video coding model.

    previous frames , and all previous en-coded frames , while the correspondingdecoder can use only all previous encoded frames. The modelis causal because the encoder for is not allowed to access tofuture frames in the encoding order. In the special case wherethe encoder for each is further restricted to enlist help onlyfrom all previous encoded frames , causalvideo coding reduces to predictive video coding.

    All MPEG-series and H-series video coding standards [13],[19] proposed so far fall into the above causal video codingmodel (strictly speaking, into the predictive video codingmodel); the differences among these different video codingstandards lie in how information available to the encoder ofeach frame is used to generate . The causal coding modelis the same as the sequential coding model of correlated sourceproposed in [15] when , and also called the C-C modelin [10], [11], and [12]. However, when , which is atypical case in MPEG-series and H-series video coding, thecausal coding model considered here is quite different fromsequential coding1. In a special case where all frames are iden-tical, which rarely happens in practical video coding, the causalvideo coding model is reduced to the successive refinementsetting considered in [8]. Notwithstanding, when frames arenot identical, causal video coding is drastically different fromsuccessive refinement even though the decoding structure lookssimilar in both cases. Partial results of this paper were presentedwithout proof in [23] and [22].

    It is expected that a future video coding standard will continueto fall into the causal video coding model shown in Fig. 1. To

    1The name of sequential coding was used in [15] to refer to a special videocoding paradigm where the encoder for frame , can only use theprevious frame as a helper and the corresponding decoder uses only theprevious encoded frame and reconstructed frame as a helper.

    0018-9448/$26.00 2011 IEEE

  • YANG et al.: RATE DISTORTION THEORY FOR CAUSAL VIDEO CODING 5259

    provide some design guidance for a future video coding stan-dard, in this paper, we aim at investigating from an informa-tion theoretic point of view how each frame in the causal modelshould be encoded so that collectively the total rate is minimizedsubject to a given distortion (quality) level .

    We model each frame itself as a sourcetaking values in a finite alphabet . Together, theframes then form a vector source

    taking values in the productalphabet . The sourcesare said to be (first-order) Markov if for anyis the output of a memoryless channel in response to input

    ; in this case, we say forms aMarkov chain. Let denote the reconstruc-tion of drawn from a finite reproductionalphabet . The distortion between and is measuredby a single-letter distortion measure .Without loss of generality, we shall assume that

    for any . For convenience, we writesimply as for any and . For any dimen-sional vector , denote by

    , and by . As such, by we shallmean that . A sim-ilar convention will apply to reconstruction sequences and othervectors.

    Formally, we define an order- causal video code forby using encoder and decoder pairs as follows2:

    1) For , an encoder of order is defined by a functionfrom to , the set of all binary sequences of

    finite length, satisfying the property that the range of is aprefix set, and a decoder of order is defined by a function

    The encoded and reconstructed sequences of aregiven respectively by and

    .

    2) For , an encoder of order is defined bya function

    2It is worthwhile to point out that as far as causal video coding alone is con-cerned, there is no need to explicitly list previous encoded frames as inputsto the encoder for the current frame in both the causal video coding diagramshown in Fig. 1 and the formal definition of causal video code given here, and allresults and their respective derivations presented in the paper remain the same.The reason for us to explicitly list as inputs to the encoder for the currentframe is two-fold: (1) it makes the subsequent information quantities moretransparent and intuitiveconnecting those information quantities to the dia-gram with linked to the respective encoder is easier than to that without linked to the respective encoderand (2) more importantly it gives us a simple,unified way to describe predictive video coding in the context of causal videocoding and contrast the two coding paradigms in our forthcoming work on theinformation theoretic performance comparison of predictive video coding andcausal video coding.

    satisfying the property that the range of given anybinary sequences is a prefix set, and a decoder of orderis defined by a function

    The encoded and reconstructed sequences of aregiven respectively byand .

    For , the distortion between andis given by

    the corresponding average distortion per symbol is then equal to

    and the average rate in bits per symbol of the th encoder is

    where denotes the length of the binary sequence . Theperformance of the order- causal video code is then mea-sured by the rate distortion pairs .

    Definition 1: Let be a rate vector anda distortion vector. The rate distortion pair

    vector is said to be achievable bycausal video coding if , there exists an order- causalvideo code for all sufficiently large such that

    (1.1)

    for .Let denote the set of all rate distortion pair vectors

    achievable by causal video coding.From the above definition, it follows that is a closed set inthe -dimensional Euclidean space. As in the usual videocompression applications, we are interested in the minimumtotal rate required to achieve the distortionlevel , which is defined by

    One of our purposes in this paper is to numerically compute,analytically characterize, and compare so thatdeep insights can be gained regarding how each frame should beencoded in order to have a minimum total rate.

    Our approach is computation oriented. Starting witha jointly stationary and totally ergodic vector source3

    , we first show in Section II that3A vector source

    is said to be jointly stationary and totally ergodic if as a single process over thealphabet is stationaryand totally ergodic.

  • 5260 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 57, NO. 8, AUGUST 2011

    is equal to the infimum of the th ordertotal rate distortion function over all ,where itself is given by the minimum ofan information quantity over a set of auxiliary random vari-ables. Then we develop an iterative algorithm in Section IIIto calculate , and further show that thisalgorithm converges to an optimal solution that achieves

    . The global convergence of the algorithmenables us to establish a single-letter characterization of

    in Section IV in the case where the vectorsource is independent and identicallydistributed (IID)4, by comparing with

    through a novel application of the algo-rithm. With the help of the algorithm, we further demonstratein Section V a somewhat surprising result dubbed the moreand less coding theoremunder some conditions on sourceframes and distortion, the more frames need to be encoded andtransmitted, the less amount of data after encoding has to beactually sent. The algorithm also gives an optimal solution forallocating bits to different frames. It is shown in Section VIthat is in general much smaller than the totalrate offered by the traditional greedy codingmethod by which each frame is encoded in a local optimummanner based on all information available to the encoder of theframe.

    II. ACHIEVABLE REGION AND MINIMUM TOTAL RATE:TOTALLY ERGODIC CASE

    Suppose now that is jointly stationaryand totally ergodic across samples (pixels). Define tobe the region consisting of all rate distortion pair vectors

    for which there exist auxiliaryrandom variables , and suchthat

    (2.1)

    and the following requirements5 are satisfied:(R1) for some deterministic function ;(R2) for some deterministic func-

    tion ;(R3) for any

    ;(R4) the Markov chain conditions

    4A vector source is said to be IID if as a single process over the alphabet is IID. Note that the common joint distribu-tion of each sample , can be arbitrary evenwhen the vector source is IID.

    5Throughout the paper, , represents a randomvariable taking values over , the -fold product of the reproduction alphabet

    ; on the other hand, , represents a random variabletaking values over an arbitrary finite alphabet.

    , andare met.

    In (2.1) and throughout the rest of the paper, the notationstands for mutual information or conditional mutual informa-tion (as the case may be) measured in bits, and the notationstands for entropy or conditional entropy (as the case may be)measured in bits. Although there is no restriction on the size ofthe alphabet of each in (2.1), one can show, by using the stan-dard cardinality bound argument based on the Caratheodory the-orem (see, for example, Appendix A of [15]), that the alphabetsize of each in (2.1) can be bounded. Let .Denote its convex hull closure by . Then we have the fol-lowing result.

    Theorem 1: For jointly stationary and totally ergodic sources.

    The positive part of Theorem 1 (i.e., ) will beproved in Appendix B by adopting a random coding argumentsimilar to that for IID vector sources. Here we present the proofof the converse part (i.e., ).

    Proof of the converse part of Theorem 1: Pick any achievablerate distortion pair vector .It follows from Definition 1 that for any , there ex-ists an order- causal video code forall sufficiently large such that (1.1) holds. Let and

    be the respective encoded frame of and recon-structed frame for given by . Let

    . It is easy to see that the Markovconditions

    , are satisfied. However, since dependsin general on in addition to and , therandom variables , and donot necessarily form a Markov chain in the indicted order. Toovercome this problem, let denote the conditional probabilitydistribution of given . Define a newrandom variable which is the output of the channelin response to the input . Then it is easy to seethat andhave the same distribution, and ,and form a Markov chain. This, together with (1.1),implies the following distortion upper bounds:

    (2.2)

    for any , and

    (2.3)

    Let us now verify rate lower bounds. In view of (1.1), we have

    (2.4)and for ,

    (2.5)

  • YANG et al.: RATE DISTORTION THEORY FOR CAUSAL VIDEO CODING 5261

    where equality is due to the fact that is a function of. For the last frame, we have

    (2.6)

    With auxiliary random variables ,and defined above, it now follows from (2.2) to (2.6)and the desired Markov conditions that

    . Letting yields, which in turn implies

    . This completes the proof of the converse part.To determine in terms of information quan-

    tities, we define for each

    (2.7)

    where the minimum is taken over all auxiliary random vec-tors , satisfying the following tworequirements:(R5) for any ;(R6) the Markov chains

    , andhold.

    We further define

    (2.8)

    Then we have the following result.Theorem 2: For jointly stationary and totally ergodic sources

    ,

    for any distortion level .To prove Theorem 2, we need the following lemma, which is

    also interesting on its own right.

    Lemma 1: The function is convex andhence continuous over the open region .

    Proof of Lemma 1: Fix . In view of thedefinition given in (2.7), it is not hard to show that the sequence

    is subadditive, that is

    for any and . As such, can also be ex-pressed as

    (2.9)Next we derive an equivalent expression for

    . Define

    That is

    (2.10)where the infimum is taken over all auxiliary random variables

    and satisfying the requirements (R1)to (R4). By comparing (2.10) with (2.7), it is easy to see that

    (2.11)On the other hand, pick any auxiliary random variables

    and satisfying the requirements (R1)to (R4). Let be defined asin the requirements (R1) and (R2). Then in view of the Markovconditions in the requirement (R4), we have

    (2.12)where the last inequality is due to the fact that is afunction of for any . To continue,we now verify Markov conditions involving . It is nothard to see that the first Markov conditions in the require-ment (R4),

    , are equivalent to the following conditions:(R7) for any and

    are conditionally independent given and.

    From this, it follows that for any andare conditionally independent given

    and . Applying the equivalence again, we seethat the first Markov conditions involving in therequirement (R6) are satisfied. Therefore, we have

  • 5262 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 57, NO. 8, AUGUST 2011

    (2.13)where the equality 1) follows from the Markov conditionsinvolving . Note that the last Markov condition in therequirement (R6) may not be valid for . To overcomethis problem, we use the same technique as in the proof of theconverse part of Theorem 1 to construct a new random vector

    such that the following hold: and

    have the samedistribution;

    the Markov conditionis met.

    Therefore, the random variables and sat-isfy the requirements (R5) and (R6). This, together with (2.13),(2.12), and (2.7), implies

    (2.14)Note that (2.14) is valid for any auxiliary random variables

    and satisfying the requirements (R1)to (R4). It then follows from (2.14) and (2.10) that

    which, together with (2.11), implies that

    and (2.10) is an equivalent expression for .In comparison with (2.7), the equivalent expression (2.10)

    makes it easier to apply the well-known time-sharing argu-ment. By applying the time sharing argument to (2.10), itis now not hard to see that is a convexfunction of for each . The convexity of

    as a function of then followsfrom its equivalent expression (2.9) and the convexity of each

    . Since a convex function is continuous overan open region [14], this completes the proof of Lemma 1.

    Proof of Theorem 2: In view of the positive part of The-orem 1, it is not hard to see that

    for any . Therefore, in what follows, itsuffices to show

    (2.15)

    for any .Now fix . Pick any rate vector

    such that .From the proof of the converse part of Theorem 1, it followsthat for any and sufficiently large , there exist auxiliaryrandom variables , and satis-fying the requirements (R1) to (R4) with each replaced by

    such that

    which, coupled with the equivalent expression (2.10) for, further implies

    (2.16)In view of Lemma 1, dividing both sides of (2.16) by and thenletting yield

    from which (2.15) follows. This completes the proof of Theorem2.

    Remark 1: Theorems 1 and 2 remain valid for generally sta-tionary ergodic sources . However, the techniqueadopted in the proof of the classic source coding theorem for asingle ergodic source [9], [2] can not be applied here. As such, anew proof technique has to be developed; this will be addressedin our forthcoming paper [25] in order not to deviate our com-putation approach.

    For general stationary ergodic sources , The-orem 2 is probably the best result one could hope for in termsof analytically characterizing . However,its impact on practical video coding will be limited if theoptimization problem involved can not be solved by an ef-fective algorithm. To a large extent, this is also true even if

    admits a single-letter characterization, andtrue for many other multi-user information theoretic problems.In the following section, we will develop an iterative algorithmto compute defined in (2.7), and establishits convergence to the global minimum.

  • YANG et al.: RATE DISTORTION THEORY FOR CAUSAL VIDEO CODING 5263

    III. AN ITERATIVE ALGORITHMIn this section, an iterative algorithm is proposed to calcu-

    late defined in (2.7), which serves three pur-poses in this paper: first, it allows us to do numerical calcula-tions; second, the global convergence of this algorithm providesa completely different approach to establish a single-letter char-acterization of when the sources are IID;and third, it allows us to do comparisons and gain deep insightsinto .

    Without loss of generality, we consider the case of anddenote three sources by , and ,which in turn will be written as , and respectively tosimplify our notation for describing the iterative algorithm.

    Let and denote joint distributions ofrandom vectors and , respectively;and let denote the marginal distribution of . If thereis no ambiguity, subscripts in distributions will be omitted.For example, we may write instead of . In orderto find the random variables and that achieve

    , we try to find transition probability andprobability functions ,and that minimize

    (3.1)where , denotes the stan-dard Lagrange multiplier, and the base of the logarithm is . Forbrevity, we shall denoteby , and by . Write

    accord-ingly as . When there is no ambiguity, the super-script or subscript will be dropped. The iterative algorithmworks as follows.

    Step 1: Initialize and set as a joint dis-tribution function over , and , where

    for any .Step 2: Fix . Find

    such that

    (3.2)where the minimum is taken over all transition probabilityfunctions . In view of the

    nested structure in (3.1), we solve the problem in (3.2) inthree stages. First let us find . From (3.1)

    (3.3)

    where . In theabove, the last inequality follows from the log-sum in-equality, and becomes an equality if and only if

    (3.4)

    for any .We next find . In view of (3.1) and (3.3), wehave

    (3.5)

    where. In the above, the last inequality

    again follows from the log-sum inequality, and becomesan equality if and only if

    (3.6)for any .Finally, let us find . Continuing from (3.1) and(3.5), we have

  • 5264 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 57, NO. 8, AUGUST 2011

    (3.7)

    where

    An argument similar to that leading to (3.3) and (3.5) canbe used to show that (3.7) becomes an equality if and onlyif

    (3.8)for any .Step 3: Fix . Find such that

    (3.9)

    where the minimum is taken over all joint distribution func-tions over and . In view of (3.1), we see that

    (3.10)where is the output of the channel

    in response to the input , and isthe distribution of , i.e.,

    (3.11)for any . The inequality (3.10) becomes an equalityif and only if for any

    .

    Step 4: Repeat Steps 2 and 3 for untilis smaller than a

    prescribed threshold.

    For any , let

    Similarly, for any , let

    The above iterative algorithm can also be described succinctlyby and . Thefollowing theorem shows that the sequence

    converges to a quadruple of distributions that achieves

    (3.12)where the infimum is taken over all possible

    , and .

    Theorem 3: For any initial satisfyingfor any , there exists such that

    , and

    as .Proof of Theorem 3: From the description of the iterative

    algorithm, it follows that

    (3.13)To show the desired convergence, let us first verify that the algo-rithm has the so-called five-point property (as defined in [7]),that is for any , and the corre-sponding

    (3.14)To this end, let us calculate both sides of (3.14). In view of Steps2 and 3, we have

    (3.15)

    where the equality follows from the following derivation:

    (3.16)

  • YANG et al.: RATE DISTORTION THEORY FOR CAUSAL VIDEO CODING 5265

    and

    (3.17)

    Combining (3.16) and (3.17), we immediately have the equalityin (3.15).On the other hand

    (3.18)

    Combining (3.15) with (3.18) yields the desired five-pointproperty in (3.14).

    The rest of the proof is similar to that adopted in [5] to showthe convergence of the Blahut-Arimoto algorithm [3]. Suppose

    (3.19)for some and . From (3.14), it then follows thatfor any

    (3.20)which, together with , implies

    (3.21)

    and hence

    (3.22)

    Note that (3.22) is valid for any and satisfying(3.19). From this, we have

    (3.23)

    To prove the convergence of , pick a con-vergent subsequence of , say . Then

    and

    (3.24)

    In view of (3.23), we have ; thus, and hence (3.20) applies to and . In particular,

    is a nonincreasing sequence. Sinceimplies , this means .Hence and as . This completesthe proof of Theorem 3.

    Remark 2: The above iterative algorithm can be easily ex-tended to the case of , and Theorem 3 remains valid. Bysetting , it also reduces to the case of .

    Remark 3: The iterative algorithm can be further extended towork for coupled distortion measures (as defined in [15])

    , where thedistortion depends not only on butalso on . The global convergence as expressedin Theorem 3 is still guaranteed.

  • 5266 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 57, NO. 8, AUGUST 2011

    Remark 4: Although as a function ofis convex as shown in the proof of Lemma 1,

    both the optimization problems (2.7) and (3.12) are actually anon-convex optimization problem. It is therefore kind of sur-prising to see the global convergence of our proposed iterativealgorithm. As shown in the proof of Theorem 3, the key for theglobal convergence is the five-point property (3.14).

    Remark 5: There are many other ways (including, for ex-ample, the greedy alternative algorithm [24]) to derive itera-tive procedures. However, it is not clear whether their globalconvergence can be guaranteed. Having algorithms with globalconvergence is important to not only numerical computation it-self, but also single-letter characterization of performance. Oneof the purposes of this paper is indeed to demonstrate for thefirst time that single-letter characterization of performance canalso be established in a computational way via algorithms withglobal convergence, as shown in the next section.

    We conclude this section by presenting an alternative expres-sion for . Once again, we illustrate this byconsidering the case of . In view of the definitions (2.7)and (3.12), it is not hard to show (for example, by using the tech-nique demonstrated in the proof of Property 1 in [21]) that forany

    (3.25)In other words, as a function of is the conjugateof . Since is convex andlower semi-continuous over the whole region

    , it follows from [14, Theorem 12.2, pp. 104] that forany ,

    (3.26)In the next section, (3.26) will be used in the process of es-tablishing a single-letter characterization forwhen the vector source is IID.

    IV. SINGLE-LETTER CHARACTERIZATION: IID CAUSAL CASESuppose now that the vector source is IID.

    In this section, we will use our iterative algorithm proposed inSection III and its global convergence to establish a single-lettercharacterization for .

    Theorem 4: If is IID, then

    for any .Proof: We first show that for any

    ,

    (4.1)

    for any . Without loss of generality, we demon-strate (4.1) in the case of by using our iter-ative algorithm in Section III. Denote three sources by

    . Since the vector source is IID, we have. In view of (3.26), we have

    (4.2)

    for any , where is defined in (3.12). Here andthroughout the rest of this proof, the subscript or superscriptdropped for convenience for notation in Section III is broughtback to distinguish between the cases of and .Therefore, it suffices to show that

    (4.3)

    for any . To this end, we willrun the iterative algorithm in both cases of and tocalculate and . Pick any initial positive distribution

    , and run the iterative algorithm in the case of . Wethen get a sequence which, according toTheorem 3, satisfies

    (4.4)

    Now let be the -fold product distribution of . Clearly,is also positive. Use as an initial distribution and run

    the iterative algorithm in the case of . Then we get asequence which, according to Theorem3 again, satisfies

    (4.5)

    Since is the -fold product of andis the -fold product of , careful examination on

    (3.4), (3.6), (3.8), and (3.11) reveals that for anyis the -fold product of , and is the -fold product of

    . (To see this is the case, let us look at (3.4) for example.Let us temporarily drop the subscripts indicating random vari-ables in all notation. When and

    , it can be verified that in(3.4)

    Since

    it follows from (3.4) that

  • YANG et al.: RATE DISTORTION THEORY FOR CAUSAL VIDEO CODING 5267

    Similar argument can be applied to (3.6), (3.8), and (3.11).)Therefore, for any

    which, coupled with (4.4) and (4.5), implies (4.3) and hence(4.1).

    Combining (4.1) with (2.8) yields

    for any . This, together with Theorem 2,implies

    (4.6)for any . Since by their definitions, bothfunctions and are rightcontinuous in the sense that for any

    it follows that (4.6) remains valid for boundary points wheresome may be . This completes the proof of Theorem 4.

    Theorem 4 can also be proved by using the classical auxiliaryrandom variable converse and positive proof (hereafter referredto as the classic approach). Indeed, one can establish the fol-lowing single-letter characterization for the achievable region

    , the proof of which is given in Appendix A.

    Theorem 5: If is an IID vector source, then6.

    Remark 6: It is instructive to compare the computationalapproach to single-letter characterization (as illustrated in theproofs of Theorems 2, 3, and 4) with the classic approach. Inthe computational approach, the converse is first established formultiple letters (blocks); its proof is often straightforward andthe required Markov chain conditions are satisfied automaticallyas shown in the proof of Theorem 2. The key is then to havean algorithm with global convergence for computing all blockterms and later show that all these block terms are the same.On the other hand, in the classic approach, the converse proofis quite involved; coming up with auxiliary random variableswith right Markov chain conditions is always challenging andsometimes seems impossible. Since single-letter characteriza-tion has to be computed any way, the computational approachis preferred whenever it is possible.

    Remark 7: When , Theorems 5 and 4 reduce to Theo-rems 1 and 3 in [15], respectively. However, the proofs in [15]are incomplete due to the invalid claim of the Markov conditionmade in the proofs therein; as such formulas therein can not beextended to the case of . Theorems 5 and 4 in a slightly

    6Since the alphabet size of each in (2.1) can be bounded, ,is actually convex and closed. As such, . We leave in the statement of Theorem 5 just for the sake of consistency with the norm inthe literature [4].

    different, but equivalent form were also reported in [10], [11],and [12] by following the classic approach. The difference lies inthe extra Markov chain condition for the reconstructionshown as Condition (R4). For example, in the specific formulasshown in [10, Theorem 1] in the case of , the Markovchain conditionis not required.

    V. MORE AND LESS CODING THEOREMTo gain deep insights into causal video coding, in this sec-

    tion, we use our iterative algorithm proposed in Section IIIto compare among different values of

    . To be specific, whenever we need to bring out the de-pendence of and onthe sources , we will writeas , and as

    . In particular, we will comparewith .

    Without loss of generality again, we will consider the case of. All results and discussions in this section can be easily

    extended to the case of . We first have the followingresult.

    Theorem 6: Suppose that is jointly stationaryand totally ergodic, and , and form a Markov chainin the indicated order. Then for any ,

    (5.1)Proof: We distinguish between two cases: (1)

    , and (2) . In Case (1), it follows from Theorem2 and (2.8) that it suffices to show

    (5.2)for any and . To this end, pickany auxiliary random variables , satisfyingthe requirements (R5) and (R6) with . It is not hard toverify that

    (5.3)where the equality 1) follows from the fact that the requirement(R6) plus the Markov condition impliesthat the Markov conditionis satisfied. In (5.3), the Markov condition

    may not be valid. However, to

  • 5268 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 57, NO. 8, AUGUST 2011

    overcome this problem, we can use the same technique as inthe proof of the converse part of Theorem 1 and also in theproof of Lemma 1 to construct a new random vectorsuch that the following hold:

    andhave the same

    distribution; the Markov condition

    is met.Therefore, the random variables and satisfythe requirements (R5) and (R6) with with respect to

    and . This, together with (5.3) and (2.7),implies

    (5.4)

    Since (5.4) is valid for any auxiliary random variables, satisfying the requirements (R5) and (R6) with

    , (5.2) then follows from the definition (2.7). This completesthe proof of (5.1) in Case (1).

    To prove (5.1) in Case (2), note that bothand are right

    continuous in the sense that for any ,the two equations shown at the bottom of the page hold. Thevalidity of (5.1) in Case (2) then follows from its validity inCase (1). This completes the proof of Theorem 6.

    Theorem 6 is what one would expect and consistent with ourintuition. Let us now look at the case where , anddo not form a Markov chain, and is an IID vectorsource. Define for any

    (5.5)

    where , for any source , is the classical rate distortionfunction of . Assume that . In view of Theorem4 and the proof of Lemma 1, bothand are convex as functions of and overthe region . As such, theyare subdifferentiable at any point with and

    . (See [14, Chapter 23] for discussions on the subdiffer-ential and subgradients of a convex function.) From Section III,they can also be computed via our iterative algorithm throughtheir respective conjugates. Since is an IID vector

    Fig. 2. One special case of two-layer causal coding.

    source, in view of Theorem 4, we will drop the subscript or su-perscript for all notation in Section III with understanding of

    throughout the rest of this section. Once again, to bringout the dependence of on the source , wewill write for as for

    as the notation meansthat is regarded as a super source (see Fig. 2)and

    for as . This convention will apply toother notation in Section III as well. In particular

    (5.6)for any .

    Condition A: A point with andis said to satisfy Condition A if asa function of and has a negative subgradient

    , at such that there is a distri-butionsatisfying the following requirements:

    (R8) .(R9) Define (as in Step 2 of the iterative algorithm)

    (5.7)

    (5.8)where

    (5.9)

    (5.10)

  • YANG et al.: RATE DISTORTION THEORY FOR CAUSAL VIDEO CODING 5269

    Denote the two conditional distributionsand by

    . Then either

    or depends on , i.e., there exist, and with such that

    We are now ready to state a somewhat surprising resultdubbed the more and less coding theorem.

    Theorem 7 (More and Less Coding Theorem): Suppose thatis an IID vector source with , and

    , and do not form a Markov chain. Then for any point, satisfying Condition A, there is a

    critical value such that for any ,

    (5.11)

    and for any

    (5.12)

    Remark 8: In Theorem 7, if , then at

    Proof of Theorem 7: Since as afunction of is continuous over and non-increasing,it suffices to show that

    (5.13)

    for any point , satisfying ConditionA. To this end, we consider a new two-layer causal coding modelshown in Fig. 2, where and together are regarded as onesuper source. Let denote its minimum totalrate function. Since at , a randomvariable independent of , and can beconstructed in such a way that .Therefore, it is easy to see that

    (5.14)

    for any and . On the other hand, in view ofthe definition of causal vide codes, it is not hard to see that anycausal code for encoding , and with respective dis-tortions can also be used for encoding

    and in Fig. 2 with distortions without changingthe total rate. Thus

    for any . This, coupled with (5.14), implies

    (5.15)for any and .

    To continue, we are now led to show

    (5.16)

    for any point , satisfying ConditionA. First note that from the definition of causal video codes

    (5.17)

    for any and . Fix now any point, satisfying Condition A. We prove (5.16) by

    contradiction. Suppose that

    (5.18)

    at the point . Let be the negative sub-gradient of at the pointin Condition A. From (5.15), is also a negative subgradientof at the point . This implies thatfor any and ,

    which, coupled with (5.18) and (5.17), in turn implies that theequation shown at the bottom of the page holds for anyand . In other words, under the assumption (5.18),is also a negative subgradient of at the point

    . In view of (3.25), (3.26), and (5.6), it then followsthat

    (5.19)

    (5.20)In view of the requirement (R8) in Condition A, we have

    (5.21)

  • 5270 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 57, NO. 8, AUGUST 2011

    From Step 2 of the iterative algorithm, it follows that

    (5.22)where the inequality in (5.22) is strict when de-pends on . Therefore, according to the requirement (R9) inCondition A, no matter which choice in the requirement (R9) isvalid, we always have

    which, together with (5.19) to (5.21), implies that

    This contradicts the assumption (5.18), hence completing theproof of (5.16) and (5.13).

    Define

    Then from (5.13), it is easy to see that is the desired criticalvalue. This completes the proof of Theorem 7.

    Remark 9: Theorem 7, in particular, (5.11) is really counterintuitive. It says that whenever the conditions specified inTheorem 7 are met, the more source frames need to be encodedand transmitted, the less amount of data after encoding has tobe actually sent! If the cost of data transmission is proportionalto the transmitted data volume, this translates literally into ascenario where the more frames you download, the less youwould pay. To help the reader better understand this phenom-enon, let us examine where the gain ofover comes from whenever the conditionsspecified in Theorem 7 are met. The availability of to theencoder of does not really help the encoder of and itscorresponding decoder achieve a better rate distortion tradeoff

    . Likewise, the availability of and to theencoder of does not really help the encoder of and itscorresponding decoder achieve a better rate distortion tradeoff

    either. What really matters is that the availability ofto the encoder of will help the encoder of choose

    better side information for the encoder and decoder of. If the rate reduction of the encoder of arising from

    this better along with is more than the overhead as-sociated with the rate and the selection of this better ,then the total rate is smaller. (Herethe overhead associated with the rate and the selection ofthis better is meant to be the difference between the sumof and in and the rate in

    . Depending on how helpful is, the ratein can be more or less than the ratein .) This is further confirmed in Examples 1and 2 at the end of this section.

    Condition A is generally met at points, for which positive bit rates are needed

    at both the decoder for and the decoder for in orderfor them to produce the respective reproductions with thedesired distortions and . Such distortion points willbe called points with positive rates. By using the technique

    demonstrated in the proof of [21, Property 1], it can be shownthat has a negative subgradient atany point , with positive rates. Inaddition, the distribution , if optimal, generallydepends on (except for some corner cases) when ,and do not form a Markov chain. We illustrate this in thefollowing theorem in the binary case.

    Theorem 8: Assume that , andthe Hamming distortion measure is used. Letbe an IID vector source with . Sup-pose that , and do not form a Markov chain.Then for with and , if( and ) achieves

    , i.e.,

    (5.23)then depends on , i.e., there exists suchthat the condition distributions and

    are different.Proof of Theorem 8: Fix with and .

    We first derive some bounds on . It is not hard toverify that

    (5.24)where is the unique value of at whichthe derivative of is equal to , and

    is the unique value of at which the derivative ofis equal to . In the above, the inequality 1) is due to

    the fact that

    (5.25)for any . Under the condition that

    , the inequality (5.25) is strict at. Therefore

    (5.26)In view of (5.23), it follows from the iterative algorithm that

    (5.27)

    (5.28)where appears as subscripts to indicate that theoperations and defined in Section III are for thesources and . Let be the output of

  • YANG et al.: RATE DISTORTION THEORY FOR CAUSAL VIDEO CODING 5271

    the channel in response to the input .Then the joint distribution of is , and (5.23)implies

    (5.29)Putting (5.29) and (5.26) together, we can conclude that

    and hence for any . Otherwise,from (5.29) we would have that

    (5.30)which contradicts (5.26).

    We now prove Theorem 8 by contradiction. Suppose thatdoes not depend on . Then for any and

    (5.31)which, together with (5.27), (5.7) to (5.10), and the fact that

    , implies

    (5.32)Simplifying (5.32) yields

    (5.33)

    where .To continue, we now consider specific values of and .

    Let us first look at the case of and . It followsfrom (5.33) that

    (5.34)

    which implies

    (5.35)where , and

    . Further simplifying (5.35) yields

    (5.36)

    Since , it can be verified thatis equal to if and only if .

    Next we show that . To this end, first note that isequivalent to saying that is a product distribution,i.e.,

    (5.37)

    By plugging (5.37) into (5.27), it follows from the Step 2 of theiterative algorithm that in

    does not depend on anddoes not depend on , i.e.,

    (5.38)

    where and are the normalization factors so thatthe respective terms are indeed distributions. It is easy to seethat (5.37) and (5.38) imply

    (5.39)(5.40)

    (5.41)Combining (5.39) to (5.41) with (5.29) yields

    which contradicts (5.26). Therefore, .Go back to (5.36). Since , (5.36) is equivalent to

    (5.42)

  • 5272 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 57, NO. 8, AUGUST 2011

    Fig. 3. Comparison of and versus for fixed and .

    Fig. 4. Comparison of and versus for fixed and in Example 1.

    Repeat the above argument for the case of and .We then have accordingly

    (5.43)

    Putting (5.42) and (5.43) together, we have shown that (5.31)implies that , and form a Markov chain, which con-tradicts our assumption. This completes the proof of Theorem8.

    Remark 10: From Theorem 8, it follows that for any sources, and satisfying the conditions of Theorem 8, Condi-

    tion A is met at any point , at whichhas a negative subgradient.

    We conclude this section with examples illustrating Theorem7.

    Example 1: Suppose that ,and that the Hamming distortion measure is used. Let

    , and

    It is easy to see that and do not form a Markov chain.We consider the following three cases:

    Case 1: , and ;Case 2: , and ; andCase 3: , and .

    For Case 1, Fig. 3 shows the rate-distortion curves ofand versus .

    Over the interval of shown in Fig. 3, it is clearthat is always strictly less than

    .

    For Case 2, Fig. 4 shows andversus with fixed

    and . It is observed that the critical point at whichmeets is the in-

    tersection of the two curves. Denote this critical point by .Then it is clear that when ,is indeed strictly less than . Table I showsthe rate allocation across different encoders in both casesof and for severalsample values of , where , representsthe rate allocated to the encoder of in both cases, and

    and are denoted asand , respectively to save space. It is clear

    from Table I that the allocated rates confirm the explanationmentioned in Remark 9.

  • YANG et al.: RATE DISTORTION THEORY FOR CAUSAL VIDEO CODING 5273

    Fig. 5. Comparison of and versus for fixed and in Example 1.

    TABLE IRATE ALLOCATION OF AND

    VERSUS FOR FIXED AND IN EXAMPLE 1

    TABLE IIRATE ALLOCATION OF AND

    VERSUS FOR FIXED AND IN EXAMPLE 1

    When we assign different values to and ,we observe the same phenomenon, as shown again in Fig. 5 andTable II for Case 3.

    Let us now look at another example with a different jointdistribution.

    Example 2: Suppose that ,and that the Hamming distortion measure is used. Let

    , and

    Once again, and do not form a Markov chain. Fixand . Fig. 6 shows

    the two rate distortion curves andversus , and Table III lists their respective

    rate allocations for several sample values of . The samephenomenon is revealed as in Example 1.

    For all cases shown in Examples 1 and 2, in comparisonwith , when we include in the encoding andtransmission, we not only get the reconstruction of (with dis-tortion ) free at the receiver end, but are also able to reducethe total number of bits to be transmitted. In other words, we canachieve a double gain.

    VI. COMPARISON WITH GREEDY CODING

    All MPEG-series and H-series video coding standards [13],[19] proposed so far fall into predictive video coding, where atthe encoder for each frame , only previous encoded framesare used as a helper. By using a technique called soft decisionquantization [19], [17], [18], it has been demonstrated in a se-ries of papers [19], [20], [16] that the greedy coding method7offers significant gains (ranging from 10% to 30% rate reduc-tion at the same quality) over the respective reference codecs8 ofthese standards. As such, it is instructive to compare the perfor-mance of causal coding characterized by withthe performance of greedy coding characterized by the total rate

    offered by the greedy coding method. In thissection, we present specific examples to numerically compare

    with . Analytic comparisonbetween causal coding and predictive coding will be treated sep-arately in our forthcoming paper due to its complexity.

    Example 3: Suppose that ,and the Hamming distortion measure is used. In this example,

    7The greedy coding method is a special form of predictive video coding; basedon all previous encoded frames, it encodes each current frame in a local optimummanner so as to achieve the best rate distortion tradeoff for the current frameonly.

    8Both the greedy coding method and reference codecs are special forms ofpredictive video coding. At this point, the best rate distortion performance ofpredictive video coding is still unknown in general.

  • 5274 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 57, NO. 8, AUGUST 2011

    Fig. 6. Comparison of and versus for fixed and in Example 2.

    Fig. 7. Comparison of and versus for fixed and in Example 3.

    TABLE IIIRATE ALLOCATION OF AND VERSUS FOR FIXED AND IN EXAMPLE 2

    we consider a Markov chain: . The transitionprobability is given by

    and the other transition probability is given by

    Fig. 7 shows the rate-distortion curves of andversus when is uniformly distributed,

    , and . As shown in Fig. 7, when, which is more than

    31 percent less than .Let us now look at another example in which , and

    do not form a Markov chain.

    Example 4: Suppose that ,and the Hamming distortion measure is used. In this example,

    , and do not form a Markov chain, but

  • YANG et al.: RATE DISTORTION THEORY FOR CAUSAL VIDEO CODING 5275

    Fig. 8. Comparison of and versus for fixed and in Example 4.

    does form a Markov chain in the indicated order. The tran-sition probability is given by

    and the other transition probability is given by

    Fig. 8 shows the rate-distortion curves of andversus when is uniformly distributed,

    , and . As shown in Fig. 8, when, which is 34.8 per-

    cent less than .The above two examples are of course toy examples. How-

    ever, if the performance improvement is indicative of the perfor-mance of causal video coding for real video data, it is definitelyworthwhile to make the causal video coding idea materialize invideo codecs.

    VII. CONCLUSIONIn this paper, we have investigated the causal coding of source

    frames from an information theoretic point ofview. An iterative algorithm has been proposed to numericallycompute the minimum total rate achievableasymptotically by causal video coding for jointly stationary andtotally ergodic sources at distortion levels ,and analytically characterize for IID sources

    . The algorithm has been shown to convergeglobally. With the help of the algorithm, we have further

    established a somewhat surprising more and less coding the-oremunder some conditions on source frames and distortion,the more frames need to be coded and transmitted, the lessamount of data after encoding has to be sent! If the cost of datatransmission is proportional to the transmitted data volume,this translates literally into a scenario where the more framesyou download, the less you would pay. Numerical comparisonsbetween causal video coding and greedy coding have shownthat causal video coding offers significant performance gainsover greedy coding. Along the way, we have advocated thatwhenever possible, the computational approach as illustratedin the paper is a preferred approach to multi-user problems ininformation theory. In addition, we have also established anextended Markov lemma for correlated ergodic sources, whichwill be useful to other multi-user problems in informationtheory as well.

    If the information theoretic analysis as demonstrated in thispaper is indicative of the real performance of causal videocoding for real video data, then the more and less codingtheorem plus the significant performance gain of causal videocoding over greedy coding really points out a bright future forcausal video coding. To make the idea of causal video codingmaterialize in real video codecs, future research efforts shouldbe towards designing effective causal video coding algorithms,in addition to addressing many information theoretic problemssuch as universal causal video coding.

    APPENDIX AIn this Appendix, we prove Theorem 5. As usual, we divide

    the proof of Theorem 5 into its converse part and its positivepart.

    Proof of the converse part: Pick any achievable rate distortionpair vector

    For any , there exists an order- causal video codefor all sufficiently large such that (1.1)

    holds. Let and be the respective encoded frame ofand reconstructed frame for given by . It followsfrom the definition of causal video codes that the Markov

  • 5276 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 57, NO. 8, AUGUST 2011

    conditions, are satisfied, and

    (A.1)for .

    Define auxiliary random variables

    for any and , where. Since is an

    IID vector source, it is not hard to verify that the Markov chainis valid for any

    and . In view of (1.1), and theassumption that is an IID vector source, wehave

    (A.2)

    and for ,

    (A.3)

    where the equality is due to the Markov chain.

    For the last frame, we have

    (A.4)

    where the equality is due to the Markov chain.

    To continue, we introduce a timesharing random variablethat is uniformly distributed over , and indepen-dent of , and hence of all random vari-ables appearing in (A.1) to (A.4). Define ,for . Then it is not hard to verify that theMarkov chain is validfor , and (A.2), (A.3), (A.4), and (A.1) canbe rewritten, respectively, as

    (A.5)(A.6)(A.7)

    (A.8)Note that and havethe same distribution, and , is a func-tion of . Therefore, in comparison with the require-ments (R1) to (R4) in the definition (2.1), the only thing missingis that the Markov chainmay not be valid. To overcome this problem, we can use thesame technique as in the proof of the converse part of Theorem1 and also in the proof of Lemma 1 to construct a new randomvector such that the following hold:

    and havethe same distribution;

    the Markov conditionis met.

    This, together with (A.5) to (A.8) and the definition (2.1), im-plies that

    (A.9)Letting yields

    and hence . This completes the proof of theconverse part of Theorem 5.

    The positive part of Theorem 5, , can beproved by using the standard random coding argument in multi-user information theory [4], [1]. For the sake of completeness,we present a sketch of proof below.

  • YANG et al.: RATE DISTORTION THEORY FOR CAUSAL VIDEO CODING 5277

    Proof sketch of the positive part: For convenience, we shalluse bold letters to denote vectors throughout the rest of this sec-tion. For example, . Since is convex and

    is closed, it suffices to show that .Pick any rate distortion pair vector

    We shall show that it is achievable. Let, and be the auxiliary random variables in (2.1) (for the

    definition of ) satisfying the requirements (R1) to (R4) withfunctions . Denote the alphabets of

    , by , respectively. For any , define

    Let be the set of -strongly jointly typical se-quences of length with respect to the joint distributionof . Similarly, for any , let

    be the set of -strongly jointly typicalsequences of length with respect to the joint distribution of

    , and let be the setof -strongly jointly typical sequences of length with respectto the joint distribution of . Similar notationwill be used for other sets of strongly typical sequences withrespect to other joint distributions. (For the definition of strongtypicality, please refer to, for example, [4, p. 326].) In whatfollows, the values of in different strongly typical sets shouldbe understood as multiplied by different constants fordifferent . We are now ready to describe random codebooksand how encoders/decoders work.Generation of codebooks:

    1) Generate independently codewords(the set of which is denoted by ), where each codeword

    is drawn according to the -foldproduct distribution of .

    2) For , for every combination, where

    for , generate independently codewords(the set of which is denoted by

    ), where each isdrawn according to the -fold product conditional distributionof conditionally given .

    3) For every combination ,where for , generateindependently codewords(the set of which is denoted by ), where each

    is drawn according to the-fold product conditional distribution of condition-

    ally given .Encoding:

    1) Given a sequence , encode into the index, say ,of the first codeword in such thatif such a codeword exists. Otherwise, set . Denote theresulting codeword by .

    2) For , with the knowledge of allhistorical codewords ,denoted by , the encoder for finds the index,say , of the first codeword in such that

    ifsuch a codeword exist, and set otherwise. Denote theresulting codeword by .

    3) With the knowledge of all historical codewords, denoted by , the en-

    coder for finds the index, say , of the first codewordin such that

    if such a codeword exist, and setotherwise. Denote the resulting codeword by

    .

    Decoding:1) The decoder for first reproduces the codeword from, and then calculates by applying the function to each

    component of .2) Upon receiving , the decoder for re-

    produces the codeword from , and then cal-culates by applying the function to each component of

    .

    3) Upon receiving , the decoder for reproduces thecodeword from , and then outputs

    .

    Analysis of bit rates, typicality, and distortions:1) From the construction of encoders, the bit rate in bits per

    symbol for each is upper bounded by .2) In view of the law of large numbers, standard probability

    bounds associated with typicality (see, for example, [4, Lemma10.6.2, Chapter 10]), and the Markov lemma [4, Lemma 15.8.1,Chapter 15], [1], it follows that with probability approachingas are strongly typical,and and are strongly typical.

    3) In view of Requirements (R1) to (R3) in the definition(2.1) and of the above two paragraphs, it follows that thedistortion per symbol between each and ,is upper bounded by with probability approaching

    as .Existence of a deterministic causal video code with desiredperformance:

    In the above analysis, all probabilities are with respect to boththe random sources , and the random codebooks.By the well-known Markov inequality, it follows that there ex-ists a deterministic causal video code (i.e., a deterministic code-book) for which the distortion per symbol between each and

    , is upper bounded by with prob-ability approaching as 9. Therefore, for this deter-ministic causal video code, the average distortion per symbolbetween each and , is upper bounded by

    . Note that all rates are fixed.Putting all pieces together, we have shown that

    Letting yields

    This completes the proof of the positive part of Theorem 5.9This step is necessary since we have multiple distortion inequalities to sat-

    isfy, in which case declaring the existence of a deterministic code immediatelyfrom several inequalities with average performance over the codebook ensemblewould fail.

  • 5278 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 57, NO. 8, AUGUST 2011

    APPENDIX B

    In this Appendix, we prove the positive part (i.e.,) of Theorem 1. Since , each is

    convex, and is closed, it suffices to show that for each.

    Proof of : Unless otherwise specified, nota-tion below is the same as in the proof of the positive part inAppendix A. Indeed, our proof is similar to the random codingargument made for the IID case in Appendix A. However, sincethe vector source now is not IID, but stationaryand totally ergodic, the Markov lemma in its simple form asexpressed in [4, Lemma 15.8.1, Chapter 15] is not valid anymore. To overcome this difficulty, we will modify the concept oftypical sequences and make it even stronger. Withand , defined asin Appendix A, we define for each sequence and

    , where for any alphabet denotes the set of allsequences of length from

    (B.1)and similarly, for each

    (B.2)We then define our modified joint typical sets as follows:

    (B.3)and for ,

    (B.4)To get our random causal video coding scheme in this case,we simply modify the encoding procedure of the randomcoding scheme constructed in Appendix A by replacing

    and with and, respectively; the rest of the random

    coding scheme remains the same. Since the rate of the encoderfor each is fixed, the bit rate in bits per symbol for eachis upper bounded by . To get the desired upper bounds ondistortions, we need to analyze the joint typicality of the sourcesequences and the respective transmitted codeword sequences.At this point, we invoke the following result, which will beproved at the end of this Appendix.

    Lemma 2 (Extended Markov Lemma): Suppose thatare jointly stationary and ergodic. Let

    , and be the auxiliary random variablesin (2.1) (for the definition of ) satisfying the requirements(R1) to (R4). Let be the output process of the mem-oryless channel given by in response to the input .

    For any , let be the output process of thememoryless channel given by in responseto the inputs and .Let be the output process of the memorylesschannel given by in response to the inputsand . Then the followingproperties hold.

    (P1) The probability , whereand , goes to as

    .

    (P2) For any and sufficiently large

    (B.5)for any .

    (P3) For sufficiently large

    (B.6)for any .

    Lemma 2 can be regarded as an extended Markov lemmain the ergodic case. In view of Lemma 2, it is not hard tosee that with high probability, which approaches 1 as

    are strongly typical, and andare strongly typical. The rest of the proof is identical to

    the case considered in Appendix A. This completes the proofof .

    Proof of : We consider a block of symbolsas a super symbol and regard as a vector sourceover . Since is totally ergodic,it is also ergodic when regarded as a vector source over

    . Repeating the above argument for super symbols, i.e.,for alphabets , we then have

    for any . This completes the proof of the positive partof Theorem 1.

    We now prove Lemma 2.Proof of Lemma 2: By construction, it is easy to see

    that and arethe output of a memoryless channel in response to the input

    . Since are joint stationary andergodic, it follows from [2, Theorem 7.2.1, Page 272] that the

    processes andare jointly stationary and ergodic as

    well. By the ergodic theorem, we then have

    (B.7)

    Let

    Rewrite as

    (B.8)

  • YANG et al.: RATE DISTORTION THEORY FOR CAUSAL VIDEO CODING 5279

    Applying the Markov inequality to (B.8), we get

    (B.9)Since as , combining (B.9) with (B.7) yieldsProperty P1 in Lemma 2.

    To prove Property P2 in Lemma 2, note that given anyis a conditionally independent

    sequence. It is not hard to see that

    (B.10)as long as . Fur-thermore, the convergence in (B.10) is uniform. This, coupledwith the definition of , implies that for sufficientlylarge and for any ,

    (B.11)Applying the Markov inequality to (B.11), we get

    (B.12)which in turn implies

    (B.13)whenever . Combining (B.13) with (B.11) yields (B.5).

    A similar argument can be used to prove Property (P3). Thecompletes the proof of Lemma 2.

    ACKNOWLEDGMENTThe authors would like to acknowledge the associate editor,

    Dr. Ordentlich, and anonymous reviewers for their detailedcomments. In particular, we are deeply grateful to the associateeditor for bringing the references [11] and [12] to our attention.

    REFERENCES[1] T. Berger, Multiterminal source coding, in Information Theory

    Approach to Communications, G. Longo, Ed. New York:Springer-Verlag, 1977.

    [2] T. Berger, Rate Distortion Theory. Englewood Cliffs, NJ: Prentice-Hall, 1971.

    [3] R. E. Blahut, Computation of channel capacity and rate-distortionfunction, IEEE Trans. Inf. Theory, vol. IT-18, pp. 460473, 1972.

    [4] T. M. Cover and J. A. Thomas, Elements of Information Theory, 2nded. Hoboken, NJ: Wiley, 2006.

    [5] I. Csiszar, On the computation of rate distortion functions, IEEETrans. Inf. Theory, vol. IT-20, pp. 122124, 1974.

    [6] I. Csiszar and J. Korner, Information Theory Coding Theorems for Dis-crete Memoryless Systems. Budapest, Hungary: Akademiai Kiado,1986.

    [7] I. Csiszar and G. Tusnady, Information geometry and alternating min-imization procedures, Statistics and Decisions, pp. 205237, 1984,Supplement Issue 1.

    [8] W. H. R. Equitz and T. Cover, Successive refinement of information,IEEE Trans. Inf. Theory, vol. 37, no. 2, pp. 269275, Mar. 1991.

    [9] R. G. Gallager, Information Theory and Reliable Communication.New York: Wiley, 1968.

    [10] N. Ma and P. Ishwar, On Delayed Sequential Coding of CorrelatedSources Sep. 30, 2008, arXiv: cs/0701197v2 [CS.IT].

    [11] N. Ma and P. Ishwar, The value of frame-delays in the sequentialcoding of correlated sources, in Proc. 2007 IEEE Int. Symp. Inf.Theory, Nice, France, Jun. 2007, pp. 14961500.

    [12] N. Ma, Y. Wang, and P. Ishwar, Delayed sequential coding of cor-related sources, in Proc. 2007 Information Theory and ApplicationsWorkshop, San Diego, CA, U.S.A., Jan. 2007, pp. 214222.

    [13] I. E. G. Richardson, H.264 and MPEG-4 Video Compression. NewYork: Wiley, 2003.

    [14] R. T. Rockafellar, Convex Analysis. Princeton, NJ: Princeton Univer-sity Press, 1970.

    [15] H. Viswanathan and T. Berger, Sequential coding of correlatedsources, IEEE Trans. Inf. Theory, vol. 46, no. 1, pp. 236246, Jan.2000.

    [16] E.-H. Yang and L. Wang, Full rate distortion optimization of MPEG 2video coding, in Proc. 2009 IEEE Intern. Conf. Image Process., Cairo,Egypt, Nov. 7-11, 2009, pp. 605608.

    [17] E.-H. Yang and L. Wang, Joint optimization of run-length coding,Huffman coding and quantization table with complete baseline JPEGdecoder compatibility, IEEE Trans. Image Process., vol. 18, no. 1, pp.6374, Jan. 2009.

    [18] E.-H. Yang and L. Wang, Method, System, and Computer ProgramProduct for Optimization of Data Compression with Cost Function,U.S. Patent No. 7 570 827, Aug. 4, 2009.

    [19] E.-H. Yang and X. Yu, Rate distortion optimization for H.264inter-frame video coding: A general framework and algorithms, IEEETrans. Image Process., vol. 16, no. 7, pp. 17741784, Jul. 2007.

    [20] E.-H. Yang and X. Yu, Soft decision quantization for H.264 with mainprofile compatibility, IEEE Trans. Circuits Syst. Video Technol., vol.19, no. 1, pp. 122127, Jan. 2009.

    [21] E.-H. Yang and Z. Zhang, On the redundancy of lossy sourcecoding with abstract alphabets, IEEE Trans. Inf. Theory, vol. 44, pp.10921110, May 1999.

    [22] E.-H. Yang, L. Zheng, D.-K. He, and Z. Zhang, On the rate distortiontheory for causal video coding, in Proc. 2009 Information Theory andApplications Workshop, San Diego, CA, Feb. 813, 2009, pp. 385391.

    [23] E.-H. Yang, L. Zheng, Z. Zhang, and D.-K. He, A computation ap-proach to the minimum total rate problem of causal video coding, inProc. 2009 IEEE Int. Symp. Inf. Theory, Seoul, Korea, Jun./Jul. 2009,pp. 21412145.

    [24] R. W. Yeung and T. Berger, Multi-way alternating minimization, inProc. 1995 IEEE Int. Symp. Inf. Theory, Whistler, Canada, Sep. 1722,1995.

    [25] L. Zheng and E.-H. Yang, Causal Video Coding Theorem for ErgodicSources in preparation.

    En-Hui Yang (M97SM00F08) received the B.S. degree in applied mathe-matics from HuaQiao University, Qianzhou, China, and Ph.D. degree in mathe-matics from Nankai University, Tianjin, China, in 1986 and 1991, respectively.

    Since June 1997, he has been with the Department of Electrical and Com-puter Engineering, University of Waterloo, ON, Canada, where he is currentlya Professor and Canada Research Chair in information theory and multimediacompression. He held a Visiting Professor position at the Chinese Universityof Hong Kong, Hong Kong, from September 2003 to June 2004; positions ofResearch Associate and Visiting Scientist at the University of Minnesota, Min-neapolis-St. Paul, the University of Bielefeld, Bielefeld, Germany, and the Uni-versity of Southern California, Los Angeles, from January 1993 to May 1997;and a faculty position (first as an Assistant Professor and then an AssociateProfessor) at Nankai University, Tianjin, China, from 1991 to 1992. He is thefounding Director of the Leitch-University of Waterloo multimedia commu-nications lab, and a Co-Founder of SlipStream Data Inc. (now a subsidiaryof Research In Motion). His current research interests are: multimedia com-pression, multimedia watermarking, multimedia transmission, digital commu-nications, information theory, source and channel coding including distributedsource coding, and image and video coding.

    Dr. Yang is a recipient of several research awards, including the 1992 TianjinScience and Technology Promotion Award for Young Investigators; the 1992

  • 5280 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 57, NO. 8, AUGUST 2011

    third Science and Technology Promotion Award of Chinese Ministry of Edu-cation; the 2000 Ontario Premiers Research Excellence Award, Canada; the2000 Marsland Award for Research Excellence, University of Waterloo; the2002 Ontario Distinguished Researcher Award; the prestigious Inaugural (2007)Premiers Catalyst Award for the Innovator of the Year; and the 2007 Ernest C.Manning Award of Distinction, one of the Canadas most prestigious innovationprizes. Products based on his inventions and commercialized by SlipStream re-ceived the 2006 Ontario Global Traders Provincial Award. With over 170 papersand many patents/patent applications, products with his inventions inside areused daily by tens of millions people worldwide. He is a Fellow of the Cana-dian Academy of Engineering and a Fellow of the Royal Society of Canada:the Academies of Arts, Humanities and Sciences of Canada. He served, amongmany other roles, as a General Co-Chair of the 2008 IEEE International Sympo-sium on Information Theory, an Associate Editor for IEEE TRANSACTIONS ONINFORMATION THEORY, a Technical Program Vice-Chair of the 2006 IEEE In-ternational Conference on Multimedia & Expo (ICME), the Chair of the awardcommittee for the 2004 Canadian Award in Telecommunications, a Co-Editor ofthe 2004 Special Issue of the IEEE TRANSACTIONS ON INFORMATION THEORY,a Co-Chair of the 2003 U.S. National Science Foundation (NSF) workshop onthe interface of Information Theory and Computer Science, and a Co-Chair ofthe 2003 Canadian Workshop on Information Theory.

    Lin Zheng received the B.Eng. degree in electronics and information engi-neering from Huazhong University of Science and Technology, Wuhan, Hubei,China, in 2004, and M.S. degree in electrical and computer engineering from theUniversity of Waterloo, Waterloo, ON, Canada, in 2007. She is currently pur-suing the Ph.D. degree in electrical and computer engineering at the Universityof Waterloo.

    Her research interests include information theory, data compression,multi-terminal source coding theory and algorithm design, and multimediacommunications.

    Da-Ke He (S01M06) received the B.S. and M.S. degrees, both in electricalengineering, from Huazhong University of Science and Technology, Wuhan,Hubei, China, in 1993 and 1996, respectively, and his Ph.D. degree in electricalengineering from the University of Waterloo, Waterloo, ON, Canada, in 2003.

    From 1996 to 1998, he was with Apple Technology China (Zhuhai) as a soft-ware engineer. From 2003 to 2004, he worked in the Department of Electricaland Computer Engineering at the University of Waterloo as a postdoctoral re-search fellow in the Leitch-University of Waterloo Multimedia CommunicationsLab. From 2005 to 2008, he was a research staff member in the Department ofMultimedia Technologies at IBM T. J. Watson Research Center in YorktownHeights, New York, U.S.A. Since 2008, he has been a technical manager in Slip-stream Data, a subsidiary of Research In Motion, in Waterloo, Ontario, Canada.His research interests are in source coding theory and algorithm design, multi-media data compression and transmission, multi-terminal source coding theoryand algorithms, and digital communications.

    Zhen Zhang (F03) received the M.S. degree in mathematics from NankaiUniversity, Tianjin, China in 1980, Ph.D. degree in applied mathematics fromCornell University, Ithaca, NY, in 1984, and Habilitation in mathematics fromBielefeld University, Bielefeld, Germany, in 1988.

    He served as a lecturer in mathematics at Nankai during 1981-1982. He wasa post- doctoral research associate with the School of Electrical Engineering,Cornell University, from 1984 to 1985 and with the Information Systems Lab-oratory, Stanford University, in the Fall of 1985. From 1986 to 1988, he waswith the Mathematics Department, Bielefeld University, Bielefeld, Germany.He joined the faculty of University of Southern California in 1988, where heis currently a Professor in Electrical Engineering, the Ming Hsieh Departmentof Electrical Engineering-systems. He is a fellow of IEEE. His research interestincludes information theory, coding theory, data compression, network codingtheory, combinatorics and various mathematical problems related to communi-cation sciences.