-
5258 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 57, NO. 8,
AUGUST 2011
Rate Distortion Theory for Causal Video Coding:Characterization,
Computation Algorithm,
and ComparisonEn-Hui Yang, Fellow, IEEE, Lin Zheng, Da-Ke He,
Member, IEEE, and Zhen Zhang, Fellow, IEEE
AbstractCausal video coding is considered from an in-formation
theoretic point of view, where video source frames
are encoded in a frame by frame manner, theencoder for each
frame
can use all previous frames and allprevious encoded frames while
the corresponding decoder canuse only all previous encoded frames,
and each frame
itselfis modeled as a source
. A novel computationapproach is proposed to analytically
characterize, numericallycompute, and compare the minimum total
rate of causal videocoding
required to achieve a given distortion(quality) level
. Among many other things,the computation approach includes an
iterative algorithm withglobal convergence for computing
. The globalconvergence of the algorithm further enables us to
demonstratea somewhat surprising result (dubbed the more and less
codingtheorem)under some conditions on source frames and
dis-tortion, the more frames need to be encoded and transmitted,the
less amount of data after encoding has to be actually sent.With the
help of the algorithm, it is also shown by example that
is in general much smaller than the total rateoffered by the
traditional greedy coding method. As a by-product,an extended
Markov lemma is established for correlated ergodicsources.
Index TermsCausal video coding, extended Markov lemma,
it-erative algorithm, multi-user information theory, predictive
videocoding, rate distortion characterization and computation, rate
dis-tortion theory, stationary ergodic sources.
I. INTRODUCTION
C ONSIDER a causal video coding model shown in Fig. 1,where ,
represents a video frame,and represent respectively its encoded
frame and recon-
structed frame, all frames , are encoded ina frame by frame
manner, and the encoder for can use all
Manuscript received March 31, 2010; revised December 23, 2010;
acceptedMarch 04, 2011. Date of current version July 29, 2011. This
work was supportedin part by the Natural Sciences and Engineering
Research Council of Canadaunder Grant RGPIN203035-06 and Strategic
Grant STPGP397345, and by theCanada Research Chairs Program.
E. Yang and L. Zheng are with the Department of Electrical and
ComputerEngineering, University of Waterloo, Waterloo, ON N2L 3G1,
Canada (e-mail:[email protected]; [email protected]).
D.-K. He is with Research in Motion/SlipStream, Waterloo, ON N2L
5Z5,Canada (e-mail: [email protected]).
Z. Zhang is with the Department of Electrical
Engineering-Systems, Uni-versity of Southern California, Los
Angeles, CA 90095-1594 USA (e-mail:[email protected]).
Communicated by E. Ordentlich, Associate Editor for Source
Coding.Digital Object Identifier 10.1109/TIT.2011.2159043
Fig. 1. Causal video coding model.
previous frames , and all previous en-coded frames , while the
correspondingdecoder can use only all previous encoded frames. The
modelis causal because the encoder for is not allowed to access
tofuture frames in the encoding order. In the special case wherethe
encoder for each is further restricted to enlist help onlyfrom all
previous encoded frames , causalvideo coding reduces to predictive
video coding.
All MPEG-series and H-series video coding standards [13],[19]
proposed so far fall into the above causal video codingmodel
(strictly speaking, into the predictive video codingmodel); the
differences among these different video codingstandards lie in how
information available to the encoder ofeach frame is used to
generate . The causal coding modelis the same as the sequential
coding model of correlated sourceproposed in [15] when , and also
called the C-C modelin [10], [11], and [12]. However, when , which
is atypical case in MPEG-series and H-series video coding,
thecausal coding model considered here is quite different
fromsequential coding1. In a special case where all frames are
iden-tical, which rarely happens in practical video coding, the
causalvideo coding model is reduced to the successive
refinementsetting considered in [8]. Notwithstanding, when frames
arenot identical, causal video coding is drastically different
fromsuccessive refinement even though the decoding structure
lookssimilar in both cases. Partial results of this paper were
presentedwithout proof in [23] and [22].
It is expected that a future video coding standard will
continueto fall into the causal video coding model shown in Fig. 1.
To
1The name of sequential coding was used in [15] to refer to a
special videocoding paradigm where the encoder for frame , can only
use theprevious frame as a helper and the corresponding decoder
uses only theprevious encoded frame and reconstructed frame as a
helper.
0018-9448/$26.00 2011 IEEE
-
YANG et al.: RATE DISTORTION THEORY FOR CAUSAL VIDEO CODING
5259
provide some design guidance for a future video coding
stan-dard, in this paper, we aim at investigating from an
informa-tion theoretic point of view how each frame in the causal
modelshould be encoded so that collectively the total rate is
minimizedsubject to a given distortion (quality) level .
We model each frame itself as a sourcetaking values in a finite
alphabet . Together, theframes then form a vector source
taking values in the productalphabet . The sourcesare said to be
(first-order) Markov if for anyis the output of a memoryless
channel in response to input
; in this case, we say forms aMarkov chain. Let denote the
reconstruc-tion of drawn from a finite reproductionalphabet . The
distortion between and is measuredby a single-letter distortion
measure .Without loss of generality, we shall assume that
for any . For convenience, we writesimply as for any and . For
any dimen-sional vector , denote by
, and by . As such, by we shallmean that . A sim-ilar convention
will apply to reconstruction sequences and othervectors.
Formally, we define an order- causal video code forby using
encoder and decoder pairs as follows2:
1) For , an encoder of order is defined by a functionfrom to ,
the set of all binary sequences of
finite length, satisfying the property that the range of is
aprefix set, and a decoder of order is defined by a function
The encoded and reconstructed sequences of aregiven respectively
by and
.
2) For , an encoder of order is defined bya function
2It is worthwhile to point out that as far as causal video
coding alone is con-cerned, there is no need to explicitly list
previous encoded frames as inputsto the encoder for the current
frame in both the causal video coding diagramshown in Fig. 1 and
the formal definition of causal video code given here, and
allresults and their respective derivations presented in the paper
remain the same.The reason for us to explicitly list as inputs to
the encoder for the currentframe is two-fold: (1) it makes the
subsequent information quantities moretransparent and
intuitiveconnecting those information quantities to the dia-gram
with linked to the respective encoder is easier than to that
without linked to the respective encoderand (2) more importantly it
gives us a simple,unified way to describe predictive video coding
in the context of causal videocoding and contrast the two coding
paradigms in our forthcoming work on theinformation theoretic
performance comparison of predictive video coding andcausal video
coding.
satisfying the property that the range of given anybinary
sequences is a prefix set, and a decoder of orderis defined by a
function
The encoded and reconstructed sequences of aregiven respectively
byand .
For , the distortion between andis given by
the corresponding average distortion per symbol is then equal
to
and the average rate in bits per symbol of the th encoder is
where denotes the length of the binary sequence . Theperformance
of the order- causal video code is then mea-sured by the rate
distortion pairs .
Definition 1: Let be a rate vector anda distortion vector. The
rate distortion pair
vector is said to be achievable bycausal video coding if , there
exists an order- causalvideo code for all sufficiently large such
that
(1.1)
for .Let denote the set of all rate distortion pair vectors
achievable by causal video coding.From the above definition, it
follows that is a closed set inthe -dimensional Euclidean space. As
in the usual videocompression applications, we are interested in
the minimumtotal rate required to achieve the distortionlevel ,
which is defined by
One of our purposes in this paper is to numerically
compute,analytically characterize, and compare so thatdeep insights
can be gained regarding how each frame should beencoded in order to
have a minimum total rate.
Our approach is computation oriented. Starting witha jointly
stationary and totally ergodic vector source3
, we first show in Section II that3A vector source
is said to be jointly stationary and totally ergodic if as a
single process over thealphabet is stationaryand totally
ergodic.
-
5260 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 57, NO. 8,
AUGUST 2011
is equal to the infimum of the th ordertotal rate distortion
function over all ,where itself is given by the minimum ofan
information quantity over a set of auxiliary random vari-ables.
Then we develop an iterative algorithm in Section IIIto calculate ,
and further show that thisalgorithm converges to an optimal
solution that achieves
. The global convergence of the algorithmenables us to establish
a single-letter characterization of
in Section IV in the case where the vectorsource is independent
and identicallydistributed (IID)4, by comparing with
through a novel application of the algo-rithm. With the help of
the algorithm, we further demonstratein Section V a somewhat
surprising result dubbed the moreand less coding theoremunder some
conditions on sourceframes and distortion, the more frames need to
be encoded andtransmitted, the less amount of data after encoding
has to beactually sent. The algorithm also gives an optimal
solution forallocating bits to different frames. It is shown in
Section VIthat is in general much smaller than the totalrate
offered by the traditional greedy codingmethod by which each frame
is encoded in a local optimummanner based on all information
available to the encoder of theframe.
II. ACHIEVABLE REGION AND MINIMUM TOTAL RATE:TOTALLY ERGODIC
CASE
Suppose now that is jointly stationaryand totally ergodic across
samples (pixels). Define tobe the region consisting of all rate
distortion pair vectors
for which there exist auxiliaryrandom variables , and
suchthat
(2.1)
and the following requirements5 are satisfied:(R1) for some
deterministic function ;(R2) for some deterministic func-
tion ;(R3) for any
;(R4) the Markov chain conditions
4A vector source is said to be IID if as a single process over
the alphabet is IID. Note that the common joint distribu-tion of
each sample , can be arbitrary evenwhen the vector source is
IID.
5Throughout the paper, , represents a randomvariable taking
values over , the -fold product of the reproduction alphabet
; on the other hand, , represents a random variabletaking values
over an arbitrary finite alphabet.
, andare met.
In (2.1) and throughout the rest of the paper, the
notationstands for mutual information or conditional mutual
informa-tion (as the case may be) measured in bits, and the
notationstands for entropy or conditional entropy (as the case may
be)measured in bits. Although there is no restriction on the size
ofthe alphabet of each in (2.1), one can show, by using the
stan-dard cardinality bound argument based on the Caratheodory
the-orem (see, for example, Appendix A of [15]), that the
alphabetsize of each in (2.1) can be bounded. Let .Denote its
convex hull closure by . Then we have the fol-lowing result.
Theorem 1: For jointly stationary and totally ergodic
sources.
The positive part of Theorem 1 (i.e., ) will beproved in
Appendix B by adopting a random coding argumentsimilar to that for
IID vector sources. Here we present the proofof the converse part
(i.e., ).
Proof of the converse part of Theorem 1: Pick any achievablerate
distortion pair vector .It follows from Definition 1 that for any ,
there ex-ists an order- causal video code forall sufficiently large
such that (1.1) holds. Let and
be the respective encoded frame of and recon-structed frame for
given by . Let
. It is easy to see that the Markovconditions
, are satisfied. However, since dependsin general on in addition
to and , therandom variables , and donot necessarily form a Markov
chain in the indicted order. Toovercome this problem, let denote
the conditional probabilitydistribution of given . Define a
newrandom variable which is the output of the channelin response to
the input . Then it is easy to seethat andhave the same
distribution, and ,and form a Markov chain. This, together with
(1.1),implies the following distortion upper bounds:
(2.2)
for any , and
(2.3)
Let us now verify rate lower bounds. In view of (1.1), we
have
(2.4)and for ,
(2.5)
-
YANG et al.: RATE DISTORTION THEORY FOR CAUSAL VIDEO CODING
5261
where equality is due to the fact that is a function of. For the
last frame, we have
(2.6)
With auxiliary random variables ,and defined above, it now
follows from (2.2) to (2.6)and the desired Markov conditions
that
. Letting yields, which in turn implies
. This completes the proof of the converse part.To determine in
terms of information quan-
tities, we define for each
(2.7)
where the minimum is taken over all auxiliary random vec-tors ,
satisfying the following tworequirements:(R5) for any ;(R6) the
Markov chains
, andhold.
We further define
(2.8)
Then we have the following result.Theorem 2: For jointly
stationary and totally ergodic sources
,
for any distortion level .To prove Theorem 2, we need the
following lemma, which is
also interesting on its own right.
Lemma 1: The function is convex andhence continuous over the
open region .
Proof of Lemma 1: Fix . In view of thedefinition given in (2.7),
it is not hard to show that the sequence
is subadditive, that is
for any and . As such, can also be ex-pressed as
(2.9)Next we derive an equivalent expression for
. Define
That is
(2.10)where the infimum is taken over all auxiliary random
variables
and satisfying the requirements (R1)to (R4). By comparing (2.10)
with (2.7), it is easy to see that
(2.11)On the other hand, pick any auxiliary random variables
and satisfying the requirements (R1)to (R4). Let be defined asin
the requirements (R1) and (R2). Then in view of the
Markovconditions in the requirement (R4), we have
(2.12)where the last inequality is due to the fact that is
afunction of for any . To continue,we now verify Markov conditions
involving . It is nothard to see that the first Markov conditions
in the require-ment (R4),
, are equivalent to the following conditions:(R7) for any
and
are conditionally independent given and.
From this, it follows that for any andare conditionally
independent given
and . Applying the equivalence again, we seethat the first
Markov conditions involving in therequirement (R6) are satisfied.
Therefore, we have
-
5262 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 57, NO. 8,
AUGUST 2011
(2.13)where the equality 1) follows from the Markov
conditionsinvolving . Note that the last Markov condition in
therequirement (R6) may not be valid for . To overcomethis problem,
we use the same technique as in the proof of theconverse part of
Theorem 1 to construct a new random vector
such that the following hold: and
have the samedistribution;
the Markov conditionis met.
Therefore, the random variables and sat-isfy the requirements
(R5) and (R6). This, together with (2.13),(2.12), and (2.7),
implies
(2.14)Note that (2.14) is valid for any auxiliary random
variables
and satisfying the requirements (R1)to (R4). It then follows
from (2.14) and (2.10) that
which, together with (2.11), implies that
and (2.10) is an equivalent expression for .In comparison with
(2.7), the equivalent expression (2.10)
makes it easier to apply the well-known time-sharing argu-ment.
By applying the time sharing argument to (2.10), itis now not hard
to see that is a convexfunction of for each . The convexity of
as a function of then followsfrom its equivalent expression
(2.9) and the convexity of each
. Since a convex function is continuous overan open region [14],
this completes the proof of Lemma 1.
Proof of Theorem 2: In view of the positive part of The-orem 1,
it is not hard to see that
for any . Therefore, in what follows, itsuffices to show
(2.15)
for any .Now fix . Pick any rate vector
such that .From the proof of the converse part of Theorem 1, it
followsthat for any and sufficiently large , there exist
auxiliaryrandom variables , and satis-fying the requirements (R1)
to (R4) with each replaced by
such that
which, coupled with the equivalent expression (2.10) for,
further implies
(2.16)In view of Lemma 1, dividing both sides of (2.16) by and
thenletting yield
from which (2.15) follows. This completes the proof of
Theorem2.
Remark 1: Theorems 1 and 2 remain valid for generally
sta-tionary ergodic sources . However, the techniqueadopted in the
proof of the classic source coding theorem for asingle ergodic
source [9], [2] can not be applied here. As such, anew proof
technique has to be developed; this will be addressedin our
forthcoming paper [25] in order not to deviate our com-putation
approach.
For general stationary ergodic sources , The-orem 2 is probably
the best result one could hope for in termsof analytically
characterizing . However,its impact on practical video coding will
be limited if theoptimization problem involved can not be solved by
an ef-fective algorithm. To a large extent, this is also true even
if
admits a single-letter characterization, andtrue for many other
multi-user information theoretic problems.In the following section,
we will develop an iterative algorithmto compute defined in (2.7),
and establishits convergence to the global minimum.
-
YANG et al.: RATE DISTORTION THEORY FOR CAUSAL VIDEO CODING
5263
III. AN ITERATIVE ALGORITHMIn this section, an iterative
algorithm is proposed to calcu-
late defined in (2.7), which serves three pur-poses in this
paper: first, it allows us to do numerical calcula-tions; second,
the global convergence of this algorithm providesa completely
different approach to establish a single-letter char-acterization
of when the sources are IID;and third, it allows us to do
comparisons and gain deep insightsinto .
Without loss of generality, we consider the case of anddenote
three sources by , and ,which in turn will be written as , and
respectively tosimplify our notation for describing the iterative
algorithm.
Let and denote joint distributions ofrandom vectors and ,
respectively;and let denote the marginal distribution of . If
thereis no ambiguity, subscripts in distributions will be
omitted.For example, we may write instead of . In orderto find the
random variables and that achieve
, we try to find transition probability andprobability functions
,and that minimize
(3.1)where , denotes the stan-dard Lagrange multiplier, and the
base of the logarithm is . Forbrevity, we shall denoteby , and by .
Write
accord-ingly as . When there is no ambiguity, the super-script
or subscript will be dropped. The iterative algorithmworks as
follows.
Step 1: Initialize and set as a joint dis-tribution function
over , and , where
for any .Step 2: Fix . Find
such that
(3.2)where the minimum is taken over all transition
probabilityfunctions . In view of the
nested structure in (3.1), we solve the problem in (3.2) inthree
stages. First let us find . From (3.1)
(3.3)
where . In theabove, the last inequality follows from the
log-sum in-equality, and becomes an equality if and only if
(3.4)
for any .We next find . In view of (3.1) and (3.3), wehave
(3.5)
where. In the above, the last inequality
again follows from the log-sum inequality, and becomesan
equality if and only if
(3.6)for any .Finally, let us find . Continuing from (3.1)
and(3.5), we have
-
5264 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 57, NO. 8,
AUGUST 2011
(3.7)
where
An argument similar to that leading to (3.3) and (3.5) canbe
used to show that (3.7) becomes an equality if and onlyif
(3.8)for any .Step 3: Fix . Find such that
(3.9)
where the minimum is taken over all joint distribution
func-tions over and . In view of (3.1), we see that
(3.10)where is the output of the channel
in response to the input , and isthe distribution of , i.e.,
(3.11)for any . The inequality (3.10) becomes an equalityif and
only if for any
.
Step 4: Repeat Steps 2 and 3 for untilis smaller than a
prescribed threshold.
For any , let
Similarly, for any , let
The above iterative algorithm can also be described succinctlyby
and . Thefollowing theorem shows that the sequence
converges to a quadruple of distributions that achieves
(3.12)where the infimum is taken over all possible
, and .
Theorem 3: For any initial satisfyingfor any , there exists such
that
, and
as .Proof of Theorem 3: From the description of the
iterative
algorithm, it follows that
(3.13)To show the desired convergence, let us first verify that
the algo-rithm has the so-called five-point property (as defined in
[7]),that is for any , and the corre-sponding
(3.14)To this end, let us calculate both sides of (3.14). In
view of Steps2 and 3, we have
(3.15)
where the equality follows from the following derivation:
(3.16)
-
YANG et al.: RATE DISTORTION THEORY FOR CAUSAL VIDEO CODING
5265
and
(3.17)
Combining (3.16) and (3.17), we immediately have the equalityin
(3.15).On the other hand
(3.18)
Combining (3.15) with (3.18) yields the desired
five-pointproperty in (3.14).
The rest of the proof is similar to that adopted in [5] to
showthe convergence of the Blahut-Arimoto algorithm [3].
Suppose
(3.19)for some and . From (3.14), it then follows thatfor
any
(3.20)which, together with , implies
(3.21)
and hence
(3.22)
Note that (3.22) is valid for any and satisfying(3.19). From
this, we have
(3.23)
To prove the convergence of , pick a con-vergent subsequence of
, say . Then
and
(3.24)
In view of (3.23), we have ; thus, and hence (3.20) applies to
and . In particular,
is a nonincreasing sequence. Sinceimplies , this means .Hence
and as . This completesthe proof of Theorem 3.
Remark 2: The above iterative algorithm can be easily ex-tended
to the case of , and Theorem 3 remains valid. Bysetting , it also
reduces to the case of .
Remark 3: The iterative algorithm can be further extended towork
for coupled distortion measures (as defined in [15])
, where thedistortion depends not only on butalso on . The
global convergence as expressedin Theorem 3 is still
guaranteed.
-
5266 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 57, NO. 8,
AUGUST 2011
Remark 4: Although as a function ofis convex as shown in the
proof of Lemma 1,
both the optimization problems (2.7) and (3.12) are actually
anon-convex optimization problem. It is therefore kind of
sur-prising to see the global convergence of our proposed
iterativealgorithm. As shown in the proof of Theorem 3, the key for
theglobal convergence is the five-point property (3.14).
Remark 5: There are many other ways (including, for ex-ample,
the greedy alternative algorithm [24]) to derive itera-tive
procedures. However, it is not clear whether their
globalconvergence can be guaranteed. Having algorithms with
globalconvergence is important to not only numerical computation
it-self, but also single-letter characterization of performance.
Oneof the purposes of this paper is indeed to demonstrate for
thefirst time that single-letter characterization of performance
canalso be established in a computational way via algorithms
withglobal convergence, as shown in the next section.
We conclude this section by presenting an alternative
expres-sion for . Once again, we illustrate this byconsidering the
case of . In view of the definitions (2.7)and (3.12), it is not
hard to show (for example, by using the tech-nique demonstrated in
the proof of Property 1 in [21]) that forany
(3.25)In other words, as a function of is the conjugateof .
Since is convex andlower semi-continuous over the whole region
, it follows from [14, Theorem 12.2, pp. 104] that forany ,
(3.26)In the next section, (3.26) will be used in the process of
es-tablishing a single-letter characterization forwhen the vector
source is IID.
IV. SINGLE-LETTER CHARACTERIZATION: IID CAUSAL CASESuppose now
that the vector source is IID.
In this section, we will use our iterative algorithm proposed
inSection III and its global convergence to establish a
single-lettercharacterization for .
Theorem 4: If is IID, then
for any .Proof: We first show that for any
,
(4.1)
for any . Without loss of generality, we demon-strate (4.1) in
the case of by using our iter-ative algorithm in Section III.
Denote three sources by
. Since the vector source is IID, we have. In view of (3.26), we
have
(4.2)
for any , where is defined in (3.12). Here andthroughout the
rest of this proof, the subscript or superscriptdropped for
convenience for notation in Section III is broughtback to
distinguish between the cases of and .Therefore, it suffices to
show that
(4.3)
for any . To this end, we willrun the iterative algorithm in
both cases of and tocalculate and . Pick any initial positive
distribution
, and run the iterative algorithm in the case of . Wethen get a
sequence which, according toTheorem 3, satisfies
(4.4)
Now let be the -fold product distribution of . Clearly,is also
positive. Use as an initial distribution and run
the iterative algorithm in the case of . Then we get asequence
which, according to Theorem3 again, satisfies
(4.5)
Since is the -fold product of andis the -fold product of ,
careful examination on
(3.4), (3.6), (3.8), and (3.11) reveals that for anyis the -fold
product of , and is the -fold product of
. (To see this is the case, let us look at (3.4) for example.Let
us temporarily drop the subscripts indicating random vari-ables in
all notation. When and
, it can be verified that in(3.4)
Since
it follows from (3.4) that
-
YANG et al.: RATE DISTORTION THEORY FOR CAUSAL VIDEO CODING
5267
Similar argument can be applied to (3.6), (3.8), and
(3.11).)Therefore, for any
which, coupled with (4.4) and (4.5), implies (4.3) and
hence(4.1).
Combining (4.1) with (2.8) yields
for any . This, together with Theorem 2,implies
(4.6)for any . Since by their definitions, bothfunctions and are
rightcontinuous in the sense that for any
it follows that (4.6) remains valid for boundary points
wheresome may be . This completes the proof of Theorem 4.
Theorem 4 can also be proved by using the classical
auxiliaryrandom variable converse and positive proof (hereafter
referredto as the classic approach). Indeed, one can establish the
fol-lowing single-letter characterization for the achievable
region
, the proof of which is given in Appendix A.
Theorem 5: If is an IID vector source, then6.
Remark 6: It is instructive to compare the computationalapproach
to single-letter characterization (as illustrated in theproofs of
Theorems 2, 3, and 4) with the classic approach. Inthe
computational approach, the converse is first established
formultiple letters (blocks); its proof is often straightforward
andthe required Markov chain conditions are satisfied
automaticallyas shown in the proof of Theorem 2. The key is then to
havean algorithm with global convergence for computing all
blockterms and later show that all these block terms are the
same.On the other hand, in the classic approach, the converse
proofis quite involved; coming up with auxiliary random
variableswith right Markov chain conditions is always challenging
andsometimes seems impossible. Since single-letter
characteriza-tion has to be computed any way, the computational
approachis preferred whenever it is possible.
Remark 7: When , Theorems 5 and 4 reduce to Theo-rems 1 and 3 in
[15], respectively. However, the proofs in [15]are incomplete due
to the invalid claim of the Markov conditionmade in the proofs
therein; as such formulas therein can not beextended to the case of
. Theorems 5 and 4 in a slightly
6Since the alphabet size of each in (2.1) can be bounded, ,is
actually convex and closed. As such, . We leave in the statement of
Theorem 5 just for the sake of consistency with the norm inthe
literature [4].
different, but equivalent form were also reported in [10],
[11],and [12] by following the classic approach. The difference
lies inthe extra Markov chain condition for the reconstructionshown
as Condition (R4). For example, in the specific formulasshown in
[10, Theorem 1] in the case of , the Markovchain conditionis not
required.
V. MORE AND LESS CODING THEOREMTo gain deep insights into causal
video coding, in this sec-
tion, we use our iterative algorithm proposed in Section IIIto
compare among different values of
. To be specific, whenever we need to bring out the de-pendence
of and onthe sources , we will writeas , and as
. In particular, we will comparewith .
Without loss of generality again, we will consider the case of.
All results and discussions in this section can be easily
extended to the case of . We first have the followingresult.
Theorem 6: Suppose that is jointly stationaryand totally
ergodic, and , and form a Markov chainin the indicated order. Then
for any ,
(5.1)Proof: We distinguish between two cases: (1)
, and (2) . In Case (1), it follows from Theorem2 and (2.8) that
it suffices to show
(5.2)for any and . To this end, pickany auxiliary random
variables , satisfyingthe requirements (R5) and (R6) with . It is
not hard toverify that
(5.3)where the equality 1) follows from the fact that the
requirement(R6) plus the Markov condition impliesthat the Markov
conditionis satisfied. In (5.3), the Markov condition
may not be valid. However, to
-
5268 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 57, NO. 8,
AUGUST 2011
overcome this problem, we can use the same technique as inthe
proof of the converse part of Theorem 1 and also in theproof of
Lemma 1 to construct a new random vectorsuch that the following
hold:
andhave the same
distribution; the Markov condition
is met.Therefore, the random variables and satisfythe
requirements (R5) and (R6) with with respect to
and . This, together with (5.3) and (2.7),implies
(5.4)
Since (5.4) is valid for any auxiliary random variables,
satisfying the requirements (R5) and (R6) with
, (5.2) then follows from the definition (2.7). This
completesthe proof of (5.1) in Case (1).
To prove (5.1) in Case (2), note that bothand are right
continuous in the sense that for any ,the two equations shown at
the bottom of the page hold. Thevalidity of (5.1) in Case (2) then
follows from its validity inCase (1). This completes the proof of
Theorem 6.
Theorem 6 is what one would expect and consistent with
ourintuition. Let us now look at the case where , anddo not form a
Markov chain, and is an IID vectorsource. Define for any
(5.5)
where , for any source , is the classical rate
distortionfunction of . Assume that . In view of Theorem4 and the
proof of Lemma 1, bothand are convex as functions of and overthe
region . As such, theyare subdifferentiable at any point with
and
. (See [14, Chapter 23] for discussions on the subdiffer-ential
and subgradients of a convex function.) From Section III,they can
also be computed via our iterative algorithm throughtheir
respective conjugates. Since is an IID vector
Fig. 2. One special case of two-layer causal coding.
source, in view of Theorem 4, we will drop the subscript or
su-perscript for all notation in Section III with understanding
of
throughout the rest of this section. Once again, to bringout the
dependence of on the source , wewill write for as for
as the notation meansthat is regarded as a super source (see
Fig. 2)and
for as . This convention will apply toother notation in Section
III as well. In particular
(5.6)for any .
Condition A: A point with andis said to satisfy Condition A if
asa function of and has a negative subgradient
, at such that there is a distri-butionsatisfying the following
requirements:
(R8) .(R9) Define (as in Step 2 of the iterative algorithm)
(5.7)
(5.8)where
(5.9)
(5.10)
-
YANG et al.: RATE DISTORTION THEORY FOR CAUSAL VIDEO CODING
5269
Denote the two conditional distributionsand by
. Then either
or depends on , i.e., there exist, and with such that
We are now ready to state a somewhat surprising resultdubbed the
more and less coding theorem.
Theorem 7 (More and Less Coding Theorem): Suppose thatis an IID
vector source with , and
, and do not form a Markov chain. Then for any point, satisfying
Condition A, there is a
critical value such that for any ,
(5.11)
and for any
(5.12)
Remark 8: In Theorem 7, if , then at
Proof of Theorem 7: Since as afunction of is continuous over and
non-increasing,it suffices to show that
(5.13)
for any point , satisfying ConditionA. To this end, we consider
a new two-layer causal coding modelshown in Fig. 2, where and
together are regarded as onesuper source. Let denote its minimum
totalrate function. Since at , a randomvariable independent of ,
and can beconstructed in such a way that .Therefore, it is easy to
see that
(5.14)
for any and . On the other hand, in view ofthe definition of
causal vide codes, it is not hard to see that anycausal code for
encoding , and with respective dis-tortions can also be used for
encoding
and in Fig. 2 with distortions without changingthe total rate.
Thus
for any . This, coupled with (5.14), implies
(5.15)for any and .
To continue, we are now led to show
(5.16)
for any point , satisfying ConditionA. First note that from the
definition of causal video codes
(5.17)
for any and . Fix now any point, satisfying Condition A. We
prove (5.16) by
contradiction. Suppose that
(5.18)
at the point . Let be the negative sub-gradient of at the
pointin Condition A. From (5.15), is also a negative subgradientof
at the point . This implies thatfor any and ,
which, coupled with (5.18) and (5.17), in turn implies that
theequation shown at the bottom of the page holds for anyand . In
other words, under the assumption (5.18),is also a negative
subgradient of at the point
. In view of (3.25), (3.26), and (5.6), it then followsthat
(5.19)
(5.20)In view of the requirement (R8) in Condition A, we
have
(5.21)
-
5270 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 57, NO. 8,
AUGUST 2011
From Step 2 of the iterative algorithm, it follows that
(5.22)where the inequality in (5.22) is strict when de-pends on
. Therefore, according to the requirement (R9) inCondition A, no
matter which choice in the requirement (R9) isvalid, we always
have
which, together with (5.19) to (5.21), implies that
This contradicts the assumption (5.18), hence completing
theproof of (5.16) and (5.13).
Define
Then from (5.13), it is easy to see that is the desired
criticalvalue. This completes the proof of Theorem 7.
Remark 9: Theorem 7, in particular, (5.11) is really
counterintuitive. It says that whenever the conditions specified
inTheorem 7 are met, the more source frames need to be encodedand
transmitted, the less amount of data after encoding has tobe
actually sent! If the cost of data transmission is proportionalto
the transmitted data volume, this translates literally into
ascenario where the more frames you download, the less youwould
pay. To help the reader better understand this phenom-enon, let us
examine where the gain ofover comes from whenever the
conditionsspecified in Theorem 7 are met. The availability of to
theencoder of does not really help the encoder of and
itscorresponding decoder achieve a better rate distortion
tradeoff
. Likewise, the availability of and to theencoder of does not
really help the encoder of and itscorresponding decoder achieve a
better rate distortion tradeoff
either. What really matters is that the availability ofto the
encoder of will help the encoder of choose
better side information for the encoder and decoder of. If the
rate reduction of the encoder of arising from
this better along with is more than the overhead as-sociated
with the rate and the selection of this better ,then the total rate
is smaller. (Herethe overhead associated with the rate and the
selection ofthis better is meant to be the difference between the
sumof and in and the rate in
. Depending on how helpful is, the ratein can be more or less
than the ratein .) This is further confirmed in Examples 1and 2 at
the end of this section.
Condition A is generally met at points, for which positive bit
rates are needed
at both the decoder for and the decoder for in orderfor them to
produce the respective reproductions with thedesired distortions
and . Such distortion points willbe called points with positive
rates. By using the technique
demonstrated in the proof of [21, Property 1], it can be
shownthat has a negative subgradient atany point , with positive
rates. Inaddition, the distribution , if optimal, generallydepends
on (except for some corner cases) when ,and do not form a Markov
chain. We illustrate this in thefollowing theorem in the binary
case.
Theorem 8: Assume that , andthe Hamming distortion measure is
used. Letbe an IID vector source with . Sup-pose that , and do not
form a Markov chain.Then for with and , if( and ) achieves
, i.e.,
(5.23)then depends on , i.e., there exists suchthat the
condition distributions and
are different.Proof of Theorem 8: Fix with and .
We first derive some bounds on . It is not hard toverify
that
(5.24)where is the unique value of at whichthe derivative of is
equal to , and
is the unique value of at which the derivative ofis equal to .
In the above, the inequality 1) is due to
the fact that
(5.25)for any . Under the condition that
, the inequality (5.25) is strict at. Therefore
(5.26)In view of (5.23), it follows from the iterative algorithm
that
(5.27)
(5.28)where appears as subscripts to indicate that theoperations
and defined in Section III are for thesources and . Let be the
output of
-
YANG et al.: RATE DISTORTION THEORY FOR CAUSAL VIDEO CODING
5271
the channel in response to the input .Then the joint
distribution of is , and (5.23)implies
(5.29)Putting (5.29) and (5.26) together, we can conclude
that
and hence for any . Otherwise,from (5.29) we would have that
(5.30)which contradicts (5.26).
We now prove Theorem 8 by contradiction. Suppose thatdoes not
depend on . Then for any and
(5.31)which, together with (5.27), (5.7) to (5.10), and the fact
that
, implies
(5.32)Simplifying (5.32) yields
(5.33)
where .To continue, we now consider specific values of and .
Let us first look at the case of and . It followsfrom (5.33)
that
(5.34)
which implies
(5.35)where , and
. Further simplifying (5.35) yields
(5.36)
Since , it can be verified thatis equal to if and only if .
Next we show that . To this end, first note that isequivalent to
saying that is a product distribution,i.e.,
(5.37)
By plugging (5.37) into (5.27), it follows from the Step 2 of
theiterative algorithm that in
does not depend on anddoes not depend on , i.e.,
(5.38)
where and are the normalization factors so thatthe respective
terms are indeed distributions. It is easy to seethat (5.37) and
(5.38) imply
(5.39)(5.40)
(5.41)Combining (5.39) to (5.41) with (5.29) yields
which contradicts (5.26). Therefore, .Go back to (5.36). Since ,
(5.36) is equivalent to
(5.42)
-
5272 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 57, NO. 8,
AUGUST 2011
Fig. 3. Comparison of and versus for fixed and .
Fig. 4. Comparison of and versus for fixed and in Example 1.
Repeat the above argument for the case of and .We then have
accordingly
(5.43)
Putting (5.42) and (5.43) together, we have shown that
(5.31)implies that , and form a Markov chain, which con-tradicts
our assumption. This completes the proof of Theorem8.
Remark 10: From Theorem 8, it follows that for any sources, and
satisfying the conditions of Theorem 8, Condi-
tion A is met at any point , at whichhas a negative
subgradient.
We conclude this section with examples illustrating
Theorem7.
Example 1: Suppose that ,and that the Hamming distortion measure
is used. Let
, and
It is easy to see that and do not form a Markov chain.We
consider the following three cases:
Case 1: , and ;Case 2: , and ; andCase 3: , and .
For Case 1, Fig. 3 shows the rate-distortion curves ofand versus
.
Over the interval of shown in Fig. 3, it is clearthat is always
strictly less than
.
For Case 2, Fig. 4 shows andversus with fixed
and . It is observed that the critical point at whichmeets is
the in-
tersection of the two curves. Denote this critical point by
.Then it is clear that when ,is indeed strictly less than . Table I
showsthe rate allocation across different encoders in both casesof
and for severalsample values of , where , representsthe rate
allocated to the encoder of in both cases, and
and are denoted asand , respectively to save space. It is
clear
from Table I that the allocated rates confirm the
explanationmentioned in Remark 9.
-
YANG et al.: RATE DISTORTION THEORY FOR CAUSAL VIDEO CODING
5273
Fig. 5. Comparison of and versus for fixed and in Example 1.
TABLE IRATE ALLOCATION OF AND
VERSUS FOR FIXED AND IN EXAMPLE 1
TABLE IIRATE ALLOCATION OF AND
VERSUS FOR FIXED AND IN EXAMPLE 1
When we assign different values to and ,we observe the same
phenomenon, as shown again in Fig. 5 andTable II for Case 3.
Let us now look at another example with a different
jointdistribution.
Example 2: Suppose that ,and that the Hamming distortion measure
is used. Let
, and
Once again, and do not form a Markov chain. Fixand . Fig. 6
shows
the two rate distortion curves andversus , and Table III lists
their respective
rate allocations for several sample values of . The
samephenomenon is revealed as in Example 1.
For all cases shown in Examples 1 and 2, in comparisonwith ,
when we include in the encoding andtransmission, we not only get
the reconstruction of (with dis-tortion ) free at the receiver end,
but are also able to reducethe total number of bits to be
transmitted. In other words, we canachieve a double gain.
VI. COMPARISON WITH GREEDY CODING
All MPEG-series and H-series video coding standards [13],[19]
proposed so far fall into predictive video coding, where atthe
encoder for each frame , only previous encoded framesare used as a
helper. By using a technique called soft decisionquantization [19],
[17], [18], it has been demonstrated in a se-ries of papers [19],
[20], [16] that the greedy coding method7offers significant gains
(ranging from 10% to 30% rate reduc-tion at the same quality) over
the respective reference codecs8 ofthese standards. As such, it is
instructive to compare the perfor-mance of causal coding
characterized by withthe performance of greedy coding characterized
by the total rate
offered by the greedy coding method. In thissection, we present
specific examples to numerically compare
with . Analytic comparisonbetween causal coding and predictive
coding will be treated sep-arately in our forthcoming paper due to
its complexity.
Example 3: Suppose that ,and the Hamming distortion measure is
used. In this example,
7The greedy coding method is a special form of predictive video
coding; basedon all previous encoded frames, it encodes each
current frame in a local optimummanner so as to achieve the best
rate distortion tradeoff for the current frameonly.
8Both the greedy coding method and reference codecs are special
forms ofpredictive video coding. At this point, the best rate
distortion performance ofpredictive video coding is still unknown
in general.
-
5274 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 57, NO. 8,
AUGUST 2011
Fig. 6. Comparison of and versus for fixed and in Example 2.
Fig. 7. Comparison of and versus for fixed and in Example 3.
TABLE IIIRATE ALLOCATION OF AND VERSUS FOR FIXED AND IN EXAMPLE
2
we consider a Markov chain: . The transitionprobability is given
by
and the other transition probability is given by
Fig. 7 shows the rate-distortion curves of andversus when is
uniformly distributed,
, and . As shown in Fig. 7, when, which is more than
31 percent less than .Let us now look at another example in
which , and
do not form a Markov chain.
Example 4: Suppose that ,and the Hamming distortion measure is
used. In this example,
, and do not form a Markov chain, but
-
YANG et al.: RATE DISTORTION THEORY FOR CAUSAL VIDEO CODING
5275
Fig. 8. Comparison of and versus for fixed and in Example 4.
does form a Markov chain in the indicated order. The tran-sition
probability is given by
and the other transition probability is given by
Fig. 8 shows the rate-distortion curves of andversus when is
uniformly distributed,
, and . As shown in Fig. 8, when, which is 34.8 per-
cent less than .The above two examples are of course toy
examples. How-
ever, if the performance improvement is indicative of the
perfor-mance of causal video coding for real video data, it is
definitelyworthwhile to make the causal video coding idea
materialize invideo codecs.
VII. CONCLUSIONIn this paper, we have investigated the causal
coding of source
frames from an information theoretic point ofview. An iterative
algorithm has been proposed to numericallycompute the minimum total
rate achievableasymptotically by causal video coding for jointly
stationary andtotally ergodic sources at distortion levels ,and
analytically characterize for IID sources
. The algorithm has been shown to convergeglobally. With the
help of the algorithm, we have further
established a somewhat surprising more and less coding
the-oremunder some conditions on source frames and distortion,the
more frames need to be coded and transmitted, the lessamount of
data after encoding has to be sent! If the cost of datatransmission
is proportional to the transmitted data volume,this translates
literally into a scenario where the more framesyou download, the
less you would pay. Numerical comparisonsbetween causal video
coding and greedy coding have shownthat causal video coding offers
significant performance gainsover greedy coding. Along the way, we
have advocated thatwhenever possible, the computational approach as
illustratedin the paper is a preferred approach to multi-user
problems ininformation theory. In addition, we have also
established anextended Markov lemma for correlated ergodic sources,
whichwill be useful to other multi-user problems in
informationtheory as well.
If the information theoretic analysis as demonstrated in
thispaper is indicative of the real performance of causal
videocoding for real video data, then the more and less
codingtheorem plus the significant performance gain of causal
videocoding over greedy coding really points out a bright future
forcausal video coding. To make the idea of causal video
codingmaterialize in real video codecs, future research efforts
shouldbe towards designing effective causal video coding
algorithms,in addition to addressing many information theoretic
problemssuch as universal causal video coding.
APPENDIX AIn this Appendix, we prove Theorem 5. As usual, we
divide
the proof of Theorem 5 into its converse part and its
positivepart.
Proof of the converse part: Pick any achievable rate
distortionpair vector
For any , there exists an order- causal video codefor all
sufficiently large such that (1.1)
holds. Let and be the respective encoded frame ofand
reconstructed frame for given by . It followsfrom the definition of
causal video codes that the Markov
-
5276 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 57, NO. 8,
AUGUST 2011
conditions, are satisfied, and
(A.1)for .
Define auxiliary random variables
for any and , where. Since is an
IID vector source, it is not hard to verify that the Markov
chainis valid for any
and . In view of (1.1), and theassumption that is an IID vector
source, wehave
(A.2)
and for ,
(A.3)
where the equality is due to the Markov chain.
For the last frame, we have
(A.4)
where the equality is due to the Markov chain.
To continue, we introduce a timesharing random variablethat is
uniformly distributed over , and indepen-dent of , and hence of all
random vari-ables appearing in (A.1) to (A.4). Define ,for . Then
it is not hard to verify that theMarkov chain is validfor , and
(A.2), (A.3), (A.4), and (A.1) canbe rewritten, respectively,
as
(A.5)(A.6)(A.7)
(A.8)Note that and havethe same distribution, and , is a
func-tion of . Therefore, in comparison with the require-ments (R1)
to (R4) in the definition (2.1), the only thing missingis that the
Markov chainmay not be valid. To overcome this problem, we can use
thesame technique as in the proof of the converse part of Theorem1
and also in the proof of Lemma 1 to construct a new randomvector
such that the following hold:
and havethe same distribution;
the Markov conditionis met.
This, together with (A.5) to (A.8) and the definition (2.1),
im-plies that
(A.9)Letting yields
and hence . This completes the proof of theconverse part of
Theorem 5.
The positive part of Theorem 5, , can beproved by using the
standard random coding argument in multi-user information theory
[4], [1]. For the sake of completeness,we present a sketch of proof
below.
-
YANG et al.: RATE DISTORTION THEORY FOR CAUSAL VIDEO CODING
5277
Proof sketch of the positive part: For convenience, we shalluse
bold letters to denote vectors throughout the rest of this
sec-tion. For example, . Since is convex and
is closed, it suffices to show that .Pick any rate distortion
pair vector
We shall show that it is achievable. Let, and be the auxiliary
random variables in (2.1) (for the
definition of ) satisfying the requirements (R1) to (R4)
withfunctions . Denote the alphabets of
, by , respectively. For any , define
Let be the set of -strongly jointly typical se-quences of length
with respect to the joint distributionof . Similarly, for any ,
let
be the set of -strongly jointly typicalsequences of length with
respect to the joint distribution of
, and let be the setof -strongly jointly typical sequences of
length with respectto the joint distribution of . Similar
notationwill be used for other sets of strongly typical sequences
withrespect to other joint distributions. (For the definition of
strongtypicality, please refer to, for example, [4, p. 326].) In
whatfollows, the values of in different strongly typical sets
shouldbe understood as multiplied by different constants
fordifferent . We are now ready to describe random codebooksand how
encoders/decoders work.Generation of codebooks:
1) Generate independently codewords(the set of which is denoted
by ), where each codeword
is drawn according to the -foldproduct distribution of .
2) For , for every combination, where
for , generate independently codewords(the set of which is
denoted by
), where each isdrawn according to the -fold product conditional
distributionof conditionally given .
3) For every combination ,where for , generateindependently
codewords(the set of which is denoted by ), where each
is drawn according to the-fold product conditional distribution
of condition-
ally given .Encoding:
1) Given a sequence , encode into the index, say ,of the first
codeword in such thatif such a codeword exists. Otherwise, set .
Denote theresulting codeword by .
2) For , with the knowledge of allhistorical codewords ,denoted
by , the encoder for finds the index,say , of the first codeword in
such that
ifsuch a codeword exist, and set otherwise. Denote theresulting
codeword by .
3) With the knowledge of all historical codewords, denoted by ,
the en-
coder for finds the index, say , of the first codewordin such
that
if such a codeword exist, and setotherwise. Denote the resulting
codeword by
.
Decoding:1) The decoder for first reproduces the codeword from,
and then calculates by applying the function to each
component of .2) Upon receiving , the decoder for re-
produces the codeword from , and then cal-culates by applying
the function to each component of
.
3) Upon receiving , the decoder for reproduces thecodeword from
, and then outputs
.
Analysis of bit rates, typicality, and distortions:1) From the
construction of encoders, the bit rate in bits per
symbol for each is upper bounded by .2) In view of the law of
large numbers, standard probability
bounds associated with typicality (see, for example, [4,
Lemma10.6.2, Chapter 10]), and the Markov lemma [4, Lemma
15.8.1,Chapter 15], [1], it follows that with probability
approachingas are strongly typical,and and are strongly
typical.
3) In view of Requirements (R1) to (R3) in the definition(2.1)
and of the above two paragraphs, it follows that thedistortion per
symbol between each and ,is upper bounded by with probability
approaching
as .Existence of a deterministic causal video code with
desiredperformance:
In the above analysis, all probabilities are with respect to
boththe random sources , and the random codebooks.By the well-known
Markov inequality, it follows that there ex-ists a deterministic
causal video code (i.e., a deterministic code-book) for which the
distortion per symbol between each and
, is upper bounded by with prob-ability approaching as 9.
Therefore, for this deter-ministic causal video code, the average
distortion per symbolbetween each and , is upper bounded by
. Note that all rates are fixed.Putting all pieces together, we
have shown that
Letting yields
This completes the proof of the positive part of Theorem 5.9This
step is necessary since we have multiple distortion inequalities to
sat-
isfy, in which case declaring the existence of a deterministic
code immediatelyfrom several inequalities with average performance
over the codebook ensemblewould fail.
-
5278 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 57, NO. 8,
AUGUST 2011
APPENDIX B
In this Appendix, we prove the positive part (i.e.,) of Theorem
1. Since , each is
convex, and is closed, it suffices to show that for each.
Proof of : Unless otherwise specified, nota-tion below is the
same as in the proof of the positive part inAppendix A. Indeed, our
proof is similar to the random codingargument made for the IID case
in Appendix A. However, sincethe vector source now is not IID, but
stationaryand totally ergodic, the Markov lemma in its simple form
asexpressed in [4, Lemma 15.8.1, Chapter 15] is not valid anymore.
To overcome this difficulty, we will modify the concept oftypical
sequences and make it even stronger. Withand , defined asin
Appendix A, we define for each sequence and
, where for any alphabet denotes the set of allsequences of
length from
(B.1)and similarly, for each
(B.2)We then define our modified joint typical sets as
follows:
(B.3)and for ,
(B.4)To get our random causal video coding scheme in this
case,we simply modify the encoding procedure of the randomcoding
scheme constructed in Appendix A by replacing
and with and, respectively; the rest of the random
coding scheme remains the same. Since the rate of the encoderfor
each is fixed, the bit rate in bits per symbol for eachis upper
bounded by . To get the desired upper bounds ondistortions, we need
to analyze the joint typicality of the sourcesequences and the
respective transmitted codeword sequences.At this point, we invoke
the following result, which will beproved at the end of this
Appendix.
Lemma 2 (Extended Markov Lemma): Suppose thatare jointly
stationary and ergodic. Let
, and be the auxiliary random variablesin (2.1) (for the
definition of ) satisfying the requirements(R1) to (R4). Let be the
output process of the mem-oryless channel given by in response to
the input .
For any , let be the output process of thememoryless channel
given by in responseto the inputs and .Let be the output process of
the memorylesschannel given by in response to the inputsand . Then
the followingproperties hold.
(P1) The probability , whereand , goes to as
.
(P2) For any and sufficiently large
(B.5)for any .
(P3) For sufficiently large
(B.6)for any .
Lemma 2 can be regarded as an extended Markov lemmain the
ergodic case. In view of Lemma 2, it is not hard tosee that with
high probability, which approaches 1 as
are strongly typical, and andare strongly typical. The rest of
the proof is identical to
the case considered in Appendix A. This completes the proofof
.
Proof of : We consider a block of symbolsas a super symbol and
regard as a vector sourceover . Since is totally ergodic,it is also
ergodic when regarded as a vector source over
. Repeating the above argument for super symbols, i.e.,for
alphabets , we then have
for any . This completes the proof of the positive partof
Theorem 1.
We now prove Lemma 2.Proof of Lemma 2: By construction, it is
easy to see
that and arethe output of a memoryless channel in response to
the input
. Since are joint stationary andergodic, it follows from [2,
Theorem 7.2.1, Page 272] that the
processes andare jointly stationary and ergodic as
well. By the ergodic theorem, we then have
(B.7)
Let
Rewrite as
(B.8)
-
YANG et al.: RATE DISTORTION THEORY FOR CAUSAL VIDEO CODING
5279
Applying the Markov inequality to (B.8), we get
(B.9)Since as , combining (B.9) with (B.7) yieldsProperty P1 in
Lemma 2.
To prove Property P2 in Lemma 2, note that given anyis a
conditionally independent
sequence. It is not hard to see that
(B.10)as long as . Fur-thermore, the convergence in (B.10) is
uniform. This, coupledwith the definition of , implies that for
sufficientlylarge and for any ,
(B.11)Applying the Markov inequality to (B.11), we get
(B.12)which in turn implies
(B.13)whenever . Combining (B.13) with (B.11) yields (B.5).
A similar argument can be used to prove Property (P3).
Thecompletes the proof of Lemma 2.
ACKNOWLEDGMENTThe authors would like to acknowledge the
associate editor,
Dr. Ordentlich, and anonymous reviewers for their
detailedcomments. In particular, we are deeply grateful to the
associateeditor for bringing the references [11] and [12] to our
attention.
REFERENCES[1] T. Berger, Multiterminal source coding, in
Information Theory
Approach to Communications, G. Longo, Ed. New
York:Springer-Verlag, 1977.
[2] T. Berger, Rate Distortion Theory. Englewood Cliffs, NJ:
Prentice-Hall, 1971.
[3] R. E. Blahut, Computation of channel capacity and
rate-distortionfunction, IEEE Trans. Inf. Theory, vol. IT-18, pp.
460473, 1972.
[4] T. M. Cover and J. A. Thomas, Elements of Information
Theory, 2nded. Hoboken, NJ: Wiley, 2006.
[5] I. Csiszar, On the computation of rate distortion functions,
IEEETrans. Inf. Theory, vol. IT-20, pp. 122124, 1974.
[6] I. Csiszar and J. Korner, Information Theory Coding Theorems
for Dis-crete Memoryless Systems. Budapest, Hungary: Akademiai
Kiado,1986.
[7] I. Csiszar and G. Tusnady, Information geometry and
alternating min-imization procedures, Statistics and Decisions, pp.
205237, 1984,Supplement Issue 1.
[8] W. H. R. Equitz and T. Cover, Successive refinement of
information,IEEE Trans. Inf. Theory, vol. 37, no. 2, pp. 269275,
Mar. 1991.
[9] R. G. Gallager, Information Theory and Reliable
Communication.New York: Wiley, 1968.
[10] N. Ma and P. Ishwar, On Delayed Sequential Coding of
CorrelatedSources Sep. 30, 2008, arXiv: cs/0701197v2 [CS.IT].
[11] N. Ma and P. Ishwar, The value of frame-delays in the
sequentialcoding of correlated sources, in Proc. 2007 IEEE Int.
Symp. Inf.Theory, Nice, France, Jun. 2007, pp. 14961500.
[12] N. Ma, Y. Wang, and P. Ishwar, Delayed sequential coding of
cor-related sources, in Proc. 2007 Information Theory and
ApplicationsWorkshop, San Diego, CA, U.S.A., Jan. 2007, pp.
214222.
[13] I. E. G. Richardson, H.264 and MPEG-4 Video Compression.
NewYork: Wiley, 2003.
[14] R. T. Rockafellar, Convex Analysis. Princeton, NJ:
Princeton Univer-sity Press, 1970.
[15] H. Viswanathan and T. Berger, Sequential coding of
correlatedsources, IEEE Trans. Inf. Theory, vol. 46, no. 1, pp.
236246, Jan.2000.
[16] E.-H. Yang and L. Wang, Full rate distortion optimization
of MPEG 2video coding, in Proc. 2009 IEEE Intern. Conf. Image
Process., Cairo,Egypt, Nov. 7-11, 2009, pp. 605608.
[17] E.-H. Yang and L. Wang, Joint optimization of run-length
coding,Huffman coding and quantization table with complete baseline
JPEGdecoder compatibility, IEEE Trans. Image Process., vol. 18, no.
1, pp.6374, Jan. 2009.
[18] E.-H. Yang and L. Wang, Method, System, and Computer
ProgramProduct for Optimization of Data Compression with Cost
Function,U.S. Patent No. 7 570 827, Aug. 4, 2009.
[19] E.-H. Yang and X. Yu, Rate distortion optimization for
H.264inter-frame video coding: A general framework and algorithms,
IEEETrans. Image Process., vol. 16, no. 7, pp. 17741784, Jul.
2007.
[20] E.-H. Yang and X. Yu, Soft decision quantization for H.264
with mainprofile compatibility, IEEE Trans. Circuits Syst. Video
Technol., vol.19, no. 1, pp. 122127, Jan. 2009.
[21] E.-H. Yang and Z. Zhang, On the redundancy of lossy
sourcecoding with abstract alphabets, IEEE Trans. Inf. Theory, vol.
44, pp.10921110, May 1999.
[22] E.-H. Yang, L. Zheng, D.-K. He, and Z. Zhang, On the rate
distortiontheory for causal video coding, in Proc. 2009 Information
Theory andApplications Workshop, San Diego, CA, Feb. 813, 2009, pp.
385391.
[23] E.-H. Yang, L. Zheng, Z. Zhang, and D.-K. He, A computation
ap-proach to the minimum total rate problem of causal video coding,
inProc. 2009 IEEE Int. Symp. Inf. Theory, Seoul, Korea, Jun./Jul.
2009,pp. 21412145.
[24] R. W. Yeung and T. Berger, Multi-way alternating
minimization, inProc. 1995 IEEE Int. Symp. Inf. Theory, Whistler,
Canada, Sep. 1722,1995.
[25] L. Zheng and E.-H. Yang, Causal Video Coding Theorem for
ErgodicSources in preparation.
En-Hui Yang (M97SM00F08) received the B.S. degree in applied
mathe-matics from HuaQiao University, Qianzhou, China, and Ph.D.
degree in mathe-matics from Nankai University, Tianjin, China, in
1986 and 1991, respectively.
Since June 1997, he has been with the Department of Electrical
and Com-puter Engineering, University of Waterloo, ON, Canada,
where he is currentlya Professor and Canada Research Chair in
information theory and multimediacompression. He held a Visiting
Professor position at the Chinese Universityof Hong Kong, Hong
Kong, from September 2003 to June 2004; positions ofResearch
Associate and Visiting Scientist at the University of Minnesota,
Min-neapolis-St. Paul, the University of Bielefeld, Bielefeld,
Germany, and the Uni-versity of Southern California, Los Angeles,
from January 1993 to May 1997;and a faculty position (first as an
Assistant Professor and then an AssociateProfessor) at Nankai
University, Tianjin, China, from 1991 to 1992. He is thefounding
Director of the Leitch-University of Waterloo multimedia
commu-nications lab, and a Co-Founder of SlipStream Data Inc. (now
a subsidiaryof Research In Motion). His current research interests
are: multimedia com-pression, multimedia watermarking, multimedia
transmission, digital commu-nications, information theory, source
and channel coding including distributedsource coding, and image
and video coding.
Dr. Yang is a recipient of several research awards, including
the 1992 TianjinScience and Technology Promotion Award for Young
Investigators; the 1992
-
5280 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 57, NO. 8,
AUGUST 2011
third Science and Technology Promotion Award of Chinese Ministry
of Edu-cation; the 2000 Ontario Premiers Research Excellence Award,
Canada; the2000 Marsland Award for Research Excellence, University
of Waterloo; the2002 Ontario Distinguished Researcher Award; the
prestigious Inaugural (2007)Premiers Catalyst Award for the
Innovator of the Year; and the 2007 Ernest C.Manning Award of
Distinction, one of the Canadas most prestigious innovationprizes.
Products based on his inventions and commercialized by SlipStream
re-ceived the 2006 Ontario Global Traders Provincial Award. With
over 170 papersand many patents/patent applications, products with
his inventions inside areused daily by tens of millions people
worldwide. He is a Fellow of the Cana-dian Academy of Engineering
and a Fellow of the Royal Society of Canada:the Academies of Arts,
Humanities and Sciences of Canada. He served, amongmany other
roles, as a General Co-Chair of the 2008 IEEE International
Sympo-sium on Information Theory, an Associate Editor for IEEE
TRANSACTIONS ONINFORMATION THEORY, a Technical Program Vice-Chair
of the 2006 IEEE In-ternational Conference on Multimedia & Expo
(ICME), the Chair of the awardcommittee for the 2004 Canadian Award
in Telecommunications, a Co-Editor ofthe 2004 Special Issue of the
IEEE TRANSACTIONS ON INFORMATION THEORY,a Co-Chair of the 2003 U.S.
National Science Foundation (NSF) workshop onthe interface of
Information Theory and Computer Science, and a Co-Chair ofthe 2003
Canadian Workshop on Information Theory.
Lin Zheng received the B.Eng. degree in electronics and
information engi-neering from Huazhong University of Science and
Technology, Wuhan, Hubei,China, in 2004, and M.S. degree in
electrical and computer engineering from theUniversity of Waterloo,
Waterloo, ON, Canada, in 2007. She is currently pur-suing the Ph.D.
degree in electrical and computer engineering at the Universityof
Waterloo.
Her research interests include information theory, data
compression,multi-terminal source coding theory and algorithm
design, and multimediacommunications.
Da-Ke He (S01M06) received the B.S. and M.S. degrees, both in
electricalengineering, from Huazhong University of Science and
Technology, Wuhan,Hubei, China, in 1993 and 1996, respectively, and
his Ph.D. degree in electricalengineering from the University of
Waterloo, Waterloo, ON, Canada, in 2003.
From 1996 to 1998, he was with Apple Technology China (Zhuhai)
as a soft-ware engineer. From 2003 to 2004, he worked in the
Department of Electricaland Computer Engineering at the University
of Waterloo as a postdoctoral re-search fellow in the
Leitch-University of Waterloo Multimedia CommunicationsLab. From
2005 to 2008, he was a research staff member in the Department
ofMultimedia Technologies at IBM T. J. Watson Research Center in
YorktownHeights, New York, U.S.A. Since 2008, he has been a
technical manager in Slip-stream Data, a subsidiary of Research In
Motion, in Waterloo, Ontario, Canada.His research interests are in
source coding theory and algorithm design, multi-media data
compression and transmission, multi-terminal source coding
theoryand algorithms, and digital communications.
Zhen Zhang (F03) received the M.S. degree in mathematics from
NankaiUniversity, Tianjin, China in 1980, Ph.D. degree in applied
mathematics fromCornell University, Ithaca, NY, in 1984, and
Habilitation in mathematics fromBielefeld University, Bielefeld,
Germany, in 1988.
He served as a lecturer in mathematics at Nankai during
1981-1982. He wasa post- doctoral research associate with the
School of Electrical Engineering,Cornell University, from 1984 to
1985 and with the Information Systems Lab-oratory, Stanford
University, in the Fall of 1985. From 1986 to 1988, he waswith the
Mathematics Department, Bielefeld University, Bielefeld, Germany.He
joined the faculty of University of Southern California in 1988,
where heis currently a Professor in Electrical Engineering, the
Ming Hsieh Departmentof Electrical Engineering-systems. He is a
fellow of IEEE. His research interestincludes information theory,
coding theory, data compression, network codingtheory,
combinatorics and various mathematical problems related to
communi-cation sciences.