LOW-COMPLEXITY POWER-SCALABLE MULTI-VIEW DISTRIBUTED VIDEO ... · LOW-COMPLEXITY POWER-SCALABLE MULTI-VIEW DISTRIBUTED VIDEO CODING FOR WIRELESS VIDEO SENSOR NETWORKS+ Li-Wei Kang

LOW-COMPLEXITY POWER-SCALABLE MULTI-VIEW DISTRIBUTED VIDEO CODING FOR WIRELESS VIDEO SENSOR NETWORKS+

Li-Wei Kang (康立威) and Chun-Shien Lu* (呂俊賢)

Institute of Information Science, Academia Sinica, Taipei, Taiwan, ROC (中央研究院資訊科學研究所)

E-mail: {lwkang, lcs}@iis.sinica.edu.tw

ABSTRACT To meet the requirements of resource-limited video sensors, low-complexity video encoding technique is highly desired. In this paper, we propose a low-complexity power-scalable multi-view distributed video encoding scheme by using the correlations among video frames from adjacent video sensor nodes via robust media hashing extracted at encoder and using the global motion parameters estimated and fed back from the decoder. In addition, the proposed video encoding scheme is power-scalable, which is adaptive based on the available power supply of the video sensor. The power-rate-distortion behavior of the proposed video encoding scheme is also analyzed in order to maximize the video quality under limited video sensor resource allocation. The theoretical achievable minimum distortion (AMD) of reconstructed video under a total power supply constraint for a video sensor is also derived. Based on the AMD estimation, a guideline is provided to decide the power supply for each video sensor based on desired video quality or acceptable distortion before deploying the wireless video sensor network. Index Terms— Low-complexity video coding, multi-view distributed video coding, power-scalable video coding, wireless video sensor networks, power-rate-distortion analysis.

1. INTRODUCTION With the availability of low-cost hardware, such as CMOS cameras, wireless video sensor networks (WVSNs) are potential to promote several emerging applications, such as security monitoring, emergency response, and environmental tracking [1]-[2]. In a WVSN shown in Fig. 1, some video sensor nodes (VSNs) are usually scattered in a sensor field. Each

VSN equipped with a camera can capture and encode visual information, and deliver the compressed video data to the aggregation and forwarding node (AFN). The AFNs aggregate and forward the video data to the remote control unit (RCU), which can usually support a powerful decoder, for decoding and further processing [2]. However, each VSN operates under several resource constraints including lower computational capability, limited power supply, and narrow transmission bandwidth. Hence, in a WVSN, video compression and transmission are the two major concerns for a VSN. However, to achieve higher coding efficiency, current video compression approaches usually perform complex encoding operations (e.g., motion estimation for exploiting temporal correlation of successive frames in the same view or disparity estimation for exploiting inter-view correlation) [3]-[6], which cannot be applicable for a VSN.

Internet or satellite

Remote control unit(RCU)

Video sensor node (VSN) Aggregation and forwarding node (AFN)Sensor field Wireless link

Internet or satellite

Remote control unit(RCU)

Video sensor node (VSN) Aggregation and forwarding node (AFN)Sensor field Wireless link

Fig. 1. A wireless video sensor network (WVSN) architecture.

To meet the requirements of resource-limited VSNs, low-complexity video coding techniques have been recently very popular. A famous one is the distributed video coding (DVC) approach, where individual frames are encoded independently, but decoded jointly [7]-[21]. The computational burden at encoder can be shifted to

_______________________________________________+ This research was supported in part by National Digital Archives Program sponsored by the National Science Council of Taiwan, ROC, under NSC Grant NSC95-2422-H-001-031 and NSC 95-2221-E-001-022. *Corresponding author ([email protected])

2007 Computer Vision, Graphics and Image Processing Conference

- 60 -

the decoder while preserving a certain coding efficiency. DVC can consider either a single VSN (single view) [7]-[15] or adjacent VSNs together (multi view) [16]-[21]. The major characteristic of multi-view DVC is that inter-VSN communications can be avoided during encoding to save energy.

Another recent popular approach is to collaborate video coding and transmission [22]-[23]. First, each frame from a VSN is intra-encoded using a still image encoder or intra-frame video encoder (e.g., JPEG-2000 or H.264/AVC intra-frame encoder [3]-[4]). While transmitting the encoded frames from adjacent VSNs through the same intermediate node, the node will perform image matching procedure to detect the similar/overlapping regions for the frames. Then, the overlapping regions will be encoded only once to further compress the frames. The major challenge here is how to efficiently and accurately identify the overlapping regions between two frames in an intermediate VSN under resource-limited constraints.

On the other hand, for a resource-limited VSN, optimal resource allocation for maximizing decoded video quality is highly desirable [2], [24]-[25]. First, it is needed to design a power-scalable video encoder for a VSN, which can adjust its computational complexity and power consumption under power constraint. In [2], [24], a parametric complexity-scalable video encoder is developed by adjusting the three encoding parameters based on encoding complexity constraint. In addition, by using a popular CMOS circuits design technology, called dynamic voltage scaling (DVS), the power scalability is equivalent to the complexity scalability. Then, an analytic power-rate-distortion (PRD) model is developed to characterize the inherent relationship between the power consumption of their power-scalable video encoder and its rate-distortion performance. Based on the PRD analysis, the optimal power allocation between video encoding and wireless data transmission can be achieved.

In this paper, a low-complexity power-scalable multi-view distributed video encoding scheme, extended from our previous work [21], is proposed by exploiting the characteristics of the aforementioned two kinds of video encoding approaches with additional but limited inter-VSN communications. The PRD behavior of the proposed encoder is also analyzed in order to maximize the video quality under limited VSN resource allocation.

The remainder of this paper is organized as follows. Our robust media hashing technique [26] for inter-VSN communication during the encoding process is described in Sec. 2. The proposed low-complexity power-scalable multi-view video encoding scheme is described in Sec. 3. The PRD analysis for the proposed video encoding scheme is addressed in Sec. 4. Simulation results are presented in Sec. 5, followed by conclusions and future works in Sec. 6.

2. ROBUST MEDIA HASHING FOR INTER-VSN (VIDEO SENSOR NODE) COMMUNICTION

In the proposed video encoding scheme, to further reduce encoding bit-rate and reduce transmission power, limited media hash bits are allowed to be exchanged among adjacent video sensor nodes. Our robust media hashing scheme, called structural digital signature (SDS) [26], which can extract the most significant components and provide a compact representation of a frame (or an image block) efficiently, meets the requirement.

To exploit the SDS for an image block encoding and reconstruction, the problem can be formulated as follows. For an image block, B, its most significant components, extracted by comparing the SDS of B and that of its reference block B’, should be properly selected such that

PSNR(B, ß) ≥ desired PSNR value, and (1) PSNR(B, ß) >> PSNR(B’, ß), (2)

where PSNR denotes the peak signal to noise ratio (PSNR), and ß is an estimate of B, obtained by modifying B’ using the SDS of B.

To extract the SDS for an image block of size n×n, a J-scale discrete wavelet transform (DWT) is performed. Let ws,o(x, y) represent a wavelet coefficient at scale s, orientation o, and position (x, y), 0 ≤ s < J, 1 ≤ x ≤ n, and 1 ≤ y ≤ n. For each pair consisting of a parent node, ws+1,o(x, y), and its four child nodes, ws,o(2x + i, 2y + j), the maximum magnitude difference (max_mag_diff) is calculated as

( ) ( ) ( ) .2,2,max,__ ,,11,0,1 jyixwyxwyxdiffmagmax ososjios ++−= +≤≤+(3)

Then, all the parent-4 children pairs will be arranged in the decreasing order based on their max_mag_diffs. The first L (L is denoted as hash length) pairs in the decreasing order are selected for constructing the SDS of the block.

Once the significant parent-4 children pairs are selected, each pair will be assigned a symbol representing what kind of relationship this pair carries. According to the interscale relationship existing among wavelet coefficients, there are four possible relationship types. Assume the magnitude of a parent node p is larger than that of its child node c. When |p| ≥ |c|, the four possible relationships of the pair are (a) p ≥ 0, c ≥ 0; (b) p ≥ 0, c < 0; (c) p < 0, c ≥ 0; and (d) p < 0, c < 0. To make the above-mentioned relationships compact, the relations (a) and (b) can be merged to form a signature symbol “+1” when p ≥ 0 and c is ignored. On the other hand, the relations (c) and (d) can be merged into another signature symbol “-1” when p < 0 and c is ignored. That is, one should keep the sign of the larger node unchanged while ignoring the smaller one under the constraint that their original interscale relationship is still preserved. Similarly, the signature symbols “+2”


- 61 -

and “-2” can be defined under the constraint |p| < |c|. In summary, for each selected pair of a parent node p and its child node c with max_mag_diff in an image block, B, the signature symbol Sym(B, p, c) can be defined as

( ) ( )( ) ( )( ) ( )( ) ( )

<<−≥<+<≥−≥≥+

=

.02,02,01,01

),,(

candcpifcandcpifpandcpifpandcpif

cpBSym (4)

That is, an image block can be translated into a symbol sequence. Those pairs not included in the SDS (outside the first L pairs in the decreasing order) are labeled by “0.” For an n×n image block translated into a symbol sequence (+1,-1, +2, -2, or 0), 2 bits and log2(n×n) bits are required to indicate each symbol and its parent node position. The other symbols without indication are “0” symbols. Totally, L×(2 + log2(n×n)) bits are required to indicate the SDS with length L for an image block.

3. PROPOSED LOW-COMPLEXITY POWER-SCALABLE MULTI-VIEW VIDEO ENCODING

SCHEME Assume that there are NVSN (≥ 3) adjacent video sensor nodes (VSNs) observing the same target scene in a wireless video sensor network (WVSN). For each VSN, Vs, s = 0, 1, 2, …, NVSN – 1, a captured video sequence is divided into several group of pictures (GOPs) with GOP size, GOPSs, in which a GOP consists of a key frame, Ks,t, where t mod GOPSs = 0, followed by some non-key frames, Ws,t, where t mod GOPSs ≠ 0. An example of the GOP structure with NVSN = 3 is shown in Table 1. In the proposed video encoding scheme, shown in Fig. 2 (for two adjacent VSNs, Vi and Vj, observing the same target scene), each key frame is encoded using the H.264/AVC intraframe encoder [4] or the proposed key frame re-encoding scheme while each non-key frame is encoded using the proposed non-key frame encoding scheme. In Fig. 2, the key frame, Kj,t, from Vj is re-encoded by treating the warped Ki,t at the same time instant t from Vi as its reference frame while the non-key frame, Wj,T, from Vj is encoded via inter-VSN communication by treating the warped Ki,T at the same time instant T from Vi as its second reference frame. The detail of the proposed video encoding scheme will be addressed in the following subsections. 3.1. Proposed Multi-View Key Frame Encoding Scheme For each video sequence captured in a VSN, the first key frame is encoded using the H.264/AVC intraframe encoder [4] and transmitted to the decoder (RCU). For a pair of key frames captured at the same time instant and coming from adjacent VSNs, the difference (derived from different viewing angles) between them can be

estimated via a global motion model [16], [18]-[19], [21]. However, the global motion estimation process is very complex and cannot be performed in a VSN. Hence, the global motion estimation task is performed for each pair of intra-decoded key frames from adjacent VSNs at the decoder. The estimated motion parameters will be transmitted back to the corresponding VSNs via a feedback channel for warping and encoding subsequent frames, as an example shown in Fig. 3. Table 1. A simple example of the GOP structure for a WVSN with NVSN = 3, where GOPS0 = 1, GOPS1 = 4, and GOPS2 = 2.

VSN / Time instant t t + 1 t + 2 t + 3 t + 4 •••V0 K0,t K0,t+1 K0,t+2 K0,t+3 K0,t+4 •••V1 K1,t W1,t+1 W1,t+2 W1,t+3 K1,t+4 •••V2 K2,t W2,t+1 K2,t+2 W2,t+3 K2,t+4 •••

H.264/AVC intraframe encoding and key frame re-encoding, performed by intermediate node

Vi

Vj

Non-key frame encoding via inter-VSN communications

AFN

Key frameKi,T

Non-key frameWj,T

Key frame

Key frame

Kj,t

Ki,t

Perform global motion estimation and send back the estimated motion parameters via feedback channel

Compressed video data

Feedback channel

Encoder side in WVSN Decoder side in RCU

H.264/AVC intraframe encoding and key frame re-encoding, performed by intermediate node

Vi

Vj

Non-key frame encoding via inter-VSN communications

AFN

Key frameKi,T

Non-key frameWj,T

Key frame

Key frame

Kj,t

Ki,t

Perform global motion estimation and send back the estimated motion parameters via feedback channel

Compressed video data

Feedback channel

Encoder side in WVSN Decoder side in RCU

Fig. 2. A diagram for the proposed low-complexity multi-view distributed video encoding scheme.

Target scene Vk

Ki,t

Kj,t

AFN

RCU (Decoder)Perform global motion estimation between decoded Ki,t and Kj,t, and send back the motion parameters via a feedback channel

An example of a WVSN

Internet

Feedback channel

Motion parameters

Perform hash-based key frame re-encoding for each non-first key frame pair

Vi

Vj

Target sceneTarget scene Vk

Ki,t

Kj,t

AFN

RCU (Decoder)Perform global motion estimation between decoded Ki,t and Kj,t, and send back the motion parameters via a feedback channel

An example of a WVSN

Internet

Feedback channel

Motion parameters

Perform hash-based key frame re-encoding for each non-first key frame pair

Vi

Vj

Fig. 3. An example for global motion estimation performed at the decoder.

For each VSN, its non-first key frame is also encoded using the H.264/AVC intraframe encoder first and then transmitted toward the RCU. For a pair of intra-encoded key frames, Ki,t, and Kj,t (with size M×N), from the adjacent VSNs, Vi and Vj, at time instant t, transmitted to the intermediate node Vk, Vk will perform the proposed key frame re-encoding scheme (Fig. 4) to compress the key frames further. First, Vk will perform intra-decoding to obtain K’i,t and K’j,t, respectively. For re-encoding K’j,t by treating K’i,t as its reference frame, K’i,t will be warped to the viewing angle of Vj via the global motion parameters estimated by their previous key frame pair and fed back from the RCU to get Ќi,t. Here, it is assumed that after a WVSN is completely deployed, each VSN is not allowed to change its location and viewing angle, and the GOP size should be not too large. Hence, the latest estimated global motion parameters preserve certain accuracy.


- 62 -

Then, both K’j,t and Ќi,t are partitioned into several non-overlapping blocks (with size n×n), respectively. For each block, B’j,t,b, in K’j,t, the mean square error (MSE) between B’j,t,b and the co-located block, B’i,t,b, in Ќi,t is calculated, where b is the block index (step (a) in Fig. 4). If MSE(B’j,t,b, B’i,t,b) is smaller than the threshold, TK1, B’j,t,b will be skipped. If MSE(B’j,t,b, B’i,t,b) is larger than the threshold, TK2, B’j,t,b will be intra-encoded. Here, TK1 and TK2 are two predefined positive thresholds, and TK1 < TK2. Otherwise, the respective hashes defined in Eq. (4) for B’j,t,b and B’i,t,b, Sym(B’j,t,b, px,y, cx,y) and Sym(B’i,t,b, px,y, cx,y), will be extracted and compared, where px,y and cx,y, respectively, denote the positions of a parent node and its child node with max_mag_diff (step (b) in Fig. 4). For each pair of symbols with the same px,y, if Sym(B’j,t,b, px,y, cx,y) is not the same as Sym(B’i,t,b, px,y, cx,y), the corresponding 5 wavelet coefficients for Sym(B’j,t,b, px,y, cx,y) will be determined to be significant (step (c) in Fig. 4). Otherwise, they are skipped. Finally, all the significant coefficients are quantized and entropy-encoded to form the bitstream for B’j,t,b. The corresponding decoding process can be easily done at the decoder. After decoding Ki,t, and Kj,t, the global motion parameters between them are estimated and fed back to the encoder for warping and encoding subsequent frames.

Target scene

V0

V1

K’0,48

K’1,48

Vk

Warping

(a) Co-located block MSE calculation and comparison(b) Block-based SDS extraction and comparison(c) Significant wavelet coefficients extraction

Ќ0,48

Quantization and entropy encoding

Compressed bitstream for K1,48

Significant wavelet coefficients for K1,48

Target sceneTarget scene

V0

V1

K’0,48

K’1,48

Vk

Warping

(a) Co-located block MSE calculation and comparison(b) Block-based SDS extraction and comparison(c) Significant wavelet coefficients extraction

Ќ0,48

Quantization and entropy encoding

Compressed bitstream for K1,48

Significant wavelet coefficients for K1,48

Fig. 4. A diagram for key frame re-encoding 3.2. Proposed Multi-View Non-Key Frame Encoding Scheme For non-key frame encoding (Fig. 5), a hash-based multi-reference encoding scheme is proposed. For encoding a non-key frame Wj,t, its nearest key frame, Rj,t (e.g., Rj,t = Kj,t-1 or Kj,t+1) from the same VSN, Vj, is determined to be its “first” reference frame. Similar to the key frame re-encoding, each block in Wj,t is compared with the co-located block in Rj,t by calculating their MSE (step (a) in Fig. 5). If the MSE is smaller than the threshold, TN, the block will be skipped. For each non-skip block, Bj,t,b, in Wj,t, its SDS with length, Lj,t,b, is compared with that of the co-located block in Rj,t (step (b) in Fig. 5) to extract the “initial” significant symbols (step (c) in Fig. 5). Let the number of the initial significant symbols for Bj,t,b be Lj,t,b,Init (Lj,t,b,Init < Lj,t,b). Then, the initial significant symbols will be compared with the co-located symbols in the co-located block in the “second” reference frame determined as follows.

While encoding Wj,t, Vj will send a message containing the parent node position for each initial significant symbol in all the non-skip blocks in Wj,t to its adjacent VSN, Vi (encoding the key frame, Ki,t, at the same time instant t), to announce it needs the second reference frame. Then, Vi will warp Ki,t to the viewing angle of Vj via the global motion parameters estimated by their previous key frame pair to get Ќi,t. The SDS symbols for each corresponding block, B’i,t,b, in Ќi,t, serving for the second reference frame for Wj,t, will be transmitted to Vj. Here, each SDS symbol for B’i,t,b corresponding to Bj,t,b are with the parent node position of the corresponding initial significant symbol. Hence, the hash length for B’i,t,b is Lj,t,b,Init, and the transmitted SDS data size is 3×Lj,t,b,Init bits (five possible symbols, +1,-1, +2, -2, or 0 for each). Note that the parent node position for each symbol in B’i,t,b is decided by the corresponding initial significant symbol in Bj,t,b, and needn’t be transmitted. Usually, the hash length, Lj,t,b,Init, for block, B’i,t,b, in the second reference frame is relatively short to eliminate the inter-VSN communication overhead. However, the amounts of hash data that can be transmitted from Vi to Vj highly depend on the available data transmission power for Vi, and data reception power for Vj, parts of which will be addressed in Sec. 4.

V0

V1

K0,45

W1,45

Warping

(d) Block-based SDS extraction and comparison(e) True significant symbols extraction

Quantization and entropy-encoding

Compressed bitstream for W1,45

Ќ0,45

R1,45 = K1,44

(a) Co-located block MSE calculation and comparison(b) Block-based SDS extraction and comparison(c) Initial significant symbols extraction

Initial significant symbols for W1,45

SDS for Ќ0,45

Target scene

Significant wavelet coefficients for W1,45

V0

V1

K0,45

W1,45

Warping

(d) Block-based SDS extraction and comparison(e) True significant symbols extraction

Quantization and entropy-encoding

Compressed bitstream for W1,45

Ќ0,45

R1,45 = K1,44

(a) Co-located block MSE calculation and comparison(b) Block-based SDS extraction and comparison(c) Initial significant symbols extraction

Initial significant symbols for W1,45

SDS for Ќ0,45

Target sceneTarget scene

Significant wavelet coefficients for W1,45

Fig. 5. A diagram for non-key frame encoding.

After receiving the SDS for Ќi,t from Vi, for each

non-skip block in Wj,t, the initial significant symbols will be compared with the co-located symbols in Ќi,t (step (d) in Fig. 5) to extract the “true” significant symbols for Wj,t (step (e) in Fig. 5). That is, some initial significant symbols can be filtered out by being compared with the corresponding symbols in the second reference frame with a little auxiliary information to indicate that the symbols are predicted by the second reference frame. Usually, most symbols corresponding to the background region in Wj,t can be filtered out by comparing with the first reference frame (Rj,t from the same VSN) while part of the symbols corresponding to the foreground (moving objects) can be filtered out by comparing with the second reference frame (Ќi,t from the adjacent VSN). Finally, all the coefficients corresponding to the true significant symbols are quantized and entropy-encoded to form the bitstream for Wj,t. The corresponding decoding process can be easily done at the decoder.


- 63 -

3.3. Power-Scalability of the Proposed Video Encoding Scheme For a battery-powered VSN, it is essential to adjust the encoding operations based on the available power supply to maximize the power efficiency and video quality. Similar to [2], [24], to analyze and control the power consumption of a VSN, a CMOS circuits design technology for a mobile device, called dynamic voltage scaling (DVS) is assumed to design the VSNs in this study. It is claimed [2], [24] that the power consumption of a video encoder can be controlled by adjusting its computational complexity. That is, for a video encoder, its computational complexity can be translated into its power consumption. Hence, based on DVS, the power scalability is equivalent to the complexity scalability.

Similar to the concept of the parametric power-scalable video encoder developed in [2], [24], the proposed video encoder is scalable in computational complexity and power consumption, achieved by adjusting the three parameters, TK1 and TK2, and TN, which control the numbers of skipped blocks in the key frame re-encoding process and non-key frame encoding process, respectively. The larger the three parameters, the larger the numbers of skipped blocks are, i.e., the lower the computational complexity and power consumption are. To maximize the rate-distortion performance under the power constraint, the encoder should perform an optimal computational power allocation based on the power-rate-distortion model analyzed in the following section.

4. POWER-RATE-DISTORTION ANALYSIS The performance of the proposed video encoder can be evaluated by the quality of video transmitted from a video sensor node (VSN) to the AFN. The video quality evaluated by the end-to-end distortion can be defined as:

D = Ds + Dt, (5) where Ds and Dt are the distortions caused by lossy video compression and transmission errors, respectively. Ds and Dt can be given by

Ds = MSE(I, Î) and Dt = MSE(Î, Ĩ), (6) where I, Î, and Ĩ are the original frame, error-free decoded version of I, and reconstructed version at the AFN of I, respectively.

For a VSN, let Ps, Pt, Pr, and P0 denote the power used for video encoding, data transmission, data reception, and the power budget, respectively, where Ps + Pt + Pr ≤ P0. By integrating the characteristic of the proposed video encoder and [2], the Ds for non-key frame is a function of Ps, Pr, and coding bitrate Rs, while Dt is a function of Pt and transmission distance d. Hence, the objective function for maximizing the video

quality for non-key frame under the power constraint can be formulated as

{ }( ) ( )dPDPPRDD ttrsss

RPPP strs

,,,min,,,

+= (7)

s.t. Ps + Pr + Pt ≤ P0. In addition, Ds, for non-key frame, can be derived as

( ) ( ) ( )[ ],,, 2 Hrss RPPgRrsss ePPRD +⋅⋅−= γσ (8)

[ ]1,0∈sP and Rs ≥ 0. In Eq. (8), σ2 denotes the distortion while no resource is available, i.e., Ps = 0 and Rs = 0. γ is a model parameter derived from the previous actual rate-distortion (RD) measurements via Eq. (8). The function g(Ps) is derived from the power consumption model in the DVS technology [2], and

( ) 3/2ss PPg = . (9)

Pr(RH) is the power function used for receiving the hash (SDS) data with bit rate RH from the adjacent VSN for encoding a non-key frame, where

Pr(RH) = α·RH, (10) and α = 135 nJ/bit [27]. The RD curves for different encoding power consumption levels and power for receiving hash levels derived from Eq. (8) and the actual measurements for the Ballroom sequence [28] are shown in Fig. 6. It should be noted that Fig. 6 only represents the power-rate-distortion (PRD) behavior for a number of non-key frames, exclusive of key frames, because the encoding behavior for key frame is different from that of non-key frame. From Fig. 6, it can be seen that while the encoding power and power for receiving hash decrease, Ds becomes flatter, i.e., the video coding efficiency is reduced. That is, while the encoder doesn’t have enough encoding power and enough hash data from adjacent VSN, more available bitrates cannot be efficiently exploited. Based on the actual measurements in Fig. 6, the Ds function in Eq. (8) is fairly accurate. It is noted that in Fig. 6, the Ps and Pr for each curve are approximately adjusted to be the same levels, instead of the same values. The cases that Ps and Pr with very different levels simultaneously are not considered. It is reasonable to assume that if a VSN can receive large hash data from its adjacent VSN, it should consume higher encoding power to sufficiently exploit these hash data and perform the encoding operations.

It should be noted that in the current PRD analysis, only the proposed non-key frame video encoding scheme is considered. That is, only the three tasks, namely, (i) video encoding, (ii) reception of the hash data from adjacent VSN, and (iii) transmission of the compressed video bitstream outside the VSN, are


- 64 -

considered. While a VSN encodes a key frame, it may transmit the hash data for the current key frame to its adjacent VSN to assist its non-key frame encoding. The PRD behavior of this situation will be more complex, and will be analyzed in the future.

In [2], there exists an achievable minimum distortion, which can be obtained from the two observations: (1) If (Ps + Pr) decreases, Ds will increase; (2) If (Ps + Pr) increases, Pt will decrease due to Ps + Pr + Pt ≤ P0 (P0 is fixed). This means less bits can be transmitted to the AFN, and Ds will also increase. That is, (Ps + Pr) should be not too low or too high, and there should exist an optimal power (Ps + Pr) minimizing Ds. For simplicity, [2] assumes that the transmission errors are ignored, i.e., Dt = 0 and D = Ds. Consider P0 = Ps + Pr + Pt, and

Pt = η(d)·Rs, and η(d) = (β + δ·dυ), (11) where β = 45 nJ/bit, δ = 10 pJ/bit/m2, and υ = 2 [2], [27], one can get Rs = (P0 - Ps - α·RH) /η(d). Based on Eq. (8), one can derive

( )

⋅−−= rs

Hss PP

dRPPDD ,,0

ηα . (12)

Based on Eqs. (8) and (12), we can derive

( )

⋅+×

⋅−−⋅−

⋅=Hs

Hs0 RPd

RPP

eDα

ηαγ

σ3/2

2 . (13) It can be found that D is a function of Ps. The minimum D occurs where the derivative with respect to Ps is zero; i.e.,

( )( ) 0=

sPdDd when 0 < Ps < P0. (14)

By making use of Eqs. (13) and (14), we can derive

( ) ( ) ( ) 03/2 =

⋅+−×× H0ss

s

RPPPdPd

d αηγ . (15)

Then,

( ) 025

32 3/2

=

⋅−−×

⋅

s

H0s

PRP

dP α

ηγ (16)

and

25

=⋅−

s

H0

PRP α . (17)

Finally, we can obtain

( )H0s RPP ⋅−= α52 . (18)

102030405060708090

100110

0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16 0.18 0.2 Bitrate (bpp)

Distortion(MSE)

5% Ps and 5% Pr 20% Ps and 20% Pr 70% Ps and 70% Pr 100% Ps and 100% Pr Actual Measurement

Fig. 6. The RD curves for different power consumption levels and power for receiving hash levels.

According to above derivations, we can find the

minimized D, which is called achievable minimum distortion (AMD) [2]. The AMD can give us the theoretically achievable minimum distortion or optimal video quality for a given power supply for a VSN. The distortion function D(Ps) in Eq. (13) is plotted in Fig. 7 based on assumed P0 = 0.3 W (Watt) [2] for a VSN and a fixed hash bitrate RH. It can be observed from Fig. 7 that D(Ps) has the AMD when Ps ≈ 0.12. This optimal Ps can be also obtained via Eq. (18). While the hash bitrate RH changes, the optimal Ps and AMD will be also changed accordingly. The AMD as a function of P0 is plotted in Fig. 8. It can be observed that with the increase of the power supply P0 for a VSN, the AMD will be decreased. The AMD estimation can provide a guideline to decide the power supply P0 for a VSN based on desired video quality or acceptable distortion before deploying the WVSN.

70

90

110

130

150

0.00 0.06 0.12 0.18 0.24 0.30 Ps (Watt)

Distortion D

Fig. 7. The curve of D(Ps) in Eq. (13) based on given fixed power budget P0 = 0.3 W.

5. SIMULATION RESULTS

Some multi-view video sequences [28] with eight views, frame size, 640×480, 250 frames, block size, 128×128, YUV4:2:0, frame rate, 10 frames per second (fps), and different bitrates were used to evaluate the proposed video encoding scheme. The first three views (video sensor nodes (VSNs)), V0, V1, and V2, were structured based on Table 1. Based on [28], the distance between a VSN and its adjacent VSN is 19.5 cm. The H.264/AVC


- 65 -

interframe coding with default setting (full search motion estimation with 5 reference frames, and only one intraframe (I frame), followed by several interframes (P frames)) [4], the H.264/AVC intraframe coding [4], our previous single-view low-complexity video codec (denoted by Single) [15], and our previous single-reference frame multi-view low-complexity video codec without key frame re-encoding (denoted by Multi) [21] were employed for comparisons with the proposed video coding scheme. The latter three approaches used for comparisons and the proposed scheme here are all with low complexity. The rate-distortion (RD) performances for V1 (the second view) are shown in Figs. 9-10, respectively, for the Ballroom and Exit video sequences. Note that Figs. 9-10 only show the RD performances for the proposed video encoding scheme without considering power consumption issue.

020406080

100120140160180

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 Po (Watt)

AMD

Fig. 8. The curve of the AMD as a function of P0.

It can be observed from Figs. 9-10 that the PSNR

performance gains of the proposed scheme outperform those of the three low-complexity approaches, especially in larger motion sequence (e.g., Ballroom). For the Ballroom sequence with large motions, the performance gaps among these approaches are more significant. That is, in the proposed video encoding scheme, some moving regions in a non-key frame can be efficiently encoded by the assistance from its second reference frame from adjacent VSN. However, the single-view [15] and single-reference frame [21] approaches for comparisons cannot well utilize the advantages from multi-view and multi-reference frames. On the other hand, for the Exit sequence with relatively small motions, the performance gaps among these approaches are somewhat small. The performance gap between the proposed scheme and the H.264/AVC interframe coding is about 1 dB while the computational complexity of the proposed scheme is much lower than that of the H.264/AVC interframe coding.

6. CONCLUSIONS AND FUTURE WORKS

In this paper, a low-complexity power-scalable multi-view distributed video encoding scheme for wireless video sensor networks is proposed. Based on the global motion parameters estimated at the decoder and fed

back to the encoder, the key frames from different video sensor nodes (VSNs) can be re-encoded further. Based on the feed-backed global motion parameters and low-complexity hash-based inter-VSN communications, a non-key frame can be efficiently encoded based on two reference frames. In addition, the power-rate-distortion (PRD) behavior of the proposed video encoder is analyzed to maximize video quality under the power constraint. The achievable minimum distortion indicating the theoretically achievable minimum distortion or optimal video quality for a given power supply for a VSN is also derived. Based on the AMD estimation, a guideline is provided to decide the power supply for each VSN based on desired video quality or acceptable distortion before deploying the wireless video sensor network.

For the future researches, transmission errors should be considered so that more accurate but complex PRD behavior can be analyzed. In addition, error resilience encoding, error concealment, and secure transmission techniques for the proposed video encoding scheme should be also developed.

28

30

32

34

36

38

40

42

0 200 400 600 800 1000 Bitrate (kbps)

PSNR (dB)

H.264 Inter (GOP = ∞) Proposed (GOP = 4)Multi (GOP = 4) Single (GOP = 4)H.264 Intra (GOP = 1)

Fig. 9. RD comparison for the Ballroom sequence.

31

33

35

37

39

41

43

0 200 400 600 800 1000 Bitrate (kbps)

PSNR (dB)

H.264 Inter (GOP = ∞) Proposed (GOP = 4)Multi (GOP = 4) Single (GOP = 4)H.264 Intra (GOP = 1)

Fig. 10. RD comparison for the Exit sequence.

REFERENCES

[1] I. F. Akyildiz, T. Melodia, and K. R. Chowdhury, “A survey on wireless multimedia sensor networks,” Computer Networks, vol. 51, pp. 921-960, 2007.


- 66 -

[2] Z. He and D. Wu, “Resource allocation and performance analysis of wireless video sensors,” IEEE Trans. on Circuits and Systems for Video Technology, vol. 16, no. 5, pp. 590-599, May 2006.

[3] T. Sikora, “Trends and perspectives in image and video coding,” Proceedings of the IEEE, vol. 93, no. 1, pp. 6-17, Jan. 2005.

[4] T. Wiegand, G. J. Sullivan, G. Bjøntegaard, and A. Luthra, “Overview of the H.264/AVC video coding standard,” IEEE Trans. on Circuits and Systems for Video Technology, vol. 13, no. 7, pp. 560-576, 2003.

[5] M. Drose, C. Clemens, and T. Sikora, “Extending single-view scalable video coding to multi-view based on H.264/AVC,” in Proc. of IEEE Int. Conf. on Image Processing, Oct. 2006, Atlanta, GA, USA, pp. 2977-2980.

[6] M. Flierl, A. Mavlankar, and B. Girod, “Motion and disparity compensated coding for multi-view video, accepted and to appear in IEEE Trans. on Circuits and Systems for Video Technology, special issue on Multiview Video Coding, 2007 (invited paper).

[7] B. Girod, A. M. Aaron, S. Rane, and D. Rebollo-Monedero, “Distributed video coding,” Proceedings of the IEEE, vol. 93, no. 1, pp. 71-83, Jan. 2005.

[8] R. Puri, A. Majumdar, P. Ishwar, and K. Ramchandran, “Distributed video coding in wireless sensor networks,” IEEE Signal Processing Magazine, vol. 23, no. 4, pp. 94-106, July 2006.

[9] Z. Li, L. Liu, and E. J. Delp, “Wyner-Ziv video coding with universal prediction,” IEEE Trans. on Circuits and Systems for Video Technology, vol. 16, no. 11, pp. 1430-1436, Nov. 2006.

[10] Z. Li, L. Liu, and E. J. Delp, “Rate distortion analysis of motion side estimation in Wyner-Ziv video coding,” IEEE Trans. on Image Processing, vol. 16, no. 1, pp. 98-113, Jan. 2007.

[11] M. Maitre, C. Guillemot, and L. Morin, “3-D model-based frame interpolation for distributed video coding of static scenes,” IEEE Trans. on Image Processing, vol. 16, no. 5, pp. 1246-1257, May 2007.

[12] M. Tagliasacchi, A. Majumdar, K. Ramchandran, and S. Tubaro, “Robust wireless video multicast based on a distributed source coding approach,” Signal Processing, vol. 86, pp. 3196-3211, 2006.

[13] Y. Tonomura, T. Nakachi, and T. Fujii, “Distributed video coding using JPEG 2000 coding scheme,” IEICE Trans. on Fundamentals, vol. E90-A, no. 3, pp. 581-589, March 2007.

[14] L. W. Kang and C. S. Lu, “Wyner-Ziv video coding with coding mode-aided motion compensation,” in Proc. of IEEE Int. Conf. on Image Processing, Atlanta, GA, USA, Oct. 2006, pp. 237-240.

[15] L. W. Kang and C. S. Lu, “Low-complexity Wyner-Ziv video coding based on robust media hashing,” in Proc. of IEEE Int. Workshop on Multimedia Signal Processing, Victoria, BC, Canada, Oct. 2006, pp. 267-272.

[16] X. Guo, Y. Lu, F. Wu, W. Gao, and S. Li, “Distributed multi-view video coding,” in Proc. of SPIE Visual Communications and Image Processing, vol. 6077, San Jose, CA, USA, Jan. 2006.

[17] X. Artigas, E. Angeli, and L. Torres, “Side information generation for multiview distributed video coding using a fusion approach,” in Proc. of Nordic Signal Processing Symposium, Reykjavik, Iceland, June 2006.

[18] M. Ouaret, F. Dufaux, and T. Ebrahimi, “Fusion-based multiview distributed video coding,” in Proc. of ACM Int. Workshop on Video Surveillance and Sensor Networks, Santa Barbara, CA, USA, Oct. 2006, pp. 139-144.

[19] F. Dufaux, M. Ouaret, and T. Ebrahimi, “Recent advances in multi-view distributed video coding,” in Proc. of SPIE Mobile Multimedia/Image Processing for Military and Security applications, vol. 6579, pp. 657902, 2007.

[20] C. Yeo and K. Ramchandran, “Robust distributed multi-view video compression for wireless camera networks,” in Proc. of SPIE Visual Communications and Image Processing, vol. 6508, Jan. 2007.

[21] L. W. Kang and C. S. Lu, “Multi-view distributed video coding with low-complexity inter-sensor communication over wireless video sensor networks,” accepted and to appear in Proc. of IEEE Int. Conf. on Image Processing, special session on Distributed Source Coding II: Distributed Image and Video Coding and Their Applications, San Antonio, TX, USA, 2007 (invited paper).

[22] M. Wu and C. W. Chen, “Collaborative image coding and transmission over wireless sensor networks,” EURASIP Journal on Advances in Signal Processing, vol. 2007, Article ID 70481, 9 pages, 2007 (special issue on Visual Sensor Networks).

[23] K. Y. Chow, K. S. Lui, and E. Y. Lam, “Efficient on-demand image transmission in visual sensor networks,” EURASIP Journal on Advances in Signal Processing, vol. 2007, Article ID 95076, 11 pages, 2007 (special issue on Visual Sensor Networks).

[24] Z. He, Y. Liang, L. Chen, I. Ahmad, and D. Wu, “Power-rate-distortion analysis for wireless video communication under energy constraints,” IEEE Trans. on Circuits and Systems for Video Technology, vol. 15, no. 5, pp. 645-658, May 2005.

[25] D. S. Turaga, M. van der Schaar, and B. Pesquet-Popescu, “Complexity scalable motion compensated wavelet video encoding, IEEE Trans. on Circuits and Systems for Video Technology, vol. 15, no. 8, pp. 982-993, Aug. 2005.

[26] C. S. Lu and H. Y. M. Liao, “Structural digital signature for image authentication: an incidental distortion resistant scheme,” IEEE Trans. on Multimedia, vol. 5, no. 2, pp. 161-173, June 2003.

[27] M. Bhardwaj, A. P. Chandrakasan, “Bounding the lifetime of sensor networks via optimal role assignments,” in Proc. of IEEE Joint Conf. of the IEEE Computer and Communications Societies (INFOCOM), 2002, vol. 3, pp. 1587-1596.

[28] Mitsubishi Electric Research Laboratories, “MERL multi-view video sequences,” ftp://ftp.merl.com/pub/ avetro/mvc-testseq.


- 67 -

LOW-COMPLEXITY POWER-SCALABLE MULTI-VIEW DISTRIBUTED VIDEO ... · LOW-COMPLEXITY POWER-SCALABLE MULTI-VIEW DISTRIBUTED VIDEO CODING FOR WIRELESS VIDEO SENSOR NETWORKS+ Li-Wei Kang

Documents