Top Banner
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination. IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS 1 Reconfigurable CORDIC-Based Low-Power DCT Architecture Based on Data Priority Min-Woo Lee, Student Member, IEEE, Ji-Hwan Yoon, Student Member, IEEE, and Jongsun Park, Senior Member, IEEE Abstract— This paper presents a low-power coordinate rotation digital computer (CORDIC)-based reconfigurable discrete cosine transform (DCT) architecture. The main idea of this paper is based on the interesting fact that all the computations in DCT are not equally important in generating the frequency domain outputs. Considering the importance difference in the DCT coefficients, the number of CORDIC iterations can be dynamically changed to efficiently tradeoff image quality for power consumption. Thus, the computational energy can be significantly reduced without seriously compromising the image quality. The proposed CORDIC-based 2-D DCT architecture is implemented using 0.13 μm CMOS process, and the experimental results show that our reconfigurable DCT achieves power savings ranging from 22.9% to 52.2% over the CORDIC-based Loeffler DCT at the cost of minor image quality degradations. Index Terms— Coordinate rotation digital computer (CORDIC), data priority, discrete cosine transform (DCT), low-power, reconfigurable architecture. I. I NTRODUCTION W ITH THE explosive growth of multimedia services running on portable applications, the demand for low- power implementations of complex signal processing algo- rithms is tremendously increasing. The most significant part of multimedia systems are the applications involving image and video processing, which are very computationally intensive and thus should be implemented with low cost because of the limited battery lifetime of portable devices. Many previous research efforts are focused on reducing power dissipation of image and video applications [1]–[3]. Especially, low-power design of discrete cosine transform (DCT) [4] has been of particular interest, since DCT is one of the most computa- tionally intensive operations in video and image compression, and it is widely adopted in many standards such as JPEG [5], MPEG [6], and H.264 [7]. Manuscript received May 26, 2012; revised November 1, 2012 and February 8, 2013; accepted April 21, 2013. This work was supported in part by the Basic Science Research Program through the National Research Foundation of Korea funded by the Ministry of Education, Science and Technology under Grant 2010-0004484, and also supported by the National Research Foundation of Korea under Grant funded by the Korea Government (MEST) under Grant 2011-0020128. M.-W. Lee was with the School of Electrical Engineering, Korea Uni- versity, Seoul 110-810, Korea. He is now with DTV SoC Development Team, SIC R&D Lab., LG Electronics Co., Seoul 157-030, Korea (e-mail: [email protected]). J.-H. Yoon and J. Park are with the School of Electrical Engineering, Korea University, Seoul 136-701, Korea (e-mail: [email protected]; [email protected]). Color versions of one or more of the figures in this paper are available online at http://ieeexplore.ieee.org. Digital Object Identifier 10.1109/TVLSI.2013.2263232 Since first proposed in 1959 [8], coordinate rotation dig- ital computer (CORDIC) has been widely used to calculate the trigonometric functions in signal processing applications, such as QR decomposition [9], fast Fourier transform [10], singular value decomposition [11], [12], and so on. Since CORDIC can be simply implemented with the iterative oper- ations of additions and shifts, it has been widely used for the multiplierless low-power DCT architectures [13]–[18]. Many previous research works focused on reducing the hardware complexity of DCT such as distribute arithmetic (DA)-based DCT [19] and multiple constant multiplication (MCM)-based approach [20]. Although bit-serial DA-based approach offers a regular and simple DCT architecture, large hardware area is needed for bit-parallel operations because of additional ROMs and control logics. MCM-based DCT [20] can be simply implemented with a smaller number of shift- and-add operations, however, to make a tradeoff between the image quality and computation energy, the computation shar- ing in different datapaths should be completely re-considered. For the low-power CORDIC-based DCT architecture pre- sented in [14], data correlations between neighboring pixels are efficiently used to skip the internal CORDIC iterations. Approximation technique or incorporating compensation steps into the quantization is also exploited to reduce the power consumption of CORDIC-based DCT architecture [16]. Most of the previous research works are mainly focused on reducing the number of arithmetic units; the inherent data priorities in DCT coefficients, however, have not been exploited in the CORDIC-based DCT. In DCT, all the computations are not equally important in generating the frequency domain outputs (DCT coefficients). In other words, some of the computations in DCT are critical for determining the output image quality, while others play relatively less important roles. This interesting property can be used to provide the right tradeoff between the output image quality and power dissipations [21]–[24]. In this paper, we present a low-power CORDIC-based DCT architecture, where the important differences among the DCT coefficients are efficiently exploited to achieve the power savings mini- mum image quality degradation. To apply the priority-based data processing, lookahead CORDIC architectures [25]–[27] are adopted to overcome the inherent data-dependencies in the conventional CORDIC architecture. Thus, the number of CORDIC iterations is dynamically controlled considering the importance of DCT coefficients by which considerable power savings is achieved. The rest of this paper is organized as follows. The basics of CORDIC algorithm and the conventional CORDIC-based 1063-8210/$31.00 © 2013 IEEE
9
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: CHV0280

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS 1

Reconfigurable CORDIC-Based Low-Power DCTArchitecture Based on Data PriorityMin-Woo Lee, Student Member, IEEE, Ji-Hwan Yoon, Student Member, IEEE,

and Jongsun Park, Senior Member, IEEE

Abstract— This paper presents a low-power coordinate rotationdigital computer (CORDIC)-based reconfigurable discrete cosinetransform (DCT) architecture. The main idea of this paperis based on the interesting fact that all the computations inDCT are not equally important in generating the frequencydomain outputs. Considering the importance difference in theDCT coefficients, the number of CORDIC iterations can bedynamically changed to efficiently tradeoff image quality forpower consumption. Thus, the computational energy can besignificantly reduced without seriously compromising the imagequality. The proposed CORDIC-based 2-D DCT architecture isimplemented using 0.13 µm CMOS process, and the experimentalresults show that our reconfigurable DCT achieves power savingsranging from 22.9% to 52.2% over the CORDIC-based LoefflerDCT at the cost of minor image quality degradations.

Index Terms— Coordinate rotation digital computer(CORDIC), data priority, discrete cosine transform (DCT),low-power, reconfigurable architecture.

I. INTRODUCTION

W ITH THE explosive growth of multimedia servicesrunning on portable applications, the demand for low-

power implementations of complex signal processing algo-rithms is tremendously increasing. The most significant part ofmultimedia systems are the applications involving image andvideo processing, which are very computationally intensiveand thus should be implemented with low cost because of thelimited battery lifetime of portable devices. Many previousresearch efforts are focused on reducing power dissipation ofimage and video applications [1]–[3]. Especially, low-powerdesign of discrete cosine transform (DCT) [4] has been ofparticular interest, since DCT is one of the most computa-tionally intensive operations in video and image compression,and it is widely adopted in many standards such as JPEG [5],MPEG [6], and H.264 [7].

Manuscript received May 26, 2012; revised November 1, 2012 and February8, 2013; accepted April 21, 2013. This work was supported in part by theBasic Science Research Program through the National Research Foundationof Korea funded by the Ministry of Education, Science and Technology underGrant 2010-0004484, and also supported by the National Research Foundationof Korea under Grant funded by the Korea Government (MEST) under Grant2011-0020128.

M.-W. Lee was with the School of Electrical Engineering, Korea Uni-versity, Seoul 110-810, Korea. He is now with DTV SoC DevelopmentTeam, SIC R&D Lab., LG Electronics Co., Seoul 157-030, Korea (e-mail:[email protected]).

J.-H. Yoon and J. Park are with the School of Electrical Engineering,Korea University, Seoul 136-701, Korea (e-mail: [email protected];[email protected]).

Color versions of one or more of the figures in this paper are availableonline at http://ieeexplore.ieee.org.

Digital Object Identifier 10.1109/TVLSI.2013.2263232

Since first proposed in 1959 [8], coordinate rotation dig-ital computer (CORDIC) has been widely used to calculatethe trigonometric functions in signal processing applications,such as QR decomposition [9], fast Fourier transform [10],singular value decomposition [11], [12], and so on. SinceCORDIC can be simply implemented with the iterative oper-ations of additions and shifts, it has been widely used forthe multiplierless low-power DCT architectures [13]–[18].Many previous research works focused on reducing thehardware complexity of DCT such as distribute arithmetic(DA)-based DCT [19] and multiple constant multiplication(MCM)-based approach [20]. Although bit-serial DA-basedapproach offers a regular and simple DCT architecture, largehardware area is needed for bit-parallel operations because ofadditional ROMs and control logics. MCM-based DCT [20]can be simply implemented with a smaller number of shift-and-add operations, however, to make a tradeoff between theimage quality and computation energy, the computation shar-ing in different datapaths should be completely re-considered.For the low-power CORDIC-based DCT architecture pre-sented in [14], data correlations between neighboring pixelsare efficiently used to skip the internal CORDIC iterations.Approximation technique or incorporating compensation stepsinto the quantization is also exploited to reduce the powerconsumption of CORDIC-based DCT architecture [16]. Mostof the previous research works are mainly focused on reducingthe number of arithmetic units; the inherent data prioritiesin DCT coefficients, however, have not been exploited in theCORDIC-based DCT.

In DCT, all the computations are not equally important ingenerating the frequency domain outputs (DCT coefficients).In other words, some of the computations in DCT are criticalfor determining the output image quality, while others playrelatively less important roles. This interesting property canbe used to provide the right tradeoff between the outputimage quality and power dissipations [21]–[24]. In this paper,we present a low-power CORDIC-based DCT architecture,where the important differences among the DCT coefficientsare efficiently exploited to achieve the power savings mini-mum image quality degradation. To apply the priority-baseddata processing, lookahead CORDIC architectures [25]–[27]are adopted to overcome the inherent data-dependencies inthe conventional CORDIC architecture. Thus, the number ofCORDIC iterations is dynamically controlled considering theimportance of DCT coefficients by which considerable powersavings is achieved.

The rest of this paper is organized as follows. The basicsof CORDIC algorithm and the conventional CORDIC-based

1063-8210/$31.00 © 2013 IEEE

Page 2: CHV0280

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

2 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS

DCT are presented in Section II. The proposed low-powerCORDIC-based DCT architecture and its hardware implemen-tation are presented in Section III. Based on the proposedDCT architecture, a reconfigurable CORDIC-based DCT ispresented in Section IV. Finally, conclusions are drawn inSection V.

II. CONVENTIONAL CORDIC-BASED

DCT ARCHITECTURE

A. CORDIC Architecture

The basic principal of CORDIC is to iteratively rotate avector using a rotation matrix [8], which is represented asfollows: ⎡

⎣xi

yi

zi

⎤⎦ =

⎡⎣

xi−1 − σi 21−i yi−1

yi−1 + σi 21−i xi−1zi−1 − σiαi

⎤⎦ (1)

where x and y are the vector coordinate components of xand y axes, respectively, i is the i th iteration step, σ is thesign-bit that can be +1 or −1 indicating the direction ofthe vector rotation, z is the accumulated rotation angle, andα is the predefined angle value of each microrotation step,αi = arctan(21−i ). In the CORDIC architecture, the amplitudeand argument of a given vector can be calculated using thevectoring mode, while the sine and cosine values of the givenangle are obtained with the rotation mode [28]. The hardwarearchitecture of the CORDIC iteration is shown in Fig. 1, whichis referred as a crossing-architecture in the following.

1) Lookahead CORDIC Approach: In the CORDIC equa-tion shown in (1), to calculate the output of the current stage,the results from the previous stage iterations should be com-puted first. These data dependencies are the main performancebottleneck in the conventional CORDIC hardware. To getover the data dependencies, lookahead CORDIC [25]–[27] isdeveloped, where lookahead means that a number of CORDICiterations can be computed ahead to finish the iterations at onetime. An example of four-iteration step lookahead CORDIC[25]–[27] is shown in (2). It is noteworthy that if the sign-bits σk , (k = 1, . . . , 4) are known ahead, the following stageiterations can be directly computed using the input vectors ofthe present stage iteration without computing the intermediateresults:

[x4y4

]=

⎡⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎣

⎡⎢⎢⎣

−σ1σ220

−σ1σ32−1

−σ1σ42−2

+σ1σ2σ3σ42−3

⎤⎥⎥⎦

⎡⎢⎢⎣

−σ120

+σ1σ2σ32−1

+σ1σ2σ42−2

+σ1σ3σ42−3

⎤⎥⎥⎦

⎡⎢⎢⎣

+σ120

−σ1σ2σ32−1

−σ1σ2σ42−2

−σ1σ3σ42−3

⎤⎥⎥⎦

⎡⎢⎢⎣

−σ1σ220

−σ1σ32−1

−σ1σ42−2

+σ1σ2σ3σ42−3

⎤⎥⎥⎦

⎤⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎦

[x0y0

]. (2)

2) Scale-Factor in CORDIC Operations: In the CORDICoperation, the magnitude of the rotated vector is scaled andaccumulated after every iteration according to the followingequation:

Ki = 1√1 + 22(1−i)

. (3)

Fig. 1. Hardware architecture of CORDIC iteration.

After a series of iterations, the accumulated Ki value in (3) isconverged to a constant as follows:

K (n) =n∏

i=1

Ki =n∏

i=1

1√1 + 22(1−i)

⇒ limn→∞ K (n) ≈ 0.60725 . . . (4)

where n is the number of iterations. The constant aboveis the scale-factor to restore the scaled magnitude of therotated vector. The scale-factor is determined by the numberof iterations. In the following sections, we use a low-powerCORDIC architecture by modifying the number of iterations,where the vector rotates to the target angle in only onedirection. The corresponding scale-factor should be modifiedas well according to the iterations. More discussions on thescale-factor will be presented in Section III-A.

B. CORDIC-Based DCT Architecture

The 2-D DCT process is decomposed into an 1-D DCT (rowDCT) followed by another 1-D DCT (column DCT), which isexpressed as the following equation:

Y = T xT T = T (T xT )T (5)

where x and Y are 8 × 8 size of image data matrix and 2-DDCT transformed output matrix, respectively. T is the 8 × 81-D DCT basis matrix. The 2-D DCT process with separable1-D DCT is shown in Fig. 2.

The 8 × 8 1-D DCT transform is expressed as

X (k) = c(k)

2

7∑i=0

x(i) cos(

(2i+1)kπ16

)

where

k = 0, 1, 2, . . . , 7

c(k) ={

1/√

2 k = 0

1 otherwise(6)

where x(i ) is the input data, and X(k) is 1-D DCT transformedoutput data. As a vector-matrix form, 1-D DCT is representedas X = T xT , where T is the 8 × 8 DCT basis matrix. X andx are the output and input vectors, respectively. Since 8 × 8DCT bases matrix T has a symmetric property, the 1-D DCT

Page 3: CHV0280

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

LEE et al.: RECONFIGURABLE CORDIC-BASED LOW-POWER DCT ARCHITECTURE BASED ON DATA PRIORITY 3

Fig. 2. 8 × 8 2-D DCT processor with separable 1-D DCT.

transform is represented as follows:[

X (0)X (4)

]= 1

2

[c4 c4c4 −c4

] [x(0) + x(7) + x(3) + x(4)x(1) + x(6) + x(2) + x(5)

]

[X (2)X (6)

]= 1

2

[c2 c6c6 −c2

] [x(0) + x(7) − x(3) − x(4)x(1) + x(6) − x(2) − x(5)

]

⎡⎢⎢⎣

X (1)X (3)X (5)X (7)

⎤⎥⎥⎦ = 1

2

⎡⎢⎢⎣

c1 c3 c5 c7c3 −c7 −c1 −c5c5 −c1 c7 c3c7 −c5 c3 −c1

⎤⎥⎥⎦

⎡⎢⎢⎣

x(0) − x(7)x(1) − x(6)x(2) − x(5)x(3) − x(4)

⎤⎥⎥⎦ (7)

where ck = cos(kπ/16). The cosine elements in (7) can bechanged into sine elements through trigonometric symmetricproperty, and (7) can be rearranged as the following equations:

[X (4)X (0)

]= 1

2

[c4 −s4s4 c4

] [x(0) + x(7) + x(3) + x(4)x(1) + x(6) + x(2) + x(5)

]

[X (6)X (2)

]= 1

2

[c6 −s6s6 c6

] [x(0) + x(7) − x(3) − x(4)x(1) + x(6) − x(2) − x(5)

]

[X (1)X (7)

]= 1

2

[c7 s7

−s7 c7

] [x(3) − x(4)x(0) − x(7)

]

+1

2

[c3 s3

−s3 c3

] [x(1) − x(6)x(2) − x(5)

]

[X (3)X (5)

]= 1

2

[c3 −s3s3 c3

] [x(0) − x(7)x(3) − x(4)

]

−1

2

[c1 s1

−s1 c1

] [x(2) − x(5)x(1) − x(6)

](8)

where sm = sin(mπ/16) = ck , and m = 8−k. The rearranged1-D DCT equation is now represented as vector rotation matrixtogether with the consecutive CORDIC iterations as shown inFig. 3. Now, DCT can be implemented using only shifters andadders without multiplier [13]. Please note that the sign-bitsand the scale-factor are known ahead since the input anglesof CORDIC module are given as the DCT bases.

After 2-D DCT operation, the input data in space domainis transformed to the frequency domain, which is the8 × 8 block of 64 DCT coefficients shown in Fig. 4. Here,as DCT has the signal compaction property, the signal energyof the output data (DCT coefficients) is mostly concentratedon a few low-frequency components, while the other higherfrequency components are associated with small signal energy.The high-frequency DCT coefficients become even smallerafter the quantization step [5], which means that the low-frequency components (DC) are more sensitive to human eyesthan high-frequency components.

The main idea in this paper is based on the fact thatlow-frequency DCT coefficients are relatively more important

Fig. 3. Hardware architecture of CORDIC-based 1-D DCT.

Fig. 4. Sensitivity difference of 8 × 8 2-D DCT coefficients.

than high-frequency coefficients. Our CORDIC-based DCTarchitecture is designed considering the importance differ-ences between the low and high-frequency DCT coefficients.Generally, as the more number of iterations is performedin CORDIC, the more accurate results are obtained. There-fore, in the proposed DCT architecture, a larger numberof CORDIC iterations are assigned to generate the low-frequency DCT coefficients, whereas the relatively smallernumber of iterations are used for the high-frequency com-ponents. The number of CORDIC iterations is judiciouslyselected such that the image quality degradation because ofthe smaller iterations can be minimized. Detailed explanations

Page 4: CHV0280

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

4 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS

Fig. 5. Differences between (a) crossing-architecture and (b) lookaheadapproach-based architecture of CORDIC module.

on the DCT hardware will be presented in the followingsections.

III. PRIORITY-BASED LOW-POWER DCT ARCHITECTURE

USING LOOKAHEAD CORDIC APPROACH

A. Data Priority Considered Lookahead CORDICArchitecture

In the conventional CORDIC structure shown in Fig. 1, dueto the crossing-datapath, changing the number of iterationsfor two separate CORDIC datapaths is not feasible. To assigndifferent number of iterations to the two CORDIC datapaths,we adopt the lookahead CORDIC approach [25]–[27] in theproposed DCT architecture. As shown in (2), the three-steplookahead CORDIC can be expressed as follows:

[x3y3

]=

⎡⎢⎢⎣

[ −σ1σ220

−σ1σ32−1

] [ −σ120

+σ1σ2σ32−1

]

[ +σ120

−σ1σ2σ32−1

] [ −σ1σ220

−σ1σ32−1

]

⎤⎥⎥⎦

[x0y0

]. (9)

Assuming that if the CORDIC results require four iterationsfor x whereas three iterations are needed for y, as shown in (2)and (9), the lookahead CORDIC equation for both results canbe expressed as follows, which means that we can separatelycalculate the two CORDIC outputs:

[x4y3

]=

⎡⎢⎢⎢⎢⎢⎢⎣

⎡⎢⎢⎣

−σ1σ220

−σ1σ32−1

−σ1σ42−2

+σ1σ2σ3σ42−3

⎤⎥⎥⎦

⎡⎢⎢⎣

−σ120

+σ1σ2σ32−1

+σ1σ2σ42−2

+σ1σ3σ42−3

⎤⎥⎥⎦

[ +σ120

−σ1σ2σ32−1

] [ −σ1σ220

−σ1σ32−1

]

⎤⎥⎥⎥⎥⎥⎥⎦

[x0y0

].

(10)

Fig. 5 presents the difference between the conventionalcrossing CORDIC architecture and the lookahead-basedapproach. When the lookahead approach is applied to theCORDIC architecture, the number of iterations can be easilycontrolled as all the internal datapath become independent.

In the proposed CORDIC-based DCT architecture, wherea different number of iterations are assigned for generatingDCT coefficients, the number of iterations should be carefully

TABLE I

REQUIRED ITERATIONS AND DIRECTIONS FOR VECTOR ROTATION

(+ : CLOCKWISE DIRECTION, ∗ : COUNTER-CLOCKWISE DIRECTION)

Angle Required Iterations Directions (Sign-Bits)

π/16+ i = 0, 1, 3, 10 σ = −1, +1,+1,+1

3π/16+ i = 1, 3, 10 σ = −1,−1,−1

3π/16∗ i = 1, 3, 10 σ = +1,+1,+1

4π/16∗ i = 0 σ = +1

6π/16∗ i = (90◦), 2, 3, 5, 7 σ = −1, +1,+1,+1,−1

7π/16+ i = 0, 1, 3, 10 σ = −1, −1,−1,−1

TABLE II

CORDIC SCALE-FACTORS AND THE APPROXIMATION VALUES FOR

MULTIPLIERLESS IMPLEMENTATION

Angle Desired Scale-Factor Approximation Value

π/16 0.3137856... 2−2 + 2−4 + 2−10

3π/16 0.4437599... 2−1 − 2−4 + 2−7 − 2−9

4π/16 0.3535533... 2−2 + 2−4 + 2−5 + 2−7 + 2−9

6π/16 0.4810759... 2−1 − 2−6 − 2−8

7π/16 0.3137856... 2−2 + 2−4 + 2−10

decided to minimize the error between the desired input angleand the corresponding accumulated angle. Table I shows theiterations executed at i th stages and the corresponding rotationdirection σ (sign-bits). For example, to rotate the vector byπ/16, only the i th iterations (i = 0, 1, 3, 10) are executed andthe rest of the iterations can be skipped for power savings.The lookahead algorithm for π/16 CORDIC rotator can bewritten as follows:

[xy

]=

[1 −σ10 · 2−10

σ10 · 2−10 1

] [1 −σ3 ·2−3

σ3 · 2−3 1

]

·[

1 −σ1 · 2−1

σ1 · 2−1 1

] [1 −σ0 · 20

σ0 · 20 1

] [x0y0

]

(11)

where σ0 = −1, σ1 = +1, σ3 = +1, σ10 = +1. In Table I,i =(90◦) represents the optional first iteration of the CORDIC[8]. In our DCT, the iterations to be skipped are carefullyselected such that the error between the desired angle andthe corresponding accumulated angle does not exceed 0.004for all the given angles. For example, in case of π/16 ofCORDIC rotator, the error between desired angle and rotatedangle using combination of CORDIC iterations presented inTable I is 0.00397958°. The number of CORDIC iterationsfor combination used to derive lookahead CORDIC algorithmcan be decided using software modeling process presented inSection III-C.

As mentioned in Section II-A2, the scale-factor is decidedaccording to the number of the executed CORDIC iterations.As the number of iterations is known ahead, the scale-factorsare predetermined, which are shown in Table II. In the table,the scale-factors are represented as signed power of twoformat, and the quantization error of the scaling factor is below10E − 4.

Page 5: CHV0280

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

LEE et al.: RECONFIGURABLE CORDIC-BASED LOW-POWER DCT ARCHITECTURE BASED ON DATA PRIORITY 5

One interesting observation when the lookahead approach isapplied to CORDIC is that removing high shift-terms has thesimilar effect with the lookahead CORDIC using less numberof iterations. For example, if the CORDIC rotation with π/16is executed using three iterations (i = 0, 1, 3), the lookaheadCORDIC algorithm and its corresponding scale-factor are asfollows:

x = (1 + 2−1 + 2−4)x0 + (2−2 + 2−4)y0

y = (−2−2 − 2−4)x0 + (1 + 2−1 + 2−4)y0 (12)

Kπ/16+ = 0.3137858 . . . . (13)

In (11), when the higher shift-terms (smaller than 2−9 terms)are eliminated, the equation is changed to (12) and (13). Pleasenote that (11) represents the four iterations (i = 0, 1, 3, 10)and (12) shows three iterations (i = 0, 1, 3). Please note thatthe number of CORDIC iterations can be simply controlledby removing the high shift-terms.

B. Proposed Low-Power CORDIC-Based DCT Architecture

As mentioned in the last part of Section III-A, consideringthe data priorities in DCT coefficient, high shift-term of thelookahead CORDIC can be carefully removed, which hasthe same effect with the less number of CORDIC iterations.Because the less number of CORDIC iterations means theCORDIC with low computational complexity, a low-powerCORDIC-based DCT architecture can be derived and itsdetailed implementation is as follows.

Fig. 6(a) shows the hardware architecture of the proposedCORDIC-based DCT. Inside the CORDIC module, the looka-head CORDIC is derived using the parameters in Table I.The scale-factors are also specified in Table II. An exampleof the lookahead CORDIC algorithm for 7π/16 rotation andthe corresponding scale-factors are presented in the equationsshown in Fig. 6(b). To reduce the number of iterations, thehigh shift-terms are removed as presented in Section III-A,the implementation of which is specified in the solid linesof Fig. 6(b). We further reduce the less important componentsconsidering the data priorities in DCT coefficients. In Fig. 6(b),a CORDIC output, Kx , is more important than Ky as it is usedlater for X (1), whereas Ky is needed for the higher frequencycomponent, X (7). Thus, the high shift-terms for y and Ky

are further removed, which is expressed as the dotted lines inFig. 6(b).

In the proposed hardware architecture, all the shift com-ponents for each of lookahead CORDIC algorithm and thescale-factors are precomputed using the lookahead CORDICequations. In Fig. 6(c), the numbers in the circle representthe shift operation, and the black color circle means the 2’scomplement elements of the shifted component, which areused for subtract operations. The dotted line in Fig. 6(c)represents the omitted computations, thus, the two results inlookahead CORDIC modules have the different number ofterms, which leads to power savings owing to the smallernumber of iterations.

TABLE III

HARDWARE IMPLEMENTATION OR COMPARISION RESULTS FOR

VARIOUS DCT ARCHITECTURES

Architecture [19] [20] [13] [16] [17] Proposed

PSNR (dB) 31.63 31.49 31.72 30.61 31.57 31.45

Gate count 36.2k 24.6k 41.6k 27.3k 31.5k 22.4k

Power (mW) 6.76 5.42 7.72 6.54 5.62 5.11

C. Experimental Results of the Proposed Low-PowerCORDIC-Based DCT Architecture

In this section, the experimental results of the proposedCORDIC-based DCT architecture are presented. First, thenumber of CORDIC iterations is decided according to thetarget PSNR of 31.5 dB, which is the average PSNR obtainedusing nine benchmark images listed in Table IV. PSNRs of thebenchmark images are obtained using the following equation:

PSNR = 20 · log10

(255√MSE

)(14)

MSE = 1

mn

m−1∑x=0

n−1∑y=0

[I (x, y) − K (x, y)]2 (15)

where I is m × n size of original image, and K is thereconstructed image. The data bit-widths inside the proposedDCT architecture are specified in Fig. 2.

For comparisons, various DCT architectures such asDA-based DCT [19], MCM [20], CORDIC-based DCT [13],and CORDIC-based Loeffler DCT [16], [17] are implementedusing 0.13 μm CMOS standard cell library. The implemented2-D DCT is specified with a dotted line in Fig. 2, andTable III shows the implementation results. In the table, powerconsumptions for different DCT architectures are measuredusing nanosim [29] with 100 MHz clock cycles, 1.2 V supplyvoltage. More than 500 input vectors are used to obtain theaverage power. Compared with the DA-based architecture [19],the proposed DCT shows 38.1% of area and 24% power sav-ings. Compared with the MCM-based DCT [20], the proposedDCT shows comparable power consumption and 10% smallerarea with a minor image quality degradation of 0.04 dB.Because some of the higher order shift-terms in CORDICiterations can be removed considering the importance differ-ences of DCT coefficients, our proposed DCT architectureshows the lowest gate count and power consumption comparedwith other CORDIC-based architectures [13], [16], and [17].Especially, the proposed DCT architecture shows 21.87% ofpower savings compared to the CORDIC-based Loeffler DCT[16] with even better PSNR results.

IV. RECONFIGURABLE CORDIC-BASED

DCT ARCHITECTURE

A. Proposed Reconfigurable Low-Power CORDIC-Based DCTArchitecture

Using the low-power DCT architecture presented in theprevious section, to further reduce the power consump-tion at the expense of a minor image quality degradation,

Page 6: CHV0280

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

6 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS

Fig. 6. (a) Hardware architecture of the proposed low-power CORDIC-based 1-D DCT. (b) An example of lookahead CORDIC algorithm (7π/16) and(c) its hardware architecture.

we propose a reconfigurable CORDIC-based DCT architec-ture in this section. Several tradeoff modes are presented,and the proposed reconfigurable architecture can dynami-cally change the CORDIC iterations to adaptively trade offthe computation energy for the image quality in the samehardware.

Generally, in the lookahead CORDIC, the shift-terms forcalculating low-frequency DCT coefficients (terms for calcu-lating X (0), X (1) in (8)) are more important than the shift-terms for calculating high-frequency coefficients. Additionally,among the shift-terms in one lookahead CORDIC equation,the most important terms are low shift-terms while the rela-tively less important terms are high shift-terms. To save thecomputation power at the expense of minimum image qualitydegradation, first, the least important shift-term in X(7) isremoved based on Greedy algorithm [30]. Again, we search forthe next least important shift-term to cancel the computation.As we repeat the process, the more number of shift-terms areremoved, which means that the computation power is reducedwith minimum image quality degradation.

Fig. 7 shows a pseudocode for shift-term reduction processin the proposed CORDIC-based DCT. In step 1, the highshift-terms of CORDIC rotation part (EQ_Terms) and thescale-factor part (SC_Terms) in lookahead CORDIC equa-tion are initialized as those in the normal mode shown inSection III-B. Once the target PSNR constraint is decidedin step 2, the loop from the steps 3–21 is performed untilthe minimum number of CORDIC terms are found, whichsatisfy the target PSNR. In the inner loop, we repetitivelysearch for the least sensitive shift-terms inside EQ_Terms andSC_terms. Then, the least sensitive shift-term that shows thelowest �PSNR is selected between EQ_Terms and SC_Terms.As the best choice (the least sensitive shift-term) is takenbased on the lookahead equation, which is updated everyiteration loop, the approach described in Fig. 7 is based on theGreedy algorithm [30]. The selected shift-term is removed andthe CORDIC equations of the current iteration are updated.The iteration continues until no further shift-term reduction is

Fig. 7. Pseudocode for shift-term reduction process in the proposedCORDIC-based DCT.

possible owing to the imposed PSNR constraint. For the PSNRcalculation, we use the average PSNR of nine benchmarkimages [22]–[24].

With the approach shown in Fig. 7, we propose three modesof tradeoff levels: normal mode, and modes 1 and 2. As we goto the higher tradeoff levels (sacrificing the image quality infavor of lower power), the number of shift-terms composinglookahead CORDIC equations is reduced. Table IV showsthe PSNR results of the benchmark images for three tradeofflevels. The image quality constraints for normal mode, mode 1,and mode 2 are aimed at the average PSNR of 31.5, 30, and27 dB, respectively, for nine benchmark images. The numberof tradeoff modes and the minimum allowable PSNRs can bechanged according to the user’s choice.

In Fig. 8, we present the number of shift-terms in thelookahead CORDIC equation and the scaling factors for threedifferent modes of operations. As an example, to calculate

Page 7: CHV0280

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

LEE et al.: RECONFIGURABLE CORDIC-BASED LOW-POWER DCT ARCHITECTURE BASED ON DATA PRIORITY 7

TABLE IV

PSNR DIFFERENCES IN EACH MODE OF PROPOSED RECONFIGURABLE

DCT ARCHITECTURE WITH VARIOUS IMAGE DATA

PSNR (dB) Normal Mode 1 Mode 2

baboon 27.41 26.59 23.89

clegg 28.33 26.40 21.95

f rymire 26.03 23.17 19.30

lena 34.30 33.54 31.75

monarch 34.98 33.69 30.55

peppers 35.93 34.61 30.71

sail 31.41 30.73 28.35

serrano 29.57 28.17 25.28

tlips 35.08 33.91 30.93

Fig. 8. Number of shift-terms inside the lookahead CORDIC rotators andscale-factors of our proposed reconfigurable DCT architecture (+: clock-wisedirection, *: counter-clock-wise direction).

X(3) component, both the CORDIC rotators of 3π/16 andπ/16 are needed, and those are expressed as the followinglookahead CORDIC equations in the normal mode:

X3π/16∗ = (1 − 2−4)x0 + (−2−1 − 2−3)y0 (16)

Xπ/16+ = (1 + 2−1)x0 + (2−2)y0. (17)

The scale-factors in normal mode are as follows:

K3π/16∗ = 2−1 − 2−4 (18)

Kπ/16+ = 2−2 + 2−4. (19)

According to the equations above, four shift-terms are usedfor X3π/16∗ CORDIC rotator, while three terms are used forXπ/16+ rotator. Thus, the normal mode of X(3) CORDIC inFig. 8 is denoted as 4 | 3. At tradeoff level 1, the 3π/16CORDIC rotator is reduced as follows:

X ′3π/16∗ = 1x0 + (−2−1)y0. (20)

As it goes to the higher tradeoff levels, the number of shift-terms are further reduced, which is specified in Fig. 8.

Fig. 9. (a) Turnoff gate schematic [24]. (b) Dynamic bit-width control usingturnoff gate.

Fig. 10. Overall hardware architecture of the proposed reconfigurableCORDIC-based DCT.

B. Hardware Implementation of the Reconfigurable DCT andExperimental Results

The image quality and computational energy tradeoffapproach proposed in the previous section can be realized asa reconfigurable hardware using the DCT architecture shownin Fig. 6. At normal mode of operation, the low-power DCTarchitecture in Section III-B is used. At tradeoff level 1, someof the shift-terms (2−α) are removed as shown in Fig. 6(b).In the DCT hardware architecture, removing the higher shift-terms means that the number of addition operations is reducedby turning off the corresponding datapaths to save computationenergy. A simple turnoff gate [24] shown in Fig. 9(a) is usedto turnoff the datapaths of high shift-terms. An example ofthe proposed approach is illustrated in Fig. 9(b), where thebit-width of datapath is dynamically controlled using dynamicbit-width control (DBC) circuit.

The overall hardware architecture of the proposed recon-figurable CORDIC-based DCT is shown in Fig. 10. Fordifferent tradeoff modes, the proposed DCT architecture canbe dynamically reconfigured by simply changing the controlsignal � to tradeoff minor image quality for computationenergy. The left side of Fig. 10 shows the proposed dynamicreconfigurable CORDIC module. Once a tradeoff mode isdetermined, the control signal � controls the turnoff gatearrays for both of the CORDIC equation terms and the scalingterms. It is noteworthy that the proposed architecture and thedesign parameters can be changed according to the requiredamount of power savings.

Page 8: CHV0280

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

8 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS

TABLE V

POWER CONSUMPTION AT DIFFERENT TRADEOFF MODES

Normal Mode 1 Mode 2

PSNR (dB) 31.45 30.09 26.97

Power (mW) 5.11 3.58 3.13

Percentage (%) 100 70.15 61.27

Fig. 11. Lena images obtained using the proposed reconfigurable CORDIC-based DCT. (a) Normal mode. (b) Mode 1. (c) Mode 2.

The power consumption of our DCT architecture at differentmodes is shown in Table V. The power consumption ismeasured with nanosim [29] with 100-MHz clock cycles,1.2 V supply voltage. The PSNR in the Table V showsthe average PSNR of 9 benchmark images. As shown inthe table, the proposed architecture offers significant powersavings as image quality decreases. Compared with the normalmode, mode 2 provides 38.73% of power savings with theimage quality degradation. Compared with the CORDIC-basedLoeffler DCT [16] that was shown in Table III, the proposedarchitecture shows 45.3% of power savings at mode 1 at theexpense of 0.52-dB image quality degradation. At tradeofflevel 2, the proposed DCT architecture achieves up to 59.5%of power savings compared with the conventional CORDIC-based DCT [13] with considerable image quality degradations.It is noteworthy that the area increase for reconfigurablearchitecture is only ∼7% when the turnoff gates [24] are used.Examples of Lena images under various tradeoff modes arepresented in Fig. 11.

V. CONCLUSION

In the conventional DCT architecture, all the computa-tions are not equally important in generating the frequencydomain outputs. This paper presented a low-power CORDIC-based DCT architecture, where the importance differences inDCT coefficients were efficiently exploited to allocate thenumbers of CORDIC iterations and internal data bit-widths.Lookahead CORDIC architectures were effectively used toget over the inherent data-dependencies in the conventionalcrossing-architecture of CORDIC. The proposed reconfig-urable CORDIC-based DCT architecture can dynamicallychange the tradeoff modes with the power savings rangingfrom 22.9% to 52.2% compared with the CORDIC-basedLoeffler DCT architecture [16]. The idea presented in thissection can assist the low-power design of image and videoimage compression applications.

ACKNOWLEDGMENT

The authors would like to thank the IC Design EducationCenter (IDEC) for its software assistance.

REFERENCES

[1] T. Liu, T. Lin, S. Wang, and C. Lee, “A low-power dual-mode videodecoder for mobile applications,” IEEE Commun. Mag., vol. 44, no. 8,pp. 119–126, Aug. 2006.

[2] M. Parlak and I. Hamzaoglu, “Low power H.264 deblocking filterhardware implementations,” IEEE Trans. Consum. Electron., vol. 54,no. 2, pp. 808–816, May 2008.

[3] A. Bahari, T. Arslan, and A. T. Erdogan, “Low-power H.264video compression architectures for mobile communication,” IEEETrans. Circuits Syst. Video Technol., vol. 19, no. 9, pp. 1251–1261,Sep. 2009.

[4] N. Ahmed, T. Natarajan, and K. R. Rao, “Discrete cosine transform,”IEEE Trans. Comput., vol. 23, no. 1, pp. 90–93, Jan. 1974.

[5] G. K. Wallace, “The JPEG still picture compression standard,”IEEE Trans. Consum. Electron., vol. 38, no. 1, pp. 18–34, Feb. 1992.

[6] D. L. Gall, “MPEG: A video compression standard for multimediaapplications,” Commun. ACM, vol. 34, no. 4, pp. 46–58, Apr. 1991.

[7] T. Wiegand, G. J. Sullivan, G. Bjontegaard, and A. Luthra, “Overviewof the H.264/AVC video coding standard,” IEEE Trans. Circuits Syst.Video Technol., vol. 13, no. 7, pp. 560–576, Jul. 2003.

[8] J. E. Volder, “The CORDIC trigonometric computing technique,”IRE Trans. Electron. Comput., vol. 8, no. 3, pp. 330–334, Sep. 1959.

[9] A. Maltsev, V. Pestretsov, R. Maslennikov, and A. Khoryaev, “Tri-angular systolic array with reduced latency for QR-decomposition ofcomplex matrices,” in Proc. IEEE Int. Symp. Circuits Syst., May 2006,pp. 385–388.

[10] A. M. Despain, “Fourier transform computers using CORDIC iterations,”IEEE Trans. Comput., vol. 23, no. 10, pp. 993–1001, Oct. 1974.

[11] S. Hsiao and J. Delosme, “Parallel singular value decomposition of com-plex matrices using multidimensional CORDIC algorithms,” IRE Trans.Signal Process., vol. 44, no. 3, pp. 685–697, Mar. 1996.

[12] J. R. Cavallaro and F. T. Luk, “CORDIC arithmetic for an SVDprocessor,” J. Parallel Distrib. Comput., vol. 5, no. 3, pp. 271–290,Jun. 1988.

[13] E. P. Mariatos, D. E. Metafas, J. A. Hallas, and C. E. Goutis, “A fastDCT processor, based on special purpose CORDIC Rotators,” in Proc.IEEE Int. Symp. Circuits Syst., Jun. 1994, pp. 271–274.

[14] H. Jeong, J. Kim, and W. Cho, “Low-power multiplierless DCT archi-tecture using image data correlation,” IEEE Trans. Consum. Electron.,vol. 50, no. 1, pp. 262–267, Feb. 2004.

[15] T. Sung, Y. Shieh, C. Yu, and H. Hsin, “High-efficiency and low-Powerarchitectures for 2-D DCT and IDCT based on CORDIC rotation,”in Proc. Int. Parallel Distrib. Comput. Appl. Technol., Dec. 2006,pp. 191-196.

[16] C. C. Sun, S. J. Ruan, B. Heyne, and J. Goetze, “Low-power and high-quality CORDIC-based Loeffler DCT for signal processing,” IET Cir-cuits, Devices, Syst., vol. 1, no. 6, pp. 453–461, Dec. 2007.

[17] Z. Wu, J. Sha, Z. Wang, and L. Li, “An improved scaledDCT architecture,” IEEE Trans. Consum. Electron., vol. 55, no. 2,pp. 685–689, May 2009.

[18] S. Hsiao, Y. Hu, T. Juang, and C. Lee, “Efficient VLSI imple-mentations of fast multiplierless approximated DCT using parameter-ized hardware modules for silicon intellectual property design,” IEEETrans. Circuits Syst. I, Reg. Papers, vol. 52, no. 8, pp. 1568–1579,Aug. 2005.

[19] S. Yu and E. E. Swartziander, “DCT implementation with distrib-uted arithmetic,” IEEE Trans. Comput., vol. 50, no. 9, pp. 985–991,Sep. 2001.

[20] B. Kim and S. G. Ziavras, “Low-power multiplierless DCT forimage/video coders,” in Proc. IEEE Int. Symp. Consum. Electron.,May 2009, pp. 133–136.

[21] J. Bracamonte, M. Ansorge, and F. Pellandini, “VLSI systems forimage compression: A power-consumption/image-resolution trade-offapproach,” in Proc. Digit. Compress. Technol. Syst. Video Commun.Conf., 1994, pp. 271–274.

[22] G. Karakonstantis, N. Banerjee, and K. Roy, “Process-variation resilientand voltage-scalable DCT architecture for robust low-power computing,”IEEE Trans. Very Large Scale Integr. (VLSI) Syst., vol. 18, no. 10,pp. 1461–1470, Oct. 2010.

[23] J. Park and K. Roy, “A low power reconfigurable DCT architecture totrade off image quality for computational complexity,” in Proc. IEEEInt. Conf. Acoust., Speech, Signal Process., May 2004, pp. 17–20.

[24] J. Park, J. H. Choi, and K. Roy, “Dynamic bit-width adaptation inDCT: An approach to trade off image quality and computation energy,”IEEE Trans. Very Large Scale Integr. (VLSI) Syst., vol. 18, no. 5,pp. 787–793, May 2010.

Page 9: CHV0280

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

LEE et al.: RECONFIGURABLE CORDIC-BASED LOW-POWER DCT ARCHITECTURE BASED ON DATA PRIORITY 9

[25] J. Li, “Sign lookahead CORDIC,” M.S. thesis, Dept. Electr. Eng., Nat.Cheng Kung Univ., Tainan, Taiwan, 2008.

[26] S. Wang and E. E. Swartzlander, “Merged CORDIC algorithm,” in Proc.IEEE Int. Symp. Circuits Syst., May 1995, pp. 1988–1991.

[27] B. Gisuthan and T. Srikanthan, “Pipelining flat CORDIC based trigono-metric function generators,” Microelectron. J., vol. 33, nos. 1–2,pp. 77–89, Jan. 2002.

[28] P. K. Meher, J. Valls, T. Juang, K. Sridharan, and K. Maharatna,“50 years of CORDIC,” IEEE Trans. Circuits Syst. I, Reg. Papers,vol. 56, no. 9, pp. 1893–1907, Sep. 2009.

[29] NanoSim User Guide, Version A-2008.03, Synopsys Inc., MountainView, CA, USA, 2008.

[30] T. H. Cormen, C. E. Leiserson, and R. L. Rivest, Introduction toAlgorithms. Cambridge, MA, USA: MIT Press, 1998.

Min-Woo Lee (S’12) received the B.S. and M.S.degrees in electrical engineering from Korea Univer-sity, Seoul, Korea, in 2009 and 2012, respectively.

Since February 2012, he has been with the Depart-ment of DTV SoC Development, SIC R&D Lab,LG Electronics Corporation, Seoul, as a ResearchEngineer. His current research interests includeCORDIC-based DSP system, low-power, and high-performance VLSI architectures.

Ji-Hwan Yoon (S’13) received the B.S. degree inelectrical engineering from Korea University, Seoul,Korea, in 2009, where he is currently pursuing theM.S. and Ph.D. degrees with the Department ofElectrical and Computer Engineering.

His current interests include low power high-throughput LDPC decoder architecture, CORDICbased DSP system, and ultra low power systemdesign.

Jongsun Park (M’05–SM’13) received the B.S.degree in electronics engineering from Korea Uni-versity, Seoul, Korea, in 1998, and the M.S. andPh.D. degrees in electrical and computer engineeringfrom Purdue University, West Lafayette, IN, USA,in 2000 and 2005, respectively.

He joined the Electrical Engineering Faculty,Korea University, in 2008. From 2005 to 2008, hewas with the Signal Processing Technology Group,Marvell Semiconductor, Inc., Santa Clara, CA, USA.He was with the Digital Radio Processor System

Design Group, Texas Instruments, Dallas, TX, USA, in 2002. His currentresearch interests include variation-tolerant, low-power and high-performanceVLSI architectures, and circuit designs for digital signal processing and digitalcommunications.