1 A Flexible VLSI Architecture for Extracting Diversity and Spatial Multiplexing Gains in MIMO Channels Chia-Hsiang Yang University of California, Los Angeles Challenges: 1. A unified solution to span the entire diversity-multiplexing tradeoff curve 2. Tradeoff between two search methods Depth-first: ML performance with variable throughput and long latency K-best: near ML performance with constant throughput and short latency 3. Antenna array size beyond 4×4 Area increases quadratically with the number of transmit antennas Critical path increases linearly with the number of transmit antennas 4. Modulations beyond 16-QAM Hardware increases quickly with the constellation size Longer latency introduced by the minimum search circuit 5. Multiple sub-carriers Research Contributions: 1. A unified sphere decoder architecture for extracting diversity and spatial multiplexing gains in MIMO channels 2. Signal processing techniques to support antenna sizes up to 16×16 Folding: hardware area increases linearly with antenna array size Loop retiming: reduces the critical path Data interleaving: supports multiple independent sub-carriers A region partition enumeration method for constellations up to 64-QAM 3. A flexible architecture Antenna array: 2×2 to 16×16 Modulations: BPSK to 64-QAM Number of sub-carriers: 16 to 128 Search method: K-best or depth-first search 4. A simplified multiplier Numerical strength reduction Gray coding to reduce number of operations 5. A multi-core architecture for enhanced performance
32
Embed
A Flexible Sphere Decoder Architecture for MIMO …icslwebs.ee.ucla.edu/dejan/researchwiki/images/3/36/Qualsproposal... · A Flexible VLSI Architecture for Extracting Diversity ...
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
1
A Flexible VLSI Architecture for Extracting Diversity and
Spatial Multiplexing Gains in MIMO Channels
Chia-Hsiang Yang
University of California, Los Angeles
Challenges:
1. A unified solution to span the entire diversity-multiplexing tradeoff curve
2. Tradeoff between two search methods
Depth-first: ML performance with variable throughput and long latency
K-best: near ML performance with constant throughput and short latency
3. Antenna array size beyond 4×4
Area increases quadratically with the number of transmit antennas
Critical path increases linearly with the number of transmit antennas
4. Modulations beyond 16-QAM
Hardware increases quickly with the constellation size
Longer latency introduced by the minimum search circuit
5. Multiple sub-carriers
Research Contributions:
1. A unified sphere decoder architecture for extracting diversity and spatial
multiplexing gains in MIMO channels
2. Signal processing techniques to support antenna sizes up to 16×16
Folding: hardware area increases linearly with antenna array size
Loop retiming: reduces the critical path
Data interleaving: supports multiple independent sub-carriers
A region partition enumeration method for constellations up to 64-QAM
3. A flexible architecture
Antenna array: 2×2 to 16×16
Modulations: BPSK to 64-QAM
Number of sub-carriers: 16 to 128
Search method: K-best or depth-first search
4. A simplified multiplier
Numerical strength reduction
Gray coding to reduce number of operations
5. A multi-core architecture for enhanced performance
2
Abstract—Sphere decoding algorithm is widely used in MIMO communications, because of
its ability to approach maximum likelihood detection with significantly reduced
computational complexity. This makes it attractive for hardware implementation; however,
prior work focused only on solutions with fixed number of antennas or fixed modulations.
This work presents a unified sphere decoder architecture that deploys diversity-multiplexing
tradeoff in MIMO channels by taking advantage of the flexibility in the number of antennas
and modulation schemes. Several signal processing and circuit techniques are constructively
combined to reduce the hardware complexity: a 20 times area reduction is achieved even
without interleaving of subcarriers compared to direct-mapped architecture. The proposed
flexible architecture supports antenna arrays from 2×2 to 16×16, modulations from BPSK to
64-QAM, over 16 to 128 sub-carriers. The peak estimated data rate exceeds 1.5 Gbps ideal
throughput using a 16 MHz bandwidth in just 0.55 mm2 in a standard 90 nm CMOS process.
I. INTRODUCTION
Multi-input multi-output (MIMO) communication has recently received
significant attention due to its potential to increase link robustness and channel
capacity. Hardware realization of MIMO signal processing algorithms is quite
challenging, because it requires multi-dimensional, matrix-based, computations.
However, with the growing demand for higher data transmission rates over wireless
links, the need of devices equipped with multiple antennas increases.
Among various MIMO algorithms, sphere decoding is one of the most promising
solutions. It approximates the information theoretic bound, set by the maximum
likelihood (ML) detection, with several orders of magnitude lower computational
complexity [1] [2]. This means that, for a given hardware cost, the reduced
complexity could be utilized to increase the size of antenna array and effectively
improve the performance beyond the ML performance of a system with smaller array
size. The complexity reduction is achieved by transforming an exhaustive search of
the ML decoders into a tree search procedure of sphere decoding. Tree search is quite
popular in other communications areas such as multi-user detection (MUD) for
CDMA systems, block-based demodulation, and linear block error control code
decoding [3]. Other potential applications include speech recognition, data
compression, protein sequence exploration, and neural signal detection.
Sphere decoding algorithm is a multi-dimensional signal processing task dealing
with vector and matrix arithmetic. The required computation involves hundreds of add
and multiply operations, and may also need divide and trigonometric functions. Such
a high complexity limits the system specifications such as antenna array size and
3
modulations. In addition, prior work focused only on solutions with fixed number of
antennas or fixed modulations [16][17][19][21][22][24]. In this work, we evaluate the
architectures proposed in prior work and advance state-of-the-art in the area of
multidimensional matrix-based signal processing hardware. A number of signal
processing techniques [23] are considered jointly with the technology parameters to
greatly reduce hardware area (cost) and power while maximizing the performance.
This work develops an architecture that further simplifies sphere decoding
implementation by jointly considering tradeoffs at the algorithm, architecture, and
circuit layers of abstraction, with the goal of minimizing chip power and area. At the
same time, additional degrees of freedom are considered in the design in order to take
full advantage of the diversity and spatial multiplexing gains available in MIMO
wireless channels [5]. Tuning over a range of diversity-multiplexing points is possible
by varying antenna array size and modulation scheme, for example. Flexibility and
scalability are, thus, key additional requirements in the design of multi-mode,
multi-standard systems. Also, our work uses the Matlab/Simulink framework to
improve design productivity in mapping of DSP algorithms onto silicon. BEE2
platform [38] is used to verify system functionality before entering physical ASIC
design.
This proposal is organized as follows. Section II reviews the fundamental
diversity-multiplexing tradeoff in MIMO communications and describes sphere
decoding algorithm. Several signal processing techniques, evaluated in
power-area-performance space, and architecture details are presented in Section III.
Section IV describes the Simulink design environment and BEE2 emulation platform.
Conclusions are summed up in Section V. Finally, Section VI proposes future work
and the timeline.
II. ALGORITHM SPACE EXPLORATION
A MIMO system can improve the reliability of a wireless link through increased
diversity or improve the channel capacity through spatial multiplexing. Diversity gain
and spatial multiplexing gain are related to system coverage range and data rate,
respectively. Both gains can be improved using a larger antenna array. However,
given a MIMO system, there is a fundamental trade-off between these two gains [4]
[5]. In the diversity-multiplexing space, repetition code, Alamouti code, and
space-time code use data redundancy to increase diversity at the price of losing spatial
multiplexing gain. In contrast, Bell Labs Layered Space Time (BLAST) algorithm,
Singular Value Decomposition (SVD), and QR decomposition allocate data-streams
4
in different eigen-modes to maximize spatial multiplexing gain while sacrificing
diversity gain, as shown in Fig. 1.
Sphere decoding is a decoding scheme that can extract both diversity and
multiplexing gains. With flexibility in coding and modulation, sphere decoder can
effectively explore the entire tradeoff curve as shown in Fig. 1. The original data type
for sphere decoding is uncoded data. By manipulation of input data, sphere decoding
is capable of decoding space-time block codes (STBC), which improves the error
probability and increases diversity gain. The data rate can be maximized by
transmitting different modulations over different MIMO substreams to increase
spatial multiplexing gain. Also, with proper preprocessing, the decoding process starts
from decoding the symbols with highest SNR first, and then canceling the effect of
the decoded symbols for remaining symbols until the final symbol is decoded. This
decoding sequence is equivalent to that in BLAST [41]. A unified sphere decoder
model is illustrated in the following section.
Spatial multiplexing gain (rate)
Div
ers
ity g
ain
(ra
ng
e)
Sphere
decoding
array size
array size
RepetitionRepetitionAlamoutiAlamouti
SpaceSpace--timetime
BLASTBLASTSVDSVDQRQR
Spatial multiplexing gain (rate)
Div
ers
ity g
ain
(ra
ng
e)
Sphere
decoding
array size
array size
RepetitionRepetitionAlamoutiAlamouti
SpaceSpace--timetime
BLASTBLASTSVDSVDQRQR
Fig. 1. Diversity-Multiplexing tradeoff in MIMO communications.
A. Sphere Decoding Algorithm
Consider a multiple antenna system with M transmitter antennas and N receive
antennas. The received vector y can be represented by
nHsy (1)
where y is an N1 vector of received symbols, and H denotes an NM channel matrix
whose elements are i.i.d. complex Gaussian with zero mean and unit variance. Vectors
s and n (M1 and N1 respectively) represent the transmitted symbols and zero mean,
circularly symmetric white Gaussian noise, respectively. The transmitted vector
Qs with the smallest Euclidean distance is selected as ML estimate in (2). The
5
channel matrix can be decomposed further using QR factorization; the equivalent ML
estimate thus can be written as
2||||minargˆ Hsys 2||ˆ||minarg Rsy (2)
with ZF
RsyQy Hˆ
where Q is a unitary matrix, R is an upper triangular matrix, and yHH)(Hs 1
ZF
HH
is the zero-forcing (unconstrained ML) estimate. The signal model is presented in Fig.
2.
H QH
Channel
n
sy
H=QR
y
TX RX
Sphere
Decoders^
min|| -Rs||2ys=arg^
^
^
Fig. 2. Signal model of sphere decoding algorithm.
The most commonly used methods for QR decomposition are Grahm-Schmidt
decomposition, Householder transformation, and Givens rotations [7]. Several
modifications such as division free, or square-root and division reduction methods are
proposed to simplify the operation in the original algorithm [45] [46]. For hardware
realization, [8] proposed an algorithm suitable for fixed-point implementation and [9]
proposed a CORDIC-based triangular systolic array architecture to reduce latency.
Under the assumption of block fading channel, QR decomposition is computed at the
packet rate.
Using the upper triangular nature of R, the symbol decoding begins from the last
row and occurs in several steps. The decoded symbols are used for successive
decoding steps until all symbols are decoded. This decoding algorithm can be mapped
to finding a shortest path (with minimum Euclidean distance) in a tree topology – one
possible constellation point denotes one node, each row of the R matrix is mapped to
each level of the tree whose edges are weighted by channel coefficients. The whole
solution space of this tree is equivalent to exhaustive search in the trellis diagram of
the original problem; number of total combinations of transmitted symbols is |Q|M
,
where |Q| is the constellation size.
By properly choosing a search radius and a search method, the ML solution can be
approached by visiting only nodes within a hyper-sphere, rather than performing an
6
exhaustive search. This complexity reduction is feasible, because the Euclidean
distance is a cumulative sum of square terms. This means that for each node, if its
Euclidean distance is larger than the search radius, the corresponding branches are
outside the search radius as well. The conceptual view of sphere decoding algorithm
is illustrated in Fig. 3. Tree pruning technique makes sphere decoding achieve ML
performance with polynomial complexity (highlighted nodes in Fig. 3) rather than
exponential complexity (all nodes in Fig. 3) [1].
. . .
. . .
. . .
...
...
...
...
...
...
constellationsize
ant-M
ant-2ant-1
search radius
Fig. 3. Concept of sphere decoding. Unlikely nodes and branches are indicated with gray shade.
B. Performance Improvements
Several simple yet effective methods such as detection ordering, candidate
enumeration and search radius setting are applied to improve error performance
and/or reduce the complexity the basic sphere decoding [3]. For instance, the sphere
decoding algorithm for 44 64QAM system as compared to exhaustive search results
in over 105 times reduction in computational complexity [10].
1) Detection Ordering: The idea behind detection ordering is to detect symbols with
the largest SNR first: to avoid discarding the ML solution, the first decoded symbols
should be the most reliable. Various ordering algorithms have been proposed for the
preprocessing stage: V-BLAST-ZF ordering, V-BLAST-MMSE, and Norm ordering [3]
[25]. Assuming a packet-based wireless communication system, the ordering only
needs to be performed once at the beginning of each received frame.
2) Candidate Enumeration: Detection ordering is applied across levels in the tree
topology. For each level, the order of constellation point enumeration is another
important factor to improve search speed. Schnorr-Euchner (SE) enumeration
suggests traversing the constellation candidates according to the cumulative distance
increment in an ascending order [2]. Therefore, the first candidate si for each row is
the one with minimum distance between bi and RiiQi in (3). Finding a good admissible
7
solution early means that we can shrink our initial radius early.
iiii
M
ij
jijisRbsRy
ˆ with
M
ij
jijiisRyb
1
ˆ . (3)
3) Search Radius Setting: One major feature of sphere decoding is the radius
shrinking. Once a solution is found with a smaller Euclidean distance, the search
radius is updated to this value so that more unlikely branches can be pruned. However,
the initial choice of search radius is not easy for sphere decoding, because the choice
of search radius influences the complexity of the algorithm. When search radius is too
large, a very high number of visited nodes is in the solution space which causes high
detection complexity. Conversely, when the search radius is too small, this may result
in an empty sphere and no available solution.
Based on AWGN model, sum of noise square is central-chi-square distributed
with 2M degrees of freedom [11] [47]. Given the channel SNR, the search radius can
be decided by solving the probability density function (pdf) with a confidence interval.
If channel SNR is unknown, the Euclidean distance of zero-forcing solution can be
used as an initial guess. Algorithm with increasing search radius was proposed, which
starts the search with a strict search radius first, and expands the search radius if no
solution is available within the radius [12] [48].
C. Tradeoff in Diversity-Multiplexing Space
A unified sphere decoder architecture is illustrated here for extracting diversity
and spatial multiplexing gains along the tradeoff curve. We demonstrate that adding
flexibility in varying antenna size and varying modulations is the key features for this
purpose. Antenna array size provides an added flexibility to shift the tradeoff curve in
the diversity-multiplexing space.
In order to maximize diversity gain, we have to supply to the receiver multiple
independently faded replicas of the same symbol, so that the error probability is
reduced [13] [14]. The data replicas can be sent in space and/or time directions. Since
a unified signal model can be developed for these space-time (ST) coding schemes,
the same sphere decoder architecture can be used with some data rearrangement.
Sphere decoding supporting algebraic ST codes [48], linear dispersion code [49], and
space time block code (STBC) [15] were reported in prior work. The ML estimate can
be written as
2||||minargˆ Bsys (4)
8
where matrix B depends on code generators and channel matrix. By interpreting B as
H in the original signal model, sphere decoding algorithm can be applied. Since the
matrix dimension we deal with is changed due to the data rearrangement in the
preprocessing stage, the equivalent antenna array size will be changed accordingly.
For example, repetition coding by 2 in space domain for an 88 system will be
transformed into data processing in a 44 system (only one half of symbols need to be
decoded). This requirement enhances the need for flexibility in antenna array size.
Spatial multiplexing gain is characterized by data rate. To maximize spatial
multiplexing gain, we should allow data rate to scale with the SNR or assign different
data rate to different substreams for a fixed SNR [5][15]. To this end, modulation
scheme should be adaptive according to channel condition: a larger constellation is
applied to substreams with higher SNR, and a smaller constellation is applied to
substreams with lower SNR. In principle, this transmission strategy just uses
water-filling in space domain. The system performance perspective, therefore, further
motivates the need for adding flexibility in modulation schemes.
III. ARCHITECTURE SPACE EXPLORATION
The optimal architecture is decided by jointly considering tradeoffs at the
algorithm, architecture, and circuit layers of abstraction, with the goal of minimizing
chip power and area. As shown in Fig. 4, a layered design approach is adopted to
merge algorithm and circuit decisions. An efficient multiplier is proposed to reduce
area and delay at the same time. Saving in area directly translates to power reduction
since power spent in charging/discharging parasitic capacitances is also reduced. At
the processing element (PE) architecture level, we evaluate the existing architectures
[16][17][19][21][22][24] and propose a solution with improved area and throughput.
Unlike prior work, flexibility is also considered in the design stage. Antenna size,
modulation scheme, number of subcarriers, and search method are designed with
flexibility and scalability to cover multiple communication scenarios. A multi-core
architecture which consists of many PEs (―small cores‖) is developed to support the
tradeoff between range and data rate at the system architecture level. We finally
summarize the flexibility, scalability, and system specification.
9
...
...
S1^S0
<<21
0
neg
<<1
0
1
1
-1
x4
x8
1
0
neg
S2
S0
S1&S0_b
R0
1
PE a
rch.
Met
ric
calc
.
Multip
lier
Sys
tem
arc
h.
...
...
S1^S0
<<21
0
neg
<<1
0
1
1
-1
x4
x8
1
0
neg
S2
S0
S1&S0_b
R0
1
PE a
rch.
Met
ric
calc
.
Multip
lier
Sys
tem
arc
h.
Fig. 4. Illustration of layered design approach.
A. Numerical Strength Reduction
From an algorithm perspective, the complexity of sphere decoding is evaluated by
the number of nodes visited in the tree search process. When considered for hardware
implementation, decoding algorithms are generally compared in terms of the number
of multiplications. Down to the circuit level, the size of multipliers is the key factor to
estimating the area, speed, and power of the sphere decoder.
We start with simplifying the cost of the multiply operation to reduce hardware
complexity. The multiplication is required to calculate Euclidean distance, which is
mathematically represented by two equivalent forms, Eqs. (5), (6).
2||)(||minargˆZFML
ssRs (5)
2||||minarg RsyQ H (6)
Seemingly, the number of multiplications in Eq. (5) is less than in Eq. (6): one
multiplication for Eq. (5) and two multiplications for Eq. (6). Hence, Eq. (5) was most
commonly used in prior work [16]-[21] as a baseline for implementation. However, a
careful investigation shows that Eq. (6) is a better choice from hardware perspective
for at least two reasons. First, we observe that sZF and QHy can be pre-computed and,
hence, have negligible impact on the total number of operations. Also, computation
effort of sZF is not less than QHy. Second, the wordlength of s is usually much shorter
than sZF. Separating terms as in Eq. (6) results in multipliers with reduced wordlength.
Without loss of generality, the normalized size of a multiplier can be estimated by
the product of wordlength of the multiplier and multiplicand. The normalized delay of
a multiplier can be estimated by the sum of wordlength of the multiplier and
10
multiplicand if an array multiplier is used [39]. The array multiplier approximation
works well for first-order comparison purposes. Table 1 summarizes the relative area
and delay reduction of a multiplier due to numerical strength reduction in a 64-QAM
system, where wordlength (WL) of s is 3 for a real multiplier. We see that the area
reduction is at least 50%, and that the delay reduction also reaches 50% for large
wordlength inputs. The absolute area difference between these two types of
multipliers is amplified by the total number of multiplications in the entire decoding
process, which is approximately O(M3).
TABLE I
AREA AND DELAY REDUCTION DUE TO NUMERICAL STRENGTH REDUCTION
WL of sZF 6 8 10 12
WL of R =12, Area/delay 0.5/0.83 0.38/0.75 0.3/0.68 0.25/0.63
WL of R =16, Area/delay 0.5/0.68 0.38/0.63 0.3/0.58 0.25/0.54
The multiplier can be simplified further by taking advantage of some
characteristics of communication signal processing: Gray coding and quantization
effects. Gray code is a more compact representation in the constellation plane since
only odd numbers are used. Conventionally, the number is transformed to 2’s
complement representation for the purpose of arithmetic operations. Carefully
examining the Gray code representation, the corresponding multiplication can be
implemented by simple shift, add and invert operations. The code mapping, the
associated operations, and the simplified multiplier are shown in Fig. 5. The neg
operator stands for bit-inversion. 1-bit carry-in in 2’s complement can be absorbed as
a carry-in (shaded in gray) in the following adders or simply be discarded as a
quantization error on LSB, which can be recovered by wordlength optimization.
The shifter has no direct area cost apart from routing, while the cost of inverters
and multiplexers is relatively low because they are simple operations. Overall, it is
possible to implement one complex multiplier with 6 adders + inverters and
multiplexers, resulting in a total 40% area reduction compared to traditional approach
(area is estimated by Synopsys Design Compiler). This implementation does not
imply that we have to force the use of Gray coding in the constellation plane; the Gray
coding is only used inside the sphere decoder to simplify metric calculation and
candidate enumeration. The decoded symbols can be converted into any arithmetic
representation at the sphere decoder outputs.
Gray code 000 001 011 010 110 111 101 100
value -7 -5 -3 -1 1 3 5 7
operation -7 -5 -3 -1 1 4-1 4+1 8-1
11
S1 S0
<<21
0
neg
<<1
0
1
1
-1
x4
x8
1
0
neg
S2
S0
S1 S0
R0
1
S1 S0
∩
Fig. 5. Gray code representation and the simplified multiplier.
B. Architecture Tradeoff
In the prior work, two major types of tree search methods are reported: depth-first
(DF) [23] [24] and K-best [16]-[22]. The depth-first algorithm starts the search from
the root of the tree and explores as far as possible along each branch, then it
back-traces until a leaf node is found. The K-best algorithm approximates a
breadth-first search by keeping only K branches with the smallest partial Euclidean
distance (PED) at each level [26], which is similar to the M-algorithm in trellis
decoding [27]. The major advantages of DF are that the ML performance can be
achieved, and that radius shrinking can be used for tree pruning. On the other hand,
the advantages of K-best are its uniform data path and constant throughput.
Further examining details, depth-first ensures the ML performance if complete
solution space is explored. This might not be feasible in practice, however, because of
limited buffer size and processing cycles. This means that some termination schemes
should be used and thus ML performance is no longer guaranteed. Since the default
input is uncoded data, achieving a sub-optimal performance while keeping constant
throughput is more important. Then, space-time codes or error correction codes can be
used to improve the performance. The iterative decoding scheme which combines
MIMO decoder and error correction code decoder was proven to achieve
near-capacity performance [2].
In hardware implementation, depth-first is realized in a folding-like architecture
because only one node is visited at a time during the tree search process. In this case,
an extra memory to record the visited nodes is required, for the trace-back operation.
K-best is realized in a multi-stage pipelined way, because no trace-back is needed. To
process K data paths at the same time, parallel architecture is applied. Figure 6
illustrates the basic architectures of these two search schemes, and Table 2
summarizes their comparison in terms of circuit metrics and algorithmic performance.
12
(b) K-best (parallel and multi-stage)(a) Depth-first (folding)
PE
PE
1
PE
2
PE
M...
Fig. 6. Basic architecture of (a) depth-first and (b) K-Best algorithm.
TABLE II
COMPARISON OF DEPTH-FIRST AND K-BEST ALGORITHM
Area Throughput Latency Radius Shrinking
/Tree Pruning Performance
Depth-first Small variable long Yes ML
K-best large Constant short No Near-ML
For the sphere decoder operating with a large antenna array, the biggest challenge
in the implementation is reducing area of the design. Using the number of (complex)
multipliers as a first order area estimate, the number of multipliers needed in the
folding and multi-stage architectures are M and M(M+1)/2, respectively, where M is
the number of transmit antennas. Expanding a 44 system to a 1616 system, relative
area increases from 4 to 16 for the folding architecture and 10 to 136 for the
multi-stage architecture. The folding architecture is 2.5 to 8.5 more area efficient
compared to the multi-stage architecture, as shown in Fig. 7 (a). To keep the area
within a reasonable value, folding technique is considered. The second design
challenge is operating frequency for the folded architecture.
As the array size increases, the number of operands in the Multiply-Accumulate
(MAC) operation in the metric calculation unit increases proportionally to the number
of antennas. Assuming a tree adder design, the critical path delay roughly increases
linearly with the number of transmit antennas. However, the time required to finish
the MAC operation should be scaled down by the number of antennas in order to
increase the throughput proportionally to the number of antennas. This timing
requirement for a fixed bandwidth is shown in Fig. 7 (b). The situation is actually
worse when metric enumeration is included in the loop. Since pipelining in the loop is
considered a difficult task, this architecture can not operate at a high frequency even
for a 44 system [24].
To facilitate pipeline insertion, inputs are up-sampled by a factor m, and then one
register can be replaced with m pipeline registers in the loop using Noble Identity [42].
In this case, only one out of m samples is useful data, and the rest could be repeating
13
values of an original sample or padding zeros. By applying data-stream interleaving,
samples of other independent data streams can be introduced in the loop in place of
the repeated values or padding zeros. Technique of interleaving is therefore used to
improve area efficiency through logic sharing and to provide flexibility needed to
support varying number of data sub-carriers. In a multi-carrier communication system,
data streams are transmitted over narrow-band sub-carriers [28].
8x8
Are
a
16x164x4
x8.5
x3.5x2.5
Antenna array size8x8
De
lay
16x164x4
timing requirement
critical path in the loop
Timing gap
Antenna array size
(a) area reduction using folding technique (b) growing timing gap in folding architecture
multi-stage
folding
Fig. 7. Design challenge and tradeoff for large antenna size. Impact of antenna array size on (a) area
and (b) critical path delay.
C. PE structure
The function of the PE is to find the si with minimum Ti ( iiiiisRbT ) for each
level in the tree search, and to provide a candidate list with Ti in a descending order
since a path with smaller Ti means a higher probability to be the ML estimate. A
straightforward algorithm mapping is to enumerate all possible constellations and sort
the Ti to find the si and the candidate list [24]. The hardware cost and computational
latency of this architecture is very high for a large constellation size due to the circuit
parallelism and inevitable latency of the sorting circuit. To resolve this problem, we
propose another strategy: first, the closest point is found through the geometric
relationship since the si with minimum Ti stands for the closest point between bi and
RiiQi. The second step is to use the selected si to calculate Ti. Finally, the candidate list
is generated by the constellation arrangement, as described in Section III-C-2, Fig. 12.
We decompose the PE into two parts: Metric Calculation Unit (MCU) and Metric
Enumeration Unit (MEU). Each submodule can be mapped to Area-Energy-Delay
space to explore optimal design parameters for the top-level integration.
Decomposing a design problem along these three axes provides insight into design
techniques and their impact on power, area, and throughput. Concurrency versus
latency is one of the basic tradeoffs that need to be considered. Maximizing data
14
throughput calls for a parallel architecture, which results in a large area. Conversely,
time-multiplexing improves area efficiency, but increases latency. For example, the
decoding algorithm operating on complex numbers can be transformed into a
real-valued problem, which results in a tree that is twice as deep as the original tree
with a smaller number of children per node [16]. Since the multipliers are reused, the
number of multipliers is reduced to one half at the cost of equal throughput reduction.
Flexibility is another issue in circuit design. Ideally, the circuit should be flexible
to support different search schemes (Depth-first or K-best). In general, the overhead
of flexibility results in reduction of both energy efficiency and area efficiency. This
overhead should be minimized while maintaining system performance. Fig. 8 shows
the circuit diagram of one PE. There are m-stage pipeline registers inserted in the loop,
so the critical path can be shortened under the timing constraint by choosing a larger
m. Since m data streams are interleaved into the PE, the hardware always keeps active,
creating the maximum throughput as if the m pipeline registers are introduced without
the loop. The area overhead of the up-samplers for R can be removed if R is invariant
for each sub-carrier during one packet transmission. The flexibility of search scheme
is provided by the shift-register chain, which can be configured as forward trace or
backward trace. By placing K PEs onto one sphere decoder, K search paths are
explored at the same time to implement K-best algorithm, while each PE has
flexibility to trace back as Depth-first. The flexibility to support varying antenna size
is provided by the folding architecture. It reuses the same hardware with a higher
clock frequency as the antenna size increases. The details of sub-modules are
illustrated in the following.
sub
shift-register chain
Symbol
selection
sub
R
s
bi
. . .
myi^
m stagesadder tree
| |2
. . .
. . .
Ti
partial product
MCU
MEU
m
Rii
Fig. 8. Circuit diagram of one PE.
15
1) Metric Calculation:
Metric Calculation Unit (MCU) computes
M
ij
jijsR
1
. Basically, it executes a
Multiply-Accumulate (MAC) operation. To accumulate the maximal 16 operands and
achieve the highest throughput, there are 15 simplified complex + 1 simplified real
multipliers followed by an adder tree that merges the partial products. It is possible to
reduce the number of multipliers in a time-multiplexing manner at the price of lower
throughput [30]. For example, 4 complex multipliers can be time-multiplexed by 4 to
deploy 16 multipliers, with throughput reduced by 4. Such tuning at the architecture
level is used to position the design along throughput and power axis, with optimal
tuning of variables such as supply voltage.
Since the search process advances one stage per clock cycle, we propose an
FIR-like architecture to facilitate metric calculation, as shown in Fig. 8. If only
forward trace is allowed, the BER performance is limited by the number of parallel
processors such as in K-best algorithm. Even though more processing cycles are
provided, there is no room to improve the BER performance for K-best algorithm. By
observing that the trace-back goes back up by only one or two layers instead of a
random jump, a bidirectional shift register chain is embedded to adjust the search
depth. Since the search state is recorded in the shift registers, no extra memory, such
as stack memory, is needed to keep all the states [26] [40]. Due to the trace-back
requirement, transpose form FIR architecture is not suitable to reduce the critical path,
but the critical path is reduced by data-interleaving.
Ri,i Ri,i+1 Ri,M
. . .
Ri,i+2
. . .
si+1
si
si+2 sM
adder tree
Fig. 8. Circuit diagram of MCU.
Coefficients of R matrix are stored in memory in an area efficient way. The
diagonal terms of R matrix are real, while the rest are complex numbers. Using the
upper triangular nature, the Real part diagonal and the Imaginary part triangular data
are organized into a square memory, which saves around 50% of area.
16
2) Metric Enumeration:
The Metric Enumeration Unit (MEU) enumerates the possible constellation points
according to their Partial Euclidean Distance (PED) (
2
||
i
Mj
jT ) in an ascending order.
Exhaustive search is a straightforward implementation; it calculates the PEDs of all
constellation points and uses a sorting circuit to find the minimum one, as shown in
Fig. 10 (a). The number of distance calculation units is proportional to the
constellation size (64 units are required for 64-QAM, for example). This requirement
in itself makes hard to support a large constellation size, in addition to the extra
latency introduced by the minimum search circuit.
In the constellation plane, metric enumeration is equal to finding the points closest
to bi and scaling constellation points RiiQi from the closest to the farthest [2]. This is
the underlying principle of Schnorr-Euchner (SE) algorithm. The SE enumeration is
originally applied to one dimensional signal, such as real valued PAM or PSK
constellation; therefore it was modified to arrange QAM constellations in PQ
concentric groups to fit the original algorithm. For example, 16-QAM constellation
can be expressed as an arrangement of points in 3 concentric circles. Then the
problem is reformulated to find the closest point in each subgroup and find the closest
point over subgroup, as shown in Fig. 10 (b) [24].
si
^
RiiQ1 | |2
| |2
. .
.
| |2
min
-searc
h
bi
sub
sub
sub
. .
.
RiiQ2
RiiQk
si
^
min
-searc
h
bi
PSKALU 1
PSKALU 2
PSKALU PQ
. .
.
. .
.
bi
Region decision
Region decision
si
^
real part
imag. part
Rii
Rii
(a) exhaustive search (b) SE enumeration for QAM
(c) region partition search
RiiQi
Q
I
bidecision
boundary
Fig. 10. Closest point selection scheme: (a) exhaustive search architecture, (b) SE enumeration for
QAM, (c) region partition based search approach. Real value is represented by gray line.
The original algorithm [2] uses phase relationship to find the closet point in a
concentric circle. This approach is not suitable for hardware implementation, so [24]
17
proposed a decision boundary based method to simplify the SE enumeration. One
type of decision boundary is set by straight lines passing through the origin and the
middle point between two adjacent constellation points in a concentric circle, to
specify the starting point. Another type of decision boundary is set by straight lines
passing through the origin and the middle point between two constellation points
around the starting point in a concentric circle, to determine the initial search direction.
However, this simplification is only applicable to small size constellations such as
16-QAM. Larger constellation sizes are hard to support for several reasons. First, the
decision boundary algorithm is quite complex–many multiplications are needed to
generate the decision boundaries. Second, the number of subgroup grows quickly,
which increases the latency of the min-search circuit. For example, 64-QAM is
decomposed into 9 subgroups. Third, the concentric group partition is scalable as
QAM constellation size changes, thereby making the architecture infeasible to support
different modulations.
We propose a simple partition method based on Cartesian coordinates. The
constellation plane is partitioned into 64 regions for 64-QAM (8 regions in the Real
part and 8 regions in the Imaginary part). The closest point (with minimum distance)
can be decided by the location of bi/Rii since real part and imaginary part can be
decoded separately, as shown in Fig. 9 (c). In fact, this idea is also applied to symbol
decision. For instance, to make a decision on a QPSK system, we do not need to
calculate the distances from received signal to 4 constellation points. Instead, we just
need to examine the sign of real and imaginary parts.
The area complexity of the three architectures in Fig. 9 is evaluated using the
number of add-equivalent operators (add, subtract, compare) as area estimation. For