DSP lab 1 בס'' ד11/15/2014 1 Introduction to DSP processors Presented by: בס'' ד11/15/2014 2 Contents: The modern processor’s classification; The digital signal processing methods & algorithms; The D(igital) S(ignal) P(rocessing) algorithms implementation The SHARC processor - architecture; data types & formats, C & Assembler; Getting started.
81
Embed
Introduction to DSP processors - המחלקה להנדסת …adcomplab/dsplab/Documents/Introduction...1 DSP lab ד''סב 11/15/2014 1 Introduction to DSP processors Presented by:
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
DSP lab 1
ד''בס
11/15/2014 1
Introduction to DSP processors
Presented by:
ד''בס
11/15/2014 2
Contents:
The modern processor’s classification;
The digital signal processing methods &
algorithms;
The D(igital) S(ignal) P(rocessing) algorithms implementation
The SHARC processor - architecture; data types & formats, C & Assembler;
Getting started.
DSP lab 2
ד''בס
11/15/2014 3
FPGA & EPLD
The modern processor’s classification.
Today chips are distributed into three groups:
ASIC’s (Application
Specific Integrated Circuits)
Chips with hardware realization
of data processing algorithms
(microprocessors & microcontrollers)
ד''בס
11/15/2014 4
The modern processor’s classification.
Microprocessors & microcontrollers
DSP microprocessors.
The processors are intended for Real-Time
Digital Signal Processing systems.
General-purpose microprocessors.
This kind of processor is intended for
computer systems: PC, workstation &
parallel supercomputer.
Microcontrollers
Very especial processors are intended for embedded systems
and in different
household devices.
DSP lab 3
ד''בס
11/15/2014 5
Review: Processor Classes
General Purpose - high performance – Pentiums, Alpha's, SPARC
– 64-128 bit word size
– Used for general purpose software
– Heavy weight OS - UNIX, NT
– Multiply layers of cache memory
– Workstations, PC's
Embedded processors and processor cores – ARM9, ARC, 486SX, Hitachi
SH7000, NEC V800
– 32 bit word size
– Single program
– Lightweight, often real-time OS
– Code and Data memory cache, DSP support
– Cellular phones, consumer electronics (e. g. CD players)
DSP processors – SHARC, BlackFin, TMS320C55x,
TMS320C67x, TMS320C64x
– 16-32 bit word size
– Single program
– Lightweight, often real-time OS
– Super Harvard Architecture Support, MAC, Circular buffer, Dual-Port RAM
– Audio, Image and Video processing, Coding and Decoding, Cellular Base Station, Adaptive Filtering, Real Time operations
inefficient because it does not exploit the symmetry
and periodicity properties of the phase factor WN. In
particular, these two properties are:
Symmetry property:
Periodicity property:
k
N
Nk
N WW 2/
k
N
Nk
N WW
X(7) = x(0)W80+x(1)W8
7+x(2)W86+x(3)W8
5+x(4)W84+x(5)W8
3+x(6)W82+x(7)W8
1
X(7) = x(0)W88+x(1)W8
7+x(2)W814+x(3)W8
21+x(4)W828+x(5)W8
35+x(6)W842+x(7)W8
49
DSP lab 11
ד''בס
11/15/2014 21
The D(igital) S(ignal) P(rocessing)
algorithms implementation
x(7) WN
4
x(3)
x(5) WN
4
x(1)
X(7)
X(6)
X(5)
X(4)
WN0
WN0
WN6
WN4
WN2
WN0
x(6) WN
4
x(2)
x(4) WN
4
x(0)
X(3)
X(2)
X(1)
X(0)
WN0
WN0
WN6
WN4
WN2
WN0
WN7
WN6
WN5
WN4
WN3
WN2
WN1
WN0
X(7) = x(0)W80 + x(1)W8
7 + x(2)W86 + x(3)W8
13 + x(4)W84 + x(5)W8
11 + x(6)W810 + x(7)W8
17
X(7) = x(0)W88 + x(1)W8
7 + x(2)W86 + x(3)W8
5 + x(4)W84 + x(5)W8
3 + x(6)W82 + x(7)W8
1
(W88= W8
0, W813= W8
5, W811= W8
3, W810= W8
2, W817= W8
1)
X(0) + X(4)WN4
X(2) + X(6)WN4
X(0) + X(4)WN4+
X(2)WN6+ X(6)WN
10
X(1) + X(5)WN4
X(3) + X(7)WN4
X(1) + X(5)WN4+
X(3)WN6+ X(7)WN
10
ד''בס
11/15/2014 22
The D(igital) S(ignal) P(rocessing)
algorithms implementation
x(7) WN
4
x(3)
x(5) WN
4
x(1)
X(7)
X(6)
X(5)
X(4)
WN0
WN0
WN6
WN4
WN2
WN0
x(6) WN
4
x(2)
x(4) WN
4
x(0)
X(3)
X(2)
X(1)
X(0)
WN0
WN0
WN6
WN4
WN2
WN0
WN7
WN6
WN5
WN4
WN3
WN2
WN1
WN0
X(2) = x(0)W80 + x(1)W8
2 + x(2)W84 + x(3)W8
6 + x(4)W88 + x(5)W8
10 + x(6)W812 + x(7)W8
14
X(2) = x(0)W80 + x(1)W8
2 + x(2)W84 + x(3)W8
6 + x(4)W80 + x(5)W8
2 + x(6)W84 + x(7)W8
6
(W88= W8
0, W810= W8
2, W812= W8
4, W814= W8
6)
X(0) + X(4)WN0
X(2) + X(6)WN0
X(0) + X(4)WN0+
X(2)WN4+ X(6)WN
4
X(1) + X(5)WN0
X(3) + X(7)WN0
X(1) + X(5)WN0+
X(3)WN4+ X(7)WN
4
000 – 000
100 – 001
010 – 010
110 – 011
001 – 100
101 – 101
011 – 110
111 - 111
Bit reverse operations
DSP lab 12
ד''בס
The D(igital) S(ignal) P(rocessing)
algorithms implementation
11/15/2014 23
Cross correlation is a measure of similarity of two waveforms as a function of a time-lag applied to one of them
ד''בס
The D(igital) S(ignal) P(rocessing)
algorithms implementation
11/15/2014 24
Discrete correlation
MmmngmfN
nyM
m
0 1
0
Discrete autocorrelation
MmmnfmfN
nyM
m
0 1
0
DSP lab 13
ד''בס
The D(igital) S(ignal) P(rocessing)
algorithms implementation
11/15/2014 25
Convolution of two square pulses: the resulting waveform is a
triangular pulse. One of the functions (in this case g) is first reflected
about τ = 0 and then offset by t, making it g(t − τ). The area under the
resulting product gives the convolution at t. The horizontal axis
is τ for f and g, and t for f*g.
Convolution of two square pulses: the resulting waveform is a triangular pulse. One of the functions (in this case g) is first reflected about τ = 0 and then offset by t, making it g(t − τ). The area under the resulting product gives the convolution at t. The horizontal axis is τ for f andg, and t for .
Convolution is a mathematical operation on two functions f and g,
producing a third function that is typically viewed as a modified
version of one of the original functions.
ד''בס
The D(igital) S(ignal) P(rocessing)
algorithms implementation
11/15/2014 26
Discrete convolution
M
Mm
M
Mm
mgmnfmngmfngnfny *
Circular discrete convolution
mngkMmfnyM
m
N
Nk
1
0
STD – standard deviation
1
0
21 N
i
iN xxN
S
DSP lab 14
ד''בס
11/15/2014 27
The DSP processor ‘s architecture
Requirement for DSP processors:
1. High speed input data, different interface devices;
2. Input data wide dynamic range;
3. ADD, MULT & SHIFT hardware implementation. Parallel processing;
4. Flexible processing (possibility to “jump” from one process to another);
3. Three mathematical units: ALU, barrel Shifter and Multiplier with fast MAC operation (MBR = MBR + Rx * Ry);
4. Cycles, branches & interrupt fast handling. Addressing special modes;
5. Circular buffer.
ד''בס
11/15/2014 28
Data types & formats.
Data types in DSP processors algorithms:
Integer (cycles, coefficients and arrays numbers);
Real (input & output data);
Complex (applications in frequency domain);
Logic (bitwise operation).
Data format in DSP processors :
Byte – 8 bit;
Short word – 16 bit;
Normal word – 32 bit;
Instruction word – 48 bit;
Extended normal word – 40 bit;
Long word – 64 bit.
DSP lab 15
ד''בס
Fixed-Point Design
Digital signal processing algorithms
– Often developed in floating point
– Later mapped into fixed point for digital hardware realization
Fixed-point digital hardware
– Lower area
– Lower power
– Lower per unit production cost
Idea
Floating-Point Algorithm
Quantization
Fixed-Point Algorithm
Code Generation
Target System
Alg
orith
m L
evel
Imple
menta
tion
Level
Range Estimation
ד''בס
X S
Fixed-Point Representation
Fixed point type
– Wordlength
– Integer wordlength
Quantization modes
– Round
– Truncation
Overflow modes
– Saturation
– Saturation to zero
– Wrap-around
S X X X X X
Wordlength
Integer wordlength
X X X X X
Wordlength
X
X
DSP lab 16
ד''בס Overflow Handling in Fixed Point Computations
Overflow handling is an important consideration when implementing signal processing algorithms. If overflow is not controlled appropriately it can lead to problems such as detection
errors, or poor quality audio output. Typical digital signal processing CPUs include hardware support for handling overflow. Some RISC processors may include these modes as well.
(In fact I helped define and implement such modes for the 32 bit MIPS processor core used in many Broadcom products). These processors often have a “saturating” mode that sets an
instruction result to a minimum or maximum value on an overflow condition. (The term “saturating” comes from analog electronics, in which an amplifier output will be limited, or
clipped, between fixed values when a large input is applied.) Commonly the CPU will limit the result to a 32 bit twos complement integer (0x7FFFFFFF or 0x80000000). For
unsigned operations, the result would be limited to 0xFFFFFFFF. There are a number of situations in which overflow can occur, and I will discuss some of them below.
Addition and Subtraction
Overflow with twos complement integers occurs when the result of an addition or subtraction is larger the largest integer that can be represented, or smaller than the smallest integer.
In fixed point representation, the largest or smallest value depends on the format of the number. I will assume Q31 in a 32 bit register for any examples that follow. In this case, a CPU
with saturation arithmetic would set the result to -1 or (just below) +1 on an overflow, corresponding to the integer values 0x80000000 and 0x7FFFFFFF.
Overflow in addition can only occur when the sign of the two numbers being added is the same. Overflow in subtraction can occur only when a negative number is subtracted from a
positive number, or when a positive number is subtracted from a negative number.
Negation
There is one case where negation of a number causes an overflow condition. When the smallest negative number is negated, there is no way to represent the corresponding positive
value in twos complement. For example, the value -1 in Q31 is 0x80000000. When this number is negated (flip the bits and add one) the result is again -1. If the saturation mode is set,
then the CPU will set the result to 0x7FFFFFFF (just less than +1).
Arithmetic Shift
Overflow can occur when shifting a number left by 1 to n bits. In fixed point computations, left shifting is used to multiply a fixed point value by a power of two, or to change the
format of a number (Q15 to Q31 for example). Again, many CPUs have saturation modes to set the output to the minimum or maximum 32 bit integer (depending on whether the
original number was positive or negative). Furthermore, a common feature is an instruction that counts the number of leading ones or zeros in a number. This helps the programmer
avoid overflow since the number of leading sign bits determines how large a shift can be done without causing overflow.
Overflow will not occur when right shifting a number.
Multiplication
Overflow doesn’t really occur during multiplication if the result register has enough bits (32 bits if two 16 bit numbers are multiplied). But it is partly a matter of interpretation. When
multiplying a fixed point value of -1 by -1 (0x8000 by 0x8000 using Q15 numbers), the result is +1. If the result is interpreted as a Q1.30 number (one integer bit and 30 fractional
bits) then there is no problem. If the result is to be a Q30 number (no integer bits) then an overflow condition has occurred. And if the number was to be converted to Q31 (by shifting
the result left by 1) then an overflow would occur during the left shift. The overall affect would be that -1 times -1 equals -1.
I have used a CPU that handles this special case with saturation hardware. Some CPUs have a multiplication mode that shifts the product left by one bit after a multiply operation. The
reason for doing so is to create a Q31 result when two Q15 numbers are multiplied. Then if a Q15 result is desired, it can be found by storing the upper 16 bits of the result register (if
the register is only 32 bits). The saturating mode automatically sets the result to 0x7FFFFFFF when the number 0x8000 is multiplied by itself, and the “shift left by one” multiplication
mode is enabled.
A very often used operation in DSP algorithms is the “multiply accumulate” or “MAC”, where a series of numbers is multiplied and added to a running sum. I would recommend not
using the “left shift by one” mode if possible when doing MACs, since this only increases the chance for overflow. A better technique is to keep the result as Q1.30, and then handle
overflow if converting the final result to Q31 or Q15 (or whatever). This is also a good technique to use on CPUs without saturation modes, since the number of overflow checks can
be greatly reduced in some cases.
Division
Overflow in division can occur when the result would have more bits than was calculated. For example, if the magnitude of the numerator is several times larger than that of the
denominator, than the result must have enough bits to represent numbers larger than one. Overflow can be avoided by carefully considering the range of numbers being operated on,
and calculating enough bits for the result. I have not seen a CPU that implements a saturation mode for division.
11/15/2014 31
ד''בס
Optimum Wordlength
Longer wordlength – May improve application
performance – Increases hardware cost
Shorter wordlength – May increase quantization errors
or minimize quantization error – Minimize hardware cost
Wordlength (w)
Cost c(w) Distortion d(w)
[1/performance]
Optimum
wordlength
DSP lab 17
ד''בס
Filter Implementation
Finite word-length effects (fixed point
implementation)
• Coefficient quantization
• Overflow & quantization in arithmetic operations • scaling to prevent overflow
• quantization noise statistical modeling
• limit cycle oscillations
ד''בס
Coefficient Quantization
The coefficient quantization problem :
Filter design in Matlab (e.g.) provides filter coefficients to 15
decimal digits (such that filter meets specifications)
For implementation, need to quantize coefficients to the word
length used for the implementation.
As a result, implemented filter may fail to meet specifications… ??
PS: In present-day signal processors, this has become less of a problem
(e.g. with 16 bits (=4 decimal digits) or 24 bits (=7 decimal digits)
precision). In hardware design, with tight speed requirements, this is still
a relevant problem.
DSP lab 18
ד''בס
Coefficient Quantization
Coefficient quantization effect on pole locations :
1. tightly spaced poles (e.g. for narrow band filters) imply high sensitivity of pole locations to coefficient quantization
2. hence preference for low-order systems (parallel/cascade)
Example: Implementation of a band-pass IIR 12-order filter
Cascade structure with 16-bit coeff. Direct form with 16-bit coeff.
ד''בס
Coefficient Quantization
Coefficient quantization effect on pole locations :
example : 2nd-order system (e.g. for cascade
realization)
21
21
..1
..1)(
zz
zzzH
ii
iii
DSP lab 19
ד''בס
Coefficient Quantization
example (continued) :
with 5 bits per coefficient, all possible pole positions are...
Low density of permissible pole locations at z=1, z=-1, hence
problem for narrow-band LP and HP filters
-1.5 -1 -0.5 0 0.5 1 1.5-1.5
-1
-0.5
0
0.5
1
1.5
end
end
)plot(roots
1:0625.0:1for
2:1250.0:2for
i
i
ד''בס
Coefficient Quantization
example (continued) :
possible remedy: `coupled realization’
poles are where are realized/quantized
hence permissible pole locations are (5 bits)
-1.5 -1 -0.5 0 0.5 1 1.5
-1.5
-1
-0.5
0
0.5
1
1.5
.j ,
+
+ +
-
y[k]
u[k]
DSP lab 20
ד''בס
Quantization of an FIR filter
Transfer function ΔH(z)
The effect of coefficient quantization to
linear phase
ד''בס
FIR filter example
Passband attenuation 0.01, Radial frequency (0,0.4)
Stopband attenuation 0.001, Radial frequency (0.4, )
DSP lab 21
ד''בס
FIR filter example – 16bits
ד''בס
FIR filter example - 8bits
DSP lab 22
ד''בס
Arithmetic Operations
Finite word-length effects in arithmetic operations:
In linear filters, have to consider additions & multiplications
Addition:
if, two B-bit numbers are added, the result has (B+1) bits.
Multiplication:
if a B1-bit number is multiplied by a B2-bit number, the
result has (B1+B2-1) bits.
For instance, two B-bit numbers yield a (2B-1)-bit product
Typically (especially so in an IIR (feedback) filter), the result of an addition/multiplication has to be represented again as a B’-bit number (e.g. B’=B). Hence have to get rid of either most significant bits or least significant bits…
ד''בס
Arithmetic Operations
Option-1: Most significant bits
If the result is known to be upper bounded so that the most significant
bit(s) is(are) always redundant, it(they) can be dropped, without loss of
accuracy.
This implies we have to monitor potential overflow, and introduce scaling
strategy to avoid overflow.
Option-2 : Least significant bits
Rounding/truncation/… to B’ bits introduces quantization noise.
The effect of quantization noise is usually analyzed in a statistical manner.
Quantization, however, is a deterministic non-linear effect, which may give
rise to limit cycle oscillations.
DSP lab 23
ד''בס
Scaling
The scaling problem:
Finite word-length implementation implies maximum
representable number. Whenever a signal (output or
internal) exceeds this value, overflow occurs.
Digital overflow may lead (e.g. in 2’s-complement
arithmetic) to polarity reversal (instead of saturation
such as in analog circuits), hence may be very harmful.
Avoid overflow through proper signal scaling
Scaled transfer function may be c*H(z) instead of H(z)
(hence need proper tracing of scaling factors)
ד''בס
Scaling
Time domain scaling:
Assume input signal is bounded in magnitude
(i.e. u-max is the largest number that can be represented in the `words’
reserved for the input signal’)
Then output signal is bounded by
To satisfy
(i.e. y-max is the largest number that can be represented in the `words’
reserved for the output signal’)
we have to scale H(z) to c.H(z), with
max][ uku
1max
0
max
00
.][.][.][][].[][ huihuikuihikuihkyiii
1max
max
. hu
yc
max][ yky
DSP lab 24
ד''בס
Scaling
Example:
assume u[k] comes from 12-bit A/D-converter
assume we use 16-bit arithmetic for y[k] & multiplier
hence inputs u[k] have to be shifted by
3 bits to the right before entering the filter
(=loss of accuracy!)
y[k]
u[k] +
x
0.99
10099.01
1...
.99.01
1)(
1
1
h
zzH
3
1
12
16
2
116.0
.2
2
hc
y[k]
u[k]
+
x
0.99
shift
ד''בס
Scaling
L2-scaling: (`scaling in L2 sense’)
Time-domain scaling is simple & guarantees that overflow will
never occur, but often over-conservative (=too small c)
If an `energy upper bound’ for the input signal is known
then L2-scaling uses
where
…is an L2-norm (this leads to larger
c)
1max
max
. hu
yc
0k
2
max u[k]UE
0
2
2][
i
ihh
2max
max
. hE
yc
U
DSP lab 25
ד''בס
Scaling
So far considered scaling of H(z), i.e. transfer function
from u[k] to y[k]. In fact we also need to consider
overflow and scaling of each internal signal, i.e. scaling of
transfer function from u[k] to each and every internal
signal !
This requires quite some thinking….
(but doable)
x
bo
x
b4
x
b3
x
b2
x
b1
+ + + +
y[k]
+ + + +
x
-a4
x
-a3
x
-a2
x
-a1
x1[k] x2[k] x3[k] x4[k]
ד''בס
Scaling
Something that may help: If 2’s-complement arithmetic is used, and
if the sum of K numbers (K>2) is guaranteed not to overflow, then
overflows in partial sums cancel out and do not affect the final
result (similar to `modulo arithmetic’).
Example:
if x1+x2+x3+x4 is guaranteed not to
overflow, then if in (((x1+x2)+x3)+x4)
the sum (x1+x2) overflows, this overflow
can be ignored, without affecting the
final result.
As a result (1), in a direct form realization,
eventually only 2 signals have to be
considered in view of scaling :
x
bo
x
b4
x
b3
x
b2
x
b1
+ + + +
+ + + +
x
-a4
x
-a3
x
-a2
x
-a1
x1[k] x2[k] x3[k] x4[k]
DSP lab 26
ד''בס
Scaling
As a result (2), in a transposed direct form realization, eventually
only 1 signal has to be considered in view of scaling……….:
hence preference for transposed direct form over direct form.
u[k]
x
-a4
x
-a3
x
-a2
x
-a1
y[k]
x
bo
x
b4
x
b3
x
b2
x
b1
+ + + +
x1[k] x2[k] x3[k] x4[k]
ד''בס
Quantization Noise
The quantization noise problem :
If two B-bit numbers are added (or multiplied), the result is a B+1 (or 2B-1) bit number. Rounding/truncation/… to (again) B bits, to get rid of the least significant bit(s) introduces quantization noise.
The effect of quantization noise is usually analyzed in a statistical manner.
Quantization, however, is a deterministic non-linear effect, which may give rise to limit cycle oscillations.
PS: Will focus on multiplications only. Assume additions are implemented with sufficient number of output bits, or are properly scaled, or…
Conclusion: weird, weird, weird,… ! Copyright Marc Moonen [1]
DSP lab 31
ד''בס
Limit Cycles
Limit cycle oscillations are clearly unwanted (e.g. may be
audible in speech/audio applications)
Limit cycle oscillations can only appear if the filter has
feedback. Hence FIR filters cannot have limit cycle
oscillations.
Mathematical analysis is very difficult.
Truncation often helps to avoid limit cycles (e.g. magnitude
truncation, where absolute value of quantizer output is
never larger than absolute value of quantizer input
(`passive quantizer’)).
Some filter structures can be made limit cycle free, e.g.
coupled realization, orthogonal filters (see below).
ד''בס
11/15/2014 62
The DSP processor ‘s architecture.
DSP processors with fixed and floating point.
Fixed versus Floating:
Fixed point arithmetic operations are more simple for hardware realization;
Floating point DSP processor has more data types and commands;
Floating point advantages:
Increases accuracy;
Wide dynamic range;
Doesn’t have problem with data overflow;
Friendly for C compiler.
Fixed point advantages:
Cheaper;
Compact.
DSP lab 32
ד''בס
11/15/2014 63
Data types & formats.
Dynamic range:
or in [db]:
maximum linearity error :
max. precision [bits]
(b – data width):
0 min
max
volue
volueDynR
0 min
maxlog20
volue
voluedbDynR
b2
erroron quantizatimax
valuemaxlog2
ד''בס
C vs. Assembly
DSP programs are different from traditional software tasks in two important respects. – First, the programs are usually much shorter, say, one-
hundred lines versus ten-thousand lines.
– Second, the execution speed is often a critical part of the application.
If assembly is used at all, it is restricted to short subroutines that must run with the utmost speed.
DSP lab 33
ד''בס
C vs. Assembly
Programs in C are more flexible and quick to develop.
Programs in assembly often have better performance , they run faster and use less memory, resulting in lower cost.
11/15/2014 65
ד''בס
C vs. Assembly
Which language is best for your application?
– If you need flexibility and fast development, choose C.
– use assembly if you need the best possible performance.
How complicated is the program? – If it is large and intricate use C. – If it is small and simple, assembly
may be a good choice.
Are you pushing the maximum speed of the DSP?
– If so, assembly will give you the last drop of performance from the device.
– For less demanding applications, assembly has little advantage, and you should consider using C.
How many programmers will be working together?
– If the project is large enough for more than one programmer, lean toward C and use in-line assembly only for time critical segments.
Which is more important, product cost or development cost?
– If it is product cost, choose assembly;
– If it is development cost, choose C.
What is your background? – If you are experienced in assembly
(on other microprocessors), choose assembly for your DSP.
– If your previous work is in C, choose C for your DSP.
DSP lab 34
ד''בס
11/15/2014 67
The DSP processor ‘s architecture.
“Traditional” fon Neiman architecture
Harvard architecture
CPU Memory
data & instruction
Address bus
Data bus
Program Memory instruction
only PM data bus
Data Memory
data only DM data bus CPU
DM address bus PM address bus
ד''בס
11/15/2014 68
The DSP processor ‘s architecture.
Super Harvard architecture
I/O Controller
Data
Program Memory instruction
only PM data bus
Data Memory
data only DM data bus
CPU DM address bus PM address bus
Instruction Cache
This is SHARC DSP processor structure
DSP lab 35
ד''בס
11/15/2014 69
The DSP processor ‘s architecture.
SHARC DSP processor structure
ד''בס
11/15/2014 70
The DSP processor ‘s architecture.
The ADSP-21160 hardware structure.
SERIAL PORTS
(2)
LINK PORTS
(6)
DMA
CONTROLLER
ADDR BUS
MUX
IOD
64
IOA
18
IOP
REGISTERS
6
6
6x10
4
Dual-Ported SRAM
External Port
I/O Processor
PROCESSOR
PORT I/O
PORT ADDR DATA ADDR DATA
Two Independent,
Dual-Ported Memory
Blocks
ADDR DATA ADDR DATA
MULTIPROCESSOR
32
64
HOST PORT
INTERFACE
PM Address Bus 32
DM Address Bus 32
PM Data Bus 16/32/40/48/64
DM Data Bus 32/40 64
INSTRUCTION
CACHE 32 x 48-Bit
DA G 2 8 x 4 x 32
DA G 1 8 x 4 x 32
Core Processor
PROGRAM
SEQUENCER
TIMER
Connect
Bus
(PX)
7 JTAG
Test &
Emulation
P M D
D M D
E P D
I O D
BL
OC
K 0
BL
OC
K 1
DATA BUS
MUX
MULTIPLIER BARREL
SHIFTER ALU
DATA
REGISTER
FILE
16 x 40-Bit
DSP lab 36
ד''בס
11/15/2014 71
The DSP processor ‘s architecture.
100 MHz - 600 MFLOPS- SIMD Core
1024 point, complex FFT benchmark: 90 us
4 Mbits on chip SRAM
14 zero overhead DMA channels
Sustained 700 Mbyte/sec over IOP bus
Two 50 mbit/sec Synchronous Serial Ports
Six 100 Mbyte/sec link ports
64 bit synchronous external port
Cluster multiprocessing support
ADSP-21160 Features
ד''בס
11/15/2014 72
Peak (technical) performance of microprocessor: Maximum theoretical microprocessor’s speed in ideal conditions. It’s
defined by number of calculating operation which had done in some time.
Real (sustained) performance of microprocessor: Real microprocessor’s speed in real conditions. The real performance is
calculated by execution of some popular programs. (like FIR,IIR or FFT).
The methods for computer performance measurement
The digital signal processing
methods & algorithms.
DSP lab 37
ד''בס
11/15/2014 73
The DSP processor ‘s architecture.
Pipe-Line command execution:
Instruction fetching (a);
Decoding (b);
Execution (c).
n-1 operation
n operation
n+1 operation
a b c
a b c
a b c
ד''בס
11/15/2014 74
SHARC instruction set
SHARC programming model.
SHARC assembly language.
SHARC data operations.
SHARC flow of control.
DSP lab 38
ד''בס
SHARC programming model
Register files:
R0-R15 (aliased as F0-F15 for floating point)
Status registers.
Loop registers.
Data address generator registers.
ד''בס
11/15/2014 76
SHARC assembly language
R1=DM(M0,I0), R2=PM(M8,I8); // comment
label: R3=R1+R2;
data memory access program memory access
Algebraic notation terminated by semicolon:
DSP lab 39
ד''בס
11/15/2014 77
Simple ALU Instructions
Rn = Rx + Ry Fn = Fx + Fy
Rx = Rx – Ry Fn = Fx - Fy
Rn = Rx + Ry + CI (Carry In) Fn = ABS(Fx + Fy)
Rn = Rx - Ry + CI - 1 Fn = ABS(Fx – Fy)
Rn = (Rx + Ry)/2 Fn = (Fx + Fy)/2
COMP(Rx, Ry) COMP(Fx, Fy)
Rn = Rx + CI – 1 Fn = - Fx
Rn = Rx + 1 Fn= ABS Fx
Rn = Rx – 1 Fn= PASS Fx
Rn = -Rx Fn = RND Fx
Rn = ABS Rx Fn = SCALB Fx BY Ry
Rn = PASS Rx Rn = MANT Fx
Rn = Rx AND Ry Rn = LOGB Fx
Rn = Rx OR Ry Rn = FIX Fx BY Ry
Rn = NOT Rx Fn = FLOAT Rx BY Ry
Rn = MIN(Rx, Ry) Rn = TRUNC Fx
Rn = MAX(Rx, Ry) Fn = RECIPS Fx
ד''בס
11/15/2014 78
MAC instructions - mainly INTEGER
Multiply and Accumulate
Rn = Rx * Ry MRF = Rx * Ry
MRB = Rx * Ry Rn = MRF + Rx * Ry
Rn = MRB + Rx * Ry MRF = MRF + Rx * Ry
MRB = MRB + Rx * Ry Rn = MRF – Rx * Ry
Rn = MRB – Rx * Ry MRF = MRF – Rx * Ry
MRB = MRB – Rx * Ry Rn = SAT MRF
Rn = SAT MRB MRF = SAT MRF
MRB = SAT MRB Rn = RND MRF
Rn = RND MRB MRF = RND MRF
MRB = RND MRB MR = Rn
Rn = MR FLOAT – Fx * Fy
DSP lab 40
ד''בס
11/15/2014 79
Shifter Instructions - mainly integer
FPACK is a cast and means (32bit -> 16bit) Fx
UNPACK is a cast and means (16bit -> 32bit) Rx
BUT WITH A LOT OF HIDDEN STUFF TOO!
Rn = LSHIFT Rx BY Ry/<dataa8>
Rn = Rn OR LSHIFT Rx BY Ry/<data8>
Rn = ASHIFT Rx BY Ry/<data8>
Rn = ROT Rx BY Ry/<data8>
Rn = BCLR Rx BY Ry/<data8>
Rn = BSET Rx BY Ry/<data8>
Rn = BTGL Rx BY
Rx/<data8>
BTST Rx BY Ry/<data8>
Rn = Rn OR FDEP Rx BY Ry/<bit6>:<len6> (SE)
Rn = Rx BY Ry/<bit 6>:<len6> (SE)
Rn = EXP Rx (EX) Rn = LEFTZ Rx
Rn = LEFT0 Rx Rn = FPACK Fx
Fn = UNPACK Rx
ד''בס
11/15/2014 80
Flag operations
ALU operations set: AZ (zero), AN (negative), AV (overflow), AC (fixed-point carry), AI (floating-point invalid), AF (last ALU operation).
Multiplier operations set: MN (negative), MV (overflow), MU (flouting point overflow), MI (floating-point invalid).
Shifter operations set: SV (overflow), SZ (zero), SS (sign).
Fixed-point: -1 + 1 = 0:
AZ = 1, AN = 0, AV = 0, AC = 1, AI = 0, AF = 0.
Fixed-point: -2*3=-6:
MN = 1, MV = 0, MU = 1, MI = 0.
LSHIFT 0x7fffffff BY 3: SV=1, SZ=0, SS=0.
DSP lab 41
ד''בס
Multifunction computations
The modified Harvard architecture allows – multiple data fetches in a single instruction.
The most common instructions allow – a memory reference and a computation to be
performed at the same time.
Memory references – can be done two at a time in many instructions, with
each reference using a DAG.
Can issue some computations in parallel: – dual add-subtract;
– fixed-point multiply/accumulate and add, subtract
– floating-point multiply and ALU operation
ד''בס
SHARC DSP Architecture
The machine supports both memory parallelism and operation parallelism.
Reduce the number of instructions required for common operations.
For example, the basic operation in a dot product loop can be performed in one cycle that performs two fetches, a multiplication, and an addition.
DSP lab 42
ד''בס
11/15/2014 83
Example Multi-Function Instruction
In a SingleCycle the SHARC Performs:
1(2) Multiply
1 (2) Addition
1 (2) Subtraction
1 (2) Memory Read
1 (2) Memory Write
2 Address Pointer Updates
Plus the I/O Processor Performs:
Active Serial Port Channels (2 Transmit, 2 Receive)
Active Link Ports (6)
Memory DMA
2 DMA Pointer Updates
f11=f1*f7, f3=f9+f14, f9=f9-f14, dm(i2,m0)=f13,
f7=pm(i8,m8);
ד''בס
Parallelism Restrictions on the sources of the operands when
operations are combined.
The operands going to the multiplier must come from R0 through R7 (or in the case of floating-point operands, F0 to F7), with one input coming from RO-R3/FO-F3 and the other from R4-R7/f0-f7.
The ALU operands must come from R8-R15/f8-fl5, with one operand coming from R8-Rll/f8-fll and the other from R12-R15/fl2-fl5.
Interrupt does not occur until 2 instructions after delayed branch (needs 2 NOPs);
Some DAG register transfers are disallowed in assembler routine;
It is preferable not use the following couples in all combinations: (M7,I6), (M14,I12), (M6,I5), (M5,I6).
ד''בס
SOS filters creation
% The cut off frequency normalization with fs/2. Wn=[200 1000]/(fs/2); % calculate coeffitions for Bandpass filter N = 5 [b, a] = butter(N, Wn); Convertion zero-pole-gain filter parameters to second-
order sections form [z,p,k]=butter(N, Wn); [sos,g] = zp2sos(z,p,k);