Number Representation for DSP processor. Numerical issues and data formats. Fixed point. Fractional number. Floating point. Comparison of formats and.

Number Representation for

DSP processor

• Numerical issues and data formats.

• Fixed point.

• Fractional number.

• Floating point.

• Comparison of formats and dynamic ranges.

Numerical Issues and Data Formats

C6000 Numerical Representation

Fixed point arithmetic:• 16-bit (integer or fractional). • Signed or unsigned.

Floating point arithmetic:• 32-bit single precision.

• 64-bit Double precision.

• How are numbers represented and processed in DSP processors for implementing DSP algorithms?

How numbers are represented?

• A collection of N binary digits (bits) has 2N possible states. – This can be seen from elementary counting theory, which tells

us that there are two possibilities for the first bit, two possibilities for the next bit, and so on until the last bit, resulting in 2×2×2… = 2N possibilities or states.

• In the most general sense, we can allow these states to represent anything conceivable. – The point is that there is no meaning inherent in a binary word,

although most people are tempted to think of them as positive integers.

• However, the meaning of an N-bit binary word depends entirely on its interpretation.

Fixed Point number and arithmeticBinary representation of 4-bit signed number

Decimal Value

Sign Magnitude

One’s Complement

Two’s Complement

7 0111 0111 0111

1 0001 0001 0001

+0 0000 0000 0000

-0 1000 1111 -

-1 1001 1110 1111

-7 1111 1000 1001

-8 - - 1000

3 BIT NUMBER IN SIGNED 2’S COMPLEMENT REPRESENTATION

Dynamic range of integer and fraction 4-bit numbersUnsigned integer Signed Integer

Smallest Value:0000(0)

Largest value :1111(15)

Most Positive Value:0111= (+7)

Least negative value:1000 =(-8)

Unsigned Fraction Unsigned Fraction

Smallest Value:.0000(0)

Largest value :.1111(0.9375)

Most Positive Value:0.111= (+0.875)

Least negative value:1.000 =(-1)

Dynamic Range

• Dynamic range in dB= 20 log10(Max/min)

• In fixed point Unsigned integer representation using N-bit range of Max to min number is 2N to 1.

• In fixed point signed integer representation using N-bit range of Max to min number is 2N-1 to 1.

• In fixed point Unsigned fraction representation using N-bit range of Max to min number is 1-2-N to 2-N.

• In fixed point signed fraction representation using N-bit range of Max to min number is 1-2-N+1 to 2-N+1.

Number format

Dynamic Range

Dynamic Range in dB

Precision

Unsigned Integer

0 to 65536 20log10(2^16)= 96 dB

1

signed Integer

-32768 to 32767

20log10(2^15)= 90 dB

1

Unsigned Fraction

0 to 0.99998474

96 dB 2^-16

signed Fraction

-1 to 0.99996948

90 dB 2^-15

Fixed Point Arithmetic - Problems

• The following equation is the basis of many DSP algorithms (See Chapter 1):

• Two problems arise when using signed and unsigned integers:– Multiplication overflow.– Addition overflow.

1

0

N

k

knxkany

• 16-bit x 16-bit = 32-bit

• Example: using 4-bit representation

• 24 cannot be represented with 4-bits.

Multiplication Overflow

33

88

2424

xx

00 00 11 11

11 00 00 00xx

11 00 00 0000 00 00 11

• 32-bit + 32-bit = 33-bit

• Example: using 4-bit representation

• 16 cannot be represented with 4-bits.

Addition Overflow

11 00 00 00

11 00 00 00++

88

88

1616

++

00 00 00 0011

Fixed Point Arithmetic - Solution

• The solutions for reducing the overflow problem are:– Saturate the result.– Use double precision result.– Use fractional arithmetic.– Use floating point arithmetic.

Solution - Saturate the result• Unsigned numbers:

– If A x B 15 result = A x B– If A x B > 15 result = 15

00 00 11 11

11 00 00 00xx

11 00 00 00

11 11 11 11

00 00 00 11

33

88

2424

1515SaturatedSaturated

Solution - Saturate the result• Signed numbers:

– If -8 A x B 7 result = A x B– If A x B > 7 result = 7– If A x B < -8 result = -8

00 00 11 11

11 00 00 00xx

11 00 00 00

11 00 00 00

11 11 11 00

33

-8-8

-24-24

-8-8SaturatedSaturated

Solution - Double precision result

• For a 4-bit x 4-bit multiplication hold the result in an 8-bit location.

• Problems:– Uses more memory for storing data.– If the result is used in another multiplication

the data needs to be represented into single precision format (e.g. prod = prod x sum).

– Results need to be scaled down if it is to be sent to an A to D converter.

Solution - Fractional arithmetic• If A and B are fractional then:

– A x B < min(A, B)– i.e. The result is less than the operands

hence it will never overflow.

• Examples: – 0.6 x 0.2 = 0.12 (0.12 < 0.6 and 0.12 < 0.2)– 0.9 x 0.9 = 0.81 (0.81 < 0.9)– 0.1 x 0.1 = 0.01 (0.01 < 0.1)

Fractional numbers - Sign Extension

• To keep the same resolution as the operands we need to select these 4-bits:

00 11 11 00a=a= = 0.5 + 0.25 = 0.75= 0.5 + 0.25 = 0.75

11 11 11 00b=b= = -1 + 0.5 + 0.25 = -0.25= -1 + 0.5 + 0.25 = -0.25

00 00 00 0000 11 11 00 ..

00 11 11 00 .. ..11 00 11 00 .. .. ..

00 11 00 0011 11 11 11

Sign extensionSign extension

11 11 11 00

xx

Fractional numbers - Sign Extension

• The way to do it is to shift left by one bit and store upper 4-bits or right shift by three and store the lower 4-bits:

00 11 11 00a=a= = 0.5 + 0.25 = 0.75= 0.5 + 0.25 = 0.75

11 11 11 00b=b= = -1 + 0.5 + 0.25 = -0.25= -1 + 0.5 + 0.25 = -0.25

00 00 0000 11 00

00 11 11 0011 00 11 00

.... ..

.. .. ..

00 11 00 0011 11 11 11

Sign extensionSign extension

11 11 11 00

xx

0000

1100

1100 000000000000

Sign extension bitsSign extension bits

• Although using 2's complement integers we can implement both addition and subtraction by usual binary addition (with special care for the sign bit), the integers are not convenient to handle to implement DSP algorithms. – For example, if we multiply two 8-bit words

together, we need 16 bits to store the result.

• The number of required word length increases without bound as we multiply numbers together more. – Although not impossible, it is complicated to handle

this increase in word-length using integer arithmetic.

• The problem can be easily handled by using numbers between -1 and 1, instead of integers, because the product of two numbers in [-1,1] are always in the same range.

• In the 2's complement fractional representation, an N bit binary word can represent 2N equally space numbers from

• For example, we interpret an 8-bit binary word

b7 b6 b5 b4 b3 b2 b1 b0

• as a fractional number• This representation is called Q-format.

Integer Vs Fractional Number

• Different notation are used to represent different binary formats.

• The Qm.n represent are most widely used

• In Qm.n representation m bit representation integer portion and n bit represent fraction number

• Eg Q15.0 and Q0.15

• If N is total number of bits then N=m+n+1.

• Number range is -2^m to 2^m – 2^-n• Its resolution is 2^-n• For eg Q14.1 representation• Its range is • [-2^14, 2^14 – 2^-1] = [-16384.0, +16383.5] =

So in hex [0x8000, 0x8001 … 0xFFFF, 0x0000, 0x0001 … 0x7FFE, 0x7FFF]

resolution is 2^-1=0.5

Conversion

• To convert a number from floating point to Qm.n format:

• 1.Multiply the floating point number by 2n

• 2.Round to the nearest integer

http://en.wikipedia.org/wiki/IEEE_754

Conversion

• To convert a number from Qm.n format to floating point

• 1.Convert the number directly to floating point

• 2.Divide by 2^n

http://en.wikipedia.org/wiki/IEEE_754

• In C6211, it is easiest to handle Q-15 numbers represented by each 16 bit binary word, because the multiplication of two Q-15 numbers results in a Q-30 number that can still be stored in a 32-bit wide register of C6211. – The programmer needs to keep track of the

implied binary point when manipulating Q-format numbers.

CPUCPUMPY A3,A4,A6MPY A3,A4,A6NOP NOP

Q15Q15 s. x x x x x x x x x x x x x x x

s. y y y y y y y y y y y y y y yxx Q15 Q15

s.s z z z z z z z z z z z z z z z z z z z z z z z z z z z z z zQ30Q30

15-bit * 15-bit Multiplication

Store toStore toData MemoryData Memory SHR SHR A6, A6,1515,A6,A6

STH STH A6,*A7 A6,*A7

s. y y y y y y y y y y y y y y yQ15Q15

Assembly language implementation

• When A0 and A1 contain two 16-bit numbers in the Q-15 format, we can perform the

• multiplications using MPY followed by a right shift.

• 1 MPY .M1 A0,A1,A2• 2 NOP• 3 SHR .S1 A2,15,A2 ;lower 16 bit contains result

• in Q-15 format

C language implementation

• Let's suppose we have two 16-bit numbers in Q-15 format, stored in variable x and y as follows:

• short x = 0x0011; /* 0.000518799 in decimal */• short y = 0xfe12; /* -0.015075684 in decimal */• short z; /* variable to store x*y */

• The product of x and y can be computed and stored in Q-15 format as follows:

• z = (x * y) > > 15;

• The result of x*y is a 32-bit word with 2 sign bits. Right shifting it by 15 bits ignores the last 15 bits, and storing the shifted result in z that is a short variable (16 bit) removes the extended sign bit by taking only lower 16 bits.

• However, care must be taken when adding binary numbers. – Because each Q-15 number can represent

numbers in the range [-1,1−215] , if the result of summing two Q-15 numbers is not in this range, we cannot represent the result in the Q-15 format.

• When this happens, we say an overflow has occurred.

• Unless carefully handled, the overflow makes the result incorrect. – Therefore, it is really important to prevent overflows from

occurring when implementing DSP algorithms.

• One way of avoiding overflow is to scale all the numbers down by a constant factor, effectively making all the numbers very small, so that any summation would give results in the [-1,1) range. – This scaling is necessary and it is important to figure out how

much scaling is necessary to avoid overflow. • Because scaling results in loss of effective number of digits,

increasing quantization errors, we usually need to find the minimum amount of scaling to prevent overflow.

• Most Fixed-Point DSP processor use two’s complement fractional numbers in different Q format. However assembler only recognize integer value.

• So we have to keep track of binary point when considering fractional number in assembly program.

Steps of converting fractional number into Q-format into an integer that can be recognized by

the assembler

• 1. Normalize the fractional number to the range determined by the desired Q-format.

• 2. Multiply the Normalized fractional by 2^n, where n is the total number of fractional bits.

• 3. Round the product to the nearest integer.

‘C6000 C Data TypesTypeType SizeSize RepresentationRepresentation

char, signed charchar, signed char 8 bits8 bits ASCIIASCIIunsigned charunsigned char 8 bits8 bits ASCIIASCIIshortshort 16 bits16 bits 2’s complement2’s complementunsigned shortunsigned short 16 bits16 bits binarybinaryint, signed intint, signed int 32 bits32 bits 2s complement 2s complement unsigned intunsigned int 32 bits32 bits binarybinarylong, signed longlong, signed long 40 bits 40 bits 2’s complement2’s complementunsigned longunsigned long 40 bits 40 bits binarybinaryenumenum 32 bits 32 bits 2’s complement2’s complementfloatfloat 32 bits 32 bits IEEE 32-bitIEEE 32-bitdoubledouble 64 bits 64 bits IEEE 64-bitIEEE 64-bitlong doublelong double 64 bits 64 bits IEEE 64-bitIEEE 64-bitpointerspointers 32 bits 32 bits binarybinary

Exponential Notation

The representations differ in that the decimal place – the “point” -- “floats” to the left or right (with the appropriate adjustment in the exponent).

p. 122

• The following are equivalent representations of 1,234123,400.0 x 10-2

12,340.0 x 10-1

1,234.0 x 100

123.4 x 101

12.34 x 102

1.234 x 103

0.1234 x 104

Parts of a Floating Point Number

-0.9876 x 10-3

p. 123

Sign ofmantissa

Location ofdecimal point Mantissa

Exponent

Sign ofexponent

Base

IEEE 754 Standard

• Most common standard for representing floating point numbers

• Single precision: 32 bits, consisting of...– Sign bit (1 bit)– Exponent (8 bits)– Mantissa (23 bits)

• Double precision: 64 bits, consisting of…– Sign bit (1 bit)– Exponent (11 bits)– Mantissa (52 bits)

p. 133

Single Precision Format

32 bits

Mantissa (23 bits)

Exponent (8 bits)

Sign of mantissa (1 bit)

Normalization

• The mantissa is normalized

• Has an implied decimal place on left

• Has an implied “1” on left of the decimal place

• E.g.,– Mantissa – Represents…

10100000000000000000000

1.1012 = 1.62510

Excess Notation

• To include +ve and –ve exponents, “excess” notation is used

• Single precision: excess 127• Double precision: excess 1023• The value of the exponent stored is larger than

the actual exponent• E.g., excess 127,

– Exponent – Represents…

10000111

135 – 127 = 8

Example

• Single precision0 10000010 11000000000000000000000

1.112

130 – 127 = 3

0 = positive mantissa

+1.112 x 23 = 1110.02 = 14.010

Hexadecimal

• It is convenient and common to represent the original floating point number in hexadecimal

• The preceding example…

0 10000010 11000000000000000000000

4 1 6 0 0 0 0 0

Converting from Floating Point

• E.g., What decimal value is represented by the following 32-bit floating point number?

C17B000016

• Step 1– Express in binary and find S, E, and M

C17B000016 =

1 10000010 111101100000000000000002

S E M

1 = negative0 = positive

• Step 2– Find “real” exponent, n– n = E – 127

= 100000102 – 127

= 130 – 127

= 3

• Step 3– Put S, M, and n together to form binary result– (Don’t forget the implied “1.” on the left of the

mantissa.)-1.11110112 x 2n =

-1.11110112 x 23 =

-1111.10112

• Step 4– Express result in decimal

-1111.10112

-15 2-1 = 0.52-3 = 0.1252-4 = 0.0625

0.6875

Answer: -15.6875

Converting to Floating Point

• E.g., Express 36.562510 as a 32-bit floating point number (in hexadecimal)

• Step 1– Express original value in binary

36.562510 =

100100.10012

• Step 2– Normalize

100100.10012 =

1.0010010012 x 25

• Step 3– Determine S, E, and M

+1.0010010012 x 25

S = 0 (because the value is positive)

MS n E = n + 127= 5 + 127= 132= 100001002

• Step 4– Put S, E, and M together to form 32-bit binary

result0 10000100 001001001000000000000002

S E M

• Step 5– Express in hexadecimal

0 10000100 001001001000000000000002 =

0100 0010 0001 0010 0100 0000 0000 00002 =

4 2 1 2 4 0 0 016

Answer: 4212400016

Comparison of fixed point and floating point

Fixed Point Floating point

Limited Dynamic range Large Dynamic Range

Overview flow and quantization errors must be resolved

Easier to program since no scaling is required

Long product development time

Quick time to market

Cheaper More expensive

Lower power consumption

High Power consumption

Fixed Point Floating point

Consumer Application such as MP3 player ,multimedia gaming, and digital camera

High-eng audio application such as ambient acoustic simulation ,audio encoding/decoding and audio mixing

Thank you

Number Representation for DSP processor. Numerical issues and data formats. Fixed point. Fractional number. Floating point. Comparison of formats and.

Documents

fixed point number

n possibilities

number representation

floating point arithmetic

fractional number

x b mina

n possible states

range of max