Computer Number Systems, Approximation in Numerical Computation

8/6/2019 Computer Number Systems, Approximation in Numerical Computation

http://slidepdf.com/reader/full/computer-number-systems-approximation-in-numerical-computation 1/21

Computer Representation of Numbers and Computer Arithmetic

In a Computer numbers are represented by binary digits 0 and 1. Computers employ

binary arithmetic for performing operations on numbers. Since it gets cumbersome to

display large numbers in binary form computers usually display them in hexadecimal or

octal or decimal system. All of these number systems are positional systems. In a

positional system a number is represented by a set of symbols. Each of these symbolsdenote a particular value depending on its position. The number of symbols used in a

positional system depends on its 'base'. Let us now discuss about various positional

number systems:

Decimal System :

The decimal system uses 10 as its base value and employs ten symbols 0 to 9 in

representing numbers. Let us consider a decimal number 7402 consisting of four

symbols 7,4,0,2. In terms of base 10 it can be expressed as follows.

So each of the symbols from a set of symbols denoting a number is multiplied with

power of the base (10) depending on its position counted from the right. The count

always begins with 0.

In general a decimal number consisting of symbols can be

expressed as:

where,

mywbut.com



Similarly, a fractional part of a decimal number can be expressed as

Binary system:

Binary system is the positional system consisting of two symbols i.e. 0,1 and '2' as its

base. Any binary number actually represents a decimal value given by

where

Consider the binary number 10101. The decimal equivalent of 10101 is given by

Hexadecimal System :

The Hexadecimal system is the positional system consisting of sixteen symbols,

0,1,2...9,A,B,C,D,E,F, and '16' as its base. Here the symbols A denotes 10, B denotes

11 and so on. The decimal equivalent of the given hexadecimal number is

given by . For example consider

.

We can convert a binary number directly to a hexadecimal number by grouping the

binary digits, starting from the right, into sets of four and converting each group to its

equivalent hexadecimal digit. If in such a grouping the last set falls short of four binary

digits then do the obvious thing of prefixing it with adequate number of binary digit '0'.

mywbut.com



For example let us find the hexadecimal equivalent of

The vice-versa is also true.

Octal System: The octal system is the positional system that uses 8 as its base andas its symbol set of size 8. The decimal equivalent of an octal number

is given by . For

example consider

We can get the octal equivalent of a binary number by grouping the binary digits,starting from the right, into sets of three binary digits and converting each of these sets

to its octal equivalent. If such a grouping results in a last set having less number of

digits it may be prefixed with adequate number of binary digit 0. As an example the

octal equivalent of

Conversion of decimal system to non-decimal system:

To convert a decimal number to a number of any other system we should consider the

integer and fractional parts separately and follow the following procedure:

Conversion of integer part:

(a) Consider the integer part of a given decimal number and divide it by the base b of

the new number system. The remainder will constitute the rightmost digit of the integer

mywbut.com



part of the new number.

(b) Next divide the quotient again by the base b. The remainder will constitute second

digit from the right in the new system.

Continue this process until we end up with a zero-quotient. The last remainder is the

leftmost digit of the new number.

Conversion of fractional part :

(a) Consider the fractional part of the given decimal number and multiply it with the

base b of the new system. The integral part of the product constitutes the leftmost digit

of the fractional part in the new system.

(b) Now again multiply the fractional part resulting in step (a) by the base b of the new

system. The integral part of the resultant product is the second digit from the left in the

new system.

Repeat the above step until we encounter a zero-fractional part or a duplicate fractional

part. The integer part of this last product will be the rightmost digit of the fractional part

of the new number.

Eg: Convert 54.45 into its binary equivalent.

(a) Consider the integer part i.e. 54 and apply the steps listed under conversion of

integer part i.e.

(b) Conversion of fractional part:

Product integral part Binary number

mywbut.com



0.45 2 = 0.90 0

0.9 2 = 1.80 1

0.8 2 = 1.6 1

0.6 2 = 1.2 1

0.2 2 = 0.4 0

0.4 2 = 0.8 0

0.8 2 = 1.6 1

0.6 2 = 1.2 1

0.2 2 = 0.4 0

0.4 2 = 0.8 0

0.8 2 = 1.6 1

Here the overbar denotes the repetition of the binary digits.

Note: Using binary system as an intermediate stage we can easily convert octal

numbers to hexadecimal numbers and vice-versa.

mywbut.com



In the above two examples we have grouped the binary digits suitably either to

quadruplets or triplets to convert octal to hexadecimal and hexadecimal to octal

numbers respectively.

mywbut.com



Computer Representation of Numbers

Computers are designed to use binary digits to represent numbers and other information. The

computer memory is organized into strings of bits called words of same length. Decimal numbers

are first converted into their binary equivalents and then are represented in either integer or

floating point form.

Integer Representation

The largest decimal number that can be represented , in binary form , in a computer depends on

its word length. An n-bit word computer can handle a number as large as . For instance

a 16-bit word machine can represent numbers as large as . How do we represent negative

numbers ? Negative numbers are stored using complement. This is obtained by taking the

complement of the binary representation of the positive number and then adding to it.

For example let us represent in the binary form.

Here in an extra zero to the left of the binary number is appended to indicate

that it is positive. If this extra leftmost binary digit is set to then it indicates that the binary

number is negative. So the general convention for storing signed numbers is to append a binary

mywbut.com



digit 0 or to the left of the binary number depending on the positive or negative sign of the

number. So in a n-bit word computer, as one bit is reserved for sign , one can use maximum up

to bits to store a signed number. So the largest signed number a 16-bit word can

represent is . On this machine since zero is defined as

it is redundant to use the number to

define a "minus zero". It is usually employed to represent an additional negative number i.eand hence the range of signed numbers that can be represented on a 16-bit word

machine is from to .

Floating Point Representation

Fractional numbers such as and large numbers like which fall outside

the range of a d-bit word machine , say for instance 16-bit word machine are stored and

processed in Exponential form. In exponential form these numbers have an embedded decimal

point and are called floating point numbers or real numbers. The floating point representation of

a real number is where is called mantissa and is the exponent. So thefloating - point representation of the fractional number is and

that of the large number is .

Typically computers use a 32-bit representation for a floating point. The left most bit is reserved

for the sign. The next seven bits are reserved for exponent and the last twenty four bits are used

for mantissa.

The shifting of the decimal point to the left of the most significant digit is called normalization

and the numbers represented in the normalized form are known as normalized floating pointnumbers.

For example , the normalized floating point form of the numbers , ,

are:

0.00695 = = .695E-2

56.2547 = = .562547E2

-684.6 = = -.6846E3

Inherent Errors

Inherent errors arise due to the data errors or due to the conversion errors.

Data Errors

mywbut.com



If the data supplied for a problem is obtained from some experiment or from some measurement

then it is prone to errors due to the limitations in instrumentation or reading. Such errors are also

referred to as empirical errors. So when the data supplied is correct , say to two decimals there is

no use performing arithmetic accurate to four decimals!

Conversion Errors

Conversion errors arise due to the limitation on the number of the bits used for representingnumbers both under integer and floating point representation. So it is also called as

representation error. The digits that are not retained constitute the round-off error.

For example consider the case of representing a decimal number in a computer. The binary

equivalent of has a non-terminating form like ...... but the computer

has limited number of bits. If we add ten such numbers in a computer the result will not be

exactly due to the round -off error during the conversion of to binary form.

mywbut.com



Computer Arithmetic

The most common computer arithmetic are integer arithmetic and floating point

arithmetic. Now these arithmetic systems will be briefly discussed.

Integer Arithmetic :

The result of any integer arithmetic operation is always an integer. The range of

integers that can be represented on a given computer is finite. The result of an integer

division is usually given as a quotient. The remainder is truncated as fractionalquantities which cannot be represented under the integer representation.

Eg:

Remark:

(1) Simple rules like , where are integers may not hold

under computer integer arithmetic due to the truncation of the remainder.

(2) An integer operation may result in a very small or a very large number which is

beyond the range of that the computer can handle. When the result is larger than the

maximum limit , it is referred to as an overflow and when it is less than the lower limit , it

is referred to as underflow.

mywbut.com



Float ing Point Arithmetic:

In the floating point arithmetic all the numbers are stored and processed in normalized

exponential form . Firstly the process of addition under floating point arithmetic will be

discussed.

Addition under Floating Point Arithmetic:

Let and be the two numbers to be added and be the result. The normalizedfloating point representation of and are , ,

respectively. The rules for carrying out the addition are as follows :

(a) Set = maximum .

Say then .

b) Right shift by places, so that the exponent of are the same

and call it

c) Set

d) Normalize and let be its normalized representation.

e) Set

E.g : Add the numbers and

a)

b) on right shifting by 3 we get

mywbut.com



c)

d) which is already in normalized form

i.e ,

e)

Remark: Substraction is nothing but addition of numbers with different signs.

Multiplication Under Floating Point Arithmetic:

If , are two real numbers in normalized form then their

product

E.g : Say , then

Since is already in normalized form ,

mywbut.com



.

Remark:

(1)

(after normalization)

During the floating point arithmetic mantissa 'M' may be truncated due to the limitation

on the number of bits available for its representation on a computer.

(2) Floating point arithmetic is prone to the following errors:

a) Errors due to inexact representation of a decimal number in binary form. For example

. Since binary equivalent of has a repeating

fraction, it has to be terminated at some point.

b) Error due to round-off-effect

c) Subtractive cancellation : It is possible that some mantissa positions are unspecified.

These unspecified positions may be arbitrarily filled by the computer.This may lead to

serious loss of significance when two nearly equal numbers are subtracted.

For example if and thenhas only one significant digit. However the

mantissa will have provision to store more number of significant digits, which may get

arbitrarily filled as they may be specified. Further if the operands themselves are

approximate representation due to this non-specification problem the overall loss of

significance will get serious.

d) Basic laws of arithmetic such as associative, distributive may not be satisfied i.e

(3) Numerical computation involves a series of computations consisting of basic

arithmetic operation. There may be round-off or truncation error at every step of the

computation. These errors accumulate with the increasing number of computations in a

process. There can be situations where even a single operation may magnify the

mywbut.com



roundoff errors to a level that completely ruins the result.

A computation process in which the cumulative effect of all input errors is grossly

magnified is said to be numerically unstable. It is important to understand the

conditions under which the process is likely to be 'sensitive' to input errors and become

unstable. Investigations to see how small changes in input parameters influence the

output are termed as sensitivity analysis.

(4) Roundoff and truncation errors effect on the final numerical result may be reduced

by

a) Increasing the significant figures of the computer either through hardware or through

software manipulations.For instance one may use double precision for floating point

arithmetic operations.

b) Minimizing the number of arithmetic operations. Here one may try to rearrange a

formula to reduce the number of arithmetic operations. For example in the evaluation of

a polynomial , it may be rearranged as

which requires less arithmetic operations.

c)A formula like may be replaced by to avoid substractive cancellation

d) While finding the sum of set of numbers, arrange the set so that they are in

ascending order of absolute value. i.e when then is better

than .

5) It may not be possible to simultaneously reduce both the truncation and round-off

error effects on the final result of a numerical computation. For instance in an iterative

procedure when one tries to reduce the round-off error by increasing the step size , it

may lead to higher truncation error and vice-versa. Hence proper care has to be taken

to reduce both the errors simultaneously.

mywbut.com



Numerical Errors:

Numerical errors arise during computations due to round-off errors and truncation

errors.

Round-off Errors:

Round-off error occurs because computers use fixed number of bits and hence fixed

number of binary digits to represent numbers. In a numerical computation round-off

errors are introduced at every stage of computation. Hence though an individualround-off error due to a given number at a given numerical step may be small but the

cumulative effect can be significant.

When the number of bits required for representing a number are less then the number

is usually rounded to fit the available number of bits. This is done either by chopping or

by symmetric rounding.

Chopping : Rounding a number by chopping amounts to dropping the extra digits. Here

the given number is truncated. Suppose that we are using a computer with a fixed word

length of four digits. Then the truncated representation of the number will be

. The digits will be dropped. Now to evaluate the error due to chopping let us

consider the normalized representation of the given number i.e.

chopping error in representing .

So in general if a number is the true value of a given number and is the

normalized form of the rounded (chopped) number and is the

mywbut.com



normalized form of the chopping error then

Since , the chopping error

Symmetric Round-off Error :

In the symmetric round-off method the last retained significant digit is rounded up by 1

if the first discarded digit is greater or equal to 5.In other words, if in is such

that then the last digit in is raised by 1 before chopping . For

example let be two given numbers to be rounded to five

digit numbers. The normalized form x and y are and .

On rounding these numbers to five digits we get and

respectively. Now w.r.t here

In either case error .

Truncation Errors:

Often an approximation is used in place of an exact mathematical procedure. For

instance consider the Taylor series expansion of say i.e.

Practically we cannot use all of the infinite number of terms in the series for computing

the sine of angle x. We usually terminate the process after a certain number of terms.

The error that results due to such a termination or truncation is called as 'truncation

error'.

Usually in evaluating logarithms, exponentials, trigonometric functions, hyperbolic

mywbut.com



functions etc. an infinite series of the form is replaced by a finite series

. Thus a truncation error of is introduced in the computation.

For example let us consider evaluation of exponential function using first three terms at

Truncation Error

Some Fundamental definitions of Error Analysis:

Absolute and Relative Errors:

Absolute Error: Suppose that and denote the true and approximate values of a

datum then the error incurred on approximating by is given by

and the absolute error i.e. magnitude of the error is given by

mywbut.com



Relative Error: Relative Error or normalized error in representing a true datum by

an approximate value is defined by

and

Sometimes is defined by

If and then

Machine Epsilon: Let us assume that we have a decimal computer system.

We know that we would encounter round-off error when a number is represented in

floating-point form. The relative round-off error due to chopping is defined by

Here we know that

mywbut.com



i.e. maximum relative round-off error due to chopping is given by . We know that

the value of 'd' i.e the length of mantissa is machine dependent. Hence the maximum

relative round-off error due to chopping is also known as machine epsilon .

Similarly , maximum relative round-off error due to symmetric rounding is given by

Machine-Epsilon for symmetric rounding is given by,

It is important to note that the machine epsilon represents upper bound for the

round-off error due to floating point representation.

For a computer system with binary representation the machine epsilon due to chopping

and symmetric rounding are given by

respectively.

Eg: Assume that our binary machine has 24-bit mantissa. Then

. Say that our system can represent a q decimal digit

mantissa.

Then,

i.e

that our machine can store numbers with seven significant decimal digits.

mywbut.com



Approximations and Round-off Errors

Approximations and errors are integral part of numerical methods. Prior to using the numerical

methods it is essential to know how errors arise, how they grow during the numerical

computations and how they affect the accuracy of a solution.Errors can come in a variety of

forms and sizes. To get a quick feel let us look at the following taxonomy of errors:

Further discussion will be focussed on errors due to computing machine and those due to

numerical method. Firstly the notion of significant digits will be introduced.

Significant Digits

Usually , the numerical solution to a given problem is sought to a desired level of accuracy and

mywbut.com



precision wherein the error is below a set tolerance level.The idea of significant numbers is

essential to understand the concept of accuracy and precision in the solution and also to

designate the reliability of a numerical value.

The Significant Digits of a number are those that can be used with confidence. Suppose we seek

a numerical solution to an accuracy of and obtain as solution . Here the

solution is reliable only up to the first three decimal places i.e or the solution has

five significant digits . Some numbers like , , etc. have infinite number

of significant digits. For example consider ,

=

Such numbers can never be represented exactly on a computer which operates with fixed

number of significant digits due to hardware limitations.The omission of certain digits from such

numbers results in what is called round-off-error. Some thumb rules on the significant digits ,within the desired level of accuracy are :

(a) All non-zero digits are significant ,

(b)All zeros occurring between non-zero digits are significant,

(c)Trailing zeros following a decimal point are significant.

(e.g , , have three significant digits),

(d) Zeros between the decimal point and preceding a non-zero digit are not significant. For

example , , , have

four significant digits.

(e) Trailing zeros in large numbers without the decimal point are not significant. For instance

may be written in scientific notation as and contains only two significant

digits.

The concept of accuracy and precision are closely related to significant digits as follows:

Accuracy refers to the number of significant digits in a value. For example the number is

accurate to five significant digits: Precision refers to the number of decimal positions i.e the order

of magnitude of the last digit in a value. The number has a precision of or .

Computer Number Systems, Approximation in Numerical Computation

Documents