Numerical Methods Fall 2010 Lecturer: conf. dr. Viorel Bostan Oce: 6-417 Telephone: 50-99-38 E-mail address: viorel [email protected]Course web page: moodle.fcim.utm.md Oce hours: TBA. I will also be available at other times. Just drop by my oce, talk to me after the class or send me an e-mail to make an appointment. Prerequisites: A basic course on mathematical anal- ysis (single and multivariable calculus), ordinary dif- ferential equations and some knowledge of computer programming.
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
O�ce hours: TBA. I will also be available at other
times. Just drop by my o�ce, talk to me after the
class or send me an e-mail to make an appointment.
Prerequisites: A basic course on mathematical anal-
ysis (single and multivariable calculus), ordinary dif-
ferential equations and some knowledge of computer
programming.
Course outline: This is a fast-paced course. This coursegives an in-depth introduction to the basic areas of nu-merical analysis. The main objective will be to have aclear understanding of the ideas and techniques underly-ing the numerical methods, results, and algorithms thatwill be presented, where error analysis plays an impor-tant role. You will then be able to use this knowledge toanalyze the numerical methods and algorithms that youwill encounter, and also to program them e¤ectively ona computer. This knowledge will be useful in your futurenot only to solve problems with a numerical component,but also to develop numerical algorithms of your own.
Topics to be covered:
1. Computer representation of numbers. Errors: types,sources, propagation.
2. Solution of nonlinear equations. Root�nding.
3. Interpolation by polynomials and spline functions.
6. Matrix computations and systems of linear equations.
7. Numerical methods for ODE.
This course plan may be modi�ed during the semester.Such modi�cations will be anounced in advance duringclass period. The student is responsible for keeping abreastof such changes.
Class procedure: The majority of each class period
will be lecture oriented. Some material will be handed
during lectures, some material will be send by e-mail.
I strongly advise to attend lectures, do your home-
work, work consistently, and ask questions. Lecture
time is at premium; you cannot be taught everything
in class. It is your responsability to learn the material;
the instructor's job is to guide you in your learning.
During the semester, 10 homeworks and 4 program-
ming projects will be assigned. As a general rule, you
will �nd it necessary to spend approximately 2-3 hours
of study for each lecture/lab meeting, and additional
time will be needed for exam preparation. It is strongly
advised that you start working on this course from the
very beginning. The importance of doing the assigned
homeworks and projects cannot be over emphasized.
Programming projects: The predominant programminglanguage used in numerical analysis are Fortran and MAT-LAB. We will focus on MATLAB. Programs in other lan-guages are also sometimes acceptable, but no program-ming assistance will be given in the use of such languages(i.e. C,C++,Java,Pascal). For students unaquaintedwith MATLAB, the following e-readings are suggested
1.Ian Cavers, An Introductory Guide to MATLAB, 2ndEdition, Dept. of Computer Science, University of BritishColumbia, December 1998,
2. The results for your test cases in forms of tables,graphs etc.;
3. Answers to all questions contained in the assign-ment;
4. Comments.
You should report your results in a way that is easy toread, communicates the problem and the results ef-fectively, and can be reproduced by someone else whohas not seen the problem before, but is technicallyknowledgeable. You should also give any justi�cationor other reasons to believe the correctness of yourresults and code. Also, give conclusions on how e�ec-tive your methods and routines appear to be, repportand comment any "unusual behavior" of your results.Team working is allowed, but you should specify thisin your report ,as well as the tasks executed by eachmember of your team.
Grading policy: The �nal grade will be based on testsand hw/projects, as follows:
1. There will be one 3-hour written exam given after8 weeks of classes at a time arranged later (assumablyat the end of October). This midterm exam will count25% of the course grade.
2. The �nal comprehensive exam will be given dur-ing the scheduled examination time at the end of thesemester, it will cover all material, and it will count35% of your �nal grade.
3. HW and lab projects will count 20% of the gradeeach. Late homeworks and projects are not allowed!
4. You will need a scienti�c calculator during exams.Sharing of calculators will not be allowed. Make sureyou have one.
The exams will be open notes, i.e. you will be allowedto use your class notes and class slides (no other ma-terial will be allowed).
Grading for homeworks and labprojects
The HW will be graded on a scale from 0 to 4 with a
possibility of getting extra bonus point at each home-
work. Grades will be given according to the following
guidelines:
{ 0 { no homework turned in;
{ 1 { poor lousy job;
{ 2 { incomplete job;
{ 3 { good job;
{ 4 { very good job;
+1 for optional problems and/or excellent/outstanding
solution to one of the porblems
It is very important that you take the examinations at thescheduled times. Alternate exams will be scheduled onlyfor those who have compelling and convincing enoughreasons.
Academic misconduct: Any kinds of academic mis-
conduct will not be tolerated. If a situation arises
where you and your instructor disagree on some mat-
ter and cannot resolve the issue, you should see the
Dean. However, any problems concerning the course
should be �rst discussed with your instructor.
Readings:
1. Kendall Atkinson, An Introduction to Numerical Analy-sis, 2nd edition
2. Cleve Moler, Numerical Computing with MATLAB,
http://www.mathworks.com/moler/
3. Bjoerck A., Dahlquist G , Numerical mathematics andscienti�c computation.
4. Steven E. Pav, Numerical Methods Course Notes,University of California at San diego, 2005
De�nition of Numerical Analysis by Kendall Atkinson,Prof. University of Iowa
Numerical analysis is the area of mathematics and com-puter science that creates, analyzes, and implements al-gorithms for solving numerically the problems of contin-uous mathematics.
Such problems originate generally from real-world appli-cations of algebra, geometry and calculus, and they in-volve variables which vary continuously; these problemsoccur throughout the natural sciences, social sciences,engineering, medicine, and business.
During the past half-century, the growth in power andavailability of digital computers has led to an increas-ing use of realistic mathematical models in science andengineering, and numerical analysis of increasing sophis-tication has been needed to solve these more detailedmathematical models of the world.
With the growth in importance of using computers tocarry out numerical procedures in solving mathematicalmodels of the world, an area known as scienti�c com-puting or computational science has taken shape duringthe 1980s and 1990s. This area looks at the use of nu-merical analysis from a computer science perspective. Itis concerned with using the most powerful tools of nu-merical analysis, computer graphics, symbolic mathemat-ical computations, and graphical userinterfaces to makeit easier for a user to set up, solve, and interpret compli-cated mathematical models of the real world.
De�nition of Numerical Analysis by Lloyd N Trefethen,Prof. Cornell Unviersity
Here is the wrong answer: Numerical analysis is the studyof rounding errors
Some other wrong or incomplete answers:
Websters New Collegiate Dictionary:The study of quan-titative approximations to the solutions of mathematicalproblems including consideration of the errors and boundsto the errors involved.
Chambers 20th Century Dictionary: The study ofmethods of approximation and their accuracy etc
The American Heritage Dictionary: The study of ap-proximate solutions to mathematical problems taking intoaccount the extent of possible errors
Correct answer is:Numerical analysis is the study of algo-rithms for the problems of continuous mathematics
NUMERICAL ANALYSIS: This refers to the analysis
of mathematical problems by numerical means, es-
pecially mathematical problems arising from models
based on calculus.
Effective numerical analysis requires several things:
• An understanding of the computational tool beingused, be it a calculator or a computer.
• An understanding of the problem to be solved.
• Construction of an algorithm which will solve the
given mathematical problem to a given desired
accuracy and within the limits of the resources
(time, memory, etc) that are available.
This is a complex undertaking. Numerous people
make this their life’s work, usually working on only
a limited variety of mathematical problems.
Within this course, we attempt to show the spirit of
the subject. Most of our time will be taken up with
looking at algorithms for solving basic problems such
as rootfinding and numerical integration; but we will
also look at the structure of computers and the impli-
cations of using them in numerical calculations.
We begin by looking at the relationship of numerical
analysis to the larger world of science and engineering.
SCIENCE
Traditionally, engineering and science had a two-sided
approach to understanding a subject: the theoretical
and the experimental. More recently, a third approach
has become equally important: the computational.
Traditionally we would build an understanding by build-
ing theoretical mathematical models, and we would
solve these for special cases. For example, we would
study the flow of an incompressible irrotational fluid
past a sphere, obtaining some idea of the nature of
fluid flow. But more practical situations could seldom
be handled by direct means, because the needed equa-
tions were too difficult to solve. Thus we also used
the experimental approach to obtain better informa-
tion about the flow of practical fluids. The theory
would suggest ideas to be tried in the laboratory, and
the experiemental results would often suggest direc-
tions for a further development of theory.
1
Computational
Science
Theoretical
Science
Experimental
Science
With the rapid advance in powerful computers, we
now can augment the study of fluid flow by directly
solving the theoretical models of fluid flow as applied
to more practical situations; and this area is often re-
ferred to as “computational fluid dynamics”. At the
heart of computational science is numerical analysis;
and to effectively carry out a computational science
approach to studying a physical problem, we must un-
derstand the numerical analysis being used, especially
if improvements are to be made to the computational
techniques being used.
MATHEMATICAL MODELS
A mathematical model is a mathematical description
of a physical situtation. By means of studying the
model, we hope to understand more about the physi-
cal situation. Such a model might be very simple. For
example,
A = 4πR2e, Re.= 6, 371 km
is a formula for the surface area of the earth. How
accurate is it? First, it assumes the earth is sphere,
which is only an approximation. At the equator, the
radius is approximately 6,378 km; and at the poles,
the radius is approximately 6,357 km. Next, there is
experimental error in determining the radius; and in
addition, the earth is not perfectly smooth. Therefore,
there are limits on the accuracy of this model for the
surface area of the earth.
AN INFECTIOUS DISEASE MODEL
For rubella measles, we have the following model for
the spread of the infection in a population (subject to
certain assumptions).
ds
dt= −a s i
di
dt= a s i− b i
dr
dt= b i
In this, s, i, and r refer, respectively, to the propor-
tions of a total population that are susceptible, infec-
tious, and removed (from the susceptible and infec-
tious pool of people). All variables are functions of
time t. The constants can be taken as
a =6.8
11, b =
1
11The same model works for some other diseases (e.g.
flu), with a suitable change of the constants a and b.
Again, this is an approximation of reality (and a useful
one).
But it has its limits. Solving a bad model will not give
good results, no matter how accurately it is solved;
and the person solving this model and using the results
must know enough about the formation of the model
to be able to correctly interpret the numerical results.
THE LOGISTIC EQUATION
This is the simplest model for population growth. Let
N(t) denote the number of individuals in a population
(rabbits, people, bacteria, etc). Then we model its
growth by
N 0(t) = cN(t), t ≥ 0, N(t0) = N0
The constant c is the growth constant, and it usually
must be determined empirically. Over short periods of
time, this is often an accurate model for population
growth. For example, it accurately models the growth
of US population over the period of 1790 to 1860, with
c = 0.2975.
THE PREDATOR-PREY MODEL
Let F (t) denote the number of foxes at time t; and
let R(t) denote the number of rabbits at time t. A
simple model for these populations is called the Lotka-
Volterra predator-prey model :
dR
dt= a [1− bF (t)]R(t)
dF
dt= c [−1 + dR(t)]F (t)
with a, b, c, d positive constants. If one looks carefully
at this, then one can see how it is built from the logis-
tic equation. In some cases, this is a very useful model
and agrees with physical experiments. Of course, we
can substitute other interpretations, replacing foxes
and rabbits with other predator and prey. The model
will fail, however, when there are other populations
that affect the first two populations in a significant
way.
NEWTON’S SECOND LAW
Newton’s second law states that the force acting on
an object is directly proportional to the product of its
mass and acceleration,
F ∝ ma
With a suitable choice of physical units, we usually
write this in its scalar form as
F = ma
Newton’s law of gravitation for a two-body situation,
say the earth and an object moving about the earth is
then
md2r(t)
dt2= −Gmme
|r(t)|2 ·r(t)
|r(t)|with r(t) the vector from the center of the earth to
the center of the object moving about the earth. The
constant G is the gravitational constant, not depen-
dent on the earth; and m and me are the masses,
respectively of the object and the earth.
This is an accurate model for many purposes. But
what are some physical situations under which it will
fail?
When the object is very close to the surface of the
earth and does not move far from one spot, we take
|r(t)| to be the radius of the earth. We obtain thenew model
md2r(t)
dt2= −mgk
with k the unit vector directly upward from the earth’s
surface at the location of the object. The gravitational
constant
g.= 9.8meters/second2
Again this is a model; it is not physical reality.
The Patriot Missile Failure
On February 25, 1991, during the Gulf War, an Amer-
ican Patriot Missile battery in Dharan, Saudi Arabia,
failed to intercept an incoming Iraqi Scud missile. The
Scud struck an American Army barracks and killed 28
soliders.
A report of the General Accounting o�ce, GAO/IMTEC-
lem Led to System Failure at Dhahran, Saudi Arabia
reported on the cause of the failure.
It turns out that the cause was an inaccurate calcula-
tion of the time since boot due to computer arithmetic
errors.
Speci�cally, the time in tenths of second as measured
by the system's internal clock was multiplied by 1=10
to produce the time in seconds. This calculation was
performed using a 24 bit �xed point register. In par-
ticular, the value 1=10, which has a non-terminating
binary expansion, was chopped at 24 bits after the
radix point. The small chopping error, when multi-
plied by the large number giving the time in tenths of
a second, lead to a
signi�cant error. Indeed, the Patriot battery had been uparound 100 hours, and an easy calculation shows that theresulting time error due to the magni�ed chopping errorwas about 0.34 seconds.
The number 110 equals
1
10=
1
24+1
25+1
28+1
29+
1
212+
1
213+ : : :
= (0:0001100110011001100110011001100 : : :)2
Now the 24 bit register in the Patriot stored instead
(0:00011001100110011001100)2
introducing an error of
(0:0000000000000000000000011001100:::)2
which being converted in decimal is
(0:000000095)10
Multiplying by the number of tenths of a second in 100hours gives:
0:000000095 � 100 � 60 � 60 � 10 = 0:34
A Scud travels at about 1676 meters per second, andso travels more than half a kilometer in this time. Thiswas far enough that the incoming Scud was outside the"range gate" that the Patriot tracked. Ironically, the factthat the bad time calculation had been improved in someparts of the code, but not all, contributed to the problem,since it meant that the inaccuracies did not cancel.
The following paragraph is excerpted from the GAO re-port.
The range gate�s prediction of where the Scud will nextappear is a function of the Scud�s known velocity and thetime of the last radar detection. Velocity is a real number
that can be expressed as a whole number and a decimal(e.g., 3750.2563...miles per hour). Time is kept continu-ously by the system�s internal clock in tenths of secondsbut is expressed as an integer or whole number (e.g., 32,33, 34...). The longer the system has been running, thelarger the number representing time. To predict wherethe Scud will next appear, both time and velocity mustbe expressed as real numbers. Because of the way thePatriot computer performs its calculations and the factthat its registers are only 24 bits long, the conversion oftime from an integer to a real number cannot be anymore precise than 24 bits. This conversion results in aloss of precision causing a less accurate time calculation.The e¤ect of this inaccuracy on the range gate�s calcu-lation is directly proportional to the target�s velocity andthe length of the the system has been running. Conse-quently, performing the conversion after the Patriot hasbeen running continuously for extended periods causesthe range gate to shift away from the center of the tar-get, making it less likely that the target, in this case aScud, will be successfully intercepted.
CALCULATION OF FUNCTIONS
Using hand calculations, a hand calculator, or a com-puter, what are the basic operations of which we arecapable? In essence, they are addition, subtraction,multiplication, and division (and even this will usuallyrequire a truncation of the quotient at some point).In addition, we can make logical decisions, such asdeciding which of the following are true for two realnumbers a and b:
a > b, a = b, a < b
Furthermore, we can carry out only a finite numberof such operations. If we limit ourselves to just addi-tion, subtraction, and multiplication, then in evaluat-ing functions f(x) we are limited to the evaluation ofpolynomials:
p(x) = a0 + a1x+ · · · anxnIn this, n is the degree (provided an 6= 0) and {a0, ..., an}are the coefficients of the polynomial. Later we willdiscuss the efficient evaluation of polynomials; but fornow, we ask how we are to evaluate other functionssuch as ex, cosx, log x, and others.
TAYLOR POLYNOMIAL APPROXIMATIONS
We begin with an example, that of f(x) = ex from
the text. Consider evaluating it for x near to 0. We
look for a polynomial p(x) whose values will be the
same as those of ex to within acceptable accuracy.
Begin with a linear polynomial p(x) = a0+a1x. Then
to make its graph look like that of ex, we ask that the
graph of y = p(x) be tangent to that of y = ex at
x = 0. Doing so leads to the formula
p(x) = 1 + x
Continue in this manner looking next for a quadratic
polynomial
p(x) = a0 + a1x+ a2x2
We again make it tangent; and to determine a2, we
also ask that p(x) and ex have the same “curvature”
at the origin. Combining these requirements, we have
for f(x) = ex that
p(0) = f(0), p0(0) = f 0(0), p00(0) = f 00(0)
This yields the approximation
p(x) = 1 + x+ 12x2
We continue this pattern, looking for a polynomial
p(x) = a0 + a1x+ a2x2 + · · ·+ anx
n
We now require that
p(0) = f(0), p0(0) = f 0(0), · · · , p(n)(0) = f (n)(0)
This leads to the formula
p(x) = 1 + x+ 12x2 + · · ·+ 1
n!xn
What are the problems when evaluating points x that
are far from 0?
TAYLOR’S APPROXIMATION FORMULA
Let f(x) be a given function, and assume it has deriv-
atives around some point x = a (with as many deriv-
atives as we find necessary). We seek a polynomial
p(x) of degree at most n, for some non-negative inte-
ger n, which will approximate f(x) by satisfying the
following conditions:
p(a) = f(a)
p0(a) = f 0(a)p00(a) = f 00(a)
...
p(n)(a) = f (n)(a)
The general formula for this polynomial is
pn(x) = f(a) + (x− a)f 0(a) + 1
2!(x− a)2f 00(a)
+ · · ·+ 1
n!(x− a)nf (n)(a)
Then f(x) ≈ pn(x) for x close to a.
TAYLOR POLYNOMIALS FOR f(x) = log x
In this case, we expand about the point x = 1, making
the polynomial tangent to the graph of f(x) = log x
at the point x = 1. For a general degree n ≥ 1, this
results in the polynomial
pn(x) = (x− 1)− 12(x− 1)2 + 1
3(x− 1)3
+ · · ·+ (−1)n−11n(x− 1)n
Note the graphs of these polynomials for varying n.
THE TAYLOR POLYNOMIAL ERROR FORMULA
Let f(x) be a given function, and assume it has deriv-
atives around some point x = a (with as many deriva-
tives as we find necessary). For the error in the Taylor
polynomial pn(x), we have the formulas
f(x)− pn(x) =1
(n+ 1)!(x− a)n+1f (n+1)(cx)
=1
n!
Z x
a(x− t)nf (n+1)(t) dt
The point cx is restricted to the interval bounded by x
and a, and otherwise cx is unknown. We will use the
first form of this error formula, although the second
is more precise in that you do not need to deal with
the unknown point cx.
Consider the special case of n = 0. Then the Taylor
polynomial is the constant function:
f(x) ≈ p0(x) = f(a)
The first form of the error formula becomes
f(x)− p0(x) = f(x)− f(a) = (x− a) f 0(cx)
with cx between a and x. You have seen this in
your beginning calculus course, and it is called the
mean-value theorem. The error formula
f(x)− pn(x) =1
(n+ 1)!(x− a)n+1f (n+1)(cx)
can be considered a generalization of the mean-value
theorem.
EXAMPLE: f(x) = ex
For general n ≥ 0, and expanding ex about x = 0, wehave that the degree n Taylor polynomial approxima-
tion is given by
pn(x) = 1 + x+1
2!x2 +
1
3!x3 + · · ·+ 1
n!xn
For the derivatives of f(x) = ex, we have
f (k)(x) = ex, f (k)(0) = 1, k = 0, 1, 2, ...
For the error,
ex − pn(x) =1
(n+ 1)!xn+1ecx
with cx located between 0 and x. Note that for x ≈ 0,we must have cx ≈ 0 and
ex − pn(x) ≈ 1
(n+ 1)!xn+1
This last term is also the final term in pn+1(x), and
thus
ex − pn(x) ≈ pn+1(x)− pn(x)
Consider calculating an approximation to e. Then let
x = 1 in the earlier formulas to get
pn(1) = 1 + 1 +1
2!+1
3!+ · · ·+ 1
n!For the error,
e− pn(1) =1
(n+ 1)!ecx, 0 ≤ cx ≤ 1
To bound the error, we have
e0 ≤ ecx ≤ e1
1
(n+ 1)!≤ e− pn(1) ≤ e
(n+ 1)!
To have an approximation accurate to within 10−5,we choose n large enough to have
e
(n+ 1)!≤ 10−5
which is true if n ≥ 8. In fact,e− p8(1) ≤
e
9!
.= 7.5× 10−6
Then calculate p8(1).= 2.71827877, and e− p8(1)
.=
3.06× 10−6.
FORMULAS OF STANDARD FUNCTIONS
1
1− x= 1 + x+ x2 + · · ·+ xn +
xn+1
1− x
cosx = 1− x2
2!+x4
4!− · · ·+ (−1)m x2m
(2m)!
+(−1)m x2m+2
(2m+ 2)!cos cx
sinx = x− x3
3!+x5
5!− · · ·+ (−1)m−1 x2m−1
(2m− 1)!+(−1)m x2m+1
(2m+ 1)!cos cx
with cx between 0 and x.
OBTAINING TAYLOR FORMULAS
Most Taylor polynomials have been bound by other
than using the formula
pn(x) = f(a) + (x− a)f 0(a) + 1
2!(x− a)2f 00(a)
+ · · ·+ 1
n!(x− a)nf (n)(a)
because of the difficulty of obtaining the derivatives
f (k)(x) for larger values of k. Actually, this is now
much easier, as we can use Maple or Mathematica.
Nonetheless, most formulas have been obtained by
manipulating standard formulas; and examples of this
are given in the text.
For example, use
et = 1 + t+1
2!t2 +
1
3!t3 + · · ·+ 1
n!tn
+1
(n+ 1)!tn+1ect
in which ct is between 0 and t. Let t = −x2 to obtain
e−x2 = 1− x2 +1
2!x4 − 1
3!x6 + · · ·+ (−1)
n
n!x2n
+(−1)n+1(n+ 1)!
x2n+2e−ξx
Because ct must be between 0 and −x2, we have itmust be negative. Thus we let ct = −ξx in the errorterm, with 0 ≤ ξx ≤ x2.
EVALUATING A POLYNOMIAL
Consider having a polynomial
p(x) = a0 + a1x+ a2x2 + · · ·+ anx
n
which you need to evaluate for many values of x. How
do you evaluate it? This may seem a strange question,
but the answer is not as obvious as you might think.
The standard way, written in a loose algorithmic for-
mat:
poly = a0for j = 1 : npoly = poly + ajx
j
end
To compare the costs of different numerical meth-
ods, we do an operations count, and then we compare
these for the competing methods. Above, the counts
are as follows:
additions : n
multiplications : 1 + 2 + 3 + · · ·+ n =n(n+ 1)
2
This assumes each term ajxj is computed indepen-
dently of the remaining terms in the polynomial.
Next, do the terms xj recursively:
xj = x · xj−1
Then to computenx2, x3, ..., xn
owill cost n−1 mul-
tiplications. Our algorithm becomes
poly = a0 + a1xpower = xfor j = 2 : npower = x · powerpoly = poly + aj · power
end
The total operations cost is
additions : n
multiplications : n+ n− 1 = 2n− 1When n is evenly moderately large, this is much less
than for the first method of evaluating p(x). For ex-
ample, with n = 20, the first method has 210 multi-
plications, whereas the second has 39 multiplications.
We now considered nested multiplication. As exam-ples of particular degrees, write
n = 2 : p(x) = a0 + x(a1 + a2x)n = 3 : p(x) = a0 + x (a1 + x (a2 + a3x))n = 4 : p(x) = a0 + x (a1 + x (a2 + x (a3 + a4x)))
These contain, respectively, 2, 3, and 4 multiplica-tions. This is less than the preceding method, whichwould have need 3, 5, and 7 multiplications, respec-tively.
For the general case, write
p(x) = a0+x (a1 + x (a2 + · · ·+ x (an−1 + anx) · · · ))This requires n multiplications, which is only abouthalf that for the preceding method. For an algorithm,write
poly = anfor j = n− 1 : −1 : 0poly = aj + x · poly
end
With all three methods, the number of additions is n;but the number of multiplications can be dramaticallydifferent for large values of n.
NESTED MULTIPLICATION
Imagine we are evaluating the polynomial
p(x) = a0 + a1x+ a2x2 + · · ·+ anx
n
at a point x = z. Thus with nested multiplication
p(z) = a0+z (a1 + z (a2 + · · ·+ z (an−1 + anz) · · · ))We can write this as the following sequence of oper-
ations:
bn = anbn−1 = an−1 + zbnbn−2 = an−2 + zbn−1
...b0 = a0 + zb1
The quantities bn−1, ..., b0 are simply the quantities inparentheses, starting from the inner most and working
outward.
Introduce
q(x) = b1 + b2x+ b3x2 + · · ·+ bnx
n−1
Claim:
p(x) = b0 + (x− z)q(x) (∗)Proof: Simply expand
b0 + (x− z)³b1 + b2x+ b3x
2 + · · ·+ bnxn−1´
and use the fact that
zbj = bj−1 − aj−1, j = 1, ..., n
With this result (*), we have
p(x)
x− z=
b0x− z
+ q(x)
Thus q(x) is the quotient when dividing p(x) by x−z,and b0 is the remainder.
If z is a zero of p(x), then b0 = 0; and then
p(x) = (x− z)q(x)
For the remaining roots of p(x), we can concentrate
on finding those of q(x). In rootfinding for polynomi-
als, this process of reducing the size of the problem is
called deflation.
Another consequence of (*) is the following. Form
the derivative of (*) with respect to x, obtaining
p0(x) = (x− z)q0(x) + q(x)
p0(z) = q(z)
Thus to evaluate p(x) and p0(x) simultaneously at x =z, we can use nested multiplication for p(z) and we
can use the intermediate steps of this to also evaluate
p0(z). This is useful when doing rootfinding problemsfor polynomials by means of Newton’s method.
APPROXIMATING SF (x)
Define
SF (x) =1
x
Z x
0
sin t
tdt, x 6= 0
We use Taylor polynomials to approximate this func-
tion, to obtain a way to compute it with accuracy and
simplicity.
x
y
0.5
1.0
-8 -4 84
As an example, begin with the degree 3 Taylor ap-
proximation to sin t, expanded about t = 0:
sin t = t− 16t3 +
1
120t5 cos ct
with ct between 0 and t. Then
sin t
t= 1− 1
6t2 +
1
120t4 cos ctZ x
0
sin t
tdt =
Z x
0
·1− 1
6t2 +
1
120t4 cos ct
¸dt
= x− 1
18x3 +
1
120
Z x
0t4 cos ctdt
1
x
Z x
0
sin t
tdt = 1− 1
18x2 +R2(x)
R2(x) =1
120
1
x
xZ0
t4 cos ct dt
How large is the error in the approximation
SF (x) ≈ 1− 1
18x2
on the interval [−1, 1]? Since |cos ct| ≤ 1, we have
for x > 0 that
0 ≤ R2(x) ≤1
120
1
x
Z x
0t4dt
=1
600x4
and the same result can be shown for x < 0. Then
for |x| ≤ 1, we have
0 ≤ R2(x) ≤1
600
To obtain a more accurate approximation, we can pro-
ceed exactly as above, but simply use a higher degree
approximation to sin t.
BINARY INTEGERS
A binary integer x is a finite sequence of the digits 0
and 1, which we write symbolically as
x = (amam−1 · · · a2a1a0)2where I insert the parentheses with subscript ()2 in
order to make clear that the number is binary. The
above has the decimal equivalent
x = am2m + am−12m−1 + · · ·+ a12
1 + a0
For example, the binary integer x = (110101)2 has
the decimal value
x = 25 + 24 + 22 + 20 = 53
The binary integer x = (111 · · · 1)2 with m ones has
(11111111111)2 = 2047 1024�1; bi = 0NaN; otherwise
What is the connection of the 24 bits in the signi�cand
x to the number of decimal digits in the storage of
a number x into oating point form. One way of
answering this is to �nd the integer M for which
1. 0 < x � M and x an integer implies fl(x) = x;
and
2.fl(M + 1) 6=M + 1
This integer M is at least as big as
(11 : : : 1| {z })223 ones
= (1:11 : : : 1)2 � 223
= 223 + 222 + : : :+ 20
= 224 � 1
Also, 224 = (1:00 : : : 0)2 � 224 will be stored exactly.
Next integer 224+1 cannot be stored exactly since its
signi�cand will contain 24 + 1 binary digits:
224 + 1 = (1:00 : : : 0| {z }23 of 0
1)2 � 224:
Therefore for single precision M = 224. Any integer
less or equal to M will be stored exactly. So
M = 224 = 16777216:
For IEEE double precision standard we have
M = 253 � 9:0� 1015:
THE MACHINE EPSILON
Let y be the smallest number representable in the ma-
chine arithmetic that is greater than 1 in the machine.
The machine epsilon is η = y − 1. It is a widely usedmeasure of the accuracy possible in representing num-
bers in the machine.
The number 1 has the simple floating point represen-
tation
1 = (1.00 · · · 0)2 · 20
What is the smallest number that is greater than 1?
It is
1 + 2−23 = (1.0 · · · 01)2 · 20 > 1
and the machine epsilon in IEEE single precision float-
ing point format is η = 2−23 .= 1.19× 10−7.
THE UNIT ROUND
Consider the smallest number δ > 0 that is repre-
sentable in the machine and for which
1 + δ > 1
in the arithmetic of the machine.
For any number 0 < α < δ, the result of 1 + α is
exactly 1 in the machines arithmetic. Thus α ‘drops
off the end’ of the floating point representation in the
machine. The size of δ is another way of describing
the accuracy attainable in the floating point represen-
tation of the machine. The machine epsilon.has been
replacing it in recent years.
It is not too difficult to derive δ. The number 1 has
the simple floating point representation
1 = (1.00 · · · 0)2 · 20
What is the smallest number which can be added to
this without disappearing? Certainly we can write
1 + 2−23 = (1.0 · · · 01)2 · 20 > 1
Past this point, we need to know whether we are us-
ing chopped arithmetic or rounded arithmetic. We
will shortly look at both of these. With chopped
arithmetic, δ = 2−23; and with rounded arithmetic,δ = 2−24.
ROUNDING AND CHOPPING
Let us first consider these concepts with decimal arith-
metic. We write a computer floating point number z
as
z = σ · ζ · 10e ≡ σ · (a1.a2 · · · an)10 · 10e
with a1 6= 0, so that there are n decimal digits in thesignificand (a1.a2 · · · an)10.
Given a general number
x = σ · (a1.a2 · · · an · · · )10 · 10e, a1 6= 0we must shorten it to fit within the computer. This
is done by either chopping or rounding. The floating
point chopped version of x is given by
fl(x) = σ · (a1.a2 · · · an)10 · 10e
where we assume that e fits within the bounds re-
quired by the computer or calculator.
For the rounded version, we must decide whether to
round up or round down. A simplified formula is
fl(x) =(σ · (a1.a2 · · · an)10 · 10e an+1 < 5
σ · [(a1.a2 · · · an)10 + (0.0 · · · 1)10] · 10e an+1 ≥ 5The term (0.0 · · · 1)10 denotes 10−n+1, giving the or-dinary sense of rounding with which you are familiar.
In the single case
(0.0 · · · 0an+1an+2 · · · )10 = (0.0 · · · 0500 · · · )10a more elaborate procedure is used so as to assure an
unbiased rounding.
CHOPPING/ROUNDING IN BINARY
Let
x = σ · (1.a2 · · · an · · · )2 · 2e
with all ai equal to 0 or 1. Then for a chopped floating
point representation, we have
fl(x) = σ · (1.a2 · · · an)2 · 2e
For a rounded floating point representation, we have
The error x−fl(x) = 0 when x needs no change to beput into the computer or calculator. Of more interest
is the case when the error is nonzero. Consider first
the case x > 0 (meaning σ = +1). The case with
x < 0 is the same, except for the sign being opposite.
With x 6= fl(x), and using chopping, we have
fl(x) < x
and the error x− fl(x) is always positive. This later
has major consequences in extended numerical com-
putations. With x 6= fl(x) and rounding, the error
x− fl(x) is negative for half the values of x, and it is
positive for the other half of possible values of x.
We often write the relative error as
x− fl(x)
x= −ε
This can be expanded to obtain
fl(x) = (1 + ε)x
Thus fl(x) can be considered as a perturbed value
of x. This is used in many analyses of the effects of
chopping and rounding errors in numerical computa-
tions.
For bounds on ε, we have
−2−n ≤ ε ≤ 2−n, rounding−2−n+1 ≤ ε ≤ 0, chopping
IEEE ARITHMETIC
We are only giving the minimal characteristics of IEEE
arithmetic. There are many options available on the
types of arithmetic and the chopping/rounding. The
default arithmetic uses rounding.
Single precision arithmetic:
n = 24, −126 ≤ e ≤ 127This results in
M = 224 = 16777216
η = 2−23 = 1.19× 10−7
Double precision arithmetic:
n = 53, −1022 ≤ e ≤ 1023What are M and η?
There is also an extended representation, having n =
69 digits in its significand.
MATLAB can be used to generate the binary oating
point representation of a number.
Execute in MATLAB the command:
format hex
This will cause all subsequent numerical output to the
screen to be given in hexadecimal format (base 16).
For example, listing the number 7.125 results in an
output of
401c800000000000:
The 16 hexadecimal digits are
f0; 1; 2; 3; 4; 5; 6; 7; 8; 9; a; b; c; d; e; fg
To obtain the binary representation, convert each hex-
adecimal digit to a four digit binary number according
to the table below
Format Format Format Formathex binary hex binary
0 0000 8 10001 0001 9 10012 0010 a 10103 0011 b 10114 0100 c 11005 0101 d 11016 0110 e 11107 0111 f 1111
For the above number, we obtain the binary expansion
4z }| {0100
0z }| {0000
1z }| {0001
cz }| {1100
8z }| {1000
0z }| {0000
0z }| {0000 : : :
0z }| {0000
0|{z}�10000000001| {z }
E
1100100000000000 : : : 0000| {z }1:b13b14:::b64=x
which provides us with the IEEE double precision rep-
resentation of 7:125.
SOME DEFINITIONS
Let xT denote the true value of some number, usuallyunknown in practice; and let xA denote an approxi-mation of xT .
The error in xA is
error(xA) = xT � xA
The relative error in xA is
rel(xA) =error(xA)
xT
=xT � xAxT
Example:
xT = e; xA =197 : Then,
error(xA) = e� 197= 0:003996
rel(xA) =0:003996
e= 0:00147
Relative error is more exact in representing the d�er-
ence between the true value and approximated one.
Example: Suppose the distance between two cities is
DT = 100 km and let this distance be approximated
with DA = 99 km. In this case,
Err (DA) = DT �DA = 1 km,
Rel (DA) =Err (DA)
DT= 0:01 � 1%:
Now, suppose that distance is dT = 2 km and esti-
mate it with dA = 1 km. Then
Err (dA) = dT � dA = 1 km,
Rel (dA) =Err (dA)
dT= 0:5 � 50%:
In both cases the error is the same. But, obviously
DA is a better approximation of DT , then dA of dT :
Numerical Analysis
conf.dr. Bostan Viorel
Fall 2010 Lecture 3
1 / 83
Sources of Error
The sources of error in the computation of the solution of amathematical model for some physical situation can be roughlycharacterised as follows:
1. Modelling Error.Consider the example of a projectile of mass m that is travellingthrugh the earth's atmosphere. A simple and oftenly useddescription of projectile motion is given by
md2�!rdt2
(t) = �mg�!k � bd�!rdt
with b � 0. In this, �!r (t) is the vector position of the projectile;and the �nal term in the equation represents friction force in air. Ifthere is an error in this a model of a physical situation, then thenumerical solution of this equation is not going to improve theresults.
2 / 83
Sources of Error
The sources of error in the computation of the solution of amathematical model for some physical situation can be roughlycharacterised as follows:1. Modelling Error.
Consider the example of a projectile of mass m that is travellingthrugh the earth's atmosphere. A simple and oftenly useddescription of projectile motion is given by
md2�!rdt2
(t) = �mg�!k � bd�!rdt
with b � 0. In this, �!r (t) is the vector position of the projectile;and the �nal term in the equation represents friction force in air. Ifthere is an error in this a model of a physical situation, then thenumerical solution of this equation is not going to improve theresults.
3 / 83
Sources of Error
The sources of error in the computation of the solution of amathematical model for some physical situation can be roughlycharacterised as follows:1. Modelling Error.Consider the example of a projectile of mass m that is travellingthrugh the earth's atmosphere. A simple and oftenly useddescription of projectile motion is given by
md2�!rdt2
(t) = �mg�!k � bd�!rdt
with b � 0. In this, �!r (t) is the vector position of the projectile;and the �nal term in the equation represents friction force in air. Ifthere is an error in this a model of a physical situation, then thenumerical solution of this equation is not going to improve theresults.
4 / 83
Sources of Error
The sources of error in the computation of the solution of amathematical model for some physical situation can be roughlycharacterised as follows:1. Modelling Error.Consider the example of a projectile of mass m that is travellingthrugh the earth's atmosphere. A simple and oftenly useddescription of projectile motion is given by
md2�!rdt2
(t) = �mg�!k � bd�!rdt
with b � 0. In this, �!r (t) is the vector position of the projectile;and the �nal term in the equation represents friction force in air. Ifthere is an error in this a model of a physical situation, then thenumerical solution of this equation is not going to improve theresults.
5 / 83
Sources of Error
2. Physical / Observational / Measurement Error.The radius of an electron is given by
(2.81777+ ε)� 10�13cm, jεj � 0.00011
This error cannot be removed, and it must a�ect the accuracy ofany computation in which it is used.
We need to be aware of these e�ects and to so arrange thecomputation as to minimize the e�ects.
6 / 83
Sources of Error
2. Physical / Observational / Measurement Error.The radius of an electron is given by
(2.81777+ ε)� 10�13cm, jεj � 0.00011
This error cannot be removed, and it must a�ect the accuracy ofany computation in which it is used.
We need to be aware of these e�ects and to so arrange thecomputation as to minimize the e�ects.
7 / 83
Sources of Error
3. Approximation Error.This is also called \discretization error" and \truncation error";and it is the main source of error with which we deal in this course.
Such errors generally occur when we replace a computationallyunsolvable problem with a nearby problem that is more tractablecomputationally.For example, the Taylor polynomial approximation
ex � 1+ x + 12x2
contains an \approximation error".The numerical integration
1R0
f (x)dx � 1
N
N
∑j=1f
�j
N
�contains an approximation error.
8 / 83
Sources of Error
3. Approximation Error.This is also called \discretization error" and \truncation error";and it is the main source of error with which we deal in this course.Such errors generally occur when we replace a computationallyunsolvable problem with a nearby problem that is more tractablecomputationally.
For example, the Taylor polynomial approximation
ex � 1+ x + 12x2
contains an \approximation error".The numerical integration
1R0
f (x)dx � 1
N
N
∑j=1f
�j
N
�contains an approximation error.
9 / 83
Sources of Error
3. Approximation Error.This is also called \discretization error" and \truncation error";and it is the main source of error with which we deal in this course.Such errors generally occur when we replace a computationallyunsolvable problem with a nearby problem that is more tractablecomputationally.For example, the Taylor polynomial approximation
ex � 1+ x + 12x2
contains an \approximation error".
The numerical integration
1R0
f (x)dx � 1
N
N
∑j=1f
�j
N
�contains an approximation error.
10 / 83
Sources of Error
3. Approximation Error.This is also called \discretization error" and \truncation error";and it is the main source of error with which we deal in this course.Such errors generally occur when we replace a computationallyunsolvable problem with a nearby problem that is more tractablecomputationally.For example, the Taylor polynomial approximation
ex � 1+ x + 12x2
contains an \approximation error".The numerical integration
1R0
f (x)dx � 1
N
N
∑j=1f
�j
N
�contains an approximation error.
11 / 83
Sources of Error
4. Finiteness of Algorithm ErrorThis is an error due to stopping an algorithm after a �nite numberof iterations.
Even if theoretically an algorithm can run for inde�nite time, aftera �nite (usually speci�ed) number of iterations the algorithm willbe stopped.
12 / 83
Sources of Error
4. Finiteness of Algorithm ErrorThis is an error due to stopping an algorithm after a �nite numberof iterations.
Even if theoretically an algorithm can run for inde�nite time, aftera �nite (usually speci�ed) number of iterations the algorithm willbe stopped.
13 / 83
Sources of Error
5. Blunders.In the pre-computer era, blunders were mostly arithmetic errors.
Inthe earlier years of the computer era, the typical blunder was aprogramming bugs. Present day \blunders" are still oftenprogramming errors. But now they are often much more di�cult to�nd, as they are often embedded in very large codes which maymask their e�ect.Some simple rules to decrease the risk of having a bug in the code:
Break programs into small testable subprograms;
Run test cases for which you know the outcome;
When running the full code, maintain a skeptical eye on theoutput, checking whether the output is reasonable or not.
14 / 83
Sources of Error
5. Blunders.In the pre-computer era, blunders were mostly arithmetic errors. Inthe earlier years of the computer era, the typical blunder was aprogramming bugs.
Present day \blunders" are still oftenprogramming errors. But now they are often much more di�cult to�nd, as they are often embedded in very large codes which maymask their e�ect.Some simple rules to decrease the risk of having a bug in the code:
Break programs into small testable subprograms;
Run test cases for which you know the outcome;
When running the full code, maintain a skeptical eye on theoutput, checking whether the output is reasonable or not.
15 / 83
Sources of Error
5. Blunders.In the pre-computer era, blunders were mostly arithmetic errors. Inthe earlier years of the computer era, the typical blunder was aprogramming bugs. Present day \blunders" are still oftenprogramming errors. But now they are often much more di�cult to�nd, as they are often embedded in very large codes which maymask their e�ect.
Some simple rules to decrease the risk of having a bug in the code:
Break programs into small testable subprograms;
Run test cases for which you know the outcome;
When running the full code, maintain a skeptical eye on theoutput, checking whether the output is reasonable or not.
16 / 83
Sources of Error
5. Blunders.In the pre-computer era, blunders were mostly arithmetic errors. Inthe earlier years of the computer era, the typical blunder was aprogramming bugs. Present day \blunders" are still oftenprogramming errors. But now they are often much more di�cult to�nd, as they are often embedded in very large codes which maymask their e�ect.Some simple rules to decrease the risk of having a bug in the code:
Break programs into small testable subprograms;
Run test cases for which you know the outcome;
When running the full code, maintain a skeptical eye on theoutput, checking whether the output is reasonable or not.
17 / 83
Sources of Error
6. Rounding/chopping Error.This is the main source of many problems, especially problems insolving systems of linear equations. We later look at the e�ects ofsuch errors.
18 / 83
Sources of Error
7. Finitness of precision errorsAll the numbers stored in computer memory are subject to the�niteness of allocated space for storage.
19 / 83
Pendulum Example
Original problem in engineering or in science to be solved:
θ
T
mg
Model this physical problem mathematically.Second Newton law provides us with:
..θ = �g
lsin θ
( .θ = ω.
ω = � gl sin θ
20 / 83
Pendulum Example
Original problem in engineering or in science to be solved:
θ
T
mg
Model this physical problem mathematically.Second Newton law provides us with:
..θ = �g
lsin θ
( .θ = ω.
ω = � gl sin θ
21 / 83
Pendulum Example
Original problem in engineering or in science to be solved:
θ
T
mg
Model this physical problem mathematically.Second Newton law provides us with:
..θ = �g
lsin θ
( .θ = ω.
ω = � gl sin θ
22 / 83
Pendulum Example
Original problem in engineering or in science to be solved:
θ
T
mg
Model this physical problem mathematically.Second Newton law provides us with:
..θ = �g
lsin θ
( .θ = ω.
ω = � gl sin θ
23 / 83
Pendulum Example
Original problem in engineering or in science to be solved:
θ
T
mg
Model this physical problem mathematically.Second Newton law provides us with:
..θ = �g
lsin θ
( .θ = ω.
ω = � gl sin θ
24 / 83
Pendulum Example
Problem of continuous mathematics:
θ
T
mg
( .θ = ω.
ω = � gl sin θ
Modeling Errors
Physical Errors
25 / 83
Pendulum Example
Problem of continuous mathematics:
θ
T
mg
( .θ = ω.
ω = � gl sin θ
Modeling Errors
Physical Errors
26 / 83
Pendulum Example
Problem of continuous mathematics:
θ
T
mg
( .θ = ω.
ω = � gl sin θ
Modeling Errors
Physical Errors
27 / 83
Pendulum Example
Mathematical Algorithms:
θ
T
mg
�θn+1 = θn + hωn+1
ωn+1 = ωn � h gl sin (θn)
Discretisation Errors
Finiteness of Algorithm Errors
28 / 83
Pendulum Example
Mathematical Algorithms:
θ
T
mg
�θn+1 = θn + hωn+1
ωn+1 = ωn � h gl sin (θn)
Discretisation Errors
Finiteness of Algorithm Errors
29 / 83
Pendulum Example
Mathematical Algorithms:
θ
T
mg
�θn+1 = θn + hωn+1
ωn+1 = ωn � h gl sin (θn)
Discretisation Errors
Finiteness of Algorithm Errors
30 / 83
Pendulum Example
Computer Implementation:
θ
T
mg
for i=1:NmaxOmega = Omega - H*g/L*sin(Theta);Theta = Theta + H*Omega
end
Rounding / Chopping Errors
Bugs in the Code
Finite Precision Errors
31 / 83
Pendulum Example
Computer Implementation:
θ
T
mg
for i=1:NmaxOmega = Omega - H*g/L*sin(Theta);Theta = Theta + H*Omega
end
Rounding / Chopping Errors
Bugs in the Code
Finite Precision Errors
32 / 83
Pendulum Example
Computer Implementation:
θ
T
mg
for i=1:NmaxOmega = Omega - H*g/L*sin(Theta);Theta = Theta + H*Omega
end
Rounding / Chopping Errors
Bugs in the Code
Finite Precision Errors
33 / 83
Loss of signi�cance errors
This can be considered a source of error or a consequence of the�niteness of calculator and computer arithmetic.
Example. De�ne
f (x) = x�p
x + 1�px�
and consider evaluating it on a 6-digit decimal calculator whichuses rounded arithmetic.
There are 3 signi�cant digits in the answer. How can such astraightforward and short calculation lead to such a large error(relative to the accuracy of the calculator)?
39 / 83
Loss of signi�cance errors
Consider one case, that of x = 0.001. Then on the calculator:
cos(0.001) = 0.9999994999
1� cos(0.001) = 5.001� 10�71� cos(0.001)(0.001)2
= 0.5001000000
The true answer is
f (0.001) = .4999999583
The relative error in our answer is
0.4999999583� 0.5001 = �0.00010004170.4999999583
= �0.0002
There are 3 signi�cant digits in the answer. How can such astraightforward and short calculation lead to such a large error(relative to the accuracy of the calculator)?
40 / 83
Loss of signi�cance errors
Consider one case, that of x = 0.001. Then on the calculator:
cos(0.001) = 0.9999994999
1� cos(0.001) = 5.001� 10�7
1� cos(0.001)(0.001)2
= 0.5001000000
The true answer is
f (0.001) = .4999999583
The relative error in our answer is
0.4999999583� 0.5001 = �0.00010004170.4999999583
= �0.0002
There are 3 signi�cant digits in the answer. How can such astraightforward and short calculation lead to such a large error(relative to the accuracy of the calculator)?
41 / 83
Loss of signi�cance errors
Consider one case, that of x = 0.001. Then on the calculator:
cos(0.001) = 0.9999994999
1� cos(0.001) = 5.001� 10�71� cos(0.001)(0.001)2
= 0.5001000000
The true answer is
f (0.001) = .4999999583
The relative error in our answer is
0.4999999583� 0.5001 = �0.00010004170.4999999583
= �0.0002
There are 3 signi�cant digits in the answer. How can such astraightforward and short calculation lead to such a large error(relative to the accuracy of the calculator)?
42 / 83
Loss of signi�cance errors
Consider one case, that of x = 0.001. Then on the calculator:
cos(0.001) = 0.9999994999
1� cos(0.001) = 5.001� 10�71� cos(0.001)(0.001)2
= 0.5001000000
The true answer is
f (0.001) = .4999999583
The relative error in our answer is
0.4999999583� 0.5001 = �0.00010004170.4999999583
= �0.0002
There are 3 signi�cant digits in the answer. How can such astraightforward and short calculation lead to such a large error(relative to the accuracy of the calculator)?
43 / 83
Loss of signi�cance errors
Consider one case, that of x = 0.001. Then on the calculator:
cos(0.001) = 0.9999994999
1� cos(0.001) = 5.001� 10�71� cos(0.001)(0.001)2
= 0.5001000000
The true answer is
f (0.001) = .4999999583
The relative error in our answer is
0.4999999583� 0.5001 = �0.00010004170.4999999583
= �0.0002
There are 3 signi�cant digits in the answer. How can such astraightforward and short calculation lead to such a large error(relative to the accuracy of the calculator)?
44 / 83
Loss of signi�cance errors
Consider one case, that of x = 0.001. Then on the calculator:
cos(0.001) = 0.9999994999
1� cos(0.001) = 5.001� 10�71� cos(0.001)(0.001)2
= 0.5001000000
The true answer is
f (0.001) = .4999999583
The relative error in our answer is
0.4999999583� 0.5001 = �0.00010004170.4999999583
= �0.0002
There are 3 signi�cant digits in the answer. How can such astraightforward and short calculation lead to such a large error(relative to the accuracy of the calculator)?
45 / 83
Loss of signi�cance errors
When two numbers are nearly equal and we subtract them, thenwe su�er a \loss of signi�cance error" in the calculation.
In somecases, these can be quite subtle and di�cult to detect. And evenafter they are detected, they may be di�cult to �x.The last example, fortunately, can be �xed in a number of ways.Easiest is to use a trigonometric identity:
cos(2θ) = 2 cos2(θ)� 1 = 1� 2 sin2(θ)Let x = 2θ. Then
f (x) =1� cos xx2
=2 sin2 (x/2)
x2=1
2
�sin(x/2)x/2
�2This latter formula, with x = 0.001, yields a computed value of0.4999999584, nearly the true answer. We could also have used aTaylor polynomial for cos(x) around x = 0 to obtain a betterapproximation to f (x) for small values of x .
46 / 83
Loss of signi�cance errors
When two numbers are nearly equal and we subtract them, thenwe su�er a \loss of signi�cance error" in the calculation. In somecases, these can be quite subtle and di�cult to detect.
And evenafter they are detected, they may be di�cult to �x.The last example, fortunately, can be �xed in a number of ways.Easiest is to use a trigonometric identity:
cos(2θ) = 2 cos2(θ)� 1 = 1� 2 sin2(θ)Let x = 2θ. Then
f (x) =1� cos xx2
=2 sin2 (x/2)
x2=1
2
�sin(x/2)x/2
�2This latter formula, with x = 0.001, yields a computed value of0.4999999584, nearly the true answer. We could also have used aTaylor polynomial for cos(x) around x = 0 to obtain a betterapproximation to f (x) for small values of x .
47 / 83
Loss of signi�cance errors
When two numbers are nearly equal and we subtract them, thenwe su�er a \loss of signi�cance error" in the calculation. In somecases, these can be quite subtle and di�cult to detect. And evenafter they are detected, they may be di�cult to �x.
The last example, fortunately, can be �xed in a number of ways.Easiest is to use a trigonometric identity:
cos(2θ) = 2 cos2(θ)� 1 = 1� 2 sin2(θ)Let x = 2θ. Then
f (x) =1� cos xx2
=2 sin2 (x/2)
x2=1
2
�sin(x/2)x/2
�2This latter formula, with x = 0.001, yields a computed value of0.4999999584, nearly the true answer. We could also have used aTaylor polynomial for cos(x) around x = 0 to obtain a betterapproximation to f (x) for small values of x .
48 / 83
Loss of signi�cance errors
When two numbers are nearly equal and we subtract them, thenwe su�er a \loss of signi�cance error" in the calculation. In somecases, these can be quite subtle and di�cult to detect. And evenafter they are detected, they may be di�cult to �x.The last example, fortunately, can be �xed in a number of ways.Easiest is to use a trigonometric identity:
cos(2θ) = 2 cos2(θ)� 1 = 1� 2 sin2(θ)Let x = 2θ. Then
f (x) =1� cos xx2
=2 sin2 (x/2)
x2=1
2
�sin(x/2)x/2
�2This latter formula, with x = 0.001, yields a computed value of0.4999999584, nearly the true answer. We could also have used aTaylor polynomial for cos(x) around x = 0 to obtain a betterapproximation to f (x) for small values of x .
49 / 83
Loss of signi�cance errors
When two numbers are nearly equal and we subtract them, thenwe su�er a \loss of signi�cance error" in the calculation. In somecases, these can be quite subtle and di�cult to detect. And evenafter they are detected, they may be di�cult to �x.The last example, fortunately, can be �xed in a number of ways.Easiest is to use a trigonometric identity:
cos(2θ) = 2 cos2(θ)� 1 = 1� 2 sin2(θ)Let x = 2θ. Then
f (x) =1� cos xx2
=2 sin2 (x/2)
x2=1
2
�sin(x/2)x/2
�2
This latter formula, with x = 0.001, yields a computed value of0.4999999584, nearly the true answer. We could also have used aTaylor polynomial for cos(x) around x = 0 to obtain a betterapproximation to f (x) for small values of x .
50 / 83
Loss of signi�cance errors
When two numbers are nearly equal and we subtract them, thenwe su�er a \loss of signi�cance error" in the calculation. In somecases, these can be quite subtle and di�cult to detect. And evenafter they are detected, they may be di�cult to �x.The last example, fortunately, can be �xed in a number of ways.Easiest is to use a trigonometric identity:
cos(2θ) = 2 cos2(θ)� 1 = 1� 2 sin2(θ)Let x = 2θ. Then
f (x) =1� cos xx2
=2 sin2 (x/2)
x2=1
2
�sin(x/2)x/2
�2This latter formula, with x = 0.001, yields a computed value of0.4999999584, nearly the true answer. We could also have used aTaylor polynomial for cos(x) around x = 0 to obtain a betterapproximation to f (x) for small values of x .
51 / 83
Another example
Evaluate e�5 using a Taylor polynomial approximation:
e�5 = 1+(�5)1!
+(�5)22!
+(�5)33!
+(�5)44!
++(�5)55!
+(�5)66!
. . .
With n = 25, the error is���� (�5)�2626!ec���� � 10�8
Imagine calculating this polynomial using a computer with 4 digitdecimal arithmetic and rounding. To make the point aboutcancellation more strongly, imagine that each of the terms in theabove polynomial is calculated exactly and then rounded to thearithmetic of the computer. We add the terms exactly and then weround to four digits.
52 / 83
Another example
Evaluate e�5 using a Taylor polynomial approximation:
e�5 = 1+(�5)1!
+(�5)22!
+(�5)33!
+(�5)44!
++(�5)55!
+(�5)66!
. . .
With n = 25, the error is���� (�5)�2626!ec���� � 10�8
Imagine calculating this polynomial using a computer with 4 digitdecimal arithmetic and rounding. To make the point aboutcancellation more strongly, imagine that each of the terms in theabove polynomial is calculated exactly and then rounded to thearithmetic of the computer. We add the terms exactly and then weround to four digits.
53 / 83
Another example
Evaluate e�5 using a Taylor polynomial approximation:
e�5 = 1+(�5)1!
+(�5)22!
+(�5)33!
+(�5)44!
++(�5)55!
+(�5)66!
. . .
With n = 25, the error is���� (�5)�2626!ec���� � 10�8
Imagine calculating this polynomial using a computer with 4 digitdecimal arithmetic and rounding.
To make the point aboutcancellation more strongly, imagine that each of the terms in theabove polynomial is calculated exactly and then rounded to thearithmetic of the computer. We add the terms exactly and then weround to four digits.
54 / 83
Another example
Evaluate e�5 using a Taylor polynomial approximation:
e�5 = 1+(�5)1!
+(�5)22!
+(�5)33!
+(�5)44!
++(�5)55!
+(�5)66!
. . .
With n = 25, the error is���� (�5)�2626!ec���� � 10�8
Imagine calculating this polynomial using a computer with 4 digitdecimal arithmetic and rounding. To make the point aboutcancellation more strongly, imagine that each of the terms in theabove polynomial is calculated exactly and then rounded to thearithmetic of the computer. We add the terms exactly and then weround to four digits.
To understand more fully the source of the error, look at thenumbers being added and their accuracy.
For example,
(�5)33!
= �1256= �20.83
in the 4 digit decimal calculation, with an error of magnitude0.00333. Note that this error in an intermediate step is of samemagnitude as the true answer 0.006738 being sought. Othersimilar errors are present in calculating other coe�cients, and thusthey cause a major error in the �nal answer being calculated.
General principle
Whenever a sum is being formed in which the �nal answer is muchsmaller than some of the terms being combined, then a loss ofsigni�cance error is occurring.
58 / 83
Another example
To understand more fully the source of the error, look at thenumbers being added and their accuracy. For example,
(�5)33!
= �1256= �20.83
in the 4 digit decimal calculation, with an error of magnitude0.00333.
Note that this error in an intermediate step is of samemagnitude as the true answer 0.006738 being sought. Othersimilar errors are present in calculating other coe�cients, and thusthey cause a major error in the �nal answer being calculated.
General principle
Whenever a sum is being formed in which the �nal answer is muchsmaller than some of the terms being combined, then a loss ofsigni�cance error is occurring.
59 / 83
Another example
To understand more fully the source of the error, look at thenumbers being added and their accuracy. For example,
(�5)33!
= �1256= �20.83
in the 4 digit decimal calculation, with an error of magnitude0.00333. Note that this error in an intermediate step is of samemagnitude as the true answer 0.006738 being sought.
Othersimilar errors are present in calculating other coe�cients, and thusthey cause a major error in the �nal answer being calculated.
General principle
Whenever a sum is being formed in which the �nal answer is muchsmaller than some of the terms being combined, then a loss ofsigni�cance error is occurring.
60 / 83
Another example
To understand more fully the source of the error, look at thenumbers being added and their accuracy. For example,
(�5)33!
= �1256= �20.83
in the 4 digit decimal calculation, with an error of magnitude0.00333. Note that this error in an intermediate step is of samemagnitude as the true answer 0.006738 being sought. Othersimilar errors are present in calculating other coe�cients, and thusthey cause a major error in the �nal answer being calculated.
General principle
Whenever a sum is being formed in which the �nal answer is muchsmaller than some of the terms being combined, then a loss ofsigni�cance error is occurring.
61 / 83
Another example
To understand more fully the source of the error, look at thenumbers being added and their accuracy. For example,
(�5)33!
= �1256= �20.83
in the 4 digit decimal calculation, with an error of magnitude0.00333. Note that this error in an intermediate step is of samemagnitude as the true answer 0.006738 being sought. Othersimilar errors are present in calculating other coe�cients, and thusthey cause a major error in the �nal answer being calculated.
General principle
Whenever a sum is being formed in which the �nal answer is muchsmaller than some of the terms being combined, then a loss ofsigni�cance error is occurring.
Whenever a function f (x) is evaluated, there are arithmeticoperations carried out which involve rounding or chopping errors.
This means that what the computer eventually returns as ananswer contains noise.
This noise is generally \random" and small.
But it can a�ect the accuracy of other calculations which dependon f (x).
65 / 83
Noise in function evaluation
Whenever a function f (x) is evaluated, there are arithmeticoperations carried out which involve rounding or chopping errors.
This means that what the computer eventually returns as ananswer contains noise.
This noise is generally \random" and small.
But it can a�ect the accuracy of other calculations which dependon f (x).
66 / 83
Noise in function evaluation
Whenever a function f (x) is evaluated, there are arithmeticoperations carried out which involve rounding or chopping errors.
This means that what the computer eventually returns as ananswer contains noise.
This noise is generally \random" and small.
But it can a�ect the accuracy of other calculations which dependon f (x).
67 / 83
Noise in function evaluation
Whenever a function f (x) is evaluated, there are arithmeticoperations carried out which involve rounding or chopping errors.
This means that what the computer eventually returns as ananswer contains noise.
This noise is generally \random" and small.
But it can a�ect the accuracy of other calculations which dependon f (x).
68 / 83
Under ow errors
Consider evaluatingf (x) = x10
for x near 0.
When using IEEE single precision arithmetic, thesmallest nonzero positive number expressible in normalized oating-point format is
m = 2�126 = 1.18� 10�38
Thus f (x) will be set to zero if
x10 < m
jx j < m110
jx j < 1.61� 10�4
�0.000161 < x < 0.000161
69 / 83
Under ow errors
Consider evaluatingf (x) = x10
for x near 0. When using IEEE single precision arithmetic, thesmallest nonzero positive number expressible in normalized oating-point format is
m = 2�126 = 1.18� 10�38
Thus f (x) will be set to zero if
x10 < m
jx j < m110
jx j < 1.61� 10�4
�0.000161 < x < 0.000161
70 / 83
Under ow errors
Consider evaluatingf (x) = x10
for x near 0. When using IEEE single precision arithmetic, thesmallest nonzero positive number expressible in normalized oating-point format is
m = 2�126 = 1.18� 10�38
Thus f (x) will be set to zero if
x10 < m
jx j < m110
jx j < 1.61� 10�4
�0.000161 < x < 0.000161
71 / 83
Under ow errors
Consider evaluatingf (x) = x10
for x near 0. When using IEEE single precision arithmetic, thesmallest nonzero positive number expressible in normalized oating-point format is
m = 2�126 = 1.18� 10�38
Thus f (x) will be set to zero if
x10 < m
jx j < m110
jx j < 1.61� 10�4
�0.000161 < x < 0.000161
72 / 83
Under ow errors
Consider evaluatingf (x) = x10
for x near 0. When using IEEE single precision arithmetic, thesmallest nonzero positive number expressible in normalized oating-point format is
m = 2�126 = 1.18� 10�38
Thus f (x) will be set to zero if
x10 < m
jx j < m110
jx j < 1.61� 10�4
�0.000161 < x < 0.000161
73 / 83
Under ow errors
Consider evaluatingf (x) = x10
for x near 0. When using IEEE single precision arithmetic, thesmallest nonzero positive number expressible in normalized oating-point format is
m = 2�126 = 1.18� 10�38
Thus f (x) will be set to zero if
x10 < m
jx j < m110
jx j < 1.61� 10�4
�0.000161 < x < 0.000161
74 / 83
Under ow errors
Consider evaluatingf (x) = x10
for x near 0. When using IEEE single precision arithmetic, thesmallest nonzero positive number expressible in normalized oating-point format is
m = 2�126 = 1.18� 10�38
Thus f (x) will be set to zero if
x10 < m
jx j < m110
jx j < 1.61� 10�4
�0.000161 < x < 0.000161
75 / 83
Over ow errors
Attempts to use numbers that are too large for the oating-pointformat will lead to over ow errors.
These are generally fatalerrors on most computers. With the IEEE oating-point format,over ow errors can be carried along as having a value of �∞ orNaN, depending on the context. Usually an over ow error is anindication of a more signi�cant problem or error in the programand the user needs to be aware of such errors.When using IEEE single precision arithmetic, the largest nonzeropositive number expressible in normalized oating point format is
m = 2128�1� 2�24
�= 3.40� 1038
Thus, f (x) will over ow if
x10 > m
jx j > m110
jx j > 7131.6
76 / 83
Over ow errors
Attempts to use numbers that are too large for the oating-pointformat will lead to over ow errors. These are generally fatalerrors on most computers. With the IEEE oating-point format,over ow errors can be carried along as having a value of �∞ orNaN, depending on the context.
Usually an over ow error is anindication of a more signi�cant problem or error in the programand the user needs to be aware of such errors.When using IEEE single precision arithmetic, the largest nonzeropositive number expressible in normalized oating point format is
m = 2128�1� 2�24
�= 3.40� 1038
Thus, f (x) will over ow if
x10 > m
jx j > m110
jx j > 7131.6
77 / 83
Over ow errors
Attempts to use numbers that are too large for the oating-pointformat will lead to over ow errors. These are generally fatalerrors on most computers. With the IEEE oating-point format,over ow errors can be carried along as having a value of �∞ orNaN, depending on the context. Usually an over ow error is anindication of a more signi�cant problem or error in the programand the user needs to be aware of such errors.
When using IEEE single precision arithmetic, the largest nonzeropositive number expressible in normalized oating point format is
m = 2128�1� 2�24
�= 3.40� 1038
Thus, f (x) will over ow if
x10 > m
jx j > m110
jx j > 7131.6
78 / 83
Over ow errors
Attempts to use numbers that are too large for the oating-pointformat will lead to over ow errors. These are generally fatalerrors on most computers. With the IEEE oating-point format,over ow errors can be carried along as having a value of �∞ orNaN, depending on the context. Usually an over ow error is anindication of a more signi�cant problem or error in the programand the user needs to be aware of such errors.When using IEEE single precision arithmetic, the largest nonzeropositive number expressible in normalized oating point format is
m = 2128�1� 2�24
�= 3.40� 1038
Thus, f (x) will over ow if
x10 > m
jx j > m110
jx j > 7131.6
79 / 83
Over ow errors
Attempts to use numbers that are too large for the oating-pointformat will lead to over ow errors. These are generally fatalerrors on most computers. With the IEEE oating-point format,over ow errors can be carried along as having a value of �∞ orNaN, depending on the context. Usually an over ow error is anindication of a more signi�cant problem or error in the programand the user needs to be aware of such errors.When using IEEE single precision arithmetic, the largest nonzeropositive number expressible in normalized oating point format is
m = 2128�1� 2�24
�= 3.40� 1038
Thus, f (x) will over ow if
x10 > m
jx j > m110
jx j > 7131.6
80 / 83
Over ow errors
Attempts to use numbers that are too large for the oating-pointformat will lead to over ow errors. These are generally fatalerrors on most computers. With the IEEE oating-point format,over ow errors can be carried along as having a value of �∞ orNaN, depending on the context. Usually an over ow error is anindication of a more signi�cant problem or error in the programand the user needs to be aware of such errors.When using IEEE single precision arithmetic, the largest nonzeropositive number expressible in normalized oating point format is
m = 2128�1� 2�24
�= 3.40� 1038
Thus, f (x) will over ow if
x10 > m
jx j > m110
jx j > 7131.6
81 / 83
Over ow errors
Attempts to use numbers that are too large for the oating-pointformat will lead to over ow errors. These are generally fatalerrors on most computers. With the IEEE oating-point format,over ow errors can be carried along as having a value of �∞ orNaN, depending on the context. Usually an over ow error is anindication of a more signi�cant problem or error in the programand the user needs to be aware of such errors.When using IEEE single precision arithmetic, the largest nonzeropositive number expressible in normalized oating point format is
m = 2128�1� 2�24
�= 3.40� 1038
Thus, f (x) will over ow if
x10 > m
jx j > m110
jx j > 7131.682 / 83
Over ow errors
Attempts to use numbers that are too large for the oating-pointformat will lead to over ow errors. These are generally fatalerrors on most computers. With the IEEE oating-point format,over ow errors can be carried along as having a value of �∞ orNaN, depending on the context. Usually an over ow error is anindication of a more signi�cant problem or error in the programand the user needs to be aware of such errors.When using IEEE single precision arithmetic, the largest nonzeropositive number expressible in normalized oating point format is
m = 2128�1� 2�24
�= 3.40� 1038
Thus, f (x) will over ow if
x10 > m
jx j > m110
jx j > 7131.683 / 83
Numerical Analysis
conf.dr. Bostan Viorel
Fall 2010 Lecture 5
1 / 101
Loss of signi�cance errors
This can be considered a source of error or a consequence of the�niteness of calculator and computer arithmetic.
Example. De�ne
f (x) = x�p
x + 1�px�
and consider evaluating it on a 6-digit decimal calculator whichuses rounded arithmetic.
In order to localise the error consider the case x = 100.
The calculator with 6 decimal digits will provide us with thefollowing values
p100 = 10,
p101 = 10.0499.
Then px + 1�
px =
p101�
p100 = 0.0499000,
while the exact value is 0.0498756.
Three signi�cant digits inpx + 1 =
p101 have been lost fromp
x =p100.
The loss of precision is due to the form of the function f (x) andthe �niteness of the precision of the 6 digit calculator.
5 / 101
Loss of signi�cance errors
In order to localise the error consider the case x = 100.The calculator with 6 decimal digits will provide us with thefollowing values
p100 = 10,
p101 = 10.0499.
Then px + 1�
px =
p101�
p100 = 0.0499000,
while the exact value is 0.0498756.
Three signi�cant digits inpx + 1 =
p101 have been lost fromp
x =p100.
The loss of precision is due to the form of the function f (x) andthe �niteness of the precision of the 6 digit calculator.
6 / 101
Loss of signi�cance errors
In order to localise the error consider the case x = 100.The calculator with 6 decimal digits will provide us with thefollowing values
p100 = 10,
p101 = 10.0499.
Then px + 1�
px =
p101�
p100 = 0.0499000,
while the exact value is 0.0498756.
Three signi�cant digits inpx + 1 =
p101 have been lost fromp
x =p100.
The loss of precision is due to the form of the function f (x) andthe �niteness of the precision of the 6 digit calculator.
7 / 101
Loss of signi�cance errors
In order to localise the error consider the case x = 100.The calculator with 6 decimal digits will provide us with thefollowing values
p100 = 10,
p101 = 10.0499.
Then px + 1�
px =
p101�
p100 = 0.0499000,
while the exact value is 0.0498756.
Three signi�cant digits inpx + 1 =
p101 have been lost fromp
x =p100.
The loss of precision is due to the form of the function f (x) andthe �niteness of the precision of the 6 digit calculator.
8 / 101
Loss of signi�cance errors
In order to localise the error consider the case x = 100.The calculator with 6 decimal digits will provide us with thefollowing values
p100 = 10,
p101 = 10.0499.
Then px + 1�
px =
p101�
p100 = 0.0499000,
while the exact value is 0.0498756.
Three signi�cant digits inpx + 1 =
p101 have been lost fromp
x =p100.
The loss of precision is due to the form of the function f (x) andthe �niteness of the precision of the 6 digit calculator.
9 / 101
Loss of signi�cance errors
In this particular case, we can avoid the loss of precision byrewritining the function as follows
f (x) = x
px + 1+
pxp
x + 1+px�px + 1�
px
1
=xp
x + 1+px.
Thus we will avoid the subtraction on near quantities.
Doing so gives usf (100) = 4.98756,
a value with 6 signi�cant digits.
10 / 101
Loss of signi�cance errors
In this particular case, we can avoid the loss of precision byrewritining the function as follows
f (x) = x
px + 1+
pxp
x + 1+px�px + 1�
px
1
=xp
x + 1+px.
Thus we will avoid the subtraction on near quantities.
Doing so gives usf (100) = 4.98756,
a value with 6 signi�cant digits.
11 / 101
Loss of signi�cance errors
In this particular case, we can avoid the loss of precision byrewritining the function as follows
f (x) = x
px + 1+
pxp
x + 1+px�px + 1�
px
1
=xp
x + 1+px.
Thus we will avoid the subtraction on near quantities.
Doing so gives usf (100) = 4.98756,
a value with 6 signi�cant digits.
12 / 101
Loss of signi�cance errors
In this particular case, we can avoid the loss of precision byrewritining the function as follows
f (x) = x
px + 1+
pxp
x + 1+px�px + 1�
px
1
=xp
x + 1+px.
Thus we will avoid the subtraction on near quantities.
Doing so gives usf (100) = 4.98756,
a value with 6 signi�cant digits.
13 / 101
Loss of signi�cance errors
In this particular case, we can avoid the loss of precision byrewritining the function as follows
f (x) = x
px + 1+
pxp
x + 1+px�px + 1�
px
1
=xp
x + 1+px.
Thus we will avoid the subtraction on near quantities.
Doing so gives usf (100) = 4.98756,
a value with 6 signi�cant digits.
14 / 101
Propagation of errorsPropagation in arithmetic operations
Let ω denote arithmetic operation such as +,�, �,or /.
Let ω� denote the same arithmetic operation as it is actuallycarried out in the computer, including rounding or chopping error.
Let xA � xT and yA � yT .
We want to obtain xT ω yT , but we actually obtain xA ω� yA.
The error in xA ω� yA is given by
(xT ω yT )� (xA ω� yA)
15 / 101
Propagation of errorsPropagation in arithmetic operations
Let ω denote arithmetic operation such as +,�, �,or /.
Let ω� denote the same arithmetic operation as it is actuallycarried out in the computer, including rounding or chopping error.
Let xA � xT and yA � yT .
We want to obtain xT ω yT , but we actually obtain xA ω� yA.
The error in xA ω� yA is given by
(xT ω yT )� (xA ω� yA)
16 / 101
Propagation of errorsPropagation in arithmetic operations
Let ω denote arithmetic operation such as +,�, �,or /.
Let ω� denote the same arithmetic operation as it is actuallycarried out in the computer, including rounding or chopping error.
Let xA � xT and yA � yT .
We want to obtain xT ω yT , but we actually obtain xA ω� yA.
The error in xA ω� yA is given by
(xT ω yT )� (xA ω� yA)
17 / 101
Propagation of errorsPropagation in arithmetic operations
Let ω denote arithmetic operation such as +,�, �,or /.
Let ω� denote the same arithmetic operation as it is actuallycarried out in the computer, including rounding or chopping error.
Let xA � xT and yA � yT .
We want to obtain xT ω yT , but we actually obtain xA ω� yA.
The error in xA ω� yA is given by
(xT ω yT )� (xA ω� yA)
18 / 101
Propagation of errorsPropagation in arithmetic operations
Let ω denote arithmetic operation such as +,�, �,or /.
Let ω� denote the same arithmetic operation as it is actuallycarried out in the computer, including rounding or chopping error.
Let xA � xT and yA � yT .
We want to obtain xT ω yT , but we actually obtain xA ω� yA.
The error in xA ω� yA is given by
(xT ω yT )� (xA ω� yA)
19 / 101
Propagation of errorsPropagation in arithmetic operations
The error in xA ω� yA is rewritten as
(xT ω yT )� (xA ω� yA) = (xT ω yT � xA ω yA)+ (xA ω yA � xA ω� yA)
The �nal term is the error introduced by the inexactness of themachine arithmetic. For it, we usually assume
xA ω� yA = (xA ω yA)
This means that the quantity xA ω yA is computed exactly and isthen rounded or chopped to �t the answer into the oating pointrepresentation of the machine.
20 / 101
Propagation of errorsPropagation in arithmetic operations
The error in xA ω� yA is rewritten as
(xT ω yT )� (xA ω� yA) = (xT ω yT � xA ω yA)+ (xA ω yA � xA ω� yA)
The �nal term is the error introduced by the inexactness of themachine arithmetic. For it, we usually assume
xA ω� yA = (xA ω yA)
This means that the quantity xA ω yA is computed exactly and isthen rounded or chopped to �t the answer into the oating pointrepresentation of the machine.
21 / 101
Propagation of errorsPropagation in arithmetic operations
The error in xA ω� yA is rewritten as
(xT ω yT )� (xA ω� yA) = (xT ω yT � xA ω yA)+ (xA ω yA � xA ω� yA)
The �nal term is the error introduced by the inexactness of themachine arithmetic. For it, we usually assume
xA ω� yA = (xA ω yA)
This means that the quantity xA ω yA is computed exactly and isthen rounded or chopped to �t the answer into the oating pointrepresentation of the machine.
22 / 101
Propagation of errorsPropagation in arithmetic operations
The formulaxA ω� yA = (xA ω yA)
impliesxA ω� yA = xA ω yA (1+ ε)
since (x) = x(1+ ε)
where limits for ε were given earier.Then,
Rel(xA ω� yA) =xA ω yA � xA ω� yA
xA ω yA
=xA ω yA � xA ω yA (1+ ε)
xA ω yA= �ε
23 / 101
Propagation of errorsPropagation in arithmetic operations
The formulaxA ω� yA = (xA ω yA)
impliesxA ω� yA = xA ω yA (1+ ε)
since (x) = x(1+ ε)
where limits for ε were given earier.Then,
Rel(xA ω� yA) =xA ω yA � xA ω� yA
xA ω yA
=xA ω yA � xA ω yA (1+ ε)
xA ω yA= �ε
24 / 101
Propagation of errorsPropagation in arithmetic operations
The formulaxA ω� yA = (xA ω yA)
impliesxA ω� yA = xA ω yA (1+ ε)
since (x) = x(1+ ε)
where limits for ε were given earier.
Then,
Rel(xA ω� yA) =xA ω yA � xA ω� yA
xA ω yA
=xA ω yA � xA ω yA (1+ ε)
xA ω yA= �ε
25 / 101
Propagation of errorsPropagation in arithmetic operations
The formulaxA ω� yA = (xA ω yA)
impliesxA ω� yA = xA ω yA (1+ ε)
since (x) = x(1+ ε)
where limits for ε were given earier.Then,
Rel(xA ω� yA) =xA ω yA � xA ω� yA
xA ω yA
=xA ω yA � xA ω yA (1+ ε)
xA ω yA= �ε
26 / 101
Propagation of errorsPropagation in arithmetic operations
The formulaxA ω� yA = (xA ω yA)
impliesxA ω� yA = xA ω yA (1+ ε)
since (x) = x(1+ ε)
where limits for ε were given earier.Then,
Rel(xA ω� yA) =xA ω yA � xA ω� yA
xA ω yA
=xA ω yA � xA ω yA (1+ ε)
xA ω yA
= �ε
27 / 101
Propagation of errorsPropagation in arithmetic operations
The formulaxA ω� yA = (xA ω yA)
impliesxA ω� yA = xA ω yA (1+ ε)
since (x) = x(1+ ε)
where limits for ε were given earier.Then,
Rel(xA ω� yA) =xA ω yA � xA ω� yA
xA ω yA
=xA ω yA � xA ω yA (1+ ε)
xA ω yA= �ε
28 / 101
Propagation of errorsPropagation in arithmetic operations
The formulaxA ω� yA = (xA ω yA)
impliesxA ω� yA = xA ω yA (1+ ε)
since (x) = x(1+ ε)
where limits for ε were given earier.Then,
Rel(xA ω� yA) =xA ω yA � xA ω� yA
xA ω yA
=xA ω yA � xA ω yA (1+ ε)
xA ω yA= �ε
29 / 101
Propagation of errorsPropagation in arithmetic operations
With rounded binary arithmetic having n digits in the mantissa,
�2�n � ε � 2�n
Coming back to error formula
(xT ω yT )� (xA ω� yA) = (xT ω yT � xA ω yA)+ (xA ω yA � xA ω� yA)| {z }Relative error is �ε
The second termxT ω yT � xA ω yA
is the propagated error.In what follows we examine it for particular cases.
30 / 101
Propagation of errorsPropagation in arithmetic operations
With rounded binary arithmetic having n digits in the mantissa,
�2�n � ε � 2�n
Coming back to error formula
(xT ω yT )� (xA ω� yA) = (xT ω yT � xA ω yA)+ (xA ω yA � xA ω� yA)| {z }Relative error is �ε
The second termxT ω yT � xA ω yA
is the propagated error.In what follows we examine it for particular cases.
31 / 101
Propagation of errorsPropagation in arithmetic operations
With rounded binary arithmetic having n digits in the mantissa,
�2�n � ε � 2�n
Coming back to error formula
(xT ω yT )� (xA ω� yA) = (xT ω yT � xA ω yA)+ (xA ω yA � xA ω� yA)| {z }Relative error is �ε
The second termxT ω yT � xA ω yA
is the propagated error.
In what follows we examine it for particular cases.
32 / 101
Propagation of errorsPropagation in arithmetic operations
With rounded binary arithmetic having n digits in the mantissa,
�2�n � ε � 2�n
Coming back to error formula
(xT ω yT )� (xA ω� yA) = (xT ω yT � xA ω yA)+ (xA ω yA � xA ω� yA)| {z }Relative error is �ε
The second termxT ω yT � xA ω yA
is the propagated error.In what follows we examine it for particular cases.
33 / 101
Propagation of errorsPropagation in multiplication
Let ω = �. WritexT = xA + ξ, yT = yA + η
Then for the relative error in xA yA
Rel(xA yA) =xT yT � xA yA
xT yT
=xT yT � (xT � ξ)(yt � η)
xT yT
=xT yT � xT yT + xTη + yT ξ � ξη
xT yT
=xTη + yT ξ � ξη
xT yT
=ξ
xT+
η
yT� ξ
xT� η
yT= Rel(xA) + Rel( yA)� Rel(xA) � Rel( yA)
34 / 101
Propagation of errorsPropagation in multiplication
Let ω = �. WritexT = xA + ξ, yT = yA + η
Then for the relative error in xA yA
Rel(xA yA) =xT yT � xA yA
xT yT
=xT yT � (xT � ξ)(yt � η)
xT yT
=xT yT � xT yT + xTη + yT ξ � ξη
xT yT
=xTη + yT ξ � ξη
xT yT
=ξ
xT+
η
yT� ξ
xT� η
yT= Rel(xA) + Rel( yA)� Rel(xA) � Rel( yA)
35 / 101
Propagation of errorsPropagation in multiplication
Let ω = �. WritexT = xA + ξ, yT = yA + η
Then for the relative error in xA yA
Rel(xA yA) =xT yT � xA yA
xT yT
=xT yT � (xT � ξ)(yt � η)
xT yT
=xT yT � xT yT + xTη + yT ξ � ξη
xT yT
=xTη + yT ξ � ξη
xT yT
=ξ
xT+
η
yT� ξ
xT� η
yT= Rel(xA) + Rel( yA)� Rel(xA) � Rel( yA)
36 / 101
Propagation of errorsPropagation in multiplication
Let ω = �. WritexT = xA + ξ, yT = yA + η
Then for the relative error in xA yA
Rel(xA yA) =xT yT � xA yA
xT yT
=xT yT � (xT � ξ)(yt � η)
xT yT
=xT yT � xT yT + xTη + yT ξ � ξη
xT yT
=xTη + yT ξ � ξη
xT yT
=ξ
xT+
η
yT� ξ
xT� η
yT= Rel(xA) + Rel( yA)� Rel(xA) � Rel( yA)
37 / 101
Propagation of errorsPropagation in multiplication
Let ω = �. WritexT = xA + ξ, yT = yA + η
Then for the relative error in xA yA
Rel(xA yA) =xT yT � xA yA
xT yT
=xT yT � (xT � ξ)(yt � η)
xT yT
=xT yT � xT yT + xTη + yT ξ � ξη
xT yT
=xTη + yT ξ � ξη
xT yT
=ξ
xT+
η
yT� ξ
xT� η
yT= Rel(xA) + Rel( yA)� Rel(xA) � Rel( yA)
38 / 101
Propagation of errorsPropagation in multiplication
Let ω = �. WritexT = xA + ξ, yT = yA + η
Then for the relative error in xA yA
Rel(xA yA) =xT yT � xA yA
xT yT
=xT yT � (xT � ξ)(yt � η)
xT yT
=xT yT � xT yT + xTη + yT ξ � ξη
xT yT
=xTη + yT ξ � ξη
xT yT
=ξ
xT+
η
yT� ξ
xT� η
yT
= Rel(xA) + Rel( yA)� Rel(xA) � Rel( yA)
39 / 101
Propagation of errorsPropagation in multiplication
Let ω = �. WritexT = xA + ξ, yT = yA + η
Then for the relative error in xA yA
Rel(xA yA) =xT yT � xA yA
xT yT
=xT yT � (xT � ξ)(yt � η)
xT yT
=xT yT � xT yT + xTη + yT ξ � ξη
xT yT
=xTη + yT ξ � ξη
xT yT
=ξ
xT+
η
yT� ξ
xT� η
yT= Rel(xA) + Rel( yA)� Rel(xA) � Rel( yA)
40 / 101
Propagation of errorsPropagation in multiplication
Let ω = �. WritexT = xA + ξ, yT = yA + η
Then for the relative error in xA yA
Rel(xA yA) =xT yT � xA yA
xT yT
=xT yT � (xT � ξ)(yt � η)
xT yT
=xT yT � xT yT + xTη + yT ξ � ξη
xT yT
=xTη + yT ξ � ξη
xT yT
=ξ
xT+
η
yT� ξ
xT� η
yT= Rel(xA) + Rel( yA)� Rel(xA) � Rel( yA)
41 / 101
Propagation of errorsPropagation in multiplication
Usually we have
jRel(xA)j � 1, jRel(yA)j � 1
therefore, we can skip the last term Rel(xA)�Rel( yA), since it ismuch smaller compared with previous two
Thus small relative errors in the arguments xA and yA leads to asmall relative error in the product xAyA.
Also, note that there is some cancellation if these relative errorsare of opposite sign.
47 / 101
Propagation of errorsPropagation in division
There is a similar result for division:
Rel(xA yA) � Rel(xA)� Rel( yA)
providedjRel(yA)j � 1
48 / 101
Propagation of errorsPropagation in addition and subtraction
For ω equal to � or +, we have
[xT � yT ]� [xA � yA] = [xT � xA]� [yT � yA]
Thus the error in a sum is the sum of the errors in theoriginal arguments, and similarly for subtraction.
However, there is a more subtle error occurring here.
49 / 101
Propagation of errorsPropagation in addition and subtraction
For ω equal to � or +, we have
[xT � yT ]� [xA � yA] = [xT � xA]� [yT � yA]
Thus the error in a sum is the sum of the errors in theoriginal arguments, and similarly for subtraction.
However, there is a more subtle error occurring here.
50 / 101
Propagation of errorsPropagation in addition and subtraction
For ω equal to � or +, we have
[xT � yT ]� [xA � yA] = [xT � xA]� [yT � yA]
Thus the error in a sum is the sum of the errors in theoriginal arguments, and similarly for subtraction.
However, there is a more subtle error occurring here.
51 / 101
Propagation of errorsExample
Suppose you are solving
x2 � 26x + 1 = 0
Using the quadratic formula, we have the true answers
r1,T = 13+p168, r2,T = 13�
p168
From a table of square roots, we takep168 � 12.961
Since this is correctly rounded to 5 digits, we have���p168� 12.961��� � 0.0005Then de�ne
r1,A = 13+ 12.961 = 25.961,
r2,A = 13� 12.961 = 0.039
52 / 101
Propagation of errorsExample
Suppose you are solving
x2 � 26x + 1 = 0Using the quadratic formula, we have the true answers
r1,T = 13+p168, r2,T = 13�
p168
From a table of square roots, we takep168 � 12.961
Since this is correctly rounded to 5 digits, we have���p168� 12.961��� � 0.0005Then de�ne
r1,A = 13+ 12.961 = 25.961,
r2,A = 13� 12.961 = 0.039
53 / 101
Propagation of errorsExample
Suppose you are solving
x2 � 26x + 1 = 0Using the quadratic formula, we have the true answers
r1,T = 13+p168, r2,T = 13�
p168
From a table of square roots, we takep168 � 12.961
Since this is correctly rounded to 5 digits, we have���p168� 12.961��� � 0.0005Then de�ne
r1,A = 13+ 12.961 = 25.961,
r2,A = 13� 12.961 = 0.039
54 / 101
Propagation of errorsExample
Suppose you are solving
x2 � 26x + 1 = 0Using the quadratic formula, we have the true answers
r1,T = 13+p168, r2,T = 13�
p168
From a table of square roots, we takep168 � 12.961
Since this is correctly rounded to 5 digits, we have���p168� 12.961��� � 0.0005
Then de�ne
r1,A = 13+ 12.961 = 25.961,
r2,A = 13� 12.961 = 0.039
55 / 101
Propagation of errorsExample
Suppose you are solving
x2 � 26x + 1 = 0Using the quadratic formula, we have the true answers
r1,T = 13+p168, r2,T = 13�
p168
From a table of square roots, we takep168 � 12.961
Since this is correctly rounded to 5 digits, we have���p168� 12.961��� � 0.0005Then de�ne
r1,A = 13+ 12.961 = 25.961,
r2,A = 13� 12.961 = 0.03956 / 101
Propagation of errorsExample
Then for both roots,
jrT � rAj � 0.0005
For the relative errors, however,
Rel (r1,A) =r1,T � r1,Ar1,T
� 0.0005
25.9605� 3.13� 10�5
Rel (r2,A) =r2,T � r2,Ar2,T
� 0.0005
0.0385� 0.0130
Why does r2,A have such poor accuracy in comparison to r1,A?
57 / 101
Propagation of errorsExample
Then for both roots,
jrT � rAj � 0.0005
For the relative errors, however,
Rel (r1,A) =r1,T � r1,Ar1,T
� 0.0005
25.9605� 3.13� 10�5
Rel (r2,A) =r2,T � r2,Ar2,T
� 0.0005
0.0385� 0.0130
Why does r2,A have such poor accuracy in comparison to r1,A?
58 / 101
Propagation of errorsExample
Then for both roots,
jrT � rAj � 0.0005
For the relative errors, however,
Rel (r1,A) =r1,T � r1,Ar1,T
� 0.0005
25.9605� 3.13� 10�5
Rel (r2,A) =r2,T � r2,Ar2,T
� 0.0005
0.0385� 0.0130
Why does r2,A have such poor accuracy in comparison to r1,A?
59 / 101
Propagation of errorsExample
Then for both roots,
jrT � rAj � 0.0005
For the relative errors, however,
Rel (r1,A) =r1,T � r1,Ar1,T
� 0.0005
25.9605� 3.13� 10�5
Rel (r2,A) =r2,T � r2,Ar2,T
� 0.0005
0.0385� 0.0130
Why does r2,A have such poor accuracy in comparison to r1,A?
60 / 101
Propagation of errorsExample
Then for both roots,
jrT � rAj � 0.0005
For the relative errors, however,
Rel (r1,A) =r1,T � r1,Ar1,T
� 0.0005
25.9605� 3.13� 10�5
Rel (r2,A) =r2,T � r2,Ar2,T
� 0.0005
0.0385� 0.0130
Why does r2,A have such poor accuracy in comparison to r1,A?
61 / 101
Propagation of errorsExample
Then for both roots,
jrT � rAj � 0.0005
For the relative errors, however,
Rel (r1,A) =r1,T � r1,Ar1,T
� 0.0005
25.9605� 3.13� 10�5
Rel (r2,A) =r2,T � r2,Ar2,T
� 0.0005
0.0385� 0.0130
Why does r2,A have such poor accuracy in comparison to r1,A?
62 / 101
Propagation of errorsExample
Then for both roots,
jrT � rAj � 0.0005
For the relative errors, however,
Rel (r1,A) =r1,T � r1,Ar1,T
� 0.0005
25.9605� 3.13� 10�5
Rel (r2,A) =r2,T � r2,Ar2,T
� 0.0005
0.0385� 0.0130
Why does r2,A have such poor accuracy in comparison to r1,A?
63 / 101
Propagation of errorsExample
The answer is due to the loss of signi�cance error involved in theformula for calculating r2,A.
Instead, use the mathematically equivalent formula
r2,A =1
13+p168
� 1
25.961
This results in a much more accurate answer, at the expense of anadditional division.
64 / 101
Propagation of errorsExample
The answer is due to the loss of signi�cance error involved in theformula for calculating r2,A.
Instead, use the mathematically equivalent formula
r2,A =1
13+p168
� 1
25.961
This results in a much more accurate answer, at the expense of anadditional division.
65 / 101
Propagation of errorsExample
The answer is due to the loss of signi�cance error involved in theformula for calculating r2,A.
Instead, use the mathematically equivalent formula
r2,A =1
13+p168
� 1
25.961
This results in a much more accurate answer, at the expense of anadditional division.
66 / 101
Propagation of errorsErrors in function evaluation
Suppose we are evaluating a function f (x) in the machine.
Then the result is generally not f (x), but rather an approximate ofit which we denote by ef (x).Now suppose that we have a number xA � xT .
We want to calculate f (xT ), but instead we evaluate ef (xA).What can we say about the error in this latter computed quantity?
f (xT )� ef (xA)
67 / 101
Propagation of errorsErrors in function evaluation
Suppose we are evaluating a function f (x) in the machine.
Then the result is generally not f (x), but rather an approximate ofit which we denote by ef (x).
Now suppose that we have a number xA � xT .
We want to calculate f (xT ), but instead we evaluate ef (xA).What can we say about the error in this latter computed quantity?
f (xT )� ef (xA)
68 / 101
Propagation of errorsErrors in function evaluation
Suppose we are evaluating a function f (x) in the machine.
Then the result is generally not f (x), but rather an approximate ofit which we denote by ef (x).Now suppose that we have a number xA � xT .
We want to calculate f (xT ), but instead we evaluate ef (xA).What can we say about the error in this latter computed quantity?
f (xT )� ef (xA)
69 / 101
Propagation of errorsErrors in function evaluation
Suppose we are evaluating a function f (x) in the machine.
Then the result is generally not f (x), but rather an approximate ofit which we denote by ef (x).Now suppose that we have a number xA � xT .
We want to calculate f (xT ), but instead we evaluate ef (xA).
What can we say about the error in this latter computed quantity?
f (xT )� ef (xA)
70 / 101
Propagation of errorsErrors in function evaluation
Suppose we are evaluating a function f (x) in the machine.
Then the result is generally not f (x), but rather an approximate ofit which we denote by ef (x).Now suppose that we have a number xA � xT .
We want to calculate f (xT ), but instead we evaluate ef (xA).What can we say about the error in this latter computed quantity?
f (xT )� ef (xA)71 / 101
Propagation of errorsErrors in function evaluation
f (xT )� ef (xA) = [f (xT )� f (xA)]� hf (xA)� ef (xA)i
The quantity f (xA)� ef (xA) is the \noise" in the evaluation off (xA) in the computer, and we will return later to some discussionof it.
The quantity f (xT )� f (xA) is called the propagated error. It isthe error that results from using perfect arithmetic in theevaluation of the function.
If the function f (x) is di�erentiable, then we can use the\mean-value theorem" to write
f (xT )� f (xA) = f 0(ξ)(xT � xA)
for some ξ between xT andxA.
72 / 101
Propagation of errorsErrors in function evaluation
f (xT )� ef (xA) = [f (xT )� f (xA)]� hf (xA)� ef (xA)iThe quantity f (xA)� ef (xA) is the \noise" in the evaluation off (xA) in the computer, and we will return later to some discussionof it.
The quantity f (xT )� f (xA) is called the propagated error. It isthe error that results from using perfect arithmetic in theevaluation of the function.
If the function f (x) is di�erentiable, then we can use the\mean-value theorem" to write
f (xT )� f (xA) = f 0(ξ)(xT � xA)
for some ξ between xT andxA.
73 / 101
Propagation of errorsErrors in function evaluation
f (xT )� ef (xA) = [f (xT )� f (xA)]� hf (xA)� ef (xA)iThe quantity f (xA)� ef (xA) is the \noise" in the evaluation off (xA) in the computer, and we will return later to some discussionof it.
The quantity f (xT )� f (xA) is called the propagated error. It isthe error that results from using perfect arithmetic in theevaluation of the function.
If the function f (x) is di�erentiable, then we can use the\mean-value theorem" to write
f (xT )� f (xA) = f 0(ξ)(xT � xA)
for some ξ between xT andxA.
74 / 101
Propagation of errorsErrors in function evaluation
f (xT )� ef (xA) = [f (xT )� f (xA)]� hf (xA)� ef (xA)iThe quantity f (xA)� ef (xA) is the \noise" in the evaluation off (xA) in the computer, and we will return later to some discussionof it.
The quantity f (xT )� f (xA) is called the propagated error. It isthe error that results from using perfect arithmetic in theevaluation of the function.
If the function f (x) is di�erentiable, then we can use the\mean-value theorem" to write
f (xT )� f (xA) = f 0(ξ)(xT � xA)
for some ξ between xT andxA.75 / 101
Propagation of errorsErrors in function evaluation
Since usually xT and xA are close together, we can say ξ is close toeither of them, and
f (xT )� f (xA) = f 0(ξ)(xT � xA)� f 0(xT )(xT � xA)� f 0(xA)(xT � xA)
76 / 101
Propagation of errorsErrors in function evaluation
Since usually xT and xA are close together, we can say ξ is close toeither of them, and
f (xT )� f (xA) = f 0(ξ)(xT � xA)
� f 0(xT )(xT � xA)� f 0(xA)(xT � xA)
77 / 101
Propagation of errorsErrors in function evaluation
Since usually xT and xA are close together, we can say ξ is close toeither of them, and
f (xT )� f (xA) = f 0(ξ)(xT � xA)� f 0(xT )(xT � xA)
� f 0(xA)(xT � xA)
78 / 101
Propagation of errorsErrors in function evaluation
Since usually xT and xA are close together, we can say ξ is close toeither of them, and
f (xT )� f (xA) = f 0(ξ)(xT � xA)� f 0(xT )(xT � xA)� f 0(xA)(xT � xA)
79 / 101
Propagation of errorsErrors in function evaluation
Since usually xT and xA are close together, we can say ξ is close toeither of them, and
f (xT )� f (xA) = f 0(ξ)(xT � xA)� f 0(xT )(xT � xA)� f 0(xA)(xT � xA)
80 / 101
Propagation of errorsExample
De�ne f (x) = bx , where b is a positive real number. Then lastformula yields
bxT � bxA � (ln b)bxT (xT � xA)
Rel (bxA) � (ln b)bxT (xT � xA)bxT
=(ln b)(xT � xA)xT
xT= xT ln b � Rel(xA)= K � Rel(xA)
Note that if K = 104 and Rel(xA) = 10�7, then Rel(bxA) � 10�3.
This is a large decrease in accuracy; and it is independent of howwe actually calculate bx . The number K is called a conditionnumber for the computation.
81 / 101
Propagation of errorsExample
De�ne f (x) = bx , where b is a positive real number. Then lastformula yields
bxT � bxA � (ln b)bxT (xT � xA)
Rel (bxA) � (ln b)bxT (xT � xA)bxT
=(ln b)(xT � xA)xT
xT= xT ln b � Rel(xA)= K � Rel(xA)
Note that if K = 104 and Rel(xA) = 10�7, then Rel(bxA) � 10�3.
This is a large decrease in accuracy; and it is independent of howwe actually calculate bx . The number K is called a conditionnumber for the computation.
82 / 101
Propagation of errorsExample
De�ne f (x) = bx , where b is a positive real number. Then lastformula yields
bxT � bxA � (ln b)bxT (xT � xA)
Rel (bxA) � (ln b)bxT (xT � xA)bxT
=(ln b)(xT � xA)xT
xT
= xT ln b � Rel(xA)= K � Rel(xA)
Note that if K = 104 and Rel(xA) = 10�7, then Rel(bxA) � 10�3.
This is a large decrease in accuracy; and it is independent of howwe actually calculate bx . The number K is called a conditionnumber for the computation.
83 / 101
Propagation of errorsExample
De�ne f (x) = bx , where b is a positive real number. Then lastformula yields
bxT � bxA � (ln b)bxT (xT � xA)
Rel (bxA) � (ln b)bxT (xT � xA)bxT
=(ln b)(xT � xA)xT
xT= xT ln b � Rel(xA)
= K � Rel(xA)
Note that if K = 104 and Rel(xA) = 10�7, then Rel(bxA) � 10�3.
This is a large decrease in accuracy; and it is independent of howwe actually calculate bx . The number K is called a conditionnumber for the computation.
84 / 101
Propagation of errorsExample
De�ne f (x) = bx , where b is a positive real number. Then lastformula yields
bxT � bxA � (ln b)bxT (xT � xA)
Rel (bxA) � (ln b)bxT (xT � xA)bxT
=(ln b)(xT � xA)xT
xT= xT ln b � Rel(xA)= K � Rel(xA)
Note that if K = 104 and Rel(xA) = 10�7, then Rel(bxA) � 10�3.
This is a large decrease in accuracy; and it is independent of howwe actually calculate bx . The number K is called a conditionnumber for the computation.
85 / 101
Propagation of errorsExample
De�ne f (x) = bx , where b is a positive real number. Then lastformula yields
bxT � bxA � (ln b)bxT (xT � xA)
Rel (bxA) � (ln b)bxT (xT � xA)bxT
=(ln b)(xT � xA)xT
xT= xT ln b � Rel(xA)= K � Rel(xA)
Note that if K = 104 and Rel(xA) = 10�7, then Rel(bxA) � 10�3.
This is a large decrease in accuracy; and it is independent of howwe actually calculate bx . The number K is called a conditionnumber for the computation.
86 / 101
Propagation of errorsExample
De�ne f (x) = bx , where b is a positive real number. Then lastformula yields
bxT � bxA � (ln b)bxT (xT � xA)
Rel (bxA) � (ln b)bxT (xT � xA)bxT
=(ln b)(xT � xA)xT
xT= xT ln b � Rel(xA)= K � Rel(xA)
Note that if K = 104 and Rel(xA) = 10�7, then Rel(bxA) � 10�3.
This is a large decrease in accuracy; and it is independent of howwe actually calculate bx . The number K is called a conditionnumber for the computation.
87 / 101
Propagation of errorsExample
De�ne f (x) = bx , where b is a positive real number. Then lastformula yields
bxT � bxA � (ln b)bxT (xT � xA)
Rel (bxA) � (ln b)bxT (xT � xA)bxT
=(ln b)(xT � xA)xT
xT= xT ln b � Rel(xA)= K � Rel(xA)
Note that if K = 104 and Rel(xA) = 10�7, then Rel(bxA) � 10�3.
This is a large decrease in accuracy; and it is independent of howwe actually calculate bx .
The number K is called a conditionnumber for the computation.
88 / 101
Propagation of errorsExample
De�ne f (x) = bx , where b is a positive real number. Then lastformula yields
bxT � bxA � (ln b)bxT (xT � xA)
Rel (bxA) � (ln b)bxT (xT � xA)bxT
=(ln b)(xT � xA)xT
xT= xT ln b � Rel(xA)= K � Rel(xA)
Note that if K = 104 and Rel(xA) = 10�7, then Rel(bxA) � 10�3.
This is a large decrease in accuracy; and it is independent of howwe actually calculate bx . The number K is called a conditionnumber for the computation.
89 / 101
Summation
Let S be a sum with a relatively large number of terms
S = a1 + a2 + . . . an (1)
where aj , j = 1, . . . , n, are oating point numbers.
Thesummation process consists of n� 1 consecutive additions
S = (((. . . (a1 + a2) + a3) + . . .+ an�1) + an,
De�ne
S2 = (a1 + a2)
S3 = (S2 + a3)
S4 = (S3 + a4)...
Sn = (Sn�1 + an)
Recall the formula (x) = x(1+ ε)
90 / 101
Summation
Let S be a sum with a relatively large number of terms
S = a1 + a2 + . . . an (1)
where aj , j = 1, . . . , n, are oating point numbers. Thesummation process consists of n� 1 consecutive additions
S = (((. . . (a1 + a2) + a3) + . . .+ an�1) + an,
De�ne
S2 = (a1 + a2)
S3 = (S2 + a3)
S4 = (S3 + a4)...
Sn = (Sn�1 + an)
Recall the formula (x) = x(1+ ε)
91 / 101
Summation
Let S be a sum with a relatively large number of terms
S = a1 + a2 + . . . an (1)
where aj , j = 1, . . . , n, are oating point numbers. Thesummation process consists of n� 1 consecutive additions
From the last relation we can establish the strategy for sumation inorder to minimize the error S � Sn: initially rearrange the termsinincreasing order
ja1j � ja2j � ja3j � . . . � janj
In this case smaller numbers a1 and a2 will be multiplied withlarger numbers ε2 + . . .+ εn, and larger number an will bemultiplied with smaller number εn.
From the last relation we can establish the strategy for sumation inorder to minimize the error S � Sn: initially rearrange the termsinincreasing order
ja1j � ja2j � ja3j � . . . � janj
In this case smaller numbers a1 and a2 will be multiplied withlarger numbers ε2 + . . .+ εn, and larger number an will bemultiplied with smaller number εn.
From the last relation we can establish the strategy for sumation inorder to minimize the error S � Sn: initially rearrange the termsinincreasing order
ja1j � ja2j � ja3j � . . . � janj
In this case smaller numbers a1 and a2 will be multiplied withlarger numbers ε2 + . . .+ εn, and larger number an will bemultiplied with smaller number εn.
with f : [a, b]! R a given real-valued function. Here, we denotesuch roots or zeroes by the Greek letter α. So
f (α) = 0
Root�nding problems occur in many contexts. Sometimes they area direct formulation of some physical situtation, but more often,they are an intermediate step in solving a much larger problem.
2 / 94
Root�nding
We want to �nd the numbers x for which
f (x) = 0
with f : [a, b]! R a given real-valued function. Here, we denotesuch roots or zeroes by the Greek letter α. So
f (α) = 0
Root�nding problems occur in many contexts. Sometimes they area direct formulation of some physical situtation, but more often,they are an intermediate step in solving a much larger problem.
3 / 94
Bisection method
Most methods for solving f (x) = 0 are iterative methods. Thismeans that such a method given an initail guess x0 will provide uswith a sequence of consecutively computed solutionsx1, x2, x3, . . . , xn, . . . such that xn ! α.
We begin with the simplest of such methods, one which mostpeople use at some time.
Suppose we are given a function f (x) and we assume we have aninterval [a, b] containing the root, on which the function iscontinuous.
We also assume we are given an error tolerance ε > 0, and wewant an approximate root eα 2 [a, b] for which
jα� eαj < ε
4 / 94
Bisection method
Most methods for solving f (x) = 0 are iterative methods. Thismeans that such a method given an initail guess x0 will provide uswith a sequence of consecutively computed solutionsx1, x2, x3, . . . , xn, . . . such that xn ! α.
We begin with the simplest of such methods, one which mostpeople use at some time.
Suppose we are given a function f (x) and we assume we have aninterval [a, b] containing the root, on which the function iscontinuous.
We also assume we are given an error tolerance ε > 0, and wewant an approximate root eα 2 [a, b] for which
jα� eαj < ε
5 / 94
Bisection method
Most methods for solving f (x) = 0 are iterative methods. Thismeans that such a method given an initail guess x0 will provide uswith a sequence of consecutively computed solutionsx1, x2, x3, . . . , xn, . . . such that xn ! α.
We begin with the simplest of such methods, one which mostpeople use at some time.
Suppose we are given a function f (x) and we assume we have aninterval [a, b] containing the root, on which the function iscontinuous.
We also assume we are given an error tolerance ε > 0, and wewant an approximate root eα 2 [a, b] for which
jα� eαj < ε
6 / 94
Bisection method
Most methods for solving f (x) = 0 are iterative methods. Thismeans that such a method given an initail guess x0 will provide uswith a sequence of consecutively computed solutionsx1, x2, x3, . . . , xn, . . . such that xn ! α.
We begin with the simplest of such methods, one which mostpeople use at some time.
Suppose we are given a function f (x) and we assume we have aninterval [a, b] containing the root, on which the function iscontinuous.
We also assume we are given an error tolerance ε > 0, and wewant an approximate root eα 2 [a, b] for which
jα� eαj < ε
7 / 94
Bisection method
Bisection method is based on the following theorem:
Theorem
If f : [a, b]! R is a continuous function on closed and boundedinterval [a, b] and
f (a) � f (b) < 0then there exists α 2 [a, b] such that f (α) = 0.
Therefore, further assume that the function f (x) changes sign on[a, b].
8 / 94
Bisection method
Bisection method is based on the following theorem:
Theorem
If f : [a, b]! R is a continuous function on closed and boundedinterval [a, b] and
f (a) � f (b) < 0then there exists α 2 [a, b] such that f (α) = 0.
Therefore, further assume that the function f (x) changes sign on[a, b].
9 / 94
Bisection method
Bisection method is based on the following theorem:
Theorem
If f : [a, b]! R is a continuous function on closed and boundedinterval [a, b] and
f (a) � f (b) < 0then there exists α 2 [a, b] such that f (α) = 0.
Therefore, further assume that the function f (x) changes sign on[a, b].
10 / 94
Bisection method
Bisection Algorithm: Bisect(f , a, b, ε)
Step 1: De�ne
c =a+ b
2
Step 2: If b� c � ε, accept c as our root, and then stop.Step 3: If b� c > ε, then compare the sign of f (c) to that off (a) and f (b). If
sign(f (a)) � sign(f (b)) � 0
then replace a with c ; and otherwise, replace b with c .Return to Step 1.
Note that we prefer checking the sign using conditionsign(f (a)) � sign(f (b)) � 0 instead of using sign(f (a) � f (b)) � 0.
11 / 94
Bisection method
Bisection Algorithm: Bisect(f , a, b, ε)
Step 1: De�ne
c =a+ b
2
Step 2: If b� c � ε, accept c as our root, and then stop.Step 3: If b� c > ε, then compare the sign of f (c) to that off (a) and f (b). If
sign(f (a)) � sign(f (b)) � 0
then replace a with c ; and otherwise, replace b with c .Return to Step 1.
Note that we prefer checking the sign using conditionsign(f (a)) � sign(f (b)) � 0 instead of using sign(f (a) � f (b)) � 0.
12 / 94
Bisection method
Bisection Algorithm: Bisect(f , a, b, ε)
Step 1: De�ne
c =a+ b
2
Step 2: If b� c � ε, accept c as our root, and then stop.
Step 3: If b� c > ε, then compare the sign of f (c) to that off (a) and f (b). If
sign(f (a)) � sign(f (b)) � 0
then replace a with c ; and otherwise, replace b with c .Return to Step 1.
Note that we prefer checking the sign using conditionsign(f (a)) � sign(f (b)) � 0 instead of using sign(f (a) � f (b)) � 0.
13 / 94
Bisection method
Bisection Algorithm: Bisect(f , a, b, ε)
Step 1: De�ne
c =a+ b
2
Step 2: If b� c � ε, accept c as our root, and then stop.Step 3: If b� c > ε, then compare the sign of f (c) to that off (a) and f (b). If
sign(f (a)) � sign(f (b)) � 0
then replace a with c ; and otherwise, replace b with c .
Return to Step 1.
Note that we prefer checking the sign using conditionsign(f (a)) � sign(f (b)) � 0 instead of using sign(f (a) � f (b)) � 0.
14 / 94
Bisection method
Bisection Algorithm: Bisect(f , a, b, ε)
Step 1: De�ne
c =a+ b
2
Step 2: If b� c � ε, accept c as our root, and then stop.Step 3: If b� c > ε, then compare the sign of f (c) to that off (a) and f (b). If
sign(f (a)) � sign(f (b)) � 0
then replace a with c ; and otherwise, replace b with c .Return to Step 1.
Note that we prefer checking the sign using conditionsign(f (a)) � sign(f (b)) � 0 instead of using sign(f (a) � f (b)) � 0.
15 / 94
Bisection method
Bisection Algorithm: Bisect(f , a, b, ε)
Step 1: De�ne
c =a+ b
2
Step 2: If b� c � ε, accept c as our root, and then stop.Step 3: If b� c > ε, then compare the sign of f (c) to that off (a) and f (b). If
sign(f (a)) � sign(f (b)) � 0
then replace a with c ; and otherwise, replace b with c .Return to Step 1.
Note that we prefer checking the sign using conditionsign(f (a)) � sign(f (b)) � 0 instead of using sign(f (a) � f (b)) � 0.
16 / 94
Bisection method
y
x
α
a1
b1=b
2
c1=a
2c
2
17 / 94
Bisection method
Example
Consider the function
f (x) = x6 � x � 1
We want to �nd the largest root with accuracy of ε = 0.001. It canbe seen form the graph of the function that the root is located in[1, 2] . Also, note that the function is continuous. Let a = 1 andb = 2, then f (a) = �1 and f (b) = 61, consequently the functionchanges its sign and thus all conditions are being satis�ed.
Let an, bn and cn be the values provided by bisection method atiteration n. Evidently,
bn+1 � an+1 =1
2(bn � an)
bn � an =1
2(bn�1 � an�1)
=1
22(bn�2 � an�2)
= . . .
=1
2n�1(b� a)
Since either α 2 [an, cn] or α 2 [cn, bn] we have
jα� cnj � cn � an = bn � cn =1
2(bn � an) =
1
2n(b� a)
20 / 94
Error analysis for bisection method
Let an, bn and cn be the values provided by bisection method atiteration n. Evidently,
bn+1 � an+1 =1
2(bn � an)
bn � an =1
2(bn�1 � an�1)
=1
22(bn�2 � an�2)
= . . .
=1
2n�1(b� a)
Since either α 2 [an, cn] or α 2 [cn, bn] we have
jα� cnj � cn � an = bn � cn =1
2(bn � an) =
1
2n(b� a)
21 / 94
Error analysis for bisection method
Let an, bn and cn be the values provided by bisection method atiteration n. Evidently,
bn+1 � an+1 =1
2(bn � an)
bn � an =1
2(bn�1 � an�1)
=1
22(bn�2 � an�2)
= . . .
=1
2n�1(b� a)
Since either α 2 [an, cn] or α 2 [cn, bn] we have
jα� cnj � cn � an = bn � cn =1
2(bn � an) =
1
2n(b� a)
22 / 94
Error analysis for bisection method
Let an, bn and cn be the values provided by bisection method atiteration n. Evidently,
bn+1 � an+1 =1
2(bn � an)
bn � an =1
2(bn�1 � an�1)
=1
22(bn�2 � an�2)
= . . .
=1
2n�1(b� a)
Since either α 2 [an, cn] or α 2 [cn, bn] we have
jα� cnj � cn � an = bn � cn =1
2(bn � an) =
1
2n(b� a)
23 / 94
Error analysis for bisection method
Let an, bn and cn be the values provided by bisection method atiteration n. Evidently,
bn+1 � an+1 =1
2(bn � an)
bn � an =1
2(bn�1 � an�1)
=1
22(bn�2 � an�2)
= . . .
=1
2n�1(b� a)
Since either α 2 [an, cn] or α 2 [cn, bn] we have
jα� cnj � cn � an = bn � cn =1
2(bn � an) =
1
2n(b� a)
24 / 94
Error analysis for bisection method
Let an, bn and cn be the values provided by bisection method atiteration n. Evidently,
bn+1 � an+1 =1
2(bn � an)
bn � an =1
2(bn�1 � an�1)
=1
22(bn�2 � an�2)
= . . .
=1
2n�1(b� a)
Since either α 2 [an, cn] or α 2 [cn, bn] we have
jα� cnj � cn � an = bn � cn =1
2(bn � an)
=1
2n(b� a)
25 / 94
Error analysis for bisection method
Let an, bn and cn be the values provided by bisection method atiteration n. Evidently,
bn+1 � an+1 =1
2(bn � an)
bn � an =1
2(bn�1 � an�1)
=1
22(bn�2 � an�2)
= . . .
=1
2n�1(b� a)
Since either α 2 [an, cn] or α 2 [cn, bn] we have
jα� cnj � cn � an = bn � cn =1
2(bn � an) =
1
2n(b� a)
26 / 94
Error analysis for bisection method
Let an, bn and cn be the values provided by bisection method atiteration n. Evidently,
bn+1 � an+1 =1
2(bn � an)
bn � an =1
2(bn�1 � an�1)
=1
22(bn�2 � an�2)
= . . .
=1
2n�1(b� a)
Since either α 2 [an, cn] or α 2 [cn, bn] we have
jα� cnj � cn � an = bn � cn =1
2(bn � an) =
1
2n(b� a)
27 / 94
Error analysis for bisection method
jα� cnj �1
2n(b� a)
This relation provides us with a stopping criterion for bisectionmethod. Moreover, it follows that cn ! α as n! ∞.
Suppose we want to estimate the number of iterations in bisectionmethod necessary to �nd the root with an error tolerance ε,
jα� cnj � ε
1
2n(b� a) � ε
n �ln�b�a
ε
�ln 2
For previuos example we get
n �ln�
10.001
�ln 2
� 9.97
28 / 94
Error analysis for bisection method
jα� cnj �1
2n(b� a)
This relation provides us with a stopping criterion for bisectionmethod. Moreover, it follows that cn ! α as n! ∞.Suppose we want to estimate the number of iterations in bisectionmethod necessary to �nd the root with an error tolerance ε,
jα� cnj � ε
1
2n(b� a) � ε
n �ln�b�a
ε
�ln 2
For previuos example we get
n �ln�
10.001
�ln 2
� 9.97
29 / 94
Error analysis for bisection method
jα� cnj �1
2n(b� a)
This relation provides us with a stopping criterion for bisectionmethod. Moreover, it follows that cn ! α as n! ∞.Suppose we want to estimate the number of iterations in bisectionmethod necessary to �nd the root with an error tolerance ε,
jα� cnj � ε
1
2n(b� a) � ε
n �ln�b�a
ε
�ln 2
For previuos example we get
n �ln�
10.001
�ln 2
� 9.97
30 / 94
Error analysis for bisection method
jα� cnj �1
2n(b� a)
This relation provides us with a stopping criterion for bisectionmethod. Moreover, it follows that cn ! α as n! ∞.Suppose we want to estimate the number of iterations in bisectionmethod necessary to �nd the root with an error tolerance ε,
jα� cnj � ε
1
2n(b� a) � ε
n �ln�b�a
ε
�ln 2
For previuos example we get
n �ln�
10.001
�ln 2
� 9.9731 / 94
Advantages and Disadvantages of Bisection method
Advantages:
1 It always converges.
2 You have a guaranteed error bound, and it decreases witheach successive iteration.
3 You have a guaranteed rate of convergence. The error bounddecreases by 1/2 with each iteration.
Disadvantages:
1 It is relatively slow when compared with other root�ndingmethods we will study, especially when the function f (x) hasseveral continuous derivatives about the root α.
2 The algorithm has no check to see whether the ε is too smallfor the computer arithmetic being used.
We also assume the function f (x) is continuous on the giveninterval [a, b]; but there is no way for the computer to con�rm this.
32 / 94
Advantages and Disadvantages of Bisection method
Advantages:
1 It always converges.
2 You have a guaranteed error bound, and it decreases witheach successive iteration.
3 You have a guaranteed rate of convergence. The error bounddecreases by 1/2 with each iteration.
Disadvantages:
1 It is relatively slow when compared with other root�ndingmethods we will study, especially when the function f (x) hasseveral continuous derivatives about the root α.
2 The algorithm has no check to see whether the ε is too smallfor the computer arithmetic being used.
We also assume the function f (x) is continuous on the giveninterval [a, b]; but there is no way for the computer to con�rm this.
33 / 94
Advantages and Disadvantages of Bisection method
Advantages:
1 It always converges.
2 You have a guaranteed error bound, and it decreases witheach successive iteration.
3 You have a guaranteed rate of convergence. The error bounddecreases by 1/2 with each iteration.
Disadvantages:
1 It is relatively slow when compared with other root�ndingmethods we will study, especially when the function f (x) hasseveral continuous derivatives about the root α.
2 The algorithm has no check to see whether the ε is too smallfor the computer arithmetic being used.
We also assume the function f (x) is continuous on the giveninterval [a, b]; but there is no way for the computer to con�rm this.
34 / 94
Advantages and Disadvantages of Bisection method
Advantages:
1 It always converges.
2 You have a guaranteed error bound, and it decreases witheach successive iteration.
3 You have a guaranteed rate of convergence. The error bounddecreases by 1/2 with each iteration.
Disadvantages:
1 It is relatively slow when compared with other root�ndingmethods we will study, especially when the function f (x) hasseveral continuous derivatives about the root α.
2 The algorithm has no check to see whether the ε is too smallfor the computer arithmetic being used.
We also assume the function f (x) is continuous on the giveninterval [a, b]; but there is no way for the computer to con�rm this.
35 / 94
Advantages and Disadvantages of Bisection method
Advantages:
1 It always converges.
2 You have a guaranteed error bound, and it decreases witheach successive iteration.
3 You have a guaranteed rate of convergence. The error bounddecreases by 1/2 with each iteration.
Disadvantages:
1 It is relatively slow when compared with other root�ndingmethods we will study, especially when the function f (x) hasseveral continuous derivatives about the root α.
2 The algorithm has no check to see whether the ε is too smallfor the computer arithmetic being used.
We also assume the function f (x) is continuous on the giveninterval [a, b]; but there is no way for the computer to con�rm this.
36 / 94
Advantages and Disadvantages of Bisection method
Advantages:
1 It always converges.
2 You have a guaranteed error bound, and it decreases witheach successive iteration.
3 You have a guaranteed rate of convergence. The error bounddecreases by 1/2 with each iteration.
Disadvantages:
1 It is relatively slow when compared with other root�ndingmethods we will study, especially when the function f (x) hasseveral continuous derivatives about the root α.
2 The algorithm has no check to see whether the ε is too smallfor the computer arithmetic being used.
We also assume the function f (x) is continuous on the giveninterval [a, b]; but there is no way for the computer to con�rm this.
37 / 94
Advantages and Disadvantages of Bisection method
Advantages:
1 It always converges.
2 You have a guaranteed error bound, and it decreases witheach successive iteration.
3 You have a guaranteed rate of convergence. The error bounddecreases by 1/2 with each iteration.
Disadvantages:
1 It is relatively slow when compared with other root�ndingmethods we will study, especially when the function f (x) hasseveral continuous derivatives about the root α.
2 The algorithm has no check to see whether the ε is too smallfor the computer arithmetic being used.
We also assume the function f (x) is continuous on the giveninterval [a, b]; but there is no way for the computer to con�rm this.
38 / 94
Advantages and Disadvantages of Bisection method
Advantages:
1 It always converges.
2 You have a guaranteed error bound, and it decreases witheach successive iteration.
3 You have a guaranteed rate of convergence. The error bounddecreases by 1/2 with each iteration.
Disadvantages:
1 It is relatively slow when compared with other root�ndingmethods we will study, especially when the function f (x) hasseveral continuous derivatives about the root α.
2 The algorithm has no check to see whether the ε is too smallfor the computer arithmetic being used.
We also assume the function f (x) is continuous on the giveninterval [a, b]; but there is no way for the computer to con�rm this.
39 / 94
Advantages and Disadvantages of Bisection method
Advantages:
1 It always converges.
2 You have a guaranteed error bound, and it decreases witheach successive iteration.
3 You have a guaranteed rate of convergence. The error bounddecreases by 1/2 with each iteration.
Disadvantages:
1 It is relatively slow when compared with other root�ndingmethods we will study, especially when the function f (x) hasseveral continuous derivatives about the root α.
2 The algorithm has no check to see whether the ε is too smallfor the computer arithmetic being used.
We also assume the function f (x) is continuous on the giveninterval [a, b]; but there is no way for the computer to con�rm this.
40 / 94
Root�nding
We want to �nd the root α of a given function f (x).
Thus we wantto �nd the point x at which the graph of y = f (x) intersects thex-axis. One of the principles of numerical analysis is the following.
Numerical Analysis Principle
If you cannot solve the given problem, then solve a "nearbyproblem".
How do we obtain a nearby problem for f (x) = 0?Begin �rst by asking for types of problems which we can solveeasily. At the top of the list should be that of �nding where astraight line intersects the x-axis.Thus we seek to replace f (x) = 0 by that of solving p(x) = 0 forsome linear polynomial p(x) that approximates f (x) in the vicinityof the root α.
41 / 94
Root�nding
We want to �nd the root α of a given function f (x). Thus we wantto �nd the point x at which the graph of y = f (x) intersects thex-axis.
One of the principles of numerical analysis is the following.
Numerical Analysis Principle
If you cannot solve the given problem, then solve a "nearbyproblem".
How do we obtain a nearby problem for f (x) = 0?Begin �rst by asking for types of problems which we can solveeasily. At the top of the list should be that of �nding where astraight line intersects the x-axis.Thus we seek to replace f (x) = 0 by that of solving p(x) = 0 forsome linear polynomial p(x) that approximates f (x) in the vicinityof the root α.
42 / 94
Root�nding
We want to �nd the root α of a given function f (x). Thus we wantto �nd the point x at which the graph of y = f (x) intersects thex-axis. One of the principles of numerical analysis is the following.
Numerical Analysis Principle
If you cannot solve the given problem, then solve a "nearbyproblem".
How do we obtain a nearby problem for f (x) = 0?Begin �rst by asking for types of problems which we can solveeasily. At the top of the list should be that of �nding where astraight line intersects the x-axis.Thus we seek to replace f (x) = 0 by that of solving p(x) = 0 forsome linear polynomial p(x) that approximates f (x) in the vicinityof the root α.
43 / 94
Root�nding
We want to �nd the root α of a given function f (x). Thus we wantto �nd the point x at which the graph of y = f (x) intersects thex-axis. One of the principles of numerical analysis is the following.
Numerical Analysis Principle
If you cannot solve the given problem, then solve a "nearbyproblem".
How do we obtain a nearby problem for f (x) = 0?Begin �rst by asking for types of problems which we can solveeasily. At the top of the list should be that of �nding where astraight line intersects the x-axis.Thus we seek to replace f (x) = 0 by that of solving p(x) = 0 forsome linear polynomial p(x) that approximates f (x) in the vicinityof the root α.
44 / 94
Root�nding
We want to �nd the root α of a given function f (x). Thus we wantto �nd the point x at which the graph of y = f (x) intersects thex-axis. One of the principles of numerical analysis is the following.
Numerical Analysis Principle
If you cannot solve the given problem, then solve a "nearbyproblem".
How do we obtain a nearby problem for f (x) = 0?
Begin �rst by asking for types of problems which we can solveeasily. At the top of the list should be that of �nding where astraight line intersects the x-axis.Thus we seek to replace f (x) = 0 by that of solving p(x) = 0 forsome linear polynomial p(x) that approximates f (x) in the vicinityof the root α.
45 / 94
Root�nding
We want to �nd the root α of a given function f (x). Thus we wantto �nd the point x at which the graph of y = f (x) intersects thex-axis. One of the principles of numerical analysis is the following.
Numerical Analysis Principle
If you cannot solve the given problem, then solve a "nearbyproblem".
How do we obtain a nearby problem for f (x) = 0?Begin �rst by asking for types of problems which we can solveeasily. At the top of the list should be that of �nding where astraight line intersects the x-axis.
Thus we seek to replace f (x) = 0 by that of solving p(x) = 0 forsome linear polynomial p(x) that approximates f (x) in the vicinityof the root α.
46 / 94
Root�nding
We want to �nd the root α of a given function f (x). Thus we wantto �nd the point x at which the graph of y = f (x) intersects thex-axis. One of the principles of numerical analysis is the following.
Numerical Analysis Principle
If you cannot solve the given problem, then solve a "nearbyproblem".
How do we obtain a nearby problem for f (x) = 0?Begin �rst by asking for types of problems which we can solveeasily. At the top of the list should be that of �nding where astraight line intersects the x-axis.Thus we seek to replace f (x) = 0 by that of solving p(x) = 0 forsome linear polynomial p(x) that approximates f (x) in the vicinityof the root α.
47 / 94
Root�nding
y
x
α
(x0,f (x
0))
x0x
1
48 / 94
Newton's method
Let x0 be an initial guess, su�ciently closed to the root α.
Consider the tangent line to the graph of f (x) in (x0, f (x0)).Tangent intersects x-axis at x1, a closer point to α. Tangent hasequation
p1(x) = f (x0) + f0(x0)(x � x0)
Since p1(x1) = 0 we get
f (x0) + f0(x0)(x1 � x0) = 0
x1 = x0 �f (x0)
f 0(x0)
Similarly, we get x2,
x2 = x1 �f (x1)
f 0(x1)
49 / 94
Newton's method
Let x0 be an initial guess, su�ciently closed to the root α.Consider the tangent line to the graph of f (x) in (x0, f (x0)).
Tangent intersects x-axis at x1, a closer point to α. Tangent hasequation
p1(x) = f (x0) + f0(x0)(x � x0)
Since p1(x1) = 0 we get
f (x0) + f0(x0)(x1 � x0) = 0
x1 = x0 �f (x0)
f 0(x0)
Similarly, we get x2,
x2 = x1 �f (x1)
f 0(x1)
50 / 94
Newton's method
Let x0 be an initial guess, su�ciently closed to the root α.Consider the tangent line to the graph of f (x) in (x0, f (x0)).Tangent intersects x-axis at x1, a closer point to α. Tangent hasequation
p1(x) = f (x0) + f0(x0)(x � x0)
Since p1(x1) = 0 we get
f (x0) + f0(x0)(x1 � x0) = 0
x1 = x0 �f (x0)
f 0(x0)
Similarly, we get x2,
x2 = x1 �f (x1)
f 0(x1)
51 / 94
Newton's method
Let x0 be an initial guess, su�ciently closed to the root α.Consider the tangent line to the graph of f (x) in (x0, f (x0)).Tangent intersects x-axis at x1, a closer point to α. Tangent hasequation
p1(x) = f (x0) + f0(x0)(x � x0)
Since p1(x1) = 0 we get
f (x0) + f0(x0)(x1 � x0) = 0
x1 = x0 �f (x0)
f 0(x0)
Similarly, we get x2,
x2 = x1 �f (x1)
f 0(x1)
52 / 94
Newton's method
Let x0 be an initial guess, su�ciently closed to the root α.Consider the tangent line to the graph of f (x) in (x0, f (x0)).Tangent intersects x-axis at x1, a closer point to α. Tangent hasequation
p1(x) = f (x0) + f0(x0)(x � x0)
Since p1(x1) = 0 we get
f (x0) + f0(x0)(x1 � x0) = 0
x1 = x0 �f (x0)
f 0(x0)
Similarly, we get x2,
x2 = x1 �f (x1)
f 0(x1)
53 / 94
Newton's method
Let x0 be an initial guess, su�ciently closed to the root α.Consider the tangent line to the graph of f (x) in (x0, f (x0)).Tangent intersects x-axis at x1, a closer point to α. Tangent hasequation
p1(x) = f (x0) + f0(x0)(x � x0)
Since p1(x1) = 0 we get
f (x0) + f0(x0)(x1 � x0) = 0
x1 = x0 �f (x0)
f 0(x0)
Similarly, we get x2,
x2 = x1 �f (x1)
f 0(x1)
54 / 94
Newton's method
Repeat this process to obtaian the sequence x1, x2, x3, . . . thathopefully will converge to α.
General scheme for Newton's method consists in:
Starting with initial guess x0 compute iteratively
xn+1 = xn �f (xn)
f 0(xn), n = 0, 1, 2, . . .
55 / 94
Newton's method
Repeat this process to obtaian the sequence x1, x2, x3, . . . thathopefully will converge to α.
General scheme for Newton's method consists in:
Starting with initial guess x0 compute iteratively
xn+1 = xn �f (xn)
f 0(xn), n = 0, 1, 2, . . .
56 / 94
Newton's method
Repeat this process to obtaian the sequence x1, x2, x3, . . . thathopefully will converge to α.
General scheme for Newton's method consists in:
Starting with initial guess x0 compute iteratively
Here we consider a division algorithm (based on Newton's method)implemented in some computers in the past.
Say, we are interestedin computing a
b = a �1b , where
1b is computed using Newton's
method.
f (x) � b� 1
x= 0,
with b positive. The root of this equation is: α = 1b .
f 0(x) =1
x2
and Newton's method for this problem becomes
xn+1 = xn �b� 1
xn1x2n
Simplifyingxn+1 = xn(2� bxn), n � 0
62 / 94
Newton's method. Division example
Here we consider a division algorithm (based on Newton's method)implemented in some computers in the past. Say, we are interestedin computing a
b = a �1b , where
1b is computed using Newton's
method.
f (x) � b� 1
x= 0,
with b positive. The root of this equation is: α = 1b .
f 0(x) =1
x2
and Newton's method for this problem becomes
xn+1 = xn �b� 1
xn1x2n
Simplifyingxn+1 = xn(2� bxn), n � 0
63 / 94
Newton's method. Division example
Here we consider a division algorithm (based on Newton's method)implemented in some computers in the past. Say, we are interestedin computing a
b = a �1b , where
1b is computed using Newton's
method.
f (x) � b� 1
x= 0,
with b positive. The root of this equation is: α = 1b .
f 0(x) =1
x2
and Newton's method for this problem becomes
xn+1 = xn �b� 1
xn1x2n
Simplifyingxn+1 = xn(2� bxn), n � 0
64 / 94
Newton's method. Division example
Here we consider a division algorithm (based on Newton's method)implemented in some computers in the past. Say, we are interestedin computing a
b = a �1b , where
1b is computed using Newton's
method.
f (x) � b� 1
x= 0,
with b positive. The root of this equation is: α = 1b .
f 0(x) =1
x2
and Newton's method for this problem becomes
xn+1 = xn �b� 1
xn1x2n
Simplifyingxn+1 = xn(2� bxn), n � 0
65 / 94
Newton's method. Division example
Here we consider a division algorithm (based on Newton's method)implemented in some computers in the past. Say, we are interestedin computing a
b = a �1b , where
1b is computed using Newton's
method.
f (x) � b� 1
x= 0,
with b positive. The root of this equation is: α = 1b .
f 0(x) =1
x2
and Newton's method for this problem becomes
xn+1 = xn �b� 1
xn1x2n
Simplifyingxn+1 = xn(2� bxn), n � 0
66 / 94
Newton's method. Division example
Initial guess x0 must be close enough to the true solution and ofcourse x0 > 0. Consider the error
α� xn+1 =1
b� xn+1
=1� bxn+1
b
=1� bxn(2� bxn)
b
=(1� bxn)2
b
On the other hand
Rel(xn+1) =α� xn+1
α= 1� bxn+1
67 / 94
Newton's method. Division example
Initial guess x0 must be close enough to the true solution and ofcourse x0 > 0. Consider the error
α� xn+1 =1
b� xn+1
=1� bxn+1
b
=1� bxn(2� bxn)
b
=(1� bxn)2
b
On the other hand
Rel(xn+1) =α� xn+1
α= 1� bxn+1
68 / 94
Newton's method. Division example
Initial guess x0 must be close enough to the true solution and ofcourse x0 > 0. Consider the error
α� xn+1 =1
b� xn+1
=1� bxn+1
b
=1� bxn(2� bxn)
b
=(1� bxn)2
b
On the other hand
Rel(xn+1) =α� xn+1
α= 1� bxn+1
69 / 94
Newton's method. Division example
Initial guess x0 must be close enough to the true solution and ofcourse x0 > 0. Consider the error
α� xn+1 =1
b� xn+1
=1� bxn+1
b
=1� bxn(2� bxn)
b
=(1� bxn)2
b
On the other hand
Rel(xn+1) =α� xn+1
α= 1� bxn+1
70 / 94
Newton's method. Division example
Initial guess x0 must be close enough to the true solution and ofcourse x0 > 0. Consider the error
α� xn+1 =1
b� xn+1
=1� bxn+1
b
=1� bxn(2� bxn)
b
=(1� bxn)2
b
On the other hand
Rel(xn+1) =α� xn+1
α= 1� bxn+1
71 / 94
Newton's method. Division example
Initial guess x0 must be close enough to the true solution and ofcourse x0 > 0. Consider the error
α� xn+1 =1
b� xn+1
=1� bxn+1
b
=1� bxn(2� bxn)
b
=(1� bxn)2
b
On the other hand
Rel(xn+1) =α� xn+1
α
= 1� bxn+1
72 / 94
Newton's method. Division example
Initial guess x0 must be close enough to the true solution and ofcourse x0 > 0. Consider the error
α� xn+1 =1
b� xn+1
=1� bxn+1
b
=1� bxn(2� bxn)
b
=(1� bxn)2
b
On the other hand
Rel(xn+1) =α� xn+1
α= 1� bxn+1
73 / 94
Newton's method. Division example
Initial guess x0 must be close enough to the true solution and ofcourse x0 > 0. Consider the error
α� xn+1 =1
b� xn+1
=1� bxn+1
b
=1� bxn(2� bxn)
b
=(1� bxn)2
b
On the other hand
Rel(xn+1) =α� xn+1
α= 1� bxn+1
74 / 94
Newton's method. Division example
It can be shown (try it!) that
Rel(xn+1) = (Rel(xn))2
In order to guarantee convergence xn ! α,
jRel(x0)j < 1
or
0 < x0 <2
b
For example, suppose that jRel(x0)j = 0.1. Then
Rel(x1) = 10�2, Rel(x2) = 10�4
Rel(x3) = 10�8, Rel(x4) = 10�16
75 / 94
Newton's method. Division example
It can be shown (try it!) that
Rel(xn+1) = (Rel(xn))2
In order to guarantee convergence xn ! α,
jRel(x0)j < 1
or
0 < x0 <2
b
For example, suppose that jRel(x0)j = 0.1. Then
Rel(x1) = 10�2, Rel(x2) = 10�4
Rel(x3) = 10�8, Rel(x4) = 10�16
76 / 94
Newton's method. Division example
It can be shown (try it!) that
Rel(xn+1) = (Rel(xn))2
In order to guarantee convergence xn ! α,
jRel(x0)j < 1
or
0 < x0 <2
b
For example, suppose that jRel(x0)j = 0.1. Then
Rel(x1) = 10�2, Rel(x2) = 10�4
Rel(x3) = 10�8, Rel(x4) = 10�16
77 / 94
Newton's method. Division example
It can be shown (try it!) that
Rel(xn+1) = (Rel(xn))2
In order to guarantee convergence xn ! α,
jRel(x0)j < 1
or
0 < x0 <2
b
For example, suppose that jRel(x0)j = 0.1. Then
Rel(x1) = 10�2, Rel(x2) = 10�4
Rel(x3) = 10�8, Rel(x4) = 10�16
78 / 94
Newton's method. Division example
It can be shown (try it!) that
Rel(xn+1) = (Rel(xn))2
In order to guarantee convergence xn ! α,
jRel(x0)j < 1
or
0 < x0 <2
b
For example, suppose that jRel(x0)j = 0.1. Then
Rel(x1) = 10�2, Rel(x2) = 10�4
Rel(x3) = 10�8, Rel(x4) = 10�16
79 / 94
Newton's method. Division example
y
x
y=b1/x
1/b
(x0,f(x
0))
x0 x1
2/b
b
80 / 94
Error analysis for Newton's method
Let f (x) 2 C 2[a, b] and α 2 [a, b]. Also let f 0(α) 6= 0.
ConsiderTaylor formula for f (x) about xn
f (x) = f (xn) + (x � xn)f 0(xn) +(x � xn)2
2f 00(ξn),
where ξn is between x and xn. Take x = α to get
f (α) = f (xn) + (α� xn)f 0(xn) +(α� xn)2
2f 00(ξn),
with ξn between α and xn. Since f (α) = 0 we have
0 =f (xn)
f 0(xn)+ (α� xn) + (α� xn)2
f 00(ξn)
2f 0(xn)
α� xn+1 = (α� xn)2��f 00(ξn)2f 0(xn)
�
81 / 94
Error analysis for Newton's method
Let f (x) 2 C 2[a, b] and α 2 [a, b]. Also let f 0(α) 6= 0. ConsiderTaylor formula for f (x) about xn
f (x) = f (xn) + (x � xn)f 0(xn) +(x � xn)2
2f 00(ξn),
where ξn is between x and xn.
Take x = α to get
f (α) = f (xn) + (α� xn)f 0(xn) +(α� xn)2
2f 00(ξn),
with ξn between α and xn. Since f (α) = 0 we have
0 =f (xn)
f 0(xn)+ (α� xn) + (α� xn)2
f 00(ξn)
2f 0(xn)
α� xn+1 = (α� xn)2��f 00(ξn)2f 0(xn)
�
82 / 94
Error analysis for Newton's method
Let f (x) 2 C 2[a, b] and α 2 [a, b]. Also let f 0(α) 6= 0. ConsiderTaylor formula for f (x) about xn
f (x) = f (xn) + (x � xn)f 0(xn) +(x � xn)2
2f 00(ξn),
where ξn is between x and xn. Take x = α to get
f (α) = f (xn) + (α� xn)f 0(xn) +(α� xn)2
2f 00(ξn),
with ξn between α and xn.
Since f (α) = 0 we have
0 =f (xn)
f 0(xn)+ (α� xn) + (α� xn)2
f 00(ξn)
2f 0(xn)
α� xn+1 = (α� xn)2��f 00(ξn)2f 0(xn)
�
83 / 94
Error analysis for Newton's method
Let f (x) 2 C 2[a, b] and α 2 [a, b]. Also let f 0(α) 6= 0. ConsiderTaylor formula for f (x) about xn
f (x) = f (xn) + (x � xn)f 0(xn) +(x � xn)2
2f 00(ξn),
where ξn is between x and xn. Take x = α to get
f (α) = f (xn) + (α� xn)f 0(xn) +(α� xn)2
2f 00(ξn),
with ξn between α and xn. Since f (α) = 0 we have
0 =f (xn)
f 0(xn)+ (α� xn) + (α� xn)2
f 00(ξn)
2f 0(xn)
α� xn+1 = (α� xn)2��f 00(ξn)2f 0(xn)
�
84 / 94
Error analysis for Newton's method
Let f (x) 2 C 2[a, b] and α 2 [a, b]. Also let f 0(α) 6= 0. ConsiderTaylor formula for f (x) about xn
f (x) = f (xn) + (x � xn)f 0(xn) +(x � xn)2
2f 00(ξn),
where ξn is between x and xn. Take x = α to get
f (α) = f (xn) + (α� xn)f 0(xn) +(α� xn)2
2f 00(ξn),
with ξn between α and xn. Since f (α) = 0 we have
0 =f (xn)
f 0(xn)+ (α� xn) + (α� xn)2
f 00(ξn)
2f 0(xn)
α� xn+1 = (α� xn)2��f 00(ξn)2f 0(xn)
�85 / 94
Error analysis for Newton's method
For previous example, f 00(x) = 30x4.We have
�f 00(ξn)2f 0(xn)
� �f 00(α)2f 0(α)
=�30α4
2(6α5 � 1) � �2.42
Thereforeα� xn+1 � �2.42(α� xn)2
For example if n = 3, we get α� x3 � �4.73e � 03 and
α� x4 � �2.42(α� x3)2 � �5.42e � 05,
a result in accordance with the result presented in the table:α� x4 � �5.35e � 05.
86 / 94
Error analysis for Newton's method
For previous example, f 00(x) = 30x4.We have
�f 00(ξn)2f 0(xn)
� �f 00(α)2f 0(α)
=�30α4
2(6α5 � 1) � �2.42
Thereforeα� xn+1 � �2.42(α� xn)2
For example if n = 3, we get α� x3 � �4.73e � 03 and
α� x4 � �2.42(α� x3)2 � �5.42e � 05,
a result in accordance with the result presented in the table:α� x4 � �5.35e � 05.
87 / 94
Error analysis for Newton's method
For previous example, f 00(x) = 30x4.We have
�f 00(ξn)2f 0(xn)
� �f 00(α)2f 0(α)
=�30α4
2(6α5 � 1) � �2.42
Thereforeα� xn+1 � �2.42(α� xn)2
For example if n = 3, we get α� x3 � �4.73e � 03 and
α� x4 � �2.42(α� x3)2 � �5.42e � 05,
a result in accordance with the result presented in the table:α� x4 � �5.35e � 05.
88 / 94
Error analysis for Newton's method
For previous example, f 00(x) = 30x4.We have
�f 00(ξn)2f 0(xn)
� �f 00(α)2f 0(α)
=�30α4
2(6α5 � 1) � �2.42
Thereforeα� xn+1 � �2.42(α� xn)2
For example if n = 3, we get α� x3 � �4.73e � 03 and
α� x4 � �2.42(α� x3)2 � �5.42e � 05,
a result in accordance with the result presented in the table:α� x4 � �5.35e � 05.
89 / 94
Error analysis for Newton's method
If iteration xn is close to α we have
�f 00(ξn)2f 0(xn)
� �f 00(α)2f 0(α)
� M
α� xn+1 � M(α� xn)2
M(α� xn+1) � (M(α� xn))2
Inductively
M(α� xn+1) � (M(α� x0))2n
, n � 0In other words, in order to guarantee the convergence of Newton'smethod we should have
jM(α� x0)j < 1
jα� x0j <1
jM j =����2f 0(α)f 00(α)
����
90 / 94
Error analysis for Newton's method
If iteration xn is close to α we have
�f 00(ξn)2f 0(xn)
� �f 00(α)2f 0(α)
� M
α� xn+1 � M(α� xn)2
M(α� xn+1) � (M(α� xn))2
Inductively
M(α� xn+1) � (M(α� x0))2n
, n � 0In other words, in order to guarantee the convergence of Newton'smethod we should have
jM(α� x0)j < 1
jα� x0j <1
jM j =����2f 0(α)f 00(α)
����
91 / 94
Error analysis for Newton's method
If iteration xn is close to α we have
�f 00(ξn)2f 0(xn)
� �f 00(α)2f 0(α)
� M
α� xn+1 � M(α� xn)2
M(α� xn+1) � (M(α� xn))2
Inductively
M(α� xn+1) � (M(α� x0))2n
, n � 0In other words, in order to guarantee the convergence of Newton'smethod we should have
jM(α� x0)j < 1
jα� x0j <1
jM j =����2f 0(α)f 00(α)
����
92 / 94
Error analysis for Newton's method
If iteration xn is close to α we have
�f 00(ξn)2f 0(xn)
� �f 00(α)2f 0(α)
� M
α� xn+1 � M(α� xn)2
M(α� xn+1) � (M(α� xn))2
Inductively
M(α� xn+1) � (M(α� x0))2n
, n � 0
In other words, in order to guarantee the convergence of Newton'smethod we should have
jM(α� x0)j < 1
jα� x0j <1
jM j =����2f 0(α)f 00(α)
����
93 / 94
Error analysis for Newton's method
If iteration xn is close to α we have
�f 00(ξn)2f 0(xn)
� �f 00(α)2f 0(α)
� M
α� xn+1 � M(α� xn)2
M(α� xn+1) � (M(α� xn))2
Inductively
M(α� xn+1) � (M(α� x0))2n
, n � 0In other words, in order to guarantee the convergence of Newton'smethod we should have
jM(α� x0)j < 1
jα� x0j <1
jM j =����2f 0(α)f 00(α)
����94 / 94
For xn close to α, and therefore cn also close to α,
we have
α− xn+1 ≈ −f 00(α)2f 0(α)
(α− xn)2
Thus Newton’s method is quadratically convergent,
provided f 0(α) 6= 0 and f(x) is twice differentiable inthe vicinity of the root α.
We can also use this to explore the ‘interval of con-
vergence’ of Newton’s method. Write the above as
α− xn+1 ≈M (α− xn)2 , M = − f 00(α)
2f 0(α)Multiply both sides by M to get
M (α− xn+1) ≈ [M (α− xn)]2
M (α− xn+1) ≈ [M (α− xn)]2
Then we want these quantities to decrease; and this
suggests choosing x0 so that
|M (α− x0)| < 1
|α− x0| <1
|M | =¯̄̄̄¯2f 0(α)f 00(α)
¯̄̄̄¯
If |M | is very large, then we may need to have a verygood initial guess in order to have the iterates xn
converge to α.
ADVANTAGES & DISADVANTAGES
Advantages: 1. It is rapidly convergent in most cases.
2. It is simple in its formulation, and therefore rela-
tively easy to apply and program.
3. It is intuitive in its construction. This means it is
easier to understand its behaviour, when it is likely to
behave well and when it may behave poorly.
Disadvantages: 1. It may not converge.
2. It is likely to have difficulty if f 0(α) = 0. This
condition means the x-axis is tangent to the graph of
y = f(x) at x = α.
3. It needs to know both f(x) and f 0(x). Contrastthis with the bisection method which requires only
f(x).
THE SECANT METHOD
Newton’s method was based on using the line tangent
to the curve of y = f(x), with the point of tangency
(x0, f(x0)). When x0 ≈ α, the graph of the tangent
line is approximately the same as the graph of y =
f(x) around x = α. We then used the root of the
tangent line to approximate α.
Consider using an approximating line based on ‘inter-
polation’. We assume we have two estimates of theroot α, say x0 and x1. Then we produce a linear
function
q(x) = a0 + a1x
with
q(x0) = f(x0), q(x1) = f(x1) (*)
This line is sometimes called a secant line. Its equa-
tion is given by
q(x) =(x1 − x) f(x0) + (x− x0) f(x1)
x1 − x0
(x0,f(x0))
(x1,f(x1))
x2x0
x1α
x
yy=f(x)
(x0,f(x0))
(x1,f(x1))
x2 x0x1
αx
yy=f(x)
q(x) =(x1 − x) f(x0) + (x− x0) f(x1)
x1 − x0
This is linear in x; and by direction evaluation, it satis-
fies the interpolation conditions of (*). We now solve
the equation q(x) = 0, denoting the root by x2. This
yields
x2 = x1 − f(x1)÷f(x1)− f(x0)
x1 − x0
We can now repeat the process. Use x1 and x2 to
produce another secant line, and then uses its root
to approximate α. This yields the general iteration
formula
xn+1 = xn−f(xn)÷f(xn)− f(xn−1)xn − xn−1
, n = 1, 2, 3...
This is called the secant method for solving f(x) = 0.
Example We solve the equation
f(x) ≡ x6 − x− 1 = 0which was used previously as an example for both the
bisection and Newton methods. The quantity xn −xn−1 is used as an estimate of α−xn−1. The iteratex8 equals α rounded to nine significant digits. As with
Newton’s method for this equation, the initial iterates
do not converge rapidly. But as the iterates become
The above iterations can be written symbolically as
E1 : xn+1 = 1 + 0:5 sinxn
E2 : xn+1 = 3 + 2 sinxn
for n = 0; 1; 2; : : : Why does one of these iterationsconverge, but not the other? The graphs show similarbehaviour, so why the di¤erence? Consider one moreexample:
Suppose we are solving the equation
x2 � 5 = 0
with exact root � =p5 � 2:2361 using iterates of the
form
xn+1 = g(xn):
Consider four di¤erent iterations
I1 : xn+1 = 5 + xn � x2n;
I2 : xn+1 =5
xn;
I3 : xn+1 = 1 + xn �1
5x2n;
I4 : xn+1 =1
2
�xn +
5
xn
�:
All of them, in case they are convergent will convergeto � =
In general, we are interested in solving equations
x = g(x)
by means of �xed point iteration:
xn+1 = g(xn); n = 0; 1; 2; : : :
It is called ��xed point iteration�because the root � isa �xed point of the function g(x), meaning that � is anumber for which
g(�) = �
EXISTENCE THEOREM
We begin by asking whether the equation
x = g(x)
has a solution. For this to occur, the graphs of y =x and y = g(x) must intersect, as seen on the earliergraphs. There are several lemmas and theorems that giveconditions under which we are guaranteed there is a �xedpoint �.
Lemma 1 Let g(x) be a continuous function on the in-terval [a; b], and suppose it satis�es the property
a � x � b ) a � g(x) � bThen the equation x = g(x) has at least one solution �in the interval [a; b].
The proof of this is fairly intuitive. Look at the functionf(x) = x � g(x), a � x � b. Evaluating at the end-points, f(a) � 0; f(b) � 0. The function f(x) iscontinuous on [a; b]; and therefore it contains a zero inthe interval.
Theorem: Assume g(x) and g0(x) exist and are con-tinuous on the interval [a, b]; and further, assume
a ≤ x ≤ b ⇒ a ≤ g(x) ≤ b
λ ≡ maxa≤x≤b
¯̄̄g0(x)
¯̄̄< 1
Then:
S1. The equation x = g(x) has a unique solution α
in [a, b].
S2. For any initial guess x0 in [a, b], the iteration
xn+1 = g(xn), n = 0, 1, 2, ...
will converge to α.
S3.
|α− xn| ≤ λn
1− λ|x1 − x0| , n ≥ 0
S4.
limn→∞
α− xn+1α− xn
= g0(α)
Thus for xn close to α,
α− xn+1 ≈ g0(α) (α− xn)
The proof is given in the text, and I go over only a
portion of it here. For S2, note that from (#), if x0is in [a, b], then
x1 = g(x0)
is also in [a, b]. Repeat the argument to show that
x2 = g(x1)
belongs to [a, b]. This can be continued by induction
to show that every xn belongs to [a, b].
We need the following general result. For any two
points w and z in [a, b],
g(w)− g(z) = g0(c) (w − z)
for some unknown point c between w and z. There-
fore,
|g(w)− g(z)| ≤ λ |w − z|for any a ≤ w, z ≤ b.
For S3, subtract xn+1 = g(xn) from α = g(α) to get
α− xn+1 = g(α)− g(xn)
= g0(cn) (α− xn) ($)
|α− xn+1| ≤ λ |α− xn| (*)
with cn between α and xn. From (*), we have that
the error is guaranteed to decrease by a factor of λ
with each iteration. This leads to
|α− xn| ≤ λn |α− xn| , n ≥ 0With some extra manipulation, we can obtain the error
bound in S3.
For S4, use ($) to write
α− xn+1α− xn
= g0(cn)
Since xn → α and cn is between α and xn, we have
g0(cn)→ g0(α).
The statement
α− xn+1 ≈ g0(α) (α− xn)
tells us that when near to the root α, the errors will
decrease by a constant factor of g0(α). If this is nega-tive, then the errors will oscillate between positive and
negative, and the iterates will be approaching from
both sides. When g0(α) is positive, the iterates willapproach α from only one side.
The statements
α− xn+1 = g0(cn) (α− xn)
α− xn+1 ≈ g0(α) (α− xn)
also tell us a bit more of what happens when¯̄̄g0(α)
¯̄̄> 1
Then the errors will increase as we approach the root
fore, xn = g(xn�1) can be either convergent ordivergent, but numerical results show it divergent.
(I3) g(x) = 1+x� 15x2; g0(x) = 1� 2
5x; g0(�) =1 � 2
5
p5 � 0:106: Thus, xn = g(xn�1) converge
top5: Moreover, we have
j�� xn+1j � 0:106 j�� xnj ;
if xn is su¢ ciently close to �: The errors are de-creasing with a liniar rate of 0:106.
(I4) g(x) =12
�x+ 5
x
�; g0(x) = 1
2
�1� 5
x2
�; g0(�) =
0:Sequence xn = g(xn�1) will converge top5;with
an order of convergence bigger than 1:
Sometimes it is di¢ cult to express equation f(x) = 0 inthe form x = g(x); such that the resulting iterates willconverge. Such a process is presented in the followingexamples.
Example 1 Let x4 � x� 1 = 0; rewritten as
x = 4p1 + x;
which will prov�de us with iterations
x0 = 1; xn+1 =4p1 + xn; n � 1
This sequence will converge to � � 1:2207:
Example 2 Let x3 + x� 1 = 0; rewritten as
x =1
1 + x2
and its �xed point iterations
x0 = 1; xn+1 =1
1 + x2n; n � 1
that will converge to � � 0:6823: Iterations are repre-sented graphically in the following �gure
0 x
y
y=g(x)
α=0.6823 x0x
2x
1 x3
y=x
x0 x
1 x2
α
y
xO
y =x
y =g(x)
0 < g0(�) < 1
x
y
O
x0
x1x
2x
3α
y =x
y =g(x)
�1 < g0(�) < 0
x
y
Oαx
0x
1x
2
y =x
y =g(x)
g0(�) > 1
y
xO
y =x
y =g(x)
α x0x
1 x2
g0(�) < �1
Besides the convergence we would like to know how fast isthe sequence xn = g(xn�1) converging to the solution,in other words how fast the error � � xn is decreasing.We will say that sequence fxng1n=0 converges to � withorder of convergence p � 1; if
j�� xn+1j � c j�� xnjp ; n � 0;
where c � 0 is a constant. Cases p = 1, p = 2 and p =3 are called linear, quadratic and cubic convergencies. Incase of linear convergence, constant c is called the rateof linear convergence liniare and we require additionallythat c < 1; otherwise sequence of errors �� xn can failto converge to zero. Also, for linear convergence wer canuse the relation,
j�� xn+1j � cn j�� x0j ; n � 0:
Thus bisection method is linearly convergent with rate 12;Newton�s method is quadratically convergent, and secantmethod has order of convergence p = 1+
p5
2 :
If��g0(�)�� < 1, from the last theorem we have that iter-
ations xn are at least linearly convergent. If in addition,g0(�) 6= 0; then we have exactly linear convergence withrate g0(�): In practice, the last theorem is rarely usedsince.it is quite di¢ cult to �nd an interval [a; b] such thatg ([a; b]) � [a; b] : To simplify the usage of the Theoremwe consider the following corollary.
Corollary: Assume x = g(x) has a solution α, and
further assume that both g(x) and g0(x) are contin-uous for all x in some interval about α. In addition,
assume ¯̄̄g0(α)
¯̄̄< 1 (**)
Then any sufficiently small number ε > 0, the interval
[a, b] = [α − ε, α + ε] will satisfy the hypotheses of
the preceding theorem.
This means that if (**) is true, and if we choose x0sufficiently close to α, then the fixed point iteration
xn+1 = g(xn) will converge and the earlier results
S1-S4 will all hold. The corollary does not tell us how
close we need to be to α in order to have convergence.
NEWTON’S METHOD
For Newton’s method
xn+1 = xn − f(xn)
f 0(xn)we have it is a fixed point iteration with
g(x) = x− f(x)
f 0(x)Check its convergence by checking the condition (**).
g0(x) = 1− f 0(x)f 0(x)
+f(x)f 00(x)[f 0(x)]2
=f(x)f 00(x)[f 0(x)]2
g0(α) = 0
Therefore the Newton method will converge if x0 is
chosen sufficiently close to α.
HIGHER ORDER METHODS
What happens when g0(α) = 0? We use Taylor’s
theorem to answer this question.
Begin by writing
g(x) = g(α) + g0(α) (x− α) +1
2g00(c) (x− α)2
with c between x and α. Substitute x = xn and
recall that g(xn) = xn+1 and g(α) = α. Also assume
g0(α) = 0.Then
xn+1 = α+ 12g00(cn) (xn − α)2
α− xn+1 = −12g00(cn) (xn − α)2
with cn between α and xn. Thus if g0(α) = 0, the
fixed point iteration is quadratically convergent or bet-
ter. In fact, if g00(α) 6= 0, then the iteration is exactlyquadratically convergent.
ANOTHER RAPID ITERATION
Newton’s method is rapid, but requires use of the
derivative f 0(x). Can we get by without this. Theanswer is yes! Consider the method
Dn =f(xn + f(xn))− f(xn)
f(xn)
xn+1 = xn − f(xn)
Dn
This is an approximation to Newton’s method, with
f 0(xn) ≈ Dn. To analyze its convergence, regard it
as a fixed point iteration with
D(x) =f(x+ f(x))− f(x)
f(x)
g(x) = x− f(x)
D(x)
Then we can, with some difficulty, show g0(α) = 0
and g00(α) 6= 0. This will prove this new iteration is
quadratically convergent.
FIXED POINT INTERATION: ERROR
Recall the result
limn→∞
α− xn
α− xn−1= g0(α)
for the iteration
xn = g(xn−1), n = 1, 2, ...
Thus
α− xn ≈ λ (α− xn−1) (***)
with λ = g0(α) and |λ| < 1.
If we were to know λ, then we could solve (***) for
α:
α ≈ xn − λxn−11− λ
Usually, we write this as a modification of the cur-
rently computed iterate xn:
α ≈ xn − λxn−11− λ
=xn − λxn
1− λ+λxn − λxn−1
1− λ
= xn +λ
1− λ[xn − xn−1]
The formula
xn +λ
1− λ[xn − xn−1]
is said to be an extrapolation of the numbers xn−1and xn. But what is λ?
From
limn→∞
α− xn
α− xn−1= g0(α)
we have
λ ≈ α− xn
α− xn−1
Unfortunately this also involves the unknown root α
which we seek; and we must find some other way of
estimating λ.
To calculate λ consider the ratio
λn =xn − xn−1xn−1 − xn−2
To see this is approximately λ as xn approaches α,
write
xn − xn−1xn−1 − xn−2
=g(xn−1)− g(xn−2)
xn−1 − xn−2= g0(cn)
with cn between xn−1 and xn−2. As the iterates ap-proach α, the number cn must also approach α. Thus
λn approaches λ as xn→ α.
We combine these results to obtain the estimation
bxn = xn+λn
1− λn[xn − xn−1] , λn =
xn − xn−1xn−1 − xn−2
We call bxn the Aitken extrapolate of {xn−2, xn−1, xn};and α ≈ bxn.We can also rewrite this as
α− xn ≈ bxn − xn =λn
1− λn[xn − xn−1]
This is called Aitken’s error estimation formula.
The accuracy of these procedures is tied directly to
the accuracy of the formulas
α−xn ≈ λ (α− xn−1) , α−xn−1 ≈ λ (α− xn−2)
If this is accurate, then so are the above extrapolation
and error estimation formulas.
EXAMPLE
Consider the iteration
xn+1 = 6.28 + sin(xn), n = 0, 1, 2, ...
for solving
x = 6.28 + sinx
Iterates are shown on the accompanying sheet, includ-
ing calculations of λn, the error estimate
α−xn ≈ bxn−xn =λn
1− λn[xn − xn−1] (Estimate)
The latter is called “Estimate” in the table. In this
instance,
g0(α) .= .9644
and therefore the convergence is very slow. This is
apparent in the table.
AITKEN’S ALGORITHM
Step 1: Select x0Step 2: Calculate
x1 = g(x0), x2 = g(x1)
Step3: Calculate
x3 = x2 +λ2
1− λ2[x2 − x1] , λ2 =
x2 − x1x1 − x0
Step 4: Calculate
x4 = g(x3), x5 = g(x4)
and calculate x6 as the extrapolate of {x3, x4, x5}.Continue this procedure, ad infinatum.
Of course in practice we will have some kind of er-
ror test to stop this procedure when believe we have
sufficient accuracy.
EXAMPLE
Consider again the iteration
xn+1 = 6.28 + sin(xn), n = 0, 1, 2, ...
for solving
x = 6.28 + sinx
Now we use the Aitken method, and the results are
shown in the accompanying table. With this we have
α− x3 = 7.98× 10−4, α− x6 = 2.27× 10−6
In comparison, the original iteration had
α− x6 = 1.23× 10−2
GENERAL COMMENTS
Aitken extrapolation can greatly accelerate the con-
vergence of a linearly convergent iteration
xn+1 = g(xn)
This shows the power of understanding the behaviour
of the error in a numerical process. From that un-
derstanding, we can often improve the accuracy, thru
extrapolation or some other procedure.
This is a justification for using mathematical analyses
to understand numerical methods. We will see this
repeated at later points in the course, and it holds
with many different types of problems and numerical
methods for their solution.
MULTIPLE ROOTS
We study two classes of functions for which there is
additional difficulty in calculating their roots. The first
of these are functions in which the desired root has a
multiplicity greater than 1. What does this mean?
Let α be a root of the function f(x), and imagine
writing it in the factored form
f(x) = (x− α)mh(x)
with some integer m ≥ 1 and some continuous func-tion h(x) for which h(α) 6= 0. Then we say that α
is a root of f(x) of multiplicity m. For example, the
function
f(x) = ex2 − 1
has x = 0 as a root of multiplicity m = 2. In partic-
ular, define
h(x) =ex2 − 1x2
for x 6= 0.
Using Taylor polynomial approximations, we can show
for x 6= 0 thath(x) ≈ 1 + 1
2x2 + 1
6x4
limx→0h(x) = 1
This leads us to extend the definition of h(x) to
h(x) =ex2 − 1x2
, x 6= 0h(0) = 1
Thus
f(x) = x2h(x)
as asserted and x = 0 is a root of f(x) of multiplicity
m = 2.
Roots for which m = 1 are called simple roots, and
the methods studied to this point were intended for
such roots. We now consider the case of m > 1.
If the function f(x) is m-times differentiable around
α, then we can differentiate
f(x) = (x− α)mh(x)
m times to obtain an equivalent formulation of what
This shows α is a root of f 0(x) of multiplicity 2.
Differentiating a second time, we can show
f 00(x) = (x− α)h3(x)
for a suitably defined h3(x) with h3(α) 6= 0, and α isa simple root of f 00(x).
Differentiating a third time, we have
f 000(α) = h3(α) 6= 0We can use this as part of a proof of the following: α
is a root of f(x) of multiplicity m = 3 if and only if
f(α) = f 0(α) = f 00(α) = 0, f 000(α) 6= 0
In general, α is a root of f(x) of multiplicity m if and
only if
f(α) = · · · = f (m−1)(α) = 0, f (m)(α) 6= 0
DIFFICULTIES OF MULTIPLE ROOTS
There are two main difficulties with the numerical cal-
culation of multiple roots (by which we mean m > 1
in the definition).
1. Methods such as Newton’s method and the se-
cant method converge more slowly than for the
case of a simple root.
2. There is a large interval of uncertainty in the pre-
cise location of a multiple root on a computer or
calculator.
The second of these is the more difficult to deal with,
but we begin with the first for the case of Newton’s
method.
Recall that we can regard Newton’s method as a fixed
point method:
xn+1 = g(xn), g(x) = x− f(x)
f 0(x)Then we substitute
f(x) = (x− α)mh(x)
to obtain
g(x) = x− (x− α)mh(x)
m (x− α)m−1 h(x) + (x− α)mh0(x)
= x− (x− α)h(x)
mh(x) + (x− α)h0(x)Then we can use this to show
g0(α) = 1− 1
m=
m− 1m
For m > 1, this is nonzero, and therefore Newton’s
method is only linearly convergent:
α− xn+1 ≈ λ (α− xn) , λ =m− 1m
Similar results hold for the secant method.
There are ways of improving the speed of convergence
of Newton’s method, creating a modified method that
is again quadratically convergent. In particular, con-
sider the fixed point iteration formula
xn+1 = g(xn), g(x) = x−mf(x)
f 0(x)in which we assume to know the multiplicity m of
the root α being sought. Then modifying the above
argument on the convergence of Newton’s method,
we obtain
g0(α) = 1−m · 1m= 0
and the iteration method will be quadratically conver-
gent.
But this is not the fundamental problem posed by
multiple roots.
NOISE IN FUNCTION EVALUATION
Recall the discussion of noise in evaluating a function
f(x), and in our case consider the evaluation for val-
ues of x near to α. In the following figures, the noise
as measured by vertical distance is the same in both
graphs.
x
y
simple root
x
y
double root
Noise was discussed earlier and as example we used func-tion
f(x) = x3 � 3x2 + 3x� 1 � (x� 1)3
Because of the noise in evaluating f(x), it appears fromthe graph that f(x) has many zeros around x = 1,whereas the exact function outside of the computer hasonly the root � = 1; of multiplicity 3. Any root�ndingmethod to �nd a multiple root � that uses evaluation off(x) is doomed to having a large interval of uncertaintyas to the location of the root. If high accuracy is desired,then the only satisfactory solution is to reformulate theproblem as a new problem F (x) = 0 in which � is a sim-ple root of F . Then use a standard root�nding methodto calculate �. It is important that the evaluation ofF (x) not involve f(x) directly, as that is the source ofthe noise and the uncertainly.
From an examination of the rate of linear convergence ofNewton�s method applied to this function, one can guesswith high probability that the multiplicity ism = 3. Thenform exactly the second derivative
f 00(x) = 21:12� 32:4x+ 12x2
Applying Newton�s method to this with a guess of x = 1will lead to rapid convergence to � = 1:1.
In general, if we know the root � has multiplicitym > 1,then replace the problem by that of solving
f (m�1)(x) = 0
since � is a simple root of this equation.
STABILITY
Generally we expect the world to be stable. By this,we mean that if we make a small change in something,then we expect to have this lead to other correspond-ingly small changes. In fact, if we think about thiscarefully, then we know this need not be true. Wenow illustrate this for the case of rootfinding.
Why have some of the roots departed so radically from
the original values? This phenomena goes under a
variety of names. We sometimes say this is an example
of an unstable or ill-conditioned rootfinding problem.
These words are often used in a casual manner, but
they also have a very precise meaning in many areas
of numerical analysis (and more generally, in all of
mathematics).
A PERTURBATION ANALYSIS
We want to study what happens to the root of a func-
tion f(x) when it is perturbed by a small amount. For
some function g(x) and for all small ε, define a per-
turbed function
Fε(x) = f(x) + εg(x)
The polynomial example would fit this if we use
g(x) = x6, ε = −.002Let α0 be a simple root of f(x). It can be shown (us-
ing the implicit differentiation theorem from calculus)
that if f(x) and g(x) are differentiable for x ≈ α0,
and if f 0(α0) 6= 0, then Fε(x) has a unique simple
root α(ε) near to α0 = α(0) for all small values of ε.
Moreover, α(ε) will be a differentiable function of ε.
We use this to estimate α(ε).
The linear Taylor polynomial approximation of α(ε) is
given by
α(ε) ≈ α(0) + εα0(0)
We need to find a formula for α0(0). Recall that
Fε(α(ε)) = 0
for all small values of ε. Differentiate this as a function
of ε and using the chain rule. Then we obtain
F 0ε(α(ε)) = f 0(α(ε))α0(ε)+g(α(ε)) + ε g0(α(ε))α0(ε) = 0
for all small ε. Substitute ε = 0, recall α(0) = α0,
and solve for α0(0) to obtain
f 0(α0)α0(0) + g(α0) = 0
α0(0) = − g(α0)
f 0(α0)This then leads to
α(ε) ≈ α(0) + εα0(0)
= α0 − εg(α0)
f 0(α0)(*)
Example: In our earlier polynomial example, consider
the simple root α0 = 3. Then
α(ε) ≈ 3− ε36
48
.= 3− 15.2ε
With ε = −.002, we obtainα(−.002) ≈ 3− 15.2(−.002) .
= 3.0304
This is close to the actual root of 3.0331253.
However, the approximation (*) is not good at esti-
mating the change in the roots 5 and 6. By ob-
servation, the perturbation in the root is a complex
number, whereas the formula (*) predicts only a per-
turbation that is real. The value of ε is too large to
have (*) be accurate for the roots 5 and 6.
DISCUSSION
Looking again at the formula
α(ε) ≈ α0 − εg(α0)
f 0(α0)we have that the size of
εg(α0)
f 0(α0)is an indication of the stability of the solution α0.
If this quantity is large, then potentially we will have
difficulty. Of course, not all functions g(x) are equally
possible, and we need to look only at functions g(x)
that will possibly occur in practice.
One quantity of interest is the size of f 0(α0). If itis very small relative to εg(α0), then we are likely to
have difficulty in finding α0 accurately.
INTERPOLATION
Interpolation is a process of finding a formula (oftena polynomial) whose graph will pass through a givenset of points (x, y).
As an example, consider defining
x0 = 0, x1 =π
4, x2 =
π
2and
yi = cosxi, i = 0, 1, 2
This gives us the three points
(0, 1) ,µπ4 ,
1sqrt(2)
¶,
³π2 , 0
´Now find a quadratic polynomial
p(x) = a0 + a1x+ a2x2
for which
p(xi) = yi, i = 0, 1, 2
The graph of this polynomial is shown on the accom-panying graph. We later give an explicit formula.
Quadratic interpolation of cos(x)
x
y
π/4 π/2
y = cos(x)y = p2(x)
PURPOSES OF INTERPOLATION
1. Replace a set of data points {(xi, yi)} with a func-tion given analytically.
2. Approximate functions with simpler ones, usually
polynomials or ‘piecewise polynomials’.
Purpose #1 has several aspects.
• The data may be from a known class of functions.Interpolation is then used to find the member of
this class of functions that agrees with the given
data. For example, data may be generated from
functions of the form
p(x) = a0 + a1ex + a2e
2x + · · ·+ anenx
Then we need to find the coefficientsnajobased
on the given data values.
• We may want to take function values f(x) givenin a table for selected values of x, often equally
spaced, and extend the function to values of x
not in the table.
For example, given numbers from a table of loga-
rithms, estimate the logarithm of a number x not
in the table.
• Given a set of data points {(xi, yi)}, find a curvepassing thru these points that is “pleasing to the
eye”. In fact, this is what is done continually with
computer graphics. How do we connect a set of
points to make a smooth curve? Connecting them
with straight line segments will often give a curve
with many corners, whereas what was intended
was a smooth curve.
Purpose #2 for interpolation is to approximate func-
tions f(x) by simpler functions p(x), perhaps to make
it easier to integrate or differentiate f(x). That will
be the primary reason for studying interpolation in this
course.
As as example of why this is important, consider the
problem of evaluating
I =Z 10
dx
1 + x10
This is very difficult to do analytically. But we will
look at producing polynomial interpolants of the inte-
grand; and polynomials are easily integrated exactly.
We begin by using polynomials as our means of doing
interpolation. Later in the chapter, we consider more
complex ‘piecewise polynomial’ functions, often called
‘spline functions’.
LINEAR INTERPOLATION
The simplest form of interpolation is probably thestraight line, connecting two points by a straight line.
Let two data points (x0, y0) and (x1, y1) be given.
There is a unique straight line passing through these
points. We can write the formula for a straight lineas
P1(x) = a0 + a1x
In fact, there are other more convenient ways to write
it, and we give several of them below.
P1(x) =x− x1x0 − x1
y0 +x− x0x1 − x0
y1
=(x1 − x) y0 + (x− x0) y1
x1 − x0
= y0 +x− x0x1 − x0
[y1 − y0]
= y0 +
Ãy1 − y0x1 − x0
!(x− x0)
Check each of these by evaluating them at x = x0and x1 to see if the respective values are y0 and y1.
Example. Following is a table of values for f(x) =tanx for a few values of x.
x 1 1.1 1.2 1.3tanx 1.5574 1.9648 2.5722 3.6021
Use linear interpolation to estimate tan(1.15). Then
use
x0 = 1.1, x1 = 1.2
with corresponding values for y0 and y1. Then
tanx ≈ y0 +x− x0x1 − x0
[y1 − y0]
tanx ≈ y0 +x− x0x1 − x0
[y1 − y0]
tan (1.15) ≈ 1.9648 +1.15− 1.11.2− 1.1 [2.5722− 1.9648]
= 2.2685
The true value is tan 1.15 = 2.2345. We will want
to examine formulas for the error in interpolation, to
know when we have sufficient accuracy in our inter-
polant.
x
y
1 1.3
y=tan(x)
x
y
1.1 1.2
y = tan(x)y = p1(x)
QUADRATIC INTERPOLATION
We want to find a polynomial
P2(x) = a0 + a1x+ a2x2
which satisfies
P2(xi) = yi, i = 0, 1, 2
for given data points (x0, y0) , (x1, y1) , (x2, y2). One
formula for such a polynomial follows:
P2(x) = y0L0(x) + y1L1(x) + y2L2(x) (∗∗)with
L0(x) =(x−x1)(x−x2)(x0−x1)(x0−x2), L1(x) =
(x−x0)(x−x2)(x1−x0)(x1−x2)
L2(x) =(x−x0)(x−x1)(x2−x0)(x2−x1)
The formula (∗∗) is called Lagrange’s form of the in-
terpolation polynomial.
LAGRANGE BASIS FUNCTIONS
The functions
L0(x) =(x−x1)(x−x2)(x0−x1)(x0−x2), L1(x) =
(x−x0)(x−x2)(x1−x0)(x1−x2)
L2(x) =(x−x0)(x−x1)(x2−x0)(x2−x1)
are called ‘Lagrange basis functions’ for quadratic in-
terpolation. They have the properties
Li(xj) =
(1, i = j0, i 6= j
for i, j = 0, 1, 2. Also, they all have degree 2. Their
graphs are on an accompanying page.
As a consequence of each Li(x) being of degree 2, we
have that the interpolant
P2(x) = y0L0(x) + y1L1(x) + y2L2(x)
must have degree ≤ 2.
UNIQUENESS
Can there be another polynomial, call it Q(x), forwhich
deg(Q) ≤ 2Q(xi) = yi, i = 0, 1, 2
Thus, is the Lagrange formula P2(x) unique?
Introduce
R(x) = P2(x)−Q(x)
From the properties of P2 and Q, we have deg(R) ≤2. Moreover,
R(xi) = P2(xi)−Q(xi) = yi − yi = 0
for all three node points x0, x1, and x2. How manypolynomials R(x) are there of degree at most 2 andhaving three distinct zeros? The answer is that onlythe zero polynomial satisfies these properties, and there-fore
R(x) = 0 for all x
Q(x) = P2(x) for all x
SPECIAL CASES
Consider the data points
(x0, 1), (x1, 1), (x2, 1)
What is the polynomial P2(x) in this case?
Answer: We must have the polynomial interpolant is
P2(x) ≡ 1meaning that P2(x) is the constant function. Why?First, the constant function satisfies the property ofbeing of degree ≤ 2. Next, it clearly interpolates thegiven data. Therefore by the uniqueness of quadraticinterpolation, P2(x) must be the constant function 1.
Consider now the data points
(x0,mx0), (x1,mx1), (x2,mx2)
for some constant m. What is P2(x) in this case? Byan argument similar to that above,
P2(x) = mx for all x
Thus the degree of P2(x) can be less than 2.
HIGHER DEGREE INTERPOLATION
We consider now the case of interpolation by poly-nomials of a general degree n. We want to find apolynomial Pn(x) for which
deg(Pn) ≤ nPn(xi) = yi, i = 0, 1, · · · , n (∗∗)
with given data points
(x0, y0) , (x1, y1) , · · · , (xn, yn)The solution is given by Lagrange’s formula
Pn(x) = y0L0(x) + y1L1(x) + · · ·+ ynLn(x)
The Lagrange basis functions are given by
Lk(x) =(x− x0) ..(x− xk−1)(x− xk+1).. (x− xn)
(xk − x0) ..(xk − xk−1)(xk − xk+1).. (xk − xn)
for k = 0, 1, 2, ..., n. The quadratic case was coveredearlier.
In a manner analogous to the quadratic case, we canshow that the above Pn(x) is the only solution to theproblem (∗∗).
In the formula
Lk(x) =(x− x0) ..(x− xk−1)(x− xk+1).. (x− xn)
(xk − x0) ..(xk − xk−1)(xk − xk+1).. (xk − xn)
we can see that each such function is a polynomial of
degree n. In addition,
Lk(xi) =
(1, k = i0, k 6= i
Using these properties, it follows that the formula
Pn(x) = y0L0(x) + y1L1(x) + · · ·+ ynLn(x)
satisfies the interpolation problem of finding a solution
to
deg(Pn) ≤ nPn(xi) = yi, i = 0, 1, · · · , n
EXAMPLE
Recall the table
x 1 1.1 1.2 1.3tanx 1.5574 1.9648 2.5722 3.6021
We now interpolate this table with the nodes
x0 = 1, x1 = 1.1, x2 = 1.2, x3 = 1.3
Without giving the details of the evaluation process,
we have the following results for interpolation with
degrees n = 1, 2, 3.
n 1 2 3Pn(1.15) 2.2685 2.2435 2.2296Error −.0340 −.0090 .0049
It improves with increasing degree n, but not at a very
rapid rate. In fact, the error becomes worse when n is
increased further. Later we will see that interpolation
of a much higher degree, say n ≥ 10, is often poorly
behaved when the node points {xi} are evenly spaced.
A FIRST ORDER DIVIDED DIFFERENCE
For a given function f(x) and two distinct points x0 andx1, de�ne
f [x0; x1] =f(x1)� f(x0)x1 � x0
This is called a �rst order divided di¤erence of f(x). Bythe Mean-value theorem,
f(x1)� f(x0) = f 0(c)(x1 � x0)
for some c between x0 and x1. Thus
f [x0; x1] = f 0(c)
and the divided di¤erence is very much like the derivative,especially if x0 and x1 are quite close together. In fact,
f 0(x1 + x02
) � f [x0; x1]
is quite an accurate approximation of the derivative
SECOND ORDER DIVIDED DIFFERENCES
Given three distinct points x0, x1, and x2, de�ne
f [x0; x1; x2] =f [x1; x2]� f [x0; x1]
x2 � x0This is called the second order divided di¤erence of f(x).By a fairly complicated argument, we can show
f [x0; x1; x2] =1
2f 00(c)
for some c intermediate to x0, x1, and x2. In fact, as weinvestigate,
From the graphs, there is enormous variation in the
size of Ψn(x) as x varies over [0, 1]; and thus there
is also enormous variation in the error as x so varies.
For example, in the n = 9 case,
maxx0≤x≤x1
|Ψn(x)|(n+ 1)!
= 3.39× 10−11
maxx4≤x≤x5
|Ψn(x)|(n+ 1)!
= 6.89× 10−13
and the ratio of these two errors is approximately 49.
Thus the interpolation error is likely to be around 49
times larger when x0 ≤ x ≤ x1 as compared to the
case when x4 ≤ x ≤ x5. When doing table inter-
polation, the point x at which you are interpolating
should be centrally located with respect to the inter-
polation nodes m{x0, ..., xn} being used to define theinterpolation, if possible.
AN APPROXIMATION PROBLEM
Consider now the problem of using an interpolation
polynomial to approximate a given function f(x) on
a given interval [a, b]. In particular, take interpolation
nodes
a ≤ x0 < x1 < · · · < xn−1 < xn ≤ b
and produce the interpolation polynomial Pn(x) that
interpolates f(x) at the given node points. We would
like to have
maxa≤x≤b |f(x)− Pn(x)|→ 0 as n→∞
Does it happen?
Recall the error bound
maxa≤x≤b |f(x)− Pn(x)|
≤ maxa≤x≤b
|Ψn(x)|(n+ 1)!
· maxa≤x≤b
¯̄̄f (n+1) (x)
¯̄̄We begin with an example using evenly spaced node
points.
RUNGE�S EXAMPLE
Use evenly spaced node points:
h =b� an
xi = a+ ih for i = 0; : : : ; n
For some functions, such as f(x) = ex, the maximumerror goes to zero quite rapidly. But the size of the deriv-ative term f (n+1)(x) in
maxa�x�b
jf(x)� Pn(x)j
� 1
(n+ 1)!maxa�x�b
jn(x)j � maxa�x�b
���f (n+1)(x)���can badly hurt or destroy the convergence of other cases.In particular, we show the graph of
f(x) =1
1 + x2
and Pn(x) on [�5; 5] for the case n = 10. It canbe proven that for this function, the maximum error on[�5; 5] does not converge to zero. Thus the use of evenlyspaced nodes is not necessarily a good approach to ap-proximating a function f(x) by interpolation.
Runge’s example with n = 10:
x
y
y=P10(x)
y=1/(1+x2)
OTHER CHOICES OF NODES
Recall the general error bound
maxa≤x≤b |f(x)− Pn(x)| ≤ max
a≤x≤b|Ψn(x)|(n+ 1)!
· maxa≤x≤b
¯̄̄f (n+1) (x)
¯̄̄There is nothing we really do with the derivative term
for f ; but we can examine the way of defining the
nodes {x0, ..., xn} within the interval [a, b]. We askhow these nodes can be chosen so that the maximum
of |Ψn(x)| over [a, b] is made as small as possible.
This problem has quite an elegant solution, and it will beconsidered in next lecture. The node points fx0; :::; xngturn out to be the zeros of a particular polynomial Tn+1(x)of degree n + 1, called a Chebyshev polynomial. Thesezeros are known explicitly, and with them
maxa�x�b
jn(x)j =�b� a2
�n+12�n
This turns out to be smaller than for evenly spaced cases;and although this polynomial interpolation does not workfor all functions f(x), it works for all di¤erentiable func-tions and more.
ANOTHER ERROR FORMULA
Recall the error formula
f(x)− Pn(x) =Ψn(x)
(n+ 1)!f (n+1) (c)
Ψn(x) = (x− x0) (x− x1) · · · (x− xn)
with c between the minimum and maximum of {x0, ..., xn, x}.A second formula is given by
f(x)− Pn(x) = Ψn(x) f [x0, ..., xn, x]
To show this is a simple, but somewhat subtle argu-
ment.
Let Pn+1(x) denote the polynomial of degree ≤ n+1
which interpolates f(x) at the points {x0, ..., xn, xn+1}.Then
Pn+1(x) = Pn(x)
+f [x0, ..., xn, xn+1] (x− x0) · · · (x− xn)
Substituting x = xn+1, and using the fact that Pn+1(x)
To simplify the presentation somewhat, I assume in
the following that our node points are evenly spaced:
x2 = x1 + h, x3 = x1 + 2h, x4 = x1 + 3h
Then our earlier formulas simplify to
s(x) =(x2 − x)3M1 + (x− x1)
3M2
6h
+(x2 − x) y1 + (x− x1) y2
h
−h6[(x2 − x)M1 + (x− x1)M2]
for x1 ≤ x ≤ x2, with similar formulas on [x2, x3] and
[x3, x4].
Without going thru all of the algebra, the conditions
(**) leads to the following pair of equations.
h
6M1 +
2h
3M2 +
h
6M3
=y3 − y2
h− y2 − y1
hh
6M2 +
2h
3M3 +
h
6M4
=y4 − y3
h− y3 − y2
h
This gives us two equations in four unknowns. The
earlier boundary conditions on s00(x) gives us immedi-ately
M1 =M4 = 0
Then we can solve the linear system for M2 and M3.
EXAMPLE
Consider the interpolation data points
x 1 2 3 4
y 1 12
13
14
In this case, h = 1, and linear system becomes
2
3M2 +
1
6M3 = y3 − 2y2 + y1 =
1
31
6M2 +
2
3M3 = y4 − 2y3 + y2 =
1
12
This has the solution
M2 =1
2, M3 = 0
This leads to the spline function formula on each
subinterval.
On [1, 2],
s(x) =(x2 − x)3M1 + (x− x1)
3M2
6h
+(x2 − x) y1 + (x− x1) y2
h
−h6[(x2 − x)M1 + (x− x1)M2]
=(2− x)3 · 0 + (x− 1)3
³12
´6
+(2− x) · 1 + (x− 1)
³12
´1
−16
h(2− x) · 0 + (x− 1)
³12
´i= 112 (x− 1)3 − 7
12 (x− 1) + 1
Similarly, for 2 ≤ x ≤ 3,
s(x) =−112(x− 2)3 + 1
4(x− 2)2 − 1
3(x− 1) + 1
2
and for 3 ≤ x ≤ 4,
s(x) =−112(x− 4) + 1
4
x 1 2 3 4
y 1 12
13
14
0 0.5 1 1.5 2 2.5 3 3.5 40
0.2
0.4
0.6
0.8
1
x
y
y = 1/xy = s(x)
Graph of example of natural cubic spline
interpolation
x 0 1 2 2.5 3 3.5 4y 2.5 0.5 0.5 1.5 1.5 1.125 0
x
y
1 2 3 4
1
2
Interpolating natural cubic spline function
ALTERNATIVE BOUNDARY CONDITIONS
Return to the equations
h
6M1 +
2h
3M2 +
h
6M3
=y3 − y2
h− y2 − y1
hh
6M2 +
2h
3M3 +
h
6M4
=y4 − y3
h− y3 − y2
h
Sometimes other boundary conditions are imposed on
s(x) to help in determining the values of M1 and
M4. For example, the data in our numerical exam-
ple were generated from the function f(x) = 1x. With
it, f 00(x) = 2x3, and thus we could use
M1 = 2, M4 =1
32
With this we are led to a new formula for s(x), one
that approximates f(x) = 1x more closely.
THE CLAMPED SPLINE
In this case, we augment the interpolation conditions
s(xi) = yi, i = 1, 2, 3, 4
with the boundary conditions
s0(x1) = y01, s0(x4) = y04 (#)
The conditions (#) lead to another pair of equations,
augmenting the earlier ones. Combined these equa-
tions are
h
3M1 +
h
6M2 =
y2 − y1h
− y01h
6M1 +
2h
3M2 +
h
6M3
=y3 − y2
h− y2 − y1
hh
6M2 +
2h
3M3 +
h
6M4
=y4 − y3
h− y3 − y2
hh
6M3 +
h
3M4 = y04 −
y4 − y3h
For our numerical example, it is natural to obtain
these derivative values from f 0(x) = − 1x2:
y01 = −1, y04 = −1
16
When combined with your earlier equations, we have
the system
1
3M1 +
1
6M2 =
1
21
6M1 +
2
3M2 +
1
6M3 =
1
31
6M2 +
2
3M3 +
1
6M4 =
1
121
6M3 +
1
3M4 =
1
48
This has the solution
[M1,M2,M3,M4] =·173
120,7
60,11
120,1
60
¸
We can now write the functions s(x) for each of the
subintervals [x1, x2], [x2, x3], and [x3, x4]. Recall for
x1 ≤ x ≤ x2,
s(x) =(x2 − x)3M1 + (x− x1)
3M2
6h
+(x2 − x) y1 + (x− x1) y2
h
−h6[(x2 − x)M1 + (x− x1)M2]
We can substitute in from the data
x 1 2 3 4
y 1 12
13
14
and the solutions {Mi}. Doing so, consider the errorf(x)− s(x). As an example,
f(x) =1
x, f
µ3
2
¶=2
3, s
µ3
2
¶= .65260
This is quite a decent approximation.
THE GENERAL PROBLEM
Consider the spline interpolation problem with n nodes
(x1, y1) , (x2, y2) , ..., (xn, yn)
and assume the node points {xi} are evenly spaced,xj = x1 + (j − 1)h, j = 1, ..., n
We have that the interpolating spline s(x) on
xj ≤ x ≤ xj+1 is given by
s(x) =
³xj+1 − x
´3Mj +
³x− xj
´3Mj+1
6h
+
³xj+1 − x
´yj +
³x− xj
´yj+1
h
−h6
h³xj+1 − x
´Mj +
³x− xj
´Mj+1
ifor j = 1, ..., n− 1.
To enforce continuity of s0(x) at the interior nodepoints x2, ..., xn−1, the second derivatives
nMj
omust
satisfy the linear equations
h
6Mj−1 +
2h
3Mj +
h
6Mj+1 =
yj−1 − 2yj + yj+1
h
for j = 2, ..., n− 1. Writing them out,
h
6M1 +
2h
3M2 +
h
6M3 =
y1 − 2y2 + y3h
h
6M2 +
2h
3M3 +
h
6M4 =
y2 − 2y3 + y4h
...h
6Mn−2 +
2h
3Mn−1 +
h
6Mn =
yn−2 − 2yn−1 + yn
h
This is a system of n−2 equations in the n unknowns{M1, ...,Mn}. Two more conditions must be imposedon s(x) in order to have the number of equations equal
the number of unknowns, namely n. With the added
boundary conditions, this form of linear system can be
solved very efficiently.
BOUNDARY CONDITIONS
“Natural” boundary conditions
s00(x1) = s00(xn) = 0Spline functions satisfying these conditions are called“natural cubic splines”. They arise out the minimiza-tion problem stated earlier. But generally they are notconsidered as good as some other cubic interpolatingsplines.
“Clamped” boundary conditions We add the condi-tions
s0(x1) = y01, s0(xn) = y0nwith y01, y0n given slopes for the endpoints of s(x) on[x1, xn]. This has many quite good properties whencompared with the natural cubic interpolating spline;but it does require knowing the derivatives at the end-points.
“Not a knot” boundary conditions This is more com-plicated to explain, but it is the version of cubic splineinterpolation that is implemented in Matlab.
THE “NOT A KNOT” CONDITIONS
As before, let the interpolation nodes be
(x1, y1) , (x2, y2) , ..., (xn, yn)
We separate these points into two categories. For
constructing the interpolating cubic spline function,
we use the points
(x1, y1) , (x3, y3) , ..., (xn−2, yn−2) , (xn, yn)Thus deleting two of the points. We now have n− 2points, and the interpolating spline s(x) can be deter-
mined on the intervals
[x1, x3] , [x3, x4] , ..., [xn−3, xn−2] , [xn−2, xn]This leads to n− 4 equations in the n− 2 unknownsM1,M3, ...,Mn−2,Mn. The two additional boundary
conditions are
s(x2) = y2, s(xn−1) = yn−1These translate into two additional equations, and we
obtain a system of n−2 linear simultaneous equationsin the n− 2 unknowns M1,M3, ...,Mn−2,Mn.
x 0 1 2 2.5 3 3.5 4y 2.5 0.5 0.5 1.5 1.5 1.125 0
x
y
1 2 3 4
1
2
Interpolating cubic spline function with ”not-a knot”
boundary conditions
MATLAB SPLINE FUNCTION LIBRARY
Given data points
(x1, y1) , (x2, y2) , ..., (xn, yn)
type arrays containing the x and y coordinates:
x = [x1 x2 ...xn]y = [y1 y2 ...yn]plot (x, y, ’o’)
The last statement will draw a plot of the data points,
marking them with the letter ‘oh’. To find the inter-
polating cubic spline function and evaluate it at the
points of another array xx, say
h = (xn − x1) / (10 ∗ n) ; xx = x1 : h : xn;
use
yy = spline (x, y, xx)plot (x, y, ’o’, xx, yy)
The last statement will plot the data points, as be-
fore, and it will plot the interpolating spline s(x) as a
continuous curve.
ERROR IN CUBIC SPLINE INTERPOLATION
Let an interval [a, b] be given, and then define
h =b− a
n− 1, xj = a+ (j − 1)h, j = 1, ..., n
Suppose we want to approximate a given function
f(x) on the interval [a, b] using cubic spline inter-
polation. Define
yi = f(xi), j = 1, ..., n
Let sn(x) denote the cubic spline interpolating this
data and satisfying the “not a knot” boundary con-
ditions. Then it can be shown that for a suitable
constant c,
En ≡ maxa≤x≤b |f(x)− sn(x)| ≤ ch4
The corresponding bound for natural cubic spline in-
terpolation contains only a term of h2 rather than h4;
it does not converge to zero as rapidly.
EXAMPLE
Take f(x) = arctanx on [0, 5]. The following ta-
ble gives values of the maximum error En for various
values of n. The values of h are being successively
Given a function f(x) that is continuous on a giveninterval [a, b], consider approximating it by some poly-nomial p(x). To measure the error in p(x) as an ap-proximation, introduce
E(p) = maxa≤x≤b |f(x)− p(x)|
This is called the maximum error or uniform error ofapproximation of f(x) by p(x) on [a, b].
With an eye towards efficiency, we want to find the‘best’ possible approximation of a given degree n.With this in mind, introduce the following:
ρn(f) = mindeg(p)≤n
E(p)
= mindeg(p)≤n
"maxa≤x≤b |f(x)− p(x)|
#The number ρn(f) will be the smallest possible uni-form error, orminimax error, when approximating f(x)by polynomials of degree at most n. If there is apolynomial giving this smallest error, we denote it bymn(x); thus E(mn) = ρn(f).
Example. Let f(x) = ex on [−1, 1]. In the followingtable, we give the values of E(tn), tn(x) the Tay-
lor polynomial of degree n for ex about x = 0, and
Chebyshev polynomials are used in many parts of nu-merical analysis, and more generally, in applicationsof mathematics. For an integer n ≥ 0, define thefunction
Tn(x) = cos³n cos−1 x
´, −1 ≤ x ≤ 1 (1)
This may not appear to be a polynomial, but we willshow it is a polynomial of degree n. To simplify themanipulation of (1), we introduce
θ = cos−1(x) or x = cos(θ), 0 ≤ θ ≤ π (2)
Then
Tn(x) = cos(nθ) (3)
Example. n = 0
T0(x) = cos(0 · θ) = 1n = 1
T1(x) = cos(θ) = x
n = 2
T2(x) = cos(2θ) = 2 cos2(θ)− 1 = 2x2 − 1
x
y
-1 1
1
-1
T0(x)T1(x)T2(x)
x
y
-1 1
1
-1
T3(x)T4(x)
The triple recursion relation. Recall the trigonomet-
ric addition formulas,
cos(α± β) = cos(α) cos(β)∓ sin(α) sin(β)Let n ≥ 1, and apply these identities to get
value of h, call it h∗, below which the error bound
will begin to increase. To find it, let E0(h) = 0, with
its root being h∗. This leads to h∗ .= 0.0726, which is
consistent with the behavior of the errors in the table.
LINEAR SYSTEMS
Consider the following example of a linear system:
x1 + 2x2 + 3x3 = −5−x1 + x3 = −3
3x1 + x2 + 3x3 = −3Its unique solution is
x1 = 1, x2 = 0, x3 = −2In general we want to solve n equations in n un-
knowns. For this, we need some simplifying nota-
tion. In particular we introduce arrays. We can think
of these as means for storing information about the
linear system in a computer. In the above case, we
introduce
A =
1 2 3−1 0 13 1 3
, b =
−5−3−3
, x =
10−2
These arrays completely specify the linear system and
its solution. We also know that we can give mean-
ing to multiplication and addition of these quantities,
calling them matrices and vectors. The linear system
is then written as
Ax = b
with Ax denoting a matrix-vector multiplication.
The general system is written as
a1,1x1 + · · ·+ a1,nxn = b1...
an,1x1 + · · ·+ an,nxn = bn
This is a system of n linear equations in the n un-
knowns x1, ..., xn. This can be written in matrix-
vector notation as
Ax = b
A =
a1,1 · · · a1,n... . . . ...
an,1 · · · an,n
, b = b1...bn
x =
x1...xn
A TRIDIAGONAL SYSTEM
Consider the tridiagonal linear system
3x1 − x2 = 2−x1 + 3x2 − x3 = 1
...−xn−2 + 3xn−1 − xn = 1
−xn−1 + 3xn = 2
The solution is
x1 = · · · = xn = 1
This has the associated arrays
A =
3 −1 0 · · · 0−1 3 −1 0
. . .... −1 3 −10 · · · −1 3
, b =21...12
, x =11...11
SOLVING LINEAR SYSTEMS
Linear systems Ax = b occur widely in applied mathe-matics. They occur as direct formulations of �real world�problems; but more often, they occur as a part of the nu-merical analysis of some other problem. As examples ofthe latter, we have the construction of spline functions,the numerical solution of systems of nonlinear equations,ordinary and partial di¤erential equations, integral equa-tions, and the solution of optimization problems.
There are many ways of classifying linear systems.
Size: Small, moderate, and large. This of course varieswith the machine you are using.
For a matrix A of order n× n, it will take 8n2 bytes
to store it in double precision. Thus a matrix of order
8000 will need around 512 MB of storage. The latter
would be too large for most present day PCs, if the
matrix was to be stored in the computer’s memory,
although one can easily expand a PC to contain much
more memory than this.
Sparse vs. Dense. Many linear systems have a matrixA in which almost all the elements are zero. These
matrices are said to be sparse. For example, it is quite
matrices, it does not make sense to store the zero ele-
ments; and the sparsity should be taken into account
when solving the linear system Ax = b. Also, the
sparsity need not be as regular as in this example.
BASIC DEFINITIONS AND THEORY
A homogeneous linear systemAx = b is one for which theright hand constants are all zero. Using vector notation,we say b is the zero vector for a homogeneous system.Otherwise the linear system is call non-homogeneous.
Theorem. The following are equivalent statements.
(1) For each b, there is exactly one solution x.
(2) For each b, there is a solution x.
(3) The homogeneous system Ax = 0 has only the solu-tion x = 0.
(4) det(A) 6= 0.
(5) Inverse matrix A�1 exists.
EXAMPLE. Consider again the tridiagonal system
3x1 − x2 = 2−x1 + 3x2 − x3 = 1
...−xn−2 + 3xn−1 − xn = 1
−xn−1 + 3xn = 2
The homogeneous version is simply
3x1 − x2 = 0−x1 + 3x2 − x3 = 0
...−xn−2 + 3xn−1 − xn = 0
−xn−1 + 3xn = 0
Assume x 6= 0, and therefore that x has nonzero com-ponents. Let xk denote a component of maximum
size:
|xk| = max1≤j≤n
¯̄̄xj¯̄̄
Consider now equation k, and assume 1 < k < n.
Then
−xk−1 + 3xk − xk+1 = 0
xk = 13
¡xk−1 + xk+1
¢|xk| ≤ 1
3
¡¯̄xk−1
¯̄+¯̄xk+1
¯̄¢≤ 1
3 (|xk|+ |xk|)= 2
3 |xk|This implies xk = 0, and therefore x = 0. A similar
proof is valid if k = 1 or k = n, using the first or the
last equation, respectively.
Thus the original tridiagonal linear system Ax = b has
a unique solution x for each right side b.
METHODS OF SOLUTION
There are two general categories of numerical methods
for solving Ax = b.
Direct Methods: These are methods with a finite
number of steps; and they end with the exact solution
x, provided that all arithmetic operations are exact.
The most used of these methods is Gaussian elimi-
nation, which we begin with. There are other direct
methods, but we do not study them here.
Iteration Methods: These are used in solving all types
of linear systems, but they are most commonly used
with large sparse systems, especially those produced
by discretizing partial differential equations. This is
for k = n−1, ..., 1. What we have done here is simplya more carefully defined and methodical version of
what you have done in high school algebra.
How do we carry out the conversion ofa(1)1,1 · · · a
(1)1,n
... . . . ...
a(1)n,1 · · · a
(1)n,n
¯̄̄̄¯̄̄̄¯̄b(1)1...
b(1)n
to
a(1)1,1 · · · a
(1)1,n
0 . . . ...... . . .
0 · · · 0 a(n)n,n
¯̄̄̄¯̄̄̄¯̄̄̄b(1)1......
b(n)n
To help us keep track of the steps of this process, we
will denote the initial system by
[A(1) | b(1)] =
a(1)1,1 · · · a
(1)1,n
... . . . ...
a(1)n,1 · · · a
(1)n,n
¯̄̄̄¯̄̄̄¯̄b(1)1...
b(1)n
Initially we will make the assumption that every pivot
element will be nonzero; and later we remove this
assumption.
Step 1. We will eliminate x1 from equations 2 thru
n. Begin by defining the multipliers
mi,1 =a(1)i,1
a(1)1,1
, i = 2, ..., n
Here we are assuming the pivot element a(1)1,1 6= 0.
Then in succession, multiply mi,1 times row 1 (called
the pivot row) and subtract the result from row i.
This yields new matrix elements
a(2)i,j = a
(1)i,j −mi,1a
(1)1,j , j = 2, ..., n
b(2)i = b
(1)i −mi,1b
(1)1
for i = 2, ..., n.
Note that the index j does not include j = 1. The
reason is that with the definition of the multipliermi,1,
it is automatic that
a(2)i,1 = a
(1)i,1 −mi,1a
(1)1,1 = 0, i = 2, ..., n
The augmented matrix now is
[A(2) | b(2)] =
a(1)1,1 a
(1)1,2 · · · a
(1)1,n
0 a(2)2,2 a
(2)2,n
... ... . . . ...
0 a(2)n,2 · · · a
(2)n,n
¯̄̄̄¯̄̄̄¯̄̄̄¯̄
b(1)1
b(2)2...
b(2)n
Step k: Assume that for i = 1, ..., k− 1 the unknownxi has been eliminated from equations i + 1 thru n.
We have the augmented matrix
[A(k) | b(k)] =
a(1)1,1 a
(1)1,2 · · · a
(1)1,n
0 a(2)2,2 · · · a
(2)2,n
. . . . . . ...
... 0 a(k)k,k · · · a
(k)k,n
... ... . . . ...
0 · · · 0 a(k)n,k · · · a
(k)n,n
¯̄̄̄¯̄̄̄¯̄̄̄¯̄̄̄¯̄̄̄¯
b(1)1
b(2)2...
b(k)k...
b(k)n
We want to eliminate unknown xk from equations k+
1 thru n. Begin by defining the multipliers
mi,k =a(k)i,k
a(k)k,k
, i = k + 1, ..., n
The pivot element is a(k)k,k, and we assume it is nonzero.
Using these multipliers, we eliminate xk from equa-
tions k + 1 thru n. Multiply mi,k times row k (the
pivot row) and subtract from row i, for i = k+1 thru
n.
a(k+1)i,j = a
(k)i,j −mi,ka
(k)k,j , j = k + 1, ..., n
b(k+1)i = b
(k)i −mi,kb
(k)k
for i = k+1, ..., n. This yields the augmented matrix
[A(k+1) | b(k+1)]:
a(1)1,1 · · · a
(1)1,n
0 . . . ...
a(k)k,k a
(k)k,k+1 · · · a
(k)k,n
... 0 a(k+1)k+1,k+1 a
(k+1)k+1,n
... ... . . . ...
0 · · · 0 a(k+1)n,k+1 · · · a
(k+1)n,n
¯̄̄̄¯̄̄̄¯̄̄̄¯̄̄̄¯̄̄̄¯
b(1)1...
b(k)k
b(k+1)k+1...
b(k+1)n
Doing this for k = 1, 2, ..., n − 1 leads to the uppertriangular system with the augmented matrix
a(1)1,1 · · · a
(1)1,n
0 . . . ...... . . .
0 · · · 0 a(n)n,n
¯̄̄̄¯̄̄̄¯̄̄̄b(1)1......
b(n)n
We later remove the assumption
a(k)k,k 6= 0, k = 1, 2, ..., n
QUESTIONS
• How do we remove the assumption on the pivotelements?
• How many operations are involved in this proce-dure?
• How much error is there in the computed solutiondue to rounding errors in the calculations?
• How does the machine architecture affect the im-plementation of this algorithm.
PARTIAL PIVOTING
Recall the reduction of
[A(1) | b(1)] =
a(1)1,1 · · · a
(1)1,n
... . . . ...
a(1)n,1 · · · a
(1)n,n
¯̄̄̄¯̄̄̄¯̄b(1)1...
b(1)n
to
[A(2) | b(2)] =
a(1)1,1 a
(1)1,2 · · · a
(1)1,n
0 a(2)2,2 a
(2)2,n
... ... . . . ...
0 a(2)n,2 · · · a
(2)n,n
¯̄̄̄¯̄̄̄¯̄̄̄¯̄
b(1)1
b(2)2...
b(2)n
What if a
(1)1,1 = 0? In that case we look for an equation
in which the x1 is present. To do this in such a way
as to avoid zero the maximum extant possible, we do
the following.
Look at all the elements in the first column,
a(1)1,1, a
(1)2,1, ..., a
(1)n,1
and pick the largest in size. Say it is¯̄̄̄a(1)k,1
¯̄̄̄= max
j=1,...,n
¯̄̄̄a(1)j,1
¯̄̄̄Then interchange equations 1 and k, which means
interchanging rows 1 and k in the augmented matrix
[A(1) | b(1)]. Then proceed with the elimination of x1from equations 2 thru n as before.
Having obtained
[A(2) | b(2)] =
a(1)1,1 a
(1)1,2 · · · a
(1)1,n
0 a(2)2,2 a
(2)2,n
... ... . . . ...
0 a(2)n,2 · · · a
(2)n,n
¯̄̄̄¯̄̄̄¯̄̄̄¯̄
b(1)1
b(2)2...
b(2)n
what if a
(2)2,2 = 0? Then we proceed as before.
Among the elements
a(2)2,2, a
(2)3,2, ..., a
(2)n,2
pick the one of largest size:¯̄̄̄a(2)k,2
¯̄̄̄= max
j=2,...,n
¯̄̄̄a(2)j,2
¯̄̄̄Interchange rows 2 and k. Then proceed as before to
eliminate x2 from equations 3 thru n, thus obtaining
[A(3) | b(3)] =
a(1)1,1 a
(1)1,2 a
(1)1,3 · · · a
(1)1,n
0 a(2)2,2 a
(2)2,3 · · · a
(2)2,n
0 0 a(3)3,3 · · · a
(3)3,n
... ... ... . . . ...
0 0 a(3)n,3 · · · a
(3)n,n
¯̄̄̄¯̄̄̄¯̄̄̄¯̄̄̄¯̄
b(1)1
b(2)2
b(3)3...
b(3)n
This is done at every stage of the elimination process.
This technique is called partial pivoting, and it is a
part of most Gaussian elimination programs (including
the one in the text).
Consequences of partial pivoting. Recall the defini-tion of the elements obtained in the process of elimi-
nating x1 from equations 2 thru n.
mi,1 =a(1)i,1
a(1)1,1
, i = 2, ..., n
a(2)i,j = a
(1)i,j −mi,1a
(1)1,j , j = 2, ..., n
b(2)i = b
(1)i −mi,1b
(1)1
for i = 2, ..., n. By our definition of the pivot element
a(1)1,1, we have¯̄̄
mi,1
¯̄̄≤ 1, i = 2, ..., n
Thus in the calculation of a(2)i,j and b
(2)i , we have that
the elements do not grow rapidly in size. This is in
comparison to what might happen otherwise, in which
the multipliers mi,1 might have been very large. This
property is true of the multipliers at very step of the
elimination process:¯̄̄mi,k
¯̄̄≤ 1, i = k + 1, ..., n, k = 1, ..., n− 1
The property¯̄̄mi,k
¯̄̄≤ 1, i = k + 1, ..., n
leads to good error propagation properties in Gaussian
elimination with partial pivoting. The only error in
Gaussian elimination is that derived from the round-
ing errors in the arithmetic operations. For example,
at the first elimination step (eliminating x1 from equa-
tions 2 thru n),
a(2)i,j = a
(1)i,j −mi,1a
(1)1,j , j = 2, ..., n
b(2)i = b
(1)i −mi,1b
(1)1
The above property on the size of the multipliers pre-
vents these numbers and the errors in their calculation
from growing as rapidly as they might if no partial piv-
oting was used.
As an example of the improvement in accuracy ob-
tained with partial pivoting, see the example on pages
262-263.
OPERATION COUNTS
One of the major ways in which we compare the effi-
ciency of different numerical methods is to count the
number of needed arithmetic operations. For solving
the linear system
a1,1x1 + · · ·+ a1,nxn = b1...
an,1x1 + · · ·+ an,nxn = bn
using Gaussian elimination, we have the following op-
eration counts.
1. A → U , where we are converting Ax = b to
Ux = g:
Divisionsn(n− 1)
2
Additionsn(n− 1)(2n− 1)
6
Multiplicationsn(n− 1)(2n− 1)
6
2. b→ g:
Additionsn(n− 1)
2
Multiplicationsn(n− 1)
23. Solving Ux = g:
Divisions n
Additionsn(n− 1)
2
Multiplicationsn(n− 1)
2
On some machines, the cost of a division is much
more than that of a multiplication; whereas on others
there is not any important difference. We assume the
latter; and then the operation costs are as follows.
MD(A→ U) =n³n2 − 1
´3
MD(b→ g) =n(n− 1)
2
MD(Find x) =n(n+ 1)
2
AS(A→ U) =n(n− 1)(2n− 1)
6
AS(b→ g) =n(n− 1)
2
AS(Find x) =n(n− 1)
2
Thus the total number of operations is
Additions2n3 + 3n2 − 5n
6ÃMultiplicationsand Divisions
!n3 + 3n2 − n
3
Both are around 13n3, and thus the total operations
account is approximately
2
3n3
What happens to the cost when n is doubled?
Solving Ax = b and Ax = c. What is the cost? Only
the modification of the right side is different in these
two cases. Thus the additional cost isÃMD(b→ g)MD(Find x)
!= n2
ÃAS(b→ g)AS(Find x)
!= n(n− 1)
The total is around 2n2 operations, which is quite a
bit smaller than 23n3 when n is even moderately large,
say n = 100.
Thus one can solve the linear system Ax = c at little
additional cost to that for solving Ax = b. This has
important consequences when it comes to estimation
of the error in computed solutions.
CALCULATING THE MATRIX INVERSE
Consider finding the inverse of a 3× 3 matrix
A =
a1,1 a1,2 a1,3a2,1 a2,2 a2,3a3,1 a3,2 a3,3
= hA∗,1, A∗,2, A∗,3
iWe want to find a matrix
X =hX∗,1,X∗,2,X∗,3
ifor which
AX = I
AhX∗,1,X∗,2,X∗,3
i= [e1, e2, e3]h
AX∗,1, AX∗,2, AX∗,3i= [e1, e2, e3]
This means we want to solve
AX∗,1 = e1, AX∗,2 = e2, AX∗,3 = e3
We want to solve three linear systems, all with the
same matrix of coefficients A.
MATRIX INVERSE EXAMPLE
A =
1 1 −21 1 11 −1 0
1 1 −21 1 11 −1 0
¯̄̄̄¯̄̄ 1 0 00 1 00 0 1
m2,1 = 1 ↓ m3,1 = 1 1 1 −20 0 30 −2 2
¯̄̄̄¯̄̄ 1 0 0−1 1 0−1 0 1
↓ 1 1 −2
0 −2 20 0 3
¯̄̄̄¯̄̄ 1 0 0−1 0 1−1 1 0
1 1 −20 −2 20 0 3
¯̄̄̄¯̄̄ 1 0 0−1 0 1−1 1 0
Then by using back substitution to solve for each col-
umn of the inverse, we obtain
A−1 =
16
13
12
16
13 −12
−13 13 0
COST OF MATRIX INVERSION
In calculating A−1, we are solving for the matrix X =hX∗,1,X∗,2, . . . ,X∗,n
iwhere
AhX∗,1,X∗,2, . . . ,X∗,n
i= [e1, e2, . . . , en]
and ej is column j of the identity matrix. Thus weare solving n linear systems
AX∗,1 = e1, AX∗,2 = e2, . . . , AX∗,n = en (1)
all with the same coefficient matrix. Returning tothe earlier operation counts for solving a single linearsystem, we have the following.
Cost of triangulating A: approx. 23n3 operations
Cost of solving Ax = b: 2n2 operations
Thus solving the n linear systems in (1) costs approx-imately
23n3 + n
³2n2
´= 83n3 operations, approximately
It costs approximately four times as many operationsto invert A as to solve a single system. With attentionto the form of the right-hand sides in (1) this can bereduced to 2n3 operations.
MATLAB MATRIX OPERATIONS
To solve the linear system Ax = b in Matlab, use
x = A \ bIn Matlab, the command
inv (A)
will calculate the inverse of A.
There are many matrix operations built into Matlab,
both for general matrices and for special classes of
matrices. We do not discuss those here, but recom-
mend the student to investigate these thru the Matlab
Embedded in this formula we have a dot product. Thisis in fact typical of this process, with the length of theinner products varying from one position to another.
Recalling the discussion of dot products, we can evaluatethis last formula by using a higher precision arithmeticand thus avoid many rounding errors.
This leads to a variant of Gaussian elimination in whichthere are far fewer rounding errors.
With ordinary Gaussian elimination, the number of round-ing errors is proportional to n3. This reduces the numberof rounding errors, with the number now being propor-tional to only n2. This can lead to major increases inaccuracy, especially for matrices which are very sensitiveto small changes.
TRIDIAGONAL MATRICES
A =
b1 c1 0 0 · · · 0a2 b2 c2 00 a3 b3 c3
.... . .
... an−1 bn−1 cn−10 · · · an bn
These occur very commonly in the numerical solution
of partial differential equations, as well as in other ap-
Additions: 2n− 2Multiplications: 2n− 2Divisions: n
Thus the total number of arithmetic operations is ap-
proximately 3n to factor A; and it takes about 5n to
solve the linear system using the factorization of A.
If we had A−1 at no cost, what would it cost to com-pute x = A−1f?
xi =nX
j=1
³A−1
´i,jfj, i = 1, ..., n
MATLAB MATRIX OPERATIONS
To obtain the LU-factorization of a matrix, including
the use of partial pivoting, use the Matlab command
lu. In particular,
[L, U, P ] = lu(X)
returns the lower triangular matrix L, upper triangular
matrix U , and permutation matrix P so that
PX = LU
NUMERICAL INTEGRATION
How do you evaluate
I =Z b
af(x) dx
From calculus, if F (x) is an antiderivative of f(x),
then
I =Z b
af(x) dx = F (x)|ba = F (b)− F (a)
However, in practice most integrals cannot be evalu-
ated by this means. And even when this can work, an
approximate numerical method may be much simpler
and easier to use. For example, the integrand inZ 10
dx
1 + x5
has an extremely complicated antiderivative; and it is
easier to evaluate the integral by approximate means.
Try evaluating this integral with Maple or Mathemat-
ica.
NUMERICAL INTEGRATIONA GENERAL FRAMEWORK
Returning to a lesson used earlier with rootfinding:If you cannot solve a problem, then replace it with a“near-by” problem that you can solve.In our case, we want to evaluate
I =Z b
af(x) dx
To do so, many of the numerical schemes are basedon choosing approximates of f(x). Calling one suchef(x), use
I ≈Z b
a
ef(x) dx ≡ eIWhat is the error?
E = I − eI = Z b
a
hf(x)− ef(x)i dx
|E| ≤Z b
a
¯̄̄f(x)− ef(x)¯̄̄ dx
≤ (b− a)°°°f − ef°°°∞°°°f − ef°°°∞ ≡ max
a≤x≤b¯̄̄f(x)− ef(x)¯̄̄
We also want to choose the approximates ef(x) of aform we can integrate directly and easily. Examples
are polynomials, trig functions, piecewise polynomials,
and others.
If we use polynomial approximations, then how do we
choose them. At this point, we have two choices:
1. Taylor polynomials approximating f(x)
2. Interpolatory polynomials approximating f(x)
EXAMPLE
Consider evaluating
I =Z 10ex2dx
Use
et = 1 + t+ 12!t2 + · · ·+ 1
n!tn + 1
(n+1)!tn+1ect
ex2= 1 + x2 + 1
2!x4 + · · ·+ 1
n!x2n + 1
(n+1)!x2n+2edx
with 0 ≤ dx ≤ x2. Then
I =Z 10
h1 + x2 + 1
2!x4 + · · ·+ 1
n!x2nidx
+ 1(n+1)!
Z 10
hx2n+2edx
idx
Taking n = 3, we have
I = 1 + 13 +
110 +
142 +E = 1.4571 +E
0 < E ≤ e24
Z 10
hx8idx = e
216 = .0126
USING INTERPOLATORY POLYNOMIALS
In spite of the simplicity of the above example, it is
generally more difficult to do numerical integration by
constructing Taylor polynomial approximations than
by constructing polynomial interpolates. We therefore
construct the function ef inZ b
af(x) dx ≈
Z b
a
ef(x) dxby means of interpolation.
Initially, we consider only the case in which the in-
terpolation is based on interpolation at evenly spaced
node points.
LINEAR INTERPOLATION
The linear interpolant to f(x), interpolating at a and
b, is given by
P1(x) =(b− x) f(a) + (x− a) f(b)
b− a
Using the linear interpolant
P1(x) =(b− x) f(a) + (x− a) f(b)
b− a
we obtain the approximationZ b
af(x) dx ≈
Z b
aP1(x) dx
= 12 (b− a) [f(a) + f(b)] ≡ T1(f)
The rulebZa
f(x) dx ≈ T1(f)
is called the trapezoidal rule.
x
y
a b
y=f(x)
y=p1(x)
Illustrating I ≈ T1(f)
Example.Z π/2
0sinxdx ≈ π
4
hsin 0 + sin
³π2
´i= π
4.= .785398
Error = .215
HOW TO OBTAIN GREATER ACCURACY?
How do we improve our estimate of the integral
I =Z b
af(x) dx
One direction is to increase the degree of the approxi-mation, moving next to a quadratic interpolating poly-nomial for f(x). We first look at an alternative.
Instead of using the trapezoidal rule on the originalinterval [a, b], apply it to integrals of f(x) over smallersubintervals. For example:
I =Z c
af(x) dx+
Z b
cf(x) dx, c = b+a
2
≈ c−a2 [f(a) + f(c)] + b−c
2 [f(c) + f(b)]
= h2 [f(a) + 2f(c) + f(b)] ≡ T2(f), h = b−a
2
Example.Z π/2
0sinxdx ≈ π
8
hsin 0 + 2 sin
³π4
´+ sin
³π2
´i.= .948059
Error = .0519
x
y
a=x0 b=x3x1 x2
y=f(x)
Illustrating I ≈ T3(f)
THE TRAPEZOIDAL RULE
We can continue as above by dividing [a, b] into even
smaller subintervals and applying
βZα
f(x) dx ≈ β − α
2[f(α) + f(β)] , (∗)
on each of the smaller subintervals. Begin by intro-
ducing a positive integer n ≥ 1,
h =b− a
n, xj = a+ j h, j = 0, 1, ..., n
Then
I =Z xn
x0f(x) dx
=Z x1
x0f(x) dx+
Z x2
x1f(x) dx+ · · ·+
Z xn
xn−1f(x) dx
Use [α, β] = [x0, x1], [x1, x2], ..., [xn−1, xn], for eachof which the subinterval has length h.
Then applying
βZα
f(x) dx ≈ β − α
2[f(α) + f(β)]
we have
I ≈ h2 [f(x0) + f(x1)] +
h2 [f(x1) + f(x2)]
+ · · ·+h2 [f(xn−2) + f(xn−1)] + h
2 [f(xn−1) + f(xn)]
Simplifying,
I ≈ h·1
2f(a) + f(x1) + · · ·+ f(xn−1) +
1
2f(b)
¸≡ Tn(f)
This is called the “composite trapezoidal rule”, or
This formula can be further simplified, and we will do
so in two ways.
Rewrite this error as
ETn (f) = −
h3n
12
"f 00(γ1) + · · ·+ f 00(γn)
n
#Denote the quantity inside the brackets by ζn. This
number satisfies
mina≤x≤b f
00(x) ≤ ζn ≤ maxa≤x≤b f
00(x)
Since f 00(x) is a continuous function (by original as-sumption), we have that there must be some number
cn in [a, b] for which
f 00(cn) = ζn
Recall also that hn = b− a. Then
ETn (f) = −h
3n
12
"f 00(γ1) + · · ·+ f 00(γn)
n
#
= −h2 (b− a)
12f 00 (cn)
This is the error formula given on the first slide.
AN ERROR ESTIMATE
We now obtain a way to estimate the error ETn (f).
Return to the formula
ETn (f) = −
h3
12f 00(γ1)− · · ·−
h3
12f 00(γn)
and rewrite it as
ETn (f) = −
h2
12
hf 00(γ1)h+ · · ·+ f 00(γn)h
iThe quantity
f 00(γ1)h+ · · ·+ f 00(γn)h
is a Riemann sum for the integralZ b
af 00(x) dx = f 0(b)− f 0(a)
By this we mean
limn→∞
hf 00(γ1)h+ · · ·+ f 00(γn)h
i=Z b
af 00(x) dx
Thus
f 00(γ1)h+ · · ·+ f 00(γn)h ≈ f 0(b)− f 0(a)
for larger values of n. Combining this with the earlier
error formula
ETn (f) = −
h2
12
hf 00(γ1)h+ · · ·+ f 00(γn)h
iwe have
ETn (f) ≈ −
h2
12
hf 0(b)− f 0(a)
i≡ eET
n (f)
This is a computable estimate of the error in the nu-
merical integration. It is called an asymptotic error
estimate.
Example. Consider evaluating
I(f) =Z π
0ex cosxdx = −e
π + 1
2
.= −12.070346
In this case,
f 0(x) = ex [cosx− sinx]f 00(x) = −2ex sinx
max0≤x≤π
¯̄f 00(x)
¯̄=
¯̄f 00 (.75π)
¯̄= 14. 921
Then
ETn (f) = −h
2 (b− a)
12f 00 (cn)¯̄̄
ETn (f)
¯̄̄≤ h2π
12· 14.921 = 3.906h2
Also
eETn (f) = −h
2
12
£f 0(π)− f 0(0)
¤=
h2
12[eπ + 1]
.= 2.012h2
I(f)� Tn(f) � �h2
12
�f 0(b)� f 0(a)
�I(f) � Tn(f)�
h2
12
�f 0(b)� f 0(a)
�CTn(f) � Tn(f)�
h2
12
�f 0(b)� f 0(a)
�
This is the corrected trapezoidal rule. It is easy to obtainfrom the trapezoidal rule, and in most cases, it convergesmore rapidly than the trapezoidal rule.
Table 3. Asymptotic and corrected trapesoidal rule ap-plied to integral I(1) from Example 1.
which are to be exact for polynomials of as large a
degree as possible. There are no restrictions placed
on the nodesnxjonor the weights
nwj
oin working
towards that goal. The motivation is that if it is exact
for high degree polynomials, then perhaps it will be
very accurate when integrating functions that are well
approximated by polynomials.
There is no guarantee that such an approach will work.
In fact, it turns out to be a bad idea when the node
pointsnxjoare required to be evenly spaced over the
interval of integration. But without this restriction onnxjowe are able to develop a very accurate set of
quadrature formulas.
The case n = 1. We want a formula
w1f(x1) �1R�1f(x)dx
The weight w1 and the nodex1 are to be so chosen thatthe formula is exact for polynomials of as large degreeas possible. To do this we substitute f(x) = 1 andf(x) = x. The �rst choice leads to
w1 � 1 �1R�11dx
w1 = 2
The choice f(x) = x leads to
w1x1 �1R�1xdx
x1 = 0
The desired formula is
1R�1f(x)dx � 2f(0)
It is called the midpoint rule.
The case n = 2. We want a formula
w1f(x1) +w2f(x2) ≈Z 1−1
f(x) dx
The weights w1, w2 and the nodes x1, x2 are to be so
chosen that the formula is exact for polynomials of as
large a degree as possible. We substitute and force
equality for
f(x) = 1, x, x2, x3
This leads to the system
w1 +w2 =Z 1−11 dx = 2
w1x1 + w2x2 =Z 1−1
xdx = 0
w1x21 + w2x
22 =
Z 1−1
x2 dx =2
3
w1x31 + w2x
32 =
Z 1−1
x3 dx = 0
The solution is given by
w1 = w2 = 1, x1 =−1
sqrt(3), x2 =
1sqrt(3)
This yields the formulaZ 1−1
f(x) dx ≈ fµ
−1sqrt(3)
¶+ f
µ1
sqrt(3)
¶(1)
We say it has degree of precision equal to 3 since it
integrates exactly all polynomials of degree ≤ 3. We
can verify directly that it does not integrate exactly
f(x) = x4. Z 1−1
x4 dx = 25
fµ
−1sqrt(3)
¶+ f
µ1
sqrt(3)
¶= 29
Thus (1) has degree of precision exactly 3.
EXAMPLE IntegrateZ 1−1
dx
3 + x= log 2
.= 0.69314718
The formula (1) yields
1
3 + x1+
1
3 + x2= 0.69230769
Error = .000839
THE GENERAL CASE
We want to find the weights {wi} and nodes {xi} soas to have Z 1
−1f(x) dx ≈
nXj=1
wjf(xj)
be exact for a polynomials f(x) of as large a degreeas possible. As unknowns, there are n weights wi andn nodes xi. Thus it makes sense to initially impose2n conditions so as to obtain 2n equations for the 2nunknowns. We require the quadrature formula to beexact for the cases
f(x) = xi, i = 0, 1, 2, ..., 2n− 1Then we obtain the system of equations
w1xi1 +w2x
i2 + · · ·+ wnx
in =
Z 1−1
xi dx
for i = 0, 1, 2, ..., 2n− 1. For the right sides,Z 1−1
xi dx =
2
i+ 1, i = 0, 2, ..., 2n− 2
0, i = 1, 3, ..., 2n− 1
The system of equations
w1xi1 + · · ·+ wnx
in =
Z 1−1
xi dx, i = 0, ..., 2n− 1has a solution, and the solution is unique except for
re-ordering the unknowns. The resulting numerical
integration rule is called Gaussian quadrature.
In fact, the nodes and weights are not found by solv-
ing this system. Rather, the nodes and weights have
other properties which enable them to be found more
easily by other methods. There are programs to pro-
duce them; and most subroutine libraries have either
a program to produce them or tables of them for com-
monly used cases.
CHANGE OF INTERVAL
OF INTEGRATION
Integrals on other finite intervals [a, b] can be con-
verted to integrals over [−1, 1], as follows:Z b
aF (x) dx =
b− a
2
Z 1−1
F
Ãb+ a+ t(b− a)
2
!dt
based on the change of integration variables
x =b+ a+ t(b− a)
2, −1 ≤ t ≤ 1
EXAMPLE Over the interval [0, π], use
x = (1 + t) π2
Then Z π
0F (x) dx = π
2
Z 1−1
F³(1 + t) π2
´dt
AN ERROR FORMULA
The usual error formula for Gaussian quadrature for-
mula,
En(f) =Z 1−1
f(x) dx−nX
j=1
wjf(xj)
is not particularly intuitive. It is given by
En(f) = enf (2n)(cn)
(2n)!
en =22n+1 (n!)4
(2n+ 1) [(2n)!]2≈ π
4n
for some a ≤ cn ≤ b.
To help in understanding the implications of this error
formula, introduce
Mk = max−1≤x≤1
¯̄̄f (k)(x)
¯̄̄k!
With many integrands f(x), this sequence {Mk} isbounded or even decreases to zero. For example,
f(x) =
cosx
1
2 + x
⇒ Mk ≤1
k!1
Then for our error formula,
En(f) = enf (2n)(cn)
(2n)!|En(f)| ≤ enM2n (2)
By other methods, we can show
en ≈ π
4n
When combined with (2) and an assumption of uni-
form boundedness for {Mk}, we have the error de-creases by a factor of at least 4 with each increase of
n to n + 1. Compare this to the convergence of the
trapezoidal and Simpson rules for such functions, to
help explain the very rapid convergence of Gaussian
quadrature.
A SECOND ERROR FORMULA
Let f(x) be continuous for a ≤ x ≤ b; let n ≥ 1.
Then, for the Gaussian numerical integration formula
I ≡Z b
af(x) dx ≈
nXj=1
wjf(xj) ≡ In
on [a, b], the error in In satisfies
|I(f)− In(f)| ≤ 2 (b− a) ρ2n−1(f) (3)
Here ρ2n−1(f) is the minimax error of degree 2n− 1for f(x) on [a, b]:
ρm(f) = mindeg(p)≤m
"maxa≤x≤b |f(x)− p(x)|
#, m ≥ 0
EXAMPLE Let f(x) = e−x2. Then the minimax er-rors ρm(f) are given in the following table.
EXAMPLE Consider evaluating the integralZ 10x13 cosx dx (5)
In applying (4), we take f(x) = cosx. Then
w1f(x1) + w2f(x2) = 0.6074977951
The true answer isZ 10x13 cosx dx
.= 0.6076257393
and our numerical answer is in error by E2.= .000128.
This is quite a good answer involving very little com-
putational effort (once the formula has been deter-
mined). In contrast, the trapezoidal and Simpson
rules applied to (5) would converge very slowly be-
cause the first derivative of the integrand is singular
at the origin.
CHANGE OF VARIABLES
As a side note to the preceding example, we observe
that the change of variables x = t3 transforms the
integral (5) to
3Z 10t3 cos
³t3´dt
and both the trapezoidal and Simpson rules will per-
form better with this formula, although still not as
good as our weighted Gaussian quadrature.
A change of the integration variable can often im-
prove the performance of a standard method, usually
by increasing the differentiability of the integrand.
EXAMPLE Using x = tr for some r > 1, we haveZ 10g(x) log x dx = r
Z 10tr−1g (tr) log t dt
The new integrand is generally smoother than the
original one.
INTERPOLATION
Interpolation is a process of finding a formula (oftena polynomial) whose graph will pass through a givenset of points (x, y).
As an example, consider defining
x0 = 0, x1 =π
4, x2 =
π
2and
yi = cosxi, i = 0, 1, 2
This gives us the three points
(0, 1) ,µπ4 ,
1sqrt(2)
¶,
³π2 , 0
´Now find a quadratic polynomial
p(x) = a0 + a1x+ a2x2
for which
p(xi) = yi, i = 0, 1, 2
The graph of this polynomial is shown on the accom-panying graph. We later give an explicit formula.
Quadratic interpolation of cos(x)
x
y
π/4 π/2
y = cos(x)y = p2(x)
PURPOSES OF INTERPOLATION
1. Replace a set of data points {(xi, yi)} with a func-tion given analytically.
2. Approximate functions with simpler ones, usually
polynomials or ‘piecewise polynomials’.
Purpose #1 has several aspects.
• The data may be from a known class of functions.Interpolation is then used to find the member of
this class of functions that agrees with the given
data. For example, data may be generated from
functions of the form
p(x) = a0 + a1ex + a2e
2x + · · ·+ anenx
Then we need to find the coefficientsnajobased
on the given data values.
• We may want to take function values f(x) givenin a table for selected values of x, often equally
spaced, and extend the function to values of x
not in the table.
For example, given numbers from a table of loga-
rithms, estimate the logarithm of a number x not
in the table.
• Given a set of data points {(xi, yi)}, find a curvepassing thru these points that is “pleasing to the
eye”. In fact, this is what is done continually with
computer graphics. How do we connect a set of
points to make a smooth curve? Connecting them
with straight line segments will often give a curve
with many corners, whereas what was intended
was a smooth curve.
Purpose #2 for interpolation is to approximate func-
tions f(x) by simpler functions p(x), perhaps to make
it easier to integrate or differentiate f(x). That will
be the primary reason for studying interpolation in this
course.
As as example of why this is important, consider the
problem of evaluating
I =Z 10
dx
1 + x10
This is very difficult to do analytically. But we will
look at producing polynomial interpolants of the inte-
grand; and polynomials are easily integrated exactly.
We begin by using polynomials as our means of doing
interpolation. Later in the chapter, we consider more
complex ‘piecewise polynomial’ functions, often called
‘spline functions’.
LINEAR INTERPOLATION
The simplest form of interpolation is probably thestraight line, connecting two points by a straight line.
Let two data points (x0, y0) and (x1, y1) be given.
There is a unique straight line passing through these
points. We can write the formula for a straight lineas
P1(x) = a0 + a1x
In fact, there are other more convenient ways to write
it, and we give several of them below.
P1(x) =x− x1x0 − x1
y0 +x− x0x1 − x0
y1
=(x1 − x) y0 + (x− x0) y1
x1 − x0
= y0 +x− x0x1 − x0
[y1 − y0]
= y0 +
Ãy1 − y0x1 − x0
!(x− x0)
Check each of these by evaluating them at x = x0and x1 to see if the respective values are y0 and y1.
Example. Following is a table of values for f(x) =tanx for a few values of x.
x 1 1.1 1.2 1.3tanx 1.5574 1.9648 2.5722 3.6021
Use linear interpolation to estimate tan(1.15). Then
use
x0 = 1.1, x1 = 1.2
with corresponding values for y0 and y1. Then
tanx ≈ y0 +x− x0x1 − x0
[y1 − y0]
tanx ≈ y0 +x− x0x1 − x0
[y1 − y0]
tan (1.15) ≈ 1.9648 +1.15− 1.11.2− 1.1 [2.5722− 1.9648]
= 2.2685
The true value is tan 1.15 = 2.2345. We will want
to examine formulas for the error in interpolation, to
know when we have sufficient accuracy in our inter-
polant.
x
y
1 1.3
y=tan(x)
x
y
1.1 1.2
y = tan(x)y = p1(x)
QUADRATIC INTERPOLATION
We want to find a polynomial
P2(x) = a0 + a1x+ a2x2
which satisfies
P2(xi) = yi, i = 0, 1, 2
for given data points (x0, y0) , (x1, y1) , (x2, y2). One
formula for such a polynomial follows:
P2(x) = y0L0(x) + y1L1(x) + y2L2(x) (∗∗)with
L0(x) =(x−x1)(x−x2)(x0−x1)(x0−x2), L1(x) =
(x−x0)(x−x2)(x1−x0)(x1−x2)
L2(x) =(x−x0)(x−x1)(x2−x0)(x2−x1)
The formula (∗∗) is called Lagrange’s form of the in-
terpolation polynomial.
LAGRANGE BASIS FUNCTIONS
The functions
L0(x) =(x−x1)(x−x2)(x0−x1)(x0−x2), L1(x) =
(x−x0)(x−x2)(x1−x0)(x1−x2)
L2(x) =(x−x0)(x−x1)(x2−x0)(x2−x1)
are called ‘Lagrange basis functions’ for quadratic in-
terpolation. They have the properties
Li(xj) =
(1, i = j0, i 6= j
for i, j = 0, 1, 2. Also, they all have degree 2. Their
graphs are on an accompanying page.
As a consequence of each Li(x) being of degree 2, we
have that the interpolant
P2(x) = y0L0(x) + y1L1(x) + y2L2(x)
must have degree ≤ 2.
UNIQUENESS
Can there be another polynomial, call it Q(x), forwhich
deg(Q) ≤ 2Q(xi) = yi, i = 0, 1, 2
Thus, is the Lagrange formula P2(x) unique?
Introduce
R(x) = P2(x)−Q(x)
From the properties of P2 and Q, we have deg(R) ≤2. Moreover,
R(xi) = P2(xi)−Q(xi) = yi − yi = 0
for all three node points x0, x1, and x2. How manypolynomials R(x) are there of degree at most 2 andhaving three distinct zeros? The answer is that onlythe zero polynomial satisfies these properties, and there-fore
R(x) = 0 for all x
Q(x) = P2(x) for all x
SPECIAL CASES
Consider the data points
(x0, 1), (x1, 1), (x2, 1)
What is the polynomial P2(x) in this case?
Answer: We must have the polynomial interpolant is
P2(x) ≡ 1meaning that P2(x) is the constant function. Why?First, the constant function satisfies the property ofbeing of degree ≤ 2. Next, it clearly interpolates thegiven data. Therefore by the uniqueness of quadraticinterpolation, P2(x) must be the constant function 1.
Consider now the data points
(x0,mx0), (x1,mx1), (x2,mx2)
for some constant m. What is P2(x) in this case? Byan argument similar to that above,
P2(x) = mx for all x
Thus the degree of P2(x) can be less than 2.
HIGHER DEGREE INTERPOLATION
We consider now the case of interpolation by poly-nomials of a general degree n. We want to find apolynomial Pn(x) for which
deg(Pn) ≤ nPn(xi) = yi, i = 0, 1, · · · , n (∗∗)
with given data points
(x0, y0) , (x1, y1) , · · · , (xn, yn)The solution is given by Lagrange’s formula
Pn(x) = y0L0(x) + y1L1(x) + · · ·+ ynLn(x)
The Lagrange basis functions are given by
Lk(x) =(x− x0) ..(x− xk−1)(x− xk+1).. (x− xn)
(xk − x0) ..(xk − xk−1)(xk − xk+1).. (xk − xn)
for k = 0, 1, 2, ..., n. The quadratic case was coveredearlier.
In a manner analogous to the quadratic case, we canshow that the above Pn(x) is the only solution to theproblem (∗∗).
In the formula
Lk(x) =(x− x0) ..(x− xk−1)(x− xk+1).. (x− xn)
(xk − x0) ..(xk − xk−1)(xk − xk+1).. (xk − xn)
we can see that each such function is a polynomial of
degree n. In addition,
Lk(xi) =
(1, k = i0, k 6= i
Using these properties, it follows that the formula
Pn(x) = y0L0(x) + y1L1(x) + · · ·+ ynLn(x)
satisfies the interpolation problem of finding a solution
to
deg(Pn) ≤ nPn(xi) = yi, i = 0, 1, · · · , n
EXAMPLE
Recall the table
x 1 1.1 1.2 1.3tanx 1.5574 1.9648 2.5722 3.6021
We now interpolate this table with the nodes
x0 = 1, x1 = 1.1, x2 = 1.2, x3 = 1.3
Without giving the details of the evaluation process,
we have the following results for interpolation with
degrees n = 1, 2, 3.
n 1 2 3Pn(1.15) 2.2685 2.2435 2.2296Error −.0340 −.0090 .0049
It improves with increasing degree n, but not at a very
rapid rate. In fact, the error becomes worse when n is
increased further. Later we will see that interpolation
of a much higher degree, say n ≥ 10, is often poorly
behaved when the node points {xi} are evenly spaced.
A FIRST ORDER DIVIDED DIFFERENCE
For a given function f(x) and two distinct points x0and x1, define
f [x0, x1] =f(x1)− f(x0)
x1 − x0
This is called a first order divided difference of f(x).
By the Mean-value theorem,
f(x1)− f(x0) = f 0(c) (x1 − x0)
for some c between x0 and x1. Thus
f [x0, x1] = f 0(c)and the divided difference in very much like the deriv-
ative, especially if x0 and x1 are quite close together.
In fact,
f 0µx1 + x02
¶≈ f [x0, x1]
is quite an accurate approximation of the derivative
(see §5.4).
SECOND ORDER DIVIDED DIFFERENCES
Given three distinct points x0, x1, and x2, define
f [x0, x1, x2] =f [x1, x2]− f [x0, x1]
x2 − x0
This is called the second order divided difference of
f(x).
By a fairly complicated argument, we can show
f [x0, x1, x2] =1
2f 00(c)
for some c intermediate to x0, x1, and x2. In fact, as
we investigate in §5.4,f 00 (x1) ≈ 2f [x0, x1, x2]
From the graphs, there is enormous variation in the
size of Ψn(x) as x varies over [0, 1]; and thus there
is also enormous variation in the error as x so varies.
For example, in the n = 9 case,
maxx0≤x≤x1
|Ψn(x)|(n+ 1)!
= 3.39× 10−11
maxx4≤x≤x5
|Ψn(x)|(n+ 1)!
= 6.89× 10−13
and the ratio of these two errors is approximately 49.
Thus the interpolation error is likely to be around 49
times larger when x0 ≤ x ≤ x1 as compared to the
case when x4 ≤ x ≤ x5. When doing table inter-
polation, the point x at which you are interpolating
should be centrally located with respect to the inter-
polation nodes m{x0, ..., xn} being used to define theinterpolation, if possible.
AN APPROXIMATION PROBLEM
Consider now the problem of using an interpolation
polynomial to approximate a given function f(x) on
a given interval [a, b]. In particular, take interpolation
nodes
a ≤ x0 < x1 < · · · < xn−1 < xn ≤ b
and produce the interpolation polynomial Pn(x) that
interpolates f(x) at the given node points. We would
like to have
maxa≤x≤b |f(x)− Pn(x)|→ 0 as n→∞
Does it happen?
Recall the error bound
maxa≤x≤b |f(x)− Pn(x)|
≤ maxa≤x≤b
|Ψn(x)|(n+ 1)!
· maxa≤x≤b
¯̄̄f (n+1) (x)
¯̄̄We begin with an example using evenly spaced node
points.
RUNGE’S EXAMPLE
Use evenly spaced node points:
h =b− a
n, xi = a+ ih for i = 0, ..., n
For some functions, such as f(x) = ex, the maximumerror goes to zero quite rapidly. But the size of thederivative term f (n+1)(x) in
maxa≤x≤b |f(x)− Pn(x)|
≤ maxa≤x≤b
|Ψn(x)|(n+ 1)!
· maxa≤x≤b
¯̄̄f (n+1) (x)
¯̄̄can badly hurt or destroy the convergence of othercases.
In particular, we show the graph of f(x) = 1/³1 + x2
´and Pn(x) on [−5, 5] for the cases n = 8 and n = 12.The case n = 10 is in the text on page 127. It canbe proven that for this function, the maximum er-ror on [−5, 5] does not converge to zero. Thus theuse of evenly spaced nodes is not necessarily a goodapproach to approximating a function f(x) by inter-polation.
Runge’s example with n = 10:
x
y
y=P10(x)
y=1/(1+x2)
OTHER CHOICES OF NODES
Recall the general error bound
maxa≤x≤b |f(x)− Pn(x)| ≤ max
a≤x≤b|Ψn(x)|(n+ 1)!
· maxa≤x≤b
¯̄̄f (n+1) (x)
¯̄̄There is nothing we really do with the derivative term
for f ; but we can examine the way of defining the
nodes {x0, ..., xn} within the interval [a, b]. We askhow these nodes can be chosen so that the maximum
of |Ψn(x)| over [a, b] is made as small as possible.
This problem has quite an elegant solution, and it is
taken up in §4.6. The node points {x0, ..., xn} turnout to be the zeros of a particular polynomial Tn+1(x)
of degree n+1, called a Chebyshev polynomial. These
zeros are known explicitly, and with them
maxa≤x≤b |Ψn(x)| =
µb− a
2
¶n+12−n
This turns out to be smaller than for evenly spaced
cases; and although this polynomial interpolation does
not work for all functions f(x), it works for all differ-
entiable functions and more.
ANOTHER ERROR FORMULA
Recall the error formula
f(x)− Pn(x) =Ψn(x)
(n+ 1)!f (n+1) (c)
Ψn(x) = (x− x0) (x− x1) · · · (x− xn)
with c between the minimum and maximum of {x0, ..., xn, x}.A second formula is given by
f(x)− Pn(x) = Ψn(x) f [x0, ..., xn, x]
To show this is a simple, but somewhat subtle argu-
ment.
Let Pn+1(x) denote the polynomial of degree ≤ n+1
which interpolates f(x) at the points {x0, ..., xn, xn+1}.Then
Pn+1(x) = Pn(x)
+f [x0, ..., xn, xn+1] (x− x0) · · · (x− xn)
Substituting x = xn+1, and using the fact that Pn+1(x)
To simplify the presentation somewhat, I assume in
the following that our node points are evenly spaced:
x2 = x1 + h, x3 = x1 + 2h, x4 = x1 + 3h
Then our earlier formulas simplify to
s(x) =(x2 − x)3M1 + (x− x1)
3M2
6h
+(x2 − x) y1 + (x− x1) y2
h
−h6[(x2 − x)M1 + (x− x1)M2]
for x1 ≤ x ≤ x2, with similar formulas on [x2, x3] and
[x3, x4].
Without going thru all of the algebra, the conditions
(**) leads to the following pair of equations.
h
6M1 +
2h
3M2 +
h
6M3
=y3 − y2
h− y2 − y1
hh
6M2 +
2h
3M3 +
h
6M4
=y4 − y3
h− y3 − y2
h
This gives us two equations in four unknowns. The
earlier boundary conditions on s00(x) gives us immedi-ately
M1 =M4 = 0
Then we can solve the linear system for M2 and M3.
EXAMPLE
Consider the interpolation data points
x 1 2 3 4
y 1 12
13
14
In this case, h = 1, and linear system becomes
2
3M2 +
1
6M3 = y3 − 2y2 + y1 =
1
31
6M2 +
2
3M3 = y4 − 2y3 + y2 =
1
12
This has the solution
M2 =1
2, M3 = 0
This leads to the spline function formula on each
subinterval.
On [1, 2],
s(x) =(x2 − x)3M1 + (x− x1)
3M2
6h
+(x2 − x) y1 + (x− x1) y2
h
−h6[(x2 − x)M1 + (x− x1)M2]
=(2− x)3 · 0 + (x− 1)3
³12
´6
+(2− x) · 1 + (x− 1)
³12
´1
−16
h(2− x) · 0 + (x− 1)
³12
´i= 112 (x− 1)3 − 7
12 (x− 1) + 1
Similarly, for 2 ≤ x ≤ 3,
s(x) =−112(x− 2)3 + 1
4(x− 2)2 − 1
3(x− 1) + 1
2
and for 3 ≤ x ≤ 4,
s(x) =−112(x− 4) + 1
4
x 1 2 3 4
y 1 12
13
14
0 0.5 1 1.5 2 2.5 3 3.5 40
0.2
0.4
0.6
0.8
1
x
y
y = 1/xy = s(x)
Graph of example of natural cubic spline
interpolation
x 0 1 2 2.5 3 3.5 4y 2.5 0.5 0.5 1.5 1.5 1.125 0
x
y
1 2 3 4
1
2
Interpolating natural cubic spline function
ALTERNATIVE BOUNDARY CONDITIONS
Return to the equations
h
6M1 +
2h
3M2 +
h
6M3
=y3 − y2
h− y2 − y1
hh
6M2 +
2h
3M3 +
h
6M4
=y4 − y3
h− y3 − y2
h
Sometimes other boundary conditions are imposed on
s(x) to help in determining the values of M1 and
M4. For example, the data in our numerical exam-
ple were generated from the function f(x) = 1x. With
it, f 00(x) = 2x3, and thus we could use
M1 = 2, M4 =1
32
With this we are led to a new formula for s(x), one
that approximates f(x) = 1x more closely.
THE CLAMPED SPLINE
In this case, we augment the interpolation conditions
s(xi) = yi, i = 1, 2, 3, 4
with the boundary conditions
s0(x1) = y01, s0(x4) = y04 (#)
The conditions (#) lead to another pair of equations,
augmenting the earlier ones. Combined these equa-
tions are
h
3M1 +
h
6M2 =
y2 − y1h
− y01h
6M1 +
2h
3M2 +
h
6M3
=y3 − y2
h− y2 − y1
hh
6M2 +
2h
3M3 +
h
6M4
=y4 − y3
h− y3 − y2
hh
6M3 +
h
3M4 = y04 −
y4 − y3h
For our numerical example, it is natural to obtain
these derivative values from f 0(x) = − 1x2:
y01 = −1, y04 = −1
16
When combined with your earlier equations, we have
the system
1
3M1 +
1
6M2 =
1
21
6M1 +
2
3M2 +
1
6M3 =
1
31
6M2 +
2
3M3 +
1
6M4 =
1
121
6M3 +
1
3M4 =
1
48
This has the solution
[M1,M2,M3,M4] =·173
120,7
60,11
120,1
60
¸
We can now write the functions s(x) for each of the
subintervals [x1, x2], [x2, x3], and [x3, x4]. Recall for
x1 ≤ x ≤ x2,
s(x) =(x2 − x)3M1 + (x− x1)
3M2
6h
+(x2 − x) y1 + (x− x1) y2
h
−h6[(x2 − x)M1 + (x− x1)M2]
We can substitute in from the data
x 1 2 3 4
y 1 12
13
14
and the solutions {Mi}. Doing so, consider the errorf(x)− s(x). As an example,
f(x) =1
x, f
µ3
2
¶=2
3, s
µ3
2
¶= .65260
This is quite a decent approximation.
THE GENERAL PROBLEM
Consider the spline interpolation problem with n nodes
(x1, y1) , (x2, y2) , ..., (xn, yn)
and assume the node points {xi} are evenly spaced,xj = x1 + (j − 1)h, j = 1, ..., n
We have that the interpolating spline s(x) on
xj ≤ x ≤ xj+1 is given by
s(x) =
³xj+1 − x
´3Mj +
³x− xj
´3Mj+1
6h
+
³xj+1 − x
´yj +
³x− xj
´yj+1
h
−h6
h³xj+1 − x
´Mj +
³x− xj
´Mj+1
ifor j = 1, ..., n− 1.
To enforce continuity of s0(x) at the interior nodepoints x2, ..., xn−1, the second derivatives
nMj
omust
satisfy the linear equations
h
6Mj−1 +
2h
3Mj +
h
6Mj+1 =
yj−1 − 2yj + yj+1
h
for j = 2, ..., n− 1. Writing them out,
h
6M1 +
2h
3M2 +
h
6M3 =
y1 − 2y2 + y3h
h
6M2 +
2h
3M3 +
h
6M4 =
y2 − 2y3 + y4h
...h
6Mn−2 +
2h
3Mn−1 +
h
6Mn =
yn−2 − 2yn−1 + yn
h
This is a system of n−2 equations in the n unknowns{M1, ...,Mn}. Two more conditions must be imposedon s(x) in order to have the number of equations equal
the number of unknowns, namely n. With the added
boundary conditions, this form of linear system can be
solved very efficiently.
BOUNDARY CONDITIONS
“Natural” boundary conditions
s00(x1) = s00(xn) = 0Spline functions satisfying these conditions are called“natural cubic splines”. They arise out the minimiza-tion problem stated earlier. But generally they are notconsidered as good as some other cubic interpolatingsplines.
“Clamped” boundary conditions We add the condi-tions
s0(x1) = y01, s0(xn) = y0nwith y01, y0n given slopes for the endpoints of s(x) on[x1, xn]. This has many quite good properties whencompared with the natural cubic interpolating spline;but it does require knowing the derivatives at the end-points.
“Not a knot” boundary conditions This is more com-plicated to explain, but it is the version of cubic splineinterpolation that is implemented in Matlab.
THE “NOT A KNOT” CONDITIONS
As before, let the interpolation nodes be
(x1, y1) , (x2, y2) , ..., (xn, yn)
We separate these points into two categories. For
constructing the interpolating cubic spline function,
we use the points
(x1, y1) , (x3, y3) , ..., (xn−2, yn−2) , (xn, yn)Thus deleting two of the points. We now have n− 2points, and the interpolating spline s(x) can be deter-
mined on the intervals
[x1, x3] , [x3, x4] , ..., [xn−3, xn−2] , [xn−2, xn]This leads to n− 4 equations in the n− 2 unknownsM1,M3, ...,Mn−2,Mn. The two additional boundary
conditions are
s(x2) = y2, s(xn−1) = yn−1These translate into two additional equations, and we
obtain a system of n−2 linear simultaneous equationsin the n− 2 unknowns M1,M3, ...,Mn−2,Mn.
x 0 1 2 2.5 3 3.5 4y 2.5 0.5 0.5 1.5 1.5 1.125 0
x
y
1 2 3 4
1
2
Interpolating cubic spline function with ”not-a knot”
boundary conditions
MATLAB SPLINE FUNCTION LIBRARY
Given data points
(x1, y1) , (x2, y2) , ..., (xn, yn)
type arrays containing the x and y coordinates:
x = [x1 x2 ...xn]y = [y1 y2 ...yn]plot (x, y, ’o’)
The last statement will draw a plot of the data points,
marking them with the letter ‘oh’. To find the inter-
polating cubic spline function and evaluate it at the
points of another array xx, say
h = (xn − x1) / (10 ∗ n) ; xx = x1 : h : xn;
use
yy = spline (x, y, xx)plot (x, y, ’o’, xx, yy)
The last statement will plot the data points, as be-
fore, and it will plot the interpolating spline s(x) as a
continuous curve.
ERROR IN CUBIC SPLINE INTERPOLATION
Let an interval [a, b] be given, and then define
h =b− a
n− 1, xj = a+ (j − 1)h, j = 1, ..., n
Suppose we want to approximate a given function
f(x) on the interval [a, b] using cubic spline inter-
polation. Define
yi = f(xi), j = 1, ..., n
Let sn(x) denote the cubic spline interpolating this
data and satisfying the “not a knot” boundary con-
ditions. Then it can be shown that for a suitable
constant c,
En ≡ maxa≤x≤b |f(x)− sn(x)| ≤ ch4
The corresponding bound for natural cubic spline in-
terpolation contains only a term of h2 rather than h4;
it does not converge to zero as rapidly.
EXAMPLE
Take f(x) = arctanx on [0, 5]. The following ta-
ble gives values of the maximum error En for various
values of n. The values of h are being successively
Given a function f(x) that is continuous on a giveninterval [a, b], consider approximating it by some poly-nomial p(x). To measure the error in p(x) as an ap-proximation, introduce
E(p) = maxa≤x≤b |f(x)− p(x)|
This is called the maximum error or uniform error ofapproximation of f(x) by p(x) on [a, b].
With an eye towards efficiency, we want to find the‘best’ possible approximation of a given degree n.With this in mind, introduce the following:
ρn(f) = mindeg(p)≤n
E(p)
= mindeg(p)≤n
"maxa≤x≤b |f(x)− p(x)|
#The number ρn(f) will be the smallest possible uni-form error, orminimax error, when approximating f(x)by polynomials of degree at most n. If there is apolynomial giving this smallest error, we denote it bymn(x); thus E(mn) = ρn(f).
Example. Let f(x) = ex on [−1, 1]. In the followingtable, we give the values of E(tn), tn(x) the Tay-
lor polynomial of degree n for ex about x = 0, and
Chebyshev polynomials are used in many parts of nu-merical analysis, and more generally, in applicationsof mathematics. For an integer n ≥ 0, define thefunction
Tn(x) = cos³n cos−1 x
´, −1 ≤ x ≤ 1 (1)
This may not appear to be a polynomial, but we willshow it is a polynomial of degree n. To simplify themanipulation of (1), we introduce
θ = cos−1(x) or x = cos(θ), 0 ≤ θ ≤ π (2)
Then
Tn(x) = cos(nθ) (3)
Example. n = 0
T0(x) = cos(0 · θ) = 1n = 1
T1(x) = cos(θ) = x
n = 2
T2(x) = cos(2θ) = 2 cos2(θ)− 1 = 2x2 − 1
x
y
-1 1
1
-1
T0(x)T1(x)T2(x)
x
y
-1 1
1
-1
T3(x)T4(x)
The triple recursion relation. Recall the trigonomet-
ric addition formulas,
cos(α± β) = cos(α) cos(β)∓ sin(α) sin(β)Let n ≥ 1, and apply these identities to get