3. Review ofMatrix Algebra
Vectors and matrices are essential for modern analysis of systems of equations —
algebrai, differential, functional, etc. In this part, we will review the most basic facts of
matrix arithmetic. See [42] for full details.
3.1. Matrices and Vectors.
A matrix is a rectangular array of numbers. Thus,
1 0 3
−2 4 1
,
0
e 1
2
− 1 .83
√5 −4
7
, ( .2 −1.6 .32 ),
0
0
,
1 3
−2 5
,
are all examples of matrices. We use the notation
A =
a11 a12 . . . a1n
a21 a22 . . . a2n
...
...
. . .
...
am1 am2 . . . amn
(3.1)
for a general matrix of size m×n (read “m by n”), where m denotes the number of rows in
A and n denotes the number of columns. Thus, the preceding examples of matrices have
respective sizes 2 × 3, 4 × 2, 1 × 3, 2 × 1, and 2 × 2. A matrix is square if m = n, i.e., it
has the same number of rows as columns. A column vector is a m×1 matrix, while a row
vector is a 1×n matrix. As we shall see, column vectors are by far the more important of
the two, and the term “vector” without qualification will always mean “column vector”.
A 1 × 1 matrix, which has but a single entry, is both a row and a column vector.
The number that lies in the ith row and the jth column of A is called the (i, j) entry
of A, and is denoted by aij . The row index always appears first and the column index
second. Two matrices are equal, A = B, if and only if they have the same size, and all
their entries are the same: aij = bij for i = 1, . . . ,m and j = 1, . . . , n.
5/18/08 35 c
2008 Peter J. Olver
A general linear system of m equations in n unknowns will take the form
a11 x1 + a12 x2 + · · · + a1n xn = b1,
a21 x1 + a22 x2 + · · · + a2n xn = b2,
...
...
...
am1 x1 + am2 x2 + · · · + amn xn = bm.
(3.2)
As such, it is composed of three basic ingredients: the m × n coefficient matrix A, with
entries aij as in (3.1), the column vector x =
x1
x2
...
xn
containing the unknowns, and the
column vector b =
b1
b2
...
bm
containing right hand sides. As an example, consider the linear
system
x + 2y + z = 2,
2y + z = 7,
x + y + 4z = 3,
The coefficient matrix A =
1 2 1
0 2 1
1 1 4
can be filled in, entry by entry, from the coef-
ficients of the variables appearing in the equations. (Don’t forget to put a zero when a
avariable doesn’t appear in an equation!) The vector x =
x
y
z
lists the variables, while
the entries of b =
2
7
3
are the right hand sides of the equations.
Remark: We will consistently use bold face lower case letters to denote vectors, and
ordinary capital letters to denote general matrices.
Matrix Arithmetic
Matrix arithmetic involves three basic operations: matrix addition, scalar multiplica-
tion, and matrix multiplication. First we define addition of matrices. You are only allowed
to add two matrices of the same size, and matrix addition is performed entry by entry.
For example,
1 2
−1 0
+
3 −5
2 1
=
4 −3
1 1
.
Therefore, if A and B are m×n matrices, their sum C = A+B is the m×n matrix whose
entries are given by cij = aij +bij for i = 1, . . . ,m and j = 1, . . . , n. When defined, matrix
5/18/08 36 c
2008 Peter J. Olver
addition is commutative, A + B = B + A, and associative, A + (B + C) = (A + B) + C,
just like ordinary addition.
A scalar is a fancy name for an ordinary number — the term merely distinguishes it
from a vector or a matrix. For the time being, we will restrict our attention to real scalars
and matrices with real entries, but eventually complex scalars and complex matrices must
be dealt with. We will consistently identify a scalar c ∈ R with the 1 × 1 matrix ( c ) in
which it is the sole entry, and so will omit the redundant parentheses in the latter case.
Scalar multiplication takes a scalar c and an m × n matrix A and computes the m × n
matrix B = cA by multiplying each entry of A by c. For example,
3
1 2
−1 0
=
3 6
−3 0
.
In general, bij = c aij for i = 1, . . . ,m and j = 1, . . . , n. Basic properties of scalar
multiplication are summarized at the end of this section.
Finally, we define matrix multiplication. First, the product between a row vector a
and a column vector x having the same number of entries is the scalar or 1 × 1 matrix
defined by the following rule:
a x = ( a1 a2 . . . an )
x1
x2
...
xn
= a1 x1 + a2 x2 + · · · + an xn =
Xn
k=1
ak xk. (3.3)
More generally, if A is an m × n matrix and B is an n × p matrix, so that the number of
columns in A equals the number of rows in B, then the matrix product C = AB is defined
as the m × p matrix whose (i, j) entry equals the vector product of the ith row of A and
the jth column of B. Therefore,
cij =
Xn
k=1
aik bkj . (3.4)
Note that our restriction on the sizes of A and B guarantees that the relevant row and
column vectors will have the same number of entries, and so their product is defined.
For example, the product of the coefficient matrix A and vector of unknowns x for
our original system (4.1) is given by
Ax =
1 2 1
2 6 1
1 1 4
x
y
z
=
x + 2y + z
2x + 6y + z
x + y + 4z
.
The result is a column vector whose entries reproduce the left hand sides of the original
linear system! As a result, we can rewrite the system
Ax = b (3.5)
as an equality between two column vectors. This result is general; a linear system (3.2)
consisting of m equations in n unknowns can be written in the matrix form (3.5) where A
5/18/08 37 c
2008 Peter J. Olver
is the m×n coefficient matrix (3.1), x is the n×1 column vector of unknowns, and b is the
m × 1 column vector containing the right hand sides. This is one of the principal reasons
for the non-evident definition of matrix multiplication. Component-wise multiplication of
matrix entries turns out to be almost completely useless in applications.
Now, the bad news. Matrix multiplication is not commutative — that is, BA is not
necessarily equal to AB. For example, BA may not be defined even when AB is. Even if
both are defined, they may be different sized matrices. For example the product s = r c
of a row vector r, a 1 × n matrix, and a column vector c, an n × 1 matrix with the same
number of entries, is a 1 × 1 matrix or scalar, whereas the reversed product C = c r is an
n × n matrix. For instance,
( 1 2 )
3
0
= 3, whereas
3
0
( 1 2 ) =
3 6
0 0
.
In computing the latter product, don’t forget that we multiply the rows of the first matrix
by the columns of the second. Moreover, even if the matrix products AB and BA have
the same size, which requires both A and B to be square matrices, we may still have
AB 6= BA. For example,
1 2
3 4
0 1
−1 2
=
−2 5
−4 11
6=
3 4
5 6
=
0 1
−1 2
1 2
3 4
.
On the other hand, matrix multiplication is associative, so A(BC) = (AB)C when-
ever A has size m × n, B has size n × p, and C has size p × q; the result is a matrix of
size m × q. The proof of associativity is a tedious computation based on the definition of
matrix multiplication that, for brevity, we omit. Consequently, the one difference between
matrix algebra and ordinary algebra is that you need to be careful not to change the order
of multiplicative factors without proper justification.
Since matrix multiplication acts by multiplying rows by columns, one can compute the
columns in a matrix product AB by multiplying the matrix A and the individual columns
of B. For example, the two columns of the matrix product
1 −1 2
2 0 −2
3 4
0 2
−1 1
=
1 4
8 6
are obtained by multiplying the first matrix with the individual columns of the second:
1 −1 2
2 0 −2
3
0
−1
=
1
8
,
1 −1 2
2 0 −2
4
2
1
=
4
6
.
In general, if we use bk to denote the kth column of B, then
AB = A
ō
b1 b2 . . . bp
=
ō
Ab1 Ab2 . . . Abp
, (3.6)
indicating that the kth column of their matrix product is Abk.
There are two special matrices. The first is the zero matrix , all of whose entries are 0.
We use Om×n to denote the m× n zero matrix, often written as just O if the size is clear
5/18/08 38 c
2008 Peter J. Olver
from the context. The zero matrix is the additive unit, so A + O = A = O + A when O
has the same size as A. In particular, we will use a bold face 0 to denote a column vector
with all zero entries.
The role of the multiplicative unit is played by the square identity matrix
I = In =
1 0 0 · · · 0 0
0 1 0 · · · 0 0
0 0 1 · · · 0 0
...
...
...
. . .
...
...
0 0 0 · · · 1 0
0 0 0 · · · 0 1
of size n × n. The entries along the main diagonal (which runs from top left to bottom
right) are equal to 1, while the off-diagonal entries are all 0. As you can check, if A is any
m × n matrix, then Im A = A = A In . We will sometimes write the preceding equation
as just IA = A = A I , since each matrix product is well-defined for exactly one size of
identity matrix.
The identity matrix is a particular example of a diagonal matrix . In general, a square
matrix A is diagonal if all its off-diagonal entries are zero: aij = 0 for all i 6= j. We will
sometimes write D = diag (c1, . . . , cn) for the n × n diagonal matrix with diagonal entries
dii = ci. Thus, diag (1, 3, 0) refers to the diagonal matrix
1 0 0
0 3 0
0 0 0
, while the n × n
identity matrix can be written as In = diag (1, 1, . . . , 1).
Let us conclude this section by summarizing the basic properties of matrix arithmetic.
In the accompanying table, A,B,C are matrices; c, d are scalars; O is a zero matrix; and
I is an identity matrix. All matrices are assumed to have the correct sizes so that the
indicated operations are defined.
Matrix Inverses
The inverse of a matrix is analogous to the reciprocal a−1 = 1/a of a scalar. We
already encountered the inverses of matrices corresponding to elementary row operations.
In this section, we will study inverses of general square matrices. We begin with the formal
definition.
Definition 3.1. Let A be a square matrix of size n×n. An n×n matrix X is called
the inverse of A if it satisfies
XA = I = AX, (3.7)
where I = I n is the n×n identity matrix. The inverse is commonly denoted by X = A−1.
Remark: Noncommutativity of matrix multiplication requires that we impose both
conditions in (3.7) in order to properly define an inverse to the matrix A. The first
condition, XA = I , says that X is a left inverse, while the second, AX = I , requires that
X also be a right inverse. Rectangular matrices might have either a left inverse or a right
inverse, but, as we shall see, only square matrices have both, and so only square matrices
5/18/08 39 c
2008 Peter J. Olver
Basic Matrix Arithmetic
Matrix Addition: Commutativity A + B = B + A
Associativity (A + B) + C = A + (B + C)
Zero Matrix A + O = A = O + A
Inverse A + (−A) = O, −A = (−1)A
Scalar Multiplication: Associativity c (dA) = (cd)A
Distributivity
c (A + B) = (cA) + (cB)
(c + d)A = (cA) + (dA)
Unit 1A = A
Zero 0A = O
Matrix Multiplication: Associativity (AB)C = A(BC)
Distributivity
A(B + C) = AB + AC,
(A + B)C = AC + BC,
Identity Matrix A I = A = IA
Zero Matrix AO = O, OA = O
can have full-fledged inverses. However, not every square matrix has an inverse. Indeed,
not every scalar has an inverse: 0−1 = 1/0 is not defined since the equation 0x = 1 has no
solution.
Example 3.2. Since
1 2 −1
−3 1 2
−2 2 1
3 4 −5
1 1 −1
4 6 −7
=
1 0 0
0 1 0
0 0 1
=
3 4 −5
1 1 −1
4 6 −7
1 2 −1
−3 1 2
−2 2 1
,
we conclude that when A =
1 2 −1
−3 1 2
−2 2 1
, then A−1 =
3 4 −5
1 1 −1
4 6 −7
. Observe that
there is no obvious way to anticipate the entries of A−1 from the entries of A.
Example 3.3. Let us compute the inverse X =
x y
z w
, when it exists, of a general
2 × 2 matrix A =
a b
c d
. The right inverse condition
AX =
a x + b z a y + bw
c x + d z c y + dw
=
1 0
0 1
= I
5/18/08 40 c
2008 Peter J. Olver
holds if and only if x, y, z,w satisfy the linear system
a x + b z = 1,
c x + d z = 0,
a y + bw = 0,
c y + dw = 1.
Solving by Gaussian Elimination (or directly), we find
x =
d
ad − b c
, y = −
b
ad − b c
, z = −
c
ad − b c
, w =
a
ad − b c
,
provided the common denominator ad − b c 6= 0 does not vanish. Therefore, the matrix
X =
1
ad − b c
d −b
−c a
forms a right inverse to A. However, a short computation shows that it also defines a left
inverse:
XA =
x a + y c x b + y d
z a + w c z b + w d
=
1 0
0 1
= I ,
and hence X = A−1 is the inverse to A.
The denominator appearing in the preceding formulae has a special name; it is called
the determinant of the 2 × 2 matrix A, and denoted
det
a b
c d
= ad − b c. (3.8)
Thus, the determinant of a 2 × 2 matrix is the product of the diagonal entries minus the
product of the off-diagonal entries. Thus, the 2 × 2 matrix A is invertible, with
A−1 =
1
ad − b c
d −b
−c a
, (3.9)
if and only if detA 6= 0. For example, if A =
1 3
−2 −4
, then detA = 2 6= 0. We
conclude that A has an inverse, which, by (3.9), is A−1 =
1
2
−4 −3
2 1
=
− 2 −3
2
1 1
2
!
.
Lemma 3.4. The inverse of a square matrix, if it exists, is unique.
Proof : Suppose both X and Y satisfy (3.7), so XA = I = AX and Y A = I = AY .
Then, by associativity, X = X I = X(AY ) = (XA)Y = I Y = Y , and hence X =
Y . Q.E.D.
Inverting a matrix twice brings us back to where we started.
Lemma 3.5. If A is invertible, then A−1 is also invertible and (A−1)−1 = A.
Proof : The matrix inverse equations A−1 A = I = AA−1 are sufficient to prove that
A is the inverse of A−1. Q.E.D.
5/18/08 41 c
2008 Peter J. Olver
Lemma 3.6. If A and B are invertible matrices of the same size, then their product,
AB, is invertible, and
(AB)−1 = B−1A−1. (3.10)
Note that the order of the factors is reversed under inversion.
Proof : Let X = B−1A−1. Then, by associativity,
X(AB) = B−1A−1AB = B−1B = I , (AB)X = ABB−1A−1 = AA−1 = I .
Thus X is both a left and a right inverse for the product matrix AB and the result
follows. Q.E.D.
Example 3.7. One verifies, directly, that the inverse of A =
1 2
0 1
is A−1 =
1 −2
0 1
, while the inverse of B =
0 1
−1 0
is B−1 =
0 −1
1 0
. Therefore, the
inverse of their product C = AB =
1 2
0 1
0 1
−1 0
=
−2 1
−1 0
is given by C−1 =
B−1A−1 =
0 −1
1 0
1 −2
0 1
=
0 −1
1 −2
.
We can straightforwardly generalize the preceding result. The inverse of a k-fold
product of invertible matrices is the product of their inverses, in the reverse order :
(A1A2 · · ·Ak−1Ak)−1 = A−1
k A−1
k−1 · · ·A−1
2 A−1
1 . (3.11)
Warning: In general, (A + B)−1 6= A−1 + B−1. This equation is not even true for
scalars (1 × 1 matrices)!
Transposes and Symmetric Matrices
Another basic operation on matrices is to interchange their rows and columns. If A
is an m×n matrix, then its transpose, denoted AT , is the n×m matrix whose (i, j) entry
equals the (j, i) entry of A; thus
B = AT means that bij = aji.
For example, if
A =
1 2 3
4 5 6
, then AT =
1 4
2 5
3 6
.
Observe that the rows of A become the columns of AT and vice versa. In particular, the
transpose of a row vector is a column vector, while the transpose of a column vector is a
row vector; if v =
1
2
3
, then vT = ( 1 2 3 ). The transpose of a scalar, considered as a
1 × 1 matrix, is itself: cT = c.
5/18/08 42 c
2008 Peter J. Olver
Remark: Most vectors appearing in applied mathematics are column vectors. To
conserve vertical space in this text, we will often use the transpose notation, e.g., v =
( v1, v2, v3 )T , as a compact way of writing column vectors.
In the square case, transposition can be viewed as “reflecting” the matrix entries
across the main diagonal. For example,
1 2 −1
3 0 5
−2 −4 8
T
=
1 3 −2
2 0 −4
−1 5 8
.
In particular, the transpose of a lower triangular matrix is upper triangular and vice-versa.
Transposing twice returns you to where you started:
(AT )T = A. (3.12)
Unlike inversion, transposition is compatible with matrix addition and scalar multiplica-
tion:
(A + B)T = AT + BT , (cA)T = cAT . (3.13)
Transposition is also compatible with matrix multiplication, but with a twist. Like the
inverse, the transpose reverses the order of multiplication:
(AB)T = BTAT . (3.14)
Indeed, if A has size m × n and B has size n × p, so they can be multiplied, then AT has
size n × m and BT has size p × n, and so, in general, one has no choice but to multiply
BTAT in that order. Formula (3.14) is a straightforward consequence of the basic laws of
matrix multiplication. An important special case is the product between a row vector vT
and a column vector w with the same number of entries. In this case,
vTw = (vTw)T = wTv, (3.15)
because their product is a scalar and so, as noted above, equals its own transpose.
Lemma 3.8. If A is a nonsingular matrix, so is AT , and its inverse is denoted
A−T = (AT )−1 = (A−1)T . (3.16)
Thus, transposing a matrix and then inverting yields the same result as first inverting and
then transposing.
Proof : Let X = (A−1)T . Then, according to (3.14),
XAT = (A−1)TAT = (AA−1)T = I T = I .
The proof that AT X = I is similar, and so we conclude that X = (AT )−1. Q.E.D.
A particularly important class of square matrices is those that are unchanged by the
transpose operation.
5/18/08 43 c
2008 Peter J. Olver
Definition 3.9. A square matrix is called symmetric if it equals its own transpose:
A = AT .
Thus, A is symmetric if and only if its entries satisfy aji = aij for all i, j. In other
words, entries lying in “mirror image” positions relative to the main diagonal must be
equal. For example, the most general symmetric 3 × 3 matrix has the form
A =
a b c
b d e
c e f
.
Note that any diagonal matrix, including the identity, is symmetric. A lower or upper
triangular matrix is symmetric if and only if it is, in fact, a diagonal matrix.
5/18/08 44 c
2008 Peter J. Olver
readna.blogspot.com
Sunday, February 2, 2014
Computer Arithmetic
The purpose of computing is insight, not numbers.
— R.W. Hamming, [24]
The main goal of numerical analysis is to develop efficient algorithms for computing
precise numerical values of mathematical quantities, including functions, integrals, solu-
tions of algebraic equations, solutions of differential equations (both ordinary and partial),
solutions of minimization problems, and so on. The objects of interest typically (but not
exclusively) arise in applications, which seek not only their qualitative properties, but also
quantitative numerical data. The goal of this course of lectures is to introduce some of the
most important and basic numerical algorithms that are used in practical computations.
Beyond merely learning the basic techniques, it is crucial that an informed practitioner
develop a thorough understanding of how the algorithms are constructed, why they work,
and what their limitations are.
In any applied numerical computation, there are four key sources of error:
(i ) Inexactness of the mathematical model for the underlying physical phenomenon.
(ii ) Errors in measurements of parameters entering the model.
(iii ) Round-off errors in computer arithmetic.
(iv) Approximations used to solve the full mathematical system.
Of these, (i ) is the domain of mathematical modeling, and will not concern us here. Neither
will (ii ), which is the domain of the experimentalists. (iii ) arises due to the finite numerical
precision imposed by the computer. (iv) is the true domain of numerical analysis, and
refers to the fact that most systems of equations are too complicated to solve explicitly, or,
even in cases when an analytic solution formula is known, directly obtaining the precise
numerical values may be difficult.
There are two principal ways of quantifying computational errors.
Definition 1.1. Let x be a real number and let x⋆ be an approximation. The
absolute error in the approximation x⋆ ≈ x is defined as | x⋆ − x |. The relative error is
defined as the ratio of the absolute error to the size of x, i.e.,
| x⋆ − x |
| x |
, which assumes
x 6= 0; otherwise relative error is not defined.
5/18/08 1
c 2008 Peter J. Olver
For example, 1000001 is an approximation to 1000000 with an absolute error of 1 and
a relative error of 10−6, while 2 is an approximation to 1 with an absolute error of 1 and a
relative error of 1. Typically, relative error is more intuitive and the preferred determiner
of the size of the error. The present convention is that errors are always ≥ 0, and are = 0
if and only if the approximation is exact. We will say that an approximation x⋆ has k
significant decimal digits if its relative error is < 5 × 10−k−1. This means that the first k
digits of x⋆ following its first nonzero digit are the same as those of x.
Ultimately, numerical algorithms must be performed on a computer. In the old days,
“computer” referred to a person or an analog machine, but nowadays inevitably means a
digital, electronic machine. A most unfortunate fact of life is that all digital computers, no
matter how “super”, can only store finitely many quantities. Thus, there is no way that
a computer can represent the (discrete) infinity of all integers or all rational numbers, let
alone the (continuous) infinity of all real or all complex numbers. So the decision of how
to approximate more general numbers using only the finitely many that the computer can
handle becomes of critical importance.
Each number in a computer is assigned a location or word, consisting of a specified
number of binary digits or bits. A k bit word can store a total of N = 2k different
numbers. For example, the standard single precision computer uses 32 bit arithmetic, for
a total of N = 232 ≈ 4.3×109 different numbers, while double precision uses 64 bits, with
N = 264 ≈ 1.84×1019 different numbers. The question is how to distribute the N exactly
representable numbers over the real line for maximum efficiency and accuracy in practical
computations.
One evident choice is to distribute them evenly, leading to fixed point arithmetic. For
example, one can use the first bit in a word to represent a sign, and treat the remaining
bits as integers, thereby representing the integers from 1− 1
2 N = 1− 2k−1 to 1
2 N = 2k−1
exactly. Of course, this does a poor job approximating most non-integral numbers. Another
option is to space the numbers closely together, say with uniform gap of 2−n, and so
distribute our N numbers uniformly over the interval −2−n−1N < x ≤ 2−n−1N. Real
numbers lying between the gaps are represented by either rounding, meaning the closest
exact representative, or chopping, meaning the exact representative immediately below (or
above if negative) the number. Numbers lying beyond the range must be represented by
the largest (or largest negative) representable number, which thus becomes a symbol for
overflow. When processing speed is a significant bottleneck, the use of such fixed point
representations is an attractive and faster alternative to the more cumbersome floating
point arithmetic most commonly used in practice.
The most common non-uniform distribution of our N numbers is the floating point
system, which is based on scientific notation. If x is any real number it can be written in
the form
x = ±.d1d2d3d4 . . . × 2n,
where dĪ½ = 0 or 1, and n ∈ Z is an integer. We call d1d2d3d4 . . . the mantissa and n the
exponent. If x 6= 0, then we can uniquely choose n so that d1 = 1. In our computer, we
approximate x by a finite number of mantissa digits
x⋆ = ±.d1d2d3d4 . . . dk−1
b dk × 2n,
5/18/08 2
c 2008 Peter J. Olver
by either chopping or rounding the final digit. The exponent is also restricted to a finite
range of integers n⋆ ≤ N ≤ N⋆. Very small numbers, lying in the gap between 0 < | x | <
2n⋆ , are said to cause underflow.
• In single precision floating point arithmetic, the sign is 1 bit, the exponent is 7 bits,
and the mantissa is 24 bits. The resulting nonzero numbers lie in the range
2−127 ≈ 10−38 ≤ | x | ≤ 2127 ≈ 1038,
and allow one to accurately represent numbers with approximately 7 significant
decimal digits of real numbers lying in this range.
• In double precision floating point arithmetic, the sign is 1 bit, the exponent is 10 bits,
and the mantissa is 53 bits, leading to floating point representations for a total of
1.84×1019 different numbers which, apart from 0. The resulting nonzero numbers
lie in the range
2−1023 ≈ 10−308 ≤ | x | ≤ 21023 ≈ 10308.
In double precision, one can accurately represent approximately 15 decimal digits.
Keep in mind floating point numbers are not uniformly spaced! Moreover, when passing
from .111111 . . . × 2n to .100000 . . . × 2n+1, the inter-number spacing suddenly jumps by
a factor of 2. The non-smoothly varying spacing inherent in the floating point system can
cause subtle, undesirable numerical artifacts in high precision computations.
Remark: Although they are overwhelmingly the most prevalent, fixed and floating
point are not the only number systems that have been proposed. See [9] for another
intriguing possibility.
In the course of a calculation, intermediate errors interact in a complex fashion, and
result in a final total error that is not just the sum of the individual errors. If x⋆ is an
approximation to x, and y⋆ is an approximation to y, then, instead of arithmetic operations
+,−,×, / and so on, the computer uses a “pseudo-operations” to combine them. For
instance, to approximate the difference x − y of two real numbers, the computer begins
by replacing each by its floating point approximation x⋆ and y⋆. It the subtracts these,
and replaces the exact result x⋆ − y⋆ by its nearest floating point representative, namely
(x⋆ − y⋆)⋆. As the following example makes clear, this is not necessarily the same as the
floating point approximation to x − y, i.e., in general (x⋆ − y⋆)⋆ 6= (x − y)⋆.
Example 1.2. . Lets see what happens if we subtract the rational numbers
x =
301
2000
≈ .15050000 . . . , y =
301
2001
≈ .150424787 . . . .
The exact answer is
x − y =
301
4002000
≈ .00007521239 . . . .
However, if we use rounding to approximate with 4 significant digits†, then
x =
301
2000
≈ .1505, y =
301
2001
≈ .1504 and so x − y ≈ .0001,
† To aid comprehension, we are using base 10 instead of base 2 arithmetic in this example.
5/18/08 3
c 2008 Peter J. Olver
2 4 6 8 10
-3
-2
-1
1
2
3
Figure 1.1. Roots of Polynomials.
which has no significant digits in common with the actual answer. On the other hand, if
we evaluate the difference using the alternative formula
x − y =
301 × 2001 − 301 × 2000
4002000
=
602301 − 602000
4002000
≈
6.023 × 105 − 6.020 × 105
4.002 × 106 =
.003 × 105
4.002 × 106 ≈ .00007496,
we at least reproduce the first two significant digits. Thus, the result of a floating point
computation depends on the order of the arithmetic operations! In particular, the associa-
tive and distributive laws are not valid in floating point arithmetic! In the development
of numerical analysis, one tends to not pay attention to this “minor detail’, although its
effects must always be kept in the back of one’s mind when evaluating the results of any
serious numerical computation.
Just in case you are getting a little complacent, thinking “gee, a few tiny round off
errors can’t really make that big a difference”, let us close with two cautionary examples.
Example 1.3. Consider the pair of degree 10 polynomials
p(x) = (x − 1)(x − 2)(x − 3)(x − 4)(x − 5)(x − 6)(x − 7)(x − 8)(x − 9)(x − 10)
and
q(x) = p(x) + x5.
They only differ in the value of the coefficient of their middle term, which, by a direct
expansion, is −902055x5 in p(x), and −902054x5 in q(x); all other terms are the same.
The relative error between the two coefficients is roughly one-thousandth of one percent.
Our task is to compute the roots of these two polynomials, i.e., the solutions to
p(x) = 0 and q(x) = 0. Those of p(x) are obvious. One might expect the roots of q(x) to
5/18/08 4
c 2008 Peter J. Olver
0.5 1 1.5 2 2.5
0.2
0.4
0.6
0.8
1
Figure 1.2. Sensitive Dependence on Initial Data.
be rather close to the integers 1, 2, . . . , 10. However, their approximate values are
1.0000027558, 1.99921, 3.02591, 3.82275,
5.24676 ± 0.751485 i , 7.57271 ± 1.11728 i , 9.75659 ± 0.368389 i .
Surprisingly, only the smallest two roots are relatively unchanged; the third and fourth
roots differ by roughly 1% and 27%, while the final six roots have mysteriously meta-
morphosed into three complex conjugate pairs of roots of q(x). The two sets of roots are
plotted in Figure 1.1; those of p(x) are indicated by solid disks, while those of q(x) are
indicated by open circles.
We have thus learned that an almost negligible change in a single coefficient of a real
polynomial can have dramatic and unanticipated effects on its roots. Needless to say, this
indicates that finding accurate numerical values for the roots of high degree polynomials
is a very challenging problem.
Example 1.4. Consider the initial value problem
du
dt
− 2u = −e−t, u(0) = 1
3 .
The solution is easily found:
u(t) = 1
3 e−t,
and is exponentially decaying as t → ∞.
However, in a floating point computer, we are not able to represent the initial con-
dition 1
3 exactly, and make some small round-off error (depending on the precision of the
computer). Let " 6= 0 represent the error in the initial condition. The solution to the
perturbed initial value problem
dv
dt
− 2v = −e−t, v(0) = 1
3 + ",
is
v(t) = 1
3 e−t + " e2t,
which is exponentially growing as t increases. As sketched in Figure 1.2, the two solutions
remain close only for a very short time interval, the duration of which depends on the
5/18/08 5
c 2008 Peter J. Olver
size in the initial error, but then they eventually diverge far away from each other. As a
consequence, a tiny error in the initial data can have a dramatic effect on the solution.
This phenomenon is referred to as sensitive dependence on initial conditions.
The numerical computation of the exponentially decaying exact solution in the face
of round-off errors is an extreme challenge. Even the tiniest of error will immediately
introduce an exponentially growing mode which will eventually swamp the true solution.
Furthermore, in many important applications, the physically relevant solution is the one
that decays to zero at large distances, and is usually distinguished from the vast majority
of solutions that grow at large distances. So the computational problem is both important
and very difficult.
Examples 1.3 and 1.4 are known as ill-conditioned problems meaning that tiny changes
in the data have dramatic effects on the solutions. Simple numerical methods work as
advertised on well-conditioned problems, but all have their limitations and a sufficiently
ill-conditioned problem will test the limits of the algorithm and/or computer, and, in
many instances, require revamping a standard numerical solution scheme to adapt to the
ill-conditioning. Some problems are so ill-conditioned as to defy even the most powerful
computers and sophisticated algorithms. For this reason, numerical analysis will forever
remain a vital and vibrant area of mathematical research.
So, numerical analysis cannot be viewed in isolation as a black box, and left in the
hands of the mathematical experts. Every practitioner will, sooner or later, confront a
problem that tests the limitations of standard algorithms and software. Without a proper
understanding of the mathematical principles involved in constructing basic numerical
algorithms, one is ill-equipped, if not hopelessly handicapped, when dealing with such
problems. The purpose of this series of lectures is to give you the proper mathematical
grounding in modern numerical analysis.
5/18/08 6
c 2008 Peter J. Olver
The purpose of computing is insight, not numbers.
— R.W. Hamming, [24]
The main goal of numerical analysis is to develop efficient algorithms for computing
precise numerical values of mathematical quantities, including functions, integrals, solu-
tions of algebraic equations, solutions of differential equations (both ordinary and partial),
solutions of minimization problems, and so on. The objects of interest typically (but not
exclusively) arise in applications, which seek not only their qualitative properties, but also
quantitative numerical data. The goal of this course of lectures is to introduce some of the
most important and basic numerical algorithms that are used in practical computations.
Beyond merely learning the basic techniques, it is crucial that an informed practitioner
develop a thorough understanding of how the algorithms are constructed, why they work,
and what their limitations are.
In any applied numerical computation, there are four key sources of error:
(i ) Inexactness of the mathematical model for the underlying physical phenomenon.
(ii ) Errors in measurements of parameters entering the model.
(iii ) Round-off errors in computer arithmetic.
(iv) Approximations used to solve the full mathematical system.
Of these, (i ) is the domain of mathematical modeling, and will not concern us here. Neither
will (ii ), which is the domain of the experimentalists. (iii ) arises due to the finite numerical
precision imposed by the computer. (iv) is the true domain of numerical analysis, and
refers to the fact that most systems of equations are too complicated to solve explicitly, or,
even in cases when an analytic solution formula is known, directly obtaining the precise
numerical values may be difficult.
There are two principal ways of quantifying computational errors.
Definition 1.1. Let x be a real number and let x⋆ be an approximation. The
absolute error in the approximation x⋆ ≈ x is defined as | x⋆ − x |. The relative error is
defined as the ratio of the absolute error to the size of x, i.e.,
| x⋆ − x |
| x |
, which assumes
x 6= 0; otherwise relative error is not defined.
5/18/08 1
c 2008 Peter J. Olver
For example, 1000001 is an approximation to 1000000 with an absolute error of 1 and
a relative error of 10−6, while 2 is an approximation to 1 with an absolute error of 1 and a
relative error of 1. Typically, relative error is more intuitive and the preferred determiner
of the size of the error. The present convention is that errors are always ≥ 0, and are = 0
if and only if the approximation is exact. We will say that an approximation x⋆ has k
significant decimal digits if its relative error is < 5 × 10−k−1. This means that the first k
digits of x⋆ following its first nonzero digit are the same as those of x.
Ultimately, numerical algorithms must be performed on a computer. In the old days,
“computer” referred to a person or an analog machine, but nowadays inevitably means a
digital, electronic machine. A most unfortunate fact of life is that all digital computers, no
matter how “super”, can only store finitely many quantities. Thus, there is no way that
a computer can represent the (discrete) infinity of all integers or all rational numbers, let
alone the (continuous) infinity of all real or all complex numbers. So the decision of how
to approximate more general numbers using only the finitely many that the computer can
handle becomes of critical importance.
Each number in a computer is assigned a location or word, consisting of a specified
number of binary digits or bits. A k bit word can store a total of N = 2k different
numbers. For example, the standard single precision computer uses 32 bit arithmetic, for
a total of N = 232 ≈ 4.3×109 different numbers, while double precision uses 64 bits, with
N = 264 ≈ 1.84×1019 different numbers. The question is how to distribute the N exactly
representable numbers over the real line for maximum efficiency and accuracy in practical
computations.
One evident choice is to distribute them evenly, leading to fixed point arithmetic. For
example, one can use the first bit in a word to represent a sign, and treat the remaining
bits as integers, thereby representing the integers from 1− 1
2 N = 1− 2k−1 to 1
2 N = 2k−1
exactly. Of course, this does a poor job approximating most non-integral numbers. Another
option is to space the numbers closely together, say with uniform gap of 2−n, and so
distribute our N numbers uniformly over the interval −2−n−1N < x ≤ 2−n−1N. Real
numbers lying between the gaps are represented by either rounding, meaning the closest
exact representative, or chopping, meaning the exact representative immediately below (or
above if negative) the number. Numbers lying beyond the range must be represented by
the largest (or largest negative) representable number, which thus becomes a symbol for
overflow. When processing speed is a significant bottleneck, the use of such fixed point
representations is an attractive and faster alternative to the more cumbersome floating
point arithmetic most commonly used in practice.
The most common non-uniform distribution of our N numbers is the floating point
system, which is based on scientific notation. If x is any real number it can be written in
the form
x = ±.d1d2d3d4 . . . × 2n,
where dĪ½ = 0 or 1, and n ∈ Z is an integer. We call d1d2d3d4 . . . the mantissa and n the
exponent. If x 6= 0, then we can uniquely choose n so that d1 = 1. In our computer, we
approximate x by a finite number of mantissa digits
x⋆ = ±.d1d2d3d4 . . . dk−1
b dk × 2n,
5/18/08 2
c 2008 Peter J. Olver
by either chopping or rounding the final digit. The exponent is also restricted to a finite
range of integers n⋆ ≤ N ≤ N⋆. Very small numbers, lying in the gap between 0 < | x | <
2n⋆ , are said to cause underflow.
• In single precision floating point arithmetic, the sign is 1 bit, the exponent is 7 bits,
and the mantissa is 24 bits. The resulting nonzero numbers lie in the range
2−127 ≈ 10−38 ≤ | x | ≤ 2127 ≈ 1038,
and allow one to accurately represent numbers with approximately 7 significant
decimal digits of real numbers lying in this range.
• In double precision floating point arithmetic, the sign is 1 bit, the exponent is 10 bits,
and the mantissa is 53 bits, leading to floating point representations for a total of
1.84×1019 different numbers which, apart from 0. The resulting nonzero numbers
lie in the range
2−1023 ≈ 10−308 ≤ | x | ≤ 21023 ≈ 10308.
In double precision, one can accurately represent approximately 15 decimal digits.
Keep in mind floating point numbers are not uniformly spaced! Moreover, when passing
from .111111 . . . × 2n to .100000 . . . × 2n+1, the inter-number spacing suddenly jumps by
a factor of 2. The non-smoothly varying spacing inherent in the floating point system can
cause subtle, undesirable numerical artifacts in high precision computations.
Remark: Although they are overwhelmingly the most prevalent, fixed and floating
point are not the only number systems that have been proposed. See [9] for another
intriguing possibility.
In the course of a calculation, intermediate errors interact in a complex fashion, and
result in a final total error that is not just the sum of the individual errors. If x⋆ is an
approximation to x, and y⋆ is an approximation to y, then, instead of arithmetic operations
+,−,×, / and so on, the computer uses a “pseudo-operations” to combine them. For
instance, to approximate the difference x − y of two real numbers, the computer begins
by replacing each by its floating point approximation x⋆ and y⋆. It the subtracts these,
and replaces the exact result x⋆ − y⋆ by its nearest floating point representative, namely
(x⋆ − y⋆)⋆. As the following example makes clear, this is not necessarily the same as the
floating point approximation to x − y, i.e., in general (x⋆ − y⋆)⋆ 6= (x − y)⋆.
Example 1.2. . Lets see what happens if we subtract the rational numbers
x =
301
2000
≈ .15050000 . . . , y =
301
2001
≈ .150424787 . . . .
The exact answer is
x − y =
301
4002000
≈ .00007521239 . . . .
However, if we use rounding to approximate with 4 significant digits†, then
x =
301
2000
≈ .1505, y =
301
2001
≈ .1504 and so x − y ≈ .0001,
† To aid comprehension, we are using base 10 instead of base 2 arithmetic in this example.
5/18/08 3
c 2008 Peter J. Olver
2 4 6 8 10
-3
-2
-1
1
2
3
Figure 1.1. Roots of Polynomials.
which has no significant digits in common with the actual answer. On the other hand, if
we evaluate the difference using the alternative formula
x − y =
301 × 2001 − 301 × 2000
4002000
=
602301 − 602000
4002000
≈
6.023 × 105 − 6.020 × 105
4.002 × 106 =
.003 × 105
4.002 × 106 ≈ .00007496,
we at least reproduce the first two significant digits. Thus, the result of a floating point
computation depends on the order of the arithmetic operations! In particular, the associa-
tive and distributive laws are not valid in floating point arithmetic! In the development
of numerical analysis, one tends to not pay attention to this “minor detail’, although its
effects must always be kept in the back of one’s mind when evaluating the results of any
serious numerical computation.
Just in case you are getting a little complacent, thinking “gee, a few tiny round off
errors can’t really make that big a difference”, let us close with two cautionary examples.
Example 1.3. Consider the pair of degree 10 polynomials
p(x) = (x − 1)(x − 2)(x − 3)(x − 4)(x − 5)(x − 6)(x − 7)(x − 8)(x − 9)(x − 10)
and
q(x) = p(x) + x5.
They only differ in the value of the coefficient of their middle term, which, by a direct
expansion, is −902055x5 in p(x), and −902054x5 in q(x); all other terms are the same.
The relative error between the two coefficients is roughly one-thousandth of one percent.
Our task is to compute the roots of these two polynomials, i.e., the solutions to
p(x) = 0 and q(x) = 0. Those of p(x) are obvious. One might expect the roots of q(x) to
5/18/08 4
c 2008 Peter J. Olver
0.5 1 1.5 2 2.5
0.2
0.4
0.6
0.8
1
Figure 1.2. Sensitive Dependence on Initial Data.
be rather close to the integers 1, 2, . . . , 10. However, their approximate values are
1.0000027558, 1.99921, 3.02591, 3.82275,
5.24676 ± 0.751485 i , 7.57271 ± 1.11728 i , 9.75659 ± 0.368389 i .
Surprisingly, only the smallest two roots are relatively unchanged; the third and fourth
roots differ by roughly 1% and 27%, while the final six roots have mysteriously meta-
morphosed into three complex conjugate pairs of roots of q(x). The two sets of roots are
plotted in Figure 1.1; those of p(x) are indicated by solid disks, while those of q(x) are
indicated by open circles.
We have thus learned that an almost negligible change in a single coefficient of a real
polynomial can have dramatic and unanticipated effects on its roots. Needless to say, this
indicates that finding accurate numerical values for the roots of high degree polynomials
is a very challenging problem.
Example 1.4. Consider the initial value problem
du
dt
− 2u = −e−t, u(0) = 1
3 .
The solution is easily found:
u(t) = 1
3 e−t,
and is exponentially decaying as t → ∞.
However, in a floating point computer, we are not able to represent the initial con-
dition 1
3 exactly, and make some small round-off error (depending on the precision of the
computer). Let " 6= 0 represent the error in the initial condition. The solution to the
perturbed initial value problem
dv
dt
− 2v = −e−t, v(0) = 1
3 + ",
is
v(t) = 1
3 e−t + " e2t,
which is exponentially growing as t increases. As sketched in Figure 1.2, the two solutions
remain close only for a very short time interval, the duration of which depends on the
5/18/08 5
c 2008 Peter J. Olver
size in the initial error, but then they eventually diverge far away from each other. As a
consequence, a tiny error in the initial data can have a dramatic effect on the solution.
This phenomenon is referred to as sensitive dependence on initial conditions.
The numerical computation of the exponentially decaying exact solution in the face
of round-off errors is an extreme challenge. Even the tiniest of error will immediately
introduce an exponentially growing mode which will eventually swamp the true solution.
Furthermore, in many important applications, the physically relevant solution is the one
that decays to zero at large distances, and is usually distinguished from the vast majority
of solutions that grow at large distances. So the computational problem is both important
and very difficult.
Examples 1.3 and 1.4 are known as ill-conditioned problems meaning that tiny changes
in the data have dramatic effects on the solutions. Simple numerical methods work as
advertised on well-conditioned problems, but all have their limitations and a sufficiently
ill-conditioned problem will test the limits of the algorithm and/or computer, and, in
many instances, require revamping a standard numerical solution scheme to adapt to the
ill-conditioning. Some problems are so ill-conditioned as to defy even the most powerful
computers and sophisticated algorithms. For this reason, numerical analysis will forever
remain a vital and vibrant area of mathematical research.
So, numerical analysis cannot be viewed in isolation as a black box, and left in the
hands of the mathematical experts. Every practitioner will, sooner or later, confront a
problem that tests the limitations of standard algorithms and software. Without a proper
understanding of the mathematical principles involved in constructing basic numerical
algorithms, one is ill-equipped, if not hopelessly handicapped, when dealing with such
problems. The purpose of this series of lectures is to give you the proper mathematical
grounding in modern numerical analysis.
5/18/08 6
c 2008 Peter J. Olver
NumericalAnalysis LectureNotes
Peter J. Olver
2. Numerical Solution of Scalar Equations
Most numerical solution methods are based on some form of iteration. The basic idea
is that repeated application of the algorithm will produce closer and closer approximations
to the desired solution. To analyze such algorithms, our first task is to understand general
iterative processes.
2.1. Iteration of Functions.
Iteration, meaning repeated application of a function, can be viewed as a discrete
dynamical system in which the continuous time variable has been “quantized” to assume
integer values. Even iterating a very simple quadratic scalar function can lead to an amazing
variety of dynamical phenomena, including multiply-periodic solutions and genuine
chaos. Nonlinear iterative systems arise not just in mathematics, but also underlie the
growth and decay of biological populations, predator-prey interactions, spread of communicable
diseases such as Aids, and host of other natural phenomena. Moreover, many
numerical solution methods — for systems of algebraic equations, ordinary differential
equations, partial differential equations, and so on — rely on iteration, and so the theory
underlies the analysis of convergence and efficiency of such numerical approximation
schemes.
In general, an iterative system has the form
u(k+1) = g(u(k)), (2.1)
where g:Rn → Rn is a real vector-valued function. (One can similarly treat iteration of
complex-valued functions g:Cn → Cn, but, for simplicity, we only deal with real systems
here.) A solution is a discrete collection of points† u(k) ∈ Rn, in which the index k =
0, 1, 2, 3, . . . takes on non-negative integer values.
Once we specify the initial iterate,
u(0) = c, (2.2)
then the resulting solution to the discrete dynamical system (2.1) is easily computed:
u(1) = g(u(0)) = g(c), u(2) = g(u(1)) = g(g(c)), u(3) = g(u(2)) = g(g(g(c))), . . .
† The superscripts on u
(k) refer to the iteration number, and do not denote derivatives.
5/18/08 7
c 2008 Peter J. Olver
and so on. Thus, unlike continuous dynamical systems, the existence and uniqueness of
solutions is not an issue. As long as each successive iterate u(k) lies in the domain of
definition of g one merely repeats the process to produce the solution,
u(k) =
k times z }| {
g ◦ · · · ◦g (c), k = 0, 1, 2, . . . ,
(2.3)
which is obtained by composing the function g with itself a total of k times. In other
words, the solution to a discrete dynamical system corresponds to repeatedly pushing the
g key on your calculator. For example, entering 0 and then repeatedly hitting the cos key
corresponds to solving the iterative system
u(k+1) = cos u(k), u(0) = 0. (2.4)
The first 10 iterates are displayed in the following table:
k 0 1 2 3 4 5 6 7 8 9
u(k) 0 1 .540302 .857553 .65429 .79348 .701369 .76396 .722102 .750418
For simplicity, we shall always assume that the vector-valued function g:Rn → Rn is
defined on all of Rn; otherwise, we must always be careful that the successive iterates u(k)
never leave its domain of definition, thereby causing the iteration to break down. To avoid
technical complications, we will also assume that g is at least continuous; later results rely
on additional smoothness requirements, e.g., continuity of its first and second order partial
derivatives.
While the solution to a discrete dynamical system is essentially trivial, understanding
its behavior is definitely not. Sometimes the solution converges to a particular value —
the key requirement for numerical solution methods. Sometimes it goes off to ∞, or, more
precisely, the norms† of the iterates are unbounded: k u(k) k → ∞ as k → ∞. Sometimes
the solution repeats itself after a while. And sometimes the iterates behave in a seemingly
random, chaotic manner — all depending on the function g and, at times, the initial
condition c. Although all of these cases may arise in real-world applications, we shall
mostly concentrate upon understanding convergence.
Definition 2.1. A fixed point or equilibrium of a discrete dynamical system (2.1) is
a vector u⋆ ∈ Rn such that
g(u⋆) = u⋆. (2.5)
We easily see that every fixed point provides a constant solution to the discrete dynamical
system, namely u(k) = u⋆ for all k. Moreover, it is not hard to prove that any
convergent solution necessarily converges to a fixed point.
† In view of the equivalence of norms on finite-dimensional vector spaces, cf. Theorem 5.9, any
norm will do here.
5/18/08 8
c 2008 Peter J. Olver
Proposition 2.2. If a solution to a discrete dynamical system converges,
lim
k→∞
u(k) = u⋆,
then the limit u⋆ is a fixed point.
Proof : This is a simple consequence of the continuity of g. We have
u⋆ = lim
k→∞
u(k+1) = lim
k→∞
g(u(k)) = g
lim
k→∞
u(k)
= g(u⋆),
the last two equalities following from the continuity of g. Q.E.D.
For example, continuing the cosine iteration (2.4), we find that the iterates gradually
converge to the value u⋆ ≈ .739085, which is the unique solution to the fixed point equation
cos u = u.
Later we will see how to rigorously prove this observed behavior.
Of course, not every solution to a discrete dynamical system will necessarily converge,
but Proposition 2.2 says that if it does, then it must converge to a fixed point. Thus, a
key goal is to understand when a solution converges, and, if so, to which fixed point —
if there is more than one. (In the linear case, only the actual convergence is a significant
issues since most linear systems admit exactly one fixed point, namely u⋆ = 0.)
Fixed points are roughly divided into three classes:
• asymptotically stable, with the property that all nearby solutions converge to it,
• stable, with the property that all nearby solutions stay nearby, and
• unstable, almost all of whose nearby solutions diverge away from the fixed point.
Thus, from a practical standpoint, convergence of the iterates of a discrete dynamical
system requires asymptotic stability of the fixed point. Examples will appear in abundance
in the following sections.
Scalar Functions
As always, the first step is to thoroughly understand the scalar case, and so we begin
with a discrete dynamical system
u(k+1) = g(u(k)), u(0) = c, (2.6)
in which g:R → R is a continuous, scalar-valued function. As noted above, we will assume,
for simplicity, that g is defined everywhere, and so we do not need to worry about whether
the iterates u(0), u(1), u(2), . . . are all well-defined.
We begin with the case of a linear function g(u) = a u. Consider the corresponding
iterative equation
u(k+1) = a u(k), u(0) = c. (2.7)
The general solution to (2.7) is easily found:
u(1) = a u(0) = a c, u(2) = a u(1) = a2 c, u(3) = a u(2) = a3 c,
5/18/08 9
c 2008 Peter J. Olver
5 10 15 20 25 30
-1
-0.75
-0.5
-0.25
0.25
0.5
0.75
1
0 < a < 1
5 10 15 20 25 30
-1
-0.75
-0.5
-0.25
0.25
0.5
0.75
1
−1 < a < 0
5 10 15 20 25 30
-1
-0.75
-0.5
-0.25
0.25
0.5
0.75
1
a = 1
5 10 15 20 25 30
-1
-0.75
-0.5
-0.25
0.25
0.5
0.75
1
a = −1
5 10 15 20 25 30
-10
-7.5
-5
-2.5
2.5
5
7.5
10
1 < a
5 10 15 20 25 30
-10
-7.5
-5
-2.5
2.5
5
7.5
10
a < −1
Figure 2.1. One Dimensional Real Linear Iterative Systems.
and, in general,
u(k) = ak c. (2.8)
If the initial condition is a = 0, then the solution u(k) ≡ 0 is constant. Therefore, 0 is a
fixed point or equilibrium solution for the iterative system.
Example 2.3. Banks add interest to a savings account at discrete time intervals.
For example, if the bank offers 5% interest compounded yearly, this means that the account
balance will increase by 5% each year. Thus, assuming no deposits or withdrawals, the
balance u(k) after k years will satisfy the iterative equation (2.7) with a = 1 + r where
r = .05 is the interest rate, and the 1 indicates that all the money remains in the account.
Thus, after k years, your account balance is
u(k) = (1 + r)kc, where c = u(0) (2.9)
is your initial deposit. For example, if c = $1, 000, after 1 year your account has u(1) =
$1, 050, after 10 years u(10) = $1, 628.89, after 50 years u(50) = $11, 467.40, and after 200
years u(200) = $17, 292, 580.82.
When the interest is compounded monthly, the rate is still quoted on a yearly basis,
and so you receive 1
12 of the interest each month. If Ėu(k) denotes the balance after k
months, then, after n years, the account balance is Ėu(12n) =
ō
1 + 1
12 r
12n c. Thus,
when the interest rate of 5% is compounded monthly, your account balance is Ėu(12) =
$1, 051.16 after 1 year, Ėu(120) = $1, 647.01 after 10 years, Ėu(600) = $12, 119.38 after 50
years, and Ėu(2400) = $21, 573, 572.66 dollars after 200 years. So, if you wait sufficiently
long, compounding will have a dramatic effect. Similarly, daily compounding replaces 12
by 365.25, the number of days in a year. After 200 years, the balance is $22, 011, 396.03.
Let us analyze the solutions of scalar iterative equations, starting with the case when
a ∈ R is a real constant. Aside from the equilibrium solution u(k) ≡ 0, the iterates exhibit
five qualitatively different behaviors, depending on the size of the coefficient a.
5/18/08 10
c 2008 Peter J. Olver
(a) If a = 0, the solution immediately becomes zero, and stays there, so u(k) = 0 for all
k ≥ 1.
(b) If 0 < a < 1, then the solution is of one sign, and tends monotonically to zero, so
u(k) → 0 as k → ∞.
(c) If −1 < a < 0, then the solution tends to zero: u(k) → 0 as k → ∞. Successive
iterates have alternating signs.
(d) If a = 1, the solution is constant: u(k) = a, for all k ≥ 0.
(e) If a = −1, the solution switches back and forth between two values; u(k) = (−1)k c.
(f ) If 1 < a < ∞, then the iterates u(k) become unbounded. If c > 0, they go monotonically
to +∞; if c < 0, to −∞.
(g) If −∞ < a < −1, then the iterates u(k) also become unbounded, with alternating
signs.
In Figure 2.1 we exhibit representative scatter plots for the nontrivial cases (b – g). The
horizontal axis indicates the index k and the vertical axis the solution value u. Each dot
in the scatter plot represents an iterate u(k).
To describe the different scenarios, we adopt a terminology that already appeared in
the continuous realm. In the first three cases, the fixed point u = 0 is said to be globally
asymptotically stable since all solutions tend to 0 as k → ∞. In cases (d) and (e), the
zero solution is stable, since solutions with nearby initial data, | c | ≪ 1, remain nearby.
In the final two cases, the zero solution is unstable; any nonzero initial data a 6= 0 — no
matter how small — will give rise to a solution that eventually goes arbitrarily far away
from equilibrium.
Let us also analyze complex scalar iterative systems. The coefficient a and the initial
datum c in (2.7) are allowed to be complex numbers. The solution is the same, (2.8), but
now we need to know what happens when we raise a complex number a to a high power.
The secret is to write a = r e i Īø in polar form, where r = | a | is its modulus and Īø = ph a
its angle or phase. Then ak = rk e i k Īø. Since | e i k Īø | = 1, we have | ak | = | a |k, and so
the solutions (2.8) have modulus | u(k) | = | ak a | = | a |k | a |. As a result, u(k) will remain
bounded if and only if | a | ≤ 1, and will tend to zero as k → ∞ if and only if | a | < 1.
We have thus established the basic stability criteria for scalar, linear systems.
Theorem 2.4. The zero solution to a (real or complex) scalar iterative system
u(k+1) = a u(k) is
(a) asymptotically stable if and only if | a | < 1,
(b) stable if and only if | a | ≤ 1,
(c) unstable if and only if | a | > 1.
Nonlinear Scalar Iteration
The simplest “nonlinear” case is that of an affine function
g(u) = au + b, (2.10)
leading to an affine discrete dynamical system
u(k+1) = au(k) + b. (2.11)
5/18/08 11
c 2008 Peter J. Olver
u⋆1
u⋆2
u⋆3
Figure 2.2. Fixed Points.
The only fixed point is the solution to
u⋆ = g(u⋆) = au⋆ + b, namely, u⋆ =
b
1 − a
. (2.12)
The formula for u⋆ requires that a 6= 1, and, indeed, the case a = 1 has no fixed point, as
the reader can easily confirm; see Exercise .
Since we already know the value of u⋆, we can readily analyze the differences
e(k) = u(k) − u⋆, (2.13)
between successive iterates and the fixed point. Observe that, the smaller e(k) is, the closer
u(k) is to the desired fixed point. In many applications, the iterate u(k) is viewed as an
approximation to the fixed point u⋆, and so e(k) is interpreted as the error in the kth
iterate. Subtracting the fixed point equation (2.12) from the iteration equation (2.11), we
find
u(k+1) − u⋆ = a(u(k) − u⋆).
Therefore the errors e(k) are related by a linear iteration
e(k+1) = ae(k), and hence e(k) = ake(0). (2.14)
Therefore, as we already demonstrated in Section 7.1, the solutions to this scalar linear
iteration converge:
e(k) −→ 0 and hence u(k) −→ u⋆, if and only if | a | < 1.
This is the criterion for asymptotic stability of the fixed point, or, equivalently, convergence
of the affine iterative system (2.11). The magnitude of a determines the rate of convergence,
and the closer it is to 0, the faster the iterates approach the fixed point.
5/18/08 12
c 2008 Peter J. Olver
Figure 2.3. Tangent Line Approximation.
Example 2.5. The affine function
g(u) = 1
4 u + 2
leads to the iterative scheme
u(k+1) = 1
4 u(k) + 2.
Starting with the initial condition u(0) = 0, the ensuing values are
k 1 2 3 4 5 6 7 8
u(k) 2.0 2.5 2.625 2.6562 2.6641 2.6660 2.6665 2.6666
Thus, after 8 iterations, the iterates have produced the fixed point u⋆ = 8
3 to 4 decimal
places. The rate of convergence is 1
4 , and indeed
| e(k) | = | u(k) − u⋆ | =
ō1
4
k
| u(0) − u⋆ | = 8
3
ō 1
4
k
−→ 0 as k −→ ∞.
Let us now turn to the fully nonlinear case. First note that the fixed points of g(u)
correspond to the intersections of its graph with the graph of the function i(u) = u. For
instance Figure 2.2 shows the graph of a function that has 3 fixed points, labeled u⋆1, u⋆2 , u⋆3 .
In general, near any point in its domain, a (smooth) nonlinear function can be well
approximated by its tangent line, which repre4sents the graph of an affine function; see
Figure 2.3. Therefore, if we are close to a fixed point u⋆, then we might expect the iterative
system based on the nonlinear function g(u) to behave very much like that of its affine
tangent line approximation. And, indeed, this intuition turns out to be essentially correct.
This result forms our first concrete example of linearization, in which the analysis of a
nonlinear system is based on its linear (or, more precisely, affine) approximation.
The explicit formula for the tangent line to g(u) near the fixed point u = u⋆ = g(u⋆)
is
g(u) ≈ g(u⋆) + g′(u⋆)(u − u⋆) ≡ au + b, (2.15)
5/18/08 13
c 2008 Peter J. Olver
where
a = g′(u⋆), b = g(u⋆) − g′(u⋆) u⋆ =
ō
1 − g′(u⋆)
u⋆.
Note that u⋆ = b /(1 − a) remains a fixed point for the affine approximation: au⋆ + b =
u⋆. According to the preceding discussion, the convergence of the iterates for the affine
approximation is governed by the size of the coefficient a = g′(u⋆). This observation
inspires the basic stability criterion for fixed points of scalar iterative systems.
Theorem 2.6. Let g(u) be a continuously differentiable scalar function. Suppose
u⋆ = g(u⋆) is a fixed point. If | g′(u⋆) | < 1, then u⋆ is an asymptotically stable fixed
point, and hence any sequence of iterates u(k) which starts out sufficiently close to u⋆ will
converge to u⋆. On the other hand, if | g′(u⋆) | > 1, then u⋆ is an unstable fixed point, and
the only iterates which converge to it are those that land exactly on it, i.e., u(k) = u⋆ for
some k ≥ 0.
Proof : The goal is to prove that the errors e(k) = u(k) − u⋆ between the iterates and
the fixed point tend to 0 as k → ∞. To this end, we try to estimate e(k+1) in terms of
e(k). According to (2.6) and the Mean Value from calculus,
e(k+1) = u(k+1) − u⋆ = g(u(k)) − g(u⋆) = g′(v) (u(k) − u⋆) = g′(v) e(k), (2.16)
for some v lying between u(k) and u⋆. By continuity, if | g′(u⋆) | < 1 at the fixed point,
then we can choose Ī“ > 0 and | g′(u⋆) | < Ļ < 1 such that the estimate
| g′(v) | ≤ Ļ < 1 whenever | v − u⋆ | < Ī“ (2.17)
holds in a (perhaps small) interval surrounding the fixed point. Suppose
| e(k) | = | u(k) − u⋆ | < Ī“.
Then the point v in (2.16), which is closer to u⋆ than u(k), satisfies (2.17). Therefore,
| u(k+1) − u⋆ | ≤ Ļ | u(k) − u⋆ |, and hence | e(k+1) | ≤ Ļ | e(k) |. (2.18)
In particular, since Ļ < 1, we have | u(k+1) − u⋆ | < Ī“, and hence the subsequent iterate
u(k+1) also lies in the interval where (2.17) holds. Repeating the argument, we conclude
that, provided the initial iterate satisfies
| e(0) | = | u(0) − u⋆ | < Ī“,
the subsequent errors are bounded by
e(k) ≤ Ļk e(0), and hence e(k) = | u(k) − u⋆ | −→ 0 as k → ∞,
which completes the proof of the theorem in the stable case.
The proof in unstable case is left as Exercise for the reader. Q.E.D.
Remark: The constant Ļ governs the rate of convergence of the iterates to the fixed
point. The closer the iterates are to the fixed point, the smaller we can choose Ī“ in (2.17),
and hence the closer we can choose Ļ to | g′(u⋆) |. Thus, roughly speaking, | g′(u⋆) | governs
the speed of convergence, once the iterates get close to the fixed point. This observation
will be developed more fully in the following subsection.
5/18/08 14
c 2008 Peter J. Olver
m
u
Figure 2.4. Planetary Orbit.
Remark: The cases when g′(u⋆) = ±1 are not covered by the theorem. For a linear
system, such fixed points are stable, but not asymptotically stable. For nonlinear systems,
more detailed knowledge of the nonlinear terms is required in order to resolve the status —
stable or unstable — of the fixed point. Despite their importance in certain applications,
we will not try to analyze such borderline cases any further here.
Example 2.7. Given constants Ē«,m, the trigonometric equation
u = m+ Ē« sin u (2.19)
is known as Kepler’s equation. It arises in the study of planetary motion, in which 0 < Ē« < 1
represents the eccentricity of an elliptical planetary orbit, u is the eccentric anomaly,
defined as the angle formed at the center of the ellipse by the planet and the major axis,
and m = 2Ļ t /T is its mean anomaly, which is the time, measured in units of T/(2Ļ)
where T is the period of the orbit, i.e., the length of the planet’s year, since perihelion or
point of closest approach to the sun; see Figure 2.4.
The solutions to Kepler’s equation are the fixed points of the discrete dynamical
system based on the function
g(u) = m + Ē« sin u.
Note that
| g′(u) | = | Ē« cos u | = | Ē« | < 1, (2.20)
which automatically implies that the as yet unknown fixed point is stable. Indeed, Exercise
implies that condition (2.20) is enough to prove the existence of a unique stable fixed
point. In the particular case m = Ē« = 1
2 , the result of iterating u(k+1) = 1
2 + 1
2 sin u(k)
starting with u(0) = 0 is
5/18/08 15
c 2008 Peter J. Olver
u
g(u)
u⋆
L+(u)
L
−
(u)
Figure 2.5. Graph of a Contraction.
k 1 2 3 4 5 6 7 8 9
u(k) .5 .7397 .8370 .8713 .8826 .8862 .8873 .8877 .8878
After 13 iterations, we have converged sufficiently close to the solution (fixed point) u⋆ =
.887862 to have computed its value to 6 decimal places.
Inspection of the proof of Theorem 2.6 reveals that we never really used the differentiability
of g, except to verify the inequality
| g(u) − g(u⋆) | ≤ Ļ | u − u⋆ | for some fixed Ļ < 1. (2.21)
A function that satisfies (2.21) for all nearby u is called a contraction at the point u⋆. Any
function g(u) whose graph lies between the two lines
L
±
(u) = g(u⋆) ± Ļ (u − u⋆) for some Ļ < 1,
for all u sufficiently close to u⋆, i.e., such that | u − u⋆ | < Ī“ for some Ī“ > 0, defines a
contraction, and hence fixed point iteration starting with | u(0) − u⋆ | < Ī“ will converge to
u⋆; see Figure 2.5. In particular, Exercise asks you to prove that any function that is
differentiable at u⋆ with | g′(u⋆) | < 1 defines a contraction at u⋆.
Example 2.8. The simplest truly nonlinear example is a quadratic polynomial. The
most important case is the so-called logistic map
g(u) = Ī» u(1 − u), (2.22)
where Ī» 6= 0 is a fixed non-zero parameter. (The case Ī» = 0 is completely trivial. Why?)
In fact, an elementary change of variables can make any quadratic iterative system into
one involving a logistic map; see Exercise .
5/18/08 16
c 2008 Peter J. Olver
The fixed points of the logistic map are the solutions to the quadratic equation
u = Ī» u(1 − u), or Ī» u2 − Ī» u + 1 = 0.
Using the quadratic formula, we conclude that g(u) has two fixed points:
u⋆1 = 0, u⋆2 = 1 −
1
Ī»
.
Let us apply Theorem 2.6 to determine their stability. The derivative is
g′(u) = Ī» − 2 Ī» u, and so g′(u⋆1 ) = Ī», g′(u⋆2 ) = 2 − Ī».
Therefore, if | Ī» | < 1, the first fixed point is stable, while if 1 < Ī» < 3, the second fixed
point is stable. For Ī» < −1 or Ī» > 3 neither fixed point is stable, and we expect the
iterates to not converge at all.
Numerical experiments with this example show that it is the source of an amazingly
diverse range of behavior, depending upon the value of the parameter Ī». In the accompanying
Figure 2.6, we display the results of iteration starting with initial point u(0) = .5 for
several different values of Ī»; in each plot, the horizontal axis indicates the iterate number
k and the vertical axis the iterate valoue u(k) for k = 0, . . . , 100. As expected from Theorem
2.6, the iterates converge to one of the fixed points in the range −1 < Ī» < 3, except
when Ī» = 1. For Ī» a little bit larger than Ī»1 = 3, the iterates do not converge to a fixed
point. But it does not take long for them to settle down, switching back and forth between
two particular values. This behavior indicates the exitence of a (stable) period 2 orbit for
the discrete dynamical system, in accordance with the following definition.
Definition 2.9. A period k orbit of a discrete dynamical system is a solution that
satisfies u(n+k) = u(n) for all n = 0, 1, 2, . . . . The (minimal) period is the smallest positive
value of k for which this condition holds.
Thus, a fixed point
u(0) = u(1) = u(2) = · · ·
is a period 1 orbit. A period 2 orbit satisfies
u(0) = u(2) = u(4) = · · · and u(1) = u(3) = u(5) = · · · ,
but u(0) 6= u(1), as otherwise the minimal period would be 1. Similarly, a period 3 orbit
has
u(0) = u(3) = u(6) = · · · , u(1) = u(4) = u(7) = · · · , u(2) = u(5) = u(8) = · · · ,
with u(0), u(1), u(2) distinct. Stability of a period k orbit implies that nearby iterates
converge to this periodic solution.
For the logistic map, the period 2 orbit persists until Ī» = Ī»2 ≈ 3.4495, after which
the iterates alternate between four values — a period 4 orbit. This again changes at
Ī» = Ī»3 ≈ 3.5441, after which the iterates end up alternating between eight values. In fact,
there is an increasing sequence of values
3 = Ī»1 < Ī»2 < Ī»3 < Ī»4 < · · · ,
5/18/08 17
c 2008 Peter J. Olver
20 40 60 80 100
0.2
0.4
0.6
0.8
1
Ī» = 1.0
20 40 60 80 100
0.2
0.4
0.6
0.8
1
Ī» = 2.0
20 40 60 80 100
0.2
0.4
0.6
0.8
1
Ī» = 3.0
20 40 60 80 100
0.2
0.4
0.6
0.8
1
Ī» = 3.4
20 40 60 80 100
0.2
0.4
0.6
0.8
1
Ī» = 3.5
20 40 60 80 100
0.2
0.4
0.6
0.8
1
Ī» = 3.55
20 40 60 80 100
0.2
0.4
0.6
0.8
1
Ī» = 3.6
20 40 60 80 100
0.2
0.4
0.6
0.8
1
Ī» = 3.7
20 40 60 80 100
0.2
0.4
0.6
0.8
1
Ī» = 3.8
Figure 2.6. Logistic Iterates.
where, for any Ī»n < Ī» ≤ Ī»n+1, the iterates eventually follow a period 2n orbit. Thus, as Ī»
passes through each value Ī»n the period of the orbit goes from 2n to 2 · 2n = 2n+1, and the
discrete dynamical system experiences a bifurcation. The bifurcation values Ī»n are packed
closer and closer together as n increases, piling up on an eventual limiting value
Ī»⋆ = lim
n→∞
Ī»n ≈ 3.5699,
at which point the orbit’s period has, so to speak, become infinitely large. The entire
phenomena is known as a period doubling cascade.
Interestingly, the ratios of the distances between successive bifurcation points approaches
a well-defined limit,
Ī»n+2 − Ī»n+1
Ī»n+1 − Ī»n −→ 4.6692 . . . , (2.23)
known as Feigenbaum’s constant . In the 1970’s, the American physicist Mitchell Feigenbaum,
[16], discovered that similar period doubling cascades appear in a broad range of
discrete dynamical systems. Even more remarkably, in almost all cases, the corresponding
5/18/08 18
c 2008 Peter J. Olver
2.5 3 3.5 4
0.2
0.4
0.6
0.8
1
Figure 2.7. The Logistic Map.
ratios of distances between bifurcation points has the same limiting value. Feigenbaum’s
experimental observations were rigorously proved by Oscar Lanford in 1982, [33].
After Ī» passes the limiting value Ī»⋆, all hell breaks loose. The iterates become completely
chaotic†, moving at random over the interval [0, 1]. But this is not the end of the
story. Embedded within this chaotic regime are certain small ranges of Ī» where the system
settles down to a stable orbit, whose period is no longer necessarily a power of 2. In fact,
there exist values of Ī» for which the iterates settle down to a stable orbit of period k for any
positive integer k. For instance, as Ī» increases past Ī»3,⋆ ≈ 3.83, a period 3 orbit appears
over a small range of values, after which, as Ī» increses slightly further, there is a period
doubling cascade where period 6, 12, 24, . . . orbits successively appear, each persisting on
a shorter and shorter range of parameter values, until Ī» passes yet another critical value
where chaos breaks out yet again. There is a well-prescribed order in which the periodic
orbits make their successive appearance, and each odd period k orbit is followed by a very
closely spaced sequence of period doubling bifurcations, of periods 2n k for n = 1, 2, 3, . . . ,
after which the iterates revert to completely chaotic behavior until the next periodic case
emerges. The ratios of distances between bifurcation points always have the same Feigenbaum
limit (2.23). Finally, these periodic and chaotic windows all pile up on the ultimate
parameter value Ī»⋆⋆
= 4. And then, when Ī» > 4, all the iterates go off to ∞, and the
system ceases to be interesting.
The reader is encouraged to write a simple computer program and perform some
numerical experiments. In particular, Figure 2.7 shows the asymptotic behavior of the
iterates for values of the parameter in the interesting range 2 < Ī» < 4. The horizontal
axis is Ī», and the marked points show the ultimate fate of the iteration for the given
† The term “chaotic” does have a precise mathematical definition, [14], but the reader can
take it more figuratively for the purposes of this elementary exposition.
5/18/08 19
c 2008 Peter J. Olver
value of Ī». For instance, each point the single curve lying above the smaller values of
Ī» represents a stable fixed point; this bifurcates into a pair of curves representing stable
period 2 orbits, which then bifurcates into 4 curves representing period 4 orbits, and so
on. Chaotic behavior is indicated by a somewhat random pattern of points lying above the
value of Ī». To plot this figure, we ran the logistic iteration u(n) for 0 ≤ n ≤ 100, discarded
the first 50 points, and then plotted the next 50 iterates u(51), . . . , u(100). Investigation of
the fine detailed structure of the logistic map requires yet more iterations with increased
numerical accuracy. In addition one should discard more of the initial iterates so as to give
the system enough time to settle down to a stable periodic orbit or, alternatively, continue
in a chaotic manner.
Remark: So far, we have only looked at real scalar iterative systems. Complex discrete
dynamical systems display yet more remarkable and fascinating behavior. The complex
version of the logistic iteration equation leads to the justly famous Julia and Mandelbrot
sets, [35], with their stunning, psychedelic fractal structure, [46].
The rich range of phenomena in evidence, even in such extremely simple nonlinear
iterative systems, is astounding. While intimations first appeared in the late nineteenth
century research of the influential French mathematician Henri Poincar´e, serious investigations
were delayed until the advent of the computer era, which precipitated an explosion of
research activity in the area of dynamical systems. Similar period doubling cascades and
chaos are found in a broad range of nonlinear systems, [1], and are often encountered in
physical applications, [38]. A modern explanation of fluid turbulence is that it is a (very
complicated) form of chaos, [1].
Quadratic Convergence
Let us now return to the more mundane case when the iterates converge to a stable
fixed point of the discrete dynamical system. In applications, we use the iterates to compute
a precise† numerical value for the fixed point, and hence the efficiency of the algorithm
depends on the speed of convergence of the iterates.
According to the remark following the proof Theorem 2.6, the convergence rate of an
iterative system is essentially governed by the magnitude of the derivative | g′(u⋆) | at the
fixed point. The basic inequality (2.18) for the errors e(k) = u(k) − u⋆, namely
| e(k+1) | ≤ Ļ | e(k) |,
is known as a linear convergence estimate. It means that, once the iterates are close to
the fixed point, the error decreases by a factor of (at least) Ļ ≈ | g′(u⋆) | at each step. If
the kth iterate u(k) approximates the fixed point u⋆ correctly to m decimal places, so its
error is bounded by
| e(k) | < .5 × 10−m,
then the (k + 1)st iterate satisfies the error bound
| e(k+1) | ≤ Ļ | e(k) | < .5 × 10−mĻ = .5 × 10−m+log10 Ļ.
† The degree of precision is to be specified by the user and the application.
5/18/
Peter J. Olver
2. Numerical Solution of Scalar Equations
Most numerical solution methods are based on some form of iteration. The basic idea
is that repeated application of the algorithm will produce closer and closer approximations
to the desired solution. To analyze such algorithms, our first task is to understand general
iterative processes.
2.1. Iteration of Functions.
Iteration, meaning repeated application of a function, can be viewed as a discrete
dynamical system in which the continuous time variable has been “quantized” to assume
integer values. Even iterating a very simple quadratic scalar function can lead to an amazing
variety of dynamical phenomena, including multiply-periodic solutions and genuine
chaos. Nonlinear iterative systems arise not just in mathematics, but also underlie the
growth and decay of biological populations, predator-prey interactions, spread of communicable
diseases such as Aids, and host of other natural phenomena. Moreover, many
numerical solution methods — for systems of algebraic equations, ordinary differential
equations, partial differential equations, and so on — rely on iteration, and so the theory
underlies the analysis of convergence and efficiency of such numerical approximation
schemes.
In general, an iterative system has the form
u(k+1) = g(u(k)), (2.1)
where g:Rn → Rn is a real vector-valued function. (One can similarly treat iteration of
complex-valued functions g:Cn → Cn, but, for simplicity, we only deal with real systems
here.) A solution is a discrete collection of points† u(k) ∈ Rn, in which the index k =
0, 1, 2, 3, . . . takes on non-negative integer values.
Once we specify the initial iterate,
u(0) = c, (2.2)
then the resulting solution to the discrete dynamical system (2.1) is easily computed:
u(1) = g(u(0)) = g(c), u(2) = g(u(1)) = g(g(c)), u(3) = g(u(2)) = g(g(g(c))), . . .
† The superscripts on u
(k) refer to the iteration number, and do not denote derivatives.
5/18/08 7
c 2008 Peter J. Olver
and so on. Thus, unlike continuous dynamical systems, the existence and uniqueness of
solutions is not an issue. As long as each successive iterate u(k) lies in the domain of
definition of g one merely repeats the process to produce the solution,
u(k) =
k times z }| {
g ◦ · · · ◦g (c), k = 0, 1, 2, . . . ,
(2.3)
which is obtained by composing the function g with itself a total of k times. In other
words, the solution to a discrete dynamical system corresponds to repeatedly pushing the
g key on your calculator. For example, entering 0 and then repeatedly hitting the cos key
corresponds to solving the iterative system
u(k+1) = cos u(k), u(0) = 0. (2.4)
The first 10 iterates are displayed in the following table:
k 0 1 2 3 4 5 6 7 8 9
u(k) 0 1 .540302 .857553 .65429 .79348 .701369 .76396 .722102 .750418
For simplicity, we shall always assume that the vector-valued function g:Rn → Rn is
defined on all of Rn; otherwise, we must always be careful that the successive iterates u(k)
never leave its domain of definition, thereby causing the iteration to break down. To avoid
technical complications, we will also assume that g is at least continuous; later results rely
on additional smoothness requirements, e.g., continuity of its first and second order partial
derivatives.
While the solution to a discrete dynamical system is essentially trivial, understanding
its behavior is definitely not. Sometimes the solution converges to a particular value —
the key requirement for numerical solution methods. Sometimes it goes off to ∞, or, more
precisely, the norms† of the iterates are unbounded: k u(k) k → ∞ as k → ∞. Sometimes
the solution repeats itself after a while. And sometimes the iterates behave in a seemingly
random, chaotic manner — all depending on the function g and, at times, the initial
condition c. Although all of these cases may arise in real-world applications, we shall
mostly concentrate upon understanding convergence.
Definition 2.1. A fixed point or equilibrium of a discrete dynamical system (2.1) is
a vector u⋆ ∈ Rn such that
g(u⋆) = u⋆. (2.5)
We easily see that every fixed point provides a constant solution to the discrete dynamical
system, namely u(k) = u⋆ for all k. Moreover, it is not hard to prove that any
convergent solution necessarily converges to a fixed point.
† In view of the equivalence of norms on finite-dimensional vector spaces, cf. Theorem 5.9, any
norm will do here.
5/18/08 8
c 2008 Peter J. Olver
Proposition 2.2. If a solution to a discrete dynamical system converges,
lim
k→∞
u(k) = u⋆,
then the limit u⋆ is a fixed point.
Proof : This is a simple consequence of the continuity of g. We have
u⋆ = lim
k→∞
u(k+1) = lim
k→∞
g(u(k)) = g
lim
k→∞
u(k)
= g(u⋆),
the last two equalities following from the continuity of g. Q.E.D.
For example, continuing the cosine iteration (2.4), we find that the iterates gradually
converge to the value u⋆ ≈ .739085, which is the unique solution to the fixed point equation
cos u = u.
Later we will see how to rigorously prove this observed behavior.
Of course, not every solution to a discrete dynamical system will necessarily converge,
but Proposition 2.2 says that if it does, then it must converge to a fixed point. Thus, a
key goal is to understand when a solution converges, and, if so, to which fixed point —
if there is more than one. (In the linear case, only the actual convergence is a significant
issues since most linear systems admit exactly one fixed point, namely u⋆ = 0.)
Fixed points are roughly divided into three classes:
• asymptotically stable, with the property that all nearby solutions converge to it,
• stable, with the property that all nearby solutions stay nearby, and
• unstable, almost all of whose nearby solutions diverge away from the fixed point.
Thus, from a practical standpoint, convergence of the iterates of a discrete dynamical
system requires asymptotic stability of the fixed point. Examples will appear in abundance
in the following sections.
Scalar Functions
As always, the first step is to thoroughly understand the scalar case, and so we begin
with a discrete dynamical system
u(k+1) = g(u(k)), u(0) = c, (2.6)
in which g:R → R is a continuous, scalar-valued function. As noted above, we will assume,
for simplicity, that g is defined everywhere, and so we do not need to worry about whether
the iterates u(0), u(1), u(2), . . . are all well-defined.
We begin with the case of a linear function g(u) = a u. Consider the corresponding
iterative equation
u(k+1) = a u(k), u(0) = c. (2.7)
The general solution to (2.7) is easily found:
u(1) = a u(0) = a c, u(2) = a u(1) = a2 c, u(3) = a u(2) = a3 c,
5/18/08 9
c 2008 Peter J. Olver
5 10 15 20 25 30
-1
-0.75
-0.5
-0.25
0.25
0.5
0.75
1
0 < a < 1
5 10 15 20 25 30
-1
-0.75
-0.5
-0.25
0.25
0.5
0.75
1
−1 < a < 0
5 10 15 20 25 30
-1
-0.75
-0.5
-0.25
0.25
0.5
0.75
1
a = 1
5 10 15 20 25 30
-1
-0.75
-0.5
-0.25
0.25
0.5
0.75
1
a = −1
5 10 15 20 25 30
-10
-7.5
-5
-2.5
2.5
5
7.5
10
1 < a
5 10 15 20 25 30
-10
-7.5
-5
-2.5
2.5
5
7.5
10
a < −1
Figure 2.1. One Dimensional Real Linear Iterative Systems.
and, in general,
u(k) = ak c. (2.8)
If the initial condition is a = 0, then the solution u(k) ≡ 0 is constant. Therefore, 0 is a
fixed point or equilibrium solution for the iterative system.
Example 2.3. Banks add interest to a savings account at discrete time intervals.
For example, if the bank offers 5% interest compounded yearly, this means that the account
balance will increase by 5% each year. Thus, assuming no deposits or withdrawals, the
balance u(k) after k years will satisfy the iterative equation (2.7) with a = 1 + r where
r = .05 is the interest rate, and the 1 indicates that all the money remains in the account.
Thus, after k years, your account balance is
u(k) = (1 + r)kc, where c = u(0) (2.9)
is your initial deposit. For example, if c = $1, 000, after 1 year your account has u(1) =
$1, 050, after 10 years u(10) = $1, 628.89, after 50 years u(50) = $11, 467.40, and after 200
years u(200) = $17, 292, 580.82.
When the interest is compounded monthly, the rate is still quoted on a yearly basis,
and so you receive 1
12 of the interest each month. If Ėu(k) denotes the balance after k
months, then, after n years, the account balance is Ėu(12n) =
ō
1 + 1
12 r
12n c. Thus,
when the interest rate of 5% is compounded monthly, your account balance is Ėu(12) =
$1, 051.16 after 1 year, Ėu(120) = $1, 647.01 after 10 years, Ėu(600) = $12, 119.38 after 50
years, and Ėu(2400) = $21, 573, 572.66 dollars after 200 years. So, if you wait sufficiently
long, compounding will have a dramatic effect. Similarly, daily compounding replaces 12
by 365.25, the number of days in a year. After 200 years, the balance is $22, 011, 396.03.
Let us analyze the solutions of scalar iterative equations, starting with the case when
a ∈ R is a real constant. Aside from the equilibrium solution u(k) ≡ 0, the iterates exhibit
five qualitatively different behaviors, depending on the size of the coefficient a.
5/18/08 10
c 2008 Peter J. Olver
(a) If a = 0, the solution immediately becomes zero, and stays there, so u(k) = 0 for all
k ≥ 1.
(b) If 0 < a < 1, then the solution is of one sign, and tends monotonically to zero, so
u(k) → 0 as k → ∞.
(c) If −1 < a < 0, then the solution tends to zero: u(k) → 0 as k → ∞. Successive
iterates have alternating signs.
(d) If a = 1, the solution is constant: u(k) = a, for all k ≥ 0.
(e) If a = −1, the solution switches back and forth between two values; u(k) = (−1)k c.
(f ) If 1 < a < ∞, then the iterates u(k) become unbounded. If c > 0, they go monotonically
to +∞; if c < 0, to −∞.
(g) If −∞ < a < −1, then the iterates u(k) also become unbounded, with alternating
signs.
In Figure 2.1 we exhibit representative scatter plots for the nontrivial cases (b – g). The
horizontal axis indicates the index k and the vertical axis the solution value u. Each dot
in the scatter plot represents an iterate u(k).
To describe the different scenarios, we adopt a terminology that already appeared in
the continuous realm. In the first three cases, the fixed point u = 0 is said to be globally
asymptotically stable since all solutions tend to 0 as k → ∞. In cases (d) and (e), the
zero solution is stable, since solutions with nearby initial data, | c | ≪ 1, remain nearby.
In the final two cases, the zero solution is unstable; any nonzero initial data a 6= 0 — no
matter how small — will give rise to a solution that eventually goes arbitrarily far away
from equilibrium.
Let us also analyze complex scalar iterative systems. The coefficient a and the initial
datum c in (2.7) are allowed to be complex numbers. The solution is the same, (2.8), but
now we need to know what happens when we raise a complex number a to a high power.
The secret is to write a = r e i Īø in polar form, where r = | a | is its modulus and Īø = ph a
its angle or phase. Then ak = rk e i k Īø. Since | e i k Īø | = 1, we have | ak | = | a |k, and so
the solutions (2.8) have modulus | u(k) | = | ak a | = | a |k | a |. As a result, u(k) will remain
bounded if and only if | a | ≤ 1, and will tend to zero as k → ∞ if and only if | a | < 1.
We have thus established the basic stability criteria for scalar, linear systems.
Theorem 2.4. The zero solution to a (real or complex) scalar iterative system
u(k+1) = a u(k) is
(a) asymptotically stable if and only if | a | < 1,
(b) stable if and only if | a | ≤ 1,
(c) unstable if and only if | a | > 1.
Nonlinear Scalar Iteration
The simplest “nonlinear” case is that of an affine function
g(u) = au + b, (2.10)
leading to an affine discrete dynamical system
u(k+1) = au(k) + b. (2.11)
5/18/08 11
c 2008 Peter J. Olver
u⋆1
u⋆2
u⋆3
Figure 2.2. Fixed Points.
The only fixed point is the solution to
u⋆ = g(u⋆) = au⋆ + b, namely, u⋆ =
b
1 − a
. (2.12)
The formula for u⋆ requires that a 6= 1, and, indeed, the case a = 1 has no fixed point, as
the reader can easily confirm; see Exercise .
Since we already know the value of u⋆, we can readily analyze the differences
e(k) = u(k) − u⋆, (2.13)
between successive iterates and the fixed point. Observe that, the smaller e(k) is, the closer
u(k) is to the desired fixed point. In many applications, the iterate u(k) is viewed as an
approximation to the fixed point u⋆, and so e(k) is interpreted as the error in the kth
iterate. Subtracting the fixed point equation (2.12) from the iteration equation (2.11), we
find
u(k+1) − u⋆ = a(u(k) − u⋆).
Therefore the errors e(k) are related by a linear iteration
e(k+1) = ae(k), and hence e(k) = ake(0). (2.14)
Therefore, as we already demonstrated in Section 7.1, the solutions to this scalar linear
iteration converge:
e(k) −→ 0 and hence u(k) −→ u⋆, if and only if | a | < 1.
This is the criterion for asymptotic stability of the fixed point, or, equivalently, convergence
of the affine iterative system (2.11). The magnitude of a determines the rate of convergence,
and the closer it is to 0, the faster the iterates approach the fixed point.
5/18/08 12
c 2008 Peter J. Olver
Figure 2.3. Tangent Line Approximation.
Example 2.5. The affine function
g(u) = 1
4 u + 2
leads to the iterative scheme
u(k+1) = 1
4 u(k) + 2.
Starting with the initial condition u(0) = 0, the ensuing values are
k 1 2 3 4 5 6 7 8
u(k) 2.0 2.5 2.625 2.6562 2.6641 2.6660 2.6665 2.6666
Thus, after 8 iterations, the iterates have produced the fixed point u⋆ = 8
3 to 4 decimal
places. The rate of convergence is 1
4 , and indeed
| e(k) | = | u(k) − u⋆ | =
ō1
4
k
| u(0) − u⋆ | = 8
3
ō 1
4
k
−→ 0 as k −→ ∞.
Let us now turn to the fully nonlinear case. First note that the fixed points of g(u)
correspond to the intersections of its graph with the graph of the function i(u) = u. For
instance Figure 2.2 shows the graph of a function that has 3 fixed points, labeled u⋆1, u⋆2 , u⋆3 .
In general, near any point in its domain, a (smooth) nonlinear function can be well
approximated by its tangent line, which repre4sents the graph of an affine function; see
Figure 2.3. Therefore, if we are close to a fixed point u⋆, then we might expect the iterative
system based on the nonlinear function g(u) to behave very much like that of its affine
tangent line approximation. And, indeed, this intuition turns out to be essentially correct.
This result forms our first concrete example of linearization, in which the analysis of a
nonlinear system is based on its linear (or, more precisely, affine) approximation.
The explicit formula for the tangent line to g(u) near the fixed point u = u⋆ = g(u⋆)
is
g(u) ≈ g(u⋆) + g′(u⋆)(u − u⋆) ≡ au + b, (2.15)
5/18/08 13
c 2008 Peter J. Olver
where
a = g′(u⋆), b = g(u⋆) − g′(u⋆) u⋆ =
ō
1 − g′(u⋆)
u⋆.
Note that u⋆ = b /(1 − a) remains a fixed point for the affine approximation: au⋆ + b =
u⋆. According to the preceding discussion, the convergence of the iterates for the affine
approximation is governed by the size of the coefficient a = g′(u⋆). This observation
inspires the basic stability criterion for fixed points of scalar iterative systems.
Theorem 2.6. Let g(u) be a continuously differentiable scalar function. Suppose
u⋆ = g(u⋆) is a fixed point. If | g′(u⋆) | < 1, then u⋆ is an asymptotically stable fixed
point, and hence any sequence of iterates u(k) which starts out sufficiently close to u⋆ will
converge to u⋆. On the other hand, if | g′(u⋆) | > 1, then u⋆ is an unstable fixed point, and
the only iterates which converge to it are those that land exactly on it, i.e., u(k) = u⋆ for
some k ≥ 0.
Proof : The goal is to prove that the errors e(k) = u(k) − u⋆ between the iterates and
the fixed point tend to 0 as k → ∞. To this end, we try to estimate e(k+1) in terms of
e(k). According to (2.6) and the Mean Value from calculus,
e(k+1) = u(k+1) − u⋆ = g(u(k)) − g(u⋆) = g′(v) (u(k) − u⋆) = g′(v) e(k), (2.16)
for some v lying between u(k) and u⋆. By continuity, if | g′(u⋆) | < 1 at the fixed point,
then we can choose Ī“ > 0 and | g′(u⋆) | < Ļ < 1 such that the estimate
| g′(v) | ≤ Ļ < 1 whenever | v − u⋆ | < Ī“ (2.17)
holds in a (perhaps small) interval surrounding the fixed point. Suppose
| e(k) | = | u(k) − u⋆ | < Ī“.
Then the point v in (2.16), which is closer to u⋆ than u(k), satisfies (2.17). Therefore,
| u(k+1) − u⋆ | ≤ Ļ | u(k) − u⋆ |, and hence | e(k+1) | ≤ Ļ | e(k) |. (2.18)
In particular, since Ļ < 1, we have | u(k+1) − u⋆ | < Ī“, and hence the subsequent iterate
u(k+1) also lies in the interval where (2.17) holds. Repeating the argument, we conclude
that, provided the initial iterate satisfies
| e(0) | = | u(0) − u⋆ | < Ī“,
the subsequent errors are bounded by
e(k) ≤ Ļk e(0), and hence e(k) = | u(k) − u⋆ | −→ 0 as k → ∞,
which completes the proof of the theorem in the stable case.
The proof in unstable case is left as Exercise for the reader. Q.E.D.
Remark: The constant Ļ governs the rate of convergence of the iterates to the fixed
point. The closer the iterates are to the fixed point, the smaller we can choose Ī“ in (2.17),
and hence the closer we can choose Ļ to | g′(u⋆) |. Thus, roughly speaking, | g′(u⋆) | governs
the speed of convergence, once the iterates get close to the fixed point. This observation
will be developed more fully in the following subsection.
5/18/08 14
c 2008 Peter J. Olver
m
u
Figure 2.4. Planetary Orbit.
Remark: The cases when g′(u⋆) = ±1 are not covered by the theorem. For a linear
system, such fixed points are stable, but not asymptotically stable. For nonlinear systems,
more detailed knowledge of the nonlinear terms is required in order to resolve the status —
stable or unstable — of the fixed point. Despite their importance in certain applications,
we will not try to analyze such borderline cases any further here.
Example 2.7. Given constants Ē«,m, the trigonometric equation
u = m+ Ē« sin u (2.19)
is known as Kepler’s equation. It arises in the study of planetary motion, in which 0 < Ē« < 1
represents the eccentricity of an elliptical planetary orbit, u is the eccentric anomaly,
defined as the angle formed at the center of the ellipse by the planet and the major axis,
and m = 2Ļ t /T is its mean anomaly, which is the time, measured in units of T/(2Ļ)
where T is the period of the orbit, i.e., the length of the planet’s year, since perihelion or
point of closest approach to the sun; see Figure 2.4.
The solutions to Kepler’s equation are the fixed points of the discrete dynamical
system based on the function
g(u) = m + Ē« sin u.
Note that
| g′(u) | = | Ē« cos u | = | Ē« | < 1, (2.20)
which automatically implies that the as yet unknown fixed point is stable. Indeed, Exercise
implies that condition (2.20) is enough to prove the existence of a unique stable fixed
point. In the particular case m = Ē« = 1
2 , the result of iterating u(k+1) = 1
2 + 1
2 sin u(k)
starting with u(0) = 0 is
5/18/08 15
c 2008 Peter J. Olver
u
g(u)
u⋆
L+(u)
L
−
(u)
Figure 2.5. Graph of a Contraction.
k 1 2 3 4 5 6 7 8 9
u(k) .5 .7397 .8370 .8713 .8826 .8862 .8873 .8877 .8878
After 13 iterations, we have converged sufficiently close to the solution (fixed point) u⋆ =
.887862 to have computed its value to 6 decimal places.
Inspection of the proof of Theorem 2.6 reveals that we never really used the differentiability
of g, except to verify the inequality
| g(u) − g(u⋆) | ≤ Ļ | u − u⋆ | for some fixed Ļ < 1. (2.21)
A function that satisfies (2.21) for all nearby u is called a contraction at the point u⋆. Any
function g(u) whose graph lies between the two lines
L
±
(u) = g(u⋆) ± Ļ (u − u⋆) for some Ļ < 1,
for all u sufficiently close to u⋆, i.e., such that | u − u⋆ | < Ī“ for some Ī“ > 0, defines a
contraction, and hence fixed point iteration starting with | u(0) − u⋆ | < Ī“ will converge to
u⋆; see Figure 2.5. In particular, Exercise asks you to prove that any function that is
differentiable at u⋆ with | g′(u⋆) | < 1 defines a contraction at u⋆.
Example 2.8. The simplest truly nonlinear example is a quadratic polynomial. The
most important case is the so-called logistic map
g(u) = Ī» u(1 − u), (2.22)
where Ī» 6= 0 is a fixed non-zero parameter. (The case Ī» = 0 is completely trivial. Why?)
In fact, an elementary change of variables can make any quadratic iterative system into
one involving a logistic map; see Exercise .
5/18/08 16
c 2008 Peter J. Olver
The fixed points of the logistic map are the solutions to the quadratic equation
u = Ī» u(1 − u), or Ī» u2 − Ī» u + 1 = 0.
Using the quadratic formula, we conclude that g(u) has two fixed points:
u⋆1 = 0, u⋆2 = 1 −
1
Ī»
.
Let us apply Theorem 2.6 to determine their stability. The derivative is
g′(u) = Ī» − 2 Ī» u, and so g′(u⋆1 ) = Ī», g′(u⋆2 ) = 2 − Ī».
Therefore, if | Ī» | < 1, the first fixed point is stable, while if 1 < Ī» < 3, the second fixed
point is stable. For Ī» < −1 or Ī» > 3 neither fixed point is stable, and we expect the
iterates to not converge at all.
Numerical experiments with this example show that it is the source of an amazingly
diverse range of behavior, depending upon the value of the parameter Ī». In the accompanying
Figure 2.6, we display the results of iteration starting with initial point u(0) = .5 for
several different values of Ī»; in each plot, the horizontal axis indicates the iterate number
k and the vertical axis the iterate valoue u(k) for k = 0, . . . , 100. As expected from Theorem
2.6, the iterates converge to one of the fixed points in the range −1 < Ī» < 3, except
when Ī» = 1. For Ī» a little bit larger than Ī»1 = 3, the iterates do not converge to a fixed
point. But it does not take long for them to settle down, switching back and forth between
two particular values. This behavior indicates the exitence of a (stable) period 2 orbit for
the discrete dynamical system, in accordance with the following definition.
Definition 2.9. A period k orbit of a discrete dynamical system is a solution that
satisfies u(n+k) = u(n) for all n = 0, 1, 2, . . . . The (minimal) period is the smallest positive
value of k for which this condition holds.
Thus, a fixed point
u(0) = u(1) = u(2) = · · ·
is a period 1 orbit. A period 2 orbit satisfies
u(0) = u(2) = u(4) = · · · and u(1) = u(3) = u(5) = · · · ,
but u(0) 6= u(1), as otherwise the minimal period would be 1. Similarly, a period 3 orbit
has
u(0) = u(3) = u(6) = · · · , u(1) = u(4) = u(7) = · · · , u(2) = u(5) = u(8) = · · · ,
with u(0), u(1), u(2) distinct. Stability of a period k orbit implies that nearby iterates
converge to this periodic solution.
For the logistic map, the period 2 orbit persists until Ī» = Ī»2 ≈ 3.4495, after which
the iterates alternate between four values — a period 4 orbit. This again changes at
Ī» = Ī»3 ≈ 3.5441, after which the iterates end up alternating between eight values. In fact,
there is an increasing sequence of values
3 = Ī»1 < Ī»2 < Ī»3 < Ī»4 < · · · ,
5/18/08 17
c 2008 Peter J. Olver
20 40 60 80 100
0.2
0.4
0.6
0.8
1
Ī» = 1.0
20 40 60 80 100
0.2
0.4
0.6
0.8
1
Ī» = 2.0
20 40 60 80 100
0.2
0.4
0.6
0.8
1
Ī» = 3.0
20 40 60 80 100
0.2
0.4
0.6
0.8
1
Ī» = 3.4
20 40 60 80 100
0.2
0.4
0.6
0.8
1
Ī» = 3.5
20 40 60 80 100
0.2
0.4
0.6
0.8
1
Ī» = 3.55
20 40 60 80 100
0.2
0.4
0.6
0.8
1
Ī» = 3.6
20 40 60 80 100
0.2
0.4
0.6
0.8
1
Ī» = 3.7
20 40 60 80 100
0.2
0.4
0.6
0.8
1
Ī» = 3.8
Figure 2.6. Logistic Iterates.
where, for any Ī»n < Ī» ≤ Ī»n+1, the iterates eventually follow a period 2n orbit. Thus, as Ī»
passes through each value Ī»n the period of the orbit goes from 2n to 2 · 2n = 2n+1, and the
discrete dynamical system experiences a bifurcation. The bifurcation values Ī»n are packed
closer and closer together as n increases, piling up on an eventual limiting value
Ī»⋆ = lim
n→∞
Ī»n ≈ 3.5699,
at which point the orbit’s period has, so to speak, become infinitely large. The entire
phenomena is known as a period doubling cascade.
Interestingly, the ratios of the distances between successive bifurcation points approaches
a well-defined limit,
Ī»n+2 − Ī»n+1
Ī»n+1 − Ī»n −→ 4.6692 . . . , (2.23)
known as Feigenbaum’s constant . In the 1970’s, the American physicist Mitchell Feigenbaum,
[16], discovered that similar period doubling cascades appear in a broad range of
discrete dynamical systems. Even more remarkably, in almost all cases, the corresponding
5/18/08 18
c 2008 Peter J. Olver
2.5 3 3.5 4
0.2
0.4
0.6
0.8
1
Figure 2.7. The Logistic Map.
ratios of distances between bifurcation points has the same limiting value. Feigenbaum’s
experimental observations were rigorously proved by Oscar Lanford in 1982, [33].
After Ī» passes the limiting value Ī»⋆, all hell breaks loose. The iterates become completely
chaotic†, moving at random over the interval [0, 1]. But this is not the end of the
story. Embedded within this chaotic regime are certain small ranges of Ī» where the system
settles down to a stable orbit, whose period is no longer necessarily a power of 2. In fact,
there exist values of Ī» for which the iterates settle down to a stable orbit of period k for any
positive integer k. For instance, as Ī» increases past Ī»3,⋆ ≈ 3.83, a period 3 orbit appears
over a small range of values, after which, as Ī» increses slightly further, there is a period
doubling cascade where period 6, 12, 24, . . . orbits successively appear, each persisting on
a shorter and shorter range of parameter values, until Ī» passes yet another critical value
where chaos breaks out yet again. There is a well-prescribed order in which the periodic
orbits make their successive appearance, and each odd period k orbit is followed by a very
closely spaced sequence of period doubling bifurcations, of periods 2n k for n = 1, 2, 3, . . . ,
after which the iterates revert to completely chaotic behavior until the next periodic case
emerges. The ratios of distances between bifurcation points always have the same Feigenbaum
limit (2.23). Finally, these periodic and chaotic windows all pile up on the ultimate
parameter value Ī»⋆⋆
= 4. And then, when Ī» > 4, all the iterates go off to ∞, and the
system ceases to be interesting.
The reader is encouraged to write a simple computer program and perform some
numerical experiments. In particular, Figure 2.7 shows the asymptotic behavior of the
iterates for values of the parameter in the interesting range 2 < Ī» < 4. The horizontal
axis is Ī», and the marked points show the ultimate fate of the iteration for the given
† The term “chaotic” does have a precise mathematical definition, [14], but the reader can
take it more figuratively for the purposes of this elementary exposition.
5/18/08 19
c 2008 Peter J. Olver
value of Ī». For instance, each point the single curve lying above the smaller values of
Ī» represents a stable fixed point; this bifurcates into a pair of curves representing stable
period 2 orbits, which then bifurcates into 4 curves representing period 4 orbits, and so
on. Chaotic behavior is indicated by a somewhat random pattern of points lying above the
value of Ī». To plot this figure, we ran the logistic iteration u(n) for 0 ≤ n ≤ 100, discarded
the first 50 points, and then plotted the next 50 iterates u(51), . . . , u(100). Investigation of
the fine detailed structure of the logistic map requires yet more iterations with increased
numerical accuracy. In addition one should discard more of the initial iterates so as to give
the system enough time to settle down to a stable periodic orbit or, alternatively, continue
in a chaotic manner.
Remark: So far, we have only looked at real scalar iterative systems. Complex discrete
dynamical systems display yet more remarkable and fascinating behavior. The complex
version of the logistic iteration equation leads to the justly famous Julia and Mandelbrot
sets, [35], with their stunning, psychedelic fractal structure, [46].
The rich range of phenomena in evidence, even in such extremely simple nonlinear
iterative systems, is astounding. While intimations first appeared in the late nineteenth
century research of the influential French mathematician Henri Poincar´e, serious investigations
were delayed until the advent of the computer era, which precipitated an explosion of
research activity in the area of dynamical systems. Similar period doubling cascades and
chaos are found in a broad range of nonlinear systems, [1], and are often encountered in
physical applications, [38]. A modern explanation of fluid turbulence is that it is a (very
complicated) form of chaos, [1].
Quadratic Convergence
Let us now return to the more mundane case when the iterates converge to a stable
fixed point of the discrete dynamical system. In applications, we use the iterates to compute
a precise† numerical value for the fixed point, and hence the efficiency of the algorithm
depends on the speed of convergence of the iterates.
According to the remark following the proof Theorem 2.6, the convergence rate of an
iterative system is essentially governed by the magnitude of the derivative | g′(u⋆) | at the
fixed point. The basic inequality (2.18) for the errors e(k) = u(k) − u⋆, namely
| e(k+1) | ≤ Ļ | e(k) |,
is known as a linear convergence estimate. It means that, once the iterates are close to
the fixed point, the error decreases by a factor of (at least) Ļ ≈ | g′(u⋆) | at each step. If
the kth iterate u(k) approximates the fixed point u⋆ correctly to m decimal places, so its
error is bounded by
| e(k) | < .5 × 10−m,
then the (k + 1)st iterate satisfies the error bound
| e(k+1) | ≤ Ļ | e(k) | < .5 × 10−mĻ = .5 × 10−m+log10 Ļ.
† The degree of precision is to be specified by the user and the application.
5/18/
Numerical analysis
From Wikipedia, the free encyclopedia
|
This article includes a list of references, but its sources remain unclear because it has insufficient inline citations. Please help to improve this article by introducing more precise citations. (November 2013) |
One of the earliest mathematical writings is a Babylonian tablet from the Yale Babylonian Collection (YBC 7289), which gives a sexagesimal numerical approximation of , the length of the diagonal in a unit square. Being able to compute the sides of a triangle (and hence, being able to compute square roots) is extremely important, for instance, in carpentry and construction.[2]
Numerical analysis continues this long tradition of practical mathematical calculations. Much like the Babylonian approximation of , modern numerical analysis does not seek exact answers, because exact answers are often impossible to obtain in practice. Instead, much of numerical analysis is concerned with obtaining approximate solutions while maintaining reasonable bounds on errors.
Numerical analysis naturally finds applications in all fields of engineering and the physical sciences, but in the 21st century, the life sciences and even the arts have adopted elements of scientific computations. Ordinary differential equations appear in celestial mechanics (planets, stars and galaxies); numerical linear algebra is important for data analysis; stochastic differential equations and Markov chains are essential in simulating living cells for medicine and biology.
Before the advent of modern computers numerical methods often depended on hand interpolation in large printed tables. Since the mid 20th century, computers calculate the required functions instead. These same interpolation formulas nevertheless continue to be used as part of the software algorithms for solving differential equations.
Contents
General introduction
The overall goal of the field of numerical analysis is the design and analysis of techniques to give approximate but accurate solutions to hard problems, the variety of which is suggested by the following:- Advanced numerical methods are essential in making numerical weather prediction feasible.
- Computing the trajectory of a spacecraft requires the accurate numerical solution of a system of ordinary differential equations.
- Car companies can improve the crash safety of their vehicles by using computer simulations of car crashes. Such simulations essentially consist of solving partial differential equations numerically.
- Hedge funds (private investment funds) use tools from all fields of numerical analysis to attempt to calculate the value of stocks and derivatives more precisely than other market participants.
- Airlines use sophisticated optimization algorithms to decide ticket prices, airplane and crew assignments and fuel needs. Historically, such algorithms were developed within the overlapping field of operations research.
- Insurance companies use numerical programs for actuarial analysis.
History
The field of numerical analysis predates the invention of modern computers by many centuries. Linear interpolation was already in use more than 2000 years ago. Many great mathematicians of the past were preoccupied by numerical analysis, as is obvious from the names of important algorithms like Newton's method, Lagrange interpolation polynomial, Gaussian elimination, or Euler's method.To facilitate computations by hand, large books were produced with formulas and tables of data such as interpolation points and function coefficients. Using these tables, often calculated out to 16 decimal places or more for some functions, one could look up values to plug into the formulas given and achieve very good numerical estimates of some functions. The canonical work in the field is the NIST publication edited by Abramowitz and Stegun, a 1000-plus page book of a very large number of commonly used formulas and functions and their values at many points. The function values are no longer very useful when a computer is available, but the large listing of formulas can still be very handy.
The mechanical calculator was also developed as a tool for hand computation. These calculators evolved into electronic computers in the 1940s, and it was then found that these computers were also useful for administrative purposes. But the invention of the computer also influenced the field of numerical analysis, since now longer and more complicated calculations could be done.
Direct and iterative methods
Direct vs iterative methods Consider the problem of solving
Discretization and numerical integrationIn a two hour race, we have measured the speed of the car at three instants and recorded them in the following table.
Ill posed problem: Take the function f(x) = 1/(x − 1). Note that f(1.1) = 10 and f(1.001) = 1000: a change in x of less than 0.1 turns into a change in f(x) of nearly 1000. Evaluating f(x) near x = 1 is an ill-conditioned problem. Well-posed problem: By contrast, the function is continuous and so evaluating it is well-posed, at least for x being not close to zero. |
In contrast to direct methods, iterative methods are not expected to terminate in a number of steps. Starting from an initial guess, iterative methods form successive approximations that converge to the exact solution only in the limit. A convergence test, often involving the residual, is specified in order to decide when a sufficiently accurate solution has (hopefully) been found. Even using infinite precision arithmetic these methods would not reach the solution within a finite number of steps (in general). Examples include Newton's method, the bisection method, and Jacobi iteration. In computational matrix algebra, iterative methods are generally needed for large problems.
Iterative methods are more common than direct methods in numerical analysis. Some methods are direct in principle but are usually used as though they were not, e.g. GMRES and the conjugate gradient method. For these methods the number of steps needed to obtain the exact solution is so large that an approximation is accepted in the same manner as for an iterative method.
Discretization
Furthermore, continuous problems must sometimes be replaced by a discrete problem whose solution is known to approximate that of the continuous problem; this process is called discretization. For example, the solution of a differential equation is a function. This function must be represented by a finite amount of data, for instance by its value at a finite number of points at its domain, even though this domain is a continuum.Generation and propagation of errors
The study of errors forms an important part of numerical analysis. There are several ways in which error can be introduced in the solution of the problem.Round-off
Round-off errors arise because it is impossible to represent all real numbers exactly on a machine with finite memory (which is what all practical digital computers are).Truncation and discretization error
Truncation errors are committed when an iterative method is terminated or a mathematical procedure is approximated, and the approximate solution differs from the exact solution. Similarly, discretization induces a discretization error because the solution of the discrete problem does not coincide with the solution of the continuous problem. For instance, in the iteration in the sidebar to compute the solution of , after 10 or so iterations, we conclude that the root is roughly 1.99 (for example). We therefore have a truncation error of 0.01.Once an error is generated, it will generally propagate through the calculation. For instance, we have already noted that the operation + on a calculator (or a computer) is inexact. It follows that a calculation of the type a+b+c+d+e is even more inexact.
What does it mean when we say that the truncation error is created when we approximate a mathematical procedure? We know that to integrate a function exactly requires one to find the sum of infinite trapezoids. But numerically one can find the sum of only finite trapezoids, and hence the approximation of the mathematical procedure. Similarly, to differentiate a function, the differential element approaches to zero but numerically we can only choose a finite value of the differential element.
Numerical stability and well-posed problems
Numerical stability is an important notion in numerical analysis. An algorithm is called numerically stable if an error, whatever its cause, does not grow to be much larger during the calculation. This happens if the problem is well-conditioned, meaning that the solution changes by only a small amount if the problem data are changed by a small amount. To the contrary, if a problem is ill-conditioned, then any small error in the data will grow to be a large error.Both the original problem and the algorithm used to solve that problem can be well-conditioned and/or ill-conditioned, and any combination is possible.
So an algorithm that solves a well-conditioned problem may be either numerically stable or numerically unstable. An art of numerical analysis is to find a stable algorithm for solving a well-posed mathematical problem. For instance, computing the square root of 2 (which is roughly 1.41421) is a well-posed problem. Many algorithms solve this problem by starting with an initial approximation x1 to , for instance x1=1.4, and then computing improved guesses x2, x3, etc.. One such method is the famous Babylonian method, which is given by xk+1 = xk/2 + 1/xk. Another iteration, which we will call Method X, is given by xk + 1 = (xk2−2)2 + xk.[3] We have calculated a few iterations of each scheme in table form below, with initial guesses x1 = 1.4 and x1 = 1.42.
Babylonian | Babylonian | Method X | Method X |
---|---|---|---|
x1 = 1.4 | x1 = 1.42 | x1 = 1.4 | x1 = 1.42 |
x2 = 1.4142857... | x2 = 1.41422535... | x2 = 1.4016 | x2 = 1.42026896 |
x3 = 1.414213564... | x3 = 1.41421356242... | x3 = 1.4028614... | x3 = 1.42056... |
... | ... | ||
x1000000 = 1.41421... | x28 = 7280.2284... |
- Numerical stability is affected by the number of the significant digits the machine keeps on, if we use a machine that keeps on the first four floating-point digits, a good example on loss of significance is given by these two equivalent functions
- If we compare the results of
- and
- by looking to the two above results, we realize that loss of significance which is also called Subtractive Cancelation
has a huge effect on the results, even though both functions are
equivalent; to show that they are equivalent simply we need to start by
f(x) and end with g(x), and so
- The true value for the result is 11.174755..., which is exactly g(500) = 11.1748 after rounding the result to 4 decimal digits.
- Now imagine that lots of terms like these functions are used in the program; the error will increase as one proceeds in the program, unless one uses the suitable formula of the two functions each time one evaluates either f(x), or g(x); the choice is dependent on the parity of x.
- The example is taken from Mathew; Numerical methods using matlab, 3rd ed.
Areas of study
The field of numerical analysis includes many sub-disciplines. Some of the major ones are:Computing values of functions
Interpolation: We have observed the temperature to vary from
20 degrees Celsius at 1:00 to 14 degrees at 3:00. A linear interpolation
of this data would conclude that it was 17 degrees at 2:00 and 18.5
degrees at 1:30pm. Extrapolation: If the gross domestic product of a country has been growing an average of 5% per year and was 100 billion dollars last year, we might extrapolate that it will be 105 billion dollars this year. Regression: In linear regression, given n points, we compute a line that passes as close as possible to those n points. Optimization: Say you sell lemonade at a lemonade stand, and notice that at $1, you can sell 197 glasses of lemonade per day, and that for each increase of $0.01, you will sell one glass of lemonade less per day. If you could charge $1.485, you would maximize your profit, but due to the constraint of having to charge a whole cent amount, charging $1.49 per glass will yield the maximum income of $220.52 per day. Differential equation: If you set up 100 fans to blow air from one end of the room to the other and then you drop a feather into the wind, what happens? The feather will follow the air currents, which may be very complex. One approximation is to measure the speed at which the air is blowing near the feather every second, and advance the simulated feather as if it were moving in a straight line at that same speed for one second, before measuring the wind speed again. This is called the Euler method for solving an ordinary differential equation. |
Interpolation, extrapolation, and regression
Interpolation solves the following problem: given the value of some unknown function at a number of points, what value does that function have at some other point between the given points?Extrapolation is very similar to interpolation, except that now we want to find the value of the unknown function at a point which is outside the given points.
Regression is also similar, but it takes into account that the data is imprecise. Given some points, and a measurement of the value of some function at these points (with an error), we want to determine the unknown function. The least squares-method is one popular way to achieve this.
Solving equations and systems of equations
Another fundamental problem is computing the solution of some given equation. Two cases are commonly distinguished, depending on whether the equation is linear or not. For instance, the equation is linear while is not.Much effort has been put in the development of methods for solving systems of linear equations. Standard direct methods, i.e., methods that use some matrix decomposition are Gaussian elimination, LU decomposition, Cholesky decomposition for symmetric (or hermitian) and positive-definite matrix, and QR decomposition for non-square matrices. Iterative methods such as the Jacobi method, Gauss–Seidel method, successive over-relaxation and conjugate gradient method are usually preferred for large systems. General iterative methods can be developed using a matrix splitting.
Root-finding algorithms are used to solve nonlinear equations (they are so named since a root of a function is an argument for which the function yields zero). If the function is differentiable and the derivative is known, then Newton's method is a popular choice. Linearization is another technique for solving nonlinear equations.
Solving eigenvalue or singular value problems
Several important problems can be phrased in terms of eigenvalue decompositions or singular value decompositions. For instance, the spectral image compression algorithm[4] is based on the singular value decomposition. The corresponding tool in statistics is called principal component analysis.Optimization
Main article: Mathematical optimization
Optimization problems ask for the point at which a given function is
maximized (or minimized). Often, the point also has to satisfy some constraints.The field of optimization is further split in several subfields, depending on the form of the objective function and the constraint. For instance, linear programming deals with the case that both the objective function and the constraints are linear. A famous method in linear programming is the simplex method.
The method of Lagrange multipliers can be used to reduce optimization problems with constraints to unconstrained optimization problems.
Evaluating integrals
Main article: Numerical integration
Numerical integration, in some instances also known as numerical quadrature, asks for the value of a definite integral. Popular methods use one of the Newton–Cotes formulas (like the midpoint rule or Simpson's rule) or Gaussian quadrature.
These methods rely on a "divide and conquer" strategy, whereby an
integral on a relatively large set is broken down into integrals on
smaller sets. In higher dimensions, where these methods become
prohibitively expensive in terms of computational effort, one may use Monte Carlo or quasi-Monte Carlo methods (see Monte Carlo integration), or, in modestly large dimensions, the method of sparse grids.Differential equations
Main articles: Numerical ordinary differential equations and Numerical partial differential equations
Numerical analysis is also concerned with computing (in an approximate way) the solution of differential equations, both ordinary differential equations and partial differential equations.Partial differential equations are solved by first discretizing the equation, bringing it into a finite-dimensional subspace. This can be done by a finite element method, a finite difference method, or (particularly in engineering) a finite volume method. The theoretical justification of these methods often involves theorems from functional analysis. This reduces the problem to the solution of an algebraic equation.
Software
Main articles: List of numerical analysis software and Comparison of numerical analysis software
Since the late twentieth century, most algorithms are implemented in a variety of programming languages. The Netlib repository contains various collections of software routines for numerical problems, mostly in Fortran and C. Commercial products implementing many different numerical algorithms include the IMSL and NAG libraries; a free alternative is the GNU Scientific Library.There are several popular numerical computing applications such as MATLAB, S-PLUS, LabVIEW, and IDL as well as free and open source alternatives such as FreeMat, Scilab, GNU Octave (similar to Matlab), IT++ (a C++ library), R (similar to S-PLUS) and certain variants of Python. Performance varies widely: while vector and matrix operations are usually fast, scalar loops may vary in speed by more than an order of magnitude.[5][6]
Many computer algebra systems such as Mathematica also benefit from the availability of arbitrary precision arithmetic which can provide more accurate results.
Also, any spreadsheet software can be used to solve simple problems relating to numerical analysis.
See also
- Scientific computing
- List of numerical analysis topics
- Numerical differentiation
- Symbolic-numeric computation
- Analysis of algorithms
- Numerical Recipes
Notes
- Jump up ^ Photograph, illustration, and description of the root(2) tablet from the Yale Babylonian Collection
- Jump up ^ The New Zealand Qualification authority specifically mentions this skill in document 13004 version 2, dated 17 October 2003 titled CARPENTRY THEORY: Demonstrate knowledge of setting out a building
- Jump up ^ This is a fixed point iteration for the equation , whose solutions include . The iterates always move to the right since . Hence converges and diverges.
- Jump up ^ The Singular Value Decomposition and Its Applications in Image Compression
- Jump up ^ Speed comparison of various number crunching packages
- Jump up
Subscribe to:
Posts (Atom)