Lesson notes for Math 21b - Spring 2020people.math.harvard.edu/~marcinek/21b/notes.pdf · Lesson 1: Linear systems De nition 1. There are a few related vocabulary words: A linear

Lesson notes for Math 21b - Spring 2020

Jake Marcinek

March 10, 2020

Contents

1 Matrix operations and subspaces 2

Lesson 1: Linear systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

Lesson 2: Gauss-Jordan elimination . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

Lesson 3: Linear transformations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

Lesson 4: Determining a linear transformation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

Lesson 5: Geometric linear transformations in R2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

Lesson 6: Matrix products . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

Lesson 7: Matrix inverses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

Lesson 8: Coordinates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

Lesson 9: Image and Kernel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

Lesson 10: Subspaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

Lesson 11: Dimension and the Rank-Nullity Theorem . . . . . . . . . . . . . . . . . . . . . . . . . 17

2 Orthogonality and spectral theory 18

Lesson 12: Orthogonality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

Lesson 13: Gram-Schmidt and transposes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

Lesson 14: Determinants . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

Lesson 15: Method of least squares . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

Lesson 16: Discrete dynamical systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

Lesson 17: Eigenvalues and eigenvectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

Lesson 18: Diagonalization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

1

1 Matrix operations and subspaces

2Jake Marcinek - Harvard Math 21b

Lesson 1: Linear systems

Definition 1. There are a few related vocabulary words:

• A linear equation is an equation involving sums of numbers and/or variables multiplied by numbers.When a variable is multiplied by a number, the number is called a coefficient.

• A linear system is a collection of linear equations – organized in a rectangular box so that each rowcontains one equation and each column contains one variable.

• A linear system is consistent if there are numerical values for each variable which make all equationstrue. In other words, “if you can solve the system.”

• A linear system is inconsistent if it is not consistent.

Algorithm 2 (Method of elimination). The method of elimination is an algorithm that is guaranteed to getyou “as close as possible” to solving a system of linear equations. Here is some pseudo-code for reference.

input : A system of equationsoutput: As close to a solution as possible

1 write down all equations stacked on top of each other so that the same variables line up vertically;2 start at the leftmost column;3 while there are columns that you have not checked yet do4 if possible then5 use the three valid moves to obtain a coefficient 1 for the variable associated to this column

in the top most row which does not already have a circled variable;6 circle this variable;7 use the circled variable to “zero out” the rest of the column;

8 end9 move to the next column to the right

10 end

Algorithm 2: Method of elimination

The algorithm refers to the three valid moves for the method of elimination. These are

1. Reorder the rows

2. Multiply a row by a number

3. Add one row to another row

3a. Add a multiple of one row to another row

Definition 3. A variable is leading if it was circled during the process of elimination. All other variablesare free.

Remark 4. To solve the system after completing the method of elimination, assign any arbitrary values tothe free variables. Each nonzero row contains exactly one leading variable, so we can solve for the leadingvariables in terms of the free variables. The name free variable comes from the fact that this variable is freeto take on any value.


Lesson 2: Gauss-Jordan elimination

Definition 5. An n×m matrix (pronounced “n by m”) matrix is a rectangular grid of numbers organizedin n rows and m columns.

Definition 6. Every system of linear equations can be written in matrix form by A~x = ~b where A is calledthe coefficient matrix, ~x is called the variable vector, and ~b is called the constant vector.

The matrix obtained by augmenting the vector of constants to the right hand side of the coefficient matrix(like this

[A ~x

]) is called the augmented matrix.

Definition 7. A matrix A is said to be in row reduced echelon form (rref) if the following three conditionsare satisfied:

1. If a row has nonzero entries, then the leftmost nonzero entry is a 1, called a leading 1.

2. If a column has a leading 1, then all of its other entries are 0.

3. If a row has a leading 1, then every row above it also has a leading 1 which is further to the left.

Proposition 8. For every m by n matrix A, there is a unique matrix in row reduced echelon form calledrref(A) such that you can obtain rref(A) from A (and vice versa) using only the following three validmoves:

1. Reorder rows

2. Scale a row by a nonzero scalar

3. Add a multiple of a row to another

3a. Add a scalar multiple of a row to another

Remark 9. The method to compute rref(A) is identical to the method of elimination from last class. Sinceit’s so important, I will again write out all the details for you to have a record of.

Algorithm 10 (Gauss-Jordan Elimination). Gauss-Jordan elimination is an algorithm for computing

rref(A) or rref[A ~b

].

input : A matrix A or augmented matrix[A ~b

]output: rref(A) or rref

[A ~b

]1 start at the leftmost column;2 while there are columns that you have not checked yet do3 if possible then4 use the three valid moves to obtain a leading 1 in this column in the top most row which

does not already have a leading 1;5 circle the leading 1;6 use the circled variable to “zero out” the rest of the column;

7 end8 move to the next column to the right

9 end

Algorithm 10: Gauss-Jordan Elimination

Remark 11. You won’t be penalized for not circling leading 1’s, but many people find it helpful and nobodyregrets it! I encourage everyone to do it. Elimination can sometimes be long and/or tedious so any marksthat help us remember our place in the algorithm can keep us from skipping ahead and making mistakes. Itwill also make the solutions more readable (helping your future self, CAs, and me).

Proposition 12. If A~x = ~b and rref([A ~b

]) =

[rref(A) ~c

], then rref(A)~x = ~c.


Lesson 3: Linear transformations

Notation 13. The set of real numbers is denoted by R. Real numbers are numbers like −1, 0, 12 , π, etc., but

not i =√−1. Write r ∈ R to mean that r is in the set of real numbers. In other words, r is a real number.

An integer is a number like . . . ,−2,−1, 0, 1, 2, . . ., but not 12 . Let m be a nonnegative integer.

The set of real vectors of length m is denoted by Rm. Real vectors of length m are of the form

~x =

x1...xm

(1)

where x1, . . . , xm ∈ R. There are two operations for vectors:

1. Addition: If ~x, ~y ∈ Rm, then ~x+ ~y ∈ Rm is defined by

~x+ ~y =

x1 + y1...

xm + ym

(2)

and is called the sum of ~x and ~y.

2. Scalar multiplication: If ~x ∈ Rm and r ∈ R, then r~x ∈ Rm is defined by

r~x =

rx1...rxm

(3)

and is called the scalar multiple of ~x by r. In this case r is called a scalar.

Definition 14. Suppose A and B are two sets. A function from A to B, written like f : A → B, is a rulethat takes in an input a ∈ A and spits out an output f(a) ∈ B.

Definition 15. Suppose m and n are nonnegative integers. Then a linear transformation is a functionT : Rm → Rn that preserves addition and multiplication:

T (~x+ ~y) = T (~x) + T (~y) and T (r~x) = rT (~x) (4)

for all length m vectors ~x, ~y ∈ Rm and all scalars r ∈ R.

The domain of a linear transformation T : Rm → Rn is Rm (the input vectors have length m). The codomainis Rn (the output vectors have length n).

Remark 16. In other words, a linear transformation is a rule (usually described through words, formulas, ormatrices) for transforming a vector of length m into a vector of length n.

You can think of the domain as describing the length of the input vectors and the codomain as describingthe length of the output vectors.

Proposition 17. If T : Rm → Rn is a linear transformation, then T (~0) = ~0.

Warning! 18. Always keep track of the dimensions of your vectors. For example, when I write T (~0) = ~0 inProposition 17, it is implied from T : Rm → Rn that the ~0 on the left hand side is the vector of m zeroswhile the ~0 on the right hand side is the vector of n zeros.

Proof. Let ~x ∈ Rm be any vector of length m. Then 0~x = ~0 is the vector of m zeros. Also, 0T (~x) = ~0 is thevector of n zeros. By preservation of scalar multiplication

T (~0) = T (0~x) = 0T (~x) = ~0. (5)


Lesson 4: Determining a linear transformation

Definition 19. There are three related new vocabulary terms:

• A vector with m components ~x ∈ Rm is a linear combination of other vectors with m components~v1, . . . , ~vk ∈ Rm if there are scalars c1, . . . , ck ∈ R such that

~x = c1~v1 + . . .+ cm~vm. (6)

• If S ⊂ Rm is a set of vectors with m components, then the span of S is the set of all linear combinationsof vectors in S

span(S) = {~x ∈ Rm|~x = c1~v1 + . . . cm~vm for some vectors ~v1, . . . , ~vm ∈ S and scalars c1, . . . , cm ∈ R}(7)

• A set S of vectors in Rm is a basis of Rm if

1. there are m vectors in S

2. span(S) = Rm

Notation 20. The set notation in (7) should be read as follows: “the span of S is the set of all vectors withm components ~x ∈ Rm such that ~x is a linear combination of ~v1, . . . , ~vk for some other vectors ~v1, . . . , ~vkalready belonging to the set S.”

Example 21. The most basic and most important example of a basis of Rm is the standard basis {~e1, . . . ~em}where

~e1 =

100...0

, ~e2 =

010...0

, . . . , ~em =

000...1

(8)

That is, ~ek is the vector of all zeros except for the kth entry which is a 1. There are m such vectors. Thesevectors span all of Rm because for any vector ~x ∈ Rm,

~x = x1~e1 + x2~e2 + . . .+ xm~em (9)

so ~x is a linear combination of ~e1, . . . , ~em and hence ~x ∈ span({~e1, . . . , ~em}).

Theorem 22. Every n×m matrix A induces a linear transformation TA : Rm → Rn defined by TA(~x) = A~x.

Moreover, for every linear transformation T : Rm → Rn, there is a unique n×m matrix A such that T = TA.This matrix A is given by

A =[T (~e1) T (~e2) . . . T (~em)

]. (10)

Proof. If A is an n×m matrix, then for m component vectors ~x, ~y ∈ Rm and scalars r ∈ R,

TA(~x+ ~y) = A~x+ A~y = TA(~x) + TA(~y) and TA(r~x) = Ar~x = rA~x = rTA(~x) (11)

so TA preserves addition and scalar multiplication. Therefore, TA as defined in the theorem is a lineartransformation.

For the second part of the theorem, suppose T : Rm → Rn is a linear transformation and A is from Equation(10). Then for any ~x ∈ Rm,

T (~x) = T (x1~e1 + . . .+ xm~em) = x1T (~e1) + . . .+ xmT (~em) = A~x = TA(~x) (12)

where the first equality is by the vector representation in Equation (9), the second is that T preservesaddition and scalar multiplication, the third is the definition of matrix-vector multiplication, and the fourthis the definition of TA(~x).


Lesson 5: Geometric linear transformations in R2

Recall 23. The two important ideas from last time are:

1. Every linear transformation T : Rm → Rn corresponds to an (n×m)-matrix A and every (n×m)-matrixA corresponds to a linear transformation T where the correspondence is given by

T (~x) = A~x (13)

for every ~x ∈ Rm.

2. If we know T (~v) for all ~v in a basis of Rm, then we can figure out T (~x) for all ~x ∈ Rm.

Remark 24. There are many ways that we can describe linear transformations. By the above facts, all ofthese descriptions will have a matrix associated to them. Today we focus on geometric linear transformationsT : R2 → R2.

Warning! 25. You should be able to derive all of these matrix forms yourself from

A =[T (~e1) T (~e2)

]. (14)

Example 26. There are five geometric linear transformations R2 → R2 which you should be familiar with:

• Scaling : All vectors are “stretched/squeezed” horizontally by a factor a > 0 and vertically by a factorb > 0. The algebraic description and corresponding matrix are

T

([x1x2

])=

[ax1bx2

]=

[a 00 b

] [x1x2

]. (15)

• Rotation: All vectors are rotated around the origin counterclockwise by an angle θ ∈ R. The algebraicdescription and corresponding matrix are

T

([x1x2

])=

[cos(θ)x1 − sin(θ)x2sin(θ)x1 + cos(θ)x2

]=

[cos(θ) − sin(θ)sin(θ) cos(θ)

] [x1x2

]. (16)

• Reflection: All vectors are flipped over the “x-axis.” The algebraic description and correspondingmatrix are

T

([x1x2

])=

[x1−x2

]=

[1 00 −1

] [x1x2

]. (17)

• Orthogonal projection: All vectors are projected vertically onto the “x-axis.” The algebraic descriptionand corresponding matrix are

T

([x1x2

])=

[x10

]=

[1 00 0

] [x1x2

]. (18)

• Horizontal shear: Every vector is pushed a ∈ R units to the right (respectively, left) for every unit itlies above (respectively, below) the “x-axis”. The algebraic description and corresponding matrix are

T

([x1x2

])=

[x1 + ax2

x2

]=

[1 a0 1

] [x1x2

]. (19)

Remark 27. We will repeatedly use all of these examples as easy to visualize cases of complicated phenomenawe encounter throughout the rest of the semester.

Remark 28. We can also rotate the axes in the scaling, reflection, projection, and shear examples to obtaina larger family of examples.

Remark 29. The vertical shear has matrix

[1 0a 1

].


Lesson 6: Matrix products

Theorem 30. Let ~v1, . . . , ~vm ∈ Rn be m vectors with n components each. The following are equivalent:

1. {~v1, . . . , ~vm} is a basis of Rn.

2. rref[~v1 . . . ~vm

]= In is the identity matrix.

3. every vector ~x ∈ Rn can be written as a linear combination of ~v1, . . . , ~vm in exactly one way.

Proof. The argument will have four steps: #1 implies #2, #2 implies #1, #2 implies #3, and #3 implies#2. For notational convenience, let V =

[~v1 . . . ~vm

]be the matrix from #2.

First, let’s show #1 implies #2. By the definition of basis, we know span({~v1, . . . , ~vm}) = Rn which meansthat every vector ~y ∈ Rn has a solution ~x ∈ Rm to the system V ~x = ~y. This means rref(V ) has has no0-rows. Also by the definition of basis, we know m = n so V is an (n× n)-matrix. The only (n× n)-matrixwithout a 0-row in row reduced echelon form is In, the identity matrix.

Second, let’s show #2 implies #1. Since the (n×m)-matrix V has the same size as rref(V ) = In which is(n × n), we know that m = n. Moreover, rref(V ) = In has no zero rows so V ~c = ~y has a solution ~c ∈ Rmfor every ~y ∈ Rn. In other words, span({~v1, . . . ~vm}) = Rn.

Third, let’s show #2 implies #3. For any vector ~x ∈ Rn, there is exactly one solution ~c ∈ Rm to V ~c = ~xsince rref(V ) has no 0-rows and no free variables.

Fourth, let’s show #3 implies #2. The matrix rref(V ) has size (n × m), has no 0-rows (since V ~c = ~xis consistent for every ~x ∈ Rn), and has no free variables (since V ~c = ~x has exactly 1 solution for every~x ∈ Rn). The only such matrix is the identity matrix, In (and m = n).

Definition 31 (Matrix multiplication). If A is an (n× p)-matrix and B is a (p×m)-matrix, then we canmultiply the matrices to obtain an (n×m)-matrix as follows

AB =[A(B~e1) . . . A(B~em)

](20)

Definition 32 (Composition of linear transformations). Suppose T : Rm → Rp and L : Rp → Rn are lineartransformations. Then the composition of L and T is defined by

(L ◦ T )(~x) = L(T (~x)) (21)

for all ~x ∈ Rm.

Theorem 33. Suppose A is an (n × p)-matrix and B is a (p ×m)-matrix. If T : Rm → Rp is the lineartransformation associated to B and L : Rp → Rn is the linear transformation associated to A, then AB isthe linear transformation associated to L ◦ T .

Proof. The matrix associated with L ◦ T is[L(T (~e1)) . . . L(T (~em))

]=[A(B~e1) . . . A(B~em)

]= AB. (22)

Definition 34 (Alternate definition). Suppose A is an (n× p)-matrix whose entry in row i and column j isAij . Suppose B is a (p×m)-matrix whose entry in row i and column j is Bij . Then AB is the (n×m)-matrixwhose entry in row i and column j is

(AB)ij = Ai1B1j + . . .+AipBpj =

p∑k=1

AipBpj (23)

for every i ∈ {1, 2, . . . , n} and j ∈ {1, 2, . . . ,m}.


Theorem 35. Matrix multiplication is associative. That is, if A is an (n×p)-matrix, B is a (p×q)-matrix,and C is a (q ×m)-matrix, then

A(BC) = (AB)C. (24)

Proof. For each i ∈ {1, 2, . . . , n} and j ∈ {1, 2, . . . ,m} let’s compute the entry in row i and column j for thetwo matrices A(BC) and (AB)C:

(A(BC))ij =

p∑k=1

Aik(BC)kj =

p∑k=1

Aik

q∑`=1

Bk`C`j =

q∑`=1

(p∑k=1

AikBk`

)C`j =

q∑`=1

(AB)i`C`j = ((AB)C)ij

(25)by reordering the sums.


Lesson 7: Matrix inverses

Definition 36. A linear transformation T : Rm → Rn is called invertible if there is a linear transformationS : Rn → Rm such that the composition S ◦T is the identity on Rm and the composition T ◦S is the identityon Rn. That is,

S(T (~x)) = ~x and T (S(~y)) = ~y (26)

for every ~x ∈ Rm and ~y ∈ Rn. In this case, S is called the inverse of T and is denoted S = T−1.

Similarly, an (n × m)-matrix A is called invertible if there is an (m × n)-matrix B such that the matrixproduct AB is the identity of size n and the matrix product BA is the identity of size m. That is

AB = In and BA = Im (27)

In this case, B is called the inverse of A and is denoted B = A−1.

Theorem 37. Suppose A is an (n×m)-matrix. Then the following are equivalent:

1. A is invertible.

2. rref(A) = In.

Proof. The argument will have two steps: #1 implies #2, and #2 implies #1.

First, let’s show that #1 implies #2. By the invertibility of A, we know there is a unique solution ~x ∈ Rmto the system A~x = ~y for every ~y ∈ Rm. This means rref(A) has no 0-rows (always consistent) and no freevariables (exactly one solution). Since A has n rows, the only possibility is that rref(A) = In.

Second, let’s show that #2 implies #1. Use Gauss-Jordan elimination to solve the the system AB = In.The row reduced echelon form of this process will look like

rref([A In

]) =

[In B

](28)

If we put B on the left and row reduce, then following the same steps in reverse the Gauss-Jordan algorithmwill lead us to

rref([B Im

]) =

[Im A

](29)

Equation (28) is equivalent to AB = In and equation (29) is equivalent to BA = Im so B = A−1 and A isinvertible.

Proposition 38. If an (n×m)-matrix A is invertible, then m = n and the columns A are a basis of Rn.

Proof. By the equivalence in Theorem 37, the (n×m)-matrix A must be the same size as the (n×n)-identitymatrix, In, so m = n. From last handout, if rref(A) = In, then the columns of A form a basis of Rn.

Remark 39. The proof of Theorem 37 gives us a convenient way of computing inverses. If A is invertible,then to compute the inverse, simply row reduce:

rref([A In

]) =

[In A−1

](30)

Example 40. For (2× 2)-matrices, there is an easy inversion formula for you to remember:[a bc d

]−1=

1

ad− bc

[d −b−c a

](31)

for every a, b, c, d ∈ R such that ad 6= bc.

Remark 41. Of the five geometric linear transformations covered in Lesson 5, only the projection is notinvertible. The other four transformations have inverses with intuitive geometric interperetations:


1. Scaling with horizontal parameter a > 0 and vertical parameter b > 0 is inverted by scaling withhorizontal parameter 1/a > 0 and vertical parameter 1/b > 0[

a 00 b

]−1=

[1/a 00 1/b

](32)

2. Rotating with angle θ is inverted by rotating with angle −θ[cos(θ) − sin(θ)sin(θ) cos(θ)

]−1=

[cos(−θ) − sin(−θ)sin(−θ) cos(−θ)

]=

[cos(θ) sin(θ)− sin(θ) cos(θ)

](33)

3. Reflecting over a line L is it’s own inverse (in math this is typically called an involution). For example,when reflecting over the x-axis, [

1 00 −1

]−1=

[1 00 −1

](34)

4. Shearing with parameter a is inverted by shearing with parameter −a. This works for both verticaland horizontal shears. The horizontal shear example is[

1 a0 1

]−1=

[1 −a0 1

](35)


Lesson 8: Coordinates

Suppose B = {~v1, . . . , ~vn} is a basis in Rn. We can use this basis to create a new coordinate system.

Definition 42. Every vector ~x ∈ Rn can be expressed uniquely as a linear combination of B in exactly oneway ~x = c1~v1 + . . .+ cn~vn for some scalars c1, . . . , cn ∈ Rn. In this case we call the vector

[~x]B

=

c1...cn

(36)

the B-coordinates of ~x.

Remark 43. The matrix S =[~v1 . . . ~vn

]is very important so we give it a name. It is called the change

of basis matrix. Note that S is invertible by the theorem from last class because B is a basis.

Theorem 44. You can move between standard coordinates and B-coordinates with the matrix inversionformula [

~x]B

= S−1~x where S =[~v1 . . . ~vn

](37)

Proof. By Definition 42, the B-coordinates satisfy

S[~x]B

=[~v1 . . . ~vn

] [~x]B

= ~x. (38)

By the theorem from last lesson, the matrix S :=[~v1 . . . ~vn

]is invertible because B is a basis. Multiply

the matrix S−1 on the left to both sides of equation 38 to get[~x]B

= In[~x]B

= S−1S[~x]B

= S−1~x (39)

Definition 45. The B-matrix of a linear transformation T : Rn → Rn is the matrix B that satisfies[T (~x)

]B

= B[~x]B

(40)

for all ~x ∈ Rn.

Remark 46. Compare this definition to the standard matrix of T , that is the matrix A satisfying

T (~x) = A~x (41)

for all ~x ∈ Rn. Just as the standard matrix of T tells us how vectors are transformed in terms of linearcombinations of the standard basis {~e1, . . . , ~en}, the B-matrix tells us how vectors are transformed in termsof linear combinations of the B.

More precisely, the entry in row i and column j of A corresponds to the ~ei coefficient of T (~ej) as a linearcombination of the standard basis {~e1, . . . , ~en}. The entry in row i and column j of B corresponds to the ~vicoefficient of T (~vj) as a linear combination of B.

Theorem 47. Suppose T : Rn → Rn is a linear transformation with matrix A and B-matrix B. Then

A = SBS−1. (42)

Proof. For every ~x ∈ Rn,

A~x = T (~x) = InT (~x) = SS−1T (~x) = S[T (~x)

]B

= SB[~x]B

= SBS−1~x (43)

so the two matrices must coincide A = SBS−1.


Example 48. Consider the linear transformation T : R2 → R2 given by reflection across the line L passingthrough the origin making an angle θ with the x-axis. If you are asked to compute the matrix associated toT , one strategy is the following. Consider the rotation matrix

S =


](44)

which is the change-of-basis matrix for the basis B = {~v1, ~v2} where ~v1 = S~e1 and ~v2 = S~e2. The B-matrixfor T is easy to find:

B =

[1 00 −1

](45)

Therefore, the matrix for T is

A = SBS−1 =


] [1 00 −1

] [cos(θ) − sin(θ)sin(θ) cos(θ)

]−1=

[cos(2θ) sin(2θ)sin(2θ) − cos(2θ)

](46)


Lesson 9: Image and Kernel

Definition 49. If A is an (n×m)-matrix, then the kernel of A is the set of solution vectors ~x ∈ Rm to theequation A~x = ~0. In set notation, the kernel is defined as

ker(A) = {~x ∈ Rm|A~x = ~0} ⊂ Rm. (47)

Definition 50. If A is an (n × m)-matrix, then the image of A is the set of all possible vectors outputvectors A~x ∈ Rn as ~x ranges over Rm. In set notation,

im(A) = {A~x|~x ∈ Rm} ⊂ Rn. (48)

An alternate perspective is: the image of im(A) is the set of all vectors in the codomain ~y ∈ Rn such thatthe system A~x = ~y is consistent.

Warning! 51. Always remember that the kernel is in the domain while the image is in the codomain. Theseare different unless A is a square matrix.

Remark 52. I like to think of the image and kernel as

• kernel — all inputs which output ~0

• image — all outputs

Proposition 53. The image of A is the span of the columns of A.

Proof. The columns of A are precisely A~e1, . . . ,A~em which all belong to the image of A by Definition 50.On the other hand, if ~y ∈ im(A), then there exists a vector ~x ∈ Rm such that ~y = A~x. This ~x has a uniquerepresentation as a linear combination of ~e1, . . . , ~em given by ~x = x1~e1 + . . . + xm~em since ~e1, . . . , ~em is abasis of Rm. Then

~y = A~x = A(x1~e1 + . . .+ xm~em) = x1A~e1 + . . .+ xmA~em (49)

is a linear combination of the columns of A.

Remark 54. To compute the image, simply take the span of the columns. To compute the kernel, solveA~x = ~0 using elimination.

Proposition 55. Suppose A is an invertible (n× n)-matrix. Then ker(A) = {~0}.

Proof. Since A is invertible, there is a exactly one solution to A~x = ~0. The trivial solution is ~x = ~0 so thismust be the only solution.


Lesson 10: Subspaces

Definition 56. A subset V ⊂ Rn is called a subspace if it satisfies the following two conditions

1. if ~v, ~w ∈ V are two vectors in the subspace, then their sum ~v + ~w ∈ V is also a vector in the subspace

2. if ~v ∈ V is a vector in the subspace and r ∈ R is a scalar, then the scalar multiple r~v ∈ V is also avector in the subspace.

Remark 57. We say subspaces are subsets of Rn which are closed under the operations of linear algebra:addition and scalar multiplication. The first condition is called closure under addition and the secondcondition is called closure under scalar multiplication.

Example 58. Here are three common ways of generating subspaces

1. If S ⊂ Rn is any set of vectors, then span(S) is the smallest subspace containing each vector in S.

2. If A is an (n×m)-matrix, then im(A) is a subspace of Rn.

3. If A is an (n×m)-matrix, then ker(A) is a subspace of Rm.

Proof. Here are proofs that the above three examples are actually subspaces.

1. If ~x, ~y ∈ span(S) and r ∈ R, then there are coefficients c1, . . . , ck, d1, . . . , dk ∈ Rn and vectors~v1, . . . , ~vn ∈ S such that ~x = c1~v1 + . . .+ ck~vk and ~y = d1~v1 + . . .+ dk~vk. Then

~x+ r~y = (c1 + rd1)~v1 + . . .+ (ck + rdk)~vk ∈ span(S) (50)

so span(S) is closed under both addition and scalar multiplication.

2. If A is an (n ×m)-matrix, then the image im(A) = span({A~e1, . . . ,A~em}) is a span and is thereforea subspace by the first example.

3. If A is an (n×m)-matrix and ~x, ~y ∈ ker(A) vectors in the kernel and r ∈ R is a scalar, then

A(~x+ r~y) = (A~x) + r(A~y) = ~0 + r~0 = ~0 (51)

so ~x+ r~y ∈ ker(A) as well. Therefore, ker(A) is closed under addition and scalar multiplication and ishence a subspace.

This shows the above three examples are subspaces.

Definition 59. A linear relation among vectors ~v1, . . . , ~vk is a collection of coefficients c1, . . . , ck ∈ R sothat the corresponding linear combination is the zero vector

c1~v1 + . . .+ ck~vk = ~0. (52)

That is, a linear relation is an expression of ~0 as a linear combination of ~v1, . . . , ~vk.

Remark 60. There is ALWAYS at least one linear relation for any set of vectors {~v1, . . . , ~vk}, namely

0~v1 + . . .+ 0~vk = ~0. (53)

This linear relation with all zero coefficients is called the trivial relation. All other relations are nontrivial(these are actually technical terms).

Definition 61. A set of vectors ~v1, . . . , ~vk are called linearly independent if the only linear relation amongthem is the trivial relation.

Example 62. The standard basis {~e1, . . . , ~en} ∈ Rn is a set of linearly independent vectors.

Definition 63. A set of vectors {~v1, . . . , ~vk} ∈ Rn is a basis for the subspace V ⊂ Rn if


1. the set of vectors spans the subspace: span{~v1, . . . , ~vk} = V , and

2. the set of vectors {~v1, . . . , ~vk} is linearly independent.

Proposition 64. Any two conditions together imply the third:

1. The span span{~v1, . . . , ~vm} = Rn is all of Rn.

2. The number of vectors matches the number of components in each vector: m = n.

3. The vectors {~v1, . . . , ~vn} are linearly independent.

As long as two of the above conditions hold, the set of vectors {~v1, . . . , ~vm} is called a basis of Rn.

Proposition 65. The vectors {~v1, . . . , ~vm} ⊂ Rn are linearly independent if and only if the matrix A =[~v1 . . . ~vm

]with columns from the set of vectors has trivial kernel ker(A) = {~0}.

Proof. Suppose the columns are linearly independent and ~c ∈ ker(A). Then

~0 = A~c = c1~v1 + . . .+ cm~vm (54)

is a linear relation. Since the trivial relation is the only relation between ~v1, . . . , ~vm, it must be the case thatc1 = . . . = cm = 0 so ~0 is the only vector in ker(A).

In the other direction, assume ker(A) = {~0} and that c1~v1 + . . .+ cm~vm = ~0. Then let

~c =

c1...cm

(55)

and note that A~c = ~0 so ~c ∈ ker(A). Then ~c = ~0 since that is the only vector in ker(A) so the linear relationmust be the trivial relation. There are no other relations, so ~v1, . . . , ~vm are linearly independent.


Lesson 11: Dimension and the Rank-Nullity Theorem

Definition 66. The dimension of a subspace V ⊂ Rn is the number of vectors in a basis of V .

Remark 67. It is a fact that although there are infinitely many choices for bases of a subspace, every choicehas the same number of vectors. There are new terms coming from the three examples of subspaces we sawlast time. Firstly, for the subspace V , we should think of the dimension dim(V ) as the minimum numberof vectors required to span V . If we can span V with k vectors, like V = span{~v1, . . . , ~vk}, then we knowdim(V ) ≤ k. If the k vectors are linearly independent, then we know dim(V ) = k.

Definition 68. If A is an (n×m)-matrix, then the rank of A is the dimension of the image of A

rank(A) = dim(im(A)). (56)

Remark 69. Note that if A is an (n×m)-matrix, then 0 ≤ rank(A) ≤ n is an integer since im(A) ⊂ Rn.

Definition 70. If A is an (n×m)-matrix, then the nullity of A is the dimension of the kernel of A

nullity(A) = dim(ker(A)). (57)

Remark 71. Note that if A is an (n×m)-matrix, then 0 ≤ nullity(A) ≤ m is an integer since ker(A) ⊂ Rm.

Theorem 72. Let A be an (n×m)-matrix. Then

m = rank(A) + nullity(A). (58)

That is, the sum of the rank and the nullity is always the number of columns for every matrix.

Remark 73. This is the first named theorem of our class. That’s how you know that it’s important!!!

The proof is also very elegant because it ties together multiple concepts that we have been covered fromthe first class all the way up to this class. The argument goes by applying Gauss-Jordan elimination andthinking of the result in terms of image and kernel. We find a basis for im(A) and a basis for ker(A) andrelate them to the columns of A and rref(A).

Proof. Consider the row-reduced echelon form of A. The columns of A whose corresponding columns inrref(A) contain a leading one form a basis for im(A). This means the rank of A is the number of leadingvariables in rref(A).

On the other hand, Gauss-Jordan solves A~x = ~0 by parameterizing ker(A). There is one parameter for everyfree variable. A basis for ker(A) is obtained by setting one free variable equal to 1 and the rest equal to 0.Therefore, the nullity of A is the number of free variables in rref(A).

The total number of variables is the number of columns which is m. Every variable is either free or leadingso rank(A) + nullity(A) = m.


2 Orthogonality and spectral theory


Lesson 12: Orthogonality

Definition 74. The dot product between two vectors ~v, ~w ∈ Rn is the scalar defined by

~v · ~w = v1w1 + . . .+ vnwn ∈ Rn. (59)

Definition 75 (Norm). The norm a vector ~v ∈ Rn is the scalar ‖~v‖ =√~v · ~v ∈ R. A vector ~v ∈ Rn is

normalized if ‖~v‖ = 1.

Definition 76 (Orthogonality). A set of vectors {~v1, . . . , ~vk} is orthogonal if every pair satisfies ~vi · ~vj = 0,when i 6= j.

Definition 77 (Orthonormal). A set of vectors {~v1, . . . , ~vk} ⊂ Rn is orthonormal if they are orthogonal andnormalized

~vi · ~vj =

{1 if i = j

0 if i 6= j.(60)

An orthonormal basis is a basis which is orthonormal.

Definition 78. Suppose V ⊂ Rn is a subspace. The set of vectors which are orthogonal to all vectors in Vis another subspace of Rn called the orthogonal complement of V

V ⊥ = {~w ∈ Rn|~w · ~v = 0 for all ~v ∈ V }. (61)

Proof. The orthogonal complement is a subspace because if ~x, ~y ∈ V ⊥ and r ∈ R, then for all ~v ∈ V ,

(~x+ r~y) · ~v = (~x · ~v) + r(~y · ~v) = 0 + r0 = 0 (62)

so ~x + r~y ∈ V ⊥. Therefore, V ⊥ is closed under addition and scalar multiplication and is hence a subspaceof Rn.

Proposition 79. Let V ⊂ Rn be any subspace. Then every vector ~x ∈ Rn has a unique decomposition as asum of a vector in V and a vector in V ⊥. That is, for every ~x ∈ Rn, there is a unique ~v ∈ V and ~w ∈ V ⊥such that ~x = ~v + ~w.

Proof. In two classes, we will discover a way to find an orthonormal basis for any subspace. In the meantime,take for granted that every subspace has an orthonormal basis.

Suppose dim(V ) = k. Then dim(V ⊥) = n − k. Let ~u1, . . . ~uk ∈ V be an orthonormal basis for V and~uk+1, . . . , ~un ∈ V ⊥ be an orthonormal basis for V ⊥. Then {~u1, . . . , ~un} is an orthonormal basis for Rn.

Every vector ~x ∈ Rn has a unique representation as a linear combination

~x = c1~u1 + c2~u2 . . .+ cn~un (63)

In fact the coefficients are very easy to find: c1 = ~x · ~u1, . . . , cn = ~x · ~un. Let ~v = c1~u1 + . . . + ck~uk ∈ Vbe the part lying in V and ~w = ck+1~uk+1 . . .+ cn~un ∈ V ⊥ be the part lying in V ⊥. Then ~x = ~v + ~w is thedesired decomposition.

Definition 80. The vectors obtained in the above orthogonal decomposition are called orthogonal projec-tions. In this setting, we write

projV (~x) = c1~u1 + . . .+ ck~uk = (~x · ~u1)~u1 + . . . (~x · ~uk)~uk (64)

This is called the projection formula. Note that the sum of orthogonal projections transformations fororthogonal complements is the identity transformation

projV (~x) + projV ⊥(~x) = ~x (65)

for every ~x ∈ Rn.


Lesson 13: Gram-Schmidt and transposes

Algorithm 81 (Gram-Schmidt). Let V ⊂ Rn be an m-dimensional subspace. The Gram-Schmidt procedureis an algorithm which converts a basis for V into an orthonormal basis. That is, the Gram-Schmidt proceduretakes as input a basis B = {~v1, . . . , ~vm} of V and outputs B′ = {~u1, . . . , ~um} an orthonormal basis. Here issome pseudo-code for reference.

input : A basis B = {~v1, . . . , ~vm} for an m-dimensional subspace V ⊂ Rnoutput: An orthonormal basis B′ = {~u1, . . . , ~um} for the same m-dimensional subspace V ⊂ Rn

1 Let V0 = {~0} be the zero-dimensional subspace in Rn;2 for k = 1, . . . ,m do3 Let ~wk = projV ⊥k−1

(~vk) ;

4 Let ~uk = ~wk/‖~wk‖;5 Let Vk = span{~u1, . . . , ~uk};6 end7 Now output B′ = {~u1, . . . , ~um};

Algorithm 81: The Gram-Schmidt procedure

Note that at each step ~wk ∈ V ⊥k−1 is orthogonal to each of ~u1, . . . , ~uk, but is still in the span ~wk ∈span{~v1, . . . , ~vk}. Then ~uk is the normalization of ~wk, guaranteeing that ~uk is a unit vector. Therefore,for every k = 1, . . . ,m, the k-dimensional subspace

Vk = span{~v1, . . . , ~vk} = span{~u1, . . . , ~uk} (66)

has both {~v1, . . . , ~vk} and {~u1, . . . , ~uk} as bases (plural form of basis). Moreover, {~u1, . . . , ~uk} is an orthonor-mal basis.

Remark 82. By the projection formula and the observation that {~u1, . . . , ~uk−1} is an orthonormal basis forVk−1, the projection in line 3 can be rewritten more explicitly by the formula

~wk = projV ⊥k−1(~vk) (67)

= ~vk − projVk−1(~vk) (68)

= ~vk − (~vk · ~u1)~u1 − . . .− (~vk · ~uk−1)~uk−1 (69)

at every step k = 1, . . . ,m.

Theorem 83. Every subspace V ⊂ Rn has an orthonormal basis.

Proof. Every subspace has a basis, so let B be a basis for V . Apply Gram-Schmidt to B and obtain anorthonormal basis B′ for V .

Definition 84 (Transpose). The transpose of an (n×m)-matrix A is the (m×n)-matrix denoted A> whoseentry in row i and column j is the entry in row j column i of A.

Example 85. The transpose of the following (3× 2)-matrix is the (2× 3)-matrix1 23 45 6

> =

[1 2 34 5 6

](70)

Remark 86. A convenient observation is that a vector ~x ∈ Rn can be treated as an (n× 1)-matrix. Then thedot product can be written easily in terms of the transpose. For instance, if ~y ∈ Rn is another vector, then

~x>~y =[~x · ~y

](71)


For this reason, the transpose often comes up when we are dealing with the properties associated to dotproducts, like orthogonality and norms. Take the following result for example.

Proposition 87. For any (n ×m)-matrix A, the kernel of the transpose is the orthogonal compliment tothe image

ker(A>) = im(A)⊥. (72)

Proof. For any vector ~v ∈ Rn, A>~v = ~0 if and only if A~e1 · ~v = . . . = A~en · ~v = 0. This is becauseequation i in the system A>~v = ~0 (corresponding to row i in A>) is exactly the equation give by columni of A dot ~v, A~ei · ~v = 0. This proves the result since the image of a matrix is the span of its columnsim(A) = span{A~e1, . . . ,A~em}.

Theorem 88 (Properties of transposes). Let A be an (n ×m)-matrix and B be a (m × p)-matrix. Thenthe following properties hold

1. ker(A>) = im(A)⊥

2. (AB)> = B>A>

3. rank(A>) = rank(A)

4. (A>)−1 = (A−1)>

Remark 89. Note that the last two items require A to be square. Hidden in the last item is the statement:if A is invertible, then so is A>.

Proof. For #1, denote the columns of A by A =[~v1 . . . ~vm

]where ~v1, . . . , ~vm ∈ Rn. Then since the

image of A is the span of the columns of A,

im(A) = span{~v1, . . . ~vm} (73)

the orthogonal complement im(A)⊥ is the set of solution vectors to ~v1 · ~x = 0, . . . , and ~vm · ~x = 0. Writtenas an augmented matrix, these are the solutions to the system~v

>1 0...

...~v>m 0

=[A> ~0

]. (74)

The set of solutions to this equation is precisely ker(A>) by the definition of kernel.

For #2, the entry in row i and column j of (AB)> is the entry in row j column i of AB. This is the dotproduct between row j of A and column i of B. Similarly, the entry in row i and column j of B>A> isthe dot product between row i of B> (which is column i of B) and column j of A> (which is row j of A).These are the same.

For #3, use #1 and the rank-nullity theorem:

rank(A>) = n− nullity(Atop) = n− dim(ker(A>)) = n− dim(im(A)⊥) = dim(im(A)) = rank(A). (75)

For #4, use property #2:(A−1)>A> = (AA−1)> = I>n = In (76)

andA>(A−1)> = (A−1A)> = I>n = In (77)


Lesson 14: Determinants

Theorem 90. Linear transformations scale signed volumes uniformly. That is, for every (n×n)-matrix A,there is a constant C ∈ R such that

Vol(AS) = CVol(S) (78)

for every subset S ⊂ Rn. Here AS ⊂ Rn is the set of vectors obtained by multiplying the vectors ~x ∈ S by A

AS = {A~x|~x ∈ S}. (79)

Definition 91. Let A be an (n×n)-matrix. The determinant of A is the constant in the previous theorem.That is, the determinant of A is the unique number (up to ± depending on orientation) det(A) ∈ R whichsatisfies

Vol(AS) = det(A)Vol(S) (80)

for every subset S ⊂ Rn.

Remark 92. The primary reason determinants are important to us is qualitative rather than quantitative.Determinants can be used to test invertibility of a matrix as follows.

Theorem 93. An (n× n)-matrix A is invertible if and only if det(A) 6= 0.

Proof. Let S = {~x ∈ Rn|0 ≤ xi ≤ 1 for all components i = 1, . . . , n} be the parallelpiped generatedby ~e1, . . . , ~en. The volume of S is Vol(S) = 1. On the other hand, Vol(AS) = 0 if and only if ASis contained in a lower dimensional subspace (intuitively, if it is flat). This is the case if and only ifrank(A) = dim(span{A~e1, . . . ,A~en}) < n.

Theorem 94. The determinant of A can be computed recursively by:

• for every i = 1, . . . , n,

det(A) =

n∑j=1

(−1)i+jAij det(A(ij)) (81)

• and similarly, for every j = 1, . . . , n,

det(A) =

n∑i=1

(−1)i+jAij det(A(ij)). (82)

Here Aij is the entry in row i and column j of A. Also, A(ij) is the (n − 1) × (n − 1)-matrix obtained byremoving the row i and column j from A. The submatrix A(ij) is called the (ij)-minor of A.

Example 95. The recursive formula for determinants can be very tedious to compute in general cases,however it reduces to easy simplifications in some basic cases. Here are some useful cases to remember:

• For a (2× 2)-matrix, the determinant is

det

[a bc d

]= ad− bc (83)

I personally think this formula is simple enough to be worth remembering.

• For a (3× 3)-matrix, the determinant is

det

a b cd e fg h i

= aei+ bfg + cdh− afh− bdi− ceg (84)

The way I remember this formula is add up all the down right diagonals and subtract all the down leftdiagonals.


• The (2× 2) determinant has 2 terms, each being a product of 2 entries. The (3× 3) determinant has 6terms, each being a product of 3 entries. The number of terms grows really really fast — even fasterthan exponentially. Already, the (4×4) determinant has 24 terms, each being the product of 4 entries.In general, the (n × n) determinant has n! = 1 · 2 · 3 · . . . · n terms, each term being the product ofn entries. For this reason, it becomes practically impossible to memorize larger general determinantformulas. Nevertheless, there are some simple structures for larger matrices making their determinantsstill easy to compute — like in the next example.

• A diagonal matrix is a square matrix whose only nonzero entries appear in row i and column i wherei is 1, 2, 3, . . . , or n. In this case, the determinant is the product of all diagonal entries

det

a1 0 0 . . . 00 a2 0 . . . 00 0 a3 . . . 0...

......

. . ....

0 0 0 . . . an

= a1a2a3 . . . an. (85)

Proposition 96. For any (n× n)-matrices A and B, the following properties of the determinant hold:

1. det(A>) = det(A), the determinant of the transpose is the determinant.

2. det(AB) = det(A) det(B), the determinant of a product is the product of the determinants.

3. det(A−1) = 1det(A) The determinant of an inverse is the (multiplicative) inverse of the determinant.

Proof. First, let’s prove #1 This result is obviously true if n = 1 because A> = A in this case. Now assumethe result is true for matrices of size n − 1. Then expanding by minors along the first column of A yieldsthe same result as expanding by minors along the first row of A>. Written out in a formula, this is saying

det(A>) =

n∑j=1

(−1)1+jAj1 det((A>)(ij)) =

n∑j=1

(−1)1+jAj1 det(A(j1)) = det(A) (86)

The first equation in (86) is the definition of determinant where we also used the fact that the entry in row1 and column j of A> is Aj1. The second equality in (86) is because the (ij)-minor of A> is the transposeof the (ji)-minor of A, (A>)(ij) = (A(ji))>, and this matrix has size (n− 1)× (n− 1) so the result holds byassumption

det((A>)(ij)) = det((A(ji))>) = det(A(ji)). (87)

The third equality in (86) is the definition of det(A).

In conclusion, we have proved that if the result is true for all (n − 1) × (n − 1)-matrices, then it must alsobe true for all (n × n)-matrices. Moreover, it is trivially true for (1 × 1)-matrices. Therefore, it is true forall matrices by induction.

Second, let’s prove #2. Let S be the parallelpiped generated by ~e1, . . . , ~en, so that Vol(S) = 1. Then

det(AB) = det(AB)Vol(S) = Vol(ABS) = det(A)Vol(BS) = det(A) det(B)Vol(S) = det(A) det(B).(88)

Third, let’s prove #3. Since In is a diagonal matrix and in particular upper diagonal, det(In) = 1n = 1.Therefore, by item #2,

1 = det(In) = det(AA−1) = det(A) det(A−1) (89)

so det(A−1) = 1/ det(A).


Lesson 15: Method of least squares

Story 97. Let A be an (n ×m)-matrix and ~b ∈ Rn a vector. Often times, people find themselves in thesituation where they strongly desire to solve

A~x = ~b (90)

for ~x ∈ Rm, but unfortunately ~b 6∈ im(A) so the system is inconsistent. By the definition of inconsis-tent/image, this is impossible. However, at such times rather than give up, people persevere and try to findsomething that is “pretty close” to a solution. One approach is to find an approximate solution to (90),~x ∗ ∈ Rm which is actually a solution to

A~x ∗ + ~ε = ~b (91)

where ~ε ∈ Rn satisfies

• ~b− ~ε ∈ im(A), and

• ‖~ε‖ is as small as possible.

The first condition is required for A~x ∗ = ~b − ~ε to have a solution. The second condition is equivalent torequiring that the approximate solution makes A~x ∗ as close to ~b as possible. The vector ~ε = ~b − A~x ∗ isoften called the error and the approximate solution aims to minimize error. The vector which minimizeserror is ~ε = projim(A)⊥(~b) so that A~x ∗ + ~ε = ~b becomes A~x ∗ = projim(A)(

~b).

Definition 98. An approximate solution to (90) is a vector ~x ∗ ∈ Rm which satisfies

A~x ∗ = projim(A)(~b) (92)

and the error is~ε = ~b−A~x ∗ = ~b− projim(A)(

~b) = projim(A)⊥(~b) (93)

where the second equality comes from (92) and the third equality comes from the identity ~y = projV (~y) +projV ⊥(~y) for every vector ~y ∈ Rn and every subspace V ⊂ Rn.

Remark 99. Approximate solutions always exist even when (90) is inconsistent. If (90) is consistent, thenthe approximate solution is the actual solution and ~ε = ~0.

Proposition 100. Approximate solutions to (90) always satisfy

A>A~x ∗ = A>~b. (94)

Proof. From Lesson 13, we know the identity of matrix spaces ker(A>) = im(A)⊥. Therefore, A>~ε = ~0

since ~ε = projim(A)⊥(~b) ∈ im(A)⊥ = ker(A>). Multiplying both sides of A~x ∗ + ~ε = ~b by A> on the leftgives

A>A~x ∗ = A>A~x ∗ + A>~ε = A>(A~x ∗ + ~ε) = A>~b. (95)

Note that (94) is usually easier to work with than (92) since projections are typically hard to compute.

Proposition 101. Let A be an (n×m)-matrix. Then A>A is invertible if and only if ker(A) = {~0}.

Proof. Since least squares solutions always exist, Equation (94) is always consistent, so im(A>) ⊂ im(A>A).On the other hand, since A> appears as the left factor in the product A>A, we know im(A>A) ⊂ im(A>).Therefore, im(A>A) = im(A>). By the properties of transposes,

rank(A>A) = rank(A>) = rank(A) (96)

A is (n×m) and A>A is (m×m) so the Rank-Nullity Theorem says

nullity(A>A) = m− rank(A>A) = m− rank(A) = nullity(A). (97)

Since A>A is square, A>A is invertible if and only if nullity(A>A) = 0 if and only if nullity(A) = 0.


Story 102 (Linear regression). Often times, least squares is used in a statistical setting where we havem known variables x1, . . . , xm (called features) and want to predict an unknown variable y (called thetarget) using a method called linear regression. The linear regression model is that there are coefficientsβ1, . . . , βm ∈ R such that

β1x1 + β2x2 + . . .+ βmxm + ε = y (98)

for some small (possibly random) error ε and our goal is to find β1, . . . , βm. Note that this equation can

conveniently be written with vectors ~x and ~β as

~x>~β + ε = y (99)

The statistical approach is to use data to approximately solve for the parameters ~β. In particular, supposewe already have data consisting of both

• n samples of the features

~x(1) =

x(1)1...

x(1)m

, . . . , ~x(n) =

x(n)1...

x(n)m

(100)

• and their corresponding target values y1, . . . , yn.

For each parameter vector ~β, there are corresponding sample errors for each sample

~ε1 = y1 − (~x(1))>~β = y1 − (β1x(1)1 + β2x

(1)2 + . . .+ βmx

(1)m ),

......

~εn = yn − (~x(n))>~β = yn − (β1x(n)1 + β2x

(n)2 + . . .+ βmx

(n)m ),

(101)

Putting the sample features as rows of an (n×m)-matrix X, the sample targets as entries of a vector ~y ∈ Rn,and the sample errors as entries of a vector ~ε ∈ Rn as follows

X =

(~x(1))>

...(~x(n))>

=

x(1)1 . . . x

(1)m

.... . .

...

x(n)1 . . . x

(n)m

, ~y =

y1...yn

, and ~ε =

ε1...εn

, (102)

we can conveniently write the entire system of equations in (101) using matrices by

X~β + ~ε = ~y. (103)

The ~ε = ~0 solution is unlikely to be consistent, but the approximate least squares solution is guaranteed toexist! (In statistical terms, the least squares solution is the maximum likelihood estimator assuming that theerror ~ε is an isotropic Gaussian — a specific/universal/easy-to-analyze type of randomness). By Proposition

100, the approximate solution ~β ∗ satisfies

(X>X)~β ∗ = X>~y. (104)

The kernel of X is trivial as long as the features are not redundant (no feature is exactly a linear combinationof the others). If this is not the case, simply remove any redundant features since they will not help in thepredictive power of a linear model regardless. Therefore, we can assume that ker(X) = {~0} which implies

that X>X is invertible by Proposition 101. Now we can solve for the estimator ~β ∗ exactly as

~β ∗ = (X>X)−1X>~y. (105)


Lesson 16: Discrete dynamical systems

Definition 103. A discrete dynamical system is a sequence of vectors ~x(t) ∈ Rn parameterized by adiscrete time variable t = 0, 1, 2, 3, . . . (another way to write this is 0 ≤ t ∈ R) where each step is determinedcompletely by the previous step through a linear evolution rule

~x(t) = A~x(t− 1) (106)

for all 1 ≤ t ∈ Z for some (n× n)-matrix A.

Remark 104. The name does a good job of describing what this object is. We have see n the word systembefore and it generically means a collection of quantities or rules that hold simultaneously. The worddynamical means the system is changing over time. The word discrete refers to the fact that the time stepsare discrete jumps 0 ≤ t ∈ Z as opposed to a continuous flow 0 ≤ t ∈ R (we’ll talk about continuous flowslater). The vectors ~x(t) model several quantities (different quantities as different components) which changeover time. A typical example is ~x(t) modeling the populations of multiple species in some ecosystem, eachof which evolve over time.

Proposition 105. If ~x(t) ∈ Rn, 0 ≤ t ∈ Z is a discrete dynamical system with evolution rule

~x(t) = A~x(t− 1) (107)

then ~x(t) is completely determined by the initial value ~x(0). Specifically, ~x(t) = At~x(0).

Proof. The equation is true for time 0. That is, A0~x(0) = In~x(0) = ~x(0) where we used the convention thatA0 = In (in the same sense that r0 = 1 for real numbers r ∈ R). If we assume that the equation holds attime step t− 1, ~x(t− 1) = At−1~x(0), then it must also be true at time t, because ~x(t) is determined by theprevious step according to our evolution rule

~x(t) = A~x(t− 1) = A(At−1~x(0)

)= At~x(0) (108)

so the equation should be true for all times by induction.

Definition 106. Suppose A is an (n×n)-matrix. If a nonzero vector ~0 6= ~v ∈ Rn and a scalar λ ∈ R satisfythe equation

A~v = λ~v (109)

then ~v is called an eigenvector of A and λ is called an eigenvalue of A. Equation 109

Remark 107. Despite looking so simple, Equation 109 can actually be very philosophically deep. To someextent almost every area of math can be reduced to understanding eigenvalues and eigenvectors of certaintypes of matrices. My current area of research is called random matrix theory and the goal is to understandeigenvalues and eigenvectors when the matrix entries are random.

Proposition 108. Let ~x(t) ∈ Rn, 0 ≤ t ∈ Z, be a discrete dynamical system with evolution rule

~x(t) = A~x(t− 1) (110)

for all 1 ≤ t ∈ Z. Suppose ~x(0) = c1~v1 + . . .+ ck~vk is a linear combination of eigenvectors ~v1, . . . , ~vk for thematrix A. That is, there are scalars (eigenvalues) λ1, . . . , λk ∈ R such that

A~v1 = λ1~v1, . . . , and A~vk = λk~vk. (111)

Then the discrete dynamical system can be computed easily at all future times by

~x(t) = c1λt1~v1 + . . .+ ckλ

tk~vk. (112)

Proof. The matrix power is easy to compute in this case

~x(t) = At~x(0) = At(c1~v1 + . . .+ ck~vk) = At−1(c1λ1~v1 + . . .+ ckλk~vk) = . . . = c1λt1~v1 + . . .+ ckλ

tk~vk (113)

where we used Proposition 108, expanded ~x(0) in terms of eigenvectors, and then repeatedly distributedfactors of A onto the linear combination.


Lesson 17: Eigenvalues and eigenvectors

Proposition 109. Let A be an (n× n)-matrix. Then λ ∈ R is an eigenvalue of A if and only if

ker(A− λIn) 6= {~0}. (114)

In this case, ker(A−λIn) is called the λ-eigenspace for A and is denoted Eλ(A). It consists of all eigenvectors~v ∈ Rn for A with corresponding eigenvalue λ (meaning A~v = λ~v). In mathematical notation,

Eλ(A) = ker(A− λIn) = {~v ∈ Rn|A~v = λ~v}. (115)

Proof. Suppose λ is an eigenvalue. There there must be a nonzero vector ~0 6= ~v ∈ Rn such that A~v = λ~v(eigenvector). Since ~v ∈ ker(A− λIn), we know that ker(A− λIn) 6= {~0}. On the other hand, if there is anonzero ~v ∈ ker(A− λIn), then A~v = λ~v so λ is an eigenvalue.

Corollary 110. All eigenspaces for an (n× n)-matrix A are subspaces of Rn.

Proof. If λ is an eigenvalue, then Eλ(A) = ker(A− λIn) and all kernels are subspaces.

Definition 111. If λ is an eigenvalue for A, then the geometric multiplicity of λ is dim(Eλ(A)).

Definition 112. Let A be an (n× n)-matrix. The characteristic polynomial of A is defined by

charA(λ) = det(A− λIn). (116)

Proposition 113. Let A be an (n × n)-matrix. Then charA(λ) is a degree n polynomial in the variable λwith zeros at the eigenvalues of A. That is, charA(λ) = 0 if and only if λ is an eigenvalue of A.

Proof. Check that the polynomial has degree n by expanding by minors. The characteristic polynomialsatisfies charA(λ) = 0 if and only if A− λIn is not invertible if and only if λ is an eigenvalue of A.

Definition 114. The trace of a matrix is the sum of its diagonal entries and is denoted tr(A).

Definition 115. By the Fundamental Theorem of Algebra, every polynomial p(x) = adxd + ad−1x

d−1 +. . .+ a1x+ a0 with zeros r1, . . . , rk can be factored as

p(x) = ad(x− r1)m1(x− r2)m2 . . . (x− rk)mk (117)

for some positive integers m1, . . . ,mk such that m1+. . .+mk = d. In the case of the characteristic polynomialof A, if λ1, . . . , λk are all of the eigenvalues of A, then

charA(x) = (λ1 − x)m1 . . . (λk − x)mk = (−1)dxd + (−1)d−1 tr(A)xd−1 + . . .+ det(A). (118)

The algebraic multiplicity of λi is the corresponding exponent mi for each eigenvalue i = 1, 2, . . . , k.

Strategy 116. To find eigenvalues and eigenvectors of a square (n× n)-matrix A, use the following strat-egy. Compute the characteristic polynomial of A, charA(λ), and solve charA(λ) = 0 for λ. The solutionsλ1, . . . , λm will be the eigenvalues. Then compute the eigenspaces as kernels with Gauss-Jordan elimination

Eλ1= ker(A− λ1In), . . . , and Eλm

= ker(A− λmIn) (119)

Theorem 117. For every eigenvalue, the geometric multiplicity is at least 1 and cannot exceed the algebraicmultiplicity. The algebraic multiplicities must add up to n.

Definition 118. A set of n vectors B = {~v1, . . . , ~vn} ∈ Rn is called an eigenbasis for an (n×n)-matrix A if

1. B is a basis for Rn

2. Each basis vector is an eigenvector of A. That is, there are scalars λ1, . . . , λn ∈ Rn such that

A~v1 = λ1~v1, . . . , and A~vn = λn~vn. (120)


Lesson 18: Diagonalization

For this page, all matrices are (n× n).

Definition 119. Two matrices A and B are similar if there is an invertible matrix S such that A = SBS−1.

Theorem 120. Similar matrices share the same eigenvalues with the same algebraic multiplicities andthe same geometric multiplicities. In other words, if A and B are similar and λ is an eigenvalue of Awith algebraic multiplicity m and geometric multiplicity d, then λ is also an eigenvalue of B with algebraicmultiplicity m and geometric multiplicity d.

Proof. Suppose λ is an eigenvalue for A with algebraic multiplicity m and geometric multiplicity d. SinceA and B are similar, there must exist an invertible matrix S such that A = SBS−1. Then

charA(x) = det(A− xIn) = det(SBS−1 − xIn) = det(S(B − xIn)S−1

)(121)

= det(S) det(B − xIn) det(S−1) = det(S)charB(x)1

det(S)(122)

= charB(x) (123)

Since A and B have the same characteristic polynomial, λ must also be an eigenvalue of B with algebraicmultiplicity d. If ~v ∈ Eλ(A), then

B(S−1~v) = S−1AS(S−1~v) = S−1(A~v) = λ(S−1~v) (124)

so S−1~v ∈ Eλ(B). This means if {~v1, . . . , ~vk} is a basis for Eλ(A), then {S−1~v1, . . . ,S−1~vk} ⊂ Eλ(B) islinearly independent since S is invertible. Thus dim(Eλ(A)) ≤ dim(Eλ(B)). Similarly, if ~w ∈ Eλ(B), then

A(S ~w) = SB(S−1S)~w = S(B ~w) = λ(S ~w) (125)

so S ~w ∈ Eλ(A). This means if {~w1, . . . , ~wk} is a basis for Eλ(B), then {S−1 ~w1, . . . ,S−1 ~wk} ⊂ Eλ(A)

is linearly independent since S−1 is invertible. Therefore dim(Eλ(B)) ≤ dim(Eλ(A)). In conclusiondim(Eλ(A)) = dim(Eλ(B)), so the B-geometric multiplicity of λ is the A-geometric multiplicity of λ.

Corollary 121. If A and B are similar, then tr(A) = tr(B) and det(A) = det(B).

Proof. Up to a sign, the trace is the xn−1-coefficient in the characteristic polynomial and the determinantis the constant-coefficient in the characteristic polynomial. Since charA(x) = charB(x), their trace anddeterminant must also be the same.

Definition 122. A matrix A is diagonalizable if it is similar to a diagonal matrix D.

Proposition 123. The following are equivalent:

1. A is diagonalizable.

2. A is similar to a diagonal matrix D.

3. There is an invertible matrix S and a diagonal matrix D such that A = SDS−1.

4. There is an eigenbasis for A.

5. The geometric multiplicities of the eigenvalues for A sum to n.

Proof. Items #1, #2, and #3 are all equivalent by the definitions of diagonalizable and similar.

Item #2 implies item #5 by Theorem 120. Assuming item #2, the geometric multiplicities of A match thegeometric multiplicities of D which match the algebraic multiplicities of D which sum to n.


Item #4 implies #3. Let {~v1, . . . , ~vn} be an eigenbasis for A with eigenvectors λ1, . . . , λn so that

A~v1 = λ1~v1, . . . , and A~vn = λn~vn. (126)

Then A = SDS−1 where S is invertible and D is diagonal given by

S =[~v1 . . . ~vn

]and D =

λ1 0 . . . 00 λ2 . . . 0...

.... . .

...0 0 . . . λn

. (127)

Item #5 implies #4. There are bases of size d1, . . . , dk for each eigenspace Eλ1(A), . . . , Eλk(A) where

λ1, . . . , λk are the eigenvalues of A and d1, . . . , dk are the geometric multiplicities, respectively. Since d1 +. . .+ dk = n, combining all such bases gives a set of n vectors. These vectors are linearly independent (andhence an eigenbasis) because there can be no linear relations between different eigenspaces. To see this,suppose ~v ∈ Eµ(A)∩Eλ(A) where µ 6= λ are eigenvalues for A. Then µ~v = A~v = λ~v so (λ−µ)~v = ~0. Since

the scale λ− µ 6= 0, it must be the case that ~v = ~0 so Eλ(A) ∩ Eµ(A) = {~0}.

Definition 124 (Two levels/strengths of diagonalizability). Something that we have glossed over so faris that the Fundamental Theorem of Algebra only states that a polynomial of degree n has n possiblycomplex zeros, counted with algebraic multiplicity. Therefore, when we solve charA(x) = 0, it is possiblethat there are complex solutions and hence A has complex eigenvalues. If A has complex eigenvalues, thecorresponding eigenvectors must also be complex.

• A is diagonalizable over C if it is diagonalizable with possibly complex eigenvalues.

• A is diagonalizable over R if it is diagonalizable with only real eigenvalues.

If A is diagonalizable over R, then it is also (automatically) diagonalizable over C because R ⊂ C. Allnon-real eigenvalues must in conjugate pairs and the corresponding eigenvectors are complex conjugates aswell. That is, if λ ∈ C\R is an eigenvalue of A and A~v = λ~v, then A~̄v = λ̄~̄v.

Example 125. Consider the following identity matrix, scaling matrix, rotation matrix, and shear matrix

I2 =

[1 00 1

]and A =

[2 00 3

]and B =

[0 −11 0

]and C =

[1 10 1

](128)

matrix eigenvaluesalgegraic

multiplicitygeometric

multiplicityeigenspaces

degree ofdiagonalizability

I2 1 2 2 E1(I2) = R2 diagonalizable / R

A23

11

11

E2(A) = span(~e1)E3(B) = span(~e2)

diagonalizable / R

Bi =√−1

−i11

11

Ei(B) = span(

[−i1

])

E−i(B) = span(

[i1

])

diagonalizable / C

C 1 2 1 E1(C) = span(~e1) not diagonalizable

Proposition 126. If A is a diagonalizable (n × n)-matrix with decomposition A = SDS−1 where D isdiagonal and S is invertible, then

At = SDtS (129)

for every non-negative integer t ≥ 0.

Proof. At = AA . . .A︸︷︷︸t copies

= (SDS−1)(SDS−1) . . . (SDS−1)︸︷︷︸t copies

= SD (S−1SD) . . . (S−1SD)︸︷︷︸t−1 copies

S−1 = SDtS


Lesson notes for Math 21b - Spring 2020people.math.harvard.edu/~marcinek/21b/notes.pdf · Lesson 1: Linear systems De nition 1. There are a few related vocabulary words: A linear

Documents