Math for Intelligent Systems - 01 Linear Algebra 01 Vector Spaces

Linear Algebra for Intelligent SystemsVector Spaces

Andres Mendez-Vazquez

March 25, 2016

AbstractNow, we start our journey with the important concepts coming from

the Linear Algebra Theory. In this first section, we will look at the ba-sic concepts of vector spaces and their dimensions. This will get us thefirst tools for handling many of the concepts that we use in intelligent sys-tems. Finally, we will explore the application of these concepts in MachineLearning.

Contents1 Introduction 2

2 Fields and Vector Spaces 22.1 Subspaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32.2 Linear Combinations . . . . . . . . . . . . . . . . . . . . . . . . . 4

3 Basis and Dimensions 73.1 Coordinates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93.2 Properties of a Basis . . . . . . . . . . . . . . . . . . . . . . . . . 10

4 Applications in Machine Learning 12

1

1 IntroductionAbout 4000 years ago, Babylonians knew how to solve the following kind ofsystems [3]:

ax+ by = c

dx+ ey = f

where x, y are unknown. As always the first steps in any field of knowledge tendto be slow, something like 1800 year slow. Because, it is only after the deathof Plato and Aristotle, that the Chinese (Nine Chapters of the MathematicalArt 200 B.C.) were able to solve 3 × 3 systems by working an “eliminationmethod” similar to the one devised by Gauss, “The Prince of Mathematics,”2000 years later for general systems. But, it is only when Leibniz tried to solvesystems of linear equations that the one of the first concepts of linear algebracame to be, the determinant. Finally, Gauss defined implicitly the concept ofa Matrix as linear transformations in his book “Disquisitions.” Furthermore,the Matrix term was finally introduced by Cayley in two papers in 1850 and1858 respectively, which allowed him to prove the important Cayley-HamiltonTheorem. Much more exist in the history of linear algebra [3], and although, thisis barely a glimpse of this rich history of what is one of the the most importanttools for intelligent systems, it stresses its importance in it.

2 Fields and Vector SpacesGiven that fields (Set of Numbers) are so important to us in our calculations,like

a11x1 + a12x2 + ...+ a1nxn =b1,

a21x1 + a22x2 + ...+ a2nxn =b2,

· · · · · · · · · · · · · · · · · · · · · · · ·am1x1 + am2x2 + ...+ amnxn =b2.

It is clear that we would like to collect them in a compact structure that allowsfor simpler manipulation. Thus, we can do this if we use a n-tuple structurelike the following ones

x =

x1x2...xn

, b =

b1b2...bn

and A =

a11 a12 · · · a1na21 a22 · · · a2n...

.... . .

...am1 am2 · · · amn

(1)

Thus, we can write:

2

Ax = b (2)

This structures can be lumped together by the use of specific operations like thesum and multiplication [2, 4]. Making clear the need to define what is a spaceof structures.

Definition 1. A vector space V over the field K is a set of objects which can beadded and multiplied by elements of K where the results are again an elementof V . Having, the following properties:

1. Given elements u,v,w of V , we have (u+ v) +w = u+ (v +w).

2. There is an element of V , denoted by O, such that O + u = u + O = ufor all elements u of V .

3. Given an element u of V , there exists an element −u in V such thatu+ (−u) = O.

4. For all elements u,v of V , we have u+ v = v + u.

5. For all elements u of V , we have 1 · u = u.

6. If c is a number, then c (u+ v) = cu+ cv.

7. if a, b are two numbers, then (ab)v = a (bv).

8. If a, b are two numbers, then (a+ b)v = av + bv.

Therefore, the elements of a vector space will be called vectors.

Example 2. Let V = Kn be the set of n−tuples of elements of K. Let

A = (a1, a2, ..., an) and B = (b1, b2, ..., bn) .

It is easy to prove that this is a vector space.

2.1 SubspacesNow, we have the idea of subsets that can be vector spaces [2, 4] which can behighly useful for representing certain structures in data spaces.

Definition 3. Let V a vector space and W ⊆ V , thus W is a subspace if:

1. If v, w ∈W , then v +w ∈W .

2. If v ∈W and c ∈ K, then cv ∈W .

3. The element 0 ∈ V is also an element of W .

3

Example 4. Given the vector space V of m× n matrices over R. A subspaceis the one defined as follow:

W = {A ∈ V |The last row has only zeros} (3)It is possible to obtain a compact

Theorem 5. A non-empty subset W of V is a subspace of V if and only if foreach pair of vectors v,w ∈W and each scalar c ∈ K the vector cv +w ∈ W .

Proof. Case ⇐Suppose that W 6= ∅ and cv +w ∈ W for all vectors v,w ∈ W , c ∈ K. Thus,exist ρ ∈ W such that (−1)ρ + ρ = 0 ∈ W . Then, cv = cv + 0 ∈ W . Finally,if v,w ∈W , then v +w = (1)v +w ∈W . Thus, W is a subspace of V .Case ⇒ Too simple, left to you to prove it.

Example 6. If V is any vector space,1. V is a subspace of V .

2. The subset consisting of the zero vector alone is a subspace of V , calledthe zero subspace of V.

3. The space of polynomial functions over the field R is a subspace of thespace of all functions from R into R.

2.2 Linear CombinationsNow, a fundamental idea is going to presented next and represents one of thefundamentals ideas in linear algebra because its applications in:

1. Convex Representations in Optimization.

2. Endmember Representation in Hyperspectral Images (Fig. 1) [1].

3. Geometric Representation of addition of forces in Physics (Fig. 2)

Figure 1: Mixing Model with e =∑ni=1 αiei and

∑ni=1 αi = 1

4

Figure 2: Forces as Vectors

Definition 7. Let V an arbitrary vector space, and let v1,v2, ...,vn ∈ V andx1, x2, ..., xn ∈ K. Then, an expression like

x1v1 + x2v2 + ...+ xnvn (4)is called a linear combination of v1,v2, ...,vn.

We are ready to prove that the space W of all linear combinations of a givencollection v1,v2, ...,vn is a subspace of V .

Proof. Let y1, y2, ..., yn ∈ K. Then, we can get two elements in W .

t1 =x1v1 + x2v2 + ...+ xnvn

t2 =y1v1 + y2v2 + ...+ ynvn

We have then that

t1 + t2 =x1v1 + x2v2 + ...+ xnvn + y1v1 + y2v2 + ...+ ynvn

= (x1 + y1)v1 + (x2 + y2)v2 + ...+ (xn + yn)vn

Thus, the sum of two elements of W is an element in W . In a similar fashion,given c ∈ K:

c (x1v1 + x2v2 + ...+ xnvn) = cx1v1 + cx2v2 + ...+ cxnvn

a linear combination of the elements v1,v2, ...,vn, and an element ofW . Finally,

0 = 0v1 + 0v2 + ...+ 0vnan element of W 1

1The subspace W is called the subspace generated or spanned by v1, v2, ..., vn.

5

Theorem 8. Let V be a vector space over the field K. The intersection of anycollection of subspaces of V is a subspace of V .

Proof. Let {Wα}α∈I be a collection of subspaces of V , and let W = ∩α∈IWα.Thus, the if v ∈ W then v ∈ Wα ∀α ∈ I, and in addition O ∈ Wα ∀α ∈ I.Thus, it is easy to see that O is in all Wα, thus in the the intersection W . Now,given two vectors v,w ∈ W , both of them are in all Wα by definition. Then,given a c ∈ K, cv + w ∈ Wα for all α ∈ I. Therefore, by Theorem 5, W is asubspace of V .

This allows us to discover that there is a smallest subspace for any collectionS of elements of V such that it contains S and it is contained in any subspacecontaining S.

Definition 9. Let S be a set of vectors in a vector space V . The subspacespanned by S is defined as the intersection W of all subspaces of V whichcontains S. When S is a finite set of vectors, S = {v1,v2, . . . ,vn}, we shallsimply call W the subspace spanned by the vectors v1,v2, . . . ,vn.

From this, we only need to prove a simple fact about the spanned space tohave an efficient tool of representation of subspaces.

Theorem 10. The subspace spanned by S 6= ∅ of a vector space V is the set ofall linear combinations of vectors in S.

Proof. Let W be a subspace spanned by S. Then, each linear combination

v =n∑i=1

xivi ∈W

of vectors v1,v2, . . . ,vn ∈ S is clearly in W . Thus, W contains the set L ofall linear combinations of vectors in S. Now, the set L contains S, and it isdifferent from ∅. If α,β belong to the L then

α =m∑i=1

xivi,

β =n∑i=1

yiwi,

with {vi}ni=1 ⊂ L and {wi}ni=1 ⊂ L. Thus, for any scalar c,

cα+ β = c

m∑i=1

xivi +n∑i=1

yiwi =m∑i=1

(cxi)vi +n∑i=1

yiwi

.Therefore, cα+ β belongs to L. Thu, L is a subspace of V by Theorem 5.

6

Example 11. Let F a subfield of the field C of complex numbers. Suppose

α1 = (1, 2, 0, 3, 0)T ,

α2 = (0, 0, 1, 4, 0)T ,

α3 = (0, 0, 0, 0, 1)T .

By Theorem 10, a vector α is the subspace W of F 5 spanned by α1,α2,α3 ifand only if ∃ scalars c1, c2, c3 in F such that

α = c1α1 + c2α2 + c3α3

Therefore, the vectors in W have the following structure

α = c1 (1, 2, 0, 3, 0)T + c2 (0, 0, 1, 4, 0)T + c3 (0, 0, 0, 0, 1)T

= (c1, 2c1, c2, 3c1 + 4c2, c3)T .

Furthermore, W can be described as the set{(x1, x2, x3, x4x5) |x2 = 2x1, x4 = 3x1 + 4x3} .

3 Basis and DimensionsBefore we define the concept of base, it is interesting to have a intuitive ideaabout why the need of a base. It is clear that we would like to have a compactrepresentation of specific spaces to avoid listing all the possible elements in it.For example, given the (Fig ), we have that all the points in the straight linecan be defined as

{c (α, β, γ) |c ∈ R} (5)

Thus, it necessary to define the concepts of a linearly dependent and inde-pendent base.

Definition 12. Let V be a vector space over a fieldK, and let v1,v2, ...,vn ∈ V .We have that v1,v2, ...,vn are linearly dependent over K if there are scalarsa1, a2, ..., an ∈ K not all equal to 0 such that

a1v1 + a2v2 + ...+ anvn = O

Therefore, if there are not such numbers, then we say that v1,v2, ...,vn arelinearly independent.

Example 13. Let V = Kn and consider the following vectors:

e1 = (1, 0, ..., 0)T

...

en = (0, 0, ..., 1)T

7

Figure 3: A base of one element

The operator vT is the exchange of the rows by columns. Then

a1e1 + a2e2 + ...+ anen = (a1, a2, ..., an) (6)

Then, we have that (a1, a2, ..., an) = 0 ⇔ ai = 0 for all 1 ≤ i ≤ n. Thus,e1, e2, ..., en are linearly independent.

Example 14. Let V be the vector space of all functions f : R→ R of a variablet. Then, we want to find a sequence of functions that are linear independent,and for this we have the following sequence of functions:

et, e2t, e3t (7)

Then, we have the following to prove that they are linearly independent.a1e

t + a2e2t + a3e

3t = 0.

Deriving, we geta1e

t + 2a2e2t + 3a2e

3t = 0.

Finally, we subtracta1e

t + 2a2e2t + 3a2e

3t −(a1e

t + a2e2t + a3e

3t) = a2e2t + 2a3e

3t = 0.

We again to subtract the previous equation from a1et+a2e

2t+a3e3t = 0. Thus,

a1et + a2e

2t + a3e3t −

(a2e

2t + 2a3e3t) = a1e

t − a3e3t = 0

In this way, we have a1et = a3e

3t for all t. Then, a3e2t = a1, and assume that

a1 6= 0 and a3 6= 0.e2t = a1

a3.

8

But t can take any value, therefore, it is not possible. Then, the other option isa1 = a3 = 0. From this, we have that a2 = 0.Following these examples, we have the following consequences from the previousdefinition 12 of linear dependence and independence.

1. Any set which contains a linearly dependent set is linearly dependent.

2. Any subset of a linearly independent set is linearly independent.

3. Any set which contains the O is linearly dependent because 1 ·O = O.

4. A set S of vectors is linearly independent set if and only if each finitesubset of S is linearly independent.

Now, from these basic ideas comes the concept of a basis.

Definition 15. If elements v1,v2, ...,vn generate or span V and in additionare linearly independent, then {v1,v2, ...,vn} is called a basis of V . In otherwords the elements v1,v2, ...,vn form a basis of V .

Thus, we can say that the vectors in (Example 13) and the ones in (Example14) are basis for Kn and the space generated by

{et, e2t, e3t}.

3.1 CoordinatesNow, we need to define the concept of coordinates of an element v ∈ V withrespect to a basis. For this, we will use the following theorem.

Theorem 16. Let V be a vector space. Let v1,v2, ...,vn be linearly independentelements of V. Let x1, . . . , xn and y1, . . . , yn be numbers. Suppose that we have

x1v1 + x2v2 + · · ·+ xnvn = y1v1 + y2v2 + · · ·+ ynvn (8)

Then, xi = yi for all i = 1, . . . , n.

Proof. It is quite simple to see form (Eq. 8) that (x1 − y1)v1 + (x2 − y2)v2 +· · ·+(xn − yn)vn = O. Thus, we have that, given that v1,v2, ...,vn are linearlyindependent, x2 − y2 = 0.

Let V be a vector space, and let {v1,v2, ...,vn} be a basis of V . It is possibleto represent all v ∈ V by an n-tuple of numbers relative to this basis as follows

v = x1v1 + x2v2 + · · ·+ xnvn (9)

Thus, this n-tuple is uniquely determined by v. We will call (x1, x2, . . . , xn)as the coordinates of v with respect to the basis, and we call xi as the ith coordi-nate. It is more, the coordinates with respect to the usual basis {e1, e2, ..., en}of Kn are the coordinates of the n−tuple X = (x1, x2, . . . , xn) which is thecoordinate vector of v with respect to the basis {v1,v2, ...,vn} .

9

Example 17. Let V be the vector space of functions generated by the threefunctions et, e2t, e3t. Then, the coordinates of the function

2et + e2t + 10e3t (10)

are the 3-tuple (2, 1, 10) with respect to the basis{et, e2t, e3t}.

3.2 Properties of a BasisNow, we describe a series of important properties coming from the concept of abasis.

Theorem 18. (Limit in the size of the basis) Let V be a vector space over afield K with a basis {v1,v2, ...,vm}. Let w1,w2, ...,wn be elements of V , andassume that n > m. Then w1,w2, ...,wn are linearly dependent.

Proof. Assume thatw1,w2, ...,wn are linearly independent. Since {v1,v2, ...,vn}is a basis V , there exists scalars a1, a2, ..., am ∈ K such that

w1 =m∑i=1

aijvi

By assumption, we know that w1 6= O, and hence some ai 6= 0. We may assumewithout loss of generality that a1 6= 0. We can then solve for v1, and get

a1v1 = w1 − a2v2 − · · · − amvm,v1 = a−1

1 w1 − a−11 a2v2 − · · · − a−1

1 amvm.

The subspace of V generated by w1,v2, ...,vm contains v1, and hence mustbe all of V since v1,v2, ...,vm generate V . Now, we continue our procedurestepwise, and to replace successively v2,v3, ... by w2,w3, ... until all the ele-ments v1,v2, ...,vm are exhausted, and w1,w2, ...,wm generate V . Now, as-sume by induction that there is an integer r with 1 ≤ r < m such that,w1, ...,wr,vr+1, ...,vm generate V . Then, there are elements b1, ..., br, cr+1, ..., cmin K such that

wr+1 = b1w1 + · · ·+ brwr + cr+1vr+1 + · · ·+ cmvm.

We cannot have cj = 0 for j = r + 1, ...,m, if not we get a relation of lineardependence between w1,w2, ...,wr,wr+1. Then, without loss of generality wecan say cr+1 6= 0. Thus, we get

cr+1vr+1 = wr+1 − b1w1 − · · · − brwr − cr+2vr+2 − · · · − cmvm.

Dividing by cr+1, we conclude that vr+1 is in the subspace generated byw1, ...,wr+1,vr+2, ...,vm.

By the induction assumption, we have thatw1, ...,wr+1,vr+2, ...,vm

generate V . Thus by induction, we have proved that w1, ...,wm generate V . Ifn > m, then exist elements d1, d2, ..., dm ∈ K such that

10

wn = d1w1 + · · ·+ dmwm

Thus, w1, ...,wn are linearly dependent.

Example 19. The vector space Cn has dimension n over C, the vector space.More generally for any field K, the vector space Kn has dimension n over Kwith basis

10...0

,

01...0

, ...,

0...01

. (11)

Example 20. Let V be a vector space. A subspace of dimension 1 is called aline in V . A subspace of dimension 2 is called a plane in V .

Now, we will define the dimension of a vector space V over K which willbe denoted by dimK V , or simply dimV . Then, a vector space with a basisconsisting of a finite number of elements, or the zero vector space, is called afinite dimensional. Therefore, the dimV is equal to the number of elementsin the finite basis of V . Additionally, vector spaces that do not have a finitebasis are called infinite dimensional, and although it is possible to define theconcept of infinite basis, we will leave for another time.

Thus, the classic question is When a set of vectors is a basis? For this,we have the concept of a maximal set of linearly independent elementsv1,v2, ...,vn of V if given any element w ∈ V , the elements w,v1,v2, ...,vn arelinearly dependent.

Theorem 21. Let V be a vector space, and {v1,v2, ...,vn} a maximal set oflinearly independent elements of V . Then, {v1,v2, ...,vn} is a basis of V .

Proof. We must show v1,v2, ...,vn generates V . Let w be an element of V . Theelements w,v1,v2, ...,vn must be linearly dependent by hypothesis. Thereforethere exist scalars x0, x1, ..., xn not all 0 such that

x0w + x1v1 + · · ·+ xnvn = O.

It is clear that x0 6= 0 because if that was the case v1,v2, ...,vn will have arelation of linear dependence among them. Therefore, we have

w = −x1

x0v1 − · · · −

xnx0vn.

This is true for all w i.e. a linear combination of v1,v2, ...,vn and hence abasis.

Now, we can easily recognize a basis because the following theorem.

11

Theorem 22. Let V be a vector space of dimension n, and let v1,v2, ...,vn belinearly independent elements of V . Then, v1,v2, ...,vn constitutes a basis ofV .

Proof. According the proof in Theorem 18, {v1,v2, ...,vn} is a maximal set oflinearly independent elements of V . Therefore, it is a basis by Theorem 21.

Corollary 23. Let V be a vector space and let W be a subspace. If dimW =dimV then V = W .

Proof. A basis for W must also be a basis for V by Theorem 22.

Corollary 24. Let V be a vector space of dimension n. Let r be a positiveinteger with r < n, and let v1,v2, ...,vr be linearly independent elements of V.Then one can find elements vr+1,vr+2, ...,vn such that {v1,v2, ...,vn} is a basisof V .

Proof. Since r < n we know that {v1,v2, ...,vr} cannot form a basis of V , andit is not a maximal set of linearly independent elements of V. In particular,we can find vr+1 ∈ V such that v1,v2, ...,vr,vr+1 are linearly independent.If r + 1 < n, we can repeat the argument. We can this proceed stepwise byinduction until we obtain n linearly dependent elements {v1,v2, ...,vn} Then,by Theorem 22 they are a basis.

Theorem 25. Let V be a vector space having a basis consisting of n elements.Let W be a subspace which does not consist of O alone. Then W has a basis,and the dimension of W is ≤ n.

Proof. Let w1 be anon-zero element of W . if {w1} is not a maximal set oflinearly independent elements of W , we can find an element w2 of W such thatw1,w2 are linerly independent. Thus, we can proceed that way, one element ata time, therefore there must be an integer m ≤ n such that w1,w2, ...,wm arelinear independent and is a maximal set by Theorem 18. Now, using Theorem21, we can conclude that {w1,w2, ...,wm} is a basis for W .

4 Applications in Machine LearningA classic application of the previous concepts is the vector of features for eachsample in the dataset.

Definition 26. A feature vector is a n-dimensional vector of numerical fea-tures that represent an object.

This allows to use linear algebra to represent basic classification algorithmsbecause the tuples {(x, y) |x ∈ Kn, y ∈ K} can be easily used to design specificalgorithms. For example, given the least squared error [5, 6], we need to fit aseries of points (Fig. 4) against an specific function. Then, the general problem

12

Figure 4: The Points for the Fitting.

N

E

d

S

Figure 5: A Meridian Quadrant form the equator (E) to the north pole (N),showing an arc segment of d degrees and length S modules, center at latitudeL.

is given a set of functions f1, f2, ..., fK find values of coefficients a1, a2, ..., aksuch that the linear combination

y = a1f1 (x) + · · ·+ aKfK (x) (12)

For example, Gauss seemed to have used an earlier version of the Leas SquaredError [6] and the following approximation to short arcs.

a = z + y sin2 L, (13)

in order to calculate the meter equal to the one 10,000,000th part of the meridianquadrant (Fig. 5). For more on it, please take a look at the article [6].

Now, going back to the Least Squared Error, we have that given the datasets{(x1, y1) , ..., (xN , yN )}, it is possible to define the sample mean as

x = 1N

N∑i=1

xi. (14)

13

Thus, looking at the following data sets in the points x when the sample spaceV has dimV = 1.

{10, 20, 30, 40, 50} and {30, 30, 30, 30, 30}

We notice that both have the same sample mean, but they have huge differentvariances, which is denoted as:

σ2x = 1

N

N∑i=1

(xi − x) .

Therefore, we can use this as a tool to measure how much a dataset fluc-tuates around the mean. Furthermore, by using this idea, we can quantify themeaning of “best fit.” For this we assume that our data is coming from a lin-ear equation y = ax + b, then y − (ax+ b) ≈ 0. Thus, given the observations{(x1, y1) , ..., (xN , yN )}, we get the following errors:

{y1 − (ax1 + b) , ..., yN − (axN + b)} .

Then, the mean should be really small (If it is a good fit), and the variancemeasures how good fit we have:

σ2y−(ax+b) = 1

N

N∑i=1

(yi − (axi + b))2. (15)

Here, large errors are given a higher weight than smaller errors (due to thesquaring). Thus, this procedure favors many medium sized errors over a fewlarge errors. It would be possible to do something similar if we used the absolutevalue |y − (ax+ b)|, but this is not differentiable.

Finally, we can define the following error Ei (a, b) = y − (ax+ b) makingpossible to obtain the quadratic accumulated error

E (a, b) =N∑i=1

Ei (a, b) =N∑i=1

(yi − (axi + b)) (16)

Therefore, the goal is to minimize the previous equation by finding the valuesa and b. We can do that by differentiating partially in the following way

∂E

∂a= 0,

∂E

∂b= 0.

Note we do not have to worry about boundary point because as |a| and |b|become large, the fill will get worse and worse. Therefore, we do not need tocheck on boundary.

We get finally,

14

∂E

∂a=

N∑i=1

2 (yi − (axi + b)) · (−xi) ,

∂E

∂b=

N∑i=1

2 (yi − (axi + b)) · (−1) .

Then, we obtain the following

N∑i=1

(yi − (axi + b)) · xi = 0,

N∑i=1

(yi − (axi + b)) · = 0.

Finally, by rewriting the previous equations

a

N∑i=1

x2i + b

N∑i=1

xi =N∑i=1

xiyi

a

N∑i=1

xi + bN =N∑i=1

yi

Then, it is possible to rewrite this in therms of matrices and vectors( ∑Ni=1 x

2i

∑Ni=1 xi∑N

i=1 xi N

)(ab

)=( ∑N

i=1 xiyi∑Ni=1 yi

)(17)

Do Remember this? Yes, the linear representation Ax = b. Now, we willcontinue with this in the next class to obtain the optimal solution.

15

References[1] Michael Theodore Eismann and SPIE (Society)". Hyperspectral remote sens-

ing. Bellingham, Wash. SPIE Press, 2012.

[2] K. Hoffman and R.A. Kunze. Linear algebra. Prentice-Hall mathematicsseries. Prentice-Hall, 1971.

[3] I. Kleiner. A History of Abstract Algebra. Birkhäuser Boston, 2007.

[4] S. Lang. Linear Algebra. Springer Undergraduate Texts in Mathematics andTechnology. Springer, 1987.

[5] S.J. Miller. The method of least squares. Mathematics Department BrownUniversity, 2006.

[6] Stephen M. Stigler. Gauss and the invention of least squares. The Annalsof Statistics, 9(3):465–474, 1981.

16

Math for Intelligent Systems - 01 Linear Algebra 01 Vector Spaces

Engineering