Machine Learning Preliminaries and Math Refresher

General remarks about learningProbability Theory and Statistics

Linear spaces

Machine LearningPreliminaries and Math Refresher

M. Luthi, T. Vetter

February 18, 2008

M. Luthi, T. Vetter Machine Learning Preliminaries and Math Refresher


Linear spaces

Outline

1 General remarks about learning

2 Probability Theory and Statistics

3 Linear spaces



Linear spaces

Outline



3 Linear spaces



Linear spaces

The problem of learning is arguably at the very core of the problemof intelligence, both biological and artificial.

T. Poggio and C.R. Shelton



Linear spaces

Model building in natural sciences

Model building

Given a phenomenon, construct a model for it.

Example (Heat Conduction)

Phenomenon: The spontaneous transfer of thermal energythrough matter, from a region of higher temperature to a region oflower temperatureModel:

∂Q

∂t= −k

∮S∇T · dS



Linear spaces

Learning as Model Building

Example (Learning)

Phenomenon: Learning (Inferring general rules from examples)Model:

f ∗ = arg maxf ∈H

P(f )P(f |D)

P(D)

Neural networks, Decision Trees, Naive Bayes, Support Vectormachines, etc.

Models for learning

The models for learning are the learning algorithms



Linear spaces

Learning as Model Building

Example (Learning)

Phenomenon: Learning (Inferring general rules from examples)Model:

f ∗ = arg maxf ∈H

P(f )P(D|f )

P(D)

Neural networks, Decision Trees, Naive Bayes, Support Vectormachines, etc.

Models for learning

The models for learning are the learning algorithms



Linear spaces

Goals of the first block

Life is short . . .

We want to cover the essentials of learning.

General Setting

Mathematicallyprecise settingof the learningproblem

Valid for anykind of learningalgorithm

StatisticalLearning Theory

When doeslearning work

Conditions anyalgorithm hasto satisfy

Performancebounds

Kernel Methods

Theory ofKernels

Make linearalgorithmsnon-linear.

Learning fromnon-vectorialdata.



Linear spaces

Mathematics needed in the first block

The need for mathematics

As we treat the learning problem in a formal setting, the resultsand methods are necessarily formulated in mathematical terms.

General Setting

Probabilitytheory

Statistics

Basicoptimizationtheory


Moreprobabilitytheory

More statistics

Kernel Methods

Linear spaces

Linear algebra


A bit of mathematical maturity and an open mind is required. Therest will be explained.



Linear spaces

Mathematics needed in the first block

The need for mathematics

As we treat the learning problem in a formal setting, the resultsand methods are necessarily formulated in mathematical terms.

General Setting

Probabilitytheory

Statistics



Moreprobabilitytheory

More statistics

Kernel Methods

Linear spaces

Linear algebra


A bit of mathematical maturity and an open mind is required. Therest will be explained.



Linear spaces

Nothing is more practical than a good theory.

Vladimir N. Vapnik

Nothing (in computer science) is more beautiful than learningtheory?

M. Luthi



Linear spaces

Nothing is more practical than a good theory.

Vladimir N. Vapnik

Nothing (in computer science) is more beautiful than learningtheory?

M. Luthi



Linear spaces

Outline



3 Linear spaces



Linear spaces



Linear spaces

Probability theory vs Statistics

Definition (Probability Theory)

A branch of mathematicsconcerned with the analysis ofrandom phenomena.

General ⇒ Specific

Definition (Statistics)

The science of collecting,analyzing, presenting, andinterpreting data.

Specific ⇒ General

Statistical Machine learning is closely related to (inferential)statistics.

Many state-of-the-art learning algorithms are based onconcepts from probability theory.



Linear spaces

Probabilities

Definition (Probability Space)

A probability space is the triple

(Ω,F ,P)

where

Ω is a set of events ω

F is a collection of events (e.g. the power-set P(Ω))

P is a measure that satisfies the probability axioms.



Linear spaces

Axioms of Probability

1 For any A ∈ F , there exists a number P(A), the probability ofA, satisfitying P(A) ≥ 0.

2 P(Ω) = 1.

3 Let An, n ≥ 1 be a collection of pairwise disjoint events,and let A be their union. Then

P(A) =∞∑

n=1

P(An).



Linear spaces

Independence

Definition (Independence)

Two events, A and B, are independent iff the probability of theirintersection equals the product of the individual probabilities, i.e.

P(A ∩ B) = P(A) · P(B).

Definition (Conditional probability)

Given two events A and B, with P(B) > 0, we define theconditional probability for A given B, P(A|B), by the relation

P(A|B) =P(A ∩ B)

P(B).



Linear spaces

Random Variables

A single event is not that interesting.

Definition (Random Variable)

A random variable X is a function from the probability space to avector of real numbers

X : Ω → Rn.

Random variables are characterized by their distribution function F :

Definition (Probability Distribution Function)

Let X : Ω → R be a random variable. We define

FX (x) = P(X ≤ x) −∞ < x < ∞.



Linear spaces

Probability density function

Definition (Probability density function)

The density function, is the function fX , with the property

FX (x) =

∫ x

−∞fX (y) dy , −∞ < x < ∞.



Linear spaces

Convergence

Definition (Convergence in Probability)

Let X1,X2, . . . be random variables. We say that Xn converges inprobability to the random variable X as n →∞, iff, for all ε > 0,

P(|Xn − X | > ε) → 0, as n →∞.

We write Xnp−→ X as n →∞.



Linear spaces

Weak law of large numbers

Theorem (Bernoulli’s Theorems (Weak law of large numbers))

Let X1, . . . ,Xn be a sequence of independent and identicallydistributed (i.i.d.) random variables, each having mean µ andstandard deviation σ. Then

P[|(X1 + . . . + Xn)/n − µ| > ε] → 0

as n →∞.

Thus given enough observations xi ∼ FX , the sample meanx = 1

n

∑ni=1 xi will approach the true mean µ.



Linear spaces

Expectation

Definition (Expectation)

Let X be a random variable with probability density function fX ,and g : R → R a function. We define the expectation

E [g(X )] :=

∫ ∞

−∞g(x)fX (x) dx .

Definition (Sample mean)

Let a sample x = x1, x2, . . . , xn be given. We define the(sample) mean to be

x =1

n

n∑i=1

xi .



Linear spaces

Variance

Definition (Variance)

Let X be a random variable with density funciton fX . The varianceis given by

Var[X ] = E [(X − E [X ])2] = E [X 2]− (E [X ])2.

The square root√

Var[X ] of the variance is referred to as thestandard deviation.

Definition (Sample Variance)

Let the sample x = x1, x2, . . . , xn with sample mean x be given.We define the sample variance to be

s2 =1

n − 1(xi − x)2.



Linear spaces

Notation

Assume F has a probability density function:

f (x) =dF (x)

dx

Formally, we write:f (x) dx = dF (x)

Example: Expectation

E [g(X )] :=

∫ ∞

−∞g(x)f (x) dx . =

∫ ∞

−∞g(x)dF (x)



Linear spaces

Outline



3 Linear spaces



Linear spaces

Vector Space

A set V together with two binary operations

1 vector addition + : V × V → V and

2 scalar multiplication · : R× V → V

is called a vector space over R, if it satisfies the following axioms:

1 ∀x , y ∈ V : x + y = y + x (commutativity)

2 ∀x , y ∈ V : x + (y + z) = (x + y) + z (associativity)

3 ∃0 ∈ V ,∀x ∈ V : 0 + x = x (identity of vector addition)

4 ∃1 ∈ V ,∀x ∈ V : 1 · x = x (identity of vector multiplication)

5 ∀x ∈ V : ∃x ∈ V : x + (−x) = 0 (additive inverse element)

6 ∀α ∈ R,∀x , y ∈ V : α · (x + y) = α · x + α · y (distributivity)

7 ∀α, β ∈ R,∀x ∈ V : (α + β) · x = α · x + β · x (distributivity)

8 ∀α, β ∈ R,∀x ∈ V : α(β · x) = (αβ) · xM. Luthi, T. Vetter Machine Learning Preliminaries and Math Refresher


Linear spaces

Vector Space

More importantly for us, the definition implies:

x + y ∈ V , ∀x , y ∈ V

αx ∈ V , ∀α ∈ R,∀x ∈ V

Subspace criterion

Let V be a vector space over R, and let W be a subset of V .Then W is a subspace if and only if it satisfies the following 3conditions:

1 0 ∈ W

2 If x , y ∈ W then x + y ∈ W

3 If x ∈ W and α ∈ R then αx ∈ W



Linear spaces

Normed spaces

Definition (Normed vector space)

A normed vector space is a pair (V , ‖·‖) where V is a vector spaceand ‖·‖ is the associated norm, satisfying the following propertiesfor all u, v ∈ V :

1 ‖v‖ ≥ 0 (positivity)

2 ‖u + v‖ ≤ ‖u‖+ ‖v‖ (triangle inequality)

3 ‖αv‖ = |α|‖v‖ (positive scalability)

4 ‖v‖ = 0 ⇔ v = 0 (positive definiteness)



Linear spaces

Definition (Inner product space)

An real inner product space is a pair (V , 〈·, ·〉), where V is a realvector space and 〈·, ·〉 the associated inner product, satisfying thefollowing properties for all u, v ,∈ V

1 〈u, v〉 = 〈v , u〉 (symmetry)

2 〈αu, v〉 = α〈u, v〉, 〈u, αv〉 = α〈u, v〉and〈u + v ,w〉 = 〈u,w〉+ 〈v ,w〉, 〈u, v + w〉 = 〈u, v〉+ 〈u,w〉,(bilinearity)

3 〈u, u〉 ≥ 0 (positive definiteness)

Definition (Strict inner product space)

A inner product space is called strict if

〈u, u〉 = 0 ⇔ u = 0



Linear spaces

Inner product space

The strict inner product

induces a norm: ‖f ‖2 = 〈f , f 〉.is used to define distances and angles between elements.

Theorem (Cauchy Schwarz inequality)

For all vectors u and v of a real inner product space (V , 〈·, ·〉), thefollowing inequality holds:

|〈u, v〉| ≤ ‖u‖‖v‖.



Linear spaces

If you’re not comfortable with any of the presented material, youshould take your favourite textbook and read it up within the nexttwo weeks.


Machine Learning Preliminaries and Math Refresher

Documents

machine learning preliminaries

statistics theory theory

statistics linear spaces

learning probability theory

learning probability theory

learning probability theory

learning probability theory

learning probability theory