Learning Deep Learning

Learning Deep Learning

M1 Sonse Shimaoka

Neural Network

(From the lecture slide of Nando de Freitas )

Machine Learning

Supervised Learning

Input output

[0,0,1,0,1,1,0,0,0,1,1] 1

[1,1,1,0,1,1,1,0,0,1,1] 0

[1,1,1,0,1,1,0,0,0,1,1] 0

[0,0,0,0,1,1,1,0,0,0,0] 1

[1,0,1,0,1,1,0,0,0,0,0] 1

[1,0,1,0,0,0,0,0,0,1,1] 0

[0,0,0,0,1,1,0,1,0,1,1] 1

Training data Input output

[1,0,1,0,1,1,0,0,0,1,0] ?

[1,1,1,1,1,1,1,0,0,1,1] ?

[1,0,1,0,1,1,0,1,0,1,1] ?

Test data

GeneralizaGon

Perceptron

∑ sign

x1

x2

x3

w1

w3

w2

b

y

y = sign wjx jj=1

3

∑ + b"

#$$

%

&''

Perceptron

∑ sign

1

3

−2

2

1.5

1

0.5

1

1*2+3*1−2*1.5+ 0.5= 2.5

y = sign wjx jj=1

3

∑ + b"

#$$

%

&''

Perceptron

(x1, x2, x3) = (1,3,−2)(w1,w2,w3) = (2,1,1.5)b = 0.5

= sign 1*2+3*1− 2*1.5+ 0.5( ) = sign(2.5) =1

y = sign wixii=1

3

∑ + b"

#$

%

&'= sign w1x1 +w2x2 +w3x3 + b( )

Perceptron x1

x2w1x1 +w2x2 + b = 0

Problem with Perceptron x1

x2w1x1 +w2x2 + b = 0

What is the probability that this point belongs to the posiGve class?

Perceptron can’t answer this!

Problem with Perceptron x1

x2

Impossible to separate linearly !!

LogisGc Regression

∑ sigmoid

x1

x2

x3

w1

w3

w2

b

y

y = sigmoid wjx jj=1

3

∑ + b"

#$$

%

&''

LogisGc Regression sigmoid x( ) = 1

1+ exp(−x)

LogisGc Regression

∑ sigmoid

y = sigmoid wjx jj=1

3

∑ + b"

#$$

%

&''

1

3

−2

2

1.5

1

0.5

1*2+3*1−2*1.5+ 0.5= 2.5

0.924

Probability!!

Feature TransformaGon

x1

x2

New Space

ΦNon Linear TransformaGon

φ1(x1, x2 )

φ2 (x1, x2 )

Original Space But, we must sGll design the transformaGon…

Feed Forward Neural Network

∑ f

A neuron

AcGvaGon funcGon


∑ f

∑ f

∑ f

∑ g

x1

x3

x2

h1

h2

h3

y1


∑ f

∑ f

∑ f

∑ g

x1

x3

x2

h1

h2

h3

y1

input layer hidden layer output layer

AbstracGon by Layer

Linear Linear

V

f g

W

x h yWx Vh

FFN can learn representaGons!!

FFN can learn representaGons!!

AcGvaGon FuncGons sigmoid x( ) = 1

1+ exp(−x)=exp(x)exp(x)+1

AcGvaGon FuncGons tanh x( ) = exp(x)− exp(−x)

exp(x)+ exp(−x)

AcGvaGon FuncGons rectifier(x) =max(0, x)

AcGvaGon FuncGons

softmax(x1,..., xm )c =exp(xc )

exp(xk )k=1

m

∑

Loss FuncGons •  When you want a model to learn to do something, you give it feedback on how well it is doing.

•  This funcGon that computes an objecGve measure of the model's performance is called a loss func1on.

•  A typical loss funcGon takes in the model's output and the ground truth and computes a value that quanGfies the model's performance.

•  The model then corrects itself to have a smaller loss.

L2 norm

(y1,..., yn )

L = 1n

ti − yi2

2i=1

n

∑

(t1,..., tn )

Output:

Target:

Loss:

Task: Regression

Cross Entropy

(y1,..., yn )

L = 1n

−ti log yi − (1− ti )log(1− yi )i=1

n

∑

(t1,..., tn )

Output:

Target:

Loss:

Task: Binary ClassificaGon

Class NegaGve Log Likelihood

(y1,..., yn )

L = − 1n

ti,k log yi,kk

m

∑i=1

n

∑

(t1,..., tn )

Output:

Target:

Loss:

Task: MulG Class ClassificaGon

Output acGvaGon funcGons and Loss funcGons

Task Output ac1va1on

Loss func1on

Regression Linear L2 norm

Binary ClassificaGon

Sigmoid Cross Entropy

MulG Class ClassificaGon

So]max Class NLL

ProbabilisGc PerspecGve

•  We can assume NNs are compuGng condiGonal probabiliGes

∑ f

∑ f

∑ f

∑ g

x1

x3

x2

h1

h2

h3

p(t1 | x1, x2, x3)


•  When

NLL = − log p(ti | xi )i=1

n

∏ = − log 12πσ

exp −ti − yi( )2

2σ 2

#

$%%

&

'((

i=1

n

∏

=12σ 2 ti − yi( )2 − n 2πσ

i=1

n

∑

p(t | x) = 12πσ

exp −t − y( )2

2σ 2

"

#$$

%

&''

L2 norm


•  When


n

∏ = − log yiti (1−

i=1

n

∏ yi )1−ti

= −ti log yi − (1− ti )log(1− yi )i=1

n

∑

p(t | x) = yt (1− y)1−t

Cross Entropy


•  When


n

∏ = − log yti,ki,kk=1

m

∏i=1

n

∏

= − ti,k log yi,kk=1

m

∑i=1

n

∑

p(t | x) = yktk

k=1

m

∏

Class NegaGve Log Likelihood

Gradient Descent

•  Gradient

•  Gradient Descent

Gradient Descent

FuncGon to be minimized IniGal point Learning rate Update rule

L(w)

winit

wnew ← wold −α∂L∂w w=wold

α

Gradient Descent

Big learning rate Small learning rate

Loss funcGon for LogisGc regression

L(w,b;D) = log ytiii=1

n

∏ (1− yi )1−ti

= ti log yi + (1− ti )log(1− yi )i=1

n

∑

yi =1

1+ exp(−wT xi − b)

Gradient with respect to w ∂L(w,b;D)

∂w=∂∂w

ti log yi + (1− ti )log(1− yi )i=1

n

∑

=∂∂w

ti log yi + (1− ti )log(1− yi )( )i=1

n

∑

=∂yi∂w

∂∂yi


n

∑

=∂yi∂w

tiyi−1− ti1− yi

$

%&

'

()

i=1

n

∑ =∂yi∂w

ti − yiyi (1− yi )$

%&

'

()

i=1

n

∑

= xiyi (1− yi )ti − yiyi (1− yi )$

%&

'

()

i=1

n

∑

= xi (ti − yi )i=1

n

∑

∵∂yi∂w

=∂∂w

11+ exp(−wT xi − b)#

$%

&

'(

=−∂∂w

1+ exp(−wT xi − b)( )1+ exp(−wT xi − b)( )

2

=xi exp(−w

T xi − b)1+ exp(−wT xi − b)( )

2

= xiyi (1− yi )

Gradient with respect to b ∂L(w,b;D)

∂b=∂∂b

ti log yi + (1− ti )log(1− yi )i=1

n

∑

=∂∂b


n

∑

=∂yi∂b

∂∂yi


n

∑

=∂yi∂b

tiyi−1− ti1− yi

$

%&

'

()

i=1

n

∑ =∂yi∂b

ti − yiyi (1− yi )$

%&

'

()

i=1

n

∑

= yi (1− yi )ti − yiyi (1− yi )$

%&

'

()

i=1

n

∑

= ti − yii=1

n

∑

∵∂yi∂b

=∂∂b

11+ exp(−wT xi − b)#

$%

&

'(

=−∂∂b1+ exp(−wT xi − b)( )

1+ exp(−wT xi − b)( )2

=exp(−wT xi − b)

1+ exp(−wT xi − b)( )2

= yi (1− yi )

Gradient Descent for LogisGc Regression

FuncGon to be minimized Update rule

bnew ← bold −α ti − yii=1

n

∑

L(w,b;D)= ti log yi + (1− ti )log(1− yi )i=1

n

∑

wnew ← wold −α xi (ti − yi )i=1

n

∑

Exercise: Gradient Descent for Linear Regression

L(w,b;D) = ti − yi( )2i=1

n

∑

yi = wT xi + b

Answer

FuncGon to be minimized Update rule

L(w,b;D) = ti − yi( )2i=1

n

∑

bnew ← bold −α ti − yii=1

n

∑

wnew ← wold −α xi (ti − yi )i=1

n

∑

BackpropagaGon

How do we compute and ? ∂L∂W

∂L∂V

∑ f

∑ f

∑ f

∑ g

x1

x3

x2

h1

h2

h3y2

LW2,3 V3,1

∑ g y1

u1

u2

u3

l1

l2

BackpropagaGon

Use the Chain Rule!!! ∂∂xq s x( )( ) = ∂s(x)

∂x∂q(s(x))∂s(x)

x s(x)qs q(p(x))

∂q(s(x))∂s(x)

∂s(x)∂x

∂q(s(x))∂s(x)

BackpropagaGon

∑ f

∑ f

∑ f

∑ g

x1

x3

x2

h1

h2

h3y2

LW2,3 V3,1

Start from Output layer:

∂L∂y1

∑ g y1

u1

u2

u3

l1

l2

∂L∂y1

BackpropagaGon

Apply Chain Rule :

∂L∂l1

=∂y1∂l1

∂L∂y1

= "g l1( ) ∂L∂y1

∑ f

∑ f

∑ f

∑ g

x1

x3

x2

h1

h2

h3y2

LW2,3 V3,1

∑ g y1

u1

u2

u3

l1

l2

∂L∂y1

∂L∂l1

BackpropagaGon

Apply Chain Rule :

∂L∂V3,1

=∂l1∂V3,1

∂L∂l1

= h3∂L∂l1

∑ f

∑ f

∑ f

∑ g

x1

x3

x2

h1

h2

h3y2

LW2,3 V3,1

∑ g y1

u1

u2

u3

l1

l2

∂L∂y1

∂L∂l1

∂L∂V3,1

BackpropagaGon

Apply Chain Rule :

∂L∂h3

=∂l1∂h3

∂L∂l1

+∂l1∂h3

∂L∂l1

=V3,1∂L∂l1

+V3,2∂L∂l2

∑ f

∑ f

∑ f

∑ g

x1

x3

x2

h1

h2

h3y2

LW2,3 V3,1

∑ g y1

u1

u2

u3

l1

l2

∂L∂y1

∂L∂l1

∂L∂V3,1

∂L∂h3

BackpropagaGon

Apply Chain Rule :

∂L∂u3

=∂h3∂u3

∂L∂h3

= "f u3( ) ∂L∂h3

∑ f

∑ f

∑ f

∑ g

x1

x3

x2

h1

h2

h3y2

LW2,3 V3,1

∑ g y1

u1

u2

u3

l1

l2

∂L∂y1

∂L∂l1

∂L∂V3,1

∂L∂h3∂L

∂u3

BackpropagaGon

Apply Chain Rule :

∂L∂W2,3

=∂u3∂W2,3

∂L∂u3

= u3∂L∂u3

∑ f

∑ f

∑ f

∑ g

x1

x3

x2

h1

h2

h3y2

LW2,3 V3,1

∑ g y1

u1

u2

u3

l1

l2

∂L∂y1

∂L∂l1

∂L∂V3,1

∂L∂h3∂L

∂u3

∂L∂W2,3

AbstracGon by Layer

Linear Linear

V

f

∂L∂W

∂L∂V

g

W

Lx

t

∂L∂x

h

∂L∂h

y

∂L∂y

Wx Vh

∂L∂ Wx( )

∂L∂ Vh( )

AbstracGon by Layer

input output

∂loss∂input

∂loss∂outputLayer

AbstracGon by Layer

input outputLayer

Forward ComputaGon

output = Layer. forward input( )

AbstracGon by Layer

∂loss∂input

∂loss∂outputLayer

Backward ComputaGon

∂loss∂input

= Layer. backward input, ∂loss∂output

"

#$

%

&'

input

BackpropagaGon

① Execute the forward computaGon

Linear Linear

V

f g

W

Lx

t

h yWx Vh

BackpropagaGon

② Compute the derivaGve of the loss funcGon with respect to the output

Linear Linear

V

f g

W

Lx

t

h y

∂L∂y

Wx Vh

BackpropagaGon

③ StarGng from the final layer, backpropagate derivaGves through layers

Linear Linear

V

f g

W

Lx

t

h y

∂L∂y

Wx Vh

∂L∂ Vh( )

Classifying Digits

32×32=1024 pixels Class: 10 digits (0~9) Training: 60000 examples TesGng: 60000 examples

Classifying Digits

x ∈ R1024

0000100000

!

"

##############

$

%

&&&&&&&&&&&&&&

t =

Classifying Digits

Linear Linear

V,cW,b

x

t

So]max Tanh

Class NLL

Classifying Digits

Linear Linear

V,cW,b

x

t

So]max Tanh

Class NLL

u =Wx + b

u

Classifying Digits

Linear Linear

V,cW,b

x

t

So]max Tanh

Class NLL

h = Tanh(u)

hu

Classifying Digits

Linear Linear

V,cW,b

x

t

So]max Tanh

Class NLL

l =Vh+ c

h lu

Classifying Digits

Linear Linear

V,cW,b

x

t

So]max Tanh

Class NLL

y = softmax l( )

h ylu

Classifying Digits

Linear Linear

V,cW,b

x

t

So]max Tanh

Class NLL h y

L = tk log ykk=1

10

∑ = tT log y

L

lu

Classifying Digits

Linear Linear

V,cW,b

x

t

So]max Tanh

Class NLL Wx + b h Vh+ c y

∂L∂y

=∂∂ytT log y = t1

y1,..., t10 y10

"#$

%&'

T

L

∂L∂y

Classifying Digits

Linear Linear

V,cW,b

x

t

So]max Tanh

Class NLL h y

∂L∂l

= y ! (t − y)! ∂L∂y

= y1 t1 − y1( ),..., y10 t10 − y10( )#$ %&T! t1 y1

,..., t10 y10#$'

%&(

T

= t1 t1 − y1( ),..., t10 t10 − y10( )#$ %&T

L

∂L∂y

∂L∂l

lu

Classifying Digits

Linear Linear

V,cW,b

x

t

So]max Tanh

Class NLL h y

∂L∂h

=VT ∂L∂l

L

∂L∂y

∂L∂V

=∂L∂lhT ∂L

∂c=∂L∂l

∂L∂h

∂L∂V, ∂L∂c

∂L∂l

lu

Classifying Digits

Linear Linear

V,cW,b

x

t

So]max Tanh

Class NLL h y

∂L∂u

= 1+ h( )! 1− h( )! ∂L∂h

L

∂L∂y

∂L∂h

∂L∂V, ∂L∂c

∂L∂u

∂L∂l

lu

Classifying Digits

Linear Linear

V,cW,b

x

t

So]max Tanh

Class NLL h y

∂L∂W

=∂L∂u

xT

L

∂L∂y

∂L∂h

∂L∂V, ∂L∂c

∂L∂u

∂L∂b

=∂L∂u

∂L∂x

=WT ∂L∂u

∂L∂W

, ∂L∂b

∂L∂x

∂L∂l

lu

Classifying Digits

bnew ← b−α ∂L∂b

Wnew ←W −α∂L∂W

V new ←V −α ∂L∂V

cnew ← c−α ∂L∂c

Torch7

Torch7

Torch7

Learning Deep Learning

Technology