Top Banner
Learning Deep Learning M1 Sonse Shimaoka
74

Learning Deep Learning

Jan 06, 2017

Download

Technology

simaokasonse
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Learning Deep Learning

Learning  Deep  Learning  

M1  Sonse  Shimaoka

Page 2: Learning Deep Learning

Neural  Network

(From  the  lecture  slide  of    Nando  de  Freitas  )

Page 3: Learning Deep Learning

Machine  Learning

Page 4: Learning Deep Learning

Supervised  Learning

Input output

[0,0,1,0,1,1,0,0,0,1,1] 1

[1,1,1,0,1,1,1,0,0,1,1] 0

[1,1,1,0,1,1,0,0,0,1,1] 0

[0,0,0,0,1,1,1,0,0,0,0] 1

[1,0,1,0,1,1,0,0,0,0,0] 1

[1,0,1,0,0,0,0,0,0,1,1] 0

[0,0,0,0,1,1,0,1,0,1,1] 1

Training  data Input output

[1,0,1,0,1,1,0,0,0,1,0] ?

[1,1,1,1,1,1,1,0,0,1,1] ?

[1,0,1,0,1,1,0,1,0,1,1] ?

Test  data

GeneralizaGon

Page 5: Learning Deep Learning

Perceptron

∑ sign

x1

x2

x3

w1

w3

w2

b

y

y = sign wjx jj=1

3

∑ + b"

#$$

%

&''

Page 6: Learning Deep Learning

Perceptron

∑ sign

1

3

−2

2

1.5

1

0.5

1

1*2+3*1−2*1.5+ 0.5= 2.5

y = sign wjx jj=1

3

∑ + b"

#$$

%

&''

Page 7: Learning Deep Learning

Perceptron

(x1, x2, x3) = (1,3,−2)(w1,w2,w3) = (2,1,1.5)b = 0.5

= sign 1*2+3*1− 2*1.5+ 0.5( ) = sign(2.5) =1

y = sign wixii=1

3

∑ + b"

#$

%

&'= sign w1x1 +w2x2 +w3x3 + b( )

Page 8: Learning Deep Learning

Perceptron x1

x2w1x1 +w2x2 + b = 0

Page 9: Learning Deep Learning

Problem  with  Perceptron x1

x2w1x1 +w2x2 + b = 0

What  is  the  probability  that  this    point  belongs  to  the  posiGve  class?  

Perceptron  can’t  answer  this!

Page 10: Learning Deep Learning

Problem  with  Perceptron x1

x2

Impossible    to  separate  linearly  !!

Page 11: Learning Deep Learning

LogisGc  Regression

∑ sigmoid

x1

x2

x3

w1

w3

w2

b

y

y = sigmoid wjx jj=1

3

∑ + b"

#$$

%

&''

Page 12: Learning Deep Learning

LogisGc  Regression sigmoid x( ) = 1

1+ exp(−x)

Page 13: Learning Deep Learning

LogisGc  Regression

∑ sigmoid

y = sigmoid wjx jj=1

3

∑ + b"

#$$

%

&''

1

3

−2

2

1.5

1

0.5

1*2+3*1−2*1.5+ 0.5= 2.5

0.924

Probability!!

Page 14: Learning Deep Learning

Feature  TransformaGon  

x1

x2

New  Space

ΦNon  Linear  TransformaGon

φ1(x1, x2 )

φ2 (x1, x2 )

Original  Space But,  we  must    sGll  design  the  transformaGon…

Page 15: Learning Deep Learning

Feed  Forward  Neural  Network

∑ f

A  neuron

AcGvaGon  funcGon

Page 16: Learning Deep Learning

Feed  Forward  Neural  Network

∑ f

∑ f

∑ f

∑ g

x1

x3

x2

h1

h2

h3

y1

Page 17: Learning Deep Learning

Feed  Forward  Neural  Network

∑ f

∑ f

∑ f

∑ g

x1

x3

x2

h1

h2

h3

y1

input  layer hidden  layer output  layer

Page 18: Learning Deep Learning

AbstracGon  by  Layer

Linear Linear

V

f g

W

x h yWx Vh

Page 19: Learning Deep Learning

FFN  can  learn  representaGons!!

Page 20: Learning Deep Learning

FFN  can  learn  representaGons!!

Page 21: Learning Deep Learning

AcGvaGon  FuncGons sigmoid x( ) = 1

1+ exp(−x)=exp(x)exp(x)+1

Page 22: Learning Deep Learning

AcGvaGon  FuncGons tanh x( ) = exp(x)− exp(−x)

exp(x)+ exp(−x)

Page 23: Learning Deep Learning

AcGvaGon  FuncGons rectifier(x) =max(0, x)

Page 24: Learning Deep Learning

AcGvaGon  FuncGons

softmax(x1,..., xm )c =exp(xc )

exp(xk )k=1

m

Page 25: Learning Deep Learning

Loss  FuncGons •  When  you  want  a  model  to  learn  to  do  something,  you  give  it  feedback  on  how  well  it  is  doing.    

•  This  funcGon  that  computes  an  objecGve  measure  of  the  model's  performance  is  called  a  loss  func1on.  

•  A  typical  loss  funcGon  takes  in  the  model's  output  and  the  ground  truth  and  computes  a  value  that  quanGfies  the  model's  performance.  

•  The  model  then  corrects  itself  to  have  a  smaller  loss.  

Page 26: Learning Deep Learning

L2  norm

(y1,..., yn )

L = 1n

ti − yi2

2i=1

n

(t1,..., tn )

Output:

Target:

Loss:

Task:  Regression

Page 27: Learning Deep Learning

Cross  Entropy

(y1,..., yn )

L = 1n

−ti log yi − (1− ti )log(1− yi )i=1

n

(t1,..., tn )

Output:

Target:

Loss:

Task:  Binary  ClassificaGon

Page 28: Learning Deep Learning

Class  NegaGve  Log  Likelihood

(y1,..., yn )

L = − 1n

ti,k log yi,kk

m

∑i=1

n

(t1,..., tn )

Output:

Target:

Loss:

Task:  MulG  Class  ClassificaGon

Page 29: Learning Deep Learning

Output  acGvaGon  funcGons    and  Loss  funcGons

Task Output  ac1va1on

Loss  func1on

Regression Linear L2  norm

Binary  ClassificaGon

Sigmoid Cross  Entropy

MulG  Class  ClassificaGon

So]max Class  NLL

Page 30: Learning Deep Learning

ProbabilisGc  PerspecGve

•  We  can  assume  NNs  are  compuGng  condiGonal  probabiliGes

∑ f

∑ f

∑ f

∑ g

x1

x3

x2

h1

h2

h3

p(t1 | x1, x2, x3)

Page 31: Learning Deep Learning

ProbabilisGc  PerspecGve

•  When  

NLL = − log p(ti | xi )i=1

n

∏ = − log 12πσ

exp −ti − yi( )2

2σ 2

#

$%%

&

'((

i=1

n

=12σ 2 ti − yi( )2 − n 2πσ

i=1

n

p(t | x) = 12πσ

exp −t − y( )2

2σ 2

"

#$$

%

&''

 L2  norm

Page 32: Learning Deep Learning

ProbabilisGc  PerspecGve

•  When  

NLL = − log p(ti | xi )i=1

n

∏ = − log yiti (1−

i=1

n

∏ yi )1−ti

= −ti log yi − (1− ti )log(1− yi )i=1

n

p(t | x) = yt (1− y)1−t

 Cross  Entropy

Page 33: Learning Deep Learning

ProbabilisGc  PerspecGve

•  When  

NLL = − log p(ti | xi )i=1

n

∏ = − log yti,ki,kk=1

m

∏i=1

n

= − ti,k log yi,kk=1

m

∑i=1

n

p(t | x) = yktk

k=1

m

 Class  NegaGve  Log  Likelihood

Page 34: Learning Deep Learning

Gradient  Descent  

•  Gradient    

•  Gradient  Descent  

Page 35: Learning Deep Learning

Gradient  Descent  

FuncGon  to  be  minimized    IniGal  point    Learning  rate    Update  rule  

 

L(w)

winit

wnew ← wold −α∂L∂w w=wold

α

Page 36: Learning Deep Learning

Gradient  Descent  

Big  learning  rate Small  learning  rate

Page 37: Learning Deep Learning

Loss  funcGon  for  LogisGc  regression

L(w,b;D) = log ytiii=1

n

∏ (1− yi )1−ti

= ti log yi + (1− ti )log(1− yi )i=1

n

yi =1

1+ exp(−wT xi − b)

Page 38: Learning Deep Learning

Gradient  with  respect  to  w ∂L(w,b;D)

∂w=∂∂w

ti log yi + (1− ti )log(1− yi )i=1

n

=∂∂w

ti log yi + (1− ti )log(1− yi )( )i=1

n

=∂yi∂w

∂∂yi

ti log yi + (1− ti )log(1− yi )( )i=1

n

=∂yi∂w

tiyi−1− ti1− yi

$

%&

'

()

i=1

n

∑ =∂yi∂w

ti − yiyi (1− yi )$

%&

'

()

i=1

n

= xiyi (1− yi )ti − yiyi (1− yi )$

%&

'

()

i=1

n

= xi (ti − yi )i=1

n

∵∂yi∂w

=∂∂w

11+ exp(−wT xi − b)#

$%

&

'(

=−∂∂w

1+ exp(−wT xi − b)( )1+ exp(−wT xi − b)( )

2

=xi exp(−w

T xi − b)1+ exp(−wT xi − b)( )

2

= xiyi (1− yi )

Page 39: Learning Deep Learning

Gradient  with  respect  to  b ∂L(w,b;D)

∂b=∂∂b

ti log yi + (1− ti )log(1− yi )i=1

n

=∂∂b

ti log yi + (1− ti )log(1− yi )( )i=1

n

=∂yi∂b

∂∂yi

ti log yi + (1− ti )log(1− yi )( )i=1

n

=∂yi∂b

tiyi−1− ti1− yi

$

%&

'

()

i=1

n

∑ =∂yi∂b

ti − yiyi (1− yi )$

%&

'

()

i=1

n

= yi (1− yi )ti − yiyi (1− yi )$

%&

'

()

i=1

n

= ti − yii=1

n

∵∂yi∂b

=∂∂b

11+ exp(−wT xi − b)#

$%

&

'(

=−∂∂b1+ exp(−wT xi − b)( )

1+ exp(−wT xi − b)( )2

=exp(−wT xi − b)

1+ exp(−wT xi − b)( )2

= yi (1− yi )

Page 40: Learning Deep Learning

Gradient  Descent    for  LogisGc  Regression  

FuncGon  to  be  minimized      Update  rule  

 bnew ← bold −α ti − yii=1

n

L(w,b;D)= ti log yi + (1− ti )log(1− yi )i=1

n

wnew ← wold −α xi (ti − yi )i=1

n

Page 41: Learning Deep Learning

Exercise:  Gradient  Descent    for  Linear  Regression  

L(w,b;D) = ti − yi( )2i=1

n

yi = wT xi + b

Page 42: Learning Deep Learning

Answer

FuncGon  to  be  minimized      Update  rule  

 

L(w,b;D) = ti − yi( )2i=1

n

bnew ← bold −α ti − yii=1

n

wnew ← wold −α xi (ti − yi )i=1

n

Page 43: Learning Deep Learning

BackpropagaGon

How  do  we  compute                      and                      ? ∂L∂W

∂L∂V

∑ f

∑ f

∑ f

∑ g

x1

x3

x2

h1

h2

h3y2

LW2,3 V3,1

∑ g y1

u1

u2

u3

l1

l2

Page 44: Learning Deep Learning

BackpropagaGon

Use  the  Chain  Rule!!! ∂∂xq s x( )( ) = ∂s(x)

∂x∂q(s(x))∂s(x)

x s(x)qs q(p(x))

∂q(s(x))∂s(x)

∂s(x)∂x

∂q(s(x))∂s(x)

Page 45: Learning Deep Learning

BackpropagaGon

∑ f

∑ f

∑ f

∑ g

x1

x3

x2

h1

h2

h3y2

LW2,3 V3,1

Start  from    Output  layer:  

∂L∂y1

∑ g y1

u1

u2

u3

l1

l2

∂L∂y1

Page 46: Learning Deep Learning

BackpropagaGon

Apply  Chain  Rule  :

∂L∂l1

=∂y1∂l1

∂L∂y1

= "g l1( ) ∂L∂y1

∑ f

∑ f

∑ f

∑ g

x1

x3

x2

h1

h2

h3y2

LW2,3 V3,1

∑ g y1

u1

u2

u3

l1

l2

∂L∂y1

∂L∂l1

Page 47: Learning Deep Learning

BackpropagaGon

Apply  Chain  Rule  :

∂L∂V3,1

=∂l1∂V3,1

∂L∂l1

= h3∂L∂l1

∑ f

∑ f

∑ f

∑ g

x1

x3

x2

h1

h2

h3y2

LW2,3 V3,1

∑ g y1

u1

u2

u3

l1

l2

∂L∂y1

∂L∂l1

∂L∂V3,1

Page 48: Learning Deep Learning

BackpropagaGon

Apply  Chain  Rule  :

∂L∂h3

=∂l1∂h3

∂L∂l1

+∂l1∂h3

∂L∂l1

=V3,1∂L∂l1

+V3,2∂L∂l2

∑ f

∑ f

∑ f

∑ g

x1

x3

x2

h1

h2

h3y2

LW2,3 V3,1

∑ g y1

u1

u2

u3

l1

l2

∂L∂y1

∂L∂l1

∂L∂V3,1

∂L∂h3

Page 49: Learning Deep Learning

BackpropagaGon

Apply  Chain  Rule  :

∂L∂u3

=∂h3∂u3

∂L∂h3

= "f u3( ) ∂L∂h3

∑ f

∑ f

∑ f

∑ g

x1

x3

x2

h1

h2

h3y2

LW2,3 V3,1

∑ g y1

u1

u2

u3

l1

l2

∂L∂y1

∂L∂l1

∂L∂V3,1

∂L∂h3∂L

∂u3

Page 50: Learning Deep Learning

BackpropagaGon

Apply  Chain  Rule  :

∂L∂W2,3

=∂u3∂W2,3

∂L∂u3

= u3∂L∂u3

∑ f

∑ f

∑ f

∑ g

x1

x3

x2

h1

h2

h3y2

LW2,3 V3,1

∑ g y1

u1

u2

u3

l1

l2

∂L∂y1

∂L∂l1

∂L∂V3,1

∂L∂h3∂L

∂u3

∂L∂W2,3

Page 51: Learning Deep Learning

AbstracGon  by  Layer

Linear Linear

V

f

∂L∂W

∂L∂V

g

W

Lx

t

∂L∂x

h

∂L∂h

y

∂L∂y

Wx Vh

∂L∂ Wx( )

∂L∂ Vh( )

Page 52: Learning Deep Learning

AbstracGon  by  Layer

input output

∂loss∂input

∂loss∂outputLayer

Page 53: Learning Deep Learning

AbstracGon  by  Layer

input outputLayer

Forward  ComputaGon

output = Layer. forward input( )

Page 54: Learning Deep Learning

AbstracGon  by  Layer

∂loss∂input

∂loss∂outputLayer

Backward  ComputaGon

∂loss∂input

= Layer. backward input, ∂loss∂output

"

#$

%

&'

input

Page 55: Learning Deep Learning

BackpropagaGon

①  Execute  the  forward  computaGon  

Linear Linear

V

f g

W

Lx

t

h yWx Vh

Page 56: Learning Deep Learning

BackpropagaGon

②  Compute  the  derivaGve  of  the  loss  funcGon  with  respect  to  the  output

Linear Linear

V

f g

W

Lx

t

h y

∂L∂y

Wx Vh

Page 57: Learning Deep Learning

BackpropagaGon

③  StarGng  from  the  final  layer,  backpropagate  derivaGves  through  layers  

Linear Linear

V

f g

W

Lx

t

h y

∂L∂y

Wx Vh

∂L∂ Vh( )

Page 58: Learning Deep Learning

Classifying  Digits

32×32=1024  pixels Class:  10  digits  (0~9)  Training:  60000  examples  TesGng:  60000  examples

Page 59: Learning Deep Learning

Classifying  Digits

x ∈ R1024

0000100000

!

"

##############

$

%

&&&&&&&&&&&&&&

t =

Page 60: Learning Deep Learning

Classifying  Digits

Linear Linear

V,cW,b

x

t

So]max Tanh

Class  NLL

Page 61: Learning Deep Learning

Classifying  Digits

Linear Linear

V,cW,b

x

t

So]max Tanh

Class  NLL

u =Wx + b

u

Page 62: Learning Deep Learning

Classifying  Digits

Linear Linear

V,cW,b

x

t

So]max Tanh

Class  NLL

h = Tanh(u)

hu

Page 63: Learning Deep Learning

Classifying  Digits

Linear Linear

V,cW,b

x

t

So]max Tanh

Class  NLL

l =Vh+ c

h lu

Page 64: Learning Deep Learning

Classifying  Digits

Linear Linear

V,cW,b

x

t

So]max Tanh

Class  NLL

y = softmax l( )

h ylu

Page 65: Learning Deep Learning

Classifying  Digits

Linear Linear

V,cW,b

x

t

So]max Tanh

Class  NLL h y

L = tk log ykk=1

10

∑ = tT log y

L

lu

Page 66: Learning Deep Learning

Classifying  Digits

Linear Linear

V,cW,b

x

t

So]max Tanh

Class  NLL Wx + b h Vh+ c y

∂L∂y

=∂∂ytT log y = t1

y1,..., t10 y10

"#$

%&'

T

L

∂L∂y

Page 67: Learning Deep Learning

Classifying  Digits

Linear Linear

V,cW,b

x

t

So]max Tanh

Class  NLL h y

∂L∂l

= y ! (t − y)! ∂L∂y

= y1 t1 − y1( ),..., y10 t10 − y10( )#$ %&T! t1 y1

,..., t10 y10#$'

%&(

T

= t1 t1 − y1( ),..., t10 t10 − y10( )#$ %&T

L

∂L∂y

∂L∂l

lu

Page 68: Learning Deep Learning

Classifying  Digits

Linear Linear

V,cW,b

x

t

So]max Tanh

Class  NLL h y

∂L∂h

=VT ∂L∂l

L

∂L∂y

∂L∂V

=∂L∂lhT ∂L

∂c=∂L∂l

∂L∂h

∂L∂V, ∂L∂c

∂L∂l

lu

Page 69: Learning Deep Learning

Classifying  Digits

Linear Linear

V,cW,b

x

t

So]max Tanh

Class  NLL h y

∂L∂u

= 1+ h( )! 1− h( )! ∂L∂h

L

∂L∂y

∂L∂h

∂L∂V, ∂L∂c

∂L∂u

∂L∂l

lu

Page 70: Learning Deep Learning

Classifying  Digits

Linear Linear

V,cW,b

x

t

So]max Tanh

Class  NLL h y

∂L∂W

=∂L∂u

xT

L

∂L∂y

∂L∂h

∂L∂V, ∂L∂c

∂L∂u

∂L∂b

=∂L∂u

∂L∂x

=WT ∂L∂u

∂L∂W

, ∂L∂b

∂L∂x

∂L∂l

lu

Page 71: Learning Deep Learning

Classifying  Digits

bnew ← b−α ∂L∂b

Wnew ←W −α∂L∂W

V new ←V −α ∂L∂V

cnew ← c−α ∂L∂c

Page 72: Learning Deep Learning

Torch7  

Page 73: Learning Deep Learning

Torch7  

Page 74: Learning Deep Learning

Torch7