Top Banner
Lecture 4: Backpropagation and Automatic Differentiation CSE599W: Spring 2018
43

Lecture 4: Backpropagation and AutomaticDifferentiationdlsys.cs.washington.edu/pdf/lecture4.pdf · Problems of backpropagation •You always need to keep intermediate data in the

Mar 03, 2019

Download

Documents

lephuc
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Lecture 4: Backpropagation and AutomaticDifferentiationdlsys.cs.washington.edu/pdf/lecture4.pdf · Problems of backpropagation •You always need to keep intermediate data in the

Lecture 4: Backpropagation and Automatic Differentiation

CSE599W: Spring 2018

Page 2: Lecture 4: Backpropagation and AutomaticDifferentiationdlsys.cs.washington.edu/pdf/lecture4.pdf · Problems of backpropagation •You always need to keep intermediate data in the

Announcement

• Assignment 1 is out today, due in 2 weeks (Apr 19th, 5pm)

Page 3: Lecture 4: Backpropagation and AutomaticDifferentiationdlsys.cs.washington.edu/pdf/lecture4.pdf · Problems of backpropagation •You always need to keep intermediate data in the

Model Training Overviewlayer1extractor

layer2extractor predictor

Objective

Training

Page 4: Lecture 4: Backpropagation and AutomaticDifferentiationdlsys.cs.washington.edu/pdf/lecture4.pdf · Problems of backpropagation •You always need to keep intermediate data in the

Symbolic Differentiation

• Input formulae is a symbolic expression tree (computation graph).• Implement differentiation rules, e.g., sum rule, product rule, chain rule

✘For complicated functions, the resultant expression can be exponentially large.✘Wasteful to keep around intermediate symbolic expressions if we only

need a numeric value of the gradient in the end✘Prone to error

!(# + %)!' = !#

!' +!%!'

!(#%)!' = !#

!' % + #!%!'

! ℎ '!' = !# % '

!' * !%(')'

Page 5: Lecture 4: Backpropagation and AutomaticDifferentiationdlsys.cs.washington.edu/pdf/lecture4.pdf · Problems of backpropagation •You always need to keep intermediate data in the

Numerical Differentiation

• We can approximate the gradient using!" #!$%

≈ lim*→," # + ℎ/0 − "(#)

" 4, $ = 4 ⋅ $−0.8 0.3 < 0.5

−0.2

Page 6: Lecture 4: Backpropagation and AutomaticDifferentiationdlsys.cs.washington.edu/pdf/lecture4.pdf · Problems of backpropagation •You always need to keep intermediate data in the

Numerical Differentiation

• We can approximate the gradient using!" #!$%

≈ lim*→," # + ℎ/0 − "(#)

" 4, $ = 4 ⋅ $−0.8 + ; 0.3 = 0.5

−0.2

" 4, $ = 4 ⋅ $−0.8 0.3 = 0.5

−0.2

Page 7: Lecture 4: Backpropagation and AutomaticDifferentiationdlsys.cs.washington.edu/pdf/lecture4.pdf · Problems of backpropagation •You always need to keep intermediate data in the

Numerical Differentiation

• We can approximate the gradient using!" #!$%

≈ lim*→," # + ℎ/0 − "(#)

ℎ• Reduce the truncation error by using center difference

!" #!$%

≈ lim*→," # + ℎ/0 − "(# − ℎ/0)

2ℎ✘Bad: rounding error, and slow to computeüA powerful tool to check the correctness of implementation, usually

use h = 1e-6.

Page 8: Lecture 4: Backpropagation and AutomaticDifferentiationdlsys.cs.washington.edu/pdf/lecture4.pdf · Problems of backpropagation •You always need to keep intermediate data in the

Backpropagation

!

"

# = %(!, ")Operator %

Page 9: Lecture 4: Backpropagation and AutomaticDifferentiationdlsys.cs.washington.edu/pdf/lecture4.pdf · Problems of backpropagation •You always need to keep intermediate data in the

Backpropagation

Operator !

"

#

$ = !(", #))*)$

)*)" =

)*)$)$)"

)*)# =

)*)$)$)#

Compute gradient becomes local computation

Page 10: Lecture 4: Backpropagation and AutomaticDifferentiationdlsys.cs.washington.edu/pdf/lecture4.pdf · Problems of backpropagation •You always need to keep intermediate data in the

Backpropagation simple example

! = 11 + %&(()*(+,+*(-,-)

Page 11: Lecture 4: Backpropagation and AutomaticDifferentiationdlsys.cs.washington.edu/pdf/lecture4.pdf · Problems of backpropagation •You always need to keep intermediate data in the

Backpropagation simple example

∗"#

+%#

∗"&%&"'

+ ∗ −1 *%+ +1

, = 11 + *.(012034320545)

1/%

Page 12: Lecture 4: Backpropagation and AutomaticDifferentiationdlsys.cs.washington.edu/pdf/lecture4.pdf · Problems of backpropagation •You always need to keep intermediate data in the

Backpropagation simple example

∗"#

+%#

∗"&%&"'

+ ∗ −1 *%+ +1

, = 11 + *.(012034320545)

1/%

1.0

3.0

-2.0

2.02.0

3.0

-4.0 -1.01.0 -1.0 0.37 1.37 0.73

Page 13: Lecture 4: Backpropagation and AutomaticDifferentiationdlsys.cs.washington.edu/pdf/lecture4.pdf · Problems of backpropagation •You always need to keep intermediate data in the

Backpropagation simple example

∗"#

+%#

∗"&%&"'

+ ∗ −1 *%+ +1

, = 11 + *.(012034320545)

1/%

1.0

3.0

-2.0

2.02.0

3.0

-4.0 -1.01.0 -1.0 0.37 1.37 0.73

1.0

Page 14: Lecture 4: Backpropagation and AutomaticDifferentiationdlsys.cs.washington.edu/pdf/lecture4.pdf · Problems of backpropagation •You always need to keep intermediate data in the

Backpropagation simple example

∗"#

+%#

∗"&%&"'

+ ∗ −1 *%+ +1

, = 11 + *.(012034320545)

1/%

1.0

3.0

-2.0

2.02.0

3.0

-4.0 -1.01.0 -1.0 0.37 1.37 0.73

1.0-0.53

, % = 1/% à8984 = −1/%&

:;:% =

:;:,:,:% = −1/%&

Page 15: Lecture 4: Backpropagation and AutomaticDifferentiationdlsys.cs.washington.edu/pdf/lecture4.pdf · Problems of backpropagation •You always need to keep intermediate data in the

Backpropagation simple example

∗"#

+%#

∗"&%&"'

+ ∗ −1 *%+ +1

, = 11 + *.(012034320545)

1/%

1.0

3.0

-2.0

2.02.0

3.0

-4.0 -1.01.0 -1.0 0.37 1.37 0.73

1.0-0.53-0.53

, % = % + 1 à8984 = 1

Page 16: Lecture 4: Backpropagation and AutomaticDifferentiationdlsys.cs.washington.edu/pdf/lecture4.pdf · Problems of backpropagation •You always need to keep intermediate data in the

Backpropagation simple example

∗"#

+%#

∗"&%&"'

+ ∗ −1 *%+ +1

, = 11 + *.(012034320545)

1/%

1.0

3.0

-2.0

2.02.0

3.0

-4.0 -1.01.0 -1.0 0.37 1.37 0.73

1.0-0.53-0.53-0.20

, % = *4 à8984 = *4

:;:% =

:;:,:,:% =

:;:, ⋅ *

4

Page 17: Lecture 4: Backpropagation and AutomaticDifferentiationdlsys.cs.washington.edu/pdf/lecture4.pdf · Problems of backpropagation •You always need to keep intermediate data in the

Backpropagation simple example

∗"#

+%#

∗"&%&"'

+ ∗ −1 *%+ +1

, = 11 + *.(012034320545)

1/%

1.0

3.0

-2.0

2.02.0

3.0

-4.0 -1.01.0 -1.0 0.37 1.37 0.73

1.0-0.53-0.53-0.20

, %," = %" à9:94 = ", 9:

90 = %

0.200.20

0.20

0.200.20

0.600.20

Page 18: Lecture 4: Backpropagation and AutomaticDifferentiationdlsys.cs.washington.edu/pdf/lecture4.pdf · Problems of backpropagation •You always need to keep intermediate data in the

Backpropagation simple example

∗"#

+%#

∗"&%&"'

+ ∗ −1 *%+ +1

, = 11 + *.(012034320545)

1/%

1.0

3.0

-2.0

2.02.0

3.0

-4.0 -1.01.0 -1.0 0.37 1.37 0.73

1.0-0.53-0.53-0.200.200.20

0.20

0.200.20

-0.400.40

0.600.20

Page 19: Lecture 4: Backpropagation and AutomaticDifferentiationdlsys.cs.washington.edu/pdf/lecture4.pdf · Problems of backpropagation •You always need to keep intermediate data in the

Any problem?Can we do better?

Page 20: Lecture 4: Backpropagation and AutomaticDifferentiationdlsys.cs.washington.edu/pdf/lecture4.pdf · Problems of backpropagation •You always need to keep intermediate data in the

Problems of backpropagation

• You always need to keep intermediate data in the memory during the forward pass in case it will be used in the backpropagation.

• Lack of flexibility, e.g., compute the gradient of gradient.

Page 21: Lecture 4: Backpropagation and AutomaticDifferentiationdlsys.cs.washington.edu/pdf/lecture4.pdf · Problems of backpropagation •You always need to keep intermediate data in the

Automatic Differentiation (autodiff)

• Create computation graph for gradient computation

Page 22: Lecture 4: Backpropagation and AutomaticDifferentiationdlsys.cs.washington.edu/pdf/lecture4.pdf · Problems of backpropagation •You always need to keep intermediate data in the

Automatic Differentiation (autodiff)

• Create computation graph for gradient computation

∗"#

+%#

∗"&%&"'

+ ∗ −1 *%+ +1

, = 11 + *.(012034320545)

1/%

Page 23: Lecture 4: Backpropagation and AutomaticDifferentiationdlsys.cs.washington.edu/pdf/lecture4.pdf · Problems of backpropagation •You always need to keep intermediate data in the

Automatic Differentiation (autodiff)

• Create computation graph for gradient computation

∗"#

+%#

∗"&%&"'

+ ∗ −1 *%+ +1 1/%

− 1%&

- = 11 + */(123145431656)

- % = 1/% à8985 = −1/%&

Page 24: Lecture 4: Backpropagation and AutomaticDifferentiationdlsys.cs.washington.edu/pdf/lecture4.pdf · Problems of backpropagation •You always need to keep intermediate data in the

Automatic Differentiation (autodiff)

• Create computation graph for gradient computation

∗"#

+%#

∗"&%&"'

+ ∗ −1 *%+ +1 1/%

− 1%&

- = 11 + */(123145431656)

∗ 1

- % = % + 1 à8985 = 1

Page 25: Lecture 4: Backpropagation and AutomaticDifferentiationdlsys.cs.washington.edu/pdf/lecture4.pdf · Problems of backpropagation •You always need to keep intermediate data in the

Automatic Differentiation (autodiff)

• Create computation graph for gradient computation

∗"#

+%#

∗"&%&"'

+ ∗ −1 *%+ +1 1/%

− 1%&

- = 11 + */(123145431656)

∗ 1∗

- % = *5 à8985 = *5

Page 26: Lecture 4: Backpropagation and AutomaticDifferentiationdlsys.cs.washington.edu/pdf/lecture4.pdf · Problems of backpropagation •You always need to keep intermediate data in the

Automatic Differentiation (autodiff)

• Create computation graph for gradient computation

∗"#

+%#

∗"&%&"'

+ ∗ −1 *%+ +1 1/%

− 1%&

- = 11 + */(123145431656)

∗ 1∗∗ −1∗89814

- %, " = %" à8;81 = %

Page 27: Lecture 4: Backpropagation and AutomaticDifferentiationdlsys.cs.washington.edu/pdf/lecture4.pdf · Problems of backpropagation •You always need to keep intermediate data in the

Automatic Differentiation (autodiff)

• Create computation graph for gradient computation

∗"#

+%#

∗"&%&"'

+ ∗ −1 *%+ +1 1/%

− 1%&

- = 11 + */(123145431656)

∗ 1∗∗ −1∗89814

∗89816

Page 28: Lecture 4: Backpropagation and AutomaticDifferentiationdlsys.cs.washington.edu/pdf/lecture4.pdf · Problems of backpropagation •You always need to keep intermediate data in the

AutoDiff Algorithm

W

xmatmult softmax log

y_

mul meany cross_en

tropy

log-gradsoftmax-grad mul 1 / batch_sizematmult-transpose

W_grad

Page 29: Lecture 4: Backpropagation and AutomaticDifferentiationdlsys.cs.washington.edu/pdf/lecture4.pdf · Problems of backpropagation •You always need to keep intermediate data in the

AutoDiff Algorithmdef gradient(out):

node_to_grad[out] = 1

nodes = get_node_list(out)

for node in reverse_topo_order(nodes):

grad ß sum partial adjoints from output edges

input_grads ß node.op.gradient(input, grad) for input in node.inputs

add input_grads to node_to_grad

return node_to_grad

!"

1!$

!%

!&

exp

+

×

Page 30: Lecture 4: Backpropagation and AutomaticDifferentiationdlsys.cs.washington.edu/pdf/lecture4.pdf · Problems of backpropagation •You always need to keep intermediate data in the

AutoDiff Algorithmdef gradient(out):

node_to_grad[out] = 1

nodes = get_node_list(out)

for node in reverse_topo_order(nodes):

grad ß sum partial adjoints from output edges

input_grads ß node.op.gradient(input, grad) for input in node.inputs

add input_grads to node_to_grad

return node_to_grad

!"

1!$

!%

!&

exp

+

×

!&node_to_grad:!&: !&

Page 31: Lecture 4: Backpropagation and AutomaticDifferentiationdlsys.cs.washington.edu/pdf/lecture4.pdf · Problems of backpropagation •You always need to keep intermediate data in the

AutoDiff Algorithmdef gradient(out):

node_to_grad[out] = 1

nodes = get_node_list(out)

for node in reverse_topo_order(nodes):

grad ß sum partial adjoints from output edges

input_grads ß node.op.gradient(input, grad) for input in node.inputs

add input_grads to node_to_grad

return node_to_grad

!"

1!$

!%

!&

exp

+

×

!&node_to_grad:!&: !&

!& !% !$ !"

Page 32: Lecture 4: Backpropagation and AutomaticDifferentiationdlsys.cs.washington.edu/pdf/lecture4.pdf · Problems of backpropagation •You always need to keep intermediate data in the

AutoDiff Algorithmdef gradient(out):

node_to_grad[out] = 1

nodes = get_node_list(out)

for node in reverse_topo_order(nodes):

grad ß sum partial adjoints from output edges

input_grads ß node.op.gradient(input, grad) for input in node.inputs

add input_grads to node_to_grad

return node_to_grad

!"

1!$

!%

!&

exp

+

×

!&node_to_grad:!&: !&

!& !% !$ !"

!$" !%××

Page 33: Lecture 4: Backpropagation and AutomaticDifferentiationdlsys.cs.washington.edu/pdf/lecture4.pdf · Problems of backpropagation •You always need to keep intermediate data in the

AutoDiff Algorithmdef gradient(out):

node_to_grad[out] = 1

nodes = get_node_list(out)

for node in reverse_topo_order(nodes):

grad ß sum partial adjoints from output edges

input_grads ß node.op.gradient(input, grad) for input in node.inputs

add input_grads to node_to_grad

return node_to_grad

!"

1!$

!%

!&

exp

+

×

!&node_to_grad:!&: !&!%: !%!$: !$"

!& !% !$ !"

!$" !%××

Page 34: Lecture 4: Backpropagation and AutomaticDifferentiationdlsys.cs.washington.edu/pdf/lecture4.pdf · Problems of backpropagation •You always need to keep intermediate data in the

AutoDiff Algorithmdef gradient(out):

node_to_grad[out] = 1

nodes = get_node_list(out)

for node in reverse_topo_order(nodes):

grad ß sum partial adjoints from output edges

input_grads ß node.op.gradient(input, grad) for input in node.inputs

add input_grads to node_to_grad

return node_to_grad

!"

1!$

!%

!&

exp

+

×

!&node_to_grad:!&: !&!%: !%!$: !$"

!& !% !$ !"

!$" !%××

Page 35: Lecture 4: Backpropagation and AutomaticDifferentiationdlsys.cs.washington.edu/pdf/lecture4.pdf · Problems of backpropagation •You always need to keep intermediate data in the

AutoDiff Algorithmdef gradient(out):

node_to_grad[out] = 1

nodes = get_node_list(out)

for node in reverse_topo_order(nodes):

grad ß sum partial adjoints from output edges

input_grads ß node.op.gradient(input, grad) for input in node.inputs

add input_grads to node_to_grad

return node_to_grad

!"

1!$

!%

!&

exp

+

×

!&node_to_grad:!&: !&!%: !%!$: !$"

!& !% !$ !"

!$" !%××

!$$id

Page 36: Lecture 4: Backpropagation and AutomaticDifferentiationdlsys.cs.washington.edu/pdf/lecture4.pdf · Problems of backpropagation •You always need to keep intermediate data in the

AutoDiff Algorithmdef gradient(out):

node_to_grad[out] = 1

nodes = get_node_list(out)

for node in reverse_topo_order(nodes):

grad ß sum partial adjoints from output edges

input_grads ß node.op.gradient(input, grad) for input in node.inputs

add input_grads to node_to_grad

return node_to_grad

!"

1!$

!%

!&

exp

+

×

!&node_to_grad:!&: !&!%: !%!$: !$", !$$

!& !% !$ !"

!$" !%××

!$$id

Page 37: Lecture 4: Backpropagation and AutomaticDifferentiationdlsys.cs.washington.edu/pdf/lecture4.pdf · Problems of backpropagation •You always need to keep intermediate data in the

AutoDiff Algorithmdef gradient(out):

node_to_grad[out] = 1

nodes = get_node_list(out)

for node in reverse_topo_order(nodes):

grad ß sum partial adjoints from output edges

input_grads ß node.op.gradient(input, grad) for input in node.inputs

add input_grads to node_to_grad

return node_to_grad

!"

1!$

!%

!&

exp

+

×

!&node_to_grad:!&: !&!%: !%!$: !$", !$$

!& !% !$ !"

!$" !%××

!$$id

!$+

Page 38: Lecture 4: Backpropagation and AutomaticDifferentiationdlsys.cs.washington.edu/pdf/lecture4.pdf · Problems of backpropagation •You always need to keep intermediate data in the

AutoDiff Algorithmdef gradient(out):

node_to_grad[out] = 1

nodes = get_node_list(out)

for node in reverse_topo_order(nodes):

grad ß sum partial adjoints from output edges

input_grads ß node.op.gradient(input, grad) for input in node.inputs

add input_grads to node_to_grad

return node_to_grad

!"

1!$

!%

!&

exp

+

×

!&node_to_grad:!&: !&!%: !%!$: !$", !$$

!& !% !$ !"

!$" !%××

!$$id

!$+

!"×

Page 39: Lecture 4: Backpropagation and AutomaticDifferentiationdlsys.cs.washington.edu/pdf/lecture4.pdf · Problems of backpropagation •You always need to keep intermediate data in the

AutoDiff Algorithmdef gradient(out):

node_to_grad[out] = 1

nodes = get_node_list(out)

for node in reverse_topo_order(nodes):

grad ß sum partial adjoints from output edges

input_grads ß node.op.gradient(input, grad) for input in node.inputs

add input_grads to node_to_grad

return node_to_grad

!"

1!$

!%

!&

exp

+

×

!&

!$" !%

!$

!$$

!"

××

+

×

id

node_to_grad:!&: !&!%: !%!$: !$", !$$!": !"

!& !% !$ !"

Page 40: Lecture 4: Backpropagation and AutomaticDifferentiationdlsys.cs.washington.edu/pdf/lecture4.pdf · Problems of backpropagation •You always need to keep intermediate data in the

AutoDiff Algorithmdef gradient(out):

node_to_grad[out] = 1

nodes = get_node_list(out)

for node in reverse_topo_order(nodes):

grad ß sum partial adjoints from output edges

input_grads ß node.op.gradient(input, grad) for input in node.inputs

add input_grads to node_to_grad

return node_to_grad

!"

1!$

!%

!&

exp

+

×

!&

!$" !%

!$

!$$

!"

××

+

×

id

node_to_grad:!&: !&!%: !%!$: !$", !$$!": !"

!& !% !$ !"

Page 41: Lecture 4: Backpropagation and AutomaticDifferentiationdlsys.cs.washington.edu/pdf/lecture4.pdf · Problems of backpropagation •You always need to keep intermediate data in the

Backpropagation vs AutoDiff

!"

!#exp 1

!(

!)

+

×

!"

!#exp 1

!(

!)

+

×!)

!(×!#′×

!#′′id

!#+

!"×

Backpropagation AutoDiff

Page 42: Lecture 4: Backpropagation and AutomaticDifferentiationdlsys.cs.washington.edu/pdf/lecture4.pdf · Problems of backpropagation •You always need to keep intermediate data in the

Recap

• Numerical differentiation• Tool to check the correctness of implementation

• Backpropagation• Easy to understand and implement• Bad for memory use and schedule optimization

• Automatic differentiation• Generate gradient computation to entire computation graph• Better for system optimization

Page 43: Lecture 4: Backpropagation and AutomaticDifferentiationdlsys.cs.washington.edu/pdf/lecture4.pdf · Problems of backpropagation •You always need to keep intermediate data in the

References

• Automatic differentiation in machine learning: a surveyhttps://arxiv.org/abs/1502.05767• CS231n backpropagation: http://cs231n.github.io/optimization-2/