Top Banner
Lecture 9: Memory Optimization CSE599W: Spring 2018
38

Lecture 9: Memory Optimization - University of Washingtondlsys.cs.washington.edu/pdf/lecture9.pdf · Lecture 9: Memory Optimization CSE599W: Spring 2018. Where are we Gradient Calculation

May 24, 2020

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Lecture 9: Memory Optimization - University of Washingtondlsys.cs.washington.edu/pdf/lecture9.pdf · Lecture 9: Memory Optimization CSE599W: Spring 2018. Where are we Gradient Calculation

Lecture 9: Memory Optimization

CSE599W: Spring 2018

Page 2: Lecture 9: Memory Optimization - University of Washingtondlsys.cs.washington.edu/pdf/lecture9.pdf · Lecture 9: Memory Optimization CSE599W: Spring 2018. Where are we Gradient Calculation

Where are we

Gradient Calculation (Differentiation API)

Computational Graph Optimization and Execution

Runtime Parallel Scheduling

GPU Kernels, Optimizing Device Code

Programming API

Accelerators and Hardwares

User API

System Components

Architecture

High level Packages

Page 3: Lecture 9: Memory Optimization - University of Washingtondlsys.cs.washington.edu/pdf/lecture9.pdf · Lecture 9: Memory Optimization CSE599W: Spring 2018. Where are we Gradient Calculation

Where are we

Gradient Calculation (Differentiation API)

Computational Graph Optimization and Execution

Runtime Parallel Scheduling

Programming API

GPU Kernels, Optimizing Device Code

Accelerators and Hardwares

Page 4: Lecture 9: Memory Optimization - University of Washingtondlsys.cs.washington.edu/pdf/lecture9.pdf · Lecture 9: Memory Optimization CSE599W: Spring 2018. Where are we Gradient Calculation

Recap: Computation Graph

W

x

matmult softmax log

y_

mul mean

log-gradsoftmax-grad mul 1 / batch_sizematmult-transpose

W_gradmul

learning_rate

sub

assign y cross_entropy

Page 5: Lecture 9: Memory Optimization - University of Washingtondlsys.cs.washington.edu/pdf/lecture9.pdf · Lecture 9: Memory Optimization CSE599W: Spring 2018. Where are we Gradient Calculation

Recap: Automatic Differentiation

W

x

matmult softmax log

y_

mul mean

log-gradsoftmax-grad mul 1 / batch_sizematmult-transpose

y cross_entropy

W

x

matmult softmax log mul meancross_entropy

Backprop in Graph

Autodiff by Extending the Graph: assignment 1

Page 6: Lecture 9: Memory Optimization - University of Washingtondlsys.cs.washington.edu/pdf/lecture9.pdf · Lecture 9: Memory Optimization CSE599W: Spring 2018. Where are we Gradient Calculation

Questions for this Lecture

W

x

matmult softmax log

y_

mul mean

log-gradsoftmax-grad mul 1 / batch_sizematmult-transpose

y cross_entropy

Why do we need automatic differentiation that extends the graph instead of backprop in graph?

Page 7: Lecture 9: Memory Optimization - University of Washingtondlsys.cs.washington.edu/pdf/lecture9.pdf · Lecture 9: Memory Optimization CSE599W: Spring 2018. Where are we Gradient Calculation

Memory Problem of Deep Nets

LeNet

Inception

Deep nets are becoming deeper

Page 8: Lecture 9: Memory Optimization - University of Washingtondlsys.cs.washington.edu/pdf/lecture9.pdf · Lecture 9: Memory Optimization CSE599W: Spring 2018. Where are we Gradient Calculation

State-of-Art Models can be Resource Bound

● Examples of recent state of art neural nets ○ Convnet: ResNet-1k on CIFAR-10, ResNet-200 on ImageNet○ Recurrent models: LSTM on long sequences like speech

● The maximum size of the model we can try is bounded by total RAM available of a Titan X card (12G)

We need to be frugal

Page 9: Lecture 9: Memory Optimization - University of Washingtondlsys.cs.washington.edu/pdf/lecture9.pdf · Lecture 9: Memory Optimization CSE599W: Spring 2018. Where are we Gradient Calculation

Q: How to build an Executor for a Given Graph

a

b

mul add-const

3

exp

Computational Graph for exp(a * b + 3)

Page 10: Lecture 9: Memory Optimization - University of Washingtondlsys.cs.washington.edu/pdf/lecture9.pdf · Lecture 9: Memory Optimization CSE599W: Spring 2018. Where are we Gradient Calculation

Build an Executor for a Given Graph

1. Allocate temp memory for intermediate computation

Same color represent same piece of memory

a

b

mul add-const

3

exp

Computational Graph for exp(a * b + 3)

4

8

Page 11: Lecture 9: Memory Optimization - University of Washingtondlsys.cs.washington.edu/pdf/lecture9.pdf · Lecture 9: Memory Optimization CSE599W: Spring 2018. Where are we Gradient Calculation

Build an Executor for a Given Graph

a

b

mul add-const

3

exp

32 35 exp(32)

Computational Graph for exp(a * b + 3)

1. Allocate temp memory for intermediate computation

2. Traverse and execute the graph by topo order.

4

8

Page 12: Lecture 9: Memory Optimization - University of Washingtondlsys.cs.washington.edu/pdf/lecture9.pdf · Lecture 9: Memory Optimization CSE599W: Spring 2018. Where are we Gradient Calculation

Build an Executor for a Given Graph

a

b

mul add-const

3

exp

32 35 exp(32)

Computational Graph for exp(a * b + 3)

1. Allocate temp memory for intermediate computation

2. Traverse and execute the graph by topo order.

4

8

Temporary space linear to number of ops

Page 13: Lecture 9: Memory Optimization - University of Washingtondlsys.cs.washington.edu/pdf/lecture9.pdf · Lecture 9: Memory Optimization CSE599W: Spring 2018. Where are we Gradient Calculation

Dynamic Memory Allocation

a

b

mul add-const

3

exp

324

8

Memory Pool

1. Allocate when needed

2. Recycle when a memory is not needed.

3. Useful for both declarative and imperative executions

Page 14: Lecture 9: Memory Optimization - University of Washingtondlsys.cs.washington.edu/pdf/lecture9.pdf · Lecture 9: Memory Optimization CSE599W: Spring 2018. Where are we Gradient Calculation

Dynamic Memory Allocation

a

b

mul add-const

3

exp

354

8

Memory Pool

1. Allocate when needed

2. Recycle when a memory is not needed.

3. Useful for both declarative and imperative executions

Page 15: Lecture 9: Memory Optimization - University of Washingtondlsys.cs.washington.edu/pdf/lecture9.pdf · Lecture 9: Memory Optimization CSE599W: Spring 2018. Where are we Gradient Calculation

Dynamic Memory Allocation1. Allocate when needed

2. Recycle when a memory is not needed.

3. Useful for both declarative and imperative executions

a

b

mul add-const

3

exp

exp(35)354

8

Memory Pool

Page 16: Lecture 9: Memory Optimization - University of Washingtondlsys.cs.washington.edu/pdf/lecture9.pdf · Lecture 9: Memory Optimization CSE599W: Spring 2018. Where are we Gradient Calculation

Static Memory Planning1. Plan for reuse ahead of time

2. Analog: register allocation algorithm in compiler

a

b

mul add-const

3

exp

4

8

Same color represent same piece of memory

Page 17: Lecture 9: Memory Optimization - University of Washingtondlsys.cs.washington.edu/pdf/lecture9.pdf · Lecture 9: Memory Optimization CSE599W: Spring 2018. Where are we Gradient Calculation

Common Patterns of Memory Planning

● Inplace store the result in the input

● Normal Sharing reuse memory that are no longer needed.

Page 18: Lecture 9: Memory Optimization - University of Washingtondlsys.cs.washington.edu/pdf/lecture9.pdf · Lecture 9: Memory Optimization CSE599W: Spring 2018. Where are we Gradient Calculation

Inplace Optimization

a

b

mul add-const exp

Computational Graph for exp(a * b + 3)

3

● Store the result in the input● Works if we only care about the final result

● Question: what operation cannot be done inplace ?

Page 19: Lecture 9: Memory Optimization - University of Washingtondlsys.cs.washington.edu/pdf/lecture9.pdf · Lecture 9: Memory Optimization CSE599W: Spring 2018. Where are we Gradient Calculation

Inplace Pitfalls

a

b

mul add-const

exp

3

log

We can only do inplace if result op is the only consumer of the current value

Page 20: Lecture 9: Memory Optimization - University of Washingtondlsys.cs.washington.edu/pdf/lecture9.pdf · Lecture 9: Memory Optimization CSE599W: Spring 2018. Where are we Gradient Calculation

Normal Memory Sharing

a

b

mul add-const

exp

3

log

Recycle memory that is no longer needed.

Page 21: Lecture 9: Memory Optimization - University of Washingtondlsys.cs.washington.edu/pdf/lecture9.pdf · Lecture 9: Memory Optimization CSE599W: Spring 2018. Where are we Gradient Calculation

Memory Planning Algorithm

A

B = sigmoid(A)

C = sigmoid(B)

E = Pooling(C)

F = Pooling(B)

G = E * F

2

1

1

1

1

2

1

1

1

1

1

1

1

1

1

0

1

1

1

1

0

0

1

1

1

0

0 0

1

step 1: Allocate tag for B

step 2: Allocate tag for C, cannot do inplace because B is still alive

step 3: Allocate tag for F, release space of B

step 4: Reuse the tag in the box for E

0

step 5: Re-use tag of E,This is an inplace optimization

internal arrays, same color indicates shared memory.

data dependency, operation completed

Initial state of allocation algorithm

Final Memory PlanTag used to indicate memory sharing on allocation Algorithm.

count ref counter on dependent operations that yet to be full-filled

data dependency, operation not completed

Box of free tags in allocation algorithm.

Page 22: Lecture 9: Memory Optimization - University of Washingtondlsys.cs.washington.edu/pdf/lecture9.pdf · Lecture 9: Memory Optimization CSE599W: Spring 2018. Where are we Gradient Calculation

Concurrency vs Memory Optimization

A[1]

A[2] = conv(A[1])

A[3]=pool(A[2])

A[4]=conv(A[3])

A[5] = pool(A[1])

A[6]=conv(A[5])

A[7]=pool(A[6])

A[8] = concat(A[4], A[7])

A[1]

A[2] = conv(A[1])

A[3]=pool(A[2])

A[4]=conv(A[3])

A[5] = pool(A[1])

A[6]=conv(A[5])

A[7]=pool(A[6])

A[8] = concat(A[4], A[7])

internal arrays data dependency

Memory allocation for result, same color indicates shared memory.

implicit dependency introduced due to allocation

Cannot Run in Parallel Enables two Parallel Path

Page 23: Lecture 9: Memory Optimization - University of Washingtondlsys.cs.washington.edu/pdf/lecture9.pdf · Lecture 9: Memory Optimization CSE599W: Spring 2018. Where are we Gradient Calculation

Concurrency aware Heuristics

1

1

1

1

11

1

1

1

1

0

0

0

0

00

0

1

1

1

First the Longest Path

Reset the Reward of visited Node to 0. Find the next longest Path

The final node Color

Restrict memory reuse in the same colored path

Page 24: Lecture 9: Memory Optimization - University of Washingtondlsys.cs.washington.edu/pdf/lecture9.pdf · Lecture 9: Memory Optimization CSE599W: Spring 2018. Where are we Gradient Calculation

Memory Allocation and Scheduling

Introduces implicit control flow dependencies between ops

Solutions:● Explicitly add the control flow dependencies

○ Needed in TensorFlow● Enable mutation in the scheduler, no extra job needed

○ Both operation “mutate” the same memory○ Supported in MXNet

A[1]

A[2] = conv(A[1])

A[3] = pool(A[2])

A[4] = conv(A[3])

A[5] = pool(A[1])

A[6] = conv(A[5])

A[7] = pool(A[6])

A[8] = concat(A[4], A[7])

Page 25: Lecture 9: Memory Optimization - University of Washingtondlsys.cs.washington.edu/pdf/lecture9.pdf · Lecture 9: Memory Optimization CSE599W: Spring 2018. Where are we Gradient Calculation

Memory Plan with Gradient Calculation

a

b

mul add-const

3

exp

mulout_grad

mula_grad

Back to the Question: Why do we need automatic differentiation that extends the graph instead of backprop in graph?

Page 26: Lecture 9: Memory Optimization - University of Washingtondlsys.cs.washington.edu/pdf/lecture9.pdf · Lecture 9: Memory Optimization CSE599W: Spring 2018. Where are we Gradient Calculation

Memory Plan with Gradient Calculation

a

b

mul add-const

3

exp

mulout_grad

mula_grad

Back to the Question: Why do we need automatic differentiation that extends the graph instead of backprop in graph?

Page 27: Lecture 9: Memory Optimization - University of Washingtondlsys.cs.washington.edu/pdf/lecture9.pdf · Lecture 9: Memory Optimization CSE599W: Spring 2018. Where are we Gradient Calculation

Memory Optimization on a Two Layer MLP

Page 28: Lecture 9: Memory Optimization - University of Washingtondlsys.cs.washington.edu/pdf/lecture9.pdf · Lecture 9: Memory Optimization CSE599W: Spring 2018. Where are we Gradient Calculation

Impact of Memory Optimization in MXNet

Page 29: Lecture 9: Memory Optimization - University of Washingtondlsys.cs.washington.edu/pdf/lecture9.pdf · Lecture 9: Memory Optimization CSE599W: Spring 2018. Where are we Gradient Calculation

We are still Starved

● For training, cost is still linear to the number of layers● Need to book-keep results for the gradient calculation

Page 30: Lecture 9: Memory Optimization - University of Washingtondlsys.cs.washington.edu/pdf/lecture9.pdf · Lecture 9: Memory Optimization CSE599W: Spring 2018. Where are we Gradient Calculation

Trade Computation with Memory

Forward

Backward1

Backward2

Data to be checkpointed for backprop

Data to be dropped

● Only store a few of the intermediate result● Recompute the value needed during gradient calculation

Page 31: Lecture 9: Memory Optimization - University of Washingtondlsys.cs.washington.edu/pdf/lecture9.pdf · Lecture 9: Memory Optimization CSE599W: Spring 2018. Where are we Gradient Calculation

Computation Graph View of the Algorithm

Page 32: Lecture 9: Memory Optimization - University of Washingtondlsys.cs.washington.edu/pdf/lecture9.pdf · Lecture 9: Memory Optimization CSE599W: Spring 2018. Where are we Gradient Calculation

Sublinear Memory Complexity● If we check point every K steps on a N layer network

● The memory cost = O(K) + O(N/K)

● We can get sqrt(N) memory cost plan

● With one additional forward pass(25% overhead)

Cost per segment Cost to store results

Page 33: Lecture 9: Memory Optimization - University of Washingtondlsys.cs.washington.edu/pdf/lecture9.pdf · Lecture 9: Memory Optimization CSE599W: Spring 2018. Where are we Gradient Calculation

Alternative View: Recursion

More memory can be saved by multiple level of recursion

Page 34: Lecture 9: Memory Optimization - University of Washingtondlsys.cs.washington.edu/pdf/lecture9.pdf · Lecture 9: Memory Optimization CSE599W: Spring 2018. Where are we Gradient Calculation

Comparison of Allocation Algorithm on ResNet

Chen et.al 2016

Page 35: Lecture 9: Memory Optimization - University of Washingtondlsys.cs.washington.edu/pdf/lecture9.pdf · Lecture 9: Memory Optimization CSE599W: Spring 2018. Where are we Gradient Calculation

Comparison of Allocation Algorithm on LSTM

Chen et.al 2016

Page 36: Lecture 9: Memory Optimization - University of Washingtondlsys.cs.washington.edu/pdf/lecture9.pdf · Lecture 9: Memory Optimization CSE599W: Spring 2018. Where are we Gradient Calculation

Execution Overhead

Page 37: Lecture 9: Memory Optimization - University of Washingtondlsys.cs.washington.edu/pdf/lecture9.pdf · Lecture 9: Memory Optimization CSE599W: Spring 2018. Where are we Gradient Calculation

Take-aways

● Computation graph is a useful tool for tracking dependencies

● Memory allocation affects concurrency

● We can trade computation for memory to get sublinear memory plan

Page 38: Lecture 9: Memory Optimization - University of Washingtondlsys.cs.washington.edu/pdf/lecture9.pdf · Lecture 9: Memory Optimization CSE599W: Spring 2018. Where are we Gradient Calculation

Assignment 2

● Assignment 1 implements computation graph and autodiff● Assignment 2 implements the rest of DL system stack (Graph Executor) to

run on hardware○ Shape Inference○ Memory management○ TVM-based operator implementation

● Deadline in two weeks: 5/8/2018 ● Post questions to #dlsys slack channel so course staff can help