Top Banner
Deep Learning Inference as a Service Mohammad Babaeizadeh Hadi Hashemi Chris Cai Advisor: Prof Roy H. Campbell
40

Deep Learning Inference as a Service

Mar 15, 2022

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Deep Learning Inference as a Service

Deep Learning Inferenceas a Service

Mohammad BabaeizadehHadi Hashemi

Chris Cai

Advisor: Prof Roy H. Campbell

Page 2: Deep Learning Inference as a Service
Page 3: Deep Learning Inference as a Service
Page 4: Deep Learning Inference as a Service
Page 5: Deep Learning Inference as a Service
Page 6: Deep Learning Inference as a Service
Page 7: Deep Learning Inference as a Service
Page 8: Deep Learning Inference as a Service
Page 9: Deep Learning Inference as a Service
Page 10: Deep Learning Inference as a Service
Page 11: Deep Learning Inference as a Service

Use case 1: Model Developer

Page 12: Deep Learning Inference as a Service

Use case 1: Model Developer

InferenceService

Page 13: Deep Learning Inference as a Service

Use case 2: Application Developer

Page 14: Deep Learning Inference as a Service

InferenceService

Use case 2: Application Developer

Page 15: Deep Learning Inference as a Service

InferenceService

Use case 2: Application Developer

Page 16: Deep Learning Inference as a Service
Page 17: Deep Learning Inference as a Service
Page 18: Deep Learning Inference as a Service
Page 19: Deep Learning Inference as a Service

Problem Formulation

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

Queries

Models

Nodes

1 2

Page 20: Deep Learning Inference as a Service

Characteristics of DNNs

Page 21: Deep Learning Inference as a Service
Page 22: Deep Learning Inference as a Service

VGG 16 Inception

Page 23: Deep Learning Inference as a Service

Constant Runtime

Page 24: Deep Learning Inference as a Service

Batch Size (Vectorized Computation)

Page 25: Deep Learning Inference as a Service

Stateless

• Embarrassingly parallel workload• No centralized data storage

• Load/unload models as necessary• No data synchronization is needed for load/unload• Load/unload is as expensive as running a process

• Light-weight fault-tolerance• Bookkeeping only for queries

Models

Nodes

1 2

Page 26: Deep Learning Inference as a Service

Stateless

• Embarrassingly parallel workload• No centralized data storage

• Load/unload models as necessary• No data synchronization is needed for load/unload• Load/unload is as expensive as running a process

• Light-weight fault-tolerance• Bookkeeping only for queries

Models

Nodes

1 2

Page 27: Deep Learning Inference as a Service

Stateless

• Embarrassingly parallel workload• No centralized data storage

• Load/unload models as necessary• No data synchronization is needed for load/unload• Load/unload is as expensive as running a process

• Light-weight fault-tolerance• Bookkeeping only for queries

Models

Nodes

1 2

Page 28: Deep Learning Inference as a Service

Deterministic

• Output is always the same for the same input

• An effective caching mechanism

Page 29: Deep Learning Inference as a Service

Query Characteristics

Page 30: Deep Learning Inference as a Service

Query Patterns

• Offline queries• Batch queries with high latency (hours) SLA

• Online stateless queries• Single queries with low latency (100s ms) SLA

• Online stateful queries• Session with a sequence of queries, each with low latency (100s ms) SLA• Session lifetime is of minutes

Page 31: Deep Learning Inference as a Service
Page 32: Deep Learning Inference as a Service

Inference as a Service

Page 33: Deep Learning Inference as a Service

Problem Statement

• To serve arbitrary number of models with different service-level objections on minimum resource.

• Number of models >> Number of nodes (in contrast with previous works)

• Isolating overall performance of the system from model’s. (vectorization is optional)

• Low overhead on request response time.

Page 34: Deep Learning Inference as a Service

Use case: Registering a new model1. Model should use our lightweight API2. Host a model in a container (docker)3. Benchmark the model (load/runtime) for different batches4. Reject a model if impossible to server (with respect to SLO)5. Stored the model on distributed file system

Page 35: Deep Learning Inference as a Service

Use case: Submitting a query1. Client asynchronously sends request(s) to a Master Server and gets request ID(s) as response2. Master may respond from the cache3. Otherwise passes the query to scheduler4. Scheduler assigns each query to a worker, and sends the commands to the pub/sub

• May unload a loaded model• May load a new model• May duplicate a running model• May wait for more requests to come (for batching)

5. Compute nodes follow the commands to load/unload models off the distributed file system6. Workers fetch all the requests, server them in batch, and puts the results back in pub/sub7. Master fetches the results and waits for client to request a responses

Page 36: Deep Learning Inference as a Service

Master Server

Compute Node

Rest API

Worker Worker

Compute Node

Worker Worker

Compute Node

Worker Worker

Master Server

Rest API

Model API

ClientClient ClientClientClient Client Client

Model API Model API Model API Model API Model API

Distributed File System

Pub Sub

Scheduler

Page 37: Deep Learning Inference as a Service

Client API

• Send_Requests(data[], model)• Get_Reponses(request_id[])

• Start_Session(model)• Send_Requests(data[], session_id)• End_Session(session_id)

Page 38: Deep Learning Inference as a Service

Demo

Page 39: Deep Learning Inference as a Service

In progress / Open Problems

• Scheduler• Elasticity• Analytical model

• API Expansion• More languages: Currently Python• Pipelines• Ensembles

• Model Efficiency• Load/Unload• Stateful Models

• Model Isolation• Currently limited to computation,

Outsourced to Docker• Memory Bandwidth, PCIE

• Fault Tolerance• Currently outsourced to Redis• Approximated response

Page 40: Deep Learning Inference as a Service

Related Problems

• Model Compression

• DNN specific hardware

• Realtime on mobile devices