TechFest 2014 Powerpoint template 16x9 - Home | RISE SICS · PDF fileFSM • 196 feature families • 54 state machines • 2.6K dynamic features extracted in less than 4us ... TechFest

Mo

ore

’s Law

Analog Specialization 2000 BC – 1940 AD Antikythera Mechanism,

Babbage Difference Engine

Von Neumann Invention 1940 – 1975 Instruction sets, virtual memory, caches

Integration 1975 - 1990 RISC, single-chip CPUs, integrated FPUs,

caches

Clock Frequency (+ ILP) 1990 - 2005 Deep pipelines, speculation, large caches

Multicore 2005 - 2016 1 to 24 cores, on-chip networks

Hardware Specialization 2016 - ? Programmable logic, rapid ASICs,

CGRAs

ASICs FPGAs

Source: Bob Broderson, Berkeley Wireless group

1000x

Generality Efficiency

CPUs ASICs CMPs Manycore GPGPUs ALU arrays

• Cloud: Two main challenges for specialization • Want homogeneous (to the extent possible) server infrastructure

• Need five years of stability for ASICs (2 to design, 3 for use), software changes monthly

• Client: • Area is precious, must be both general and efficient

• “Uncanny valley” between CPUs and ASICS (where accelerators go to die)

2.4+ million emails per day

200+ Cloud Services 1+ billion customers · 20+ million businesses · 90+ markets worldwide

5.8+ billion worldwide queries each month

1 in 4 enterprise customers

50+ billion minutes of connections handled

each month

48+ million users in 41

markets

50+ million active users

400+ million

active accounts

250+ million active users

8.6+ trillion objects in Microsoft Azure

storage

Huge infrastructure: Scale is the enabler

Chicago

Cheyenne

Dublin

Amsterdam

Hong Kong

Singapore

Japan

San Antonio

Microsoft has datacenter capacity around the world…and we’re growing

Boydton Shanghai

Quincy

Des Moines

Brazil

1M+ servers

Mega, Regional, Edge datacenters

Dark fiber network

Australia

Finland

Azure scaling: Exponential growth

2010

2014

Compute

(VMs) Storage DC Network

Capacity

Efficiency

(ASICS)

Ubiquity

Xeon CPU NIC Search Acc. (FPGA)

Search Acc. (ASIC)

Wasted Power,

Holds back SW

Xeon CPU NIC Search Acc. v2 (FPGA)

NIC Xeon CPU Math Accelerator

Wasted Power, One more thing that

can break

•

•

•

•

13

• 1/2U rack-mounted

• 1 x 10Ge ports

• 1 x16 PCIe slot

• 12 Intel Westmere

cores (2 sockets)

FPGA FPGA FPGA FPGA

Web Search Pipeline

FPGA FPGA FPGA FPGA

Math Acceleration

Service Comp.

Vision

Service

Physics

Engine

Web Search Pipeline

ToR

ToR ToR

ToR

CS CS

• Two 8-core Xeon 2.1 GHz CPUs

• 64 GB DRAM

• 4 HDDs, 2 SSDs

• 10 Gb Ethernet

• No cable attachments to server

• Altera Stratix V D5

• 172,600 ALMs, 2,014 M20Ks, 1,590 DSPs

• PCIe Gen 3 x8

• 8GB DDR3-1333

• Powered by PCIe slot

• Torus Network

Stratix V

8GB DDR3

PCIe Gen3 x8

Data Center Server (1U, ½ width)

West SLIII

East SLIII

South SLIII

North SLIII

x8 PCIe Core

DMA Engine

Config Flash (RSU)

DDR3 Core 1 DDR3 Core 0

JTAG

LEDs

Temp Sensors

Application

Shell

I2C

xcvr reconfig

2 2 2 2

4 256 Mb

QSPI Config Flash

4 GB DDR3-1333 ECC SO-DIMM

4 GB DDR3-1333 ECC SO-DIMM

Host CPU

72 72

Role

8

Inter-FPGA Router SEU

Microsoft Confidential

IFM 0

IFM 1

IFM 47

IFM 2

Bing Pod

TLA

Front end

MLA 0

MLA N

MLA 1

L0

L1

L2 (RaaS)

L2: Expensive ranker

Retrieve 4 docs from disk

Compute numerical score

for each. Milliseconds/doc

L0: Candidate finder

Find all docs on this

machine that contain the

query terms

L1: Fast filter

Generate quick scores for

each doc from index,

choose top 4

• IFM sends 4 scores to MLA

• MLA sends top 100/220 to TLA

• TLA sorts, generates captions,

returns top 10

Front end sends query that

misses in the cache to the

TLA, query is processed

FE FFE MLS

IFM 1

IFM 2

IFM 44

IFM 3

IFM 1

IFM 2

IFM 44

IFM 3

IFM 1

IFM 2

IFM 44

IFM 3

SaaS 1

SaaS 2

SaaS

48

SaaS 3

Ranking-as-a-Service (RaaS)

- Compute scores for how relevant each selected

document is for the search query

- Sort the scores and return the results

Selection-as-a-Service (SaaS)

- Find all docs that contain query terms,

- Filter and select candidate documents for

ranking

Selection as a Service (SaaS)

IFM 1

IFM 2

IFM 44

IFM 3

IFM 1

IFM 2

IFM 44

IFM 3

IFM 1

IFM 2

IFM 44

IFM 3

RaaS 1

RaaS 2

RaaS

48

RaaS 3

Ranking as a Service (RaaS)

Query

Selected

Documents 10 blue links

2

4

Query

compilation

From L1: query + 4

document IDs Read document

from disk

FE: Feature

Extraction

FFE: Free-Form

Expressions MLS: Machine

learning scoring

Docs

Dynamic

Features

Synthetic

Features

Send ranked scores

for 4 documents

back to MLA

Hit vector per stream and static features

>

/

+

+

+

+

+

*

1 1e-006

5 5

SF1

if NF91

DF88 DF89

DF90 DF91

DF92

DF93

DF95

ln

max

SF13 +

DF94

+

S0

Position Term

5 3

12 4

99 2

107 3

109 3

7 1

42 3

43 7

S1

NumOccurrences_1_3 = 1

Decompress and

extract HV

Query: “FPGA Configuration”

NumberOfOccurrences_0 = 7 NumberOfOccurrences_1 = 4 NumberOfTuples_0_1 = 1 {Query, Document}

~4K Dynamic

Features

~2K Synthetic

Features

L2 Score

Document

Score

FFE #1 =(2*NumberOfOccurrences_0 + NumberOfOccurrences_1)

(2 * NumberOfTuples_0_1)

NumberOfTuples_0_1 = 1 NumberOfOccurrences_1 = 4 NumberOfOccurrences_0 = 7

FFE #1 = 9

{Query, Document}

~4K Dynamic

Features

~2K Synthetic

Features

L2 Score

Document

Score

Complex

ALU

Ln, ÷, div

Basic Tile

Basic Tile

Basic Tile

Basic Tile

Registers

Constants

FFE 1

Inst.

FFE n

Inst.

Compression

Thresholds

… Local

ALU

DSP

D

SP

Sch

ed

ulin

g L

og

ic

Distribution latches

Control/Data

Tokens

Feature

Transmissi

on

Network

Stream

Preprocessin

g FSM

FE FFE MLS

>100 feature

families

~90 State

Machines

MLT [3][7]

MLT [3][6]

MLT [3][5]

MLT [3][4]

MLT [3][3]

MLT [3][2]

MLT [3][1]

MLT [3][0]

MLT [3][11]

MLT [3][10]

MLT [3][9]

MLT [3][8]

MLT [2][7]

MLT [2][6]

MLT [2][5]

MLT [2][4]

MLT [2][3]

MLT [2][2]

MLT [2][1]

MLT [2][0]

MLT [2][11]

MLT [2][10]

MLT [2][9]

MLT [2][8]

MLT [1][7]

MLT [1][6]

MLT [1][5]

MLT [1][4]

MLT [1][3]

MLT [1][2]

MLT [1][1]

MLT [1][0]

MLT [1][11]

MLT [1][10]

MLT [1][9]

MLT [1][8]

MLT [0][7]

MLT [0][6]

MLT [0][5]

MLT [0][4]

MLT [0][3]

MLT [0][2]

MLT [0][1]

MLT [0][0]

MLT [0][11]

MLT [0][10]

MLT [0][9]

MLT [0][8]

FFE [1][3]

FFE [1][2]

FFE [1][1]

FFE [1][0]

FFE [0][3]

FFE [0][2]

FFE [0][1]

FFE [0][0]

FFE: 64 cores / chip

256-512 threads

MLS: 48 MLT tiles/chip

240 ML processors

2880 ML units/chip

PCIe

Distribution latches Control/Data

Tokens

Compressed

Document

Feature

Gathering

Network

Free Form

Expression

(FFE)

Stream

Preprocessing

FSM

• 196 feature families

• 54 state machines

• 2.6K dynamic features extracted in

less than 4us (~600us in SW)

Core 0 Core 1 Core 2

Core 3 Core 4 Core 5

Complex FST

Ou

tpu

t

• Specialized processing engines • Each core has a simple ALU (integer, logical,

load/store, control flow operations)

• 4 HW threads, 16 registers per thread.

• 4kB shared memory

• Every six cores share a complex ALU • Complex ALU performs ln, divide, exp and

float to int conversions.

• Six cores + complex ALU = cluster

• 8+ clusters (192+ threads) per FPGA

• 551 synthetic features computed in less than 5us (~50us in SW)

Cluster

0

FFE: Free-Form

Expressions

FE: Feature Extraction

FPGA 0

FPGA 1

FPGA 2

FPGA 3

FPGA 4

FPGA 5

FPGA 6

FPGA 7

Server

Server

Server

Server

Server

Server

Server

Server

Document

Scoring

Request

8-Stage Pipeline

Compute

Score

Route to

Head

Return

Score

RaaS Servers Document

Score

Document

Scoring

Request

Compute

Score

Route to

Head

Return

Score

FPGA 0

FPGA 1

FPGA 2

FPGA 3

FPGA 4

FPGA 5

FPGA 6

FPGA 7

8-Stage Pipeline

FPGA 5

FPGA 6

FPGA 0

FPGA 1

FPGA 2

FPGA 3

FPGA 4

8-Stage Pipeline

FPGA 2

1,632 Servers with FPGAs Running Bing Page Ranking Service (~30,000 lines of C++)