Mo
ore
’s Law
Analog Specialization 2000 BC – 1940 AD Antikythera Mechanism,
Babbage Difference Engine
Von Neumann Invention 1940 – 1975 Instruction sets, virtual memory, caches
Integration 1975 - 1990 RISC, single-chip CPUs, integrated FPUs,
caches
Clock Frequency (+ ILP) 1990 - 2005 Deep pipelines, speculation, large caches
Multicore 2005 - 2016 1 to 24 cores, on-chip networks
Hardware Specialization 2016 - ? Programmable logic, rapid ASICs,
CGRAs
ASICs FPGAs
Source: Bob Broderson, Berkeley Wireless group
1000x
Generality Efficiency
CPUs ASICs CMPs Manycore GPGPUs ALU arrays
• Cloud: Two main challenges for specialization • Want homogeneous (to the extent possible) server infrastructure
• Need five years of stability for ASICs (2 to design, 3 for use), software changes monthly
• Client: • Area is precious, must be both general and efficient
• “Uncanny valley” between CPUs and ASICS (where accelerators go to die)
2.4+ million emails per day
200+ Cloud Services 1+ billion customers · 20+ million businesses · 90+ markets worldwide
5.8+ billion worldwide queries each month
1 in 4 enterprise customers
50+ billion minutes of connections handled
each month
48+ million users in 41
markets
50+ million active users
400+ million
active accounts
250+ million active users
8.6+ trillion objects in Microsoft Azure
storage
Huge infrastructure: Scale is the enabler
Chicago
Cheyenne
Dublin
Amsterdam
Hong Kong
Singapore
Japan
San Antonio
Microsoft has datacenter capacity around the world…and we’re growing
Boydton Shanghai
Quincy
Des Moines
Brazil
1M+ servers
Mega, Regional, Edge datacenters
Dark fiber network
Australia
Finland
Azure scaling: Exponential growth
2010
2014
Compute
(VMs) Storage DC Network
Capacity
Efficiency
(ASICS)
Ubiquity
Xeon CPU NIC Search Acc. (FPGA)
Search Acc. (ASIC)
Wasted Power,
Holds back SW
Xeon CPU NIC Search Acc. v2 (FPGA)
NIC Xeon CPU Math Accelerator
Wasted Power, One more thing that
can break
•
•
•
•
13
• 1/2U rack-mounted
• 1 x 10Ge ports
• 1 x16 PCIe slot
• 12 Intel Westmere
cores (2 sockets)
FPGA FPGA FPGA FPGA
Web Search Pipeline
FPGA FPGA FPGA FPGA
Math Acceleration
Service Comp.
Vision
Service
Physics
Engine
Web Search Pipeline
ToR
ToR ToR
ToR
CS CS
• Two 8-core Xeon 2.1 GHz CPUs
• 64 GB DRAM
• 4 HDDs, 2 SSDs
• 10 Gb Ethernet
• No cable attachments to server
• Altera Stratix V D5
• 172,600 ALMs, 2,014 M20Ks, 1,590 DSPs
• PCIe Gen 3 x8
• 8GB DDR3-1333
• Powered by PCIe slot
• Torus Network
Stratix V
8GB DDR3
PCIe Gen3 x8
Data Center Server (1U, ½ width)
West SLIII
East SLIII
South SLIII
North SLIII
x8 PCIe Core
DMA Engine
Config Flash (RSU)
DDR3 Core 1 DDR3 Core 0
JTAG
LEDs
Temp Sensors
Application
Shell
I2C
xcvr reconfig
2 2 2 2
4 256 Mb
QSPI Config Flash
4 GB DDR3-1333 ECC SO-DIMM
4 GB DDR3-1333 ECC SO-DIMM
Host CPU
72 72
Role
8
Inter-FPGA Router SEU
Microsoft Confidential
IFM 0
IFM 1
IFM 47
IFM 2
Bing Pod
TLA
Front end
MLA 0
MLA N
MLA 1
L0
L1
L2 (RaaS)
L2: Expensive ranker
Retrieve 4 docs from disk
Compute numerical score
for each. Milliseconds/doc
L0: Candidate finder
Find all docs on this
machine that contain the
query terms
L1: Fast filter
Generate quick scores for
each doc from index,
choose top 4
• IFM sends 4 scores to MLA
• MLA sends top 100/220 to TLA
• TLA sorts, generates captions,
returns top 10
Front end sends query that
misses in the cache to the
TLA, query is processed
FE FFE MLS
IFM 1
IFM 2
IFM 44
IFM 3
IFM 1
IFM 2
IFM 44
IFM 3
IFM 1
IFM 2
IFM 44
IFM 3
SaaS 1
SaaS 2
SaaS
48
SaaS 3
Ranking-as-a-Service (RaaS)
- Compute scores for how relevant each selected
document is for the search query
- Sort the scores and return the results
Selection-as-a-Service (SaaS)
- Find all docs that contain query terms,
- Filter and select candidate documents for
ranking
Selection as a Service (SaaS)
IFM 1
IFM 2
IFM 44
IFM 3
IFM 1
IFM 2
IFM 44
IFM 3
IFM 1
IFM 2
IFM 44
IFM 3
RaaS 1
RaaS 2
RaaS
48
RaaS 3
Ranking as a Service (RaaS)
Query
Selected
Documents 10 blue links
2
4
Query
compilation
From L1: query + 4
document IDs Read document
from disk
FE: Feature
Extraction
FFE: Free-Form
Expressions MLS: Machine
learning scoring
Docs
Dynamic
Features
Synthetic
Features
Send ranked scores
for 4 documents
back to MLA
Hit vector per stream and static features
>
/
+
+
+
+
+
*
1 1e-006
5 5
SF1
if NF91
DF88 DF89
DF90 DF91
DF92
DF93
DF95
ln
max
SF13 +
DF94
+
S0
Position Term
5 3
12 4
99 2
107 3
109 3
7 1
42 3
43 7
S1
NumOccurrences_1_3 = 1
Decompress and
extract HV
Query: “FPGA Configuration”
NumberOfOccurrences_0 = 7 NumberOfOccurrences_1 = 4 NumberOfTuples_0_1 = 1 {Query, Document}
~4K Dynamic
Features
~2K Synthetic
Features
L2 Score
Document
Score
FFE #1 =(2*NumberOfOccurrences_0 + NumberOfOccurrences_1)
(2 * NumberOfTuples_0_1)
NumberOfTuples_0_1 = 1 NumberOfOccurrences_1 = 4 NumberOfOccurrences_0 = 7
FFE #1 = 9
{Query, Document}
~4K Dynamic
Features
~2K Synthetic
Features
L2 Score
Document
Score
Complex
ALU
Ln, ÷, div
Basic Tile
Basic Tile
Basic Tile
Basic Tile
Registers
Constants
FFE 1
Inst.
FFE n
Inst.
Compression
Thresholds
… Local
ALU
DSP
D
SP
Sch
ed
ulin
g L
og
ic
Distribution latches
Control/Data
Tokens
Feature
Transmissi
on
Network
Stream
Preprocessin
g FSM
FE FFE MLS
>100 feature
families
~90 State
Machines
MLT [3][7]
MLT [3][6]
MLT [3][5]
MLT [3][4]
MLT [3][3]
MLT [3][2]
MLT [3][1]
MLT [3][0]
MLT [3][11]
MLT [3][10]
MLT [3][9]
MLT [3][8]
MLT [2][7]
MLT [2][6]
MLT [2][5]
MLT [2][4]
MLT [2][3]
MLT [2][2]
MLT [2][1]
MLT [2][0]
MLT [2][11]
MLT [2][10]
MLT [2][9]
MLT [2][8]
MLT [1][7]
MLT [1][6]
MLT [1][5]
MLT [1][4]
MLT [1][3]
MLT [1][2]
MLT [1][1]
MLT [1][0]
MLT [1][11]
MLT [1][10]
MLT [1][9]
MLT [1][8]
MLT [0][7]
MLT [0][6]
MLT [0][5]
MLT [0][4]
MLT [0][3]
MLT [0][2]
MLT [0][1]
MLT [0][0]
MLT [0][11]
MLT [0][10]
MLT [0][9]
MLT [0][8]
FFE [1][3]
FFE [1][2]
FFE [1][1]
FFE [1][0]
FFE [0][3]
FFE [0][2]
FFE [0][1]
FFE [0][0]
FFE: 64 cores / chip
256-512 threads
MLS: 48 MLT tiles/chip
240 ML processors
2880 ML units/chip
PCIe
Distribution latches Control/Data
Tokens
Compressed
Document
Feature
Gathering
Network
Free Form
Expression
(FFE)
Stream
Preprocessing
FSM
• 196 feature families
• 54 state machines
• 2.6K dynamic features extracted in
less than 4us (~600us in SW)
Core 0 Core 1 Core 2
Core 3 Core 4 Core 5
Complex FST
Ou
tpu
t
• Specialized processing engines • Each core has a simple ALU (integer, logical,
load/store, control flow operations)
• 4 HW threads, 16 registers per thread.
• 4kB shared memory
• Every six cores share a complex ALU • Complex ALU performs ln, divide, exp and
float to int conversions.
• Six cores + complex ALU = cluster
• 8+ clusters (192+ threads) per FPGA
• 551 synthetic features computed in less than 5us (~50us in SW)
Cluster
0
FFE: Free-Form
Expressions
FE: Feature Extraction
FPGA 0
FPGA 1
FPGA 2
FPGA 3
FPGA 4
FPGA 5
FPGA 6
FPGA 7
Server
Server
Server
Server
Server
Server
Server
Server
Document
Scoring
Request
8-Stage Pipeline
Compute
Score
Route to
Head
Return
Score
RaaS Servers Document
Score
Document
Scoring
Request
Compute
Score
Route to
Head
Return
Score
FPGA 0
FPGA 1
FPGA 2
FPGA 3
FPGA 4
FPGA 5
FPGA 6
FPGA 7
8-Stage Pipeline
FPGA 5
FPGA 6
FPGA 0
FPGA 1
FPGA 2
FPGA 3
FPGA 4
8-Stage Pipeline
FPGA 2
1,632 Servers with FPGAs Running Bing Page Ranking Service (~30,000 lines of C++)