Top Banner
IBM Research © 2008 Feeding the Multicore Beast: It’s All About the Data! Michael Perrone IBM Master Inventor Mgr, Cell Solutions Dept.
38

IBM Research © 2008 Feeding the Multicore Beast: It’s All About the Data! Michael Perrone IBM Master Inventor Mgr, Cell Solutions Dept.

Jan 05, 2016

Download

Documents

Bruno Kelly
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: IBM Research © 2008 Feeding the Multicore Beast: It’s All About the Data! Michael Perrone IBM Master Inventor Mgr, Cell Solutions Dept.

IBM Research

© 2008

Feeding theMulticore Beast:It’s All About the Data!

Michael PerroneIBM Master InventorMgr, Cell Solutions Dept.

Page 2: IBM Research © 2008 Feeding the Multicore Beast: It’s All About the Data! Michael Perrone IBM Master Inventor Mgr, Cell Solutions Dept.

IBM Research

© 20082 [email protected]

Outline

History: Data challenge

Motivation for multicore

Implications for programmers

How Cell addresses these implications

Examples• 2D/3D FFT

– Medical Imaging, Petroleum, general HPC…

• Green’s Functions– Seismic Imaging (Petroleum)

• String Matching– Network Processing: DPI & Intrusion Detections

• Neural Networks– Finance

Page 3: IBM Research © 2008 Feeding the Multicore Beast: It’s All About the Data! Michael Perrone IBM Master Inventor Mgr, Cell Solutions Dept.

IBM Research

© 20083 [email protected]

Chapter 1:

The Beast is Hungry!

Page 4: IBM Research © 2008 Feeding the Multicore Beast: It’s All About the Data! Michael Perrone IBM Master Inventor Mgr, Cell Solutions Dept.

IBM Research

© 20084 [email protected]

The Hungry Beast

Processor(“beast”)

Data(“food”)

Data Pipe

Pipe too small = starved beast

Pipe big enough = well-fed beast

Pipe too big = wasted resources

Page 5: IBM Research © 2008 Feeding the Multicore Beast: It’s All About the Data! Michael Perrone IBM Master Inventor Mgr, Cell Solutions Dept.

IBM Research

© 20085 [email protected]

The Hungry Beast

Processor(“beast”)

Data(“food”)

Data Pipe

Pipe too small = starved beast

Pipe big enough = well-fed beast

Pipe too big = wasted resources

If flops grow faster than pipe capacity…

… the beast gets hungrier!

Page 6: IBM Research © 2008 Feeding the Multicore Beast: It’s All About the Data! Michael Perrone IBM Master Inventor Mgr, Cell Solutions Dept.

IBM Research

© 20086 [email protected]

Move the food closer

Example: Intel Tulsa– Xeon MP 7100 series

– 65nm, 349mm2, 2 Cores

– 3.4 GHz @ 150W

– ~54.4 SP GFlops

– http://www.intel.com/products/processor/xeon/index.htm

Large cache on chip

– ~50% of area

– Keeps data close for efficient access

If the data is local,the beast is happy!

– True for many algorithms

Page 7: IBM Research © 2008 Feeding the Multicore Beast: It’s All About the Data! Michael Perrone IBM Master Inventor Mgr, Cell Solutions Dept.

IBM Research

© 20087 [email protected]

What happens if the beast is still hungry?

Data

Cache

If the data set doesn’t fit in cache

– Cache misses

– Memory latency exposed

– Performance degraded

Several important application classes don’t fit

– Graph searching algorithms

– Network security

– Natural language processing

– Bioinformatics

– Many HPC workloads

Page 8: IBM Research © 2008 Feeding the Multicore Beast: It’s All About the Data! Michael Perrone IBM Master Inventor Mgr, Cell Solutions Dept.

IBM Research

© 20088 [email protected]

Make the food bowl larger

Data

Cache Cache size steadily increasing

Implications

– Chip real estate reserved for cache

– Less space on chip for computes

– More power required for fewer FLOPS

Page 9: IBM Research © 2008 Feeding the Multicore Beast: It’s All About the Data! Michael Perrone IBM Master Inventor Mgr, Cell Solutions Dept.

IBM Research

© 20089 [email protected]

Make the food bowl larger

Data

Cache Cache size steadily increasing

Implications

– Chip real estate reserved for cache

– Less space on chip for computes

– More power required for fewer FLOPS

But…

– Important application working sets are growing faster

– Multicore even more demanding on cache than uni-core

Page 10: IBM Research © 2008 Feeding the Multicore Beast: It’s All About the Data! Michael Perrone IBM Master Inventor Mgr, Cell Solutions Dept.

IBM Research

© 200810 [email protected]

Chapter 2:

The Beast Has Babies

Page 11: IBM Research © 2008 Feeding the Multicore Beast: It’s All About the Data! Michael Perrone IBM Master Inventor Mgr, Cell Solutions Dept.

IBM Research

© 200811 [email protected]

Power Density – The fundamental problem

1

10

100

1000

1.5 1 0.7 0.5 0.35 0.25 0.18 0.13 0.1 0.07

i386i486

Pentium®

Pentium Pro ®

Pentium II ®Pentium III®

W/cm2

Hot Plate

Nuclear Reactor

Source: Fred Pollack, Intel. New Microprocessor Challenges in the Coming Generations of CMOS Technologies, Micro32

Page 12: IBM Research © 2008 Feeding the Multicore Beast: It’s All About the Data! Michael Perrone IBM Master Inventor Mgr, Cell Solutions Dept.

IBM Research

© 200812 [email protected]

What’s causing the problem?

10S Tox=11AGate Stack

Gate dielectric approaching a fundamental limit (a few atomic layers)

0.010.110.001

0.01

0.1

1

10

100

1000

Gate Length (microns)

Active Power

Passive Power

1994 2004Po

wer

Den

sity

(W

/cm

2 )

65 nM

Gate Length (microns)

1 0.010.1

1000

100

10

1

0.1

0.01

0.001

Power, signal jitter, etc...

Page 13: IBM Research © 2008 Feeding the Multicore Beast: It’s All About the Data! Michael Perrone IBM Master Inventor Mgr, Cell Solutions Dept.

IBM Research

© 200813 [email protected]

1.0E+02

1.0E+03

1.0E+04

1990 1995 2000 2005 2010

Clo

ck S

pee

d (

MH

z)

Clock Speed

103

102

104

Diminishing Returns on FrequencyIn a power-constrained environment, chip clock speed yields diminishing

returns. The industry has moved to lower frequency multicore architectures.

Frequency-DrivenDesignPoints

Page 14: IBM Research © 2008 Feeding the Multicore Beast: It’s All About the Data! Michael Perrone IBM Master Inventor Mgr, Cell Solutions Dept.

IBM Research

© 200814 [email protected]

Power vs Performance Trade Offs

Relative Performance

0

1

2

3

4

5

Rel

ativ

e P

ower

1

1.45

1.3.85 1.7

We need to adapt our algorithms to get performance out of multicore

Page 15: IBM Research © 2008 Feeding the Multicore Beast: It’s All About the Data! Michael Perrone IBM Master Inventor Mgr, Cell Solutions Dept.

IBM Research

© 200815 [email protected]

Implications of Multicore

There are more mouths to feed

– Data movement will take center stage

Complexity of cores will stop increasing

… and has started to decrease in some cases

Complexity increases will center around communication

Assumption

– Achieving a significant % or peak performance is important

Page 16: IBM Research © 2008 Feeding the Multicore Beast: It’s All About the Data! Michael Perrone IBM Master Inventor Mgr, Cell Solutions Dept.

IBM Research

© 200816 [email protected]

Chapter 3:

The Proper Care and Feeding of Hungry Beasts

Page 17: IBM Research © 2008 Feeding the Multicore Beast: It’s All About the Data! Michael Perrone IBM Master Inventor Mgr, Cell Solutions Dept.

IBM Research

© 200817 [email protected]

Cell/B.E. Processor: 200GFLOPS (SP) @ ~70W

Page 18: IBM Research © 2008 Feeding the Multicore Beast: It’s All About the Data! Michael Perrone IBM Master Inventor Mgr, Cell Solutions Dept.

IBM Research

© 200818 [email protected]

Feeding the Cell Processor

8 SPEs each with

– LS

– MFC

– SXU

PPE

– OS functions

– Disk IO

– Network IO

16B/cycle (2x)16B/cycle

BIC

FlexIOTM

MIC

Dual XDRTM

16B/cycle

EIB (up to 96B/cycle)

16B/cycle

64-bit Power Architecture with VMX

PPE

SPE

LS

SXUSPU

MFC

PXUL1

PPU

16B/cycle

L232B/cycle

LS

SXUSPU

MFC

LS

SXUSPU

MFC

LS

SXUSPU

MFC

LS

SXUSPU

MFC

LS

SXUSPU

MFC

LS

SXUSPU

MFC

LS

SXUSPU

MFC

Page 19: IBM Research © 2008 Feeding the Multicore Beast: It’s All About the Data! Michael Perrone IBM Master Inventor Mgr, Cell Solutions Dept.

IBM Research

© 200819 [email protected]

Cell Approach: Feed the beast more efficiently Explicitly “orchestrate” the data flow between main

memory and each SPE’s local store

– Use SPE’s DMA engine to gather & scatter data between memory main memory and local store

– Enables detailed programmer control of data flow

• Get/Put data when & where you want it• Hides latency: Simultaneous reads, writes & computes

– Avoids restrictive HW cache management

• Unlikely to determine optimal data flow• Potentially very inefficient

– Allows more efficient use of the existing bandwidth

Page 20: IBM Research © 2008 Feeding the Multicore Beast: It’s All About the Data! Michael Perrone IBM Master Inventor Mgr, Cell Solutions Dept.

IBM Research

© 200820 [email protected]

Cell Approach: Feed the beast more efficiently Explicitly “orchestrate” the data flow between main memory

and each SPE’s local store

– Use SPE’s DMA engine to gather & scatter data between memory main memory and local store

– Enables detailed programmer control of data flow

• Get/Put data when & where you want it

• Hides latency: Simultaneous reads, writes & computes

– Avoids restrictive HW cache management

• Unlikely to determine optimal data flow

• Potentially very inefficient

– Allows more efficient use of the existing bandwidth

BOTTOM LINE:

It’s all about the data!

Page 21: IBM Research © 2008 Feeding the Multicore Beast: It’s All About the Data! Michael Perrone IBM Master Inventor Mgr, Cell Solutions Dept.

IBM Research

© 200821 [email protected]

Cell Comparison: ~4x the FLOPS @ ~½ the power Both 65nm technology

(to scale)

Page 22: IBM Research © 2008 Feeding the Multicore Beast: It’s All About the Data! Michael Perrone IBM Master Inventor Mgr, Cell Solutions Dept.

IBM Research

© 200822 [email protected]

Memory Managing Processor vs. Traditional General Purpose Processor

IBM

AMD

Intel

Cell

BE

Page 23: IBM Research © 2008 Feeding the Multicore Beast: It’s All About the Data! Michael Perrone IBM Master Inventor Mgr, Cell Solutions Dept.

IBM Research

© 200823 [email protected]

Examples of Feeding Cell

2D and 3D FFTs

Seismic Imaging

String Matching

Neural Networks (function approximation)

Page 24: IBM Research © 2008 Feeding the Multicore Beast: It’s All About the Data! Michael Perrone IBM Master Inventor Mgr, Cell Solutions Dept.

IBM Research

© 200824 [email protected]

Feeding FFTs to Cell

Buffer

Input Image

Transposed Image

Tile

Transposed Tile

Transposed Buffer

SIMDized data

DMAs double buffered

Pass 1: For each buffer• DMA Get buffer

• Do four 1D FFTs in SIMD

• Transpose tiles

• DMA Put buffer

Pass 2: For each buffer• DMA Get buffer

• Do four 1D FFTs in SIMD

• Transpose tiles

• DMA Put buffer

Page 25: IBM Research © 2008 Feeding the Multicore Beast: It’s All About the Data! Michael Perrone IBM Master Inventor Mgr, Cell Solutions Dept.

IBM Research

© 200825 [email protected]

3D FFTs

Long stride trashes cache

Cell DMA allows prefetch

Single Element Data envelope

Stride 1

StrideN2

N

Page 26: IBM Research © 2008 Feeding the Multicore Beast: It’s All About the Data! Michael Perrone IBM Master Inventor Mgr, Cell Solutions Dept.

IBM Research

© 200826 [email protected]

Feeding Seismic Imaging to Cell

(X,Y)

New G at each (x,y)

Radial symmetry of G reduces BW requirements

Data

Green’s Function

ij

jiyxGjyixD ),,,(),(

Page 27: IBM Research © 2008 Feeding the Multicore Beast: It’s All About the Data! Michael Perrone IBM Master Inventor Mgr, Cell Solutions Dept.

IBM Research

© 200827 [email protected]

Feeding Seismic Imaging to Cell Data

SPE 0 SPE 1 SPE 2 SPE 3 SPE 4 SPE 5 SPE 6 SPE 7

Page 28: IBM Research © 2008 Feeding the Multicore Beast: It’s All About the Data! Michael Perrone IBM Master Inventor Mgr, Cell Solutions Dept.

IBM Research

© 200828 [email protected]

Feeding Seismic Imaging to Cell Data

SPE 0 SPE 1 SPE 2 SPE 3 SPE 4 SPE 5 SPE 6 SPE 7

Page 29: IBM Research © 2008 Feeding the Multicore Beast: It’s All About the Data! Michael Perrone IBM Master Inventor Mgr, Cell Solutions Dept.

IBM Research

© 200829 [email protected]

Feeding Seismic Imaging to Cell

For each X

– Load next column of data

– Load next column of indices

– For each Y• Load Green’s functions• SIMDize Green’s functions • Compute convolution at

(X,Y)

– Cycle buffers

H

2R+1

1

Data bufferGreen’s Index buffer

(X,Y)

R

2

Page 30: IBM Research © 2008 Feeding the Multicore Beast: It’s All About the Data! Michael Perrone IBM Master Inventor Mgr, Cell Solutions Dept.

IBM Research

© 200830 [email protected]

Feeding String Matching to Cell

Find (lots of) substrings in (long) string

Build graph of words & represent as DFA

Problem: Graph doesn’t fit in LS

Sample Word List:

“the”“that”

“math”

Page 31: IBM Research © 2008 Feeding the Multicore Beast: It’s All About the Data! Michael Perrone IBM Master Inventor Mgr, Cell Solutions Dept.

IBM Research

© 200831 [email protected]

Feeding String Matching to Cell

Page 32: IBM Research © 2008 Feeding the Multicore Beast: It’s All About the Data! Michael Perrone IBM Master Inventor Mgr, Cell Solutions Dept.

IBM Research

© 200832 [email protected]

Hiding Main Memory Latency

Page 33: IBM Research © 2008 Feeding the Multicore Beast: It’s All About the Data! Michael Perrone IBM Master Inventor Mgr, Cell Solutions Dept.

IBM Research

© 200833 [email protected]

Software Multithreading

Page 34: IBM Research © 2008 Feeding the Multicore Beast: It’s All About the Data! Michael Perrone IBM Master Inventor Mgr, Cell Solutions Dept.

IBM Research

© 200834 [email protected]

Feeding Neural Networks to Cell

Neural net function F(X)

– RBF, MLP, KNN, etc.

If too big for LS, BW Bound

N Basis functions: dot product + nonlinearity

D Input dimensions

DxN Matrix of parameters

OutputF

X

Page 35: IBM Research © 2008 Feeding the Multicore Beast: It’s All About the Data! Michael Perrone IBM Master Inventor Mgr, Cell Solutions Dept.

IBM Research

© 200835 [email protected]

Convert BW Bound to Compute Bound

Split function over multiple SPEs

Avoids unnecessary memory traffic

Reduce compute time per SPE

Minimal merge overhead

Merge

Page 36: IBM Research © 2008 Feeding the Multicore Beast: It’s All About the Data! Michael Perrone IBM Master Inventor Mgr, Cell Solutions Dept.

IBM Research

© 200836 [email protected]

Moral of the Story:It’s All About the Data!

The data problem is growing: multicore

Intelligent software prefetching

– Use DMA engines

– Don’t rely on HW prefetching

Efficient data management

– Multibuffering: Hide the latency!

– BW utilization: Make every byte count!

– SIMDization: Make every vector count!

– Problem/data partitioning: Make every core work!

– Software multithreading: Keep every core busy!

Page 37: IBM Research © 2008 Feeding the Multicore Beast: It’s All About the Data! Michael Perrone IBM Master Inventor Mgr, Cell Solutions Dept.

IBM Research

© 200837 [email protected]

Backup

Page 38: IBM Research © 2008 Feeding the Multicore Beast: It’s All About the Data! Michael Perrone IBM Master Inventor Mgr, Cell Solutions Dept.

IBM Research

© 200838 [email protected]

Abstract

Technological obstacles have prevented the microprocessor industry from achieving increased performance through increased chip clock speeds. In a reaction to these restrictions, the industry has chosen the multicore processors path. Multicore processors promise tremendous GFLOPS performance but raise the challenge of how one programs them. In this talk, I will discuss the motivation for multicore, the implications to programmers and how the Cell/B.E. processors design addresses these challenges. As an example, I will review one or two applications that highlight the strengths of Cell.