Parallel Computers Today

Parallel Computers TodayParallel Computers Today

LANL / IBM Roadrunner> 1 PFLOPS

Two Nvidia 8800 GPUs> 1 TFLOPS

Intel 80-core chip> 1 TFLOPS TFLOPS = 1012 floating point ops/sec

PFLOPS = 1,000,000,000,000,000 / sec (1015)

Columbia (10240-processor SGI Altix, 50 Teraflops, NASA Ames Research Center)

Beowulf (18-processor cluster, lab machine)

http://beowulf.lcs.mit.edu/18.337/beowulf.html

AMD Opteron quad-core die

The nVidia G80 GPUThe nVidia G80 GPU

• 128 streaming floating point processors @1.5Ghz• 1.5 Gb Shared RAM with 86Gb/s bandwidth• 500 Gflop on one chip (single precision)

The Computer Architecture ChallengeThe Computer Architecture Challenge

Most high-performance computer designs allocate resources to optimize Gaussian elimination on large, dense matrices.

Originally, because linear algebra is the middleware of scientific computing.

Nowadays, mostly for bragging rights.

= xP A L U

• Top 500 List

• http://www.top500.org/list/2008/11/100

Generic Parallel Machine ArchitectureGeneric Parallel Machine Architecture

• Key architecture question: Where is the interconnect, and how fast?

• Key algorithm question: Where is the data?

ProcCache

L2 Cache

L3 Cache

Memory

Storage Hierarchy

ProcCache

L2 Cache

L3 Cache

Memory

ProcCache

L2 Cache

L3 Cache

Memory

potentialinterconnects

Multicore SMP SystemsMulticore SMP Systems

4MBShared L2

Core2

FSB

Fully Buffered DRAM

10.6GB/s

Core2

Chipset (4x64b controllers)

10.6GB/s

10.6 GB/s(write)

4MBShared L2

Core2 Core2

4MBShared L2

Core2

FSB

Core2

4MBShared L2

Core2 Core2

21.3 GB/s(read)

Intel ClovertownC

ross

bar

Sw

itch

Fully Buffered DRAM

4MB

Sha

red

L2 (

16 w

ay)

42.7GB/s (read), 21.3 GB/s (write)

8K D$MT UltraSparcFPU








179

GB

/s(f

ill)

90 G

B/s

(writ

ethr

u)

Sun Niagara2

4x128b FBDIMM memory controllers

AMD Opteron

1MBvictim

Opteron

1MBvictim

Opteron

Memory Controller / HT

1MBvictim

Opteron

1MBvictim

Opteron

Memory Controller / HT

DDR2 DRAM DDR2 DRAM

10.6GB/s 10.6GB/s

4GB/s(each direction)

XDR DRAM

25.6GB/s

EIB

(Ring N

etwork)

<<20GB/seach

direction

SPE256K

PPE512K L2

MFC

BIF

XDR

SPE256KMFC

SPE256KMFC

SPE256KMFC

SPE256KMFC

SPE256KMFC

SPE256KMFC

SPE256KMFC

XDR DRAM

25.6GB/s

EIB

(Ring N

etwork)

SPE 256K

PPE 512K L2

MFC

BIF

XDR

SPE 256K MFC

SPE 256K MFC

SPE 256K MFC

SPE 256K MFC

SPE 256K MFC

SPE 256K MFC

SPE 256K MFC

IBM Cell Blade

More Detail on GPU Architecture

• Michael Perrone (IBM): Proper Care and Feeding of Multicore Beasts

• http://www.csm.ornl.gov/workshops/HPA/documents/1-arch/feeding_the_beast_perrone.pdf

Cray XMT Cray XMT (highly multithreaded (highly multithreaded shared memory)shared memory)

i = n

i = 3

i = 2

i = 1

. . .

1 2 3 4

Sub- problem

A

i = n

i = 1

i = 0

. . .

Sub- problem

BSubproblem A

Serial Code

Unused streams

. . . .

Programs running in parallel

Concurrent threads of computation

Hardware streams (128)

Instruction Ready Pool;

Pipeline of executing instructions

Parallel Computers Today

Documents