Top Banner
Parallel Computers Today Parallel Computers Today LANL / IBM Roadrunner > 1 PFLOPS Two Nvidia 8800 GPUs > 1 TFLOPS Intel 80-core chip > 1 TFLOPS TFLOPS = 10 12 floating point ops/sec PFLOPS = 1,000,000,000,000,000 / sec (10 15 )
12

Parallel Computers Today

Dec 30, 2015

Download

Documents

tristessa-leon

Parallel Computers Today. Two Nvidia 8800 GPUs > 1 TFLOPS. LANL / IBM Roadrunner > 1 PFLOPS. Intel 80-core chip > 1 TFLOPS. TFLOPS = 10 12 floating point ops/sec PFLOPS = 1,000,000,000,000,000 / sec (10 15 ). - PowerPoint PPT Presentation
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Parallel Computers Today

Parallel Computers TodayParallel Computers Today

LANL / IBM Roadrunner> 1 PFLOPS

Two Nvidia 8800 GPUs> 1 TFLOPS

Intel 80-core chip> 1 TFLOPS TFLOPS = 1012 floating point ops/sec

PFLOPS = 1,000,000,000,000,000 / sec (1015)

Page 2: Parallel Computers Today

Columbia (10240-processor SGI Altix, 50 Teraflops, NASA Ames Research Center)

     

             

Page 3: Parallel Computers Today

Beowulf (18-processor cluster, lab machine)

Page 4: Parallel Computers Today

AMD Opteron quad-core die

Page 5: Parallel Computers Today

The nVidia G80 GPUThe nVidia G80 GPU

• 128 streaming floating point processors @1.5Ghz• 1.5 Gb Shared RAM with 86Gb/s bandwidth• 500 Gflop on one chip (single precision)

Page 6: Parallel Computers Today

The Computer Architecture ChallengeThe Computer Architecture Challenge

Most high-performance computer designs allocate resources to optimize Gaussian elimination on large, dense matrices.

Originally, because linear algebra is the middleware of scientific computing.

Nowadays, mostly for bragging rights.

= xP A L U

Page 7: Parallel Computers Today

• Top 500 List

• http://www.top500.org/list/2008/11/100

Page 8: Parallel Computers Today

Generic Parallel Machine ArchitectureGeneric Parallel Machine Architecture

• Key architecture question: Where is the interconnect, and how fast?

• Key algorithm question: Where is the data?

ProcCache

L2 Cache

L3 Cache

Memory

Storage Hierarchy

ProcCache

L2 Cache

L3 Cache

Memory

ProcCache

L2 Cache

L3 Cache

Memory

potentialinterconnects

Page 9: Parallel Computers Today

Multicore SMP SystemsMulticore SMP Systems

4MBShared L2

Core2

FSB

Fully Buffered DRAM

10.6GB/s

Core2

Chipset (4x64b controllers)

10.6GB/s

10.6 GB/s(write)

4MBShared L2

Core2 Core2

4MBShared L2

Core2

FSB

Core2

4MBShared L2

Core2 Core2

21.3 GB/s(read)

Intel ClovertownC

ross

bar

Sw

itch

Fully Buffered DRAM

4MB

Sha

red

L2 (

16 w

ay)

42.7GB/s (read), 21.3 GB/s (write)

8K D$MT UltraSparcFPU

8K D$MT UltraSparcFPU

8K D$MT UltraSparcFPU

8K D$MT UltraSparcFPU

8K D$MT UltraSparcFPU

8K D$MT UltraSparcFPU

8K D$MT UltraSparcFPU

8K D$MT UltraSparcFPU

179

GB

/s(f

ill)

90 G

B/s

(writ

ethr

u)

Sun Niagara2

4x128b FBDIMM memory controllers

AMD Opteron

1MBvictim

Opteron

1MBvictim

Opteron

Memory Controller / HT

1MBvictim

Opteron

1MBvictim

Opteron

Memory Controller / HT

DDR2 DRAM DDR2 DRAM

10.6GB/s 10.6GB/s

4GB/s(each direction)

XDR DRAM

25.6GB/s

EIB

(Ring N

etwork)

<<20GB/seach

direction

SPE256K

PPE512K L2

MFC

BIF

XDR

SPE256KMFC

SPE256KMFC

SPE256KMFC

SPE256KMFC

SPE256KMFC

SPE256KMFC

SPE256KMFC

XDR DRAM

25.6GB/s

EIB

(Ring N

etwork)

SPE 256K

PPE 512K L2

MFC

BIF

XDR

SPE 256K MFC

SPE 256K MFC

SPE 256K MFC

SPE 256K MFC

SPE 256K MFC

SPE 256K MFC

SPE 256K MFC

IBM Cell Blade

Page 10: Parallel Computers Today

More Detail on GPU Architecture

Page 11: Parallel Computers Today

• Michael Perrone (IBM): Proper Care and Feeding of Multicore Beasts

• http://www.csm.ornl.gov/workshops/HPA/documents/1-arch/feeding_the_beast_perrone.pdf

Page 12: Parallel Computers Today

Cray XMT Cray XMT (highly multithreaded (highly multithreaded shared memory)shared memory)

i = n

i = 3

i = 2

i = 1

. . .

1 2 3 4

Sub- problem

A

i = n

i = 1

i = 0

. . .

Sub- problem

BSubproblem A

Serial Code

Unused streams

. . . .

Programs running in parallel

Concurrent threads of computation

Hardware streams (128)

Instruction Ready Pool;

Pipeline of executing instructions