IRAM and ISTORE Projects

Slide 1

IRAM and ISTORE ProjectsAaron Brown, James Beck, Rich Fromm, Joe Gebis,

Paul Harvey, Adam Janin, Dave Judd, Kimberly Keeton, Christoforos Kozyrakis, David Martin, Rich

Martin, Thinh Nguyen, David Oppenheimer, Steve Pope, Randi Thomas, Noah Treuhaft,

Sam Williams, John Kubiatowicz, Kathy Yelick, and David Patterson

http://iram.cs.berkeley.edu/[istore]Winter 2000 IRAM/ISTORE Retreat

Slide 3

IRAM Vision: Intelligent PDAPilot PDA+ gameboy, cell phone,

radio, timer, camera, TV remote, am/fm radio, garage door opener, ...

+ Wireless data (WWW)+ Speech, vision, video+ Voice output for

conversations Speech control +Vision to see, scan documents, read bar code, ...

Slide 4

ISTORE Hardware Vision• System-on-a-chip enables computer, memory,

without significantly increasing size of disk• 5-7 year target:

• MicroDrive:1.7” x 1.4” x 0.2” 2006: ?

- 1999: 340 MB, 5400 RPM, 5 MB/s, 15 ms seek

– 2006: 9 GB, 50 MB/s ? (1.6X/yr capacity,1.4X/yr BW)

• Integrated IRAM processor– 2x height

• Connected via crossbar switch– growing like Moore’s law

• 10,000+ nodes in one rack!

Slide 5

VIRAM: System on a ChipPrototype scheduled for tape-out 1H 2000•0.18 um EDL process•16 MB DRAM, 8 banks•MIPS Scalar core and

caches @ 200 MHz•4 64-bit vector unit pipelines @ 200 MHz•4 100 MB parallel I/O lines•17x17 mm, 2 Watts•25.6 GB/s memory (6.4 GB/s per

direction and per Xbar)•1.6 Gflops (64-bit), 6.4 GOPs (16-bit)

CPU+$

I/O4 Vector Pipes/Lanes

Memory (64 Mbits / 8 MBytes)

Memory (64 Mbits / 8 MBytes)

Xbar

Slide 6

IRAM Architecture Update• ISA mostly frozen since 6/99

– better fixed-point model and instructions» gained some experience using them over past year

– better exception model– better support for short vectors

» auto-increment memory addressing» instructions for in-register reductions & butterfly-

permutations– memory consistency model spec refined

(poster)• Suite of simulators actively used and maintained

– vsim-isa (functional), vsim-p (performance), vsim-db (debugger), vsim-sync (memory synchronization)

Slide 7

IRAM Software Update• Vectorizing Compiler for VIRAM

– retargeting CRAY vectorizing compiler (talk)» Initial backend complete: scalar and vector

instructions» Extensive testing for correct functionality» Instruction scheduling and performance tuning

begun• Applications using compiler underway

– Speech processing (talk)– Small benchmarks; suggestions welcome

• Hand-coded fixed point applications– Video encoder application complete (poster)– FFT, floating point done, fixed point started

(talk)

Slide 8

IRAM Chip Update•IBM to supply embedded DRAM/Logic (98%)

–DRAM macro added to 0.18 micron logic process –DRAM specs under NDA; final agreement in UCB bureaucracy

•MIPS to supply scalar core (99%)–MIPS processor, caches, TLB

•MIT to supply FPU (100%)–single precision (32 bit) only

•VIRAM-1 Tape-out scheduled for mid-2000–Some updates of micro-architecture based on benchmarks (talk)

–Layout of multiplier (poster), register file nearly complete–Test strategy developed (talk)–Demo system high level hardware design complete (talk)–Network interface design complete (talk)

Slide 9

VIRAM-1 block diagram

Slide 10

Microarchitecture configuration

• 2 arithmetic units– both execute integer

operations– one executes FP

operations– 4 64-bit datapaths (lanes)

per unit• 2 flag processing units

– for conditional execution and speculation support

• 1 load-store unit– optimized for strides

1,2,3, and 4– 4 addresses/cycle for

indexed and strided operations

– decoupled indexed and strided stores

• Memory system– 8 DRAM banks– 256-bit synchronous

interface– 1 sub-bank per bank– 16 Mbytes total capacity

• Peak performance– 3.2 GOPS64, 12.8 GOPS16

(w. madd)– 1.6 GOPS64, 6.4 GOPS16

(wo. madd)– 0.8 GFLOPS64, 3.2

GFLOPS32 (w. madd) – 6.4 Gbyte/s memory

bandwidth

Slide 11

Media Kernel Performance

PeakPerf.

SustainedPerf.

%of Peak

Image Composition 6.4 GOPS 6.40 GOPS 100.0%iDCT 6.4 GOPS 1.97 GOPS 30.7%Color Conversion 3.2 GOPS 3.07 GOPS 96.0%

Image Convolution 3.2 GOPS 3.16 GOPS 98.7%Integer MV Multiply 3.2 GOPS 2.77 GOPS 86.5%Integer VM Multiply 3.2 GOPS 3.00 GOPS 93.7%FP MV Multiply 3.2 GFLOPS 2.80 GFLOPS 87.5%

FP VM Multiply 3.2 GFLOPS 3.19 GFLOPS 99.6%

AVERAGE 86.6%

Slide 12

Base-line system comparison

VIRAM MMX VIS TMS320C82

ImageComposition

0.13 - 2.22 (17.0x) -

iDCT 1.18 3.75 (3.2x) - -

ColorConversion

0.78 8.00 (10.2x) - 5.70 (7.6x)

ImageConvolution

5.49 5.49 (4.5x) 6.19 (5.1x) 6.50 (5.3x)

• All numbers in cycles/pixel•MMX and VIS results assume all data in L1 cache

Slide 13

Scaling to 10K Processors• IRAM + micro-disk offer huge scaling opportunities• Still many hard system problems, SAM AME (talk)

– Availability» 24 x7 databases without human intervention» Discrete vs. continuous model of machine being up

– Maintainability» 42% of system failures are due to administrative errors» self-monitoring, tuning, and repair

– Evolution» Dynamic scaling with plug-and-play components» Scalable performance, gracefully down as well as up» Machines become heterogeneous in performance at

scale

Slide 14

Hardware: plug-and-play intelligent devices with self-monitoring, diagnostics, and fault injection hardware–intelligence used to collect and filter monitoring data–diagnostics and fault injection enhance robustness–networked to create a scalable shared-nothing cluster

Intelligent Chassis80 nodes, 8 per tray2 levels of switches•20 100 Mb/s•2 1 Gb/sEnvironment Monitoring:UPS, redundant PS,fans, heat and vibrartion sensors...

Intelligent Disk “Brick”Portable PC Processor: Pentium II+ DRAM

Redundant NICs (4 100 Mb/s links)Diagnostic Processor

DiskHalf-height canister

ISTORE-1: Hardware for AME

Slide 15

ISTORE Brick Block Diagram

CPU NorthBridge

Mobile Pentium II Module

DRAM256 MB

DiagnosticProcessor

PCI

SCSI

SouthBridge

SuperI/O

BIOS

DUALUART

Ethernets4x100 Mb/s

DiagnosticNet

Flash RTC RAM

Monitor&

Control

Disk (18 GB)

• Sensors for heat and vibration• Control over power to individual nodes

Slide 16

ISTORE Software Approach• Two-pronged approach to providing reliability:

1) reactive self-maintenance: dynamic reaction to exceptional system events

» self-diagnosing, self-monitoring hardware» software monitoring and problem detection» automatic reaction to detected problems

2) proactive self-maintenance: continuous online self- testing and self-analysis

» automatic characterization of system components» in situ fault injection, self-testing, and scrubbing to

detect flaky hardware components and to exercise rarely-taken application code paths before they’re used

Slide 17

ISTORE Applications• Storage-intensive, reliable services for ISTORE-1

– infrastructure for “thin clients,” e.g., PDAs – web services, such as mail and storage– large-scale databases (talk)– information retrieval (search and on-the-fly

indexing)

• Scalable memory-intensive computations for ISTORE in 2006

– Performance estimates through IRAM simulation + model» not major emphasis

– Large-scale defense and scientific applications enabled by high memory bw and arithmetic performance

Slide 18

Performance Availability• System performance limited by the weakest link• NOW Sort experience: performance heterogeneity is the

norm– disks: inner vs. outer track (50%), fragmentation– processors: load (1.5-5x) and heat

• Virtual Streams: dynamically off-load I/O work from slower disks to faster ones

0

1

2

3

4

5

6

100% 67% 39% 29%

Efficiency Of Single Slow Disk

Min

imum

Per

-Pro

cess

Ban

dwid

th

(MB

/sec

)

IdealVirtual StreamsStatic

Slide 19

ISTORE Update• High level hardware design by UCB complete (talk)

– Design of ISTORE boards handed off to Anigma» First run complete; SCSI problem to be fixed» Testing of UCB design (DP), to start asap» 10 nodes by end of 1Q 2000, 80 by 2Q 2000» Design of BIOS handed off to AMI

– Most parts donated or discounted » Adaptec, Andataco, IBM, Intel, Micron, Motorola, Packet

Engines• Proposal for Quantifying AME (talk)• Beginning work on short-term applications

» Mail server » Web server will be used to » Large database drive principled » Decision support primitives system design

Slide 20

Conclusions• IRAM attractive for two Post-PC applications

because of low power, small size, high memory bandwidth– Mobile consumer electronic devices– Scaleable infrastructure

• IRAM benchmarking result: faster than DSPs

• ISTORE: hardware/software architecture for large scale network services

• Scaling systems requires – new continuous models of availability– performance not limited by the weakest link– self* systems to reduce human interaction

Slide 21

Backup Slides

Slide 22

Introduction and Ground Rules

• Who is here?– Mixed IRAM/ISTORE “experience”

• Questions are welcome during talks• Schedule: lecture from Brewster Kahle during

Thursday’s Open Mic Session.• Feedback is required (Fri am)

– Be careful, we have been known to listen to you

• Mixed experience: please ask• Time for skiing and talking tomorrow

afternoon

Slide 23

2006 ISTORE• ISTORE node

– Add 20% pad to MicroDrive size for packaging, connectors

– Then double thickness to add IRAM– 2.0” x 1.7” x 0.5” (51 mm x 43 mm x 13 mm)

• Crossbar switches growing by Moore’s Law– 2x/1.5 yrs 4X transistors/3yrs– Crossbars grow by N2 2X switch/3yrs– 16 x 16 in 1999 64 x 64 in 2005

• ISTORE rack (19” x 33” x 84”)1 tray (3” high) 16 x 32 512 ISTORE nodes / try

• 20 trays+switches+UPS 10,240 ISTORE nodes / rack (!)

Slide 24

0

2

4

6

8

16 32 64

vpw

GO

P/s 2

4

8

IRAM/VSUIF Decryption (IDEA)

• IDEA Decryption operates on 16-bit ints • Compiled with IRAM/VSUIF • Note scalability of both #lanes and data width• Some hand-optimizations (unrolling) will be

automated by Cray compiler

# lanes

Virtual processor width

Slide 25

1D FFT on IRAMFFT study on IRAM

– bit-reversal time included; cost hidden using indexed store

– Faster than DSPs on floating point (32-bit) FFTs– CRI Pathfinder does 24-bit fixed point, 1K points in 28

usec (2 Watts without SRAM)

Slide 26

3D FFT on ISTORE 2006• Performance of large 3D FFT’s depend on 2 factors

– speed of 1D FFT on a single node (next slide)– network bandwidth for “transposing” data

– 1.3 Tflop FFT possible w/ 1K IRAM nodes, if network bisection bandwidth scales (!)

Slide 27

ISTORE-1 System Layout

Brick shelfBrick shelfBrick shelf

Brick shelfBrick shelf

Brick shelf

Brick shelfBrick shelf

Slide 28

+

Vector Registers

x

÷

Load/Store

Vector 4 x 64or

8 x 32or

16 x 16

4 x 644 x 64

QueueInstruction

V-IRAM1: 0.18 µm, Fast Logic, 200 MHz

1.6 GFLOPS(64b)/6.4 GOPS(16b)/32MB

Memory Crossbar Switch

16K I cache 16K D cache

2-way SuperscalarProcessor

M

M…M

M

M…M

M

M…M

M

M…M

M

M…M

M

M…M

…

M

M…M

M

M…M

M

M…M

M

M…M

4 x 64 4 x 64 4 x 64 4 x 64 4 x 64

I/OI/O

I/OI/O

100MBeach

Slide 29

Fixed-point multiply-add model

• Same basic model, different set of instructions– fixed-point: multiply & shift & round, shift right & round, shift left & saturate

– integer saturated arithmetic: add or sub & saturate

– added multiply-add instruction for improved performance and energy consumption

satRound

a

wy

z+*

x

n/2

n/2n

n

n

Multiply half word & Shift & Round Add & Saturate

n

Slide 30

Other ISA modifications• Auto-increment loads/stores

– a vector load/store can post-increment its base address

– added base (16), stride (8), and increment (8) registers

– necessary for applications with short vectors or scaled-up implementations

• Butterfly permutation instructions– perform step of a butterfly permutation

within a vector register– used for FFT and reduction operations

• Miscellaneous instructions added– min and max instructions (integer and FP)– FP reciprocal and reciprocal square root

Slide 31

Major architecture updates• Integer arithmetic units support multiply-add

instructions• 1 load store unit

– complexity Vs. benefit• Optimize for strides 2, 3, and 4

– useful for complex arithmetic and image processing functions

• Decoupled strided and indexed stores– memory stalls due to bank conflicts do not stall the

arithmetic pipelines– allows scheduling of independent arithmetic

operations in parallel with stores that experience many stalls

– implemented with address, not data, buffering – currently examining a similar optimization for loads

Slide 32

Micro-kernel results: simulated systems

1 LaneSystem

2 LaneSystem

4 LaneSystem

8 LaneSystem

# of 64-bit lanes 1 2 4 8Addresses per cyclefor strided-indexedaccesses

1 2 4 8

Crossbar width 64b 128b 256b 512b

Width of DRAM bankinterface

64b 128b 256b 512b

DRAM banks 8 8 8 8

•Note : simulations performed with 2 load-store units and without decoupled stores or optimizations for strides 2, 3, and 4

Slide 33

Micro-kernels

Benchmark OperationsType

DataWidth

MemoryAccesses

OtherComments

ImageComposition(Blending)

Integer 16b Unit-stride

2D iDCT (8x8image blocks)

Integer 16b Unit-strideStrided

Color Conversion(RGB to YUV)


ImageConvolution


Matrix-vectorMultiply (MV)

IntegerFP

32b Unit-stride Uses reductions

Vector-matrixMultiply (VM)

IntegerFP

32b Unit-stride

•Vectorization and scheduling performed manually

Slide 34

Scaled system results

•Near linear speedup for all application apart from iDCT

•iDCT bottlenecks

•large number of bank conflicts

•4 addresses/cycle for strided accesses

0

1

2

3

4

5

6

7

8

Compositing iDCT Color Conversion Convolution MxV INT (32) VxM INT (32) MxV FP (32) VxM FP(32)

Spee

dup

1 Lane 2 Lanes 4 Lanes 8 Lanes

Slide 35

iDCT scaling with sub-banks

• Sub-banks reduce bank conflicts and increase performance• Alternative (but not as effective) ways to reduce conflicts:

– different memory layout– different address interleaving schemes

0

1

2

3

4

5

6

7

8

1 Sub-Bank 2 Sub-Banks 4 Sub-Banks 8 Sub-Banks

Spe

edup

1 Lane 2 Lanes 4 Lanes 8 Lanes

Slide 36

Compiling for VIRAM• Long-term success of DIS technology depends

on simple programming model, i.e., a compiler• Needs to handle significant class of

applications– IRAM: multimedia, graphics, speech and

image processing– ISTORE: databases, signal processing, other

DIS benchmarks• Needs to utilize hardware features for

performance– IRAM: vectorization– ISTORE: scalability of shared-nothing

programming model

Slide 37

IRAM Compilers• IRAM/Cray vectorizing compiler [Judd]

– Production compiler» Used on the T90, C90, as well as the T3D and T3E» Being ported (by SGI/Cray) to the SV2 architecture

– Has C, C++, and Fortran front-ends (focus on C)– Extensive vectorization capability

» outer loop vectorization, scatter/gather, short loops, … – VIRAM port is under way

• IRAM/VSUIF vectorizing compiler [Krashinsky]– Based on VSUIF from Corinna Lee’s group at Toronto

which is based on MachineSUIF from Mike Smith’s group at Harvard which is based on SUIF compiler from Monica Lam’s group at Stanford

– This is a “research” compiler, not intended for compiling large complex applications

– It has been working since 5/99.

Slide 38

IRAM/Cray Compiler Status

• MIPS backend developed in this year– Validated using a commercial test suite for

code generation• Vector backend recently started

– Testing with simulator under way • Leveraging from Cray

– Automatic vectorization

Vectorizer

C

Fortran

C++

Frontends Code Generators

PDGCS

IRAMC90

Slide 39

VIRAM/VSUIF Matrix/Vector Multiply

• VIRAM/VSUIF does reasonably well on long loops

0200400600800

10001200

dot

padd

ed

saxp

y

hand

opt

Mflop/ s

mvm vmm

• 256x256 single matrix• Compare to 1600 Mflop/s (peak

without multadd)• Note BLAS-2 (little reuse)• ~350 on Power3 and EV6

• Problems specific to VSUIF– hand strip-mining results in

short loops– reductions– no multadd support

Slide 40

Reactive Self-Maintenance• ISTORE defines a layered system model for

monitoring and reaction:

Self-monitoringhardware

SW monitoring

Problem detection

Coordinationof reaction

Reaction mechanisms

Provided by ISTORE Runtime System

Provided byApplication

• ISTORE API defines interface between runtime system and app. reaction mechanisms

Policies

ISTORE API

• Policies define system’s monitoring, detection, and reaction behavior

Slide 41

Proactive Self-Maintenance• Continuous online self-testing of HW and SW

– detects flaky, failing, or buggy components via:

» fault injection: triggering hardware and software error handling paths to verify their integrity/existence

» stress testing: pushing HW/SW components past normal operating parameters

» scrubbing: periodic restoration of potentially “decaying” hardware or software state

– automates preventive maintenance• Dynamic HW/SW component characterization

– used to adapt to heterogeneous hardware and behavior of application software components

Slide 42

ISTORE-0 Prototype and Plans• ISTORE-0: testbed for early experimentation with

ISTORE research ideas• Hardware: cluster of 6 PCs

– intended to model ISTORE-1 using COTS components

– nodes interconnected using ISTORE-1 network fabric

– custom fault-injection hardware on subset of nodes

• Initial research plans– runtime system software– fault injection– scalability, availability, maintainability

benchmarking– applications: block storage server, database, FFT

Slide 43

Runtime System Software• Demonstrate simple policy-driven adaptation

– within context of a single OS and application– software monitoring information collected and

processed in realtime» e.g., health & performance parameters of OS,

application– problem detection and coordination of reaction

» controlled by a stock set of configurable policies– application-level adaptation mechanisms

» invoked to implement reaction• Use experience to inform ISTORE API design• Investigate reinforcement learning as technique

to infer appropriate reactions from goals

Slide 44

Record-breaking performance is not the common case

• NOW-Sort records demonstrate peak performance• But perturb just 1 of 8 nodes and...

012345

Bestcase

Baddisk

layout

Busydisk

LightCPU

HeavyCPU

Paging

Slow

dow

n

Slide 45

Virtual Streams:Dynamic load balancing for I/O

• Replicas of data serve as second sources• Maintain a notion of each process’s progress • Arbitrate use of disks to ensure equal progress• The right behavior, but what mechanism?

Process

Virtual Streams Software

Disk

Arbiter

Slide 46

Graduated Declustering:A Virtual Streams implementation

• Clients send progress, servers schedule in response

ToClient0

Before Slowdown After Slowdown

0 1 1 2 2 3 3 0

Client0B

Client1B

Client2B

Client3B

Server0B

Server1B

Server2B

Server3B

ToClient0

FromServer3

B/2B/2

B/2

B/2

B/2B/2 B/2

B/2

B/2

0 1 1 2 2 3 3 0

Client07B/8

Client17B/8

Client27B/8

Client37B/8

Server0B

Server1B/2

Server2B

Server3B

FromServer3

B/23B/8

5B/8

B/4

B/45B/8 3B/8

B/2

B/2

Slide 47

Read Performance:Multiple Slow Disks

0

1

2

3

4

5

6

0 1 2 3 4 5 6 7 8# Of Slow Disks (out of 8)

Min

imum

Per

-Pro

cess

B

andw

idth

(MB

/sec

) IdealVirtual StreamsStatic

Slide 48

Storage Priorities: Research v. Users

Traditional Research Priorities

1) Performance1’) Cost 3) Scalability4) Availability5) Maintainability

ISTORE Priorities1) Maintainability2) Availability3) Scalability4) Performance5) Cost

} easy to

measure

} hard to

measure

Slide 49

Intelligent Storage Project Goals

• ISTORE: a hardware/software architecture for building scaleable, self-maintaining storage–An introspective system: it monitors itself and acts on its observations

• Self-maintenance: does not rely on administrators to configure, monitor, or tune system

Slide 50

Self-maintenance• Failure management– devices must fail fast without interrupting

service– predict failures and initiate replacement– failures immediate human intervention

• System upgrades and scaling– new hardware automatically incorporated

without interruption – new devices immediately improve

performance or repair failures• Performance management

– system must adapt to changes in workload or access patterns

Slide 51

ISTORE-I: 2H99• Intelligent disk

– Portable PC Hardware: Pentium II, DRAM– Low Profile SCSI Disk (9 to 18 GB)– 4 100-Mbit/s Ethernet links per node– Placed inside Half-height canister– Monitor Processor/path to power off

components?• Intelligent Chassis

– 64 nodes: 8 enclosures, 8 nodes/enclosure» 64 x 4 or 256 Ethernet ports

– 2 levels of Ethernet switches: 14 small, 2 large » Small: 20 100-Mbit/s + 2 1-Gbit; Large: 25 1-Gbit» Just for prototype; crossbar chips for real system

– Enclosure sensing, UPS, redundant PS, fans, ...

Slide 52

Disk Limit• Continued advance in capacity (60%/yr) and

bandwidth (40%/yr)• Slow improvement in seek, rotation (8%/yr)• Time to read whole disk

Year Sequentially Randomly (1 sector/seek)

1990 4 minutes 6 hours1999 35 minutes 1 week(!)

• 3.5” form factor make sense in 5-7 years?

Slide 53

Related Work• ISTORE adds to several recent research efforts• Active Disks, NASD (UCSB, CMU)• Network service appliances (NetApp, Snap!,

Qube, ...)• High availability systems (Compaq/Tandem, ...)• Adaptive systems (HP AutoRAID, M/S AutoAdmin,

M/S Millennium)• Plug-and-play system construction (Jini, PC

Plug&Play, ...)

Slide 54

Other (Potential) Benefits of ISTORE

• Scalability: add processing power, memory, network bandwidth as add disks

• Smaller footprint vs. traditional server/disk• Less power

– embedded processors vs. servers– spin down idle disks?

• For decision-support or web-service applications, potentially better performance than traditional servers

Slide 55

Disk Limit: I/O Buses

CPU Memory bus

Memory

C

External I/O bus

(SCSI)C

(PCI)

C Internal I/O busC

Multiple copies of data,SW layers

• Bus rate vs. Disk rate– SCSI: Ultra2 (40 MHz),

Wide (16 bit): 80 MByte/s– FC-AL: 1 Gbit/s = 125 MByte/s (single disk in

2002)

Cannot use 100% of bus Queuing Theory (<

70%) Command overhead

(Effective size = size x 1.2)

Controllers(15 disks)

Slide 56

State of the Art: Seagate Cheetah 36

–36.4 GB, 3.5 inch disk –12 platters, 24 surfaces–10,000 RPM–18.3 to 28 MB/s internal media transfer rate(14 to 21 MB/s user data)

–9772 cylinders (tracks), (71,132,960 sectors total)

–Avg. seek: read 5.2 ms, write 6.0 ms (Max. seek: 12/13,1 track: 0.6/0.9 ms)

–$2100 or 17MB/$ (6¢/MB)(list price)

–0.15 ms controller timesource: www.seagate.com

Slide 57

User Decision Support Demand

vs. Processor speed

1

10

100

1996 1997 1998 1999 2000

CPU speed2X / 18 months

Database demand:2X / 9-12 months

Database-Proc.Performance Gap:

“Greg’s Law”

“Moore’s Law”

IRAM and ISTORE Projects

Documents