IRAM and ISTORE Projects

Slide 1

IRAM and ISTORE ProjectsAaron Brown, James Beck, Rich Fromm, Joe Gebis,

Paul Harvey, Adam Janin, Dave Judd, Kimberly Keeton, Christoforos Kozyrakis, David Martin, Rich

Martin, Thinh Nguyen, David Oppenheimer, Steve Pope, Randi Thomas, Noah Treuhaft,

Sam Williams, John Kubiatowicz, Kathy Yelick, and David Patterson

http://iram.cs.berkeley.edu/[istore]Fall 2000 DIS DARPA Meeting

Slide 2

IRAM and ISTORE Vision• Integrated processor in memory provides

efficient access to high memory bandwidth• Two “Post-PC” applications:

– IRAM: Single chip system for embedded and portable applications

» Target media processing (speech, images, video, audio)

– ISTORE: Building block when combined with disk for storage and retrieval servers

» Up to 10K nodes in one rack

» Non-IRAM prototype addresses key scaling issues: availability, manageability, evolution

Photo from Itsy, Inc.

Slide 3

IRAM Overview• A processor architecture for embedded/portable

systems running media applications– Based on media processing and embedded DRAM– Simple, scalable, energy and area efficient– Good compiler target

MIPS64™5Kc Core

Instr Cache (8KB)

Data Cache (8KB)

CP

IFFPU

Vector Register File (8KB)

Flag Register File (512B)

Flag 0

Memory Unit

DMA256b

Memory Crossbar

256b

256b

64b

DRAM0

(2MB)

DRAM1

(2MB)

DRAM7

(2MB)

…

SysAD IF

64b

Arith 0 Arith 1

Flag 1

JTAG

JTAG IF

TLB

Slide 4

Architecture Details• MIPS64™ 5Kc core (200 MHz)

– Single-issue scalar core with 8 Kbyte I&D caches• Vector unit (200 MHz)

– 8 KByte register file (32 64b elements per register)– 256b datapaths, can be subdivided into 16b, 32b,

64b:» 2 arithmetic (1 FP, single), 2 flag processing

– Memory unit » 4 address generators for strided/indexed accesses

• Main memory system– 8 2-MByte DRAM macros

» 25ns random access time, 7.5ns page access time– Crossbar interconnect

» 12.8 GBytes/s peak bandwidth per direction (load/store)• Off-chip interface

– 2 channel DMA engine and 64n SysAD bus

Slide 5

Floorplan• Technology: IBM SA-27E

–0.18m CMOS, 6 metal layers • 290 mm2 die area

–225 mm2 for memory/logic• Transistor count: ~150M• Power supply

–1.2V for logic, 1.8V for DRAM• Typical power consumption: 2.0

W–0.5 W (scalar) + 1.0 W (vector) + 0.2 W (DRAM) + 0.3 W (misc)

• Peak vector performance–1.6/3.2/6.4 Gops wo. multiply-add (64b/32b/16b operations)

–3.2/6.4 /12.8 Gops w. madd–1.6 Gflops (single-precision)

• Tape-out planned for March ‘01

14.5 mm

20.0

mm

Slide 6

Alternative Floorplans

“VIRAM-8MB”4 lanes, 8 Mbytes190 mm2

3.2 Gops at 200 MHz(32-bit ops)

“VIRAM-2Lanes”

2 lanes, 4 Mbytes

120 mm2

1.6 Gops at 200 MHz

“VIRAM-Lite”

1 lane, 2 Mbytes

60 mm2

0.8 Gops at 200 MHz

Slide 7

VIRAM Compiler

• Based on the Cray’s production compiler • Challenges:

– narrow data types and scalar/vector memory consistency

• Advantages relative to media-extensions:– powerful addressing modes and ISA

independent of datapath width

Optimizer

C

Fortran95

C++

Frontends Code Generators

Cray’s

PDGCS

T3D/T3E

SV2/VIRAM

C90/T90/SV1

Slide 8

Exploiting 0n-Chip Bandwidth• Vector ISA uses high bandwidth to mask latency• Compiled matrix-vector multiplication: 2 Flops/element

– Easy compilation problem; stresses memory bandwidth– Compare to 304 Mflops (64-bit) for Power3 (hand-coded)

0

100

200

300

400

500

600

700

800

900

MFL

OPS

mvm

32-bit,8 banks

mvm

32-bit,16 banks

mvm

64-bit,8 banks

mvm

64-bit,16 banks

1 lane2 lane4 lane8 lane

–Performance scales with number of lanes up to 4

–Need more memory banks than default DRAM macro for 8 lanes

Slide 9

Compiling Media Kernels on IRAM• The compiler generates code for narrow data widths,

e.g., 16-bit integer• Compilation model is simple, more scalable (across

generations) than MMX, VIS, etc.

0

500

1000

1500

2000

2500

3000

3500

MFL

OPS

colorspace composite FI R filter

1 lane2 lane4 lane8 lane

– Strided and indexed loads/stores simpler than pack/unpack

– Maximum vector length is longer than datapath width (256 bits); all lane scalings done with single executable

Slide 10

Vector Vs. SIMD: Example• Simple image processing example:

– conversion from RGB to YUV

Y = [( 9798*R + 19235*G + 3736*B) / 32768] U = [(-4784*R - 9437*G + 4221*B) / 32768] + 128 V = [(20218*R – 16941*G – 3277*B) / 32768] + 128

Slide 11

VIRAM Code (22 instructions)RGBtoYUV: vlds.u.b r_v, r_addr, stride3, addr_inc # load R vlds.u.b g_v, g_addr, stride3, addr_inc # load G vlds.u.b b_v, b_addr, stride3, addr_inc # load B xlmul.u.sv o1_v, t0_s, r_v # calculate Y xlmadd.u.sv o1_v, t1_s, g_v xlmadd.u.sv o1_v, t2_s, b_v vsra.vs o1_v, o1_v, s_s xlmul.u.sv o2_v, t3_s, r_v # calculate U xlmadd.u.sv o2_v, t4_s, g_v xlmadd.u.sv o2_v, t5_s, b_v vsra.vs o2_v, o2_v, s_s vadd.sv o2_v, a_s, o2_v xlmul.u.sv o3_v, t6_s, r_v # calculate V xlmadd.u.sv o3_v, t7_s, g_v xlmadd.u.sv o3_v, t8_s, b_v vsra.vs o3_v, o3_v, s_s vadd.sv o3_v, a_s, o3_v vsts.b o1_v, y_addr, stride3, addr_inc # store Y vsts.b o2_v, u_addr, stride3, addr_inc # store U vsts.b o3_v, v_addr, stride3, addr_inc # store V subu pix_s,pix_s, len_s bnez pix_s, RGBtoYUV

Slide 12

MMX Code (part 1)RGBtoYUV: movq mm1, [eax] pxor mm6, mm6 movq mm0, mm1 psrlq mm1, 16 punpcklbw mm0, ZEROS movq mm7, mm1 punpcklbw mm1, ZEROS movq mm2, mm0 pmaddwd mm0, YR0GR movq mm3, mm1 pmaddwd mm1, YBG0B movq mm4, mm2 pmaddwd mm2, UR0GR movq mm5, mm3 pmaddwd mm3, UBG0B punpckhbw mm7, mm6; pmaddwd mm4, VR0GR paddd mm0, mm1 pmaddwd mm5, VBG0B movq mm1, 8[eax] paddd mm2, mm3 movq mm6, mm1

paddd mm4, mm5 movq mm5, mm1 psllq mm1, 32 paddd mm1, mm7 punpckhbw mm6, ZEROS movq mm3, mm1 pmaddwd mm1, YR0GR movq mm7, mm5 pmaddwd mm5, YBG0B psrad mm0, 15 movq TEMP0, mm6 movq mm6, mm3 pmaddwd mm6, UR0GR psrad mm2, 15 paddd mm1, mm5 movq mm5, mm7 pmaddwd mm7, UBG0B psrad mm1, 15 pmaddwd mm3, VR0GR packssdw mm0, mm1 pmaddwd mm5, VBG0B psrad mm4, 15 movq mm1, 16[eax]

Slide 13

MMX Code (part 2) paddd mm6, mm7 movq mm7, mm1 psrad mm6, 15 paddd mm3, mm5 psllq mm7, 16 movq mm5, mm7 psrad mm3, 15 movq TEMPY, mm0 packssdw mm2, mm6 movq mm0, TEMP0 punpcklbw mm7, ZEROS movq mm6, mm0 movq TEMPU, mm2 psrlq mm0, 32 paddw mm7, mm0 movq mm2, mm6 pmaddwd mm2, YR0GR movq mm0, mm7 pmaddwd mm7, YBG0B packssdw mm4, mm3 add eax, 24 add edx, 8 movq TEMPV, mm4

movq mm4, mm6 pmaddwd mm6, UR0GR movq mm3, mm0 pmaddwd mm0, UBG0B paddd mm2, mm7 pmaddwd mm4, pxor mm7, mm7 pmaddwd mm3, VBG0B punpckhbw mm1, paddd mm0, mm6 movq mm6, mm1 pmaddwd mm6, YBG0B punpckhbw mm5, movq mm7, mm5 paddd mm3, mm4 pmaddwd mm5, YR0GR movq mm4, mm1 pmaddwd mm4, UBG0B psrad mm0, 15 paddd mm0, OFFSETW psrad mm2, 15 paddd mm6, mm5 movq mm5, mm7

Slide 14

MMX Code (pt. 3: 121 instructions) pmaddwd mm7, UR0GR psrad mm3, 15 pmaddwd mm1, VBG0B psrad mm6, 15 paddd mm4, OFFSETD packssdw mm2, mm6 pmaddwd mm5, VR0GR paddd mm7, mm4 psrad mm7, 15 movq mm6, TEMPY packssdw mm0, mm7 movq mm4, TEMPU packuswb mm6, mm2 movq mm7, OFFSETB paddd mm1, mm5 paddw mm4, mm7 psrad mm1, 15 movq [ebx], mm6 packuswb mm4, movq mm5, TEMPV packssdw mm3, mm4 paddw mm5, mm7 paddw mm3, mm7

movq [ecx], mm4 packuswb mm5, mm3 add ebx, 8 add ecx, 8 movq [edx], mm5 dec edi jnz RGBtoYUV

Slide 15

IRAM Status• Chip

– ISA has not changed significantly in over a year– Verilog complete, except SRAM for scalar cache– Testing framework in place

• Compiler– Backend code generation complete– Continued performance improvements, especially for

narrow data widths• Application & Benchmarks

– Handcoded kernels better than MMX,VIS, gp DSPs » DCT, FFT, MVM, convolution, image composition,…

– Compiled kernels demonstrate ISA advantages» MVM, sparse MVM, decrypt, image composition,…

– Full applications: H263 encoding (done), speech (underway)

Slide 16

Scaling to 10K Processors• IRAM + micro-disk offer huge scaling

opportunities• Still many hard system problems (AME)

– Availability» systems should continue to meet quality of service

goals despite hardware and software failures– Maintainability

» systems should require only minimal ongoing human administration, regardless of scale or complexity

– Evolutionary Growth» systems should evolve gracefully in terms of

performance, maintainability, and availability as they are grown/upgraded/expanded

• These are problems at today’s scales, and will only get worse as systems grow

Slide 17

Cause of System Crashes

20%10% 5%

50%

18%

5%

15%

53%

69%

15% 18% 21%

0%20%40%60%80%

100%

1985 1993 2001

Other: app, power, network failureSystem management: actions + N/problemOperating SystemfailureHardware failure

(est.)• VAX crashes ‘85, ‘93 [Murp95]; extrap. to ‘01• HW/OS 70% in ‘85 to 28% in ‘93. In ‘01, 10%?

• Rule of Thumb: Maintenance 10X HW– so over 5 year product life, ~ 95% of cost is

maintenance

Is Maintenance the Key?

Slide 18

Hardware Techniques for AME• Cluster of Storage Oriented Nodes (SON)

– Scalable, tolerates partial failures, automatic redundancy

• Heavily instrumented hardware– Sensors for temp, vibration, humidity, power, intrusion

• Independent diagnostic processor on each node– Remote control of power; collects environmental data

for – Diagnostic processors connected via independent

network • On-demand network partitioning/isolation

– Allows testing, repair of online system– Managed by diagnostic processor

• Built-in fault injection capabilities– Used for hardware introspection– Important for AME benchmarking

Slide 19

ISTORE-1 system

ISTORE Chassis80 nodes, 8 per tray2 levels of switches:• 20 100 Mb/s• 2 1 Gb/sEnvironment Monitoring:

UPS, redundant PS,fans, heat and vibration sensors...

Storage-Oriented Node “Brick”Portable PC CPU: Pentium II/266 + DRAM

Redundant NICs (4 100 Mb/s links)Diagnostic Processor

DiskHalf-height canister

Hardware: plug-and-play intelligent devices with self-monitoring, diagnostics, and fault injection hardware– intelligence used to collect and filter monitoring data– diagnostics and fault injection enhance robustness– networked to create a scalable shared-nothing cluster

Scheduled for 4Q 00

Slide 20

ISTORE-1 System Layout

Brick shelf

Brick shelf

Brick shelfBrick shelfBrick shelf

Patch panel

Patch panel

UPS UPS UPSUPS

PE5200

PE5200

PE1000s

Brick shelf

Brick shelf

Brick shelfBrick shelfBrick shelf

Patch panel

Patch panel

UPS UPS

PE1000s: PowerEngines 100Mb switchesPE5200s: PowerEngines 1 Gb switchesUPSs: “used”

Slide 21

ISTORE Brick Node Block Diagram

CPU NorthBridge

Mobile Pentium II Module

DRAM256 MB

DiagnosticProcessor

PCI

SCSI

SouthBridge

SuperI/O

BIOS

DUALUART

Ethernets4x100 Mb/s

DiagnosticNet

Flash RTC RAM

Monitor&

Control

Disk (18 GB)

• Sensors for heat and vibration• Control over power to individual nodes

Slide 22

ISTORE Brick Node• Pentium-II/266MHz• 256 MB DRAM• 18 GB SCSI (or IDE) disk• 4x100Mb Ethernet• m68k diagnostic processor & CAN diagnostic network• Packaged in standard half-height RAID array canister

Slide 23

Software Techniques• Reactive introspection

– “Mining” available system data• Proactive introspection

– Isolation + fault insertion => test recovery code• Semantic redundancy

– Use of coding and application-specific checkpoints• Self-Scrubbing data structures

– Check (and repair?) complex distributed structures

• Load adaptation for performance faults– Dynamic load balancing for “regular”

computations• Benchmarking

– Define quantitative evaluations for AME

Slide 24

050

100150200250300350

1 link 2 links 3 links

I dealActual

Network Redundancy• Each brick node has 4 100Mb ethernets

– TCP striping used for performance– Demonstration on 2-node prototype using 3

links

– When a link fails, packets on that link are dropped

– Nodes detect failures using independent pings» More scalable approach being developed

Mb/

s

Slide 25

Load Balancing for Performance Faults

• Failure is not always a discrete property

– Some fraction of components may fail– Some components may perform poorly– Graph shows effect of “Graduated

Declustering” on cluster I/O with disk performance faults

0

1

2

3

4

5

6

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16# Of Slow Disks (out of 8, each with 50% slowdown)

Ban

dwid

th o

f Sl

owes

t (M

B/se

c)

I dealGDAffi nity GDStatic

Slide 26Time

Perf

orm

ance }normal behavior

(99% conf)

injecteddisk failure reconstruction

0

Availability benchmarks• Goal: quantify variation in QoS as fault events

occur• Leverage existing performance benchmarks

– to generate fair workloads– to measure & trace quality of service

metrics• Use fault injection to compromise system• Results are most accessible graphically

Slide 27

Time (minutes)0 10 20 30 40 50 60 70 80 90 100 110

80

100

120

140

160

0

1

2

Hits/sec# failures tolerated

0 10 20 30 40 50 60 70 80 90 100 110

Hits

per

sec

ond

190

195

200

205

210

215

220

#fai

lure

s to

lera

ted

0

1

2

Reconstruction

Reconstruction

Example: Faults in Software RAID

• Compares Linux and Solaris reconstruction– Linux: minimal performance impact but longer window of

vulnerability to second fault– Solaris: large perf. impact but restores redundancy fast

Linux

Solaris

Slide 28

Towards Manageability Benchmarks

• Goal is to gain experience with a small piece of the problem– can we measure the time and learning-curve

costs for one task?

• Task: handling disk failure in RAID system– includes detection and repair

• Same test systems as availability case study– Windows 2000/IIS, Linux/Apache,

Solaris/Apache• Five test subjects and fixed training session

– (Too small to draw statistical conclusions)

Slide 29

Solaris

Trial Number1 2 3 4 5 6 7 8 9

Seco

nds

(hum

an ti

me)

0

50

100

150

200

250

300

350

400

Subject 2Subject 3Subject 4Subject 5

Windows

Trial Number1 2 3 4 5 6 7 8 9

Seco

nds

(hum

an ti

me)

0

50

100

150

200

250

300

350

400

Subject 1Subject 2Subject 3Subject 4Subject 5

Linux

Trial Number1 2 3 4 5 6 7 8 9

Seco

nds

(hum

an ti

me)

0

50

100

150

200

250

300

350

400

Subject 1Subject 2Subject 3Subject 4Subject 5

Sample results: time• Graphs plot human time, excluding wait time

Slide 30

Analysis of time results• Rapid convergence across all OSs/subjects

– despite high initial variability– final plateau defines “minimum” time for

task– plateau invariant over

individuals/approaches

• Clear differences in plateaus between OSs– Solaris < Windows < Linux

» note: statistically dubious conclusion given sample size!

Slide 31

ISTORE Status• ISTORE Hardware

– All 80 Nodes (boards) manufactured– PCB backplane: in layout– Finish 80 node system: December 2000

• Software– 2-node system running -- boots OS– Diagnostic Processor SW and device driver done – Network striping done; fault adaptation ongoing– Load balancing for performance heterogeneity

done• Benchmarking

– Availability benchmark example complete– Initial maintainability benchmark complete,

revised strategy underway

Slide 32

BACKUP SLIDES

IRAM

Slide 33

Modular Vector Unit Design

• Single 64b “lane” design replicated 4 times– Reduces design and testing time– Provides a simple scaling model (up or down)

without major control or datapath redesign– Lane scaling independent of DRAM scaling

• Most instructions require only intra-lane interconnect– Tolerance to interconnect delay scaling

256b

Control

64b

Xbar IF

Integer Datapath 0

Flag Reg. Elements& Datapaths

Vector Reg.Elements

FP Datapath

Integer Datapath 1

64b

Xbar IF

Integer Datapath 0


Vector Reg.Elements

FP Datapath

Integer Datapath 1

64b

Xbar IF

Integer Datapath 0


Vector Reg.Elements

FP Datapath

Integer Datapath 1

64b

Xbar IF

Integer Datapath 0


Vector Reg.Elements

FP Datapath

Integer Datapath 1

Slide 34

Performance: FFT (1)

FFT (Floating-point, 1024 points)

36

16.825

69

92

124.3

0

40

80

120

160

Exec

utio

n Ti

me

(use

c) VIRAM

Pathfinder-2

Wildstar

TigerSHARC

ADSP-21160

TMS320C6701

Slide 35

Performance: FFT (2)

FFT (Fixed-point, 256 points)

7.2 8.1 9 7.3

87

151

0

40

80

120

160

Exec

utio

n Ti

me

(use

c) VIRAM

Pathfinder-1

Carmel

TigerSHARC

PPC 604E

Pentium

Slide 36

Media Kernel Performance

PeakPerf.

SustainedPerf.

%of Peak

Image Composition 6.4 GOPS 6.40 GOPS 100.0%iDCT 6.4 GOPS 1.97 GOPS 30.7%Color Conversion 3.2 GOPS 3.07 GOPS 96.0%

Image Convolution 3.2 GOPS 3.16 GOPS 98.7%Integer MV Multiply 3.2 GOPS 2.77 GOPS 86.5%Integer VM Multiply 3.2 GOPS 3.00 GOPS 93.7%FP MV Multiply 3.2 GFLOPS 2.80 GFLOPS 87.5%

FP VM Multiply 3.2 GFLOPS 3.19 GFLOPS 99.6%

AVERAGE 86.6%

Slide 37

Base-line system comparison

VIRAM MMX VIS TMS320C82

ImageComposition

0.13 - 2.22 (17.0x) -

iDCT 1.18 3.75 (3.2x) - -

ColorConversion

0.78 8.00 (10.2x) - 5.70 (7.6x)

ImageConvolution

5.49 5.49 (4.5x) 6.19 (5.1x) 6.50 (5.3x)

• All numbers in cycles/pixel

•MMX and VIS results assume all data in L1 cache

Slide 38

Vector Architecture State

GeneralPurpose

Registers(32)

FlagRegisters

(32)

VP0 VP1 VP$vlr-1

vr0vr1

vr31

vf0

vf1

vf31

$vpw

1b

Virtual Processors ($vlr)

vs0

vs1

vs15

Scalar Regs

64b

Slide 39

Vector Instruction Set• Complete load-store vector instruction set

– Uses the MIPS64™ ISA coprocessor 2 opcode space

» Ideas work with any core CPU: Arm, PowerPC, ...– Architecture state

» 32 general-purpose vector registers» 32 vector flag registers

– Data types supported in vectors:» 64b, 32b, 16b (and 8b)

– 91 arithmetic and memory instructions• Not specified by the ISA

– Maximum vector register length– Functional unit datapath width

Slide 40

Compiler/OS Enhancements• Compiler support

– Conditional execution of vector instruction» Using the vector flag registers

– Support for software speculation of load operations

• Operating system support– MMU-based virtual memory– Restartable arithmetic exceptions– Valid and dirty bits for vector registers– Tracking of maximum vector length used

Slide 41

BACKUP SLIDES

ISTORE

Slide 42

ISTORE: A server for the PostPC Era

Aaron Brown, Dave Martin, David Oppenheimer, Noah Trauhaft, Dave Patterson,Katherine Yelick

University of California at [email protected]

UC Berkeley ISTORE [email protected]

August 2000

Slide 43

ISTORE as Storage System of the Future

• Availability, Maintainability, and Evolutionary growth key challenges for storage systems

– Maintenance Cost ~ >10X Purchase Cost per year, – Even 2X purchase cost for 1/2 maintenance cost wins– AME improvement enables even larger systems

• ISTORE also cost-performance advantages– Better space, power/cooling costs ($@colocation site)– More MIPS, cheaper MIPS, no bus bottlenecks– Compression reduces network $, encryption protects– Single interconnect, supports evolution of

technology, single network technology to maintain/understand

• Match to future software storage services– Future storage service software target clusters

Slide 44

Lampson: Systems Challenges• Systems that work

– Meeting their specs– Always available– Adapting to changing environment– Evolving while they run– Made from unreliable components– Growing without practical limit

• Credible simulations or analysis• Writing good specs• Testing• Performance

– Understanding when it doesn’t matter

“Computer Systems Research-Past and Future” Keynote address,

17th SOSP, Dec. 1999

Butler LampsonMicrosoft

Slide 45

Jim Gray: Trouble-Free Systems • Manager

– Sets goals– Sets policy– Sets budget– System does the rest.

• Everyone is a CIO (Chief Information Officer)

• Build a system – used by millions of people each day– Administered and managed by a ½ time

person.» On hardware fault, order replacement part» On overload, order additional equipment» Upgrade hardware and software automatically.

“What Next? A dozen remaining IT problems”

Turing Award Lecture, FCRC,

May 1999Jim GrayMicrosoft

Slide 46

Jim Gray: Trustworthy Systems

• Build a system used by millions of people that – Only services authorized users

» Service cannot be denied (can’t destroy data or power).

» Information cannot be stolen.– Is always available: (out less than 1 second per 100 years = 8 9’s of

availability) » 1950’s 90% availability,

Today 99% uptime for web sites, 99.99% for well managed sites

(50 minutes/year)3 extra 9s in 45 years.

» Goal: 5 more 9s: 1 second per century.– And prove it.

Slide 47

Hennessy: What Should the “New World” Focus Be?• Availability

– Both appliance & service• Maintainability

– Two functions:» Enhancing availability by preventing failure» Ease of SW and HW upgrades

• Scalability– Especially of service

• Cost– per device and per service transaction

• Performance– Remains important, but its not SPECint

“Back to the Future: Time to Return to Longstanding

Problems in Computer Systems?” Keynote address,

FCRC, May 1999

John HennessyStanford

Slide 48

The real scalability problems: AME

• Availability– systems should continue to meet quality of

service goals despite hardware and software failures

• Maintainability– systems should require only minimal ongoing

human administration, regardless of scale or complexity: Today, cost of maintenance = 10-100 cost of purchase

• Evolutionary Growth– systems should evolve gracefully in terms of

performance, maintainability, and availability as they are grown/upgraded/expanded

• These are problems at today’s scales, and will only get worse as systems grow

Slide 49

Principles for achieving AME• No single points of failure, lots of redundancy• Performance robustness is more important than

peak performance• Performance can be sacrificed for improvements

in AME– resources should be dedicated to AME

» biological systems > 50% of resources on maintenance– can make up performance by scaling system

• Introspection– reactive techniques to detect and adapt to

failures, workload variations, and system evolution

– proactive techniques to anticipate and avert problems before they happen

Slide 50

Hardware Techniques (1): SON

• SON: Storage Oriented Nodes• Distribute processing with storage

– If AME really important, provide resources!– Most storage servers limited by speed of CPUs!! – Amortize sheet metal, power, cooling, network for

disk to add processor, memory, and a real network?– Embedded processors 2/3 perf, 1/10 cost, power?– Serial lines, switches also growing with Moore’s Law;

less need today to centralize vs. bus oriented systems

• Advantages of cluster organization– Truly scalable architecture– Architecture that tolerates partial failure– Automatic hardware redundancy

Slide 51

Hardware techniques (2)• Heavily instrumented hardware

– sensors for temp, vibration, humidity, power, intrusion

– helps detect environmental problems before they can affect system integrity

• Independent diagnostic processor on each node– provides remote control of power, remote

console access to the node, selection of node boot code

– collects, stores, processes environmental data for abnormalities

– non-volatile “flight recorder” functionality– all diagnostic processors connected via

independent diagnostic network

Slide 52

Hardware techniques (3)• On-demand network partitioning/isolation

– Internet applications must remain available despite failures of components, therefore can isolate a subset for preventative maintenance

– Allows testing, repair of online system– Managed by diagnostic processor and network

switches via diagnostic network• Built-in fault injection capabilities

– Power control to individual node components– Injectable glitches into I/O and memory busses– Managed by diagnostic processor – Used for proactive hardware introspection

» automated detection of flaky components» controlled testing of error-recovery mechanisms

Slide 53

“Hardware” culture (4)• Benchmarking

– One reason for 1000X processor performance was ability to measure (vs. debate) which is better

» e.g., Which most important to improve: clock rate, clocks per instruction, or instructions executed?

– Need AME benchmarks“what gets measured gets done”“benchmarks shape a field”“quantification brings rigor”

Slide 54

Time (minutes)0 10 20 30 40 50 60 70 80 90 100 110

80

100

120

140

160

0

1

2

Hits/sec# failures tolerated

0 10 20 30 40 50 60 70 80 90 100 110

Hits

per

sec

ond

190

195

200

205

210

215

220

#fai

lure

s to

lera

ted

0

1

2

Reconstruction

Reconstruction

Example single-fault result

• Compares Linux and Solaris reconstruction– Linux: minimal performance impact but longer

window of vulnerability to second fault– Solaris: large perf. impact but restores

redundancy fast

Linux

Solaris

Slide 55

Deriving ISTORE• What is the interconnect?

– FC-AL? (Interoperability? Cost of switches?)– Infiniband? (When? Cost of switches? Cost of

NIC?)– Gbit Ehthernet?

• Pick Gbit Ethernet as commodity switch, link– As main stream, fastest improving in cost

performance– We assume Gbit Ethernet switches will get

cheap over time (Network Processors, volume, …)

Slide 56

Deriving ISTORE• Number of Disks / Gbit port? • Bandwidth of 2000 disk

– Raw bit rate: 427 Mbit/sec.– Data transfer rate: 40.2 MByte/sec – Capacity: 73.4 GB

• Disk trends– BW: 40%/year– Capacity, Areal density,$/MB: 100%/year

• 2003 disks– ~ 500 GB capacity (<8X)– ~ 110 MB/sec or 0.9 Gbit/sec (2.75X)

• Number of Disks / Gbit port = 1

Slide 57

ISTORE-1 Brick• Webster’s Dictionary:

“brick: a handy-sized unit of building or paving material typically being rectangular and about 2 1/4 x 3 3/4 x 8 inches”

• ISTORE-1 Brick: 2 x 4 x 11 inches (1.3x)– Single physical form factor, fixed cooling

required, compatible network interface to simplify physical maintenance, scaling over time

– Contents should evolve over time: contains most cost effective MPU, DRAM, disk, compatible NI

– If useful, could have special bricks (e.g., DRAM rich)

– Suggests network that will last, evolve: Ethernet

Slide 58

ISTORE-1 hardware platform• 80-node x86-based cluster, 1.4TB storage

– cluster nodes are plug-and-play, intelligent, network-attached storage “bricks”

» a single field-replaceable unit to simplify maintenance– each node is a full x86 PC w/256MB DRAM, 18GB disk– more CPU than NAS; fewer disks/node than cluster

Intelligent Disk “Brick”Portable PC CPU: Pentium II/266 + DRAM

Redundant NICs (4 100 Mb/s links)Diagnostic Processor

DiskHalf-height canister

ISTORE Chassis80 nodes, 8 per tray2 levels of switches•20 100 Mbit/s•2 1 Gbit/sEnvironment Monitoring:UPS, redundant PS,fans, heat and vibration sensors...

Slide 59

Common Question: RAID?• Switched Network sufficient for all types of

communication, including redundancy– Hierarchy of buses is generally not

superior to switched network• Veritas, others offer software RAID 5 and

software Mirroring (RAID 1)• Another use of processor per disk

Slide 60

A Case for Intelligent Storage

Advantages:• Cost of Bandwidth• Cost of Space• Cost of Storage System v. Cost of Disks• Physical Repair, Number of Spare Parts• Cost of Processor Complexity • Cluster advantages: dependability,

scalability• 1 v. 2 Networks

Slide 61

Cost of Space, Power, Bandwidth

• Co-location sites (e.g., Exodus) offer space, expandable bandwidth, stable power

• Charge ~$1000/month per rack (~ 10 sq. ft.) – Includes 1 20-amp circuit/rack; charges

~$100/month per extra 20-amp circuit/rack• Bandwidth cost: ~$500 per Mbit/sec/Month

Slide 62

Cost of Bandwidth, Safety• Network bandwidth cost is significant

– 1000 Mbit/sec/month => $6,000,000/year• Security will increase in importance for

storage service providers• XML => server format conversion for gadgets=> Storage systems of future need greater

computing ability– Compress to reduce cost of network

bandwidth 3X; save $4M/year?– Encrypt to protect information in transit for

B2B=> Increasing processing/disk for future storage

apps

Slide 63

Cost of Space, Power• Sun Enterprise server/array (64CPUs/60disks)

– 10K Server (64 CPUs): 70 x 50 x 39 in.– A3500 Array (60 disks): 74 x 24 x 36 in.– 2 Symmetra UPS (11KW): 2 * 52 x 24 x 27 in.

• ISTORE-1: 2X savings in space– ISTORE-1: 1 rack (big) switches, 1 rack (old)

UPSs, 1 rack for 80 CPUs/disks (3/8 VME rack unit/brick)

• ISTORE-2: 8X-16X space?• Space, power cost/year for 1000 disks:

Sun $924k, ISTORE-1 $484k, ISTORE2 $50k

Slide 64

Disk Limit: Bus HierarchyCPU Memory

bus

Memory

External I/O bus

(SCSI)

(PCI)

Internal I/O bus

• Data rate vs. Disk rate– SCSI: Ultra3 (80 MHz),

Wide (16 bit): 160 MByte/s– FC-AL: 1 Gbit/s = 125 MByte/s

Use only 50% of a bus Command overhead (~ 20%) Queuing Theory (< 70%)

(15 disks/bus)

Storage Area Network

(FC-AL)

Server

DiskArray

Mem

RAID bus

Slide 65

Physical Repair, Spare Parts• ISTORE: Compatible modules based on hot-

pluggable interconnect (LAN) with few Field Replacable Units (FRUs): Node, Power Supplies, Switches, network cables– Replace node (disk, CPU, memory, NI) if any fail

• Conventional: Heterogeneous system with many server modules (CPU, backplane, memory cards, …) and disk array modules (controllers, disks, array controllers, power supplies, … ) – Store all components available somewhere as

FRUs– Sun Enterprise 10k has ~ 100 types of spare

parts– Sun 3500 Array has ~ 12 types of spare parts

Slide 66

ISTORE: Complexity v. Perf • Complexity increase:

– HP PA-8500: issue 4 instructions per clock cycle, 56 instructions out-of-order execution, 4Kbit branch predictor, 9 stage pipeline, 512 KB I cache, 1024 KB D cache (> 80M transistors just in caches)

– Intel Xscale: 16 KB I$, 16 KB D$, 1 instruction, in order execution, no branch prediction, 6 stage pipeline

• Complexity costs in development time, development power, die size, cost

– 550 MHz HP PA-8500 477 mm2, 0.25 micron/4M $330, 60 Watts

– 1000 MHz Intel StrongARM2 (“Xscale”) @ 1.5 Watts, 800 MHz at 0.9 W, … 50 Mhz @ 0.01W, 0.18 micron (old chip 50 mm2, 0.35 micron, $18)

• => Count $ for system, not processors/disk

Slide 67

ISTORE: Cluster Advantages• Architecture that tolerates partial failure• Automatic hardware redundancy

– Transparent to application programs• Truly scalable architecture

– Given maintenance is 10X-100X capital costs, clustersize limits today are maintenance, floor space cost - generally NOT capital costs

• As a result, it is THE target architecture for new software apps for Internet

Slide 68

ISTORE: 1 vs. 2 networks• Current systems all have LAN + Disk interconnect

(SCSI, FCAL)– LAN is improving fastest, most investment, most

features– SCSI, FC-AL poor network features, improving

slowly, relatively expensive for switches, bandwidth

– FC-AL switches don’t interoperate– Two sets of cables, wiring?– SysAdmin trained in 2 networks, SW interface,

…???• Why not single network based on best HW/SW

technology?– Note: there can be still 2 instances of the network

(e.g. external, internal), but only one technology

Slide 69

Initial Applications• ISTORE-1 is not one super-system that

demonstrates all these techniques!– Initially provide middleware, library to

support AME• Initial application targets

– information retrieval for multimedia data (XML storage?)

» self-scrubbing data structures, structuring performance-robust distributed computation

» Example: home video server using XML interfaces– email service

» self-scrubbing data structures, online self-testing» statistical identification of normal behavior

Slide 70

A glimpse into the future?• System-on-a-chip enables computer, memory,

redundant network interfaces without significantly increasing size of disk

• ISTORE HW in 5-7 years:

– 2006 brick: System On a Chip integrated with MicroDrive

» 9GB disk, 50 MB/sec from disk» connected via crossbar switch» From brick to “domino”

– If low power, 10,000 nodes fit into one rack!

• O(10,000) scale is our ultimate design point

Slide 71

Conclusion: ISTORE as Storage System of the Future

• Availability, Maintainability, and Evolutionary growth key challenges for storage systems

– Maintenance Cost ~ 10X Purchase Cost per year, so over 5 year product life, ~ 95% of cost of ownership

– Even 2X purchase cost for 1/2 maintenance cost wins– AME improvement enables even larger systems

• ISTORE has cost-performance advantages– Better space, power/cooling costs ($@colocation site)– More MIPS, cheaper MIPS, no bus bottlenecks– Compression reduces network $, encryption protects– Single interconnect, supports evolution of

technology, single network technology to maintain/understand

• Match to future software storage services– Future storage service software target clusters

Slide 72

Questions?

Contact us if you’re interested:email: [email protected]

http://iram.cs.berkeley.edu/

“If it’s important, how can you say if it’s impossible if you don’t try?”

Jean Morreau, a founder of European Union

Slide 73

Clusters and TPC Software 8/’00

• TPC-C: 6 of Top 10 performance are clusters, including all of Top 5; 4 SMPs

• TPC-H: SMPs and NUMAs– 100 GB All SMPs (4-8 CPUs)– 300 GB All NUMAs (IBM/Compaq/HP 32-64

CPUs)• TPC-R: All are clusters

– 1000 GB :NCR World Mark 5200• TPC-W: All web servers are clusters (IBM)

Slide 74

Clusters and TPC-C Benchmark

Top 10 TPC-C Performance (Aug. 2000) Ktpm1. Netfinity 8500R c/s Cluster 4412. ProLiant X700-96P Cluster 2623. ProLiant X550-96P Cluster 2304. ProLiant X700-64P Cluster 1805. ProLiant X550-64P Cluster 1626. AS/400e 840-2420 SMP 1527. Fujitsu GP7000F Model 2000 SMP 1398. RISC S/6000 Ent. S80 SMP 1399. Bull Escala EPC 2400 c/s SMP 13610. Enterprise 6500 Cluster Cluster 135

Slide 75

Cost of Storage System v. Disks• Examples show cost of way we build current

systems (2 networks, many buses, CPU, …) Disks DisksDate CostMain.Disks/CPU /IObus

– NCR WM: 10/97$8.3M--1312 10.25.0– Sun 10k: 3/98$5.2M-- 66810.47.0– Sun 10k: 9/99$6.2M$2.1M173227.012.0– IBM Netinf: 7/00$7.8M$1.8M704055.09.0=>Too complicated, too heterogenous

• And Data Bases are often CPU or bus bound! – ISTORE disks per CPU: 1.0– ISTORE disks per I/O bus: 1.0

Slide 76

Common Question: Why Not Vary Number of Processors

and Disks?• Argument: if can vary numbers of each to match

application, more cost-effective solution?• Alternative Model 1: Dual Nodes + E-switches

– P-node: Processor, Memory, 2 Ethernet NICs– D-node: Disk, 2 Ethernet NICs

• Response– As D-nodes running network protocol, still need

processor and memory, just smaller; how much save?

– Saves processors/disks, costs more NICs/switches: N ISTORE nodes vs. N/2 P-nodes + N D-nodes

– Isn't ISTORE-2 a good HW prototype for this model? Only run the communication protocol on N nodes, run the full app and OS on N/2

Slide 77

Common Question: Why Not Vary Number of Processors

and Disks?• Alternative Model 2: N Disks/node

– Processor, Memory, N disks, 2 Ethernet NICs• Response

– Potential I/O bus bottleneck as disk BW grows– 2.5" ATA drives are limited to 2/4 disks per ATA bus– How does a research project pick N? What’s natural? – Is there sufficient processing power and memory to run

the AME monitoring and testing tasks as well as the application requirements?

– Isn't ISTORE-2 a good HW prototype for this model? Software can act as simple disk interface over network and run a standard disk protocol, and then run that on N nodes per apps/OS node. Plenty of Network BW available in redundant switches

Slide 78

SCSI v. IDE $/GB

• Prices from PC Magazine, 1995-2000

$-

$150

$300

$450

Pric

e pe

r gig

abyt

e

-

0.50

1.00

1.50

2.00

2.50

3.00

Pric

e ra

tio p

er g

igab

ye: S

CSI v

. ID

E

SCSIIDERatio SCSI/IDE

IRAM and ISTORE Projects

Documents