Computers for the Post-PC Era - University of California, Berkeleypattrsn/talks/POSTPC3.pdf · 2000-02-16 · Slide 1 Computers for the Post-PC Era David Patterson University of California

Slide 1

Computers for the Post-PC Era

David PattersonUniversity of California at Berkeley

[email protected]

UC Berkeley IRAM GroupUC Berkeley ISTORE Group

[email protected] 2000

Slide 2

Perspective on Post-PC Era• PostPC Era will be driven by 2 technologies:1) “Gadgets”:Tiny Embeddedor Mobile Devices– ubiquitous: in everything– e.g., successor to PDA,

cell phone,wearable computers

2) Infrastructure to Support such Devices– e.g., successor to Big Fat Web Servers,

Database Servers

Slide 3

Outline1) Example microprocessor for PostPC gadgets

2) Motivation and the ISTORE project vision– AME: Availability, Maintainability, Evolutionary growth

– ISTORE’s research principles

– Proposed techniques for achieving AME

– Benchmarks for AME

• Conclusions and future work

Slide 4

New Architecture Directions• “…media processing will become the dominantforce in computer arch. and microprocessordesign.”

• “...new media-rich applications ... involvesignificant real-time processing of continuousmedia streams, and make heavy use of vectorsof packed 8-, 16-, 32-bit integer and Fl. Pt.”

• Needs include real-time response, continuousmedia data types (no temporal locality), finegrain parallelism, coarse grain parallelism,memory bandwidth– “How Multimedia Workloads Will Change Processor

Design”, Diefendorff & Dubey, IEEE Computer (9/97)

Slide 5

Intelligent RAM: IRAMMicroprocessor & DRAM on a

single chip:– 10X capacity vs. SRAM

– on-chip memory latency5-10X,bandwidth 50-100X

– improve energy efficiency2X-4X (no off-chip bus)

– serial I/O 5-10X v. buses

– smaller board area/volume

IRAM advantages extend to:– a single chip system

– a building block for larger systems

DRAM

fab

Proc

Bus

D R A M

I/OI/O

$ $Proc

L2$

Logic

fab

Bus

D R A M

BusI/OI/O

Slide 6

Revive Vector Architecture

• Cost: $1M each?• Low latency, high

BW memory system?• Code density?• Compilers?

• Performance?

• Power/Energy?

• Limited to scientificapplications?

• Single-chip CMOS MPU/IRAM• IRAM

• Much smaller than VLIW• For sale, mature (>20 years)

(We retarget Cray compilers)• Easy scale speed with

technology• Parallel to save energy, keep

performance• Multimedia apps vectorizable

too: N*64b, 2N*32b, 4N*16b

Slide 7

V-IRAM1: Low Power v. High Perf.

Memory Crossbar Switch

M

M

…

M

M

M

…

M

M

M

…

M

M

M

…

M

M

M

…

M

M

M

…

M

…

M

M

…

M

M

M

…

M

M

M

…

M

M

M

…

M

+

Vector Registers

x

÷

Load/Store

16K I cache 16K D cache

2-waySuperscalar

VectorProcessor

4 x 64 4 x 64 4 x 64 4 x 64 4 x 64

4 x 64or

8 x 32or

16 x 16

4 x 644 x 64

QueueInstruction

I/OI/O

I/OI/O

SerialI/O

Slide 8

VIRAM-1: System on a ChipPrototype scheduled for tape-out mid 2000•0.18 um EDL process•16 MB DRAM, 8 banks

•MIPS Scalar core and caches @ 200 MHz

•4 64-bit vector unit pipelines @ 200 MHz

•4 100 MB parallel I/O lines•17x17 mm, 2 Watts

•25.6 GB/s memory (6.4 GB/s per direction and per Xbar)

•1.6 Gflops (64-bit), 6.4 GOPs (16-bit)

CPU+$

I/O4 Vector Pipes/Lanes

Memory (64 Mbits / 8 MBytes)

Memory (64 Mbits / 8 MBytes)

Xbar

Slide 9

Media Kernel PerformancePeakPerf.

SustainedPerf.

%of Peak

Image Composition 6.4 GOPS 6.40 GOPS 100.0%

iDCT 6.4 GOPS 1.97 GOPS 30.7%

Color Conversion 3.2 GOPS 3.07 GOPS 96.0%

Image Convolution 3.2 GOPS 3.16 GOPS 98.7%

Integer MV Multiply 3.2 GOPS 2.77 GOPS 86.5%

Integer VM Multiply 3.2 GOPS 3.00 GOPS 93.7%

FP MV Multiply 3.2 GFLOPS 2.80 GFLOPS 87.5%

FP VM Multiply 3.2 GFLOPS 3.19 GFLOPS 99.6%

AVERAGE 86.6%

Slide 10

Base-line system comparisonVIRAM MMX VIS TMS320C82

ImageComposition

0.13 - 2.22 (17.0x) -

iDCT 1.18 3.75 (3.2x) - -

ColorConversion

0.78 8.00 (10.2x) - 5.70 (7.6x)

ImageConvolution

5.49 5.49 (4.5x) 6.19 (5.1x) 6.50 (5.3x)

• All numbers in cycles/pixel

•MMX and VIS results assume alldata in L1 cache

Slide 11

IRAM Chip Challenges• Merged Logic-DRAM process Cost: Cost ofwafer, Impact on yield, testing cost of logicand DRAM

• Price: on-chip DRAM v. separate DRAM chips?• Delay in transistor speeds, memory cell sizesin Merged process vs. Logic only or DRAM only

• DRAM block: flexibility via DRAM “compiler”(vary size, width, no. subbanks) vs. fixedblock

• Apps: advantages in memory bandwidth,energy, system size to offset challenges?

Slide 12

Other examples: IBM “Blue Gene”• 1 PetaFLOPS in 2005 for $100M?• Application: Protein Folding• Blue Gene Chip

– 32 Multithreaded RISC processors + ??MB Embedded DRAM +high speed Network Interface on single 20 x 20 mm chip

– 1 GFLOPS / processor

• 2’ x 2’ Board = 64 chips (2K CPUs)• Rack = 8 Boards (512 chips,16K CPUs)• System = 64 Racks (512 boards,32K chips,1M CPUs)• Total 1 million processors in just 2000 sq. ft.

Slide 13

Other examples: Sony Playstation 2

• Emotion Engine: 6.2 GFLOPS, 75 million polygons persecond (Microprocessor Report, 13:5)

– Superscalar MIPS core + vector coprocessor + graphics/DRAM– Claim: “Toy Story” realism brought to games

Slide 14







Slide 15

The problem space: big data• Big demand for enormous amounts of data

– today: high-end enterprise and Internet applications» enterprise decision-support, data mining databases» online applications: e-commerce, mail, web, archives

– future: infrastructure services, richer data» computational & storage back-ends for mobile devices» more multimedia content» more use of historical data to provide better services

• Today’s SMP server designs can’t easily scale• Bigger scaling problems than performance!

Slide 16

Lampson: Systems Challenges• Systems that work

– Meeting their specs– Always available– Adapting to changing environment– Evolving while they run– Made from unreliable components– Growing without practical limit

• Credible simulations or analysis• Writing good specs• Testing• Performance

– Understanding when it doesn’t matter

“Computer Systems Research-Past and Future” Keynote address,

17th SOSP, Dec. 1999

Butler LampsonMicrosoft

Slide 17

Hennessy: What Should the “New World”Focus Be?• Availability

– Both appliance & service

• Maintainability– Two functions:

» Enhancing availability by preventing failure» Ease of SW and HW upgrades

• Scalability– Especially of service

• Cost– per device and per service transaction

• Performance– Remains important, but its not SPECint

“Back to the Future: Time to Return to Longstanding

Problems in Computer Systems?” Keynote address,

FCRC, May 1999

John HennessyStanford

Slide 18

The real scalability problems: AME• Availability

– systems should continue to meet quality of servicegoals despite hardware and software failures

• Maintainability– systems should require only minimal ongoing human

administration, regardless of scale or complexity

• Evolutionary Growth– systems should evolve gracefully in terms of

performance, maintainability, and availability as theyare grown/upgraded/expanded

• These are problems at today’s scales, and willonly get worse as systems grow

Slide 19

The ISTORE project vision• Our goal:

develop principles and investigatehardware/software techniques for buildingstorage-based server systems that:

– are highly available– require minimal maintenance– robustly handle evolutionary growth– are scalable to O(10000) nodes

Slide 20

Principles for achieving AME (1)• No single points of failure• Redundancy everywhere• Performance robustness is more importantthan peak performance– “performance robustness” implies that real-world

performance is comparable to best-case performance• Performance can be sacrificed forimprovements in AME– resources should be dedicated to AME

» compare: biological systems spend > 50% of resourceson maintenance

– can make up performance by scaling system

Slide 21

Principles for achieving AME (2)• Introspection

– reactive techniques to detect and adapt to failures,workload variations, and system evolution

– proactive techniques to anticipate and avert problemsbefore they happen

Slide 22







Slide 23

Hardware techniques• Fully shared-nothing cluster organization

– truly scalable architecture– architecture that tolerates partial failure– automatic hardware redundancy

Slide 24

Hardware techniques (2)• No Central Processor Unit:distribute processing with storage– Serial lines, switches also growing with Moore’s Law;

less need today to centralize vs. bus oriented systems– Most storage servers limited by speed of CPUs;

why does this make sense?– Why not amortize sheet metal, power, cooling

infrastructure for disk to add processor, memory, andnetwork?

– If AME is important, must provide resources to beused to help AME: local processors responsible forhealth and maintenance of their storage

Slide 25

Hardware techniques (3)• Heavily instrumented hardware

– sensors for temp, vibration, humidity, power, intrusion– helps detect environmental problems before they can

affect system integrity

• Independent diagnostic processor on each node– provides remote control of power, remote console

access to the node, selection of node boot code– collects, stores, processes environmental data for

abnormalities– non-volatile “flight recorder” functionality– all diagnostic processors connected via independent

diagnostic network

Slide 26

Hardware techniques (4)• On-demand network partitioning/isolation

– Internet applications must remain available despitefailures of components, therefore can isolate a subsetfor preventative maintenance

– Allows testing, repair of online system– Managed by diagnostic processor and network

switches via diagnostic network

Slide 27

Hardware techniques (5)• Built-in fault injection capabilities

– Power control to individual node components– Injectable glitches into I/O and memory busses– Managed by diagnostic processor– Used for proactive hardware introspection

» automated detection of flaky components» controlled testing of error-recovery mechanisms

– Important for AME benchmarking (see next slide)

Slide 28

“Hardware” techniques (6)• Benchmarking

– One reason for 1000X processor performance wasability to measure (vs. debate) which is better

» e.g., Which most important to improve: clock rate,clocks per instruction, or instructions executed?

– Need AME benchmarks“what gets measured gets done”“benchmarks shape a field”“quantification brings rigor”

Slide 29

ISTORE-1 hardware platform• 80-node x86-based cluster, 1.4TB storage

– cluster nodes are plug-and-play, intelligent, network-attached storage “bricks”

» a single field-replaceable unit to simplify maintenance

– each node is a full x86 PC w/256MB DRAM, 18GB disk– more CPU than NAS; fewer disks/node than cluster

ISTORE Chassis80 nodes, 8 per tray2 levels of switches•20 100 Mbit/s•2 1 Gbit/sEnvironment Monitoring:UPS, redundant PS,fans, heat and vibrationsensors...

Intelligent Disk “Brick”Portable PC CPU: Pentium II/266 + DRAM

Redundant NICs (4 100 Mb/s links)Diagnostic Processor

Disk

Half-height canister

Slide 30

A glimpse into the future?• System-on-a-chip enables computer, memory,redundant network interfaces withoutsignificantly increasing size of disk

• ISTORE HW in 5-7 years:– building block: 2006 MicroDrive

integrated with IRAM» 9GB disk, 50 MB/sec from disk» connected via crossbar switch

– 10,000 nodes fit into one rack!

• O(10,000) scale is ourultimate design point

Slide 31

Software techniques• Fully-distributed, shared-nothing code

– centralization breaks as systems scale up O(10000)– avoids single-point-of-failure front ends

• Redundant data storage– required for high availability, simplifies self-testing– replication at the level of application objects

» application can control consistency policy» more opportunity for data placement optimization

Slide 32

Software techniques (2)• “River” storage interfaces

– NOW Sort experience:performance heterogeneity is the norm

» e.g., disks: outer vs. inner track (1.5X), fragmentation» e.g., processors: load (1.5-5x)

– So demand-driven delivery of data to apps» via distributed queues and graduated declustering» for apps that can handle unordered data delivery

– Automatically adapts to variations in performance ofproducers and consumers

– Also helps with evolutionary growth of cluster

Slide 33

Software techniques (3)• Reactive introspection

– Use statistical techniques to identify normal behaviorand detect deviations from it

– Policy-driven automatic adaptation to abnormalbehavior once detected

» initially, rely on human administrator to specify policy» eventually, system learns to solve problems on its own

by experimenting on isolated subsets of the nodes• one candidate: reinforcement learning

Slide 34

Software techniques (4)• Proactive introspection

– Continuous online self-testing of HW and SW» in deployed systems!» goal is to shake out “Heisenbugs” before they’re

encountered in normal operation» needs data redundancy, node isolation, fault injection

– Techniques:» fault injection: triggering hardware and software

error handling paths to verify their integrity/existence» stress testing: push HW/SW to their limits» scrubbing: periodic restoration of potentially

“decaying” hardware or software state• self-scrubbing data structures (like MVS)• ECC scrubbing for disks and memory

Slide 35

Applications• ISTORE is not one super-system thatdemonstrates all these techniques!– Initially provide library to support AME goals

• Initial application targets– cluster web/email servers

» self-scrubbing data structures, online self-testing» statistical identification of normal behavior

– decision-support database query execution system» River-based storage, replica management

– information retrieval for multimedia data» self-scrubbing data structures, structuring

performance-robust distributed computation

Slide 36







Slide 37

Availability benchmark methodology• Goal: quantify variation in QoS metrics asevents occur that affect system availability

• Leverage existing performance benchmarks– to generate fair workloads– to measure & trace quality of service metrics

• Use fault injection to compromise system– hardware faults (disk, memory, network, power)– software faults (corrupt input, driver error returns)– maintenance events (repairs, SW/HW upgrades)

• Examine single-fault and multi-fault workloads– the availability analogues of performance micro- and

macro-benchmarks

Slide 38

Time (2-minute intervals)0 5 10 15 20 25 30 35 40 45 50 55 60

Perfo

rman

ce

160

170

180

190

200

210

}normal behavior(99% conf)

injecteddisk failure

reconstruction

• Results are most accessible graphically– plot change in QoS metrics over time– compare to “normal” behavior?

» 99% confidence intervals calculated from no-fault runs

Methodology: reporting results

• Graphs can be distilled into numbers?

Slide 39

Example results: software RAID-5• Test systems: Linux/Apache and Win2000/IIS

– SpecWeb ’99 to measure hits/second as QoS metric– fault injection at disks based on empirical fault data

» transient, correctable, uncorrectable, & timeout faults• 15 single-fault workloads injected per system

– only 4 distinct behaviors observed(A) no effect (C) RAID enters degraded mode(B) system hangs (D) RAID enters degraded mode &

starts reconstruction– both systems hung (B) on simulated disk hangs– Linux exhibited (D) on all other errors– Windows exhibited (A) on transient errors and (C) on

uncorrectable, sticky errors

Slide 40

Time (2-minute intervals)0 10 20 30 40 50 60 70 80 90 100 110

Hits

pe

r se

co

nd

140

150

160

170

180

190

200

210

data diskfaulted

reconstruction(manual)

sparefaulted

disks replaced


Time (2-minute intervals)0 10 20 30 40 50 60 70 80 90 100 110

Hits

pe

r se

co

nd

140

150

160

170

180

190

200

210

220

data diskfaulted

reconstruction(automatic)

sparefaulted

reconstruction(automatic)


disks replaced

Example results: multiple-faults

• Windows reconstructs ~3x faster than Linux• Windows reconstruction noticeably affects application

performance, while Linux reconstruction does not

Windows2000/IIS

Linux/Apache

Slide 41

Conclusions (1): Benchmarks• Linux and Windows take opposite approachesto managing benign and transient faults– Linux is paranoid and stops using a disk on any error– Windows ignores most benign/transient faults– Windows is more robust except when disk is truly failing

• Linux and Windows have differentreconstruction philosophies– Linux uses idle bandwidth for reconstruction– Windows steals app. bandwidth for reconstruction– Windows rebuilds fault-tolerance more quickly

• Win2k favors fault-tolerance over performance;Linux favors performance over fault-tolerance

Slide 42

Conclusions (2): ISTORE• Availability, Maintainability, and Evolutionarygrowth are key challenges for server systems– more important even than performance

• ISTORE is investigating ways to bring AME tolarge-scale, storage-intensive servers– via clusters of network-attached, computationally-

enhanced storage nodes running distributed code– via hardware and software introspection– we are currently performing application studies to

investigate and compare techniques• Availability benchmarks a powerful tool?

– revealed undocumented design decisions affectingSW RAID availability on Linux and Windows 2000

Slide 43

Conclusions (3)• IRAM attractive for two Post-PC applicationsbecause of low power, small size, high memorybandwidth– Gadgets: Embedded/Mobile devices– Infrastructure: Intelligent Storage and Networks

• PostPC infrastructure requires– New Goals: Availability, Maintainability, Evolution– New Principles: Introspection, Performance Robustness– New Techniques: Isolation/fault insertion, Software

scrubbing– New Benchmarks: measure, compare AME metrics

Slide 44

Berkeley Future work• IRAM: fab and test chip• ISTORE

– implement AME-enhancing techniques in a variety ofInternet, enterprise, and info retrieval applications

– select the best techniques and integrate into ageneric runtime system with “AME API”

– add maintainability benchmarks» can we quantify administrative work needed to maintain

a certain level of availability?– Perhaps look at data security via encryption?– Even consider denial of service?

Slide 45

For more information:

http://iram.cs.berkeley.edu/[email protected]

The UC Berkeley IRAM/ISTOREProjects:

Computers for the PostPC Era

Slide 46

Backup Slides

(mostly in the area of benchmarking)

Slide 47

Case study• Software RAID-5 plus web server

– Linux/Apache vs. Windows 2000/IIS

• Why software RAID?– well-defined availability guarantees

» RAID-5 volume should tolerate a single disk failure» reduced performance (degraded mode) after failure» may automatically rebuild redundancy onto spare disk

– simple system– easy to inject storage faults

• Why web server?– an application with measurable QoS metrics that

depend on RAID availability and performance

Slide 48

Benchmark environment: metrics• QoS metrics measured

– hits per second» roughly tracks response time in our experiments

– degree of fault tolerance in storage system

• Workload generator and data collector– SpecWeb99 web benchmark

» simulates realistic high-volume user load» mostly static read-only workload; some dynamic content» modified to run continuously and to measure average

hits per second over each 2-minute interval

Slide 49

Benchmark environment: faults• Focus on faults in the storage system (disks)• How do disks fail?

– according to Tertiary Disk project, failures include:» recovered media errors» uncorrectable write failures» hardware errors (e.g., diagnostic failures)» SCSI timeouts» SCSI parity errors

– note: no head crashes, no fail-stop failures

Slide 50

Disk fault injection technique• To inject reproducible failures, we replacedone disk in the RAID with an emulated disk– a PC that appears as a disk on the SCSI bus– I/O requests processed in software, reflected to

local disk– fault injection performed by altering SCSI command

processing in the emulation software

• Types of emulated faults:– media errors (transient, correctable, uncorrectable)– hardware errors (firmware, mechanical)– parity errors– power failures

Slide 51

System configuration

• RAID-5 Volume: 3GB capacity, 1GB used per disk– 3 physical disks, 1 emulated disk, 1 emulated spare disk

• 2 web clients connected via 100Mb switched Ethernet

IBM18 GB

10k RPM

IBM18 GB

10k RPM

IBM18 GB

10k RPM

Server

AMD K6-2-33364 MB DRAM

Linux or Win2000

IDEsystem

disk

= Fast/Wide SCSI bus, 20 MB/sec

Adaptec2940

Adaptec2940

Adaptec2940 Adaptec

2940

RAIDdata disks

IBM18 GB

10k RPM

SCSIsystem

disk

Disk Emulator

AMD K6-2-350Windows NT 4.0

ASC VirtualSCSI lib.

Adaptec2940

emulatorbacking disk

(NTFS)AdvStorASC-U2W

UltraSCSI

EmulatedSpareDisk

EmulatedDisk

Slide 52

Results: single-fault experiments• One exp’t for each type of fault (15 total)

– only one fault injected per experiment– no human intervention– system allowed to continue until stabilized or crashed

• Four distinct system behaviors observed(A) no effect: system ignores fault(B) RAID system enters degraded mode(C) RAID system begins reconstruction onto spare disk(D) system failure (hang or crash)

Slide 53

State of the Art: Ultrastar 72ZX– 73.4 GB, 3.5 inch disk– 2¢/MB– 16 MB track buffer– 11 platters, 22 surfaces– 20,944 cylinders– 7 Gbit/sq. in. areal density– 17 watts (idle)– 0.1 ms controller time– 5.3 ms avg. seek

(seek 1 track => 0.6 ms)– 3 ms = 1/2 rotation– 37 to 22 MB/s to mediasource: www.ibm.com;

www.pricewatch.com; 2/14/00

Latency = Queuing Time + Controller time +Seek Time + Rotation Time + Size / Bandwidth

per access

per byte

{+

Sector

Track

Cylinder

Head PlatterArm

Embed. Proc.

Track Buffer

Computers for the Post-PC Era - University of California, Berkeleypattrsn/talks/POSTPC3.pdf · 2000-02-16 · Slide 1 Computers for the Post-PC Era David Patterson University of California

Documents