Slide 1 IRAM and ISTORE Projects Aaron Brown, James Beck, Rich Fromm, Joe Gebis, Paul Harvey, Adam Janin, Dave Judd, Kimberly Keeton, Christoforos Kozyrakis, David Martin, Rich Martin, Thinh Nguyen, David Oppenheimer, Steve Pope, Randi Thomas, Noah Treuhaft, Sam Williams, John Kubiatowicz, Kathy Yelick, and David Patterson http://iram.cs.berkeley.edu/[istore] Fall 2000 DIS DARPA Meeting
IRAM and ISTORE Projects. - PowerPoint PPT Presentation
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Slide 1
IRAM and ISTORE ProjectsAaron Brown, James Beck, Rich Fromm, Joe Gebis,
Paul Harvey, Adam Janin, Dave Judd, Kimberly Keeton, Christoforos Kozyrakis, David Martin, Rich
Martin, Thinh Nguyen, David Oppenheimer, Steve Pope, Randi Thomas, Noah Treuhaft,
Sam Williams, John Kubiatowicz, Kathy Yelick, and David Patterson
Hardware: plug-and-play intelligent devices with self-monitoring, diagnostics, and fault injection hardware– intelligence used to collect and filter monitoring data– diagnostics and fault injection enhance robustness– networked to create a scalable shared-nothing cluster
Vector Instruction Set• Complete load-store vector instruction set
– Uses the MIPS64™ ISA coprocessor 2 opcode space
» Ideas work with any core CPU: Arm, PowerPC, ...– Architecture state
» 32 general-purpose vector registers» 32 vector flag registers
– Data types supported in vectors:» 64b, 32b, 16b (and 8b)
– 91 arithmetic and memory instructions• Not specified by the ISA
– Maximum vector register length– Functional unit datapath width
Slide 40
Compiler/OS Enhancements• Compiler support
– Conditional execution of vector instruction» Using the vector flag registers
– Support for software speculation of load operations
• Operating system support– MMU-based virtual memory– Restartable arithmetic exceptions– Valid and dirty bits for vector registers– Tracking of maximum vector length used
Slide 41
BACKUP SLIDES
ISTORE
Slide 42
ISTORE: A server for the PostPC Era
Aaron Brown, Dave Martin, David Oppenheimer, Noah Trauhaft, Dave Patterson,Katherine Yelick
• Availability, Maintainability, and Evolutionary growth key challenges for storage systems
– Maintenance Cost ~ >10X Purchase Cost per year, – Even 2X purchase cost for 1/2 maintenance cost wins– AME improvement enables even larger systems
• ISTORE also cost-performance advantages– Better space, power/cooling costs ($@colocation site)– More MIPS, cheaper MIPS, no bus bottlenecks– Compression reduces network $, encryption protects– Single interconnect, supports evolution of
technology, single network technology to maintain/understand
• Match to future software storage services– Future storage service software target clusters
Slide 44
Lampson: Systems Challenges• Systems that work
– Meeting their specs– Always available– Adapting to changing environment– Evolving while they run– Made from unreliable components– Growing without practical limit
• Credible simulations or analysis• Writing good specs• Testing• Performance
– Understanding when it doesn’t matter
“Computer Systems Research-Past and Future” Keynote address,
17th SOSP, Dec. 1999
Butler LampsonMicrosoft
Slide 45
Jim Gray: Trouble-Free Systems • Manager
– Sets goals– Sets policy– Sets budget– System does the rest.
• Everyone is a CIO (Chief Information Officer)
• Build a system – used by millions of people each day– Administered and managed by a ½ time
person.» On hardware fault, order replacement part» On overload, order additional equipment» Upgrade hardware and software automatically.
“What Next? A dozen remaining IT problems”
Turing Award Lecture, FCRC,
May 1999Jim GrayMicrosoft
Slide 46
Jim Gray: Trustworthy Systems
• Build a system used by millions of people that – Only services authorized users
» Service cannot be denied (can’t destroy data or power).
» Information cannot be stolen.– Is always available: (out less than 1 second per 100 years = 8 9’s of
availability) » 1950’s 90% availability,
Today 99% uptime for web sites, 99.99% for well managed sites
(50 minutes/year)3 extra 9s in 45 years.
» Goal: 5 more 9s: 1 second per century.– And prove it.
Slide 47
Hennessy: What Should the “New World” Focus Be?• Availability
– Both appliance & service• Maintainability
– Two functions:» Enhancing availability by preventing failure» Ease of SW and HW upgrades
• Scalability– Especially of service
• Cost– per device and per service transaction
• Performance– Remains important, but its not SPECint
“Back to the Future: Time to Return to Longstanding
Problems in Computer Systems?” Keynote address,
FCRC, May 1999
John HennessyStanford
Slide 48
The real scalability problems: AME
• Availability– systems should continue to meet quality of
service goals despite hardware and software failures
• Maintainability– systems should require only minimal ongoing
human administration, regardless of scale or complexity: Today, cost of maintenance = 10-100 cost of purchase
• Evolutionary Growth– systems should evolve gracefully in terms of
performance, maintainability, and availability as they are grown/upgraded/expanded
• These are problems at today’s scales, and will only get worse as systems grow
Slide 49
Principles for achieving AME• No single points of failure, lots of redundancy• Performance robustness is more important than
peak performance• Performance can be sacrificed for improvements
in AME– resources should be dedicated to AME
» biological systems > 50% of resources on maintenance– can make up performance by scaling system
• Introspection– reactive techniques to detect and adapt to
failures, workload variations, and system evolution
– proactive techniques to anticipate and avert problems before they happen
Slide 50
Hardware Techniques (1): SON
• SON: Storage Oriented Nodes• Distribute processing with storage
– If AME really important, provide resources!– Most storage servers limited by speed of CPUs!! – Amortize sheet metal, power, cooling, network for
disk to add processor, memory, and a real network?– Embedded processors 2/3 perf, 1/10 cost, power?– Serial lines, switches also growing with Moore’s Law;
less need today to centralize vs. bus oriented systems
• Advantages of cluster organization– Truly scalable architecture– Architecture that tolerates partial failure– Automatic hardware redundancy
– Internet applications must remain available despite failures of components, therefore can isolate a subset for preventative maintenance
– Allows testing, repair of online system– Managed by diagnostic processor and network
switches via diagnostic network• Built-in fault injection capabilities
– Power control to individual node components– Injectable glitches into I/O and memory busses– Managed by diagnostic processor – Used for proactive hardware introspection
» automated detection of flaky components» controlled testing of error-recovery mechanisms
Slide 53
“Hardware” culture (4)• Benchmarking
– One reason for 1000X processor performance was ability to measure (vs. debate) which is better
» e.g., Which most important to improve: clock rate, clocks per instruction, or instructions executed?
– Need AME benchmarks“what gets measured gets done”“benchmarks shape a field”“quantification brings rigor”
– cluster nodes are plug-and-play, intelligent, network-attached storage “bricks”
» a single field-replaceable unit to simplify maintenance– each node is a full x86 PC w/256MB DRAM, 18GB disk– more CPU than NAS; fewer disks/node than cluster
Intelligent Disk “Brick”Portable PC CPU: Pentium II/266 + DRAM
ISTORE Chassis80 nodes, 8 per tray2 levels of switches•20 100 Mbit/s•2 1 Gbit/sEnvironment Monitoring:UPS, redundant PS,fans, heat and vibration sensors...
Slide 59
Common Question: RAID?• Switched Network sufficient for all types of
communication, including redundancy– Hierarchy of buses is generally not
superior to switched network• Veritas, others offer software RAID 5 and
software Mirroring (RAID 1)• Another use of processor per disk
Slide 60
A Case for Intelligent Storage
Advantages:• Cost of Bandwidth• Cost of Space• Cost of Storage System v. Cost of Disks• Physical Repair, Number of Spare Parts• Cost of Processor Complexity • Cluster advantages: dependability,
Use only 50% of a bus Command overhead (~ 20%) Queuing Theory (< 70%)
(15 disks/bus)
Storage Area Network
(FC-AL)
Server
DiskArray
Mem
RAID bus
Slide 65
Physical Repair, Spare Parts• ISTORE: Compatible modules based on hot-
pluggable interconnect (LAN) with few Field Replacable Units (FRUs): Node, Power Supplies, Switches, network cables– Replace node (disk, CPU, memory, NI) if any fail
• Conventional: Heterogeneous system with many server modules (CPU, backplane, memory cards, …) and disk array modules (controllers, disks, array controllers, power supplies, … ) – Store all components available somewhere as
FRUs– Sun Enterprise 10k has ~ 100 types of spare
parts– Sun 3500 Array has ~ 12 types of spare parts
Slide 66
ISTORE: Complexity v. Perf • Complexity increase:
– HP PA-8500: issue 4 instructions per clock cycle, 56 instructions out-of-order execution, 4Kbit branch predictor, 9 stage pipeline, 512 KB I cache, 1024 KB D cache (> 80M transistors just in caches)
– Intel Xscale: 16 KB I$, 16 KB D$, 1 instruction, in order execution, no branch prediction, 6 stage pipeline
• Complexity costs in development time, development power, die size, cost
slowly, relatively expensive for switches, bandwidth
– FC-AL switches don’t interoperate– Two sets of cables, wiring?– SysAdmin trained in 2 networks, SW interface,
…???• Why not single network based on best HW/SW
technology?– Note: there can be still 2 instances of the network
(e.g. external, internal), but only one technology
Slide 69
Initial Applications• ISTORE-1 is not one super-system that
demonstrates all these techniques!– Initially provide middleware, library to
support AME• Initial application targets
– information retrieval for multimedia data (XML storage?)
» self-scrubbing data structures, structuring performance-robust distributed computation
» Example: home video server using XML interfaces– email service
» self-scrubbing data structures, online self-testing» statistical identification of normal behavior
Slide 70
A glimpse into the future?• System-on-a-chip enables computer, memory,
redundant network interfaces without significantly increasing size of disk
• ISTORE HW in 5-7 years:
– 2006 brick: System On a Chip integrated with MicroDrive
» 9GB disk, 50 MB/sec from disk» connected via crossbar switch» From brick to “domino”
– If low power, 10,000 nodes fit into one rack!
• O(10,000) scale is our ultimate design point
Slide 71
Conclusion: ISTORE as Storage System of the Future
• Availability, Maintainability, and Evolutionary growth key challenges for storage systems
– Maintenance Cost ~ 10X Purchase Cost per year, so over 5 year product life, ~ 95% of cost of ownership
– Even 2X purchase cost for 1/2 maintenance cost wins– AME improvement enables even larger systems
• ISTORE has cost-performance advantages– Better space, power/cooling costs ($@colocation site)– More MIPS, cheaper MIPS, no bus bottlenecks– Compression reduces network $, encryption protects– Single interconnect, supports evolution of
technology, single network technology to maintain/understand
• Match to future software storage services– Future storage service software target clusters
Cost of Storage System v. Disks• Examples show cost of way we build current
systems (2 networks, many buses, CPU, …) Disks DisksDate CostMain.Disks/CPU /IObus
– NCR WM: 10/97$8.3M--1312 10.25.0– Sun 10k: 3/98$5.2M-- 66810.47.0– Sun 10k: 9/99$6.2M$2.1M173227.012.0– IBM Netinf: 7/00$7.8M$1.8M704055.09.0=>Too complicated, too heterogenous
• And Data Bases are often CPU or bus bound! – ISTORE disks per CPU: 1.0– ISTORE disks per I/O bus: 1.0
Slide 76
Common Question: Why Not Vary Number of Processors
and Disks?• Argument: if can vary numbers of each to match
application, more cost-effective solution?• Alternative Model 1: Dual Nodes + E-switches
• Response– As D-nodes running network protocol, still need
processor and memory, just smaller; how much save?
– Saves processors/disks, costs more NICs/switches: N ISTORE nodes vs. N/2 P-nodes + N D-nodes
– Isn't ISTORE-2 a good HW prototype for this model? Only run the communication protocol on N nodes, run the full app and OS on N/2
Slide 77
Common Question: Why Not Vary Number of Processors
and Disks?• Alternative Model 2: N Disks/node
– Processor, Memory, N disks, 2 Ethernet NICs• Response
– Potential I/O bus bottleneck as disk BW grows– 2.5" ATA drives are limited to 2/4 disks per ATA bus– How does a research project pick N? What’s natural? – Is there sufficient processing power and memory to run
the AME monitoring and testing tasks as well as the application requirements?
– Isn't ISTORE-2 a good HW prototype for this model? Software can act as simple disk interface over network and run a standard disk protocol, and then run that on N nodes per apps/OS node. Plenty of Network BW available in redundant switches