Slide 1 IRAM and ISTORE Projects Aaron Brown, James Beck, Rich Fromm, Joe Gebis, Paul Harvey, Adam Janin, Dave Judd, Kimberly Keeton, Christoforos Kozyrakis, David Martin, Rich Martin, Thinh Nguyen, David Oppenheimer, Steve Pope, Randi Thomas, Noah Treuhaft, Sam Williams, John Kubiatowicz, Kathy Yelick, and David Patterson http://iram.cs.berkeley.edu/[istore] Winter 2000 IRAM/ISTORE Retreat
IRAM and ISTORE Projects. - PowerPoint PPT Presentation
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Slide 1
IRAM and ISTORE ProjectsAaron Brown, James Beck, Rich Fromm, Joe Gebis,
Paul Harvey, Adam Janin, Dave Judd, Kimberly Keeton, Christoforos Kozyrakis, David Martin, Rich
Martin, Thinh Nguyen, David Oppenheimer, Steve Pope, Randi Thomas, Noah Treuhaft,
Sam Williams, John Kubiatowicz, Kathy Yelick, and David Patterson
• All numbers in cycles/pixel•MMX and VIS results assume all data in L1 cache
Slide 13
Scaling to 10K Processors• IRAM + micro-disk offer huge scaling opportunities• Still many hard system problems, SAM AME (talk)
– Availability» 24 x7 databases without human intervention» Discrete vs. continuous model of machine being up
– Maintainability» 42% of system failures are due to administrative errors» self-monitoring, tuning, and repair
– Evolution» Dynamic scaling with plug-and-play components» Scalable performance, gracefully down as well as up» Machines become heterogeneous in performance at
scale
Slide 14
Hardware: plug-and-play intelligent devices with self-monitoring, diagnostics, and fault injection hardware–intelligence used to collect and filter monitoring data–diagnostics and fault injection enhance robustness–networked to create a scalable shared-nothing cluster
Intelligent Chassis80 nodes, 8 per tray2 levels of switches•20 100 Mb/s•2 1 Gb/sEnvironment Monitoring:UPS, redundant PS,fans, heat and vibrartion sensors...
Intelligent Disk “Brick”Portable PC Processor: Pentium II+ DRAM
• Sensors for heat and vibration• Control over power to individual nodes
Slide 16
ISTORE Software Approach• Two-pronged approach to providing reliability:
1) reactive self-maintenance: dynamic reaction to exceptional system events
» self-diagnosing, self-monitoring hardware» software monitoring and problem detection» automatic reaction to detected problems
2) proactive self-maintenance: continuous online self- testing and self-analysis
» automatic characterization of system components» in situ fault injection, self-testing, and scrubbing to
detect flaky hardware components and to exercise rarely-taken application code paths before they’re used
Slide 17
ISTORE Applications• Storage-intensive, reliable services for ISTORE-1
– infrastructure for “thin clients,” e.g., PDAs – web services, such as mail and storage– large-scale databases (talk)– information retrieval (search and on-the-fly
indexing)
• Scalable memory-intensive computations for ISTORE in 2006
– Performance estimates through IRAM simulation + model» not major emphasis
– Large-scale defense and scientific applications enabled by high memory bw and arithmetic performance
Slide 18
Performance Availability• System performance limited by the weakest link• NOW Sort experience: performance heterogeneity is the
norm– disks: inner vs. outer track (50%), fragmentation– processors: load (1.5-5x) and heat
• Virtual Streams: dynamically off-load I/O work from slower disks to faster ones
0
1
2
3
4
5
6
100% 67% 39% 29%
Efficiency Of Single Slow Disk
Min
imum
Per
-Pro
cess
Ban
dwid
th
(MB
/sec
)
IdealVirtual StreamsStatic
Slide 19
ISTORE Update• High level hardware design by UCB complete (talk)
– Design of ISTORE boards handed off to Anigma» First run complete; SCSI problem to be fixed» Testing of UCB design (DP), to start asap» 10 nodes by end of 1Q 2000, 80 by 2Q 2000» Design of BIOS handed off to AMI
– Most parts donated or discounted » Adaptec, Andataco, IBM, Intel, Micron, Motorola, Packet
Engines• Proposal for Quantifying AME (talk)• Beginning work on short-term applications
» Mail server » Web server will be used to » Large database drive principled » Decision support primitives system design
Slide 20
Conclusions• IRAM attractive for two Post-PC applications
because of low power, small size, high memory bandwidth– Mobile consumer electronic devices– Scaleable infrastructure
• IRAM benchmarking result: faster than DSPs
• ISTORE: hardware/software architecture for large scale network services
• Scaling systems requires – new continuous models of availability– performance not limited by the weakest link– self* systems to reduce human interaction
Slide 21
Backup Slides
Slide 22
Introduction and Ground Rules
• Who is here?– Mixed IRAM/ISTORE “experience”
• Questions are welcome during talks• Schedule: lecture from Brewster Kahle during
Thursday’s Open Mic Session.• Feedback is required (Fri am)
– Be careful, we have been known to listen to you
• Mixed experience: please ask• Time for skiing and talking tomorrow
afternoon
Slide 23
2006 ISTORE• ISTORE node
– Add 20% pad to MicroDrive size for packaging, connectors
– Then double thickness to add IRAM– 2.0” x 1.7” x 0.5” (51 mm x 43 mm x 13 mm)
• Crossbar switches growing by Moore’s Law– 2x/1.5 yrs 4X transistors/3yrs– Crossbars grow by N2 2X switch/3yrs– 16 x 16 in 1999 64 x 64 in 2005
• ISTORE rack (19” x 33” x 84”)1 tray (3” high) 16 x 32 512 ISTORE nodes / try
• IDEA Decryption operates on 16-bit ints • Compiled with IRAM/VSUIF • Note scalability of both #lanes and data width• Some hand-optimizations (unrolling) will be
automated by Cray compiler
# lanes
Virtual processor width
Slide 25
1D FFT on IRAMFFT study on IRAM
– bit-reversal time included; cost hidden using indexed store
– Faster than DSPs on floating point (32-bit) FFTs– CRI Pathfinder does 24-bit fixed point, 1K points in 28
usec (2 Watts without SRAM)
Slide 26
3D FFT on ISTORE 2006• Performance of large 3D FFT’s depend on 2 factors
– speed of 1D FFT on a single node (next slide)– network bandwidth for “transposing” data
– 1.3 Tflop FFT possible w/ 1K IRAM nodes, if network bisection bandwidth scales (!)
Slide 27
ISTORE-1 System Layout
Brick shelfBrick shelfBrick shelf
Brick shelfBrick shelf
Brick shelf
Brick shelfBrick shelf
Slide 28
+
Vector Registers
x
÷
Load/Store
Vector 4 x 64or
8 x 32or
16 x 16
4 x 644 x 64
QueueInstruction
V-IRAM1: 0.18 µm, Fast Logic, 200 MHz
1.6 GFLOPS(64b)/6.4 GOPS(16b)/32MB
Memory Crossbar Switch
16K I cache 16K D cache
2-way SuperscalarProcessor
M
M…M
M
M…M
M
M…M
M
M…M
M
M…M
M
M…M
…
M
M…M
M
M…M
M
M…M
M
M…M
4 x 64 4 x 64 4 x 64 4 x 64 4 x 64
I/OI/O
I/OI/O
100MBeach
Slide 29
Fixed-point multiply-add model
• Same basic model, different set of instructions– fixed-point: multiply & shift & round, shift right & round, shift left & saturate
– integer saturated arithmetic: add or sub & saturate
– added multiply-add instruction for improved performance and energy consumption
satRound
a
wy
z+*
x
n/2
n/2n
n
n
Multiply half word & Shift & Round Add & Saturate
n
Slide 30
Other ISA modifications• Auto-increment loads/stores
– a vector load/store can post-increment its base address
– added base (16), stride (8), and increment (8) registers
– necessary for applications with short vectors or scaled-up implementations
• Butterfly permutation instructions– perform step of a butterfly permutation
within a vector register– used for FFT and reduction operations
• Miscellaneous instructions added– min and max instructions (integer and FP)– FP reciprocal and reciprocal square root
Slide 31
Major architecture updates• Integer arithmetic units support multiply-add
instructions• 1 load store unit
– complexity Vs. benefit• Optimize for strides 2, 3, and 4
– useful for complex arithmetic and image processing functions
• Decoupled strided and indexed stores– memory stalls due to bank conflicts do not stall the
arithmetic pipelines– allows scheduling of independent arithmetic
operations in parallel with stores that experience many stalls
– implemented with address, not data, buffering – currently examining a similar optimization for loads
Slide 32
Micro-kernel results: simulated systems
1 LaneSystem
2 LaneSystem
4 LaneSystem
8 LaneSystem
# of 64-bit lanes 1 2 4 8Addresses per cyclefor strided-indexedaccesses
1 2 4 8
Crossbar width 64b 128b 256b 512b
Width of DRAM bankinterface
64b 128b 256b 512b
DRAM banks 8 8 8 8
•Note : simulations performed with 2 load-store units and without decoupled stores or optimizations for strides 2, 3, and 4
Slide 33
Micro-kernels
Benchmark OperationsType
DataWidth
MemoryAccesses
OtherComments
ImageComposition(Blending)
Integer 16b Unit-stride
2D iDCT (8x8image blocks)
Integer 16b Unit-strideStrided
Color Conversion(RGB to YUV)
Integer 32b Unit-stride
ImageConvolution
Integer 32b Unit-stride
Matrix-vectorMultiply (MV)
IntegerFP
32b Unit-stride Uses reductions
Vector-matrixMultiply (VM)
IntegerFP
32b Unit-stride
•Vectorization and scheduling performed manually
Slide 34
Scaled system results
•Near linear speedup for all application apart from iDCT
•iDCT bottlenecks
•large number of bank conflicts
•4 addresses/cycle for strided accesses
0
1
2
3
4
5
6
7
8
Compositing iDCT Color Conversion Convolution MxV INT (32) VxM INT (32) MxV FP (32) VxM FP(32)
Spee
dup
1 Lane 2 Lanes 4 Lanes 8 Lanes
Slide 35
iDCT scaling with sub-banks
• Sub-banks reduce bank conflicts and increase performance• Alternative (but not as effective) ways to reduce conflicts:
– different memory layout– different address interleaving schemes
0
1
2
3
4
5
6
7
8
1 Sub-Bank 2 Sub-Banks 4 Sub-Banks 8 Sub-Banks
Spe
edup
1 Lane 2 Lanes 4 Lanes 8 Lanes
Slide 36
Compiling for VIRAM• Long-term success of DIS technology depends
on simple programming model, i.e., a compiler• Needs to handle significant class of
applications– IRAM: multimedia, graphics, speech and
image processing– ISTORE: databases, signal processing, other
DIS benchmarks• Needs to utilize hardware features for
performance– IRAM: vectorization– ISTORE: scalability of shared-nothing
Runtime System Software• Demonstrate simple policy-driven adaptation
– within context of a single OS and application– software monitoring information collected and
processed in realtime» e.g., health & performance parameters of OS,
application– problem detection and coordination of reaction
» controlled by a stock set of configurable policies– application-level adaptation mechanisms
» invoked to implement reaction• Use experience to inform ISTORE API design• Investigate reinforcement learning as technique
to infer appropriate reactions from goals
Slide 44
Record-breaking performance is not the common case
• NOW-Sort records demonstrate peak performance• But perturb just 1 of 8 nodes and...
012345
Bestcase
Baddisk
layout
Busydisk
LightCPU
HeavyCPU
Paging
Slow
dow
n
Slide 45
Virtual Streams:Dynamic load balancing for I/O
• Replicas of data serve as second sources• Maintain a notion of each process’s progress • Arbitrate use of disks to ensure equal progress• The right behavior, but what mechanism?
• ISTORE: a hardware/software architecture for building scaleable, self-maintaining storage–An introspective system: it monitors itself and acts on its observations
• Self-maintenance: does not rely on administrators to configure, monitor, or tune system
Slide 50
Self-maintenance• Failure management– devices must fail fast without interrupting
service– predict failures and initiate replacement– failures immediate human intervention
• System upgrades and scaling– new hardware automatically incorporated
without interruption – new devices immediately improve
performance or repair failures• Performance management
– system must adapt to changes in workload or access patterns
Slide 51
ISTORE-I: 2H99• Intelligent disk
– Portable PC Hardware: Pentium II, DRAM– Low Profile SCSI Disk (9 to 18 GB)– 4 100-Mbit/s Ethernet links per node– Placed inside Half-height canister– Monitor Processor/path to power off
components?• Intelligent Chassis
– 64 nodes: 8 enclosures, 8 nodes/enclosure» 64 x 4 or 256 Ethernet ports
– 2 levels of Ethernet switches: 14 small, 2 large » Small: 20 100-Mbit/s + 2 1-Gbit; Large: 25 1-Gbit» Just for prototype; crossbar chips for real system
– Enclosure sensing, UPS, redundant PS, fans, ...
Slide 52
Disk Limit• Continued advance in capacity (60%/yr) and
bandwidth (40%/yr)• Slow improvement in seek, rotation (8%/yr)• Time to read whole disk
Year Sequentially Randomly (1 sector/seek)
1990 4 minutes 6 hours1999 35 minutes 1 week(!)
• 3.5” form factor make sense in 5-7 years?
Slide 53
Related Work• ISTORE adds to several recent research efforts• Active Disks, NASD (UCSB, CMU)• Network service appliances (NetApp, Snap!,
Qube, ...)• High availability systems (Compaq/Tandem, ...)• Adaptive systems (HP AutoRAID, M/S AutoAdmin,
M/S Millennium)• Plug-and-play system construction (Jini, PC