Slide 1 Research in Internet Scale Systems Katherine Yelick U.C. Berkeley, EECS http://iram.cs.berkeley.edu/istore With Jim Beck, Aaron Brown, Daniel Hettena, David Oppenheimer, Randi Thomas, Noah Treuhaft, David Patterson, John Kubiatowicz http://www.cs.berkeley.edu/project/titanium With Greg Balls, Dan Bonachea, David Gay, Ben Liblit, Chang-Sun Lin, Peter McQuorquodale, Carleton Miyamoto, Geoff Pike, Alex Aiken, Phil Colella, Susan Graham, Paul Hilfinger
27
Embed
Slide 1 Research in Internet Scale Systems Katherine Yelick U.C. Berkeley, EECS With Jim Beck, Aaron Brown, Daniel Hettena,
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Slide 1
Research in Internet Scale Systems
Katherine Yelick U.C. Berkeley, EECS
http://iram.cs.berkeley.edu/istore
With Jim Beck, Aaron Brown, Daniel Hettena, David Oppenheimer, Randi Thomas, Noah
Treuhaft, David Patterson, John Kubiatowicz
http://www.cs.berkeley.edu/project/titanium
With Greg Balls, Dan Bonachea, David Gay, Ben Liblit, Chang-Sun Lin, Peter McQuorquodale, Carleton Miyamoto, Geoff Pike, Alex Aiken,
• Connectivity everywhere:– Rapid growth of bandwidth in the interior of the net– Broadband to the home and office– Wireless technologies such as CMDA, Satelite, laser
• Rise of the thin-client metaphor:– Services provided by interior of network– Incredibly thin clients on the leaves
» MEMs devices -- sensors+CPU+wireless net in 1mm3
Slide 3
The problem space: big data
• Big demand for enormous amounts of data– today: enterprise and internet applications
– future: richer data and more of it» computational & storage back-ends for mobile devices» more multimedia content» more use of historical data to provide better services
• Two key application domains:– storage: public, private, and institutional data– search: building static indexes, dynamic
discovery• Today’s SMP server designs can’t easily scale
– Bigger scaling problems than performance!
Slide 4
The Real Scalability Problems: AME
• Availability– systems should continue to meet quality of
service goals despite hardware and software failures and extreme load
• Maintainability– systems should require only minimal ongoing
human administration, regardless of scale or complexity
• Evolutionary Growth– systems should evolve gracefully in terms of
performance, maintainability, and availability as they are grown/upgraded/expanded
• These are problems at today’s scales, and will only get worse as systems grow
Slide 5
Research Principles
• Redundancy everywhere, no single point of failure• Performance secondary to AME
– Performance robustness over peak performance– Dedicate resources to AME
» biological systems use > 50% of resources on maintenance
– Optimizations viewed as AME-enablers » e.g., use of (slower) safe languages like Java with static
and dynamic optimizations
• Introspection– reactive techniques to detect and adapt to
failures, workload variations, and system evolution
– proactive techniques to anticipate and avert problems before they happen
» in deployed systems!» goal is to shake out bugs in failure response code
on isolated subset» use of fault-injection and stress testing
Slide 15
Techniques for Safe Languages
Titanium: A high performance dialect of Java• Scalable parallelism
– A global address space, but not shared memory
– For tightly-coupled applications, e.g., mining– Safe, region-based memory management
• Scalar performance enhancements, some specific to application domain – immutable classes (avoids indirection)– multidimensional arrays with subarrays
• Application domains– scientific computing on grids
» typically +/-20% of C++/F in this domain– data mining in progress
Slide 16
Use of Static Information• Titanium compiler performs parallel
optimizations– communication overlap (40%) and aggregation
• Uses two new analyses– synchronization analysis: the parallel analog
to control flow analysis » identifies code segments that may execute
in parallel– shared variable analysis: the parallel analog to
dependence analysis»recognize when reordering can be observed
by another processor»necessary for any code motion or use of
relaxed memory models in hardware => missed or illegal optimizations
Slide 17
Use of Dynamic Information• Several data mining or web search algorithms use
sparse matrix-vector multiplication– use for documents, images, video, etc.– irregular, indirect memory patterns perform
poorly on memory hierarchies• Performance improvements possible, but depend
on: – sparsity structure, e.g., keywords within
documents– machine parameters without analytical models
• Good news:– operation repeated many times on similar matrix– Sparsity: automatic code generator based on
runtime information
Slide 18
Using Dynamic Information: Sparsity Performance
Slide 19
Use of Dynamic Information: Virtual Stream
• System performance limited by the weakest link• NOW Sort experience: performance heterogeneity is the
• 15 single-fault workloads injected per system– only 4 distinct behaviors observed
(A) no effect (C) RAID enters degraded mode(B) system hangs (D) RAID enters degraded mode &
starts reconstruction– both systems hung (B) on simulated disk hangs– Linux exhibited (D) on all other errors– Windows exhibited (A) on transient errors and