Slide 1 ISTORE: Introspective Storage for Data-Intensive Network Services Aaron Brown, David Oppenheimer, Jim Beck, Kimberly Keeton, Rich Martin, Randi Thomas, John Kubiatowicz, David Patterson, and Kathy Yelick Computer Science Division University of California, Berkeley http://iram.cs.berkeley.edu/istore/ 1999 Summer IRAM Retreat
49
Embed
ISTORE: Introspective Storage for Data-Intensive Network Services
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Slide 1
ISTORE: Introspective Storage for Data-Intensive
Network ServicesAaron Brown, David Oppenheimer, Jim Beck, Kimberly Keeton, Rich Martin, Randi Thomas, John Kubiatowicz, David Patterson, and Kathy
Yelick
Computer Science DivisionUniversity of California, Berkeley
online backup, search engines, and web servers– tomorrow: more of above (with ever-growing
datasets), plus thin-client/PDA infrastructure support
• Infrastructure users expect “always-on”service and constant quality of service– infrastructure must provide fault-tolerance
and performance-tolerance– failures and slowdowns have major business impact
» e.g., recent EBay, E*Trade, Schwab outages
Slide 7
Motivation: Service Demands (2)
• Delivering 24x7 fault- and performance-tolerance requires:– a robust hardware platform– fast adaptation to failures, load spikes, changing
access patterns– easy incremental scalability when existing resources
stop providing desired quality of service– self-maintenance: the system handles problems as
they arise, automatically» can't rely on human intervention to fix problems or to
tune performance» humans are too expensive, too slow, prone to mistakes
• Introspective systems are self-maintaining
Slide 8
Motivation: System Scaling • Infrastructure services are growing rapidly
– more users, more online data, higher access rates, more historical data
– bigger and bigger back-end systems are needed» O(300)-node clusters deployed now; thousands of
nodes not far off– techniques for maintenance and administration must scale
with the system to 1000s of nodes• Today’s administrative approaches don’t scale
– systems will be too big to reason about, monitor, or fix– failures and load variance will be too frequent for static
solutions to work• Introspective, reactive techniques are required
Slide 9
ISTORE Research Agenda• ISTORE goal = create a hardware/software
framework for building introspective servers
– Hardware: plug-and-play intelligent devices with integrated self-monitoring, diagnostics, and fault injection hardware
» intelligence used to collect and filter monitoring data» diagnostics and fault injection enhance robustness» networked to create a scalable shared-nothing cluster
– Software: toolkit that allows programmers to easily define the system’s adaptive behavior
» provides abstractions for manipulating and reacting to monitoring data
Slide 10
Hardware Requirements for Self-Maintaining Servers
• Redundant components that fail fast– no single point of failure anywhere
» monitoring: connects to motherboard SMbus, CAN bus•environmental monitor, CPU watchdog
» control•reboot/power-cycle main CPU•inject simulated faults: power, bus transients, memory errors, network interface failure, ...
• Not-so-small embedded Motorola 68k processor– provides the flexibility needed for research prototype– still can run just a small, simple monitoring and control
program if desired (no OS, networking, etc.)
Slide 15
Diagnostic Network• Separate “diagnostic network” connects
the diagnostic processors of each brick– provides independent network path to diagnostic
CPU» works when brick CPU is powered off or has failed» separate failure modes from Ethernet interfaces
• CAN (Controller Area Network) diagnostic interconnect– CAN connects directly to environmental
monitoring sensors (temperature, fan RPM, ...)– one brick per “shelf” of 8 acts as gateway from
CAN to redundant switched Ethernet fabric
Slide 16
ISTORE-1 Hardware Prototype• Meets requirements for a robust HW
platform– fast embedded CPU performs local monitoring tasks– diagnostic hardware enables low-level diagnostic
monitoring, fail-fast behavior, and fault injection– highly-redundant system design
» redundant data network and interfaces at all levels» separate diagnostic network» redundant backup power
– powerful preventive maintenance» each brick can be periodically taken offline and
stress-tested/scrubbed using diagnostic processor’s fault injection capabilities
Slide 17
ISTORE Research Agenda• ISTORE goal = create a
hardware/software framework for building introspective servers
– Hardware
– Software: toolkit that allows programmers to easily define the system’s adaptive behavior
» provides abstractions for manipulating and reacting to monitoring data
Slide 18
A Software Framework for Introspection
• ISTORE hardware provides device monitoring– application programmers could write ad-hoc code to
collect, process, and react to monitoring data• ISTORE software framework should simplify
writing introspective applications– rule-based adaptation engine encapsulates the
mechanisms of collecting, processing monitoring data– policy compiler and mechanism libraries help turn
application adaptation goals into rules & reaction code– these provide a high-level, abstract interface to the
system’s monitoring and adaptation mechanisms
Slide 19
Rule-based Adaptation• ISTORE’s adaptation framework built on
– applications define views and triggers over the DB» views select and aggregate data of interest to app.» triggers are rules that invoke application-specific
reaction code when their predicates are satisfied– SQL-like declarative language used to specify
views and trigger rules
Slide 20
Benefits of Views and Triggers• Allow applications to focus on adaptation,
not monitoring– hide the mechanics of gathering and processing
monitoring data– can be dynamically redefined without altering
adaptation code as situation changes• Can be implemented without a real
database– views and triggers implemented as device-local and
distributed filters and reaction rules– defined views and triggers control frequency,
granularity, types of data gathered by HW monitoring– no materialized database necessary
Slide 21
Raising the Level of Abstraction:
Policy Compiler and Mechanism Libs• Rule-based adaptation doesn’t go far
enough– application designer must still write views, triggers, and
adaptation code by hand» but designer thinks in terms of system policies
• Solution: designer specifies policies to system; system implements them– policy compiler automatically generates views, triggers,
adaptation code– uses preexisting mechanism libraries to implement
adaptation algorithms– claim: feasible for common adaptation mechanisms
needed by data-intensive network service apps.
Slide 22
Adaptation Policies• Policies specify system states and how
to react to them– high-level specification: independent of “schema”
of system, object/node identity» that knowledge is encapsulated in policy compiler
• Examples– self-maintenance and availability
» if overall free disk space is below 10%, compress all but one replica/version of least-frequently-accessed data
» if any disk reports more than 5 errors per hour, migrate all data off that disk and shut it down
» invoke load-balancer when new disk is added to system– performance tuning
» place large, sequentially-accessed objects on outer tracks of fast disks as space becomes available
Slide 23
Software Structure
policy
view trigger adaptation code
mechanism libraries
policy compilercallsused as input toproduces
Slide 24
Detailed Adaptation Example• Policy: quench hot spots by migrating
objects
policy
view trigger adaptation code
mechanism libraries
policy compilercallsused as input toproduces
while ((average queue length for any disk D) > (120% of average for whole system)) migrate hottest object on D to disk with shortest average queue length
Slide 25
policy
view trigger adaptation code
policy compiler
Example: View Definitionwhile ((average queue length for any disk D) > (120% of average for whole system)) migrate hottest object on D to disk with shortest average queue length
DEFINE VIEW (average_queue_length= SELECT AVG(queue_length) FROM disk_stats,queue_length[]= SELECT queue_length FROM disk_stats,disk_id[]= SELECT disk_id FROM disk_stats,
short_disk= SELECT disk_id FROM disk_stats WHERE queue_length = SELECT MIN(queue_length) FROM disk_stats)
Example: Triggerwhile ((average queue length for any disk D) > (120% of average for whole system)) migrate hottest object on D to disk with shortest average queue length
foreach disk_id from_disk { if (queue_length[from_disk] > 1.2*average_queue_length) user_migrate(from_disk,short_disk)}
mechanism libraries
Slide 27
policy
view trigger adaptation code
policy compiler
Example: Adaptation Codewhile ((average queue length for any disk D) > (120% of average for whole system)) migrate hottest object on D to disk with shortest average queue length
foreach disk_id from_disk { if (queue_length[from_disk] > 1.2*average_queue_length) user_migrate(from_disk,short_disk)}
user_migrate(from_disk,to_disk) { diskObject x; x = find_hottest_obj(from_disk); migrate(x, to_disk);}
mechanism libraries
Slide 28
policy
view trigger adaptation code
policy compiler
Example: Mechanism Lib. Callswhile ((average queue length for any disk D) > (120% of average for whole system)) migrate hottest object on D to disk with shortest average queue length
foreach disk_id from_disk { if (queue_length[from_disk] > 1.2*average_queue_length) user_migrate(from_disk,short_disk)}
user_migrate(from_disk,to_disk) { diskObject x; x = find_hottest_obj(from_disk); migrate(x, to_disk);}
mechanism libraries
Slide 29
Mechanism Libraries• Unify existing techniques/services found in
single-node OSs, DBMSs, distributed systems– distributed directory services– replication and migration– data layout and placement– distributed transactions– checkpointing– caching– administrative (human UI) tasks
• Provide a place for higher-level monitoring• Simplify creation of adaptation code
– for humans using the rule system directly– for the policy compiler auto-generating code
select key mechanisms fordata-intensive
network services
Slide 30
Open Research Issues• Defining appropriate software abstractions
– how should views and triggers be declared?– what should the policy language look like?– what functions should mechanism libraries provide?– what is the system’s “schema”?
» how should heterogeneous hardware be integrated?» can it be extended by the user to include new types and
statistics?– what level of policies can be expressed?
» how much of the implementation can the system figure out automatically?
» to what extent can the system reason about policies and their interactions?
Slide 31
More Open Research Issues• Implementing an introspective system
– what default policies should the system supply?– what are the internal and external interfaces? – debugging
» visualization of states, triggers, ...» simulation/coverage analysis of policies, adaptation code
– appropriate administrative interfaces• Measuring an introspective system
– what are the right benchmarks for scalability, availability, and maintainability (SAM)?
• O(>=1000)-node scalability– how to write applications that scale and run well despite
continual state of partial failure?
Slide 32
Related Work• Hardware:
– CMU and UCSB Active Disks• Software:
– adaptive databases: MS AutoAdmin, Informix NoKnobs
– adaptive OSs: MS Millennium, adaptive VINO– adaptive storage: HP AutoRAID, attribute-
managed storage– active databases: UFL Gator, TriggerMan
• ISTORE unifies many of these techniques in a single system
Slide 33
Related Work: Ninja• Ninja: composable Internet-scale
services– some ISTORE runtime software services provided
using Ninja programming platform?– provides
» some fault tolerance» a framework for automatic service discovery » incremental s/w upgrades
Slide 34
Related Work: Telegraph• Universal system for information• Four layers
• Relationship to ISTORE– continuous online reoptimization– adaptive data placement– indexing, other operations on disk CPU
Slide 35
Related Work: OceanStore• Global-scale persistent storage• Nomadic, highly-available data• Federation of data storage providers• Investigate global-scale SAM
– also naming, indexability, consistency• Relationship to ISTORE
– investigating similar concepts but on a global scale
– converse: ISTORE as “Internet in a box”
Slide 36
Related Work: Endeavour• Endeavour: new research project at UCB
– goal: “enhancing human understanding through information technology”
• ISTORE’s potential contributions:– ISTORE is building adaptive, scalable, self-
maintaining back-end servers for storage-intensive network services
» can be part of Endeavour’s back-end infrastructure– software research
» using policies to guide a system’s adaptive behavior» providing QoS under degraded conditions
– application platform» process and store streams of sensor data
Slide 37
Status and Conclusions• ISTORE’s focus is on introspective systems
– a new perspective on systems research priorities• Proposed framework for building
introspection– intelligent, self-monitoring plug-and-play hardware– software that provides a higher level of abstraction
for the construction of introspective systems» flexible, powerful rule system for monitoring» policy specification automates generation of adaptation
• Status– ISTORE-1 hardware prototype being constructed now– software prototyping just starting
• Equally useful for performance and failure management– Performance tuning example: DB index
management» View: access patterns to tables, query predicates
used» Trigger: access rate to table above/below average» Adaptation: add/drop indices based on query stream
– Failure management example: impending disk failure
» View: disk error logs, environmental conditions» Trigger: frequency of errors, unsafe environment» Adaptation: redirect requests to other replicas, shut
down disk, generate new replicas, signal operator
Slide 48
More Adaptation Policy Examples • Self-maintenance and availability
– maintain two copies of all dirty data stored only in volatile memory
– if a disk fails, restore original redundancy level for objects stored on that disk
• Performance tuning– if accesses to a read-mostly object take more than
10ms on average, replicate the object on another disk• Both (like HP AutoRAID)
– if an object is in the top 10% of frequently-accessed objects, and there is only one copy, create a new replica. if an object is in the bottom 90%, delete all replicas and stripe the object across N disks using RAID-5.
Slide 49
Mechanism Library Benefits• Programmability
– libraries provide high-level abstractions of services– code using the libraries is easier to reason about,
maintain, customize• Performance
– libraries can be highly-optimized– optimization complexity is hidden by abstraction
• Reliability– libraries include code that’s easy to forget or get wrong
» synchronization, communication, memory allocation– debugging effort can be spent getting libraries right