Scalable, Fault-tolerant Management of Grid Services: Application to Messaging Middleware

Scalable, Fault-tolerant Management of Grid Services: Application to Messaging Middleware

Harshawardhan [email protected]

Ph.D. Defense Exam (Practice Talk) Advisor: Prof. Geoffrey Fox

2

Talk Outline Motivation Architecture

Service-oriented Management Performance Evaluation

Application: Managing Grid Messaging Middleware

Related Work Conclusion

3

Motivation: Characteristics of today’s (GRID) Applications

Increasing Application Complexity Applications distributed and composed of multiple

resources In future, much larger systems will be built

Components widely dispersed and disparate in nature and access

Span different administrative domains Under differing network / security policies Limited access to resources due to presence of firewalls,

NATs etc… Components in flux

Components (Nodes, network, processes) may fail Services must meet

General QoS and Life-cycle features (User defined) Application specific criteria

Need to “manage” services to provide these capabilities

4

Motivation: Key Challenges in Management of Resources

Scalability With Growing Complexity of application,

number of resources that require management increases

E.g. LHC Grid consists of a large number of CPUs, disks and mass storage servers

Web Service based applications such as Amazon’s EC2 dynamically resizes compute capacity, so number of resources is NOT ONLY large BUT ALSO in a constant state of flux.

Management framework MUST cope with large number of resources in terms of

Additional components Required

5


Scalability Performance

Performance Important in terms of Initialization Cost Recovery from failure Responding to run-time events

Performance should not suffer with increasing resources and additional system components

6


Scalability Performance Fault – tolerance

Failures are Normal Resources may fail, but so also components

of the management framework. Framework MUST recover from failure Recovery period must not increase

drastically with increasing number of resources.

7


Scalability Performance Fault – tolerance Interoperability

Resources exist on different platforms Written in different languages Typically managed using system specific protocols

and hence not INTEROPERABLE Investigate the use of a service-oriented architecture

for management

8


Scalability Performance Fault – tolerance Interoperability Generality

Management framework must be a generic framework.

Should manage any type of resource (hardware / software)

9


Scalability Performance Fault – tolerance Interoperability Generality Usability

Simple to deploy. Built in terms of simple components (services)

Autonomous operation (as much as possible)

10

Summary:Research Issues

Building a Fault-tolerant Management Architecture

Making the architecture SCALABLE Investigate the overhead in terms of

Additional Components Required Typical response time Recovery Time

Interoperable and Extensible Management framework

General and usable system

11

Architecture:Assumptions

For our discussion: RESOURCE = SERVICE External state required by resources is

small Can be captured using a message-based

interaction interface Resource may maintain internal state and can

bootstrap itself E.g.: Shopping Cart (Internal State = Contents of cart,

External state = Location of DB and access credentials where contents persist)

Assume a scalable, fault-tolerant database store that acts as a registry to maintain resource specific external state

12

Definition:The process of Management

E.g. Consider Printer as a resource that will be managed

Generates Events (E.g. Low Ink Level, Out of paper)

Resource Manager appropriately handles these events as defined by Resource specific policies

Job Queue Management is NOT the responsibility of our management architecture.

We imagine existence of a separate Job Management Process which itself can be managed by our framework (E.g. Make sure it is always up and running)

Job QueueInk LevelFeeder Tray Management

LowInkLevelEventOutOfPaperEvent

`

Resource Specific Manager

Job Queue Management

MANAGE

MA

NA

GE

13

Management Architecture built in terms of

Hierarchical Bootstrap System – Robust itself by Replication Resources in different domains can be managed with separate policies for

each domain Periodically spawns a System Health Check that ensures components are up

and running

Registry for metadata (distributed database) – Robust by standard database techniques and our system itself for Service Interfaces

Stores resource specific information (User-defined configuration / policies, external state required to properly manage a resource)

Messaging Nodes form a scalable messaging substrate Message delivery between managers and managees Provides Secure delivery of messages

Managers – Active stateless agents that manage resources. Resource specific management thread performs actual management Multi-threaded to improve scalability

Managees – what you are managing (Resource / Service to manage) – Our system makes robust

There is NO assumption that Managed system uses Messaging nodes Wrapped by a Service Adapter which provides a Web Service interface

14

Architecture:Scalability: Hierarchical distribution

ROOT

US EUROPE

FSU CARDIFF

CGL

Active Bootstrap Nodes/ROOT/EUROPE/CARDIFF

•Responsible for maintaining a working set of management components in the domain•Always the leaf nodes in the hierarchy

Passive Bootstrap Nodes

•Only ensure that all child bootstrap nodes are always up and running

…Spawns if not present

and ensure up and running

…

15

Architecture:Conceptual Idea (Internals)

Resource to Manage

(Managee)

ServiceAdapter

Bootstrap Service

System Health Check Manager

Resource to Manage

(Managee)

ServiceAdapter

Resource to Manage

(Managee)

ServiceAdapter

Manager

MessagingNode

Registry

Manager

Manager

...

...

Connect to Messaging Node for sending and

receiving messagesUser writes system

configuration to registry

Manager processes periodically checks available resources to manage. Also

Read/Write resource specific external state from/to

registry

Always ensure up and runningAlways ensure up

and running

Periodically Spawn

16

Architecture: User Component

Characteristics are determined by the user.

Events generated by the Managees are handled by the manager

Event processing is determined by via WS-Policy constructs

E.g. Wait for user’s decision on handling specific conditions The event handler has been specified, so execute default

policy, etc…

Note Managers will set up services if registry indicates that is appropriate; so writing information to registry can be used to start up a set of services

Generic and Application specific policies are written to the registry where it will be picked up by a manager process.

17

Issues:Issues in the distributed system

Consistency – Management architecture must provide consistency of management process

Examples of inconsistent behaviour Two or more managers managing the same resource Old messages / requests reaching after new requests Multiple copies of resources existing at the same time leading

to inconsistent system state Use a Registry generated monotonically increasing Unique

Instance ID to distinguish between new and old instances

Security – Provide secure communication between communicating parties (e.g. Manager <-> Resource)

Leverage NaradaBrokering’s Topic Creation and discovery and Security framework to provide:

Provenance, Secure Discovery, Authorization & Authentication Prevent unauthorized users from accessing the resource Prevent malicious users from modifying message (Thus

message interactions are secure when passing through insecure intermediaries)

18

Consistency All new entities get a unique InstanceID (IID), generated thru registry.

All interaction from that entity use this id as a prefix. We assume this to be a monotonically increasing number (E.g. an NTP timestamp)

Thus If a Manager Thread starts, it is assigned a unique ID by the registry. A newer instance has a higher id, thus OLD manager threads can be distinguished

The resource ALWAYS assumes the manager thread to be current with the highest known ID

Thus requests from manager thread A are considered obsolete IF IID(A) < IID(B)

This IID may also be used as a prefix for interactions Message ID = [X:N] where X is the registry assigned IID and N is a monotonically

increasing number generated by an instance with IID as X Service Adapter stores the last known MessageID allowing it to differentiate

between duplicates AND obsolete messages Similar principle for auto-instantiated resources. IF a resource is

considered DEAD (Unreachable) and a new resource is spawned this new resource has the same ResourceID (allows us to identify the type of resource) but a higher InstanceID.

Later if the old resource joins back in, it can be distinguished by checking its InstanceID and appropriately taking action

E.g. IF IID(ResourceInstance_1) <IID(ResourceInstance_2), then ResourceInstance_1 was previously deemed OBSOLETE hence ResourceInstance_2 exists SO instruct ResourceInstance_1 to silently shutdown

19

Interoperability:Service-Oriented Management

Existing systems Platforms, languages SNMP, JMX, WMI Quite successful, but not interoperable

Move to Web Service based service-oriented architecture that uses XML based interactions that facilitate

implementation in different languages, running on different platforms and over multiple transports.

20

Interoperability:WS – Distributed Management vs. WS-Management

Both systems provide Web service model for building application management solutions

WSDM – MOWS (Mgmt. Of Web Services) & MUWS (Mgmt. Using Web Services)

MUWS: unifying layer on top of existing management specifications such as SNMP, OMI (Object Management Interface)

MOWS: Provide support for management framework such as deployment, auditing, metering, SLA management, life cycle management etc…

WS Management identifies core set of specification and usage requirements

E.g. CREATE, DELETE, GET / PUT, ENUMERATE + any number of resource specific management methods (if applicable)

Selected WS-Management primarily due to its simplicity and also to leverage WS-Eventing implementation recently added for Web Service support in NaradaBrokering

21

Implemented: WS – Specifications

WS – Management (June 2005) parts (WS – Transfer [Sep 2004], WS – Enumeration [Sep 2004] and WS – Eventing) (could use WS-DM)

WS – Eventing (Leveraged from the WS – Eventing capability implemented in OMII)

WS – Addressing [Aug 2004] and SOAP v 1.2 used (needed for WS-Management)

Used XmlBeans 2.0.0 for manipulating XML in custom container.

Security Framework for NB Provides secure end-to-end delivery of messages

Broker Discovery mechanism May be leveraged to discover Messaging Nodes

Currently implemented using JDK 1.4.2 (expect better performance moving to JDK 1.5 or better)

22

Performance EvaluationMeasurement Model – Test Setup

Multithreaded manager process - Spawns a Resource specific management thread (A single manager can manage multiple different types of resources)

Limit on maximum resources that can be managed Limited by Response time obtained Limited by maximum threads per JVM possible (memory

constraints)

Messaging Node(GF5)

GF1

GF3

GF4

GF2

Registry(GF5)

Manager Process(es)(GF6)

Set

of M

anag

ees

Benchmark Accumulator(GF7)

TCP Connection

23

Performance EvaluationResults

Response time increases with increasing number of resources

Response time is RESOURCE-DEPENDENT and the shown times are typical

MAY involve 1 or more Registry access which will increase overall response time

Increases rapidly as no. of resources > (150 – 200)

24

Performance EvaluationResults: Increasing Managers on Same machine

25

Performance EvaluationResults: Increasing Managers on different machines

26

How to scale locally

ROOT

US

Node-1

CGL

Node-2 Node-N…

…

Cluster of Messaging Nodes

27

Performance EvaluationResearch Question: How much infrastructure is required to manage N resources ?

N = Number of resources to manage Z = Max. no. of entities connected to a single messaging node D = Max. no of resources managed by a single manager

process K = min. no. of registry database instances required to provide

fault-tolerance

Assume every leaf domain has 1 messaging node. Hence we have N/Z leaf domains.

Further, No. of managers required per leaf domain is Z/D

Thus total components at lowest level= Components Per domain * No. of Domains= (K + 1 Messaging Node + 1 Bootstrap Node + Z/D Managers) * N/Z = (2 + K + Z/D) * N/Z

Note: Other passive bootstrap nodes are not counted here since (No. of Passive Nodes) << N

E.g.: If it’s a shared registry, then the value of K = 1 for each domain which represents the service interface

28

Performance EvaluationResearch Question: How much infrastructure is required to manage N resources ?

Thus for N resources we require an additional (2 + K + Z/D) * N/Z resources

Thus percentage of additional infrastructure is

= [(2 + K + Z/D)*N/Z] * 100 %

N + (2 + K + Z/D)*N/Z= [1 – 1/(1+2/Z+ K/Z + 1/D)] * 100 %

A Few Cases Typical values of D and Z are 200 and 800 and assuming K = 4, then

Additional Infrastructure = [1 – 1/(1 + 2/800 + 4/800 + 1/200)] * 100 %≈ 1.23 %

When Registry is shared and there is one registry interface per domain, K = 1, then Additional Infrastructure = [1 – 1/(1 + 2/800 + 1/800 + 1/200)] * 100 %≈ 0.87 %

If the resource manager can only manage 1 resource at any given instance, then D = 1, then Additional Infrastructure = [1 – 1/(1 + 2/800 + 4/800 + 1/1)] * 100 %≈ 50%

29

Performance EvaluationXML Processing Overhead

XML Processing overhead is measured as the total marshalling and un-marshalling time required.

In case of Broker Management interactions, typical processing time (includes validation against schema) ≈ 5 ms

Broker Management operations invoked only during initialization and failure from recovery

Reading Broker State using a GET operation involves 5ms overhead and is invoked periodically (E.g. every 1 minute, depending on policy)

Further, for most operation dealing with changing broker state, actual operation processing time >> 5ms and hence the XML overhead of 5 ms is acceptable.

30

Prototype: Managing Grid Messaging Middleware

We illustrate the architecture by managing the distributed messaging middleware: NaradaBrokering

This example motivated by the presence of large number of dynamic peers (brokers) that need configuration and deployment in specific topologies

Runtime metrics provide dynamic hints on improving routing which leads to redeployment of messaging system (possibly) using a different configuration and topology

Can use (dynamically) optimized protocols (UDP v TCP v Parallel TCP) and go through firewalls but no good way to make choices dynamically

Broker Service Adapter Note NB illustrates an electronic entity that didn’t start off

with an administrative Service interface So add wrapper over the basic NB BrokerNode object that

provides WS – Management front-end Allows CREATION, CONFIGURATION and MODIFICATION of

broker topologies

31

Prototype: Use Case

Use case I: Audio – Video Conferencing GlobalMMCS project, which uses NaradaBrokering as a event delivery substrate

Consider a scenario where there is a teacher and 10,000 students. One would typically form a TREE shaped hierarchy of brokers

One broker can support up to 400 simultaneous video clients and 1500 simultaneous audio clients with acceptable quality*. So one would need (10000 / 400 ≈ 25 broker nodes).

May also require additional links between brokers for fault-tolerance purposes

Use Case II: Sensor Network Both use cases need high QoS

streams of messages Use Case III: Management System

itself

** “Scalable Service Oriented Architecture for Audio/Video Conferencing”, Ahmet Uyar, Ph.D. Thesis, May 2005

… … …

…

400 participants

400 participants

400 participan

ts

A single participant sends

audio / video

32

Failure HandlingWS - Policy

Policy defines resource failure handling

Implemented 2 policies (based on WS-Policy) Require User Input: No action taken

against failure. A user interaction is required to handle

Auto Instantiate: Tries auto instantiation of failed broker.

Location of a fork process is required.

33

Prototype: Costs (Individual Resources – Brokers)

Operation

Time (msec) (average values)

Un-Initialized(First time)

Initialized(Later modifications)

Set Configuration 777 46

Create Broker 459 132

Create Link 175 43

Delete Link 109 35

Delete Broker 110 187

34

Recovery time:

Topology

Number of Resource specific

Configuration Entries

Recovery Time= T(Read State From Registry) + T(Bring

Resource up to speed)= T(Read State) + T[SetConfig + Create

Broker + CreateLink(s)]

Ring

N nodes, N links (1 outgoing link per Node)2 Resource Objects Per node

10 + (777 + 459 + 175) ≈ 1.4 sec

Cluster

N nodes, Links per broker vary from 0 – 31 – 4 Resource Objects per node

Min:5 + (777 + 459)≈ 1.2 sec

Max:20 + {777 + 459 + (175*1 + 43*2)}≈ 1.5 sec

Assuming 5ms Read time from registry per resource object

35

Prototype: Observed Recovery Cost per Resource

Operation Average (msec)

*Spawn Process 2362 ± 18

Read State 8 ± 1

Restore (1 Broker + 1 Link) 1421 ± 9

Restore (1 Broker + 3 Link) 1616 ± 82

Time for Create Broker depends on the number & type of transports opened by the broker

E.g. SSL transport requires negotiation of keys and would require more time than simply opening a TCP connection

If brokers connect to other brokers, the destination broker MUST be ready to accept connections, else topology recovery takes more time.

36

Management Console:Creating Nodes and Setting Properties

37

Management Console:Creating Links

38

Management Console:Policies

39

Management Console:Creating Topologies

40

Related work

Fault-Tolerance Strategies Replication

Provide transfer of control to a new or existing backup service instance on failure

Passive (primary / backup) OR Active E.g. Distributed databases, agents-based

systems

41

Related work

Fault-Tolerance Strategies Replication Check-pointing

Allow computation to continue from point of failure OR for process migration

E.g. MPI-based systems (Open MPI) Can be done independently (easier to do but

complicates recovery) OR co-ordinated (performance issue but recovery is easy)

42

Related work

Fault-Tolerance Strategies Replication Check-pointing Request-Retry Logging Checksum

43

Related work

Fault-Tolerance Strategies Failure Detection

Via periodic Heartbeats (E.g. Globus Heartbeat Monitor)

Scalability Hierarchical organization (E.g. DNS)

Resource Management (Monitoring / Scheduling) E.g. MonALISA, Globus GRAM

44

Conclusion We have presented a scalable, fault-

tolerant management framework that Adds acceptable cost in terms of extra

resources required (about 1%) Provides a general framework for management

of distributed resources Is compatible with existing Web Service

standard We have applied our framework to manage

resources that have modest external state This assumption is important to improve

scalability of management process

45

Summary Of Contributions Designed and implemented a Resource Management

Framework: Tolerant to failures in management framework as well as

resource failures by implementing resource specific policies

Scalable - In terms of number of additional resources required to provide fault-tolerance and performance

Implements Web Service Management to manage resources

Our implementation of global management by leveraging a scalable messaging substrate to traverse firewalls

Detailed evaluation of the system components to show that the proposed architecture has acceptable costs

The architecture adds (approx.) 1% extra resources

Implemented Prototype to illustrate management of a distributed messaging middleware system: NaradaBrokering

46

Future Work

Current work assumes SMALL runtime state that needs to be maintained. Apply management framework and

evaluate the system when this assumption does not hold true

More messages / Higher sized messages XML processing overhead becomes significant

Apply the framework to broader domains

47

Publications

On the proposed work: Scalable, Fault-tolerant Management in a Service Oriented Architecture

Harshawardhan Gadgil, Geoffrey Fox, Shrideep Pallickara, Marlon PierceSubmitted to IPDPS 2007

Managing Grid Messaging MiddlewareHarshawardhan Gadgil, Geoffrey Fox, Shrideep Pallickara, Marlon PierceIn Proceedings of “Challenges of Large Applications in Distributed Environments” (CLADE), pp. 83 - 91, June 19, 2006, Paris, France

Relevant to the proposed work: A Scripting based Architecture for Management of Streams and Services in Real-

time Grid ApplicationsHarshawardhan Gadgil, Geoffrey Fox, Shrideep Pallickara, Marlon Pierce, Robert Granat In Proceedings of the IEEE/ACM Cluster Computing and Grid 2005 Conference, CCGrid 2005, Vol. 2, pp. 710-717, Cardiff, UK

On the Discovery of Brokers in Distributed Messaging InfrastructureShrideep Pallickara, Harshawardhan Gadgil, Geoffrey FoxIn Proceedings of the IEEE Cluster 2005 Conference. Boston, MA

On the Discovery of Topics in Distributed Publish/Subscribe systemsShrideep Pallickara, Geoffrey Fox, Harshawardhan GadgilIn Proceedings of the 6th IEEE/ACM International Workshop on Grid Computing Grid 2005 Conference, pp. 25-32, Seattle, WA (Selected as one of six Best Papers)

A Framework for Secure End-to-End Delivery of Messages in Publish/Subscribe SystemsShrideep Pallickara, Marlon Pierce, Harshawardhan Gadgil, Geoffrey Fox, Yan Yan, Yi Huang (To Appear) In Proceedings of “The 7th IEEE/ACM International Conference on Grid Computing” (Grid 2006), Barcelona, September 28th-29th, 2006

48

Publications:Others

On the Secure Creation, Organization and Discovery of Topics in Distributed Publish/Subscribe systemsShrideep Pallickara, Geoffrey Fox, Harshawardhan Gadgil(To Appear) International Journal of High Performance Computing and Networking (IJHPCN), 2006. Special Issue of extended versions of the 6 best papers at the ACM/IEEE Grid 2005 Workshop

Building Messaging Substrates for Web and Grid ApplicationsGeoffrey Fox, Shrideep Pallickara, Marlon Pierce, Harshawardhan GadgilIn special Issue on Scientific Applications of Grid Computing in Philosophical Transactions of the Royal Society, London, Volume 363, Number 1833, pp 1757-1773, August 2005

Management of Real-Time Streaming Data Grid ServicesGeoffrey Fox, Galip Aydin, Harshawardhan Gadgil, Shrideep Pallickara, Marlon Pierce, and Wenjun WuInvited talk at Fourth International Conference on Grid and Cooperative Computing (GCC2005), Beijing, China Nov 30-Dec 3, 2005, Lecture Notes in Computer Science, Volume 3795, Nov 2005, Pages 3 -12

SERVOGrid Complexity Computational Environments(CCE) Integrated Performance Analysis Galip Aydin, Mehmet S. Aktas, Geoffrey C. Fox, Harshawardhan Gadgil, Marlon Pierce, Ahmet SayarAs poster and In Proceedings of the 6th IEEE/ACM International Workshop on Grid Computing Grid2005 Conference, pp. 256 - 261, Seattle, WA, Nov 13 - 14, 2005

Scalable, Fault-tolerant Management of Grid Services: Application to Messaging Middleware

Documents

management increasese

key challenges

large number of resources

increasing number of

increasing resources

flux components nodes

large number of cpus

generic framework