Grid Computing, SAO, and Autonomic Computing

Colorado Software Summit: October 24 – 29, 2004 © Copyright 2004, IBM Corporation

Paul Giangarra — Grid Computing, SAO, and Autonomic Computing Page 1

Grid Computing, SAO, and Autonomic ComputingPaul GiangarraSr. Technical Staff Membere-mail: [email protected]

mailto:[email protected]



AgendaGrid Computing, a Brief IntroductionGrid Computing Core ConceptsGrid Computing Standards and ArchitectureInformation and Grid ComputingAutonomic Computing and Grid ComputingService Oriented Architecture and Grid Computing(Now What do I do With All This?)

The Realm of the PossibleSummary and Questions



What’s the Problem?Grid Problem:

Provide for flexible secure coordinated resource sharing among dynamic collections of individuals, institutions & resources (a.k.a. virtual organizations)This includes unique authentication, authorization, resource access, and resource discovery

Grid Challenge:Create an architecture and solution set based on open standards and where they exist exploit existing technologies to solve this

See: The Anatomy of the Grid by Foster, Kesselman, Tuecke



What Is NOT a Grid?The 8:00 AM rush hour (that’s gridlock)

A bunch of PCs on a network(it’s a lot more than that)

A cluster, a network attached storage device, a scientific instrument, a network, etc.(each is an important component of a Grid, but by itself each does not constitute a Grid)

KEY: Grid Computing is NOT a silver bullet!



So, What Is a Grid?More correctly, what is Grid Computing?

Based on services-oriented architectureBased on standard, open, general-purpose protocols and interfacesGrid Computing, Services, and Technologies:

Help coordinate and manage disparate and possibly heterogeneous resources that are not subject to centralized controlCan be used to deliver non-trivial quantities of serviceCan be used to aggregate disparate IT elements such as compute resources, data storage and filing systems to create a single, unified virtual system



StorageData

Applications

Processing I/O Operating System

Microcosm – Pre-Internet “System”

What Is Grid Computing?



What Is Grid Computing?

....a single unified image

StorageData

Applications

Processing I/O Operating System

Macrocosm – Distributed Resources and Applications



Grid Computing EnablesDistributed computing across networks using open standards supporting heterogeneous resources by providing facilities for:

Virtualized Sharing of ResourcesVirtual Organizations & Collaboration

Autonomic Management of ResourcesQuality of Service & Optimization

Secure Reliable Access to ResourcesOn Demand Computing and Utility Models



Grid Computing, SAO, and Autonomic Computing

Grid Computing Core Concepts



3 Models and Unique Value Propositions

IncreasedResults:

Resource useFlexibilityProductivityReliability/Availability

ComplexityTotal cost of ownership

Decreased

Grid Computing Value Proposition

On Demand“Access data & processing capabilities in a utility-like fashion…….. Make vs. Buy”

Processing“Aggregate processing power from a distributed collection of heterogeneous systems”

Data

“Secure access and sharing of distributed data & information ina collaborative fashion”

Resiliency“Improve the quality of service of distributed systems, despite unplanned events”



Grid Computing Resources & Types

Grid ResourcesComputationStorageDataApplicationsCommunication (I/O)Software & LicensesSpecial equipment, capacities, architectures, & policies

Grid TypesCollaboration GridCompute Grids

Desktop ScavengingServer

Data/Information GridsContentDataFileStorage

Grid Resources Virtualized Across the Grid Types



1. Intra-GridsGrid

NAS/SAN

Grid

NAS/SAN

Grid Deployment OptionsA Function of Business Need, Technology and Organizational Flexibility



1. Intra-Grids

2. Extra-Grids

GridGrid

NAS/SANNAS/SAN

Grid

NAS/SAN

VPN

A Function of Business Need, Technology and Organizational Flexibility

Grid Deployment Options



1. Intra-Grids

2. Extra-Grids

3. Inter-Grids

GridGrid

NAS/SANNAS/SAN

Grid

NAS/SAN

VPN





Motivations for Grid Computing

SupportHeterogeneous

SystemsEnable

Collaboration

ReduceTime toResults

IncreaseCapacity

ImproveEfficiencyReduceCosts

ProvideReliability

& Availability




Increase CapacityExploit distributed resources to provide capacity for high-demand applications

• Existing applications that cannot be run effectively on a single processor

• New large scale application that provide strategic business advantages




Increase CapacityExploit distributed resources to provide capacity for high-demand applications

Improve Efficiency / Reduce Costs

Reduce infrastructure cost associated with over-provisioned resourcesReduce the cost of manpower to manage and configure resources



IBMIBMIBM

Provide Reliability / AvailabilityUse distributed resources Monitor work progressRestart failed jobs

Motivations for Grid Computing112234

567891011

JobScheduler

TIMEOUT !

JOB 1JOB 1 JOB 2JOB 2 JOB 3JOB 3JOB 1JOB 1Recovery / Restart




Reduce “Time to Results”Exploit opportunities for parallel computing to allow business critical computation to be completed in a timely fashionGain competitive advantage by allowing computation to be executed more frequently and on customer demand Deliver real-time results to internal and external customers

112

234

567891011

March

29March

28March

27

Serial Execution

Parallel Execution



Provide Reliability / AvailabilityUse distributed resources Monitor work progressRestart failed jobs

Support Heterogeneous systemsDifferent hardware, system platforms,

and available middlewareSpecialized equipment


Linux / Z-OS

IBM

IBM

AIX / Linux

IBM

IBM

serverp Se ries

IBM

H C R U6

serverp Se ries

IBM

H C R U6

serverp Se ries

IBM

H C R U6

serverp Se ries

IBM

H C R U6

serverp Se ries

IBM

H C R U6

serverp Se ries

IBM

H C R U6

serverp Se ries

IBM

H C R U6

serverp Se ries

IBM

H C R U6

serverp Se ries

IBM

H C R U6

serverp Se ries

IBM

H C R U6




Enable CollaborationsEnable collaboration across applications to integrate results Support large multi-disciplinary collaborationsBoth within a single organization and between partners

Air Force

ArmyNavy

C2C

MissionPlanning




Grid Computing Standards and Architecture



The Value of Open Standards

Networking:The Internet

(TCP/IP)

Communications:e-mail

(pop3,SMTP,Mime)

Information:World-wide Web

(html, http, j2ee, xml)

Applications:Web Services

(SOAP, WSDL, UDDI)

Distributed Computing:Grid

(Globus / OGSA)

Operating System:Linux



Cooperation on Standards

MicrosystemsMicrosystems



WSDLDescribes what the service is, how to use it (XML document)

UDDI (optional)Yellow pages for web services

(Universal Directory, Discovery and Integration ) Directory

SOAPConnect the service (“the envelope”)

Core Web Services Technologies



Value Proposition

Increase business flexibility through standardized services

Enabling the ecosystem

Extend IT Infrastructure to suppliers and Business Partners

Radical reduction in complexity of integration

Leverage existing investments and skills

IBM provides the industry's broadest support for Web services

Development LifecycleTransaction ServicesInformation IntegrationCollaboration ServicesManagement Services

IBM Software Activities

Drive definition, adoption and interoperability of Web services

Open standards-based Open standards-based technology for flexible technology for flexible

integrationintegration

Making Web Services Work

Basic Profile 1.1 - Final Specification published August 24, 2004



Open Grid Services Architecture (OGSA)

Objectives:

Manage resources across distributed heterogeneous platforms Deliver seamless QoSProvide a common base for autonomic management solutionsDefine open, published interfaces

Exploit industry-standard integration technologies

Web Services: SOAP, XML, WSDL, WS-Security, UDDI…

Integrate with existing IT resources



Web Services “Stack”

HTTP(S), SMTP, FTP, BEEP, TCP/IP, …

Messaging

WSDL

Quality of Service

WS-Transactions

ComponentsComposite

Transport

SOAP RMI/IIOP, JMS, …

WS-CoordinationWS-SecurityWS-Reliable

Messaging

DescriptionWS-Policy

UD

DI, W

S-A

ddressing, WS

-Inspection

Atomic

BPEL4WS WS-Coord



Grid Protocol vs. Internet Protocol

Fabric

Connectivity

Resource

Collective

Applications

Applications

Transport

Internet

LinkGrid

Pro

toco

l Arc

hite

ctur

e

Inte

rnet

Pro

toco

l Arc

hite

ctur

e



Grid Computing Protocol Architecture

Resource and Connectivity protocols, which facilitate the sharing of resourcesBuild on capabilities provided by lower layersDesign goals:

Place few constraints on implementationFocus on small set of core abstractionsEmphasize identification and definition of protocols and servicesIdentify and define APIs and SDKsProvide for a Secure Environment

Fabric

Connectivity

Resource

Collective

Applications

The layered Grid Computing protocol architecture is based on Open Standards



Grid Protocol – FabricProvides the resources to which shared access is mediated by Grid protocolsExamples include computational resources, storage systems, catalogs, or network resources

Includes logical resources such as distributed file systems and clusters

Resources implement inquiry mechanisms that permit discovery of their structure, state, and capabilities



Grid Protocol – ConnectivityDefines core communication and authentication protocols required for Grid-specific network transactionsCommunication protocols enable the exchange of data between Fabric layer resources.Authentication protocols build on communication servicesProvide cryptographically secure mechanisms for verifying the identity of users and resources.

Asymmetric cryptography

TransportRouting Naming

Single Sign On Delegation Security Integration Trust Relationships



Grid Protocol – ResourceBuilds on Connectivity layer communication and authentication protocols

Defines protocols (and APIs and SDKs) for the secure negotiation, initiation, monitoring, control, accounting, and payment of sharing operations on individual resources

Concerned entirely with individual resourcesIgnores issues of global state and atomic actions across distributed collections

API/SDK

MonitorControl Negotiation

InitiationAccountingPayment



Grid Protocol – CollectiveProtocols and services (and APIs and SDKs) that are not associated with any one specific resource but rather are global in nature and capture interactions across collections of resources

Directory servicesCo-allocation, scheduling, and brokering servicesMonitoring and diagnostic servicesData replication servicesGrid-enabled programming systemsWorkload managementCommunity authorization and accountingSoftware discovery servicesCollaborative services



Grid Protocol – Application LayerPutting it all together

InteragencyCollaborative

Data Grid

ComputeIntensive

Simulation

WeatherSimulation and

Modeling

Utility compute providers HA Operational

SupportSystems

B2B Hubs and trading

networks }

}}}}

Application layer:Grid enabledsecure and scalableVirtual Organizations

Collective layer:Global interactionsand servicesResource layer:ResourcemanagementservicesConnectivity layer:Security, transport,routing,

Fabric layer:Physical resources



OGSA – Open Grid Services Architecture

Network

OGSA Enabled

Storage

OGSA Enabled

Servers

OGSA Enabled

MessagingOGSA Enabled

DirectoryOGSA Enabled

File SystemsOGSA Enabled

DatabaseOGSA Enabled

WorkflowOGSA Enabled

SecurityOGSA Enabled

Web Services

OGSI – Open Grid Services Infrastructure

Grid Data Services Grid Core

Services

Grid Program Execution Services

Domain Specific Services

OGSA Architected Services

Applications

Open Grid Services Architecture (OSGA)

Enabled Hardware and Operating System Platforms

Enabled “generalpurpose” middleware

Support for web services on a

variety of platforms, languages and protocols

Open architecture forinteroperability

Open and value-addedvendor implementations

Applications & systemsbuilt on standards

Open Standards Based Architecture: 2003



• OGSA Services can be defined and implemented asWeb services

• OSGA can take advantage of other Web services standards

• OGSA can be implemented using standard Web services development tools

• Grid applications will NOT require special Web services infrastructure

Network

OGSA Enabled

Storage

OGSA Enabled

Servers

OGSA Enabled

MessagingOGSA Enabled

DirectoryOGSA Enabled

File SystemsOGSA Enabled

DatabaseOGSA Enabled

WorkflowOGSA Enabled

SecurityOGSA Enabled

Web Services

WS-Resource Framework & WS-Notification are an evolution of OGSI

OGSI – Open Grid Services Infrastructure

Web Services

OGSA Architected Services

Applications

WS-

Serv

ice

Gro

up

WS-RenewableReferences

WS-

Notif

icatio

n

Modeling Stateful

Resources with Web Services

WS-Base Faults

WS-ResourceProperties W

S-Resource

Lifetime

WS-RF & WS-Notification and OGSA



Web Servicesdynamic, addressable, state-full, manageable

OGSA Structure

OGSA Architected ServicesGrid Data ServicesGrid Program Execution

Services Grid Core Services

WS-Addressing

WS-PolicyWS-CoordinationWS-Security

WS-Trust

Domain Specific Services

SecurityPolicy ManagementService CommunicationService Management Security

•Registries and Discovery Services (SG)

• Attribute Propagation and Query• Service Domain

•Service Orchestration •Metering & Accounting

• Installation & Deployment

• Messaging and Queuing Services

• Event Services• Distributed Secure

Logging Service

Policy ManagementService CommunicationService Management

• Authentication• Authorization &

Access Control• Credential

Validation & Transformation

• Trust Broker

• Policy Service Manager• Policy Agent• Policy Transformation Service• Policy Resolution Service• Policy Validation Service• Policy Administration Services

and Negotiation Framework

• Job Scheduler & Queuing Services

• Resource Reservation Services

• Workload Managers and Micro-Scheduling Services

• Data Access Services• Data Transformation &

Federation Services• Data Replication Service• Data Caching Service• MetaData Catalog Services



Meta OS Grid Services Service

CollectionsJob

SchedulingFile

TransferData

ReplicationProvisioningLoggingProblemDetermination

ResourceManagement

ClusterManagementPolicy

Security APIsglobus_gss_assist - simplifies the use of the GSSAPI in the globus environment [1.1.x, 2.0]GSS API - the Generic Security Service API C bindings (IETF draft) [version 2]

Information Service APIsOpenLDAP - an API for the LDAP protocol used by MDS (developed by the OpenLDAP Project) [version 1.2]

Communication APIsglobus_io - provides high-performance I/O with integrated security and a socket-like interface [1.1.x,2.0]globus_nexus - provides multithreaded, asynchronous, thread-safe multiprotocol communication facilities [1.1.x,2.0]globus_nexus_fd - provides NEXUS-based support for file descriptors and timed events (This API is obsolete as of release1.1.2. We recommend use of globus_io instead.) [1.1.1]

Data Access APIsglobus_ftp_control - provides low-level services for implementing FTP client and servers [2.0]globus_ftp_client - provides a convenient way of accessing files on remote FTP servers [2.0]globus_gass_copy - provides a uniform interface for accessing files using a variety of protocols [2.0]globus_gass - provides clients with access to remote files [1.1.x]globus_gass_transfer - provides an API for clients and servers involved in GASS data transferglobus_gass_cache - manages the local GASS cache on a client system [1.1.x,2.0]globus_gass_server_ez - provides a simple set of GASS server capabilities [1.1.x,2.0]globus_gass_server - provides GASS server functionality (This API is obsolete as of release 1.1.2. We recommend use of globus_gass_transfer instead.) [1.1.1]globus_gass_client - allows clients to get and put remote files via several protocols (This API is obsolete as of release 1.1.2. We recommend use of globus_gass_transfer instead.) [1.1.1]

Data Management APIsglobus_replica_catalog - provides an interface to a catalog of data collections, logical files, and physical locations [2.0]globus_replica_management - allows clients to manage files within a file replication system [2.0]

Resource Management APIsglobus_gram_client - provides remote job submission and management capabilities [1.1.x,2.0]globus_gram_myjob - provides a basic communication mechanism for processes within a GRAM job [1.1.x,2.0]globus_gram_jobmanager - provides a simple, consistent way to interact locally with a variety of schedulers such as LSF, LoadLeveler, PBS, Condor, etc. [1.1.x,2.0]globus_duroc - provides resource coallocation services for starting distributed jobs [1.1.x,2.0]

Fault Detection APIsglobus_hbm_client - allows a client process to be monitored by a Heartbeat Monitor system [1.1.x]globus_hbm_datacollector - allows clients to monitor multiple processes and enables the notification of exceptions [1.1.x]

Portability APIsglobus_module - provides a mechanism for activating and deactivating software modules [1.1.x,2.0]globus_libc - provides a portable implementation of libc[1.1.x,2.0]globus_thread - implements threads and synchronization mechanisms [1.1.x,2.0]globus_dc - provides cross-platform data conversion servicesglobus_utp - supports the use of timers for monitoring applications and other programs [1.1.x,2.0]globus_list - support for linked lists [1.1.x,2.0]globus_fifo - supports first-in-first-out queues [1.1.x,2.0]globus_hashtable - supports hash tables [1.1.x,2.0]globus_url - supports URL strings [2.0]globus_error - provides an abstract error type for function return codesglobus_poll - supports polling on I/O channels

see: http://www.globus.org/developer/api-reference.html

http://www.globus.org/developer/api-reference.html



Recent Developments (Jan 20, 2004)WS-Resource Framework & WS-Notification

announced January 20th 2004at Globus World in San Francisco

Proposals to extend to Web servicesModeling Stateful Resources with Web Services

Driven by requirements from:Grid computingSystems ManagementBusiness computing

WS-

Serv

ice G

roup


WS-

Notif

icatio

n

Modeling Stateful

Resources with Web Services W

S-Base Faults

WS-ResourceProperties

WS-Resource

Lifetime



A family of Web services specification proposalsIntroduces a design pattern to specify how to use Web services to access “stateful” componentsIntroduce message based publish-subscribe to Web services

WS-

Serv

ice G

roup


WS-

Notif

icatio

n

Modeling Stateful

Resources with Web Services

WS-Base Faults

WS-ResourceProperties WS-Resource

Lifetime

IntroducedIn Jan

To be developed

What Was Announced



WS-NotificationProvides a publish-subscribe messaging capability for Web Services

WS-Resource FrameworkThere are many possible ways Web services might model, access and manage stateWS-RF is a family of Web services specifications that clarify how “state” and Web services combine

Both: Build upon existing Web services specifications and technologyHelp align Grid computing, Systems Management and Web services

Contributed to by:WS-Resource Framework: IBM, Globus, HPWS-Notification: IBM, Globus, Akamai, HP, SAP, Tibco, Sonic

What Was Announced



The WS-Resource Framework Model

What is a WS-Resource?Examples of WS-Resources: • Physical entities (e.g. processor, communication link,

disk drive)or Logical construct (e.g. agreement, running task, subscription)

• Real or virtual• Static (long-lived, pre-existing) or

Dynamic (created and destroyed as needed)• Simple (one), or Compound (collection)

Unique – Has a distinguishable identity and lifetime

resource



The WS-Resource Framework Model

Architecture rationaleWS-Resource framework exploits WS-Addressing

Web services and WS-Resources are referenced using an “Endpoint Reference”Services that create or locate WS-Resources returnEndpoint References

Web service and WS-Resource are separate:A Web service is statelessA WS-Resource provides a context / mechanism for stateful execution



WS-NotificationWS-Notification

Brings enterprise quality publish and subscribe messaging to Webservices

• Loosely coupled, asynchronous messaging in a Web services context• Composes with other Web services technologies• Facilitates integration between different messaging middleware

environmentsExploits WS Resource framework and Web services technologiesStandardizes the role of Brokers, Publishers, Subscribers and ConsumersProvides two forms of publish/subscribe: direct publishing and brokered publishing

Standardizes Web service message exchanges for publishing, subscribing and notification deliveryDefines XML model of Topics and TopicSpaces to categorize and organize notification messages



Open Grid Infrastructure (OGSI)

Grid Service Implementation Independence

HardwareOperating System

Other Middleware

Hosting Environment

Implementation

Abstract service interface remains the

same



Open Grid Infrastructure (OGSI)

Grid Service Implementation – Examples

Hardware

Operating System

Other Middleware

Hosting Environment - J2EE

File TransferService

File System

Storage System (NAS/SAN)

Implementation

Abstract service interface remains the

same

Database (DB2)




Information…… and Grid Computing



Managing Information at Different Levels

Global NamingMeta-data and catalogFederation and Transformation

Data

Distributed File Systems / Remote AccessFile Transfer / Data ReplicationCaching

File

NAS / SAN “Storage Cluster”

Automatic or Dynamic provisioning of storage

Support for hierarchy managementStorage



IBM Products for an Information Grid

* Avaki is an IBM business partner

Data backup/restore, data archive and retrieve

Enterprise wide reporting, file level analysis, subsystem reporting, automated capacity provisioning

Creates pools of managed disks spanning multiple storage subsystems. Includes dynamic data-migration function.

Provides a common file system specifically designed for storage networks. Manages the metadata on the storage network instead of within individual network servers.

Provides scalable access to GPFS from outside cluster. GPFS + NFSv4 provides the performance of a SAN File System scalable to a WAN.

Cluster based, shared disk, parallel file system. Data and metadata can flow to all nodes and all disks in parallel. Featured in HPC environments. Available on pSeries and Linux clusters.

Data catalog, data provisioning, reusable data integrations, caching capabilities.

Relational database that runs on Linux, Unix, Windows, z/OS, and OS/390

Federated data server, replication server

Features BenefitsProduct

Centralized protection leading to faster backups and restores with less resources needed. Tivoli Storage Manager

Manageability features, Integrated Information capabilities via Web Services, Integrated business intelligence, and more

DB2 UDB

Security and access control in a grid environment.NFS v4

Storage on demand for file systems. Reclaim wasted space consumed by non-essential files. Ensure storage used efficiently for future capacity.

Tivoli Storage ResourceManager

Centralized point of control for volume mgmt. Allows administrators to migrate storage from one device to another w/o taking it offline.

SAN Volume ControllerStorageFile

Data

Not a client-server file system like NFS, DFS, or AFS: no single server bottleneck, no protocol overhead for data transfer.

GPFS (General Parallel File System)

Provides high performance access to data and enables sharing across heterogeneous application servers. Allows applications on any server within the SAN to access any file in the network without making changes to the application.

SAN File System

Provisioning, access, and integration of data from multiple, heterogeneous, distributed sources.

Avaki Data Grid 5.0*

Query and access distributed data without requiring central repository. Supports movement of data from mixed relational data sources.

DB2 Information Integrator




Autonomic Computing… …and Grid Computing



A continuously evolving and dynamic state that establishes the correct balance between what is managed

by a person and what is managed by the system

Focus on business, not infrastructure

Autonomic Computing Is



Why Autonomic Computing?

Heterogeneity

Large state space

Unpredictable human element

Unpredictable scalabilityContinuous Change

Open-endedness

Connectedness

The interconnected characteristics of a

complex system need…

…Systems level understanding with certain

component and system characteristics

Real-timeSelf-adaptiveSelf-organizingSelf-healingSelf-formingSelf-testing Resilient



Self-managing Systems Deliver:Increased ResponsivenessAdapt to dynamically changing environments

Business ResiliencyDiscover, diagnose,

and act to prevent disruptions

OperationalEfficiencyTune resources and balance workloads to maximize use of IT resources

Secure Information and Resources

Anticipate, detect, identify, and protect

against attacks

“Autonomic computing allows companies to operate more efficiently and achieve more from their existing IT environments, enabling increased responsiveness, business continuance and availability.” — Rick Sturm



The Autonomic Element: Sense & Respond

An autonomic element contains continuous control loop that monitors activities and takes action Autonomic elements learn from past experience to build action plansManaged elements are consistently monitored

Knowledge

Analyze Plan

Monitor Execute

Element

Sensors Effectors

The autonomic computing control loop

“IBM’s autonomic approach to automation goes well beyond integration to the truly intelligent, responsive and proactive capabilities needed to deliver e-business on demand.”

— Mark Hydar



Levels of Automation

Level 2 Level 3 Level 4 Level 5Level 1

Basic

Managed

Predictive

Adaptive

Autonomic

Manual analysis and problem solving

Centralized tools, manual actions

Cross-resource correlation and guidance

System monitors, correlates and takes action

Dynamic business policy based management

Evolution not revolution

“Autonomic computing is a vision that will take several years to realize, but with the model that IBM has outlined, there are benefits attainable at every step, which pay you back... fairly quickly for the investments you make.”

— Mike Gilpin



Self-configuringAdapt automatically to the dynamically changing environments

Self-Configuring

Self-Configuring

Self-healingDiscover,

diagnose, and react to disruptions

Self-HealingSelf-

Healing

Self-optimizingMonitor and tune

resources automatically

Self-Optimizing

Self-Optimizing

Self-protectingAnticipate, detect, identify, and protect against attacks from anywhere

Self-Protecting

Self-Protecting

Autonomic Computing: Self Managing Systems

Autonom

ic Capabilities

OGSA Structure + Autonomic Backplane

Adaptive Grid



Grid Computing and the oDOE

Open

Linux

XML WSDLWSDL

SOAPOGSA

Self-protectingSelf-protecting

Self-healingSelf-healing

Self-optimizingSelf-optimizing

Self-configuringSelf-configuring

Autonomic

Virtualized

Integrated



Service-Oriented Architecture Evolution

Web Services

Complex Event Processing

Enterprise Infrastructure

Component Orchestration

Semantic Web

Standards-based info management framework

Warfighter events pattern recognition

Distributed collaborative processing with discovery

Orchestration of C4ISR components

Intelligent M2Mcollaboration



Service Oriented ArchitectureChange of Paradigm at the core of Grid Computing

Services “encapsulate” heterogeneous resourcesServices provide a compose-able, orchestrable, extensible base Common Resource Model (CRM) for abstractions key to manageability of resources

Simple Rules:Any function is implemented once and once only as a ServiceServices can be runtime or deployment-time re-usedService providers and requesters are loosely bound:

• Each service is defined by an implementation independent interface.• Services are defined in terms of common business function and data

models.• Communication protocols that emphasize interoperability and location

transparency are used to mediate service interactions

Service “contract” can come with a QoS “clause” (SLA)



Anatomy of a Service Interface

Interface by contractAn explicit interface definition or contract is used to bind a service requestor and a service providerSpecifies explicitly only the mutual behaviour -specifies nothing about the implementation of the requestor or the providerAllows either to change implementation or identity freely

Interface granularityBased on Service Type:Examples:

• Business Process Services• Business Transaction Services• Business Function Services• Technical Function Services

Interface Code

Interface Code

Internal code and processs

Shared process and interface definitions

CONTRACT

SYSTEM 1

SYSTEM 2

Internal code and processs



Refactoring: Things to Deal WithMany Existing Applications are Monolithic or Tightly Coupled Need to Re-Factor Applications

Some things to worry about are:• Distributed threads • Data locking• Latency

Re-Hosting ApplicationsExploit Meta-OS servicesAchieve platform independenceRe-Factor for distributed parallel execution

Need for Re-Hosted MiddlewareAbility to Exploit Grid computing services, e.g. Distributed ProvisioningManage (and exploit) Quality of Service across the Grid

Challenge: Move to and Exploit Services Oriented Architecture



Can Your Application Benefit from Grid Computing?

How do you know if your application can benefit from Grid computing? Ask these questions:Q. Is the application computationally intensive?Q. Does it serve a distributed or collaborative community?Q. Can the tasks or jobs the application performs run in parallel?Q. Does the application do pattern matching?Q. Does it have a reasonable network bandwidth profile?

A. If the answer to any or all of these is yes, then Grid-enablement is feasible.

Q. What is the application processing type (e.g., serial or batch)?

A. Batch is currently more amenable to Grid enablement.

Q. Do the operations within the task have time and/or sequencing dependencies?

A. The fewer dependencies, the better.

Q. What are the bottlenecks in the existing use of the application (e.g., single processor performance, scalability, memory, data output volume, pre/post processing)?

A. Grid can potentially address these bottlenecks.



Processors

Time223+837+383+662+121+554+123+816+228+772+452+827+972+274+...+832+971+753+981+2282+23

223+...+772

452+...+845

183+...+559

884+...+121

314+...+265

271+...+173

491+...+23

2443+...+9772

Parallel application done

Serial application done

Rearranging computations to execute in parallel on Grid

CPU – Make Execution Parallel



Sequence

Sequence

if

Loop

Sequence

Sequence

Sequence Sequence

Sequence

Sequence

Sequence

Sequence

Sequence

if

Sequence

Sequence Sequence

if

if

CPU – Programming Code Control Graph

Rearranging computationsSeparate subgraphs to run in parallelConsider data dependenciesChange algorithms



Compute & Data Intensive Application

Video conversion problemCapture video tape onto computer hard drive• About 200 Megabytes per minute• 25 Gigabytes for a 2 hour tape

Compress video and audio• Can take days at higher quality level

Write VCD, SVCD, or DVD disk (650 MB to 4.7 GB)



Single stream:

Using a Grid:

VCR

2 hours

HD

24 GB

10 minutes

HD

4.7 GB

HD compression HD

Data

Transfer

Dat

aTr

ansf

er

compressioncompressioncompression

Data

TransferD

ataTransfer

Data

TransferD

ataTransfer D

ata

Tran

sfer

Dat

aTr

ansf

erD

ata

Tran

sfer

Dat

aTr

ansf

er

HDHDHDHDHDHDHDHD

45 minutesat 100mb/s

9 minutesat 100mb/s

The Grid

compression

<<30 hours

VCR HD

2 hours

compression

30 hours

HD

10 minutes

24 GB 4.7 GB




Overlapping data transfer with capture and computing:

VCR

2 hours

HD

24 GB

10 minutes

HD

4.7 GB

compressionD

ata

Tran

sfer

HD

The Grid

Data

Transfer

Dat

aTr

ansf

erD

ata

Tran

sfer

Dat

aTr

ansf

erD

ata

Tran

sfer

Dat

aTr

ansf

erD

ata

Tran

sfer

Dat

aTr

ansf

erData

Transfer

Data

Transfer

Data

Transfer

Data

Transfer

Data

Transfer

Data

Transfer

Data

Transfer

HD HD HD HD HD HD HD

compressioncompressioncompressioncompressioncompressioncompressioncompression




Six Strategies for Grid Application Enablement




Strategy 1: Batch AnywhereOnly the grid (not the application, the client, the user, or anything else) decides which node to use for the jobThe machine submitting the job might not be a node in the gridExample application: a query to determine whether a given number, x, is a prime number. More than one node in the grid can submit the same query. The grid returns the correct results to the submitter.

Strategy 2: Independent Concurrent Batch Multiple independent instances of the same application run concurrently and independently without interference.Independent jobs are common. For example, Job X for Account A can run concurrently with Job X for Account B. Databases and other resources don't have hot spots or deadlocks.

Strategy 3: Parallel BatchTake each user's batch work, subdivide it, disperse it out to multiple nodes, collect it, and then aggregate the results.




Strategies 4, 5, & 6 use services on the grid in order to get jobs done. Strategy 4: Service

Focus on the transition from a batch to a service-oriented architectureA follow-on to Independent Concurrent BatchIt is not assumed that each client subdivides its work and spreads it over multiple service instances

Strategy 5: Parallel ServicesService with the subdivided work model of Parallel Batch. Provides multiple service instancesPermits these instances to be invoked in parallel on the client's behalf

Strategy 6: Tightly Coupled Parallel ProgramsThe domain of specialized applications in engineering, physics, and biological modeling, such as finite state analysisProvides intense communications and synchronization between client and services and among services



From Enablement to Exploitation



Three Stages for ImplementationRun

Strategies 1 and 2, and the simplest form of Strategy 3, focus on the ability of an application to run in a grid.

AdaptThe more complex form of Strategy 3 as well as Strategies 4 and 5 significantly adapt the function and value of the business application by enabling it to use a grid without requiring many changes that are specific to grid middleware. The same application could be structured to run in a non-grid environment.

ExploitApplications at Strategy 6 exploit the grid or cluster infrastructure for their operation because they were written from the start with a grid in mind. Strategy 6 applications cannot finish in a timely and successful manner without running in a grid.

See: http://www-106.ibm.com/developerworks/grid/library/gr-enable/

http://www-106.ibm.com/developerworks/grid/library/gr-enable/



It’s Not Just Limited to Applications

MiddlewareApplication ServersGPFS, Database, Transaction ManagersSystems Management SoftwareCollaborative Software…

ResourcesProcessorsStorageNetwork…

And more…



Example: GPFS Parallel AccessParallel Cluster File System

Cluster – fabric-interconnected nodes (IP, SAN, …)

Shared disk – all data and metadata on fabric-attached disk

Parallel – data and metadata flows from all of the nodes to all of the disks in parallel under control of distributed lock manager.

Fine grain locks – efficient sharing of individual files

GPFS File System Nodes

Switching fabric(System or storage area network)

Shared disks(SAN-attached or network

block device)



GPFS: Information Management For The Grid

Goal: sharing GPFS file systems over the WAN

WAN adds 10-60 ms latency… but under load, storage latency is much higher than this anyway!

New GPFS featureGPFS NSD now allows both SAN and IP access to storageSAN-attached nodes go directNon-SAN nodes use NSD over IP

Award winning demo at SC03

Work in progress

/NCSAGPFS File System

/NCSAover WAN

/Sc2003GPFS File System

/SDSCGPFS File System

/SDSCover WAN

/SDSCover SAN

/NCSAover SAN

SDSC Compute Nodes

Sc2003 Compute Nodes

NCSA Compute Nodes

NCSA NSD Servers

Sc03 NSD Servers

SDSC NSD Servers

Scinet

NCSA SAN

Sc03 SAN

SDSC SAN

Visualization



Some Important Infrastructure Considerations

SecurityAuthentication/authorizationClient and server concerns

Information servicesWhat resources exist, what is their state and how do I access them?

Data managementHow do I access, move, replicate data to where I need it?

Resource managementHow do I run a job and monitor its state?




The Realm of the Possible



Why Are Customers Implementing Grid Computing Solutions?

Accelerate Business ProcessesGrids provide the ability to shorten application run-times without upgrading existing servers.(i.e. Charles Schwab, MassMutual, RBC Insurance, Nippon Life Insurance, Royal Dutch Shell, EADS)Ability to run new High Performance Computing (HPC) applicationsGrid computing provides the opportunity to run new applications due to the cost effective grid virtual computing environment. (i.e. AIST, UMass, FNMOC, TeraGrid)Data Sharing & CollaborationGrid architecture provides the ability to store, share and analyze large volumes of data(i.e. eDiamond, NDMA, WestGrid, CERN, European DataGrid, Kansai Electric)Accelerate Research & DevelopmentGrids provide Life Science companies the ability to speed up drug research & development.(i.e. Smallpox Grid, Aventis, Novartis)I/T Optimization & Resiliency – Virtualization of Servers & Storage



Grid Infrastructure

Grid Computing – Industry Applications

DerivativesAnalysis

Statistical Analysis

Portfolio Risk

Analysis

Batch Throughput

Product Design

Process Simulation

FiniteElement Analysis

Failure Analysis

Cancer Research

Drug Discovery

Protein Folding

Protein Sequencing

CollaborativeResearch

Weather Analysis

High Energy Physics

Unique by Industry with Common Characteristics

Seismic Analysis

Reservoir Analysis

Bandwidth Consumption

Digital Rendering

Multiplayer Gaming

Primary Focus

Energy

Financial Services

Manufacturing

Life Sciences Telco & Media

Government & Higher Education



IBM Grid Focus Areas and Information Grid

Financial ServicesPublicIndustrial

PublicPublicIndustrial

IndustrialFinancial ServicesPublic Industrial

Sectors

Provide large scale data sharing infrastructure for industry and scientific collaboration

Virtualized distributed storage and data resources

Facilitating access to large scale data marts and broadly distributed client data.

Sharing design data across large multi-party projects.

Sharing of public data sources. Also supports use of shared compute resources.

Information Grid

Create large-scale IT infrastructures to drive economic development and/or enable new government services

Optimize computing and data assets to improve utilization, efficiency and business continuity

Enable faster and more comprehensive business planning and analysis through the sharing of data and computing power

Share data and computing power, for computing intensive engineering and scientific applications, to accelerate product design

Accelerate and enhance the R&D process by enabling the sharing data and computing power seamlessly for research intensive applications

Description

Government Development Grid

Enterprise Optimization

Grid

Business Analytics Grid

Engineering and Design Grid

Research and Development Grid

Virtualization of Compute and Information Resources



MREN STARTAPMAGPI

NCNISOX

ESNet

CANet*2

U Toronto

U Penn

U NCOak Ridge

U CHI

NGIX Chicago Peering Point

Indianapolis GigaPop

Atlanta GigaPop

New York GigaPop

Abilene Peer Network

Abilene Connector

Project Site

Sponsored by: University of Pennsylvania

National Digital Mammographic Archive



Research & DevelopmentResearch & DevelopmentCombines Grid Computing with Radiology to makebreast cancer diagnosis faster and treatment moreeffective

IBM assisted with implementing a Gridinfrastructure across the hospitals to manage andretrieve digital mammograms

Secure transmission of all patient records

Grid solution architecture includes:IBM pSeries, xSeries, Linux, DB2, GPFS, Globus

WebSite: http://nscp.upenn.edu/NDMA

NDMA: National Digital Mammographic Archive

http://nscp.upenn.edu/NDMA



Connects nine (9) major supercomputing sites: NCSA, SDSC, Argonne NL, CalTech, PSC, UTexas, IndianaU, PurdueU, Oak Ridge NL

40 gigabit network backbone connecting the sites20 Teraflops of computing power1 Petabyte of disk accessible data storage

Accessible to thousands of scientists working on advanced research

Applications include:Real Time Brain MappingEarthquake ModelingMolecular Dynamics simulationMcell – Monte Carlo simulation of cellular micro physiologyEncyclopedia of Life – Protein catalog

IBM project team and solution includes:IBM High Performance Computing (HPC) expertiseIBM GPFS expertiseIBM Linux Clusters – Itanium2 processorsIBM Power4 processors – p690 RegattasIBM Grid Computing & Linux consulting services

The TeraGrid – Extensible Terascale Facility

National Science Foundation Grid Computing project ($90M):



CERN

Cambridge

Newcastle

EdinburghUS Sites

EU

Glasgow

Cardiff

Southampton London

Belfast

Dublin

Oxford

Manchester

Multiple Grid Applications including:HighEnergy PhysicsAircraft Engine MaintenanceCombinatorial ChemistryOceanographic studyParticle Physics & AstronomyBiomolecular analysisEnvironmental simulation

Heterogeneous Grid:IBM, Sun, HP serversLinux, Globus, Condor, SRB

UK eScience Grid



The SMALLPOX Research GridA massive distributed computing grid running a computational chemistry application to help fight the smallpox virus:

Screened 35 million potential drug moleculesTwo (2) million computer processors in 200

countries were connected to this grid

The Grid architecture will reduce the time required to develop a commercial drug by several years:

“In-Silico” Research

IBM collaborated with:United DevicesAccelrysEvotec OAIUS Department of DefenseOxford University

IBM provided the hardware and software for storing and analyzing the molecule screening results:

p690, AIX, DB2, Linux



Butterfly.netThe Butterfly Grid:

an end-to-end solution designed to support up to one million simultaneoususersbased on IBM WebSphere Application Server, DB2 and the Globus Toolkitrunning on IBM eServerxSeries clusters at an IBM e-business Hosting Center

Modeling and Simulation platform



The Butterfly Grid: Service Provider Program

Package:Butterfly server software suiteButterfly game admin appsGlobus, provisioning, policy mgt, billing

Shared Grid or DedicatedGamersIndustrialMilitary

NotesInter-node resource sharingValue-added broadband packageSLAs, QoS guarantees, ratings/certification



Japan AIST(National Institute of Advanced Industrial Science & Technology)

Collaborations

Government

Life Science Nanotechnology

LAN Internet

Academia Corporations

Grid Technology

Advanced Computing Center.

Other Research Institutes

One of the world’s most powerful Linux-based supercomputersMore than 11 trillion calculations per secondMore powerful than the current third most powerful supercomputer in the world

Solution Linux Cluster

• 2116 CPU AMD Opteron Cluster• 520 CPU Intel Madison Cluster

Globus Toolkit 3.0 (OGSA)

ChallengeAIST, Japan‘s largest national research organization needed to provide an on-demand computing infrastructure which dynamically adapts to support various research requirements of its collaborators focusing on grid computing, life sciences, and nanotechnology.



Grid Computing @ IBM

Charlotte (1) 3RTP (2) 7

Cambridge (2) 4Hawthorne (2) 4Poughkeepsie (4) 4Somers (1) 1Southbury (4) 13Yorktown Heights (7) 28

Markham (2) 14

San Jose (7) 13San Mateo (2) 7

Hursley (1) 3London (1) 2

Montpellier (2) 10

Uithorn (1) 2

Boeblingen (2) 2

Zurich (1) 6

Haifa (1) 2Austin (9) 58Roanoke (2) 4

Bangalore (1) 1

Chiba (1) 2Tokyo (2) 2

Taipei (2) 2

Rochester (3) 15

Chicago (1) 1 Sapporo (1) 4Beijing (2) 7

27 different geographic locations137 end user teams66 Grid applications

Heterogeneous platforms:- Linux on x, z, p series- AIX on pSeries

Globus 2.2 & Globus 3.0



IBM Grid Middleware –Product Roadmap

Grid Services (OGSA) & Web Services

Scheduling

Information Virtualization

Provisioning

Workload Management

Billing and Metering

Transaction Management

Gri

d C

apab

iliti

es

TotalStorage

GridXpertGridXpert

IBM Grid Toolbox



1. Intra-Grids

2. Extra-Grids

3. Inter-Grids

GridGrid

NAS/SANNAS/SAN

Grid

NAS/SAN

VPN





…. Look Familiar?



… How About This?



SummaryGrid Computing still evolvingIt is built on existing and new open computing standardsIt exploits existing components and technologiesIt can and is being used todayThere are many ways and places to exploit Grid ComputingMake decisions based on “business” needsIBM is leading with both products and services for Grid Computing



Thank You

Questions?



References (Articles and Publications)M.Mitchell Waldrop, Grid Computing, MIT Technology Review, May 2002, pgs 30-37

I. Foster, C. Kesselman, S. Tuecke, The Anatomoy of the Grid, http://www.globus.org/research/papers/anatomy.pdf

I. Foster, C. Kesselman, J. Nick, S. Tuecke, The Physiology of the Grid: An Open Grid Services Architecture for Distributed Systems Integration, http://www.globus.org/research/papers/ogsa.pdf

I. Foster, C. Kesselman, eds., The Grid: Blueprint for a New Computing Infrastructure, Morgan Kaufmann, San Francisco, Calif. (1999)

IBM Redbook: Introduction to Grid Computing with Globus, http://www.ibm.com/redbooks/

http://www.globus.org/research/papers/anatomy.pdf

http://www.globus.org/research/papers/ogsa.pdf

http://www.ibm.com/redbooks/



References (URLs)IBM Grid Web Site: http://www.ibm.com/grid/Globus: http://www.globus.org/OGSA (Open Grid Services Architecture): http://www.globus.org/ogsa/Global Grid Forum: http://www.gridforum.orgGrid Computing Planet: http://www.gridcomputingplanet.comGrid Today Newsletter: http://www.gridtoday.comNASA's Information Power Grid: http://www.ipg.nasa.govDOE Science Grid: http://www.doesciencegrid.orgParticle Physics Data Grid, PPDG: http://www.ppdg.net/National Digital Mammographic Archive:http://www.isi.edu/us-uk.gridworkshop/presentations/hollebeek.pdfNSF TeraGrid: http://www.teragrid.org/Nasa Information Power Grid: http://www.nas.nasa.gov/About/IPG/ipg.htmlUK eScience Program: http://www.research-councils.ac.uk/escience/UK e-Science Grid Program: http://www.escience-grid.org.uk/e-Diamond: http://www.gridoutreach.org.uk/docs/pilots/ediamond.htmEuropean Union DataGrid Project: http://www.eu-datagrid.org/

http://www.ibm.com/grid/

http://www.globus.org/

http://www.globus.org/ogsa/

http://www.gridforum.org

http://www.gridcomputingplanet.com

http://www.gridtoday.com

http://www.ipg.nasa.gov

http://www.doesciencegrid.org

http://www.ppdg.net/

http://www.isi.edu/us-uk.gridworkshop/presentations/hollebeek.pdf

http://www.teragrid.org/

http://www.nas.nasa.gov/About/IPG/ipg.html

http://www.research-councils.ac.uk/escience/

http://www.escience-grid.org.uk/

http://www.gridoutreach.org.uk/docs/pilots/ediamond.htm

http://www.eu-datagrid.org/



Additional IBM Grid Information: Red Paper & Red Book

Download from http://www.redbooks.ibm.com

http://www.redbooks.ibm.com



IBM RedBook: Grid Enabling Applications

Download from www.redbooks.ibm.com

http://www.redbooks.ibm.com



References:http://www.ibm.com/developerworks/grid/library/gr-visual

developerWorks Journal, November 2003 Issue

Good reference for IBM and customer technical people

Covers some of the same material as this presentation

http://www.ibm.com/developerworks/grid/library/gr-visual



http://www.varbusiness.com/sections/news/breakingnews.asp?articleid=45311varBusiness, October 27, 2003 Issue

References: (Continued)

http://www.varbusiness.com/sections/news/breakingnews.asp?articleid=45311

Grid Computing, SAO, and Autonomic Computing

Documents