This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Paul Giangarra — Grid Computing, SAO, and Autonomic Computing Page 2
AgendaGrid Computing, a Brief IntroductionGrid Computing Core ConceptsGrid Computing Standards and ArchitectureInformation and Grid ComputingAutonomic Computing and Grid ComputingService Oriented Architecture and Grid Computing(Now What do I do With All This?)
Paul Giangarra — Grid Computing, SAO, and Autonomic Computing Page 3
What’s the Problem?Grid Problem:
Provide for flexible secure coordinated resource sharing among dynamic collections of individuals, institutions & resources (a.k.a. virtual organizations)This includes unique authentication, authorization, resource access, and resource discovery
Grid Challenge:Create an architecture and solution set based on open standards and where they exist exploit existing technologies to solve this
See: The Anatomy of the Grid by Foster, Kesselman, Tuecke
Paul Giangarra — Grid Computing, SAO, and Autonomic Computing Page 4
What Is NOT a Grid?The 8:00 AM rush hour (that’s gridlock)
A bunch of PCs on a network(it’s a lot more than that)
A cluster, a network attached storage device, a scientific instrument, a network, etc.(each is an important component of a Grid, but by itself each does not constitute a Grid)
Paul Giangarra — Grid Computing, SAO, and Autonomic Computing Page 5
So, What Is a Grid?More correctly, what is Grid Computing?
Based on services-oriented architectureBased on standard, open, general-purpose protocols and interfacesGrid Computing, Services, and Technologies:
Help coordinate and manage disparate and possibly heterogeneous resources that are not subject to centralized controlCan be used to deliver non-trivial quantities of serviceCan be used to aggregate disparate IT elements such as compute resources, data storage and filing systems to create a single, unified virtual system
Paul Giangarra — Grid Computing, SAO, and Autonomic Computing Page 19
Motivations for Grid Computing
Reduce “Time to Results”Exploit opportunities for parallel computing to allow business critical computation to be completed in a timely fashionGain competitive advantage by allowing computation to be executed more frequently and on customer demand Deliver real-time results to internal and external customers
Paul Giangarra — Grid Computing, SAO, and Autonomic Computing Page 21
Motivations for Grid Computing
Enable CollaborationsEnable collaboration across applications to integrate results Support large multi-disciplinary collaborationsBoth within a single organization and between partners
Paul Giangarra — Grid Computing, SAO, and Autonomic Computing Page 27
Open Grid Services Architecture (OGSA)
Objectives:
Manage resources across distributed heterogeneous platforms Deliver seamless QoSProvide a common base for autonomic management solutionsDefine open, published interfaces
Paul Giangarra — Grid Computing, SAO, and Autonomic Computing Page 30
Grid Computing Protocol Architecture
Resource and Connectivity protocols, which facilitate the sharing of resourcesBuild on capabilities provided by lower layersDesign goals:
Place few constraints on implementationFocus on small set of core abstractionsEmphasize identification and definition of protocols and servicesIdentify and define APIs and SDKsProvide for a Secure Environment
Fabric
Connectivity
Resource
Collective
Applications
The layered Grid Computing protocol architecture is based on Open Standards
Paul Giangarra — Grid Computing, SAO, and Autonomic Computing Page 31
Grid Protocol – FabricProvides the resources to which shared access is mediated by Grid protocolsExamples include computational resources, storage systems, catalogs, or network resources
Includes logical resources such as distributed file systems and clusters
Resources implement inquiry mechanisms that permit discovery of their structure, state, and capabilities
Paul Giangarra — Grid Computing, SAO, and Autonomic Computing Page 32
Grid Protocol – ConnectivityDefines core communication and authentication protocols required for Grid-specific network transactionsCommunication protocols enable the exchange of data between Fabric layer resources.Authentication protocols build on communication servicesProvide cryptographically secure mechanisms for verifying the identity of users and resources.
Asymmetric cryptography
TransportRouting Naming
Single Sign On Delegation Security Integration Trust Relationships
Paul Giangarra — Grid Computing, SAO, and Autonomic Computing Page 33
Grid Protocol – ResourceBuilds on Connectivity layer communication and authentication protocols
Defines protocols (and APIs and SDKs) for the secure negotiation, initiation, monitoring, control, accounting, and payment of sharing operations on individual resources
Concerned entirely with individual resourcesIgnores issues of global state and atomic actions across distributed collections
Paul Giangarra — Grid Computing, SAO, and Autonomic Computing Page 34
Grid Protocol – CollectiveProtocols and services (and APIs and SDKs) that are not associated with any one specific resource but rather are global in nature and capture interactions across collections of resources
Directory servicesCo-allocation, scheduling, and brokering servicesMonitoring and diagnostic servicesData replication servicesGrid-enabled programming systemsWorkload managementCommunity authorization and accountingSoftware discovery servicesCollaborative services
Security APIsglobus_gss_assist - simplifies the use of the GSSAPI in the globus environment [1.1.x, 2.0]GSS API - the Generic Security Service API C bindings (IETF draft) [version 2]
Information Service APIsOpenLDAP - an API for the LDAP protocol used by MDS (developed by the OpenLDAP Project) [version 1.2]
Communication APIsglobus_io - provides high-performance I/O with integrated security and a socket-like interface [1.1.x,2.0]globus_nexus - provides multithreaded, asynchronous, thread-safe multiprotocol communication facilities [1.1.x,2.0]globus_nexus_fd - provides NEXUS-based support for file descriptors and timed events (This API is obsolete as of release1.1.2. We recommend use of globus_io instead.) [1.1.1]
Data Access APIsglobus_ftp_control - provides low-level services for implementing FTP client and servers [2.0]globus_ftp_client - provides a convenient way of accessing files on remote FTP servers [2.0]globus_gass_copy - provides a uniform interface for accessing files using a variety of protocols [2.0]globus_gass - provides clients with access to remote files [1.1.x]globus_gass_transfer - provides an API for clients and servers involved in GASS data transferglobus_gass_cache - manages the local GASS cache on a client system [1.1.x,2.0]globus_gass_server_ez - provides a simple set of GASS server capabilities [1.1.x,2.0]globus_gass_server - provides GASS server functionality (This API is obsolete as of release 1.1.2. We recommend use of globus_gass_transfer instead.) [1.1.1]globus_gass_client - allows clients to get and put remote files via several protocols (This API is obsolete as of release 1.1.2. We recommend use of globus_gass_transfer instead.) [1.1.1]
Data Management APIsglobus_replica_catalog - provides an interface to a catalog of data collections, logical files, and physical locations [2.0]globus_replica_management - allows clients to manage files within a file replication system [2.0]
Resource Management APIsglobus_gram_client - provides remote job submission and management capabilities [1.1.x,2.0]globus_gram_myjob - provides a basic communication mechanism for processes within a GRAM job [1.1.x,2.0]globus_gram_jobmanager - provides a simple, consistent way to interact locally with a variety of schedulers such as LSF, LoadLeveler, PBS, Condor, etc. [1.1.x,2.0]globus_duroc - provides resource coallocation services for starting distributed jobs [1.1.x,2.0]
Fault Detection APIsglobus_hbm_client - allows a client process to be monitored by a Heartbeat Monitor system [1.1.x]globus_hbm_datacollector - allows clients to monitor multiple processes and enables the notification of exceptions [1.1.x]
Portability APIsglobus_module - provides a mechanism for activating and deactivating software modules [1.1.x,2.0]globus_libc - provides a portable implementation of libc[1.1.x,2.0]globus_thread - implements threads and synchronization mechanisms [1.1.x,2.0]globus_dc - provides cross-platform data conversion servicesglobus_utp - supports the use of timers for monitoring applications and other programs [1.1.x,2.0]globus_list - support for linked lists [1.1.x,2.0]globus_fifo - supports first-in-first-out queues [1.1.x,2.0]globus_hashtable - supports hash tables [1.1.x,2.0]globus_url - supports URL strings [2.0]globus_error - provides an abstract error type for function return codesglobus_poll - supports polling on I/O channels
Paul Giangarra — Grid Computing, SAO, and Autonomic Computing Page 41
A family of Web services specification proposalsIntroduces a design pattern to specify how to use Web services to access “stateful” componentsIntroduce message based publish-subscribe to Web services
Paul Giangarra — Grid Computing, SAO, and Autonomic Computing Page 42
WS-NotificationProvides a publish-subscribe messaging capability for Web Services
WS-Resource FrameworkThere are many possible ways Web services might model, access and manage stateWS-RF is a family of Web services specifications that clarify how “state” and Web services combine
Both: Build upon existing Web services specifications and technologyHelp align Grid computing, Systems Management and Web services
Contributed to by:WS-Resource Framework: IBM, Globus, HPWS-Notification: IBM, Globus, Akamai, HP, SAP, Tibco, Sonic
Paul Giangarra — Grid Computing, SAO, and Autonomic Computing Page 45
WS-NotificationWS-Notification
Brings enterprise quality publish and subscribe messaging to Webservices
• Loosely coupled, asynchronous messaging in a Web services context• Composes with other Web services technologies• Facilitates integration between different messaging middleware
environmentsExploits WS Resource framework and Web services technologiesStandardizes the role of Brokers, Publishers, Subscribers and ConsumersProvides two forms of publish/subscribe: direct publishing and brokered publishing
Standardizes Web service message exchanges for publishing, subscribing and notification deliveryDefines XML model of Topics and TopicSpaces to categorize and organize notification messages
Creates pools of managed disks spanning multiple storage subsystems. Includes dynamic data-migration function.
Provides a common file system specifically designed for storage networks. Manages the metadata on the storage network instead of within individual network servers.
Provides scalable access to GPFS from outside cluster. GPFS + NFSv4 provides the performance of a SAN File System scalable to a WAN.
Cluster based, shared disk, parallel file system. Data and metadata can flow to all nodes and all disks in parallel. Featured in HPC environments. Available on pSeries and Linux clusters.
Data catalog, data provisioning, reusable data integrations, caching capabilities.
Relational database that runs on Linux, Unix, Windows, z/OS, and OS/390
Federated data server, replication server
Features BenefitsProduct
Centralized protection leading to faster backups and restores with less resources needed. Tivoli Storage Manager
Manageability features, Integrated Information capabilities via Web Services, Integrated business intelligence, and more
DB2 UDB
Security and access control in a grid environment.NFS v4
Storage on demand for file systems. Reclaim wasted space consumed by non-essential files. Ensure storage used efficiently for future capacity.
Tivoli Storage ResourceManager
Centralized point of control for volume mgmt. Allows administrators to migrate storage from one device to another w/o taking it offline.
SAN Volume ControllerStorageFile
Data
Not a client-server file system like NFS, DFS, or AFS: no single server bottleneck, no protocol overhead for data transfer.
GPFS (General Parallel File System)
Provides high performance access to data and enables sharing across heterogeneous application servers. Allows applications on any server within the SAN to access any file in the network without making changes to the application.
SAN File System
Provisioning, access, and integration of data from multiple, heterogeneous, distributed sources.
Avaki Data Grid 5.0*
Query and access distributed data without requiring central repository. Supports movement of data from mixed relational data sources.
Paul Giangarra — Grid Computing, SAO, and Autonomic Computing Page 54
Self-managing Systems Deliver:Increased ResponsivenessAdapt to dynamically changing environments
Business ResiliencyDiscover, diagnose,
and act to prevent disruptions
OperationalEfficiencyTune resources and balance workloads to maximize use of IT resources
Secure Information and Resources
Anticipate, detect, identify, and protect
against attacks
“Autonomic computing allows companies to operate more efficiently and achieve more from their existing IT environments, enabling increased responsiveness, business continuance and availability.” — Rick Sturm
Paul Giangarra — Grid Computing, SAO, and Autonomic Computing Page 55
The Autonomic Element: Sense & Respond
An autonomic element contains continuous control loop that monitors activities and takes action Autonomic elements learn from past experience to build action plansManaged elements are consistently monitored
Knowledge
Analyze Plan
Monitor Execute
Element
Sensors Effectors
The autonomic computing control loop
“IBM’s autonomic approach to automation goes well beyond integration to the truly intelligent, responsive and proactive capabilities needed to deliver e-business on demand.”
Paul Giangarra — Grid Computing, SAO, and Autonomic Computing Page 56
Levels of Automation
Level 2 Level 3 Level 4 Level 5Level 1
Basic
Managed
Predictive
Adaptive
Autonomic
Manual analysis and problem solving
Centralized tools, manual actions
Cross-resource correlation and guidance
System monitors, correlates and takes action
Dynamic business policy based management
Evolution not revolution
“Autonomic computing is a vision that will take several years to realize, but with the model that IBM has outlined, there are benefits attainable at every step, which pay you back... fairly quickly for the investments you make.”
Paul Giangarra — Grid Computing, SAO, and Autonomic Computing Page 60
Service Oriented ArchitectureChange of Paradigm at the core of Grid Computing
Services “encapsulate” heterogeneous resourcesServices provide a compose-able, orchestrable, extensible base Common Resource Model (CRM) for abstractions key to manageability of resources
Simple Rules:Any function is implemented once and once only as a ServiceServices can be runtime or deployment-time re-usedService providers and requesters are loosely bound:
• Each service is defined by an implementation independent interface.• Services are defined in terms of common business function and data
models.• Communication protocols that emphasize interoperability and location
transparency are used to mediate service interactions
Service “contract” can come with a QoS “clause” (SLA)
Paul Giangarra — Grid Computing, SAO, and Autonomic Computing Page 61
Anatomy of a Service Interface
Interface by contractAn explicit interface definition or contract is used to bind a service requestor and a service providerSpecifies explicitly only the mutual behaviour -specifies nothing about the implementation of the requestor or the providerAllows either to change implementation or identity freely
Interface granularityBased on Service Type:Examples:
• Business Process Services• Business Transaction Services• Business Function Services• Technical Function Services
Paul Giangarra — Grid Computing, SAO, and Autonomic Computing Page 62
Refactoring: Things to Deal WithMany Existing Applications are Monolithic or Tightly Coupled Need to Re-Factor Applications
Some things to worry about are:• Distributed threads • Data locking• Latency
Re-Hosting ApplicationsExploit Meta-OS servicesAchieve platform independenceRe-Factor for distributed parallel execution
Need for Re-Hosted MiddlewareAbility to Exploit Grid computing services, e.g. Distributed ProvisioningManage (and exploit) Quality of Service across the Grid
Challenge: Move to and Exploit Services Oriented Architecture
Paul Giangarra — Grid Computing, SAO, and Autonomic Computing Page 63
Can Your Application Benefit from Grid Computing?
How do you know if your application can benefit from Grid computing? Ask these questions:Q. Is the application computationally intensive?Q. Does it serve a distributed or collaborative community?Q. Can the tasks or jobs the application performs run in parallel?Q. Does the application do pattern matching?Q. Does it have a reasonable network bandwidth profile?
A. If the answer to any or all of these is yes, then Grid-enablement is feasible.
Q. What is the application processing type (e.g., serial or batch)?
A. Batch is currently more amenable to Grid enablement.
Q. Do the operations within the task have time and/or sequencing dependencies?
A. The fewer dependencies, the better.
Q. What are the bottlenecks in the existing use of the application (e.g., single processor performance, scalability, memory, data output volume, pre/post processing)?
A. Grid can potentially address these bottlenecks.
Paul Giangarra — Grid Computing, SAO, and Autonomic Computing Page 70
Six Strategies for Grid Application Enablement
Strategy 1: Batch AnywhereOnly the grid (not the application, the client, the user, or anything else) decides which node to use for the jobThe machine submitting the job might not be a node in the gridExample application: a query to determine whether a given number, x, is a prime number. More than one node in the grid can submit the same query. The grid returns the correct results to the submitter.
Strategy 2: Independent Concurrent Batch Multiple independent instances of the same application run concurrently and independently without interference.Independent jobs are common. For example, Job X for Account A can run concurrently with Job X for Account B. Databases and other resources don't have hot spots or deadlocks.
Strategy 3: Parallel BatchTake each user's batch work, subdivide it, disperse it out to multiple nodes, collect it, and then aggregate the results.
Paul Giangarra — Grid Computing, SAO, and Autonomic Computing Page 71
Six Strategies for Grid Application Enablement
Strategies 4, 5, & 6 use services on the grid in order to get jobs done. Strategy 4: Service
Focus on the transition from a batch to a service-oriented architectureA follow-on to Independent Concurrent BatchIt is not assumed that each client subdivides its work and spreads it over multiple service instances
Strategy 5: Parallel ServicesService with the subdivided work model of Parallel Batch. Provides multiple service instancesPermits these instances to be invoked in parallel on the client's behalf
Strategy 6: Tightly Coupled Parallel ProgramsThe domain of specialized applications in engineering, physics, and biological modeling, such as finite state analysisProvides intense communications and synchronization between client and services and among services
Paul Giangarra — Grid Computing, SAO, and Autonomic Computing Page 73
Three Stages for ImplementationRun
Strategies 1 and 2, and the simplest form of Strategy 3, focus on the ability of an application to run in a grid.
AdaptThe more complex form of Strategy 3 as well as Strategies 4 and 5 significantly adapt the function and value of the business application by enabling it to use a grid without requiring many changes that are specific to grid middleware. The same application could be structured to run in a non-grid environment.
ExploitApplications at Strategy 6 exploit the grid or cluster infrastructure for their operation because they were written from the start with a grid in mind. Strategy 6 applications cannot finish in a timely and successful manner without running in a grid.
Paul Giangarra — Grid Computing, SAO, and Autonomic Computing Page 79
Why Are Customers Implementing Grid Computing Solutions?
Accelerate Business ProcessesGrids provide the ability to shorten application run-times without upgrading existing servers.(i.e. Charles Schwab, MassMutual, RBC Insurance, Nippon Life Insurance, Royal Dutch Shell, EADS)Ability to run new High Performance Computing (HPC) applicationsGrid computing provides the opportunity to run new applications due to the cost effective grid virtual computing environment. (i.e. AIST, UMass, FNMOC, TeraGrid)Data Sharing & CollaborationGrid architecture provides the ability to store, share and analyze large volumes of data(i.e. eDiamond, NDMA, WestGrid, CERN, European DataGrid, Kansai Electric)Accelerate Research & DevelopmentGrids provide Life Science companies the ability to speed up drug research & development.(i.e. Smallpox Grid, Aventis, Novartis)I/T Optimization & Resiliency – Virtualization of Servers & Storage
Paul Giangarra — Grid Computing, SAO, and Autonomic Computing Page 84
Connects nine (9) major supercomputing sites: NCSA, SDSC, Argonne NL, CalTech, PSC, UTexas, IndianaU, PurdueU, Oak Ridge NL
40 gigabit network backbone connecting the sites20 Teraflops of computing power1 Petabyte of disk accessible data storage
Accessible to thousands of scientists working on advanced research
Applications include:Real Time Brain MappingEarthquake ModelingMolecular Dynamics simulationMcell – Monte Carlo simulation of cellular micro physiologyEncyclopedia of Life – Protein catalog
IBM project team and solution includes:IBM High Performance Computing (HPC) expertiseIBM GPFS expertiseIBM Linux Clusters – Itanium2 processorsIBM Power4 processors – p690 RegattasIBM Grid Computing & Linux consulting services
The TeraGrid – Extensible Terascale Facility
National Science Foundation Grid Computing project ($90M):
Paul Giangarra — Grid Computing, SAO, and Autonomic Computing Page 87
Butterfly.netThe Butterfly Grid:
an end-to-end solution designed to support up to one million simultaneoususersbased on IBM WebSphere Application Server, DB2 and the Globus Toolkitrunning on IBM eServerxSeries clusters at an IBM e-business Hosting Center
Paul Giangarra — Grid Computing, SAO, and Autonomic Computing Page 89
Japan AIST(National Institute of Advanced Industrial Science & Technology)
Collaborations
Government
Life Science Nanotechnology
LAN Internet
Academia Corporations
Grid Technology
Advanced Computing Center.
Other Research Institutes
One of the world’s most powerful Linux-based supercomputersMore than 11 trillion calculations per secondMore powerful than the current third most powerful supercomputer in the world
Solution Linux Cluster
• 2116 CPU AMD Opteron Cluster• 520 CPU Intel Madison Cluster
Globus Toolkit 3.0 (OGSA)
ChallengeAIST, Japan‘s largest national research organization needed to provide an on-demand computing infrastructure which dynamically adapts to support various research requirements of its collaborators focusing on grid computing, life sciences, and nanotechnology.
Paul Giangarra — Grid Computing, SAO, and Autonomic Computing Page 95
SummaryGrid Computing still evolvingIt is built on existing and new open computing standardsIt exploits existing components and technologiesIt can and is being used todayThere are many ways and places to exploit Grid ComputingMake decisions based on “business” needsIBM is leading with both products and services for Grid Computing
Paul Giangarra — Grid Computing, SAO, and Autonomic Computing Page 97
References (Articles and Publications)M.Mitchell Waldrop, Grid Computing, MIT Technology Review, May 2002, pgs 30-37
I. Foster, C. Kesselman, S. Tuecke, The Anatomoy of the Grid, http://www.globus.org/research/papers/anatomy.pdf
I. Foster, C. Kesselman, J. Nick, S. Tuecke, The Physiology of the Grid: An Open Grid Services Architecture for Distributed Systems Integration, http://www.globus.org/research/papers/ogsa.pdf
I. Foster, C. Kesselman, eds., The Grid: Blueprint for a New Computing Infrastructure, Morgan Kaufmann, San Francisco, Calif. (1999)
IBM Redbook: Introduction to Grid Computing with Globus, http://www.ibm.com/redbooks/