The Anatomy of the Grid Enabling Scalable Virtual Organizations Ian Foster Mathematics and Computer Science Division Argonne National Laboratory and Department of Computer Science The University of Chicago http://www.mcs.anl.gov/~foster
Mar 26, 2015
The Anatomy of the GridEnabling Scalable Virtual Organizations
Ian Foster
Mathematics and Computer Science Division
Argonne National Laboratory
and
Department of Computer Science
The University of Chicago
http://www.mcs.anl.gov/~foster
[email protected] ARGONNE CHICAGO
Computational
Data
Information
Access
Knowledge
DISCOM
SinRG
but what are they really about?
Grids are “hot” …
APGrid
TeraGrid
[email protected] ARGONNE CHICAGO
Issues I Propose to Address
Problem statement Architecture Globus Toolkit Futures
[email protected] ARGONNE CHICAGO
The Grid Problem
Resource sharing & coordinated problem solving in dynamic, multi-institutional virtual organizations
[email protected] ARGONNE CHICAGO
Elements of the Problem
Resource sharing– Computers, storage, sensors, networks, …
– Sharing always conditional: issues of trust, policy, negotiation, payment, …
Coordinated problem solving– Beyond client-server: distributed data analysis,
computation, collaboration, … Dynamic, multi-institutional virtual orgs
– Community overlays on classic org structures
– Large or small, static or dynamic
[email protected] ARGONNE CHICAGO
Grid Communities & Applications:Data Grids for High Energy Physics
Tier2 Centre ~1 TIPS
Online System
Offline Processor Farm
~20 TIPS
CERN Computer Centre
FermiLab ~4 TIPSFrance Regional Centre
Italy Regional Centre
Germany Regional Centre
InstituteInstituteInstituteInstitute ~0.25TIPS
Physicist workstations
~100 MBytes/sec
~100 MBytes/sec
~622 Mbits/sec
~1 MBytes/sec
There is a “bunch crossing” every 25 nsecs.
There are 100 “triggers” per second
Each triggered event is ~1 MByte in size
Physicists work on analysis “channels”.
Each institute will have ~10 physicists working on one or more channels; data for these channels should be cached by the institute server
Physics data cache
~PBytes/sec
~622 Mbits/sec or Air Freight (deprecated)
Tier2 Centre ~1 TIPS
Tier2 Centre ~1 TIPS
Tier2 Centre ~1 TIPS
Caltech ~1 TIPS
~622 Mbits/sec
Tier 0Tier 0
Tier 1Tier 1
Tier 2Tier 2
Tier 4Tier 4
1 TIPS is approximately 25,000
SpecInt95 equivalents
Image courtesy Harvey Newman, Caltech
[email protected] ARGONNE CHICAGO
Grid Communities and Applications:Network for Earthquake Eng. Simulation
NEESgrid: national infrastructure to couple earthquake engineers with experimental facilities, databases, computers, & each other
On-demand access to experiments, data streams, computing, archives, collaboration
NEESgrid: Argonne, Michigan, NCSA, UIUC, USC
[email protected] ARGONNE CHICAGO
Grid Communities and Applications:Mathematicians Solve NUG30
Community=an informal collaboration of mathematicians and computer scientists
Condor-G delivers 3.46E8 CPU seconds in 7 days (peak 1009 processors) in U.S. and Italy (8 sites)
Solves NUG30 quadratic assignment problem
14,5,28,24,1,3,16,15,10,9,21,2,4,29,25,22,13,26,17,30,6,20,19,8,18,7,27,12,11,23
MetaNEOS: Argonne, Iowa, Northwestern, Wisconsin
[email protected] ARGONNE CHICAGO
Community =– 1000s of home
computer users
– Philanthropic computing vendor (Entropia)
– Research group (Scripps)
Common goal= advance AIDS research
Grid Communities and Applications:Home Computers Evaluate AIDS Drugs
Grid Architecture
[email protected] ARGONNE CHICAGO
Why Discuss Architecture?
Descriptive– Provide a common vocabulary for use when
describing Grid systems Guidance
– Identify key areas in which services are required
Prescriptive– Define standard “Intergrid” protocols and
APIs to facilitate creation of interoperable Grid systems and portable applications
[email protected] ARGONNE CHICAGO
What Sorts of Standards? Need for interoperability when different groups
want to share resources– E.g., IP lets me talk to your computer, but how do
we establish & maintain sharing?
– How do I discover, authenticate, authorize, describe what I want to do, etc., etc.?
Need for shared infrastructure services to avoid repeated development, installation, e.g.– One port/service for remote access to computing,
not one per tool/application
– X.509 enables sharing of Certificate Authorities
[email protected] ARGONNE CHICAGO
So, in Defining Grid Architecture, We Must Address …
Development of Grid protocols & services– Protocol-mediated access to remote resources
– New services: e.g., resource brokering
– “On the Grid” = speak Intergrid protocols
– Mostly (extensions to) existing protocols Development of Grid APIs & SDKs
– Facilitate application development by supplying higher-level abstractions
The (hugely successful) model is the Internet The Grid is not a distributed OS!
[email protected] ARGONNE CHICAGO
The Role of Grid Services(aka Middleware) and Tools
Remotemonitor
Remoteaccess
Informationservices
Faultdetection
. . .Resourcemgmt
CollaborationTools
Data MgmtTools
Distributedsimulation
. . .
net
[email protected] ARGONNE CHICAGO
Layered Grid Architecture(By Analogy to Internet Architecture)
Application
Fabric“Controlling things locally”: Access to, & control of, resources
Connectivity“Talking to things”: communication (Internet protocols) & security
Resource“Sharing single resources”: negotiating access, controlling use
Collective“Coordinating multiple resources”: ubiquitous infrastructure services, app-specific distributed services
InternetTransport
Application
Link
Inte
rnet P
roto
col
Arch
itectu
re
[email protected] ARGONNE CHICAGO
Protocols, Services, and InterfacesOccur at Each Level
Languages/Frameworks
Fabric Layer
Applications
Local Access APIs and Protocols
Collective Service APIs and SDKs
Collective ServicesCollective Service Protocols
Resource APIs and SDKs
Resource ServicesResource Service Protocols
Connectivity APIs
Connectivity Protocols
[email protected] ARGONNE CHICAGO
Where Are We With Architecture?
No “official” standards exist– Nor is it clear what this would mean
But: – Globus Toolkit has emerged as the de facto
standard for several important Connectivity, Resource, and Collective protocols
– GGF has an architecture working group
– Technical specifications are being developed for architecture elements: e.g., security, data, resource management, information
The Globus Toolkit
[email protected] ARGONNE CHICAGO
Grid Services Architecture (1):Fabric Layer
Just what you would expect: the diverse mix of resources that may be shared– Individual computers, Condor pools, file
systems, archives, metadata catalogs, networks, sensors, etc., etc.
Few constraints on low-level technology: connectivity and resource level protocols form the “neck in the hourglass”
Globus toolkit provides a few selected components (e.g., bandwidth broker)
[email protected] ARGONNE CHICAGO
Grid Services Architecture (2):Connectivity Layer Protocols & Services
Communication– Internet protocols: IP, DNS, routing, etc.
Security: Grid Security Infrastructure (GSI)– Uniform authentication & authorization
mechanisms in multi-institutional setting
– Single sign-on, delegation, identity mapping
– Public key technology, SSL, X.509, GSS-API
– Supporting infrastructure: Certificate Authorities, key management, etc.
[email protected] ARGONNE CHICAGO
User
User Proxy
GlobusGlobusCredentialCredential
Site 1
Kerberos
GRAM Process
Process
ProcessGSI
TicketTicket
Site 2
Public Key
GRAM
GSI
CertificateCertificate
Process
Process
Process
Authenticatedinterprocess
communication
CREDENTIAL
Mutualuser-resourceauthentication
Mappingto
local ids
Assignment of credentials to“user proxies”
Single sign-on via “grid-id”
Authorization
[email protected] ARGONNE CHICAGO
GSI Futures
Scalability in numbers of users & resources – Credential management
– Online credential repositories (“MyProxy”)
– Account management Authorization
– Policy languages
– Community authorization Protection against compromised resources
– Restricted delegation, smartcards
[email protected] ARGONNE CHICAGO
GSI Futures:Community Authorization
2. CAS reply, with and resource CA info
user/group membership
resource/collective membership
collective policy information
CAS
Does the collective policy authorize this
request for this user?
User
1. CAS request, with resource names and operations
Resource
Is this request authorized for
the CAS?
Is this request authorized by
the capability? local policy
information
3. Resource request, authenticated with
capability
4. Resource reply
capability
[email protected] ARGONNE CHICAGO
Grid Services Architecture (3):Resource Layer Protocols & Services
Resource management: GRAM– Remote allocation, reservation, monitoring,
control of [compute] resources Data access: GridFTP
– High-performance data access & transport Information: MDS (GRRP, GRIP)
– Access to structure & state information & others emerging: catalog access, code
repository access, accounting, … All integrated with GSI
[email protected] ARGONNE CHICAGO
GRAM Resource Management Protocol
Grid Resource Allocation & Management– Allocation, monitoring, control of computations
Simple HTTP-based RPC– Job request:
> Returns a “job contact”: Opaque string that can be passed between clients, for access to job
– Job cancel, Job status, Job signal
– Event notification (callbacks) for state changes> Pending, active, done, failed, suspended
Servers for most schedulers; C and Java APIs
[email protected] ARGONNE CHICAGO
Resource Management Futures
GRAM-2 protocol (ETA late 2001)– Advance reservations & multiple resource types
– Recoverable requests, timeout, etc.
– Use of SOAP (RPC using HTTP + XML)
– Policy evaluation points for restricted proxies
[email protected] ARGONNE CHICAGO
Data Access & Transfer
GridFTP: extended version of popular FTP protocol for Grid data access and transfer
Secure, efficient, reliable, flexible, extensible, parallel, concurrent, e.g.:– Third-party data transfers, partial file transfers
– Parallelism, striping (e.g., on PVFS)
– Reliable, recoverable data transfers Reference implementations
– Existing clients and servers: wuftpd, nicftp
– Flexible, extensible libraries
[email protected] ARGONNE CHICAGO
Grid Services Architecture (4):Collective Layer Protocols & Services Index servers aka metadirectory services
– Custom views on dynamic resource collections assembled by a community
Resource brokers (e.g., Condor Matchmaker)– Resource discovery and allocation
Replica management and replica selection– Optimize aggregate data access performance
Co-reservation and co-allocation services– End-to-end performance
Etc., etc.
[email protected] ARGONNE CHICAGO
The Grid Information Problem
Large numbers of distributed “sensors” with different properties
Need for different “views” of this information, depending on community membership, security constraints, intended purpose, sensor type
[email protected] ARGONNE CHICAGO
The Globus Toolkit Solution: MDS-2
Registration & enquiry protocols, information models, query languages– Provides standard interfaces to sensors
– Supports different “directory” structures supporting various discovery/access strategies
[email protected] ARGONNE CHICAGO
GRAM GRAM GRAM
LSF Condor NQE
Application
RSL
Simple ground RSL
Information Service
Localresourcemanagers
RSLspecialization
Broker
Ground RSL
Co-allocator
Queries& Info
Resource Management Architecture
* See talk by Jarek Nabrzyski et al.
ASCI DISCOMCondor-GNimrod-GPoznan*U. Lecce
DUROCMPICH-G2
[email protected] ARGONNE CHICAGO
Data Grid Architecture(See talk by Sudharshan Vazhkudai)
Metadata Catalog
Replica Catalog
Tape Library
Disk Cache
Attribute Specification
Logical Collection and Logical File Name
Disk Array Disk Cache
Application
Replica Selection
Multiple Locations
NWS
SelectedReplica
GridFTP commandsPerformanceInformation &Predictions
Replica Location 1 Replica Location 2 Replica Location 3
MDS
+ “Virtual data”: transparency wrt location and materialization (www.griphyn.org)
Grid Futures
[email protected] ARGONNE CHICAGO
Large GridProjects
are in Place
DOE ASCI DISCOM DOE Particle Physics Data Grid DOE Earth Systems Grid DOE Science Grid DOE Fusion Collaboratory European Data Grid Egrid (see talk by G. Allen et al.) NASA Information Power Grid NSF National Technology Grid NSF Network for Earthquake Eng Simulation NSF Grid Application Development Software NSF Grid Physics Network
[email protected] ARGONNE CHICAGO
Problem Evolution Past-present: O(102) high-end systems; Mb/s networks;
centralized (or entirely local) control– I-WAY (1995): 17 sites, week-long; 155 Mb/s
– GUSTO (1998): 80 sites, long-term experiment
– NASA IPG, NSF NTG: O(10) sites, production Present: O(104-106) data systems, computers; Gb/s networks;
scaling, decentralized control– Scalable resource discovery; restricted delegation; community
policy; GriPhyN Data Grid: 100s of sites, O(104) computers; complex policies
Future: O(106-109) data, sensors, computers; Tb/s networks; highly flexible policy, control
[email protected] ARGONNE CHICAGO
The Future:All Software is Network-Centric
We don’t build or buy “computers” anymore, we borrow or lease required resources– When I walk into a room, need to solve a
problem, need to communicate A “computer” is a dynamically, often
collaboratively constructed collection of processors, data sources, sensors, networks– Similar observations apply for software
[email protected] ARGONNE CHICAGO
And Thus …
Reduced barriers to access mean that we do much more computing, and more interesting computing, than today => Many more components (& services); massive parallelism
All resources are owned by others => Sharing (for fun or profit) is fundamental; trust, policy, negotiation, payment
All computing is performed on unfamiliar systems => Dynamic behaviors, discovery, adaptivity, failure
[email protected] ARGONNE CHICAGO
Summary
The Grid problem: Resource sharing & coordinated problem solving in dynamic, multi-institutional virtual organizations
Grid architecture: Emphasize protocol and service definition to enable interoperability and resource sharing
Globus Toolkit as a source of protocol and API definitions, reference implementations
For more info: www.globus.org, www.griphyn.org, www.gridforum.org