Stephen Jarvis High Performance Systems Group University of Warwick, UK Grid Computing: like herding cats?
Mar 28, 2015
Stephen JarvisHigh Performance Systems Group
University of Warwick, UK
Grid Computing: like herding cats?
2
• What are we going to cover today?– A brief history
– Why we are doing it
– Applications
– Users
– Challenges
– Middleware
• What are you going to cover next week?– technical talk on the specifics of our work
– Including application to e-Business and e-Science
Sessions on Grid
3
An Overused Analogy
• Electrical Power Grid
• Computing power might somehow be like electrical power– plug in
– switch on
– have access to unlimited power
• We don’t know who supplies the power, or where it comes from– just pick up the bill at the end of the month
• Is this the future of computing?
4
• Is the computing infrastructure available?
• Computing power
– 1986: Cray X-MP ($8M)
– 2000: Nintendo-64 ($149)
– 2003: Earth Simulator (NEC), ASCI Q (LANL)
– 2005: Blue Gene/L (IBM), 360 Teraflops
– Look at www.top500.org for current supercomputers!
Sounds great - but how long?
5
• Storage capabilities– 1986: Local data stores (MB)
– 2002: Goddard Earth Observation System – 29TB
• Network capabilities– 1986 : NFSNET 56Kb/s backbone
– 1990s: Upgraded to 45Mb/s (gave us the Internet)
– 2000s: 40 Gb/s
Storage & Network
6
Many Potential Resources
GRID
Terra-bytedatabases
Spacetelescopes
Millions of PCs30% Utilisation
SupercomputingCentres
10k PS/2per week
50M MobilePhones
• The vision … mid ’90s– to promote a revolution in how NASA addresses large-
scale science and engineering – by providing a persistent HPC infrastructure
• Computing and data management services– on-demand– locate and co-schedule multi-Center resources – address large-scale and/or widely distributed problems
• Ancillary services – workflow management and coordination – security, charging …
Some History:
NASAs Information Power Grid
•Lift Capabilities•Drag Capabilities•Responsiveness
•Thrust performance•Reverse Thrust performance•Responsiveness•Fuel Consumption
•Braking performance•Steering capabilities•Traction•Dampening capabilities
Crew Capabilities- accuracy- perception- stamina- re-action times- SOP’s Engine Models
Airframe Models
Landing Gear Models
Stabilizer Models
Human Models
Whole system simulations are produced by couplingall of the sub-system simulations
SDSC
LaRC
GSFC
MSFC
KSCJSC
NCSA
Boeing
JPL
NGIX
EDC
NRENCMU
GRC
300 node Condor pool
NTON-II/SuperNet
MCAT/SRB
O2000
DMF MDSCA
O2000
O2000
cluster
clusterO2000
MDS
MDS
VirtualNational Air
SpaceVNAS
GRCEngine Models
LaRC
Airframe Models
LandingGear Models
ARC
Wing Models
Stabilizer Models
Human Models
•FAA Ops Data•Weather Data•Airline Schedule Data•Digital Flight Data•Radar Tracks•Terrain Data•Surface Data
22,000 CommercialUS Flights a day
50,000 Engine Runs
22,000 Airframe Impact Runs
132,000 Landing/Take-off
Gear Runs
48,000 Human Crew Runs
66,000 Stabilizer Runs
44,000 Wing Runs
SimulationDrivers
(Being pulled togetherunder the NASA AvSPAviation ExtraNet (AEN)
National Air Space Simulation Environment
• A computational grid is a hardware and software infrastructure that provides dependable, consistent, pervasive and inexpensive access to high-end computational capabilities.
• The capabilities need not be high end.
• The infrastructure needs to be relatively transparent.
What is a Computational Grid?
Selected Grid Projects• US Based
– NASA Information Power Grid– DARPA CoABS Grid– DOE Science Grid– NSF National Virtual Observatory– NSF GriPhyN– DOE Particle Physics Data Grid– NSF DTF TeraGrid– DOE ASCI DISCOM Grid– DOE Earth Systems Grid etc…
• EU Based
– DataGrid (CERN, ..)– EuroGrid (Unicore)– Damien (Metacomputing)– DataTag (TransAtlanticTestbed, …)– Astrophysical Virtual Observatory– GRIP (Globus/Unicore)– GRIA (Industrial applications)– GridLab (Cactus Toolkit, ..)– CrossGrid (Infrastructure Components)– EGSO (Solar Physics)
• Other National Projects – UK - e-Science Grid– Netherlands – VLAM-G, DutchGrid– Germany – UNICORE Grid, D-Grid– France – Etoile Grid– Italy – INFN Grid– Eire – Grid-Ireland– Scandinavia - NorduGrid– Poland – PIONIER Grid– Hungary – DemoGrid– Japan – JpGrid, ITBL– South Korea – N*Grid – Australia – Nimrod-G, ….– Thailand – Singapore – AsiaPacific Grid
The Big Spend: two examples
– US Tera Grid
• $100 Million US Dollars (so far…)
• 5 supercomputer centres
• New ultra-fast optical network ≤ 40Gb/s
• Grid software and parallel middleware
• Coordinated virtual organisations
• Scientific applications and users
– UK e-Science Grid
• £250 Million (so far…)
• Regional e-Science centres
• New infrastructure
• Middleware development
• Big science projects
SuperJANET4
Cambridge
Newcastle
Edinburgh
Oxford
Glasgow
Manchester
Cardiff
Soton
London
Belfast
DL
RL Hinxton
Lancaster White Rose
Birmingham/Warwick
Bristol UCL
e-Science Grid
15
• NASA– Aerospace simulations, Air traffic control– NWS, In-aircraft computing– Virtual Airspace– Free fly, Accident prevention
• IBM– On-demand computing infrastructure– Protect software– Support business computing
• Governments
– Simulation experiments– Biodiversity, genomics, military, space science…
Who wants Grids and why?
16
Classes of Grid applicationsCategory Examples Characteristics
Distributed supercomputing
DIS, Stellar dynamics, Chemistry
Very large problems, lots of CPU, memory
High Throughput Chip design, cryptography
Harnessing idle resources
On Demand Medical, Weather prediction
Remote resources, time bounded
Data Intensive Physics, Sky surveys Synthesis of new information
Collaborative Data exploration, virtual environments
Connection between many parties
17
Classes of GridCategory Examples Characteristics
Data Grid EU DataGridLots of data sources
from one site, processing off site
Compute Grid Chip design, cryptography
Harnessing and connecting rare
resources
Scavenging Grid SETI CPU Cycle steeling, commodity resources
Enterprise Grid Banking Multi-site, but one organisation
ScientificInformationScientific
InformationScientific Discovery
In Real Time
Real Time Integration
Dynamic ApplicationIntegration
Workflow Construction
Interactive Visual Analysis
LiteratureLiterature
DatabasesDatabases
OperationalData
OperationalData
ImagesImages
InstrumentData
InstrumentData
Using Distributed Resources
Discovery Net Project
Nucleotide Annotation Workflows
Download sequence
from Reference
Server
Save to Distributed Annotation
Server
Execute distributed annotation workflow
NCBIEMBL
TIGR SNP
InterPro
SMART
SWISSPROT
GO
KEGG
1800 clicks 500 Web access200 copy/paste 3 weeks work in 1 workflow and few second execution
An e-science challenge – non-trivial
NASA IPG as a possible paradigm
Need to integrate rigorously if to deliver accurate & hence biomedically useful results
Noble (2002) Nature Rev. Mol. Cell.Biol. 3:460
Sansom et al. (2000) Trends Biochem. Sci. 25:368
molecular
cellular
organism
Grand Challenge: Integrating Different Levels of Simulation
21
Classes of Grid usersClass Purpose Makes Use
Of Concerns
End Users Solve problems Applications Transparency, performance
Application Developers
Develop applications
Programming models, tools
Ease of use, performance
Tool Developers Develop tools & prog. models Grid services Adaptivity,
security
Grid Developers Provide grid services
Existing grid services
Connectivity, security
System Administrators
Management of resources Management tools Balancing
concerns
22
• Composed of hierarchy of sub-systems• Scalability is vital• Key elements:
– End systems• Single compute nodes, storage systems, IO devices etc.
– Clusters• Homogeneous networks of workstations; parallel & distributed
management
– Intranet• Heterogeneous collections of clusters; geographically
distributed
– Internet• Interconnected intranets; no centralised control
Grid architecture
23
• State of the art– Privileged OS; complete control of resources and
services
– Integrated nature allows high performance
– Plenty of high level languages and tool
• Future directions– Lack features for integration into larger systems
– OS support for distributed computation
– Mobile code (sandboxing)
– Reduction in network overheads
End Systems
24
• State of the art– High-speed LAN, 100s or 1000s of nodes
– Single administrative domain
– Programming libraries like MPI
– Inter-process communication, co-scheduling
• Future directions– Performance improvements
– OS support
Clusters
25
• State of the art– Grids of many resources, but one admin. domain– Management of heterogeneous resources– Data sharing (e.g. databases, web services)– Supporting software environments inc. CORBA– Load sharing systems such as LSF and Condor– Resource discovery
• Future directions– Increasing complexity (physical scale etc)– Performance– Lack of global knowledge
Intranets
26
• State of the art– Geographical distribution, no central control
– Data sharing is very successful
– Management is difficult
• Future directions– Sharing other computing services (e.g. computation)
– Identification of resources
– Transparency
– Internet services
Internets
27
• Authentication– Can the users use the system; what jobs can they run?
• Acquiring resources– What resources are available?
– Resource allocation policy; scheduling
• Security– Is the data safe? Is the user process safe?
• Accounting– Is the service free, or should the user pay?
Basic Grid services
28
• Grids computing is a relatively new area– There are many challenges
• Nature of Applications– New methods of scientific and business computing
• Programming models and tools– Rethinking programming, algorithms, abstraction etc.– Use of software components/services
• System Architecture– Minimal demands should be placed on contributing sites– Scalability– Evolution of future systems and services
Research Challenges (#1)
29
• Problem solving methods– Latency- and fault-tolerant strategies
– Highly concurrent and speculative execution
• Resource management– How are the resources shared?
– How do we achieve end-to-end performance?
– Need to specify QoS requirements
– Then need to translate this to resource level
– Contention?
Research Challenges (#2)
30
• Security– How do we safely share data, resources, tasks?– How is code transferred?– How does licensing work?
• Instrumentation and performance– How do we maintain good performance?– How can load-balancing be controlled?– How do we measure grid performance?
• Networking and infrastructure– Significant impact on networking– Need to combine high and low bandwidth
Research Challenges (#3)
31
• Many people see middleware as the vital ingredient
• Globus toolkit– Component services for security, resource location,
resource management, information services
• OGSA– Open Grid Services Architecture
– Drawing on web services technology
• GGF– International organisation driving Grid development
– Contains partners such as Microsoft, IBM, NASA etc.
Development of middleware
32
Workload Generation, Visualization…
Middleware Conceptual Layers
Discovery, Mapping, Scheduling, Security, Accounting…
Computing, Storage, Instrumentation…
Requirements include:
• Offers up useful resources
• Accessible and useable resources
• Stable and adequately supported
• Single user ‘Laptop feel’
Middleware has much of this responsibility
Demanding management issues • Users are (currently) likely to be sophisticated
• but probably not computer ‘techies’
• Need to hide detail & ‘obscene’ complexity
• Provide the vision of access of full resources
• Provide contract for level(s) of support (SLAs)
Key Interface between Applications & Machines
Gate Keeper / Manager
• Acts as resource manager.
• Responsible for mapping applications to resources.
• Scheduling tasks.
• Ensuring service level agreements (SLAs)
• Distributed / Dynamic.
Middleware Projects
• Globus, Argonne National Labs, USA
• AppLeS, UC San Diego, USA
• Open Grid Services Architecture (OGSA)
• ICENI, Imperial, UK
• Nimrod, Melbourne, Australia
• Many others... including us!!
37
HPSG’s approach:
• Determine what resources are required– (advertise)
• Determine what resources are available– (discovery)
• Map requirements to available resources– (scheduling)
• Maintain contract of performance – (service level of agreement)
• Performance drives the middleware decisions– PACE
38
• ‘[The Grid] intends to make access to computing power, scientific data repositories and experimental facilities as easy as the Web makes access to information.’
• High Performance Systems Group, Warwick– www.dcs.warwick.ac.uk/research/hpsg
Tony Blair, 2002
39
• And herding cats …
– 100,000s computers
– Sat. links, miles of networking
– Space telescopes, atomic colliders, medical scanners
– Tera-bytes of data
– Software stack a mile high…