Cactus in GrADS
Dave Angulo, Ian Foster
Matei Ripeanu, Michael Russell
Distributed Systems Laboratory
The University of Chicago
With: Gabrielle Allen, Thomas Dramlitsch, Ed Seidel, John Shalf, Thomas Radke
Distributed Systems Lab ARGONNE CHICAGO
Presentation Outline
Cactus Overview– Architecture
– Applications Cactus and Grid computing
– Metacomputing, Worms, … Proposed Cactus-GrADS project
– The “Cactus-G worm”
– Tequila thorn and architecture
– Issues
Distributed Systems Lab ARGONNE CHICAGO
What is Cactus?
Cactus is a freely available, modular, portable and manageable environment for collaboratively developing parallel, high-performance multidimensional simulations– Originally developed for astrophysics,
but nothing about it is astrophysics-specific
Distributed Systems Lab ARGONNE CHICAGO
Cactus Applications
Example output from Numerical Relativity Simulations
Distributed Systems Lab ARGONNE CHICAGO
Cactus Architecture
Codes are constructed by linking a small core (flesh) with selected modules (thorns)– Custom linking/configuration tools
Core provides basic management services A wide variety of thorns are supported
– Numerical methods
– Grids and domain decompositions
– Visualization and steering
– Etc.
Distributed Systems Lab ARGONNE CHICAGO
Cactus Architecture
Configure CST
Flesh
ComputationalToolkit
Toolkit Toolkit
Operating SystemsAIX NT
LinuxUnicos
SolarisHP-UX
Thorns
Cactus
SuperUX Irix
OSF
Make
Distributed Systems Lab ARGONNE CHICAGO
Cactus Applications
A Cactus “application” is just another thorn, “linked” with other tool thorns
Numerous Astrophysics applications– E.g., Calculate Schwartzchild Event Horizons
for colliding black holes Potential candidates for GrADS work
– Elliptical Solver, BenchADM
– Both use 3-D grid abstract topology
Distributed Systems Lab ARGONNE CHICAGO
Cactus Model (cont.)
Building an executable
Cactus Source
Flesh
IOBasic
IOASCII
WaveToy
LDAP
Worm
…
ThornsConfiguration
• Compiler options
• Tool options
• MPI options
• HDF5 options
Distributed Systems Lab ARGONNE CHICAGO
Running Cactus
Parameter File
• Specify which thorns to activate
• Specify global parameters
• Specify restricted parameters
• Specify private parameters
Distributed Systems Lab ARGONNE CHICAGO
Parallelism in Cactus Distributed memory model: each thorn is passed a section of
the global grid The parallel driver (implemented in a thorn) can use whatever
method it likes to decompose the grid across processors and exchange ghost zone information - each thorn is presented with a standard interface, independent of the driver
Standard driver distributed with Cactus (PUGH) is for a parallel unigrid and uses MPI for the communication layer
PUGH can do custom processor decomposition and static load balancing
AMR driver also provided
Distributed Systems Lab ARGONNE CHICAGO
Cactus and Grid Computing:General Observations
Reasons to work with Cactus– Rich structure, computationally intensive,
numerous opportunities for Grid computing
– Talented and motivated developer/user community
Issues– At core, relatively simple structure
– Cactus system is relatively complex
– User community is relatively small
Distributed Systems Lab ARGONNE CHICAGO
Cactus-G: Possible Opportunities
“Metacomputing”: use heterogeneous systems as source of low-cost cycles– Departmental pool or multi-site system
Dynamic resource selection, e.g.– “Cheapest” resources to achieve interactivity
– “Fastest” resource for best turnaround
– “Best” resolution to meet turnaround goal
– Spawn independent tasks: e.g., analysis
– Migration to “better” resource for all above
Distributed Systems Lab ARGONNE CHICAGO
Cactus-G: Common Building Blocks
Resource selection based on resource and application characterizations
Implementation and management of distributed output
(De)centralized logging, accounting for resource usage, parameter selection, etc.
Fault discovery, recovery, tolerance Code/executable management and creation Next-generation Cactus that increases flexibility
with respect to parameter selection
Distributed Systems Lab ARGONNE CHICAGO
Proposed Cactus-G Challenge Problem: Cactus-G Worm
Migrate to “faster/cheaper/bigger” system– When system identified by resource discovery
– When resource requirements change Why?
– Tests much of the machinery required for Cactus-G (source code mgmt, discovery, …)
– Places substantial demands on GrADS
– Good potential to show real benefit
– Migration approach simplifies infrastructure demands (MPI-2 support not required)
Distributed Systems Lab ARGONNE CHICAGO
Cactus-G WormBasic Architecture and Operation
Cactus “flesh”
“Tequila” Thorn
Computeresource
Computeresource
…
Coderepository…
Coderepository
Storageresource
Storageresource
…
GridInformation
Service
GrADSResourceSelector
ApplicationManager
Appln& otherthorns
(1) Adapt.request (2)
Resourcerequest
(3) Writecheckpoint
(4) Migrationrequest
(5) Cactusstartup
(7) Readcheckpoint
(0) Possibleuser input
(6) Loadcode
(1’) Resourcenotification
Storemodels, etc.
Query
Distributed Systems Lab ARGONNE CHICAGO
Tequila Thorn Functions
Initiates adaptation on application request or on notification of new resources– Can include user input (e.g., HTTP thorn)
Requests resources from external entity– GIS or ResourceSelector
Checkpoints application Contacts Application Manager to request
restart on new resources– AppManager has security, robustness
advantages vs. direct restart
Distributed Systems Lab ARGONNE CHICAGO
Cactus-G Worm: Approach
1) Uniproc Tequila thorn that speaks to GIS, adapts periodically [done: Cactus group]
2) Tequila thorn that speaks to UCSD Resource Selector [current focus]
3) Integrate accurate performance models
4) Support multiprocessor execution
5) Detailed evaluation
6) Add adaptation triggers: e.g., contract violation, new regime, user input
Distributed Systems Lab ARGONNE CHICAGO
Tequila Thorn + ResourceSelector
ResourceSelector must be set up as service
Tequila thorn sends request for new bag of resources
ResourceSelector responds with the new bag
Distributed Systems Lab ARGONNE CHICAGO
Current Status
Tequila thorn prototype developed that speaks to ResourceSelector
Dummy ResourceSelector that returns a static bag of resources
Demonstrated Cactus+Tequila operating Performance model developed Expected by May: multiprocessor support,
ResourceSelector interface, real performance model
Distributed Systems Lab ARGONNE CHICAGO
Open Issues
Should we move more management logic into Application Manager?
How does Contract Monitor fit into architecture?
How does PPS fit into architecture? How does COP and Aplication Launcher fit into
architecture (Cactus has its own launcher and compiles its own code)?
How does Pablo fit into architecture (Which thorns are monitored, is flesh monitored)?
The End
Distributed Systems Lab ARGONNE CHICAGO
Request and Response
The Request to the ResourceSelector will be stored in the InformationService
Only the pointer to the data in the IS will be passed to the ResourceSelector
The Response from the ResourceSelector will also be stored in the IS
Only the pointer to the data in the IS will be passed back.
Distributed Systems Lab ARGONNE CHICAGO
Tequila communication overview
Cactus Tequila ThornResourceSelector
InformationService
Distributed Systems Lab ARGONNE CHICAGO
Cactus Architecture in GrADS
Configure CST
Flesh
ComputationalToolkit
Toolkit
Operating SystemsAIX NT
LinuxUnicos
SolarisHP-UX
Thorns
Cactus
SuperUX Irix
OSF
Make
Toolkit
GradsCommuni-
cationlibrary
Distributed Systems Lab ARGONNE CHICAGO
Communication Details step 1
Event sent to Tequila thorn requesting restart
Cactus Tqeuila ThornResourceSelector
InformationService
Distributed Systems Lab ARGONNE CHICAGO
Communication Details step 2
Tequila store AART in IS
Cactus Tqeuila ThornResourceSelector
InformationService
Distributed Systems Lab ARGONNE CHICAGO
Communication Details step 3
Tequila sends request to ResourceSelector passing pointer to data in IS
Cactus Tqeuila ThornResourceSelector
InformationService
Distributed Systems Lab ARGONNE CHICAGO
Communication Details step 4
ResourceSelector retrieves AART from IS
Cactus Tqeuila ThornResourceSelector
InformationService
Distributed Systems Lab ARGONNE CHICAGO
Communication Details step 5
ResourceSelector stores bag of resources (in AART) in IS
Cactus Tqeuila ThornResourceSelector
InformationService
Distributed Systems Lab ARGONNE CHICAGO
Communication Details step 6
ResourceSelector responds to Tequila passing pointer to data in IS
Cactus Tqeuila ThornResourceSelector
InformationService
Distributed Systems Lab ARGONNE CHICAGO
Communication Details step 7
Tequila retrieves AART with new bag of resources from IS
Cactus Tqeuila ThornResourceSelector
InformationService
Distributed Systems Lab ARGONNE CHICAGO
Requirements
Using the IS for communication adds overhead.
Why do this? GrADS requirement 1: do some things (e.g.
compile) at one time and have the results stored in a persistent storage area. Pick these stored results up later and complete other phases.
Distributed Systems Lab ARGONNE CHICAGO
Sample Tequila Scenario User asks to run an ADM simulation 400x400x400 for
1000 timesteps in 10s. Resource selector contacted to obtain virtual machines Best virtual machine selected based on performance
model. AM starts Cactus on that virtual machine (and monitors
execution Contracts?) User (or application manager) decides that
computation advances too slow and decides to search for a better virtual machine
AM finds a better machine, commands the Cactus run to Checkpoint, transfers files and restart Cactus