National Grid Cyberinfrastructure Open Science Grid (OSG) and TeraGrid (TG)
Jan 01, 2016
National Grid Cyberinfrastructure
Open Science Grid (OSG) and
TeraGrid (TG)
• What we’ve already learned so far– What grids are, why we want them and who is using them (Intro)– Grid Authentication and Authorization– Harnessing CPU cycles with Condor– Data Management and the Grid
In this lecture – Fabric level infrastructure: Grid building blocks– National Grid efforts in the US
• Open Science Grid• TeraGrid
Introduction
Grid Resources in the US
• Research Participation
Majority from physics : Tevatron, LHC, STAR, LIGO.
Used by 10 other (small) research groups. 90 members, 30 VOs,
Contributors: 5 DOE Labs
BNL, Fermilab, NERSC, ORNL, SLAC.
65 Universities. 5 partner campus/regional grids.
Accessible resources: 43,000+ cores 6 Petabytes disk cache 10 Petabytes tape stores 14 internetwork partnership
Usage 15,000 CPU WallClock days/day 1 Petabyte data distributed/month. 100,000 application jobs/day. 20% cycles through resource sharing,
opportunistic use.
• Research Participation
Majority from physics : Tevatron, LHC, STAR, LIGO.
Used by 10 other (small) research groups. 90 members, 30 VOs,
Contributors: 5 DOE Labs
BNL, Fermilab, NERSC, ORNL, SLAC.
65 Universities. 5 partner campus/regional grids.
Accessible resources: 43,000+ cores 6 Petabytes disk cache 10 Petabytes tape stores 14 internetwork partnership
Usage 15,000 CPU WallClock days/day 1 Petabyte data distributed/month. 100,000 application jobs/day. 20% cycles through resource sharing,
opportunistic use.
• Research Participation Support for Science Gateways
over 100 scientific data collections
(discipline specific databases)
Contributors: 11 Supercomputing centers
Indiana, LONI, NCAR, NCSA, NICS, ORNL, PSC
, Purdue, SDSC, TACC and UC/ANL
• Computational resources:
– > 1 Petaflop computing capability
– 30 Petabytes of storage (disk and tape)
– Dedicated high performance internet
connections (10G)
TFLOPS (161K-cores) in parallel 750
computing systems and growing
• Research Participation Support for Science Gateways
over 100 scientific data collections
(discipline specific databases)
Contributors: 11 Supercomputing centers
Indiana, LONI, NCAR, NCSA, NICS, ORNL, PSC
, Purdue, SDSC, TACC and UC/ANL
• Computational resources:
– > 1 Petaflop computing capability
– 30 Petabytes of storage (disk and tape)
– Dedicated high performance internet
connections (10G)
750 TFLOPS (161K-cores) in parallel
computing systems and growing
TeraGridOSG
OSG vs TG OSG TG
Computational Resource 43K-cores across 80 institutions
161K-cores across 11 institutions and 22 systems
Storage support a shared file system is not mandatory, and hence applications need to be aware of this
shared file system (NFS, PVFS, GPFS, Lustre) on each system, and even has a WAN GPFS mounted across most systems
Accessibility • Private IP space for compute nodes• no interactive sessions• supports Condor throughout• supports GT2 (and a few GT4)• the firewall is locked down
• More compute nodes public IP space• Support for interactive sessions with login and compute nodes• Supports GT2, GT4 for remote access, and mostly PBS/SGE and some Condor for local access• 10K ports open in the TG firewall on login and compute nodes
Layout of Typical Grid Site
Computing Fabric
Grid MiddlewareGrid Level Services
++
=>
A Grid Site
=>globusglobus
ComputeElement
StorageElement
User Interface
Authz server
Monitoring Element
Monitoring Clients Services
Data Management
Services
Grid Operations
The Gr id
globusglobusglobusglobusglobusglobus
Globus,Condor, ++
Globus,Condor, ++
Grid Monitoring & Information Services
To efficiently use a Grid, you must locate and monitor its resources.
• Check the availability of different grid sites• Discover different grid services• Check the status of “jobs”• Make better scheduling decisions with information
maintained on the “health” of sites
Open Science Grid Overview
• grid service providers:– middleware developers– cluster, network and storage administrators– local-grid communities
• the grid consumers:– global collaborations – single researchers– campus communities – under-served science domains
into a cooperative infrastructure to share and sustain a common heterogeneous distributed facility in the US and beyond.
The Open Science Grid Consortium brings:
96 Resources across production & integration infrastructures
20 Virtual Organizations +6 operations
Includes 25% non-physics.
~20,000 CPUs (from 30 to 4000)
~6 PB Tapes
~4 PB Shared Disk
Snapshot of Jobs on OSGs
Sustaining through OSG submissions:
3,000-4,000 simultaneous jobs .
~10K jobs/day
~50K CPUhours/day.
Peak test jobs of 15K a day.
Using production & research networks
OSG Snapshot
OSG sits in the middle of an environment of a Grid-of-Grids from Local to Global InfrastructuresInter-Operating and Co-Operating Grids: Campus, Regional, Community, National,
International. Virtual Organizations doing Research & Education.
Overlaid by virtual computational environments of single to large groups of researchers local to worldwide
OSG Grid Monitoring
Open Science Grid
Virtual Organization Resource Selector - VORShttp://vors.grid.iu.edu/
• Custom web interface to a grid scanner that checks services and resources on:– Each Compute Element – Each Storage Element
• Very handy for checking:– Paths of installed tools on Worker Nodes.– Location & amount of disk space for planning a workflow.– Troubleshooting when an error occurs.
VORS entry for OSG_LIGO_PSU
Gatekeeper: grid3.aset.psu.edu
OSG Consortium Mtg March 2007Quick Start Guide to the OSG
Gratia -- job accounting systemhttp://gratia-osg.fnal.gov:8880/gratia-reporting/
QuickTime™ and aTIFF (Uncompressed) decompressor
are needed to see this picture.
How do you join the OSG?A software perspective
(based on Alain Roy’s presentation)
Joining OSG
• Assumption:– You have a campus grid
• Question: – What changes do you need to make to join OSG?
Your Campus Grid
• assuming that you have a cluster with a batch system:
– Condor– Sun Grid Engine– PBS/Torque– LSF
Administrative Work
• You need a security contact– Who will respond to security concerns
• You need to register your site
• You should have a web page about your site.– This will be published– People can learn about your site.
Big Picture
• Compute Element (CE)– OSG jobs submitted to CE, which gives them to batch
system– Also has information services and lots of support software
• Shared file system– OSG requires a couple of directories to be mounted on all
worker nodes
• Storage Element (SE)– How do you manage your storage at your site
Installing Software
• The OSG Software Stack– Based on the VDT
•The majority of the software you’ll install•It is grid independent
– OSG Software Stack:•VDT + OSG-specific configuration
• Installed via Pacman
What is installed?
• GRAM: – Allows job submissions
• GridFTP: – Allows file transfers
• CEMon/GIP: – Publishes site information
• Some authorization mechanism– grid-mapfile: file that lists authorized users, or– GUMS (grid identity mapping service)
• And a few other things…
OSG Middleware
Infr
astr
uctu
reA
pplic
atio
ns VO Middleware
Core grid technology distributions: Condor, Globus, Myproxy: shared with TeraGrid and
others
Virtual Data Toolkit (VDT) core technologies + software needed by
stakeholders: many components shared with EGEE
OSG Release Cache: OSG specific configurations, utilities etc.
HEP
Data and workflow management etc
Biology
Portals, databases etc
User Science Codes and Interfaces
Existing Operating, Batch systems and Utilities.
Astrophysics
Data replication etc
Picture of a basic site
Shared file system
• OSG_APP– For users to store applications
• OSG_DATA– A place to store data– Highly recommended, not required
• OSG_GRID– Software needed on worker nodes– Not required– May not exist on non-Linux clusters
• Home directories for users– Not required, but often very convenient
Storage Element
• Some folks require more sophisticated storage
management– How do worker nodes access data?
– How do you handle terabytes (petabytes?) of data
• Storage Elements are more complicated– More planning needed
– Some are complex to install and configure
• Two OSG supported options of SRMs:– dCache
– Bestman
More information
• Site planning– https://twiki.grid.iu.edu/twiki/bin/view/ReleaseDocumentation/SitePlanning
• Installing OSG Software Stack– https://twiki.grid.iu.edu/twiki/bin/view/ReleaseDocumentation/– Tutorial: http://www.mcs.anl.gov/~bacon/osgedu/sa_intro.html
Genome Analysis and Database Update system
• Runs across TeraGrid and OSG. Uses the Virtual Data System (VDS) workflow & provenance.
•Pass through public DNA and protein databases for new and newly updated genomes of different organisms and runs BLAST, Blocks, Chisel. 1200 users of resulting DB.
• Request: 1000 CPUs for 1-2 weeks. Once a month, every month. On OSG at the moment >600CPUs and 17,000 jobs a week.
Summary of OSG
• Provides core services, software and a distributed facility for an increasing set of research communities.
• Helps VOs access resources on many different infrastructures.
• Interested in collaborating and contributing our experience and efforts.
42Computational Resources (size approximate - not to scale)
Tommy Minyard, TACC
SDSC
TACC
UC/ANL
NCSA
ORNL
PU
IU
PSC
NCAR
2007(504TF)
2008(~1PF)Tennessee
LONI/LSU
The TeraGrid Facility
• Grid Infrastructure Group (GIG)– University of Chicago
– TeraGrid integration, planning, management, coordination
– Organized into areas (and not VOs as OSG)• User Services
• Operations
• Gateways
• Data/Visualization/Scheduling
• Education Outreach & Training
• Software Integration
• Resource Providers (RP)– Ex: NCSA, SDSC, PSC, Indiana, Purdue, ORNL, TACC, UC/ANL
– Systems (resources, services) support, user support
– Provide access to resources via policies, software, and mechanisms coordinated by and provided through the GIG.
SDSC
TACC
UC/ANL
NCSA
ORNL
PU
IU
PSC
NCAR
Caltech
USC/ISI
UNC/RENCI
UW
Resource Provider (RP)
Software Integration Partner
Grid Infrastructure Group (UChicago)
11 Resource Providers, One Facility
LONI
NICS
TeraGrid Hardware Components
• High-end compute hardware– Intel/Linux clusters
– Alpha SMP clusters
– IBM POWER3 and POWER4 clusters
– SGI Altix SMPs
– SUN visualization systems
– Cray XT3
– IBM Blue Gene/L
• Large-scale storage systems– hundreds of terabytes for secondary storage
• Visualization hardware• Very high-speed network backbone (40Gb/s)
– bandwidth for rich interaction and tight coupling
TeraGrid Objectives
• DEEP Science: Enabling Petascale Science– Make Science More Productive through an integrated set
of very-high capability resources•Address key challenges prioritized by users
• WIDE Impact: Empowering Communities– Bring TeraGrid capabilities to the broad science
community•Partner with science community leaders - “Science Gateways”
• OPEN Infrastructure, OPEN Partnership– Provide a coordinated, general purpose, reliable set of
services and resources•Partner with campuses and facilities
TeraGrid Resources and Services• Computing -
– nearly a Petaflop of computing power today
– 500 Tflop Ranger system at TACC
• Remote visualization servers and software
• Data – Allocation of data storage facilities
– Over 100 Scientific Data Collections
• Central allocations process
• Technical Support– Central point of contact for support of all systems
– Advanced Support for TeraGrid Applications (ASTA)
– Education and training events and resources
– Over 20 Science Gateways
Requesting Allocations of Time
• TeraGrid resources are provided for free to academic researchers and educators
– Development Allocations Committee (DAC) for start-up accounts up to 30,000 hours of time are requests processed in two weeks - start-up and courses
– Medium Resource Allocations Committee (MRAC) for requests of up to 500,000 hours of time are reviewed four times a year
– Large Resource Allocations Committee (LRAC) for requests of over 500,000 hours of time are reviewed twice a year
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
PIs (879)
Active Users
(3,197)
Charging Users
(1,141)
Allocations (1.8B NUs)
NUs (618M NUs)
All 20 Others (< 2% Usage each) Atmospheric Sciences
Chemical, Thermal Systems
Materials Research
Astronomical Sciences
Physics
Chemistry
Molecular Biosciences
TeraGrid User Community
TeraGrid Web Resources
• TeraGrid User Portal for managing user allocations and job flow
• Knowledge Base for quick answers to technical questions
• User Information including documentation, information about hardware and software resources
• Science Highlights
• News and press releases
• Education, outreach and training events and resources
TeraGrid Provides a rich array of web-based resources:
In general, seminars and workshops will be accessible via video on the Web. Extensive documentation will also be Web-based.
Science GatewaysBroadening Participation in TeraGrid
• Increasing investment by communities in their own cyberinfrastructure, but heterogeneous:
• Resources• Users – from expert to K-12• Software stacks, policies
• Science Gateways– Provide “TeraGrid Inside”
capabilities– Leverage community investment
• Three common forms:– Web-based Portals – Application programs running on
users' machines but accessing services in TeraGrid
– Coordinated access points enabling users to move seamlessly between TeraGrid and other grids.
Technical Approach
Biomedical and Biology, Building Biomedical Communities
OGCE Portletswith ContainerOGCE Portletswith Container
Apache JetspeedInternal ServicesApache JetspeedInternal Services
ServiceAPI
ServiceAPI
GridProtocols
GridServiceStubs
GridServiceStubs
RemoteContentServices
RemoteContentServices
RemoteContentServersHTTP
GridService
sLocalPortal
Services
LocalPortal
Services
Grid Resources
Open Source Tools
Build standard portals to meet the domain requirements of the biology communitiesDevelop federated databases to be replicated and shared across TeraGrid
Workflow Composer
Source: Dennis Gannon ([email protected])
HPC Education and Training
• Workshops, institutes and seminars on high-performance scientific computing
• Hands-on tutorials on porting and optimizing code for the TeraGrid systems
• On-line self-paced tutorials
• High-impact educational and visual materials suitable for K–12, undergraduate and graduate classes
TeraGrid partners offer training and education events and resources to educators and researchers:
QuickTime™ and aTIFF (Uncompressed) decompressor
are needed to see this picture.
“HPC University”• Advance researchers’ HPC skills
– Catalog of live and self-paced training– Schedule series of training courses– Gap analysis of materials to drive development
• Work with educators to enhance the curriculum– Search catalog of HPC resources– Schedule workshops for curricular development– Leverage good work of others
• Offer Student Research Experiences– Enroll in HPC internship opportunities– Offer Student Competitions
• Publish Science and Education Impact– Promote via TeraGrid Science Highlights, iSGTW– Publish education resources to NSDL-CSERD
Sampling of Training Topics Offered• HPC Computing
– Introduction to Parallel Computing– Toward Multicore Petascale Applications– Scaling Workshop - Scaling to Petaflops– Effective Use of Multi-core Technology – TeraGrid - Wide BlueGene Applications – Introduction to Using SDSC Systems – Introduction to the Cray XT3 at PSC – Introduction to & Optimization for SDSC Sytems – Parallel Computing on Ranger & Lonestar
• Domain-specific Sessions– Petascale Computing in the Biosciences – Workshop on Infectious Disease Informatics at NCSA
• Visualization– Introduction to Scientific Visualization– Intermediate Visualization at TACC– Remote/Collaborative TeraScale Visualization on the TeraGrid
• Other Topics– NCSA to host workshop on data center design – Rocks Linux Cluster Workshop– LCI International Conference on HPC Clustered Computing
• Over 30 on-line asynchronous tutorials
ANL/UC IU NCSA ORNL PSC Purdue SDSC TACC
ComputationalResources
Itanium 2(0.5 TF)
IA-32(0.5 TF)
Itanium2(0.2 TF)
IA-32(2.0 TF)
Itanium2(10.7 TF)
SGI SMP (7.0 TF)
Dell Xeon(17.2TF)
IBM p690(2TF)
Condor Flock(1.1TF)
IA-32 (0.3 TF)
XT3 (10 TF)
TCS (6 TF)
Marvel SMP
(0.3 TF)
Hetero(1.7 TF)
IA-32(11 TF)Opportunistic
Itanium2(4.4 TF)
Power4+(15.6 TF)
Blue Gene(5.7 TF)
IA-32(6.3 TF)
Online Storage 20 TB 32 TB 1140 TB 1 TB 300 TB 26 TB 1400 TB 50 TB
Mass Storage 1.2 PB 5 PB 2.4 PB 1.3 PB 6 PB 2 PB
Net Gb/s, Hub 30 CHI 10 CHI 30 CHI 10 ATL 30 CHI 10 CHI 10 LA 10 CHI
DataCollections# collectionsApprox total sizeAccess methods
5 Col.>3.7 TBURL/DB/GridFTP
> 30 Col.URL/SRB/DB/GridFTP
4 Col.7 TBSRB/Portal/OPeNDAP
>70 Col.>1 PBGFS/SRB/DB/GridFTP
4 Col. 2.35 TBSRB/Web Services/URL
Instruments ProteomicsX-ray Cryst.
SNS and HFIR Facilities
VisualizationResourcesRI: Remote InteractRB: Remote BatchRC: RI/Collab
RI, RC, RB IA-32, 96 GeForce 6600GT
RBSGI Prism, 32 graphics pipes; IA-32
RI, RBIA-32 + Quadro4 980 XGL
RBIA-32, 48 Nodes
RB RI, RC, RBUltraSPARC IV, 512GB SMP, 16 gfx cards
TeraGrid Resources
100+ TF8 distinct
architectures
3 PB Online Disk
>100 data collections
Applications can cross infrastructures e.g: OSG and TeraGrid
More Info:
• Open Science Grid–http://www.opensciencegrid.org
• TeraGrid–http://www.teragrid.org