Site Report: The RHIC Computing Facility
HEPIX – Amsterdam
May 19-23, 2003
A. Chan
RHIC Computing Facility
Brookhaven National Laboratory
Outline
Background Mass Storage Central Disk Storage Linux Farms Software Development Monitoring Security Other services Summary
Background
Brookhaven National Lab (BNL) is a U.S. gov’t funded multi-disciplinary research laboratory
RCF formed in the mid-90’s to address computing needs of RHIC experiments
Became U.S. Tier 1 Center for ATLAS in late 90’s
RCF is a multi-purpose facility (NHEP and HEP)
Background (continued)
Currently 25 staff members (need more)
RHIC first collisions in 2000, now in year 3 of operations
5 RHIC experiments (BRAHMS, PHENIX, PHOBOS, PP2PP and STAR)
Mass Storage
4 StorageTek tape silos managed via HPSS (9940A and 9940B )
Peak raw data rate to silos 350 MB/s (can do better)
Peak data rate to/from Linux Farm 180 MB/s (can do better)
Experiments have accumulated 618 TB of raw data (capacity for 5x more)
5 staff members oversee Mass Storage operations
The Mass Storage System (1)
The Mass Storage System (2)
Central Disk Storage
24 Sun E450 servers running Solaris 8
140 TB of disks managed by Sun servers via Veritas
Fast access to processed (DST) data via NFS (back-up in HPSS)
Aggregate 600 MB/s data rate to/from Sun servers on average
5 staff members oversee Central Disk Storage operations
Central Disk Storage (1)
Central Disk Storage (2)
Linux Farms
Provide the majority of CPU power in the RCF
Used for mass processing of RHIC data
Listed as 3rd largest cluster according to http://www.clusters500.org
5 staff members oversee all Linux Farm operations
Linux Farm Hardware
Built with commercially available Intel-based servers
1097 rack-mounted, dual CPU servers
917,728 SpecInt2000
Reliable (0.0052 hardware failures/month-machine –about 6 failures/month at current size)
The growth of the Linux Farm
0100200
300400
500600
700800
9001000
1999 2000 2001 2002 2003
KSpecInt2000
The Linux Farm in the RCF (1)
The Linux Farm in the RCF (2)
Linux Farm Software
RedHat 7.2 (RHIC) and 7.3 (ATLAS)
Image installed with Kickstart
Support for compilers (gcc, PGI, Intel) and debuggers (gdb, Totalview, Intel)
Support for network file systems (AFS, NFS)
Linux Farm Software (continued)
Support for LSF and RCF-designed batch software
System administration software to monitor & control hardware, software and infrastructure
GRID-like software (Ganglia, Condor, GLOBUS, etc)
Scalability an important operational requirement
Batch jobs in the Linux Farm (1)
0
5000
10000
15000
20000
25000
30000
35000
Tota
l Job
s Su
bmitt
ed/M
onth
1999 2000 2001 2002 2003
Year
CRS Batch Job Statistics
BRAHMSPHENIXPHOBOSSTAR
Batch Jobs in the Linux Farm (2)
0
20
40
60
80
100
Effic
ienc
y
1999 2000 2001 2002 2003
Year
CRS Batch Job Statistics
BRAHMSPHENIXPHOBOSSTAR
Software Development
GRID-like services for RHIC and ATLAS
GRID monitoring tools
GRID user management issues
4 staff members involved
The USATLAS GRID Testbed
4/15/034/15/03CHEP 03, La Jolla 1
Internet
HPSS
BNL US ATLAS Grid Configuration
Submit Grid Jobs
LSFServer1 LSF
Server2
GatekeeperJob manager
DisksGrid Job Requests
Globus client
2TB
30MB/S
atlas00
afs04,05
amds04
gridftp serverGlobus Replica
catalog
GridFtp
GIIS ServerGrid Status
GRID Monitoring
4/15/034/15/03CHEP 03, La Jolla 6
Monitoring Framework
MonitoringDatabase
(ODBC+MYSQL)Or RRD
DB Info. ProvidersData Collectors
Aggregate Service Index
(GIIS)Grid-View(Web Server)
Information Provider (GRIS)
Information Provider (GRIS)
Information Provider (GRIS)
Information Provider (GRIS)
Grid-info-search
Server HPSSNetwork
Computing Nodes
Sensor Sensor Sensor Sensor
GRID User Management(1)
4/15/034/15/03CHEP 03, La Jolla 7
VirtualOrganization
GUMS: A scalable Grid User Management System
User info
User info
UNM
GRID User Management (2)
4/15/034/15/03CHEP 03, La Jolla 8
Schematic Diagram
VO User Registry
Database
Regional Registration
Authority?
Local Registration
Authority
VO #3 …
VO #2
Database
Site User Info
DatabaseLocal Policy
Local Accont
Managementgrid-mapfile
Site
Push
Pull
Push
Monitoring
Mix of open-source, RCF-designed and vendor-provided monitoring software
Persistency and fault-tolerant features
Near real-time information
Scalability requirements
Mass Storage Monitoring
Central Data Storage Monitoring
Linux Farm Monitoring
Batch Job Control & Monitoring
Infrastructure Monitoring
Security
Firewall to minimize unauthorized access
Most servers closed to direct, external access
User access through security-enhanced gateway systems
Security in the GRID-environment a big challenge
Security at the RCF
Other Services
Limited printer support
Off-site data transfer services (bbftp, rftp, etc)
Nightly backups of critical file systems
Summary
Implementation of GRID-like services increasing
Hardware & software scalability more important as RCF grows
Security issues in the GRID-era an important issue