Top Banner
The RHIC-ATLAS Computing Facility at BNL HEPIX – Edinburgh May 24-28, 2004 Tony Chan RHIC Computing
29

The RHIC-ATLAS Computing Facility at BNL HEPIX – Edinburgh May 24-28, 2004 Tony Chan RHIC Computing Facility Brookhaven National Laboratory.

Mar 28, 2015

Download

Documents

Faith Fox
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: The RHIC-ATLAS Computing Facility at BNL HEPIX – Edinburgh May 24-28, 2004 Tony Chan RHIC Computing Facility Brookhaven National Laboratory.

The RHIC-ATLAS Computing Facility at BNL

HEPIX – Edinburgh

May 24-28, 2004

Tony Chan

RHIC Computing Facility

Brookhaven National Laboratory

Page 2: The RHIC-ATLAS Computing Facility at BNL HEPIX – Edinburgh May 24-28, 2004 Tony Chan RHIC Computing Facility Brookhaven National Laboratory.

Outline

BackgroundMass StorageCentral Disk StorageLinux FarmMonitoringSecurity & AuthenticationFuture DevelopmentsSummary

Page 3: The RHIC-ATLAS Computing Facility at BNL HEPIX – Edinburgh May 24-28, 2004 Tony Chan RHIC Computing Facility Brookhaven National Laboratory.

Background

Brookhaven National Lab (BNL) is a U.S. gov’t funded multi-disciplinary research laboratory.

RACF formed in the mid-90’s to address computing needs of RHIC experiments. Became U.S. Tier 1 Center for ATLAS in late 90’s.

RACF supports HENP and HEP scientific computing efforts and various general services (backup, e-mail, web, off-site data transfer, Grid, etc).

Page 4: The RHIC-ATLAS Computing Facility at BNL HEPIX – Edinburgh May 24-28, 2004 Tony Chan RHIC Computing Facility Brookhaven National Laboratory.

Background (continued)

Currently 29 staff members (4 new hires in 2004).

RHIC Year 4 just concluded. Performance surpassed all expectations.

Page 5: The RHIC-ATLAS Computing Facility at BNL HEPIX – Edinburgh May 24-28, 2004 Tony Chan RHIC Computing Facility Brookhaven National Laboratory.

Staff Growth at the RACF

0

5

10

15

20

25

30St

aff

Lev

els

(FT

E)

1997 1998 1999 2000 2001 2002 2003 2004

YearStaff Levels

Page 6: The RHIC-ATLAS Computing Facility at BNL HEPIX – Edinburgh May 24-28, 2004 Tony Chan RHIC Computing Facility Brookhaven National Laboratory.

RACF Structure

Page 7: The RHIC-ATLAS Computing Facility at BNL HEPIX – Edinburgh May 24-28, 2004 Tony Chan RHIC Computing Facility Brookhaven National Laboratory.

Mass Storage

4 StorageTek tape silos managed via HPSS (v 4.5).

Using 37 9940B drives (200 GB/tape).

Aggregate bandwidth up to 700 MB/s.

10 data movers with 10 TB of disk.

Total over 1.5 PB of raw data in 4 years of running (capacity for 4.5 PB).

Page 8: The RHIC-ATLAS Computing Facility at BNL HEPIX – Edinburgh May 24-28, 2004 Tony Chan RHIC Computing Facility Brookhaven National Laboratory.

The Mass Storage System

Page 9: The RHIC-ATLAS Computing Facility at BNL HEPIX – Edinburgh May 24-28, 2004 Tony Chan RHIC Computing Facility Brookhaven National Laboratory.

Central Disk Storage

Large SAN served via NFS DST + user home directories + scratch area.

41 Sun servers (E450 & V480) running Solaris 8 and 9. Plan to migrate all to Solaris 9 eventually.

24 Brocade switches & 250 TB of FB RAID5 managed by Veritas.

Aggregate 600 MB/s data rate to/from Sun servers on average.

Page 10: The RHIC-ATLAS Computing Facility at BNL HEPIX – Edinburgh May 24-28, 2004 Tony Chan RHIC Computing Facility Brookhaven National Laboratory.

Central Disk Storage (cont.)

RHIC and ATLAS AFS cells software repository + user home directories.

Total of 11 AIX servers with 1.2 TB for RHIC and 0.5 TB for ATLAS.

Transarc on server side, OpenAFS on client side.

Considering OpenAFS for server side.

Page 11: The RHIC-ATLAS Computing Facility at BNL HEPIX – Edinburgh May 24-28, 2004 Tony Chan RHIC Computing Facility Brookhaven National Laboratory.

The Central Disk Storage System

Page 12: The RHIC-ATLAS Computing Facility at BNL HEPIX – Edinburgh May 24-28, 2004 Tony Chan RHIC Computing Facility Brookhaven National Laboratory.

Linux Farm

Used for mass processing of data.

1359 rack-mounted, dual-CPU (Intel) servers.

Total of 1362 kSpecInt2000.

Reliable (about 6 hardware failures per month at current farm size).

Combination of SCSI & IDE disks with aggregate of 234+ TB of local storage.

Page 13: The RHIC-ATLAS Computing Facility at BNL HEPIX – Edinburgh May 24-28, 2004 Tony Chan RHIC Computing Facility Brookhaven National Laboratory.

Linux Farm (cont.)

Experiments making significant use of local storage through custom job schedulers, data repository managers and rootd.

Requires significant infrastructure resources (network, power, cooling, etc).

Significant scalability challenges. Advance planning and careful design a must!

Page 14: The RHIC-ATLAS Computing Facility at BNL HEPIX – Edinburgh May 24-28, 2004 Tony Chan RHIC Computing Facility Brookhaven National Laboratory.

The Growth of the Linux Farm

0

200

400

600

800

1000

1200

1400K

Spec

Int2

000

1999 2000 2001 2002 2003 2004

YearKSpecInt2000

Page 15: The RHIC-ATLAS Computing Facility at BNL HEPIX – Edinburgh May 24-28, 2004 Tony Chan RHIC Computing Facility Brookhaven National Laboratory.

The Linux Farm in the RACF

Page 16: The RHIC-ATLAS Computing Facility at BNL HEPIX – Edinburgh May 24-28, 2004 Tony Chan RHIC Computing Facility Brookhaven National Laboratory.

Linux Farm Software

Custom RH 8 (RHIC) and 7.3 (ATLAS) images.

Installed with Kickstart server.

Support for compilers (gcc, PGI, Intel) and debuggers (gdb, Totalview, Intel).

Support for network file systems (AFS, NFS) and local data storage.

Page 17: The RHIC-ATLAS Computing Facility at BNL HEPIX – Edinburgh May 24-28, 2004 Tony Chan RHIC Computing Facility Brookhaven National Laboratory.

Linux Farm Batch Management

New Condor-based batch system with custom PYTHON front-end to replace old batch system. Fully deployed in Linux Farm.

Use of Condor DAGman functionality to handle job dependencies.

New system solves scalability problems of old system.

Upgraded to Condor 6.6.5 (latest stable release) to implement advanced features (queue priority and preemption).

Page 18: The RHIC-ATLAS Computing Facility at BNL HEPIX – Edinburgh May 24-28, 2004 Tony Chan RHIC Computing Facility Brookhaven National Laboratory.

Linux Farm Batch Management (cont.)

Page 19: The RHIC-ATLAS Computing Facility at BNL HEPIX – Edinburgh May 24-28, 2004 Tony Chan RHIC Computing Facility Brookhaven National Laboratory.

Linux Farm Batch Management (cont.)

LSF v5.1 widely used in Linux Farm, specially for data analysis jobs. Peak rate of 350 K jobs/week.

LSF possibly to be replaced by Condor if the latter can scale to similar peak job rates. Current Condor peak rates of 7 K jobs/week.

Condor and LSF accepting jobs through GLOBUS.

Condor scalability to be tested in ATLAS DC 2.

Page 20: The RHIC-ATLAS Computing Facility at BNL HEPIX – Edinburgh May 24-28, 2004 Tony Chan RHIC Computing Facility Brookhaven National Laboratory.

Condor Usage at the RACF

Page 21: The RHIC-ATLAS Computing Facility at BNL HEPIX – Edinburgh May 24-28, 2004 Tony Chan RHIC Computing Facility Brookhaven National Laboratory.

Monitoring

Mix of open-source, RCF-designed and vendor-provided monitoring software.

Persistency and fault-tolerant features.

Near real-time information.

Scalability requirements.

Page 22: The RHIC-ATLAS Computing Facility at BNL HEPIX – Edinburgh May 24-28, 2004 Tony Chan RHIC Computing Facility Brookhaven National Laboratory.

Mass Storage Monitoring

Page 23: The RHIC-ATLAS Computing Facility at BNL HEPIX – Edinburgh May 24-28, 2004 Tony Chan RHIC Computing Facility Brookhaven National Laboratory.

Central Disk Storage Monitoring

Page 24: The RHIC-ATLAS Computing Facility at BNL HEPIX – Edinburgh May 24-28, 2004 Tony Chan RHIC Computing Facility Brookhaven National Laboratory.

Linux Farm Monitoring

Page 25: The RHIC-ATLAS Computing Facility at BNL HEPIX – Edinburgh May 24-28, 2004 Tony Chan RHIC Computing Facility Brookhaven National Laboratory.

Temperature Monitoring

Page 26: The RHIC-ATLAS Computing Facility at BNL HEPIX – Edinburgh May 24-28, 2004 Tony Chan RHIC Computing Facility Brookhaven National Laboratory.

Security & Authentication

Two layers of firewall with limited network services and limited interactive access through secure gateways.

Migration to Kerberos 5 single sign-on and consolidation of password DB’s. NIS passwords to be phased-out.

Integration of K5/AFS with LSF to solve credential forwarding issues. Will need similar implementation for Condor.

Implemented Kerberos certificate authority.

Page 27: The RHIC-ATLAS Computing Facility at BNL HEPIX – Edinburgh May 24-28, 2004 Tony Chan RHIC Computing Facility Brookhaven National Laboratory.

Future Developments

HIS/HTAR deployment for UNIX-like access to HPSS.

Moving beyond NFS-served SAN with more scalable solutions (Panasas, IBRIX, Lustre, NFS v.4.1, etc).

dCache/SRM being evaluated as a distributed storage management solution to exploit high-capacity, low-cost local storage in the 1300+ node Linux Farm.

Linux Farm OS upgrade plans (RHEL?).

Page 28: The RHIC-ATLAS Computing Facility at BNL HEPIX – Edinburgh May 24-28, 2004 Tony Chan RHIC Computing Facility Brookhaven National Laboratory.

US ATLAS Grid Testbed

Internet

HPSS

Condor pool

GatekeeperJob manager

DisksGrid Job Requests

Globus-client

17TB

70MB/S

atlas02

aafs

amds Mover

aftpexp00

GridFtp

giis01Information Server

AFS serverGlobus RLS

Server

Local Grid development currently focused on monitoring, user management and support for DC2 production activities

Page 29: The RHIC-ATLAS Computing Facility at BNL HEPIX – Edinburgh May 24-28, 2004 Tony Chan RHIC Computing Facility Brookhaven National Laboratory.

Summary

RHIC run very successful.

Increasing staff levels to support increasing level of computing support activities.

On-going evaluation of scalable solutions (dCache, Panasas, Condor, etc) in a distributed computing environment.

Increased activity to support upcoming ATLAS DC 2 production.