James Annis Gabriele Garzoglio Peretz Partensky Chris Stoughton The Experimental Astrophysics Group, Fermilab The Terabyte Analysis Machine Project data.

James Annis

Gabriele Garzoglio

Peretz Partensky

Chris Stoughton

The Experimental Astrophysics Group, Fermilab

The Terabyte Analysis Machine Project

data intensive computing in astronomy

TAM Design The TAM is a compact analysis cluster that is designed to

Bring compute power to bear on large datasets. Make high I/O rate scans through large data sets

Compute power brought to large data sets (Hardware)

10 processor Linux cluster High memory (1 Gig RAM/node) and local disk (140

Gig/node) SAN to Terabyte of global disk (FC network and HW RAID) Global File System (9x faster read than NFS; HW limited)

High I/O rate scans (Software)

The SDSS science database SX The Distance Machine framework

TAM Hardware

Fast Ethernet Switch

Gigabit Uplink To Slow Access Data

Fibre Channel Switch

1 Terabyte of Global Disk5 Compute Nodes0.5 Terabyte of Local Disk

IDE disk farmsEnstore

April 2001

The Terabyte Analysis Machine

System integrator Linux NetworX ACE cluster control box

Compute Nodes Linux NetworX Dual 600 MHz Pentium III ASUS motherboard 1 Gig RAM 2x36 Gig EIDE disks Qlogic 2100 HBA

Ethernet Cisco Catalyst 2948G

Fibre Channel Gadzoox Capellix 3000

Global Disk DotHill SanNet 4200 Dual Fibre Channel

controllers 10x73 Gig Seagate

Cheetah SCSI disk Software

Linux 2.2.19 Qlogic drivers GFS V4.0 Condor

GFS: The Global File System

Sistina Software (ex-University of Minnesota)• Open source (GPL; now Sistina Public License)• Linux and FreeBSD

64-bit files and file system Distributed, server-less metadata Data synchronization via global, disk based locks Journaling and node cast-out Three major pieces:

• The network storage pool driver (on nodes)• The file system (on disk)• The locking modules (on disk or ip server)

GFS Performance

•Test setup•5 nodes•1 5-disk RAID

•Results•RAID limited at 95 Mbytes/sec•at >15 threads, disk head move limited

•Linear rate increase before hardware limits•Circa 9x faster than NFS

Fibre Channel Fibre Channel hardware has performed flawlessly, no maintence

Qlogic hardware bus adaptors (single channel) Gadzoox Capellix Fibre Channel Switch

• One port per node• Two ports per raid box

Dot Hill hardware raid system, with dual FC controllers

Hardware bus adaptor ~$800/node Switches ~$1000/port HBA shows up as a scsi device (/dev/sda) on Linux HBA has driver code and firmware code

• Must down load driver code and compile into kernel

We haven't explored fibre channel's ability to connect machines over km’s.

Global File System I We have never lost data to a GFS problem Untuned GFS cleary outperformed untuned NFS Linux kernel buffering an issue (“but I edited that file…”) Sistina mailing list very responsive

Must patch Linux kernel (2.2.19 or 2.4.6)

Objectivity doesn’t behave on GFS. One must duplicate federation files for each machine that wants access.

GFS itself is on the complicated side, and is unfamiliar to sys-admins.

We haven't explored GFS machines as file servers.

Global File System II Biggest issues are node death, power on, and IP locking

• GFS is a journaling FS; who replay journal? STOMITH: shoot the other machine in the head

• Needs to be able to power cycle nodes so that other nodes can replay journal

Power on:• /etc/rc.d/init.d scriptable

• We have never survived a power loss without human intervention at power up.

IP locking• DMEP is the diskless protocol, very new; 2 vendor support

• Memexpd is an ip lock server

– Allow any hardware for disk

– Single point of failure and potential throughput bottleneck

Disk Area Arrangement Data area on /GFS Home area on /GFS

Standard products and UPS available from NFS Normal user areas (desktops) available from NFS

/GFS area not NFS shared to desktops Large amounts of local disk (ide disk) available on each node,

not accesible from other nodes

James Annis Gabriele Garzoglio Peretz Partensky Chris Stoughton The Experimental Astrophysics Group, Fermilab The Terabyte Analysis Machine Project data.

Documents