James Annis Gabriele Garzoglio Peretz Partensky Chris Stoughton The Experimental Astrophysics Group, Fermilab The Terabyte Analysis Machine Project data intensive computing in astronomy
James Annis
Gabriele Garzoglio
Peretz Partensky
Chris Stoughton
The Experimental Astrophysics Group, Fermilab
The Terabyte Analysis Machine Project
data intensive computing in astronomy
TAM Design The TAM is a compact analysis cluster that is designed to
Bring compute power to bear on large datasets. Make high I/O rate scans through large data sets
Compute power brought to large data sets (Hardware)
10 processor Linux cluster High memory (1 Gig RAM/node) and local disk (140
Gig/node) SAN to Terabyte of global disk (FC network and HW RAID) Global File System (9x faster read than NFS; HW limited)
High I/O rate scans (Software)
The SDSS science database SX The Distance Machine framework
TAM Hardware
Fast Ethernet Switch
Gigabit Uplink To Slow Access Data
Fibre Channel Switch
1 Terabyte of Global Disk5 Compute Nodes0.5 Terabyte of Local Disk
IDE disk farmsEnstore
April 2001
The Terabyte Analysis Machine
System integrator Linux NetworX ACE cluster control box
Compute Nodes Linux NetworX Dual 600 MHz Pentium III ASUS motherboard 1 Gig RAM 2x36 Gig EIDE disks Qlogic 2100 HBA
Ethernet Cisco Catalyst 2948G
Fibre Channel Gadzoox Capellix 3000
Global Disk DotHill SanNet 4200 Dual Fibre Channel
controllers 10x73 Gig Seagate
Cheetah SCSI disk Software
Linux 2.2.19 Qlogic drivers GFS V4.0 Condor
GFS: The Global File System
Sistina Software (ex-University of Minnesota)• Open source (GPL; now Sistina Public License)• Linux and FreeBSD
64-bit files and file system Distributed, server-less metadata Data synchronization via global, disk based locks Journaling and node cast-out Three major pieces:
• The network storage pool driver (on nodes)• The file system (on disk)• The locking modules (on disk or ip server)
GFS Performance
•Test setup•5 nodes•1 5-disk RAID
•Results•RAID limited at 95 Mbytes/sec•at >15 threads, disk head move limited
•Linear rate increase before hardware limits•Circa 9x faster than NFS
Fibre Channel Fibre Channel hardware has performed flawlessly, no maintence
Qlogic hardware bus adaptors (single channel) Gadzoox Capellix Fibre Channel Switch
• One port per node• Two ports per raid box
Dot Hill hardware raid system, with dual FC controllers
Hardware bus adaptor ~$800/node Switches ~$1000/port HBA shows up as a scsi device (/dev/sda) on Linux HBA has driver code and firmware code
• Must down load driver code and compile into kernel
We haven't explored fibre channel's ability to connect machines over km’s.
Global File System I We have never lost data to a GFS problem Untuned GFS cleary outperformed untuned NFS Linux kernel buffering an issue (“but I edited that file…”) Sistina mailing list very responsive
Must patch Linux kernel (2.2.19 or 2.4.6)
Objectivity doesn’t behave on GFS. One must duplicate federation files for each machine that wants access.
GFS itself is on the complicated side, and is unfamiliar to sys-admins.
We haven't explored GFS machines as file servers.
Global File System II Biggest issues are node death, power on, and IP locking
• GFS is a journaling FS; who replay journal? STOMITH: shoot the other machine in the head
• Needs to be able to power cycle nodes so that other nodes can replay journal
Power on:• /etc/rc.d/init.d scriptable
• We have never survived a power loss without human intervention at power up.
IP locking• DMEP is the diskless protocol, very new; 2 vendor support
• Memexpd is an ip lock server
– Allow any hardware for disk
– Single point of failure and potential throughput bottleneck