Linux Clusters in ITD
Post on 01-Feb-2016
53 Views
Preview:
DESCRIPTION
Transcript
Brookhaven Science AssociatesU.S. Department of Energy
Linux Clusters in ITD
Efstratios EfstathiadisInformation Technology Division
Brookhaven Science AssociatesU.S. Department of Energy 2
Linux in Scientific Computing Large Scale Linux Installation & Configuration File Sharing: NAS/SAN, NFS, PVFS Cluster Interconnects Load Management Systems Parallel Computing System Monitoring Tools Linux Clusters in ITD Thoughts, Conclusions
Outline
Brookhaven Science AssociatesU.S. Department of Energy 3
Features of Scientific Computing: Floating point performance is very important User write their own codes. Fortran is common. GUIs and user-friendly interfaces are not required. Goal is Science, Not Computer Science
Linux in Scientific Computing
Brookhaven Science AssociatesU.S. Department of Energy 4
Scientific computing is one of the first areas where Linux has had a major impact on production and mission-critical computing.
Access to cheap hardware. License Issues. Vendor Response/Support is slow. Access to Source code is needed to implement desired features. Availability of man power. Availability of Scientific Tools/Resources.
Linux in Scientific Computing
Brookhaven Science AssociatesU.S. Department of Energy 5
SUN: (www.sun.com/linux) Porting its software products to Linux (Java 2, Forte for Java, OpenOffice, Grid
Engine) Porting Linux for the UltraSPARC architecture Provides common utilities for Solaris and Linux so that users can move between the
two Improves the compatibility between the two so that applications can run on both Sun StorEdge T3 Arrays are compatible with Linux
IBM: (www-1.ibm.com/linux) AFS Storage Devices Linux pre-installed ( 20% of its Intel-based servers are Linux) openclustergroup spends over $1.3B in supporting Linux.
Linux Endorsement
Brookhaven Science AssociatesU.S. Department of Energy 6
Most Popular: x86AlphaSparcPowerPCMIPS
Which Processor? It Depends on:CostPerformanceAvailability of Software
Processor Support in Linux
Brookhaven Science AssociatesU.S. Department of Energy 7
SPEC: Standard Performance Evaluation Corporation (http://www.spec.org)
SPECint95: 8 integer-intensive C codesSPECfp95: 10 floating-point scientific FORTRAN codes
Performance
Processor MHz SPECfp95 SPECint95Alpha 21264 500 48.4 23.6UltraSparc 450 27.0 19.7Athlon 650 22.4 29.4PIII/500 500 15.1 21.6
Brookhaven Science AssociatesU.S. Department of Energy 8
Compilers:gcc, g++, g77: Available on all platforms, but: Generated code is not very fast No parallelization for SMPs g77 is Fortran 77 only g++ has its limitations
x86 Compilers:Portland Group (www.pgroup.com)Fortran 90/95, OpenMP parallelization, HPF, better performance (15%)Kuck and Associates (www.kai.com)C++/ OpenMP ParallelizationNAG, Absoft, Fujitsu etc
Software
Brookhaven Science AssociatesU.S. Department of Energy 9
International Data Corp. (IDC) Cluster requirements:
Software must provide environment that looks to all as much as like a single system as possible.
Environment must provide higher data and app availability than is possible on single systems.
Developers must not have to use special APIs to have the app work in clustering environment.
Administrators must be able to treat the configuration as a single management domain.
There must be facilities for components of a single or entire app to be run in parallel on many different processors to improve single app performance or overall scalability of the environment.
What is a Cluster?
Brookhaven Science AssociatesU.S. Department of Energy 10
Cluster
A cluster is a collection of interconnected computers that can be viewed and used as a single, unified computing resource.
What is a Cluster?
Brookhaven Science AssociatesU.S. Department of Energy
Choice of: Diskless Install
One copy of Linux to Maintain
Requires Special Tools
It doesn’t scale to large Number of nodes Local InstallKickstartSystem ImagerLUI (Linux Utilities: Installation)g4u (Ghost for Unix: http://www.feyrer.de/g4u/ )
Large Scale Linux Installation
Brookhaven Science AssociatesU.S. Department of Energy 12
Kickstart
Pulls and installs a list of RPM files from a RedHat mirror site (such as linux.bnl.gov) specified in a configuration file (ks.cfg).
Cluster nodes must be on a public network. Have to maintain several configuration files ks.cfg RedHat only. No easy way to propagate configuration changes.
Large Scale Linux Installation
Brookhaven Science AssociatesU.S. Department of Energy 13
System Imager (http://systemimager.sourceforge.net)
It pulls the system image of a master-client into an Image Server. Cluster nodes can pull the image they choose from the Image Server.
The Image Server “pulls” the system image of a master-client. Cluster nodes use rsync and tftp to pull images from the Image Server Can be done on a private network. It Supports several Linux distributions Configuration changes can be easily propagated to clients through rsync. Rsync (http://rsync.samba.org) is capable to just “pull” the new &/
modified files off the server rather than the whole system image.
Large Scale Linux Installation
Brookhaven Science AssociatesU.S. Department of Energy 14
File Sharing: DASFile Sharing: DAS
DAS: Direct Attached Storage
Brookhaven Science AssociatesU.S. Department of Energy 15
File Sharing: NASFile Sharing: NAS
NAS: Network Attached Storage
Brookhaven Science AssociatesU.S. Department of Energy 16
File Sharing: NASFile Sharing: NAS
Network Attached Storage (NAS): Shared storage on a network. A dedicated high-performance single purpose
machine. Separate Data Servers from Application Servers Provide Centralized Data Management Scalability Dynamic Growth of Filesystems (LVM). Journaling Filesystems RAID controllers Support for Multiple protocols (NFS, CIFS, HTTP, FTP etc) Multiple Network Interfaces Uses existing Network Infrastructure Web Admin/Monitor GUI (netattach) Redundant Power Supplies/fans/cables Linux Support
Brookhaven Science AssociatesU.S. Department of Energy 17
File Sharing: SANFile Sharing: SAN
SAN: Storage Area Network
Brookhaven Science AssociatesU.S. Department of Energy 18
File Sharing: SANFile Sharing: SAN
Storage Area Network (SAN): Shared storage on a network. A dedicated high-performance network
connecting storage elements and the back end of the servers. Provides the Benefits of NAS and also isolates the network traffic
into a dedicated high performance network. Disk Drives are attached directly on a Fiber Channel Network (not
acceptable on a TCP/IP network).
SAN Disadvantages Expensive: Must build a dedicated, high-performance network Lack of strong standards Proprietary Solutions Only
Brookhaven Science AssociatesU.S. Department of Energy 19
File Sharing: NFSFile Sharing: NFS
- What is NFS ?
The Network File System (NFS) protocol provides transparent remote access to shared file systems across networks. The NFS protocol is designed to be machine, operating system, network architecture and transport protocol independent. This independence is achieved through the use of Remote Procedure Calls (RPC) primitives built on top of eXternal Data Representation (XDR).
- How is NFS3 different from NFS2 ?
NFS version 3 support allows 64 bit file system support (version 2 is limited to 32 bit support), reliable asynchronous writes (version 2 supports only synchronous writes), better cache consistency by providing attribute information before and after an operation which is an added feature and better performance on directory lookups by using READDIRPLUS calls which reduce the number of messages passed between the client and server. The readdirplus calls return file handles and attributes in addition to directory entries. The maximum data transfer size which was set to be 8k in NFS2 is now set by values in the FSINFO return structure.
Brookhaven Science AssociatesU.S. Department of Energy 20
File Sharing: NFSFile Sharing: NFS
NFS Benchmarking: Bonnie (http://www.textuality.com/bonnie) The same filesystem (/scratch) mounted on the Linux client using different NFS version. The
NFS server is running Solaris 7 (sun3.bnl.gov), The Client is a dual 800MHz RedHat 6.2 host (BLC)
NFSv2 -------Sequential Output-------- ---Sequential Input-- --Random-- -Per Char- --Block--- -Rewrite-- -Per Char- --Block--- --Seeks---Machine MB K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU /sec %CPU 1000 698 2.2 619 0.5 655 1.1 5615 17.3 9994 12.9 176.2 2.8
NFSv3 -------Sequential Output-------- ---Sequential Input-- --Random-- -Per Char- --Block--- -Rewrite-- -Per Char- --Block--- --Seeks---Machine MB K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU /sec %CPU 1000 4630 15.6 4631 4.4 2329 4.6 11233 39.3 11185 16.6 695.4 11.5
Brookhaven Science AssociatesU.S. Department of Energy 21
File Sharing: NFSFile Sharing: NFS
NFS Benchmarking: Bonnie Linux Server - Linux Client (RedHat 6.2, 2.2.18)
-------Sequential Output-------- ---Sequential Input-- --Random-- -Per Char- --Block--- -Rewrite-- -Per Char- --Block--- --Seeks---Machine MB K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU /sec %CPUNFSv3-v2 1000 9787 33.5 9886 8.7 3222 5.6 8630 29.3 9110 13.6 115.9 0.9NFSv3-v3 1000 9848 34.1 9911 8.9 3227 5.6 8740 30.0 9087 12.7 116.5 1.0Local 1000 19643 61.5 25040 11.7 8297 12.7 15651 40.9 18885 11.1 1150.4 6.9
Brookhaven Science AssociatesU.S. Department of Energy 22
File Sharing: NFSFile Sharing: NFS
Network Attached Storage (NAS) Linux Server (VA 9450NAS ) Dual PIII Xeon 700MHz, 2.0GB RAM, RedHat 6.2 NFSv3, Mylex extremeRAID 2000, ext3 ) Linux Client (VA 2200) Dual PIII 800 MHz, 0.5GB RAM, RedHat 6.2 2.2.18 kernel with NFSv3).
Server and Client are on the same Network switch (cisco 4006) -------Sequential Output-------- ---Sequential Input-- --Random-- -Per Char- --Block--- -Rewrite-- -Per Char- --Block--- --Seeks---Machine MB K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU /sec %CPUNFSNFSv3-v3 1000 v3-v3 1000 9848 9848 34.1 34.1 9911 9911 8.9 3227 5.6 8.9 3227 5.6 87408740 30.0 30.0 90879087 12.7 116.5 1.0 12.7 116.5 1.0NASv3-v3 1024 7551 25.8 7544 6.8 5048 10.7 11504 36.9 11488 15.9 1640.2 22.1NASv3-v3 2047 7428 25.5 7427 7.0 4021 7.8 10297 33.7 9940 13.9 740.5 13.5
NFSNFSv3-v2 1000 v3-v2 1000 97879787 33.5 33.5 98869886 8.7 8.7 32223222 5.6 5.6 86308630 29.3 29.3 91109110 13.6 115.9 0.9 13.6 115.9 0.9NASv3-v2 512 7430 25.1 7280 7.0 6500 10.4 26453 63.8 19378 16.2 6881.1 87.7NASv3-v2 1024 7360 25.3 7431 7.0 4811 9.1 11357 36.5 10985 16.3 1742.8 24.0NASv3-v2 2047 7348 25.1 7398 6.9 4154 8.2 10813 34.9 10474 15.1 981.3 13.5
Brookhaven Science AssociatesU.S. Department of Energy 23
File Sharing: NFSFile Sharing: NFS
The setup of having a SUN workstation as a “main node” serving home directories is pretty common.quark.phy.bnl.gov, sun1.sns.bnl.gov, sun2.bnl.gov (sun65.bnl.gov) etchttp://linux.itd.bnl.gov/NFShttp://nfs.sourceforge.net
Brookhaven Science AssociatesU.S. Department of Energy 24
File Sharing: PVFSFile Sharing: PVFSPVFS: Parallel Virtual File System http://www.parl.clemson.edu/pvfs/desc.htmlStripes file data across multiple disks in different nodes (I/O nodes) in a Cluster. This way large files can be created and bandwidth is increased. Four major components to the PVFS system:
Metadata server (mgr)
I/O server (iod)
PVFS native API (libpvfs)
PVFS Linux kernel support
Brookhaven Science AssociatesU.S. Department of Energy 25
Cluster Network Cluster Network
Fast Ethernet Transmission Speed: 0.1Gbps, Latency: 100 s, Cost/Connection:<$1000 Gigabit Ethernet Maximum Bandwidth: 1.0Gbps; Cost: $1,650/connection (based on 64 ports, copper).
Myrinet Low Latency, small distance network (System Area Network). Maximum Bandwidth: 1.2Gbps; Latency: 9 s, Cost: > $2,500/connection Single Vendor Hardware.
CDIC Cluster Interconnect: Cisco 4006: 48x3 port Full-duplex Fast Ethernet Switch
Brookhaven Science AssociatesU.S. Department of Energy 26
Network Graph Network Graph
http://linux.itd.bnl.gov/netpipe
Brookhaven Science AssociatesU.S. Department of Energy 27
Network Signature Network Signature
Brookhaven Science AssociatesU.S. Department of Energy 28
Cluster Network: Private vs PublicCluster Network: Private vs Public
Private NetworkCluster Security/Setup/Administration much easierApplications cannot interact with the outside world Public NetworkSecurity/setup/administration difficultIP addresses neededInteraction possible.
Brookhaven Science AssociatesU.S. Department of Energy 29
Load Management Systems (LMS)Load Management Systems (LMS)
Transparent Load Sharing The users submit jobs w/o being concerned with which cluster resource is
being used to process the job. Control Over Resource Sharing Rather than leaving it up to individuals to search the network for available
resources and capacity to run their jobs, LMS controls the resources in the cluster. LMS takes into account the specifications or requirements of the job when assigning resources. It matches the requirements with the resources available.
Implement Policies In an LMS, rules can be established that automatically set priorities for jobs
among groups or teams. This enables the LMS to implement resource sharing between groups.
Brookhaven Science AssociatesU.S. Department of Energy 30
Load Management Systems (LMS)Load Management Systems (LMS)
Batch queuing Load Balancing Failover Capability Job Accounting/Statistics User specifiable Resources Relinking/Recompiling of Application Programs Fault tolerant Suspend/Resume jobs Job Status Host Status Meta-Job Capability Cluster-wide resources Job Migration Central Control
Brookhaven Science AssociatesU.S. Department of Energy 31
Load Management Systems (LMS)Load Management Systems (LMS)
Numerous LMS (or CMS) available on Linux. Portable Batch System (PBS) (http://pbs.mrj.com)Developed by NASA. It is freely distributed by a commercial company which can also
provide service and support. Load Share Facility (LSF) (http://www.platform.com) Sun Grid Engine (CODINE) (http://www.sun.com/software/gridware/linux/) Distributed Queuing System (DQS) (http://www.scri.fsu.edu/~pasko/dqs.html) Generic Network Queuing System (GNQS) (http://www.gnqs.org/) LoadLeveler (http://www.austin.ibm.com/software/sp_products/loadlev.html)Developed by IBM, is a modified version of the Condor batch queuing system
http://www.cs.wisc.edu/condor/
Brookhaven Science AssociatesU.S. Department of Energy 32
Load Management Systems (LMS)Load Management Systems (LMS)
Portable Batch System (PBS)PBS was designed and developed by NASA to provide control over the initiating, scheduling, and
execution of batch jobs. User Interfaces: Gui xPBS and Command Line Interface (CLI) Heterogeneous Clusters Interactive (debugging sessions or jobs that require user command-line input) and Batch Jobs Parallel code support for MPI, PVM, HPF File Staging Automatic Load-Leveling: The PBS Scheduler numerous ways to distribute workload across the
cluster, based on hardware configuration, resource availability and keyboard activity Job Accounting Cross-system Scheduling Web Site: http://pbs.mrj.com Short introduction at BNL: http://www.itd.bnl.gov/bcf/cluster/pbs/
SPF (Single Point of Failure)
Brookhaven Science AssociatesU.S. Department of Energy 33
Parallel ProcessingParallel Processing
The use of multiple processors to execute different parts of the program simultaneously.
Main Goal is to reduce wall-clock time; (also cost, memory constraints, etc)
Things to consider: Is the problem parallelizable? (F(k+2)=F(k+1)+F(k)) Parallel Overhead (the amount of time required to coordinate
parallel tasks) Synchronization of parallel tasks (waiting of two or more tasks to
reach a specified point). Granularity of the problem SMP vs DMP (is the network a factor ?)
Brookhaven Science AssociatesU.S. Department of Energy 34
Parallel ProcessingParallel Processing
Threads Used on SMP hosts only; Not widely used in scientific computing. Compiler generated parallel programs Compiler detects concurrency in loops and distributes work in a loop to different
threads. Compiler is usually assisted by compiler directives. MPI, PVM Embarrassing parallelismIndependent processes can be executed in parallel with little or no coupling between them.
Brookhaven Science AssociatesU.S. Department of Energy 35
Message Passing Interface (MPI)Message Passing Interface (MPI)
MPI is a message passing library, a collection of subroutines to facilitate the communication (exchange of data and synchronization) among processes in a distributed memory program.
MPI offers portability and performance; Not a true standard. Messages are the actual data that you send/receive and an
envelope of information that helps route the data. In MPI message-passing calls there are three parameters that describe the data and another three that specify the routing (envelope).
Data: startbuf, count, datatype Envelope: dest, tag, communicator Messages are sent over TCP sockets.
Brookhaven Science AssociatesU.S. Department of Energy 36
Message Passing Interface (MPI)Message Passing Interface (MPI)
Number of processors. (are we overdoing it with too many processes? Creating too much message passing over the network, enhancing synchronization time?)
Message size. (it is the optimum for our network technology? What bandwidth do we get when we pass different size of messages?)
Design: Is our problem very fine gained?? Do we take advantage of loop unrolling? Use the right compiler flags Take advantage of the resources. Avoid nodes that are busy.
Sometimes quite slow nodes can do the job equally well than a fast and busy node.
Benchmark. Get the numbers. What is important to your code? Memory, CPU, etc
Brookhaven Science AssociatesU.S. Department of Energy 37
Benchmarking MPIBenchmarking MPI
Tools included in the MPICH distribution LLC Bench (http://icl.cs.utk.edu/projects/llcbench/index.htm) Vampir MPI Performance Analysis Tool NAS Parallel Benchmarks (NPB) http://www.nas.nasa.gov/software/NPB NAS: Numerical Aerospace Simulation NPB are installed on BGC under /opt/pgi/bench/NPB2.3 NPB benchmark Results: only one program seems to benefit from the Gigabit
interconnect upgrade, due to large message passing. NAS serial version (help understand the architecture)
Brookhaven Science AssociatesU.S. Department of Energy 38
NPB-2.3-Serial (W Class)
0
50
100
150
200
250
BT LU SP CG MG
MFlops
BLC-W
BGC-W
SUN2-W
SUN65-W
Brookhaven Science AssociatesU.S. Department of Energy 39
Cluster MonitorsCluster Monitors
Administrators: Cluster Usage Log File scans Intrusion Detection Hardware and Software Inventories, etc
Users: How many nodes are in the cluster? What type of nodes (architecture)? PC, Spark, SGI Available resources (CPU speed, disk space, memory etc) What is being used? What nodes are “empty”?
Brookhaven Science AssociatesU.S. Department of Energy 40
Cluster MonitorsCluster Monitors
Load Management Systems provide some sort of monitoring. HP OpenView
Open Source Products Pong3 (Perl, www.megacity.org/pong3) System-Info (Perl, http://blaine.res.wpi.net/) spong (Perl, spong.sourceforge.net) bWatch (Tcl/Tk and ssh, user customizable) Vacum (VA product) rps (Perl, rps.sourceforge.net)
Brookhaven Science AssociatesU.S. Department of Energy 41
HP OpenViewHP OpenView
IT Operations• ITO agent running on client• ITO central console• ITO agents monitor client log files and report events back to the central console• Actions can be taken either manually or automatically at the console in response
to events RADIA (Novadigm Product)
• Hardware and software Inventories• Software distributions
NNM (Network Node Manager)• Net Metrics: Report Network Statistics • Net Rangers: Intrusion detection
Brookhaven Science AssociatesU.S. Department of Energy 42
Brookhaven Science AssociatesU.S. Department of Energy 43
Brookhaven Science AssociatesU.S. Department of Energy 44
Spong (spong.sourceforge.net)Spong (spong.sourceforge.net)
Provides CPU, memory, disk utilization Checks Availability of services (ssh, http, PBS etc)(*) List running jobs (sorted by CPU usage) (*) Keeps history of events per host Warns (email) Admins of status changes Usage graphs (per hour/day/month/year) (**) Scans Log files Per Host Configuration Open Source(*) Modified/Enchanced (**)unstable
Brookhaven Science AssociatesU.S. Department of Energy 45
Brookhaven Science AssociatesU.S. Department of Energy 46
Remote ps (http://rps.sourceforge.net)Remote ps (http://rps.sourceforge.net)
Brookhaven Science AssociatesU.S. Department of Energy 47
The CDIC Cluster (BGC)The CDIC Cluster (BGC)
49 Nodes on a local network bgc000: Master Node, 2NIC 001-029: dual PIII 700MHz, 1GB Memory, 8GB disk 030-047: dual PIII 500MHz, 0.5GB Memory, 2GB disk bgc-f1: fileserver hosting “local” user Home directories (50GB) (“public” home directories are mounted on the master node only, under /itd) RedHat 6.2 PBS (with MPI support) MPICH-1.2.1 Initial installation with kickstart. Use rsync to propagate updates. Portland Group Compilers (3.2) 16 CPU, 2 users Monitors: bWatch, spong, pong3 Interim backup solution
Brookhaven Science AssociatesU.S. Department of Energy 48
The Brookhaven Cluster (BLC)The Brookhaven Cluster (BLC) General Purpose Cluster 60 Nodes on a public network blc000.bnl.gov: Master Node 001-040: dual PIII 800MHz, 0.5GB Memory, 2x9GB disk 041-059: dual PIII 500MHz, 0.5GB Memory, 18GB disk RedHat 6.2 PBS (with MPI support). MPICH-1.2.1 Home directories are hosted on a Solaris File Server (userdata.bnl.gov) Initial installation, configuration and updates with System Imager. (Host
Images are kept on the master node) Portland Group Compilers (3.2) 64 CPU, 4 users Monitors: bWatch, spong, pong3, HPOV
Brookhaven Science AssociatesU.S. Department of Energy 49
The SNS Cluster (SNSC)The SNS Cluster (SNSC)
6 Nodes on public network snsc00.sns.bnl.gov: Master Node 01-05: dual PIII 700MHz, 0.5GB Memory, 18GB disk RedHat 7.0 PBS (with MPI support). (?) MPICH-1.2.1 Home directories are hosted on a Solaris File Server (sun1.sns.bnl.gov) Initial installation, configuration and updates with System Imager ( Host
Images are kept on the master node).
Brookhaven Science AssociatesU.S. Department of Energy 50
Thoughts ...Thoughts ...
Need for Centralized Cluster Management and Homogeneity : Easier monitoring, administration, maintenance and recovery. Users will have the option to share resources: Idle CPUs,
filesystems (NAS), Network Switches, Printers etc. and costs (licenses, software).
Increase user interaction. Faster Integration of new groups.
Brookhaven Science AssociatesU.S. Department of Energy 51
BLC1
Public Network
Private Network (192.168.50.0)
Gateway
BLC 2 BLC 3 BLC N
Gateway Fileserver1 Software1
BGC 1 BGC 2 BGC 3 BGC N
SNSC 1 SNSC 2 SNSC 3 SNSC N
VIS 1 VIS 2 VIS 3 VIS N
Brookhaven Science AssociatesU.S. Department of Energy 52
Linux in ITDLinux in ITD Mail Gateway, DNS, DHCP, Proxies, etc Linux clusters are growing in size and number. Linux is becoming a Player (Back ups, HPOV,
SAN, Security, man power, Mirror sites etc) Linux is becoming a common platform for
Scientific Computing (RHIC, CDIC, SNS, g-2, Physics Theory, …)
top related