KAI ST Brief presentation of Earth Brief presentation of Earth Simulation Center Simulation Center Jang, Jae-Wan
Jan 15, 2016
KAIST
Brief presentation of Earth Brief presentation of Earth Simulation CenterSimulation Center
Jang, Jae-Wan
CS610
Hardware configurationHardware configuration
Highly parallel vector supercomputer of the distributed-memory type
640 Processor nodes (PNs)
PN 8 vector-type arithmetic processors (APs)
16 GB main momory
Remote control and I/O parts
CS610
Arithmetic processor Arithmetic processor
CS610
Processor nodeProcessor node
CS610
Processor nodeProcessor node
CS610
Interconnection networkInterconnection network
CS610
Interconnection NetworkInterconnection Network
CS610
65m
50m
EarthEarth Simulator Research and Development CenterSimulator Research and Development Center
CS610
SoftwareSoftware
OS NEC’s UNIX-based OS : SUPER-UX
Programming model
Supported language Fortran90, C, C++ (modified for ES)
hybrid flat
Inter-PN HPF/MPI HPF/MPI
Intra-PN Microtasking/OpenMP
AP Automatic vectoriztion
KAIST
Earth Simulator CenterEarth Simulator Center
First results from the Earth SimulatorFirst results from the Earth Simulator
Resolution 300km
KAIST
Earth Simulator CenterEarth Simulator Center
First results from the Earth SimulatorFirst results from the Earth Simulator
Resolution 120km
KAIST
Earth Simulator CenterEarth Simulator Center
First results from the Earth SimulatorFirst results from the Earth Simulator
Resolution 20km
KAIST
Earth Simulator CenterEarth Simulator Center
First results from the Earth SimulatorFirst results from the Earth Simulator
Resolution 10km
CS610
Ocean Circulation Model ( MOM3 developed by GFDL ) resolution : 0.1º× 0.1º ( 10km) initial condition : Levitus data (1982) computer resources : number of nodes = 175,
elapsed time 8,100 hours
First results from the Earth SimulatorFirst results from the Earth Simulator
CS610
First results from the Earth SimulatorFirst results from the Earth Simulator
resolution : 0.1º× 0.1º ( 10km) resolution : 1º× 1º ( 100km)
Ocean Circulation Model ( MOM3 developed by GFDL )
KAIST
Terascale Cluster:Terascale Cluster:System XSystem X
Virginia Tech, Apple, Mellanox, Cisco, and Liebert
2003. 3. 16
Daewoo Lee
CS610
Terascale Cluster: System XTerascale Cluster: System X
A Groundbreaking Supercomputer Cluster with Industrial Assistance
Apple, Mellanox, Cisco, and Liebert
$5.2 million for hardware
10280/17600 GFlops of Performance with 1100 Nodes (3rd Ranked in TOP500 Supercomputer Site)
CS610
GoalsGoals
Computational Scienceand Engineering
Research
Nanoscale Electronics Quantum Chemistry Molecular Statistics
Fluid DynamicsLarge-Scale Network
EmulationOptimal Design
…
Computational Scienceand Engineering
Research
Nanoscale Electronics Quantum Chemistry Molecular Statistics
Fluid DynamicsLarge-Scale Network
EmulationOptimal Design
…
Experimental System
Fault Tolerance and Migration
Queuing System and Scheduler
Distributed Operating System
Parallel FilesystemMiddleware for Grids
Authentication/Security System
…
Experimental System
Fault Tolerance and Migration
Queuing System and Scheduler
Distributed Operating System
Parallel FilesystemMiddleware for Grids
Authentication/Security System
…
Dual Usage Mode(90% of computational cycles devoted to production use)
CS610
Hardware ArchitectureHardware Architecture
NodeApple G5 Platform
Dual IBM PowerPC 970(64-bit CPU)
Primary Communication
InfiniBand by Mellanox(20Gbps full duplex, fat-tree topology)
Secondary Communication
Gigabit Ethernet by Cisco
Cooling System by Liebert
CS610
SoftwareSoftware
Mac OS X (FreeBSD based)
MPI-2 (MPICH-2)
Support C/C++/Fortran compilation
Déjà vu: transparent fault-tolerance systemMaintain computer stability by transferring a failed application to another location without alerting the computer, thus keeping the application intact.
CS610
ReferenceReference
Terascale Cluster Web Sitehttp://computing.vt.edu/research_computing/terascale
KAIST
4th fastest supercomputer4th fastest supercomputer
TungstenTungsten
PAK, EUNJI
CS610
44thth : NCSA Tungsten : NCSA Tungsten
Top500.org
National Center for Supercomputing Applications (NCSA)
University of Illinois at Urbana-Champaign
CS610
Tungsten Architecture Tungsten Architecture [1/3][1/3]
Tungsten
Xeon 3.0 GHz Dell cluster
2,560 processors
3 GB memory/node
Peak performance: 15.36 TF
Top 500 list debut: #4 (9.819 TF, November 2003)
Currently 4th fastest supercomputer in the world
CS610
Tungsten Architecture Tungsten Architecture [2/3][2/3]
Components
Myrinet
I/ONode
104 nodes
I/ONode
Shared:122TB
ComputeNode
P P
ComputeNode
P P
ComputeNode
P P
1280 nodes (2560 Processors)
Dell PowerEdge 1750with 3GB DDR SDRAM Intel Xeon 3.06 GHz (dual)
Linux 2.4.20 (Red Hat 9.0)
Cluster File System
Compilers
Intel Fortran 77/90/95 C C++GNU Fortran 77 C C++
LSF + Maui Scheduler
User Applications
CS610
Tungsten Architecture Tungsten Architecture [3/3][3/3]
1450 nodesDell PowerEdge 1750 Server
Intel Xeon 3.06GHZ : Peak performance 6.12GFLOPS
1280 compute nodes, 104 I/O nodes
Parallel I/O 11.1 Gigabytes per second (GB/s) of I/O throughput
Complements the cluster’s 9.8TFLOPS of computational capability
104 node I/O sub-cluster with more than 120TBNode local : 73GB, Shared : 122TB
CS610
Applications on Tungsten Applications on Tungsten [1/3][1/3]
PAPI and PerfSuitePAPI : Portable interface to hardware performance counters
PerfSuite : Set of tools for performance analysis on Linux platforms
CS610
Applications on Tungsten Applications on Tungsten [2/3][2/3]
PAPI and PerfSuite
CS610
Applications on Tungsten Applications on Tungsten [3/3][3/3]
CHARMM (Harvard Version)Chemistry at Harvard Macromolecular Mechanics
General purpose molecular mechanics, molecular dynamics and vibrational analysis packages
Amber 7.0A set of molecular mechanical force fields for the simulation of bimolecular
Package of molecular simulation programs
KAIST
MPP2 SupercomputerMPP2 SupercomputerThe world’s largest Itanium2 cluster.The world’s largest Itanium2 cluster.
Molecular Science Computing FacilityPacific Northwest National Laboratory
2004. 3. 16
Presentation : Kim SangWon
CS610
ContentsContents
MPP2 Supercomputer Overview
Configuration
HP rx2600(Longs Peak) Node
QsNet ELAN Interconnect Network
System/Application Software
File System
Future Plan
CS610
MPP2 OverviewMPP2 Overview
MPP2 The High Performance Computing System-2
At the Molecular Science Computing Facilityin the William R. Wiley Environmental Molecular Sciences Laboratoryat Pacific Northwest National Laboratory
the fifth-fastest supercomputer in the world in the November 2003
CS610
MPP2 Overview MPP2 Overview
System Name : Mpp2Linux Supercomputer cluster11.8(8.633) Teraflops6.8 Terabytes of memory
Purpose : Production
Platform : HP Integrity rx2600 bi-Itanium2 1,5 Ghz
Nodes : 980 (Processors : 1960)
¾ Megawatt of power220 Tons of Air Conditioning4,000 Sq. Ft.
Cost: $24.5 million (estimated)
Generator
UPS
CS610
Configuration(Phase2b)Configuration(Phase2b)
1,856 Madison Batch CPUs
Elan4
4 Login nodes with 4Gb-Enet
2 System Mgt nodes
1,900 next generation Itanium® processors
11.4TF6.8TB Memory
…...928
compute nodes
SAN / 53TB
…
Lustre
Elan3
Elan4Not
Operational
Operational: September 2003
CS610
Each node has:2 Intel Itanium 2 Processors(1.5Ghz)
6.4GB/s System bus
8.5GB/s Memory bus
12GB of RAM
1 1000T Connection
1 100T Connection
1 Serial Connection
2 Elan3 Connections
HP rx2600 Longs Peak Node HP rx2600 Longs Peak Node ArchitectureArchitecture
PCI-X2 (1GB/s)Elan3
Elan3
2SCSI160
CS610
QsNet ELAN Interconnect NetworkQsNet ELAN Interconnect Network
High bandwidth, Ultra low latency and scalability900Mbytes/s user space to user space bandwidth.
1024 nodes for standard QsNet conf., rising to 4096 in QsNetII systems.
Optimized libraries for common distributed memory programming models exploit the full capabilities of the base hardware.
CS610
Software on MPP2 (1/2)Software on MPP2 (1/2)
System SoftwareOperating System - Red Hat Linux 7.2 Advanced Server
NWLinux : tailored to IA64 clusters (2.4.18 kernel with various patches)
Cluster Management : Resource Management System(RMS) by QuadrixA single point interface to the system for resource managementMonitoring, Fault diagnosis, Data collection, Allocating CPUs, Parallel jobs execution…
Job Management SoftwareLSF(Load Sharing Facility) Batch SchedulerQBank : Control and Manage CPU resources allocated to projects or users.
Compiler SoftwareC (ecc), F77/F90/F95 (efc), G++
Code DevelopmentEtnus TotalView
A parallel and multithreaded application debugger
Vampirthe GUI driven frontend used to visualize the profile data of running a program
gdb
CS610
Software on MPP2 (2/2)Software on MPP2 (2/2)
Application SoftwareQuantum Chemistry Codes
GAMESS(The General Atomic and Molecular Electronic Structure System)performing a variety of ab initio molecular orbital (MO) calculations
MOLPRO an advanced ab initio quantum chemistry software package
NWChemcomputational chemistry software developed by EMSL
ADF (Amsterdam Density Functional) 2000 software for first-principle electronic structure calculations via Density-Functional Theory (DFT)
General Molecular Modeling Software : Amber
Unstructured Mesh Modeling Codes NWGrid (Grid Generator)
hybrid mesh generation, mesh optimization, and dynamic mesh maintenance NWPhys (Unstructured Mesh Solvers)
a 3D, full-physics, first principles, time-domain, free-Lagrange code for parallel processing using hybrid grids.
CS610
File System on MPP2File System on MPP2
Four file systems available on the cluster:Local filesystem(/scratch)
On each of the compute nodes
Non-persistent storage area provided to a parallel job running on that node.
NFS filesystem(/home)User home directory and files are located.
Uses RAID-5 for reliability
Lustre Global filesystem(/dtemp)Designed for the world's largest high-performance compute clusters.
Aggregate write rate of 3.2 Gbyte/s.
Restart files and files needed for post analysis.
Long term global scratch space
AFS filesystem(/msrc) On the front-end (non-compute) nodes
CS610
Future Plan…Future Plan…
MPP2 will be upgraded with the faster Quadrics QsNetII interconnect in early 2004
1,856 Madison Batch CPUs
Elan4
4 Login nodes with 4Gb-Enet
2 System Mgt nodes
…...928
compute nodes
SAN / 53TB
…
Lustre
KAIST
Bluesky SupercomputerBluesky Supercomputer
Top 500 Supercomputers
CS610 Parallel Processing
Donghyouk Lim
(Dept of Computer Science, KAIST)
CS610
Contents Contents
Introduction
National Center for Atmosphere Research
Scientific Computing Division
Hardware
Software
Recommendations for usage
Related Link
CS610
IntroductionIntroduction
Bluesky 13th Supercomputer in the world
Clustered Symmetric Multi-Processing(SMP) System
1600 IBM Power 4 processor
Peak of 8.7 TFLOP
CS610
National Center for Atmosphere National Center for Atmosphere ResearchResearch
Established in 1960
Located in Boulder, Colorado
Research areaEarth system
Climate change
Changes in atmospheric composition
CS610
Scientific Computing DivisionScientific Computing Division
Research on high-performance supercomputing
Computing resourcesBluesky (IBM Cluster 1600 running AIX) : 13th place
blackforest (IBM SP RS/6000 running AIX) : 80th place
Chinook complex: Chinook (SGI Origin3800 running IRIX) and Chinook (SGI Origin2100 running IRIX)
CS610
HardwareHardware
Processor 1600 Power 4 Processors 1.3 GHz
each can perform up to 4 fp operations per cycle
Peak of 8.7 TFLOPS
Memory2 GB memory per processor
memory on a node is shared between processors on that node
Memory CachesL1 cache : 64KB I-cache, 32KB d-cache, direct mapped
L2 cache : For pair of processors, 1.44MB, 8-way set associative
L3 cache : 32MB, 512byte cache line, 8-way set associative
CS610
HardwareHardware
Computing Nodes8-way processor nodes: 76
32-way processor nodes: 25
32-processor nodes for running interactive jobs: 4
Separate nodes for user logins
System support nodes 12 nodes dedicated to the General Parallel File System (GPFS)
Four nodes dedicated to HiPPI communications to the Mass Storage System
Two master nodes dedicated to controlling LoadLeveler operations
One dedicated system monitoring node
One dedicated test node for system administration, upgrades, testing
CS610
HardwareHardware
Storage RAID disk storage capacity: 31.0 TB total
Each user application can access 120 GB of temporary space
Interconnect fabricSP switch2 (“Colony” switch)
Two full duplex network path to increase throughput
Bandwidth : 1.0GB per second bidirectional
Worst case latency : 2.5 microsecond
HiPPI(High-Performance Parallel Interface) to the Mass Storage System
Gigabit Ethernet network
CS610
SoftwareSoftware
Operating System: AIX (IBM-proprietary UNIX)
Compilers: Fortran (95/90/77), C, C++
Batch subsystem: LoadLeveler Managing serial and parallel jobs over a cluster of servers
File System: General Parallel File System (GPFS)
System information commands: spinfo for general information, lslpp for information about libraries
CS610
Related LinksRelated Links
NCAR : http://www.ncar.ucar.edu/ncar/
SCD : http://www.scd.ucar.edu/
Bluesky : http://www.scd.ucar.edu/computers/bluesky/
IBM p690 : http://www-903.ibm.com/kr/eserver/pseries/highend/p690.html
KAIST
About Cray X1About Cray X1
Kim, SooYoung ([email protected])
(Dept of Computer Science, KAIST)
CS610
Features Features (1/2)(1/2)
Contributing areasweather and climate prediction, aerospace engineering, automotive design, and a wide variety of other applications important in government and academic research
Army High Performance Computing Research Center (AHPCRC), Boeing, Ford, Warsaw Univ., U.S. Government, Department of Energy's Oak Ridge National Laboratory (ORNL)
Operating System: UNICOS/mptm from UNICOS, UNICOS/mktm
True single system image (SSI)
Scheduling algorithms for parallel applications
Accelerated application mode and migration
Variable processor utilization: Each CPU has four internal processorsTogether as a closely coupled, multistreaming processor (MSP)
Individually as four single-streaming processors (SSPs)
Flexible system partitioning
CS610
Features Features (2/2)(2/2)
Scalable system architectureDistributed shared memory (DSM)
Scalable cache coherence protocol
Scalable address translation
Parallel programming modelsShared-memory parallel models
Traditional distributed-memory parallel models: MPI and SHMEM
Up-and-coming global distributed-memory parallel models: Unified Parallel C(UPC)
Programming environmentsFortran compiler, C and C++ compiler
High-performance scientific library (LibSci), language support libraries, system libraries
Etnus TotalView debugger, CrayPat (Cray Performance Analysis Tool)
CS610
Node ArchitectureNode Architecture
Figure 1. Node, Containing Four MSPs
CS610
System Conf. ExamplesSystem Conf. Examples
Cabinets
CPUs Memory Peak Performance
1 (AC) 16 64 – 256 GB 204.8 Gflops
1 64 256 – 1024 GB 819.0 Gflops
4 256 1024 – 4096 GB 3.3 Tflops
8 512 2048 – 8192 GB 6.6 Tflops
16 1024 4096 – 16384 GB 13.1 Tflops
32 2048 8192 – 32768 GB 26.2 Tflops
64 4096 16384 – 65536 GB 52.4 Tflops
CS610
Technical Data Technical Data (1/2)(1/2)
Technical specificationsPeak performance 52.4 Tflops in a 64 cabinet configuration
Architecture Scalable vector MPP with SMP nodes
Processing elementProcessor Cray custom design vector CPU
16 vector floating-point operations/clock cycle32- and 64-bit IEEE arithmetic
Memory size 16 to 64GB per node
Data error protection SECDED
Vector clock speed 800MHz
Peak performance 12.8 Gflops per CPU
Peak memory bandwidth
34.1 GB/sec per CPU
Peak cache bandwidth 76.8 GB/sec per CPU
Packaging 4 CPUs per nodeUp to 4 nodes per AC cabinet, up to 4 interconnected cabinetsUp to 16 nodes per LC cabinet, up to 64 interconnected cabinets
CS610
Technical Data Technical Data (2/2)(2/2)
MemoryTechnology RDRAM with 204 GB/sec peak bandwidth per node
Architecture Cache coherent, physically distributed, globally addressable
Total system memory size
32 GB to 64 TB
Interconnect networkTopology Modified 2D torus
Peak global bandwidth 400 GB/sec for a 64-CPU Liquid Cooled (LC) system
I/OI/O system port channels
4 per node
Peak I/O bandwidth 1.2 GB/sec per channel