Linux Clusters Institute: Introduction to High Performance Computing University of Wyoming May 22 – 26, 2017 Irfan Elahi, National Center for Atmospheric Research 1
Linux Clusters Institute:Introduction to
High Performance ComputingUniversity of Wyoming
May 22 – 26, 2017
Irfan Elahi, National Center for Atmospheric Research
1
What is Supercomputing or High Performance Computing?
• The definition of supercomputing is constantly changing. Supercomputers can perform up to quadrillions of FLOPS or PFLOPS.
• High-performance computing (HPC) utilizes parallel processing for running large and advanced application programs efficiently. The term applies especially to systems that function above a hundred teraflops. The Top500 list has several multi-petaflop systems in the top 50.
• HPC aggregates computing power in a way that delivers much higher performance than one could get out of a typical desktop computer or workstation in order to solve large problems in science, engineering, or business.
• Supercomputers were introduced in the 1960s, and initially created by Seymour Cray at Control Data Corporation who led the HPC industry for decades.
• To me, personally, it is an eco-system that provides users with a high performance computational, networking, storage, and analysis platform, and the necessary software stack to stitch these resources.
2
Fastest Supercomputer vs. Moore
Year Fastest Moore1993 59.7 601994 143.41995 170.41996 220.4 2401997 10681998 13381999 2121.3 9602000 23792001 72262002 35860 38402003 358602004 358602005 136800 153602006 2806002007 2806002008 1375780 614402009 14567002010 17590002011 8162000 2457602012 16324750
0
2000000
4000000
6000000
8000000
10000000
12000000
14000000
16000000
18000000
1990 1995 2000 2005 2010 2015
Fastest
Moore
1993: 1024 CPU cores
1
10
100
1000
10000
100000
1000000
10000000
100000000
1990 1995 2000 2005 2010 2015
Fastest
Moore
Year
GFLOPs:billions of calculations per second
www.top500.org
3
http://www.mooreslaw.org/
What is Supercomputing About?
SizeSpeed
4
What is Supercomputing About?
• Size: Many problems that are interesting to scientists and engineers can’t fit on a single personal computer system – usually because they need more RAM and/or more disk storage.
• Speed: Many problems that are interesting to scientists and engineers would take a very, very long time to run on a single personal computer system - months or even years. But a problem that would take a month on a PC might take only an hour on a supercomputer.
5
What is HPC Used For?• Simulation of physical phenomena by developing a model
that represents the key characteristics of the selected physical or abstract system or process. Areas where simulation is heavy used:
• Weather forecasting• Galaxy formation• Oil reservoir management
• Data mining: finding needle(s) in a haystack. It is the process of analyzing data from different perspectives and summarizing it into useful and sometimes new information:
• Gene sequencing• Signal processing• Detecting storms that might produce tornados
• Visualization: turning a vast sea of data into pictures that a scientist can understand and analyze. Any technique for creating images, diagrams, or animations to communicate a message is called Visualization.
Moore, OKTornadicStorm
May 3, 1999
[3]
[1]
6
[2]
Supercomputing Issues
• Storage hierarchy or storage tiers
• Parallelism: doing multiple things at the same time
• Scaling issues
• High-speed interconnect
• Software stack
• Facility
7
What is a “Cluster”?
• A cluster needs a collection of small computers, called nodes, hooked together by an interconnection network (or interconnect for short).
• It also needs software that allows the nodes to communicate over the interconnect.
• A cluster is all of these components working together as if they’re one big computer ... a super computer.
8
What Does a Cluster Look Like?
Network View
9
10
11
What Does a Cluster Look Like?
Cluster Components All Components Working Together
• Computational resources
• Storage and file system
• Management infrastructure
• High-speed interconnect
• HPC software stack
• HPC applications and workflow
12
Cluster Components All Components Working Together
Management/Service Compute Nodes Network (IHPI) Storage
File SystemNetwork: Load balancing, discovery, failoverCluster: Provisioning, cluster commands, etc.
Server: Discovery, remote power, diskless support
HPC Software Stack: OS, Compilers, Libraries, MPI, Programming Tools, Debuggers,
Scheduler, etc.
User Applications
13
Computational Resources
• Compute nodes and software
• Compute node has:
• Processor (CPU)
• Memory
• Networking
• Software
• Access to file system for long or short term retention
14
Processor Types Examples
• X-86 Architecture: • The instruction set architecture (ISA) is Intel's most successful line of processors.• Xeon & Xeon Phi - Many-Core (Intel) and Opteron (AMD)
• GPGPU or GPU:• General-purpose computing on graphics processing units is the use of a graphics processing
unit (GPU), which typically handles computation only for computer graphics, to perform computation in applications traditionally handled by the central processing unit (CPU).
• NVIDIA, AMD, ASUS, etc., manufacture GPGPU/GPU.
• POWER: (Power Optimization with Enhanced RISC)• IBM has a series of high performance microprocessors called POWER.• IBM launched OpenPOWER Foundation for collaboration on their Power Architecture in
2013. Google, Tyan, Nvidia, and Mellanox are founding members.
• ARM (Advanced RISC Machines)• CPUs based on the RISC (reduced instruction set computer) architecture developed by
Advanced RISC Machines (ARM). • Companies using ARM cores on their chips are Qualcomm, Samsung Electronics, Texas
Instruments, and Cavium among others.
15
Storage and File System
• Used to control how data is stored and retrieved
• Short and long term retention
• HPC or Cluster’s require shared FS:
• Storage (Disk)
• Software (Lustre/GPFS)
• Transport/Networking
16
Management Infrastructure
• Cluster Management
• Cluster Network Management
• Service Nodes
• Head/Login Nodes
• Facility
17
Cluster Management
• Node provisioning, hardware/power control, discovery, and OS disk-based/disk-less deployment
• Monitoring and log management
• Fabric management
• Cluster startup and shutdown
• Parallel shell • One ring to rule them all, one ring to find them, one ring to
bring them all, and in the darkness bind them.
18
Facility
• Power
• 120 V, 280 V, 440V, etc.
• 3-Phase, DC
• N + N or N+1 Redundancy
• UPS, Generators
• Cooling
• Water cooled or air cooled
• Racks 19
HPC Software Stack
• Cluster Management Software
• Operating System
• Compilers, Programming Tool, and Libraries
• File System
• Scheduler/Resource Manager
20
High-Speed Interconnect
• Low latency, high-bandwidth• Library support, FCA, MPI offload, RDMA, etc.• Fabric management• Examples:
• Ethernet: performance is expected to hit 400 Gbps soon• Infiniband: EDR = ~100 Gbps, HDR = ~200 Gbps and NDR
= ~400 Gbps• OPA: Gen1 =~100 Gbps
21
HPC Applications and Workflow
• Parallelism:
• Speedup is not linear
• Dependencies
• Tuning:
• Race conditions
• Mutual exclusions
• Synchronization
22
23
Parallelism
What is a Parallelism - Simple Analogy
24
Parallelism
Less fish …
More fish!
Parallelism means doing multiple things at the same time; you can get more work done in the same time.
25
The Jigsaw Puzzle Analogy
26
Serial Computing
Suppose you want to do a jigsaw puzzlethat has, say, a thousand pieces.
We can imagine that it’ll take you acertain amount of time. Let’s saythat you can put the puzzle together inan hour.
27
Shared Memory Parallelism
If Scott sits across the table from you, he can work on his half of the puzzle, and you can work on yours. Once in a while, you’ll both reach into the pile of pieces at the same time (you’ll contend for the same resource), which will cause a little bit of slowdown. And from time to time, you’ll have to work together (communicate) at the interface between his half and yours. The speedup will be nearly 2-to-1; you might take 35 minutes instead of 30.
28
The More the Merrier?
Now, let’s put Paul and Charlie on the other two sides of the table. Each of you can work on a part of the puzzle, but there’ll be a lot more contention for the shared resource (the pile of puzzle pieces) and a lot more communication at the interfaces. So, you will get noticeably less than a 4-to-1 speedup, but you’ll still have an improvement - maybe something like 3-to-1. The four of you can get it done in 20 minutes instead of an hour.
29
Diminishing Returns
If we now put Dave, Tom, Horst, and Brandon on the corners of the table, there’s going to be a whole lot of contention for the shared resource and a lot of communication at the many interfaces. So, the speedup you’ll get will be much less than we’d like; you’ll be lucky to get 5-to-1.
So, we can see that adding more and more workers onto a shared resource is eventually going to have a diminishing return.
30
Distributed Parallelism
Now let’s try something a little different. Let’s set up two tables, and let’s put you at one of them and Scott at the other. Let’s put half of the puzzle pieces on your table and the other half of the pieces on Scott’s table. Now you can work completely independently, without any contention for a shared resource. BUT, the cost per communication is MUCH higher (you have to scootch your tables together), and you need the ability to split up (decompose) the puzzle pieces reasonably evenly, which may be tricky to do for some puzzles.
31
More Distributed Processors
It’s a lot easier to add more processors in distributed parallelism. But, you always have to be aware of the need to decompose the problem and to communicate among the processors. Also, as you add more processors, it may be harder to load balancethe amount of work that each processor gets.
32
Load Balancing
Load balancing means ensuring that everyone completes their workload at roughly the same time.
For example, if the jigsaw puzzle is half grass and half sky, then you can do the grass and Scott can do the sky. Then you’ll only have to communicate at the horizon – and the amount of work that each of you does on your own is roughly equal. So you’ll get pretty good speedup.
33
Load Balancing
Load balancing can be easy if the problem splits up into chunks of roughly equal size with one chunk per processor. Or, load balancing can be very hard.
34
Load Balancing
Load balancing can be easy if the problem splits up into chunks of roughly equal size with one chunk per processor. Or, load balancing can be very hard.
35
Load Balancing
Load balancing can be easy if the problem splits up into chunks of roughly equal size with one chunk per processor. Or, load balancing can be very hard.
36
37
Why Bother?
Why Bother with HPC at All?
• It’s clear that making effective use of HPC takes quite a bit of effort, both learning how and developing software.
• That seems like a lot of trouble to just get your code to run faster.
• It’s nice to have a code that used to take a day and now runs in an hour. But, if you can afford to wait a day, what’s the point of HPC?
• Why go to all that trouble just to get your code to run faster?
38
Why HPC is Worth the Bother
• What HPC gives you that you won’t get elsewhere is the ability to do bigger, better, and more exciting science. If your code can run faster, that means that you can tackle much bigger problems in the same amount of time that you used to need for smaller problems.
• HPC is important not only for its own sake, but also because what happens in HPC today will be on your desktop in about 10 to 15 years and on your cell phone in 25 years; it puts you ahead of the curve.
39
The Future is Now
• Historically, this has always been true:• Whatever happens in supercomputing today will be on
your desktop down the road.
• So, if you have experience with supercomputing, you’ll be ahead of the curve when things get to the desktop.
• Exascale
40
Exa-scale Challenges• Processor architecture
• Facility power is the primary constraint for the exascale system
• A Xeon based 5.34 petaflop system (Cheyenne) consumes 1.72 MW, so an exaflop computer would require 320 MW, which is untenable. GPU and KNL have lower power footprints.
• The target is 20-40 MW in 2020 for 1 exaflop.
• Memory bandwidth and capacity are not keeping pace with the increase in flops
• Clock frequencies are expected to decrease to conserve power
• Cost of data movement
• A new programming model will be necessary
• The I/O system will be much harder to manage
• Reliability and resiliency will be critical at the scale (Component mean-time-to-failure)
• Cost
41
References
[1] Image by Greg Bryan, Columbia U.[2] “Update on the Collaborative Radar Acquisition Field Test (CRAFT): Planning for the Next Steps.”
Presented to NWS Headquarters August 30 2001.[3] See http://hneeman.oscer.ou.edu/hamr.html for details.[4] http://www.dell.com/[5] http://www.vw.com/newbeetle/[6] Richard Gerber, The Software Optimization Cookbook: High-performance Recipes for the Intel
Architecture. Intel Press, 2002, pp. 161-168.[7] RightMark Memory Analyzer. http://cpu.rightmark.org/[8] ftp://download.intel.com/design/Pentium4/papers/24943801.pdf[9] http://www.samsungssd.com/meetssd/techspecs[10] http://www.samsung.com/Products/OpticalDiscDrive/SlimDrive/OpticalDiscDrive_SlimDrive_SN_S082D.asp?page=Specifications
[11] ftp://download.intel.com/design/Pentium4/manuals/24896606.pdf[12] http://www.pricewatch.com/
Special Thanks to Henry Neeman, University of Oklahoma for the use of his slides of LCI Workshop, Monday, May 18, 2015.
43
Acknowledgements
• Henry Neeman, University of Oklahoma for his 2015 LCI Slides. • Erik Scott (Harris) , Jared David Baker (University of Wyoming) ,
Pamela Hill (NCAR), Shilo Hall (NCAR), Nathan Rini (NCAR), Ben Matthews (NCAR), Jon Roberts (NCAR), Thomas Engel (NCAR), , Jeffrey R. Lang (University of Wyoming) , Jonathan Anderson (University Colorado, Boulder), Brian Dale Haymore (University of Utah), Leslie Ann Froeschl (University of Illinois) ,Tim Brewer (University of Wyoming), Stormy Knight (NCAR), Robert McLay(TACC) for reviewing and providing feedback on the slides.
• LCI for the opportunity.
44