Linux Clusters Institute: Introduction to High Performance Computing · 2017. 8. 18. · High-performance computing (HPC) utilizes parallel processing for running large and advanced

Linux Clusters Institute:Introduction to

High Performance ComputingUniversity of Wyoming

May 22 – 26, 2017

Irfan Elahi, National Center for Atmospheric Research

1

What is Supercomputing or High Performance Computing?

• The definition of supercomputing is constantly changing. Supercomputers can perform up to quadrillions of FLOPS or PFLOPS.

• High-performance computing (HPC) utilizes parallel processing for running large and advanced application programs efficiently. The term applies especially to systems that function above a hundred teraflops. The Top500 list has several multi-petaflop systems in the top 50.

• HPC aggregates computing power in a way that delivers much higher performance than one could get out of a typical desktop computer or workstation in order to solve large problems in science, engineering, or business.

• Supercomputers were introduced in the 1960s, and initially created by Seymour Cray at Control Data Corporation who led the HPC industry for decades.

• To me, personally, it is an eco-system that provides users with a high performance computational, networking, storage, and analysis platform, and the necessary software stack to stitch these resources.

2

Fastest Supercomputer vs. Moore

Year Fastest Moore1993 59.7 601994 143.41995 170.41996 220.4 2401997 10681998 13381999 2121.3 9602000 23792001 72262002 35860 38402003 358602004 358602005 136800 153602006 2806002007 2806002008 1375780 614402009 14567002010 17590002011 8162000 2457602012 16324750

0

2000000

4000000

6000000

8000000

10000000

12000000

14000000

16000000

18000000

1990 1995 2000 2005 2010 2015

Fastest

Moore

1993: 1024 CPU cores

1

10

100

1000

10000

100000

1000000

10000000

100000000

1990 1995 2000 2005 2010 2015

Fastest

Moore

Year

GFLOPs:billions of calculations per second

www.top500.org

3

http://www.mooreslaw.org/

http://www.top500.org/

What is Supercomputing About?

SizeSpeed

4

What is Supercomputing About?

• Size: Many problems that are interesting to scientists and engineers can’t fit on a single personal computer system – usually because they need more RAM and/or more disk storage.

• Speed: Many problems that are interesting to scientists and engineers would take a very, very long time to run on a single personal computer system - months or even years. But a problem that would take a month on a PC might take only an hour on a supercomputer.

5

What is HPC Used For?• Simulation of physical phenomena by developing a model

that represents the key characteristics of the selected physical or abstract system or process. Areas where simulation is heavy used:

• Weather forecasting• Galaxy formation• Oil reservoir management

• Data mining: finding needle(s) in a haystack. It is the process of analyzing data from different perspectives and summarizing it into useful and sometimes new information:

• Gene sequencing• Signal processing• Detecting storms that might produce tornados

• Visualization: turning a vast sea of data into pictures that a scientist can understand and analyze. Any technique for creating images, diagrams, or animations to communicate a message is called Visualization.

Moore, OKTornadicStorm

May 3, 1999

[3]

[1]

6

[2]

Supercomputing Issues

• Storage hierarchy or storage tiers

• Parallelism: doing multiple things at the same time

• Scaling issues

• High-speed interconnect

• Software stack

• Facility

7

What is a “Cluster”?

• A cluster needs a collection of small computers, called nodes, hooked together by an interconnection network (or interconnect for short).

• It also needs software that allows the nodes to communicate over the interconnect.

• A cluster is all of these components working together as if they’re one big computer ... a super computer.

8

What Does a Cluster Look Like?

Network View

9

10

11

What Does a Cluster Look Like?

Cluster Components All Components Working Together

• Computational resources

• Storage and file system

• Management infrastructure

• High-speed interconnect

• HPC software stack

• HPC applications and workflow

12

Cluster Components All Components Working Together

Management/Service Compute Nodes Network (IHPI) Storage

File SystemNetwork: Load balancing, discovery, failoverCluster: Provisioning, cluster commands, etc.

Server: Discovery, remote power, diskless support

HPC Software Stack: OS, Compilers, Libraries, MPI, Programming Tools, Debuggers,

Scheduler, etc.

User Applications

13

Computational Resources

• Compute nodes and software

• Compute node has:

• Processor (CPU)

• Memory

• Networking

• Software

• Access to file system for long or short term retention

14

Processor Types Examples

• X-86 Architecture: • The instruction set architecture (ISA) is Intel's most successful line of processors.• Xeon & Xeon Phi - Many-Core (Intel) and Opteron (AMD)

• GPGPU or GPU:• General-purpose computing on graphics processing units is the use of a graphics processing

unit (GPU), which typically handles computation only for computer graphics, to perform computation in applications traditionally handled by the central processing unit (CPU).

• NVIDIA, AMD, ASUS, etc., manufacture GPGPU/GPU.

• POWER: (Power Optimization with Enhanced RISC)• IBM has a series of high performance microprocessors called POWER.• IBM launched OpenPOWER Foundation for collaboration on their Power Architecture in

2013. Google, Tyan, Nvidia, and Mellanox are founding members.

• ARM (Advanced RISC Machines)• CPUs based on the RISC (reduced instruction set computer) architecture developed by

Advanced RISC Machines (ARM). • Companies using ARM cores on their chips are Qualcomm, Samsung Electronics, Texas

Instruments, and Cavium among others.

15

Storage and File System

• Used to control how data is stored and retrieved

• Short and long term retention

• HPC or Cluster’s require shared FS:

• Storage (Disk)

• Software (Lustre/GPFS)

• Transport/Networking

16

Management Infrastructure

• Cluster Management

• Cluster Network Management

• Service Nodes

• Head/Login Nodes

• Facility

17

Cluster Management

• Node provisioning, hardware/power control, discovery, and OS disk-based/disk-less deployment

• Monitoring and log management

• Fabric management

• Cluster startup and shutdown

• Parallel shell • One ring to rule them all, one ring to find them, one ring to

bring them all, and in the darkness bind them.

18

Facility

• Power

• 120 V, 280 V, 440V, etc.

• 3-Phase, DC

• N + N or N+1 Redundancy

• UPS, Generators

• Cooling

• Water cooled or air cooled

• Racks 19

HPC Software Stack

• Cluster Management Software

• Operating System

• Compilers, Programming Tool, and Libraries

• File System

• Scheduler/Resource Manager

20

High-Speed Interconnect

• Low latency, high-bandwidth• Library support, FCA, MPI offload, RDMA, etc.• Fabric management• Examples:

• Ethernet: performance is expected to hit 400 Gbps soon• Infiniband: EDR = ~100 Gbps, HDR = ~200 Gbps and NDR

= ~400 Gbps• OPA: Gen1 =~100 Gbps

21

HPC Applications and Workflow

• Parallelism:

• Speedup is not linear

• Dependencies

• Tuning:

• Race conditions

• Mutual exclusions

• Synchronization

22

23

Parallelism

What is a Parallelism - Simple Analogy

24

Parallelism

Less fish …

More fish!

Parallelism means doing multiple things at the same time; you can get more work done in the same time.

25

The Jigsaw Puzzle Analogy

26

Serial Computing

Suppose you want to do a jigsaw puzzlethat has, say, a thousand pieces.

We can imagine that it’ll take you acertain amount of time. Let’s saythat you can put the puzzle together inan hour.

27

Shared Memory Parallelism

If Scott sits across the table from you, he can work on his half of the puzzle, and you can work on yours. Once in a while, you’ll both reach into the pile of pieces at the same time (you’ll contend for the same resource), which will cause a little bit of slowdown. And from time to time, you’ll have to work together (communicate) at the interface between his half and yours. The speedup will be nearly 2-to-1; you might take 35 minutes instead of 30.

28

The More the Merrier?

Now, let’s put Paul and Charlie on the other two sides of the table. Each of you can work on a part of the puzzle, but there’ll be a lot more contention for the shared resource (the pile of puzzle pieces) and a lot more communication at the interfaces. So, you will get noticeably less than a 4-to-1 speedup, but you’ll still have an improvement - maybe something like 3-to-1. The four of you can get it done in 20 minutes instead of an hour.

29

Diminishing Returns

If we now put Dave, Tom, Horst, and Brandon on the corners of the table, there’s going to be a whole lot of contention for the shared resource and a lot of communication at the many interfaces. So, the speedup you’ll get will be much less than we’d like; you’ll be lucky to get 5-to-1.

So, we can see that adding more and more workers onto a shared resource is eventually going to have a diminishing return.

30

Distributed Parallelism

Now let’s try something a little different. Let’s set up two tables, and let’s put you at one of them and Scott at the other. Let’s put half of the puzzle pieces on your table and the other half of the pieces on Scott’s table. Now you can work completely independently, without any contention for a shared resource. BUT, the cost per communication is MUCH higher (you have to scootch your tables together), and you need the ability to split up (decompose) the puzzle pieces reasonably evenly, which may be tricky to do for some puzzles.

31

More Distributed Processors

It’s a lot easier to add more processors in distributed parallelism. But, you always have to be aware of the need to decompose the problem and to communicate among the processors. Also, as you add more processors, it may be harder to load balancethe amount of work that each processor gets.

32

Load Balancing

Load balancing means ensuring that everyone completes their workload at roughly the same time.

For example, if the jigsaw puzzle is half grass and half sky, then you can do the grass and Scott can do the sky. Then you’ll only have to communicate at the horizon – and the amount of work that each of you does on your own is roughly equal. So you’ll get pretty good speedup.

33

Load Balancing

Load balancing can be easy if the problem splits up into chunks of roughly equal size with one chunk per processor. Or, load balancing can be very hard.

34

Load Balancing


35

Load Balancing


36

37

Why Bother?

Why Bother with HPC at All?

• It’s clear that making effective use of HPC takes quite a bit of effort, both learning how and developing software.

• That seems like a lot of trouble to just get your code to run faster.

• It’s nice to have a code that used to take a day and now runs in an hour. But, if you can afford to wait a day, what’s the point of HPC?

• Why go to all that trouble just to get your code to run faster?

38

Why HPC is Worth the Bother

• What HPC gives you that you won’t get elsewhere is the ability to do bigger, better, and more exciting science. If your code can run faster, that means that you can tackle much bigger problems in the same amount of time that you used to need for smaller problems.

• HPC is important not only for its own sake, but also because what happens in HPC today will be on your desktop in about 10 to 15 years and on your cell phone in 25 years; it puts you ahead of the curve.

39

The Future is Now

• Historically, this has always been true:• Whatever happens in supercomputing today will be on

your desktop down the road.

• So, if you have experience with supercomputing, you’ll be ahead of the curve when things get to the desktop.

• Exascale

40

Exa-scale Challenges• Processor architecture

• Facility power is the primary constraint for the exascale system

• A Xeon based 5.34 petaflop system (Cheyenne) consumes 1.72 MW, so an exaflop computer would require 320 MW, which is untenable. GPU and KNL have lower power footprints.

• The target is 20-40 MW in 2020 for 1 exaflop.

• Memory bandwidth and capacity are not keeping pace with the increase in flops

• Clock frequencies are expected to decrease to conserve power

• Cost of data movement

• A new programming model will be necessary

• The I/O system will be much harder to manage

• Reliability and resiliency will be critical at the scale (Component mean-time-to-failure)

• Cost

41

Thanks for your attention!

[email protected]

42

References

[1] Image by Greg Bryan, Columbia U.[2] “Update on the Collaborative Radar Acquisition Field Test (CRAFT): Planning for the Next Steps.”

Presented to NWS Headquarters August 30 2001.[3] See http://hneeman.oscer.ou.edu/hamr.html for details.[4] http://www.dell.com/[5] http://www.vw.com/newbeetle/[6] Richard Gerber, The Software Optimization Cookbook: High-performance Recipes for the Intel

Architecture. Intel Press, 2002, pp. 161-168.[7] RightMark Memory Analyzer. http://cpu.rightmark.org/[8] ftp://download.intel.com/design/Pentium4/papers/24943801.pdf[9] http://www.samsungssd.com/meetssd/techspecs[10] http://www.samsung.com/Products/OpticalDiscDrive/SlimDrive/OpticalDiscDrive_SlimDrive_SN_S082D.asp?page=Specifications

[11] ftp://download.intel.com/design/Pentium4/manuals/24896606.pdf[12] http://www.pricewatch.com/

Special Thanks to Henry Neeman, University of Oklahoma for the use of his slides of LCI Workshop, Monday, May 18, 2015.

43

http://www.caps.ou.edu/present/Jack%20Hayes%20FINAL.ppt

http://hneeman.oscer.ou.edu/hamr.html

http://www.dell.com/

http://www.vw.com/newbeetle/

http://cpu.rightmark.org/

ftp://download.intel.com/design/Pentium4/papers/24943801.pdf

http://www.samsungssd.com/meetssd/techspecs

http://www.samsung.com/Products/OpticalDiscDrive/SlimDrive/OpticalDiscDrive_SlimDrive_SN_S082D.asp?page=Specifications

ftp://download.intel.com/design/Pentium4/manuals/24896606.pdf

http://www.pricewatch.com/

Acknowledgements

• Henry Neeman, University of Oklahoma for his 2015 LCI Slides. • Erik Scott (Harris) , Jared David Baker (University of Wyoming) ,

Pamela Hill (NCAR), Shilo Hall (NCAR), Nathan Rini (NCAR), Ben Matthews (NCAR), Jon Roberts (NCAR), Thomas Engel (NCAR), , Jeffrey R. Lang (University of Wyoming) , Jonathan Anderson (University Colorado, Boulder), Brian Dale Haymore (University of Utah), Leslie Ann Froeschl (University of Illinois) ,Tim Brewer (University of Wyoming), Stormy Knight (NCAR), Robert McLay(TACC) for reviewing and providing feedback on the slides.

• LCI for the opportunity.

44

Linux Clusters Institute: Introduction to High Performance Computing · 2017. 8. 18. · High-performance computing (HPC) utilizes parallel processing for running large and advanced

Documents