Top Banner
Lecture 1: An Introduction Parallel Computing CSCE 569, Spring 2018 Department of Computer Science and Engineering Yonghong Yan [email protected] http://cse.sc.edu/~yanyh 1
74

Lecture 1: An Introduction Parallel Computing CSCE … •Learn fundamentals of concurrent and parallel computing –Describe benefits and applications of parallel computing. –Explain

May 12, 2018

Download

Documents

dangkiet
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Lecture 1: An Introduction Parallel Computing CSCE … •Learn fundamentals of concurrent and parallel computing –Describe benefits and applications of parallel computing. –Explain

Lecture1:AnIntroductionParallelComputing

CSCE569,Spring2018

DepartmentofComputerScienceandEngineeringYonghongYan

[email protected]://cse.sc.edu/~yanyh

1

Page 2: Lecture 1: An Introduction Parallel Computing CSCE … •Learn fundamentals of concurrent and parallel computing –Describe benefits and applications of parallel computing. –Explain

CourseInformation

• MeetingTime: 9:40AM– 10:55AMMondayWednesday• ClassRoom:2A15, SwearingenEngineerCenter,301MainSt,

Columbia,SC29208• Grade:60%forfourhomeworks +40%fortwoexams

• Instructor:YonghongYan– http://cse.sc.edu/~yanyh,[email protected]– Office:Room2211, Storey InnovationCenter(HorizonII),550

AssemblySt,Columbia,SC29201– Tel:803-777-7361– OfficeHours:11:00AM– 12:30AM(afterclass)orbyappointment

• PublicCoursewebsite: http://passlab.github.io/CSCE569• Homeworksubmission:https://dropbox.cse.sc.edu• Syllabusorwebsiteformoredetails

2

Page 3: Lecture 1: An Introduction Parallel Computing CSCE … •Learn fundamentals of concurrent and parallel computing –Describe benefits and applications of parallel computing. –Explain

Objectives

• Learnfundamentalsofconcurrentandparallelcomputing– Describebenefitsandapplicationsofparallelcomputing.– ExplainarchitecturesofmulticoreCPU,GPUsandHPC

clusters• Includingthekeyconceptsinparallelcomputerarchitectures,e.g.sharedmemorysystem,distributedsystem,NUMAandcachecoherence,interconnection

– Understandprinciplesforparallelandconcurrentprogramdesign,e.g.decompositionofworks,taskanddataparallelism,processormapping,mutualexclusion,locks.

• Developskillswritingandanalyzingparallelprograms– WriteparallelprogramusingOpenMP,CUDA,andMPI

programmingmodels.– Performanalysisofparallelprogramproblem.

3

Page 4: Lecture 1: An Introduction Parallel Computing CSCE … •Learn fundamentals of concurrent and parallel computing –Describe benefits and applications of parallel computing. –Explain

• LotsofmaterialsonInternet.– Onthewebsite,thereisa“Resources”sectionthatprovidesweb

pagelinks,documents,andothermaterialsforthiscourse

Textbooks

4

• Required:IntroductiontoParallelComputing(2ndEdition), PDF, Amazon,covertheory,MPIandOpenMPintroduction,byAnanth Grama,Anshul Gupta,GeorgeKarypis,andVipin Kumar,Addison-Wesley,2003

• Recommended:JohnCheng,MaxGrossman,andTyMcKercher, ProfessionalCUDACProgramming,1stEdition2014, PDF, Amazon.

• ReferencebookforOpenMP:BarbaraChapman,GabrieleJost,andRuudvanderPas, UsingOpenMP:PortableSharedMemoryParallelProgramming,2007, PDF, Amazon.

• ReferencebookforMPI:Choosefrom RecommendedBooksforMPI

Page 5: Lecture 1: An Introduction Parallel Computing CSCE … •Learn fundamentals of concurrent and parallel computing –Describe benefits and applications of parallel computing. –Explain

Homeworks andExams

• Fourhomeworks:practiceprogrammingskills– Requirebothgoodandcorrectprogramming

• Writeorganizedprogramthatiseasytoread– Reportanddiscussyourfindingsinreport

• Writinggooddocument– 60%Total(10%+10%+20%+20%)

• Exams:Testfundamentals– Close/Openbook(?)– 40%Total

• Midterm:15%,March7thWednesdayduringclass– Theweekbeforespringbreak.

• FinalExam:25%,May2ndWednesday,9:00AM- 11:30AM5

Page 6: Lecture 1: An Introduction Parallel Computing CSCE … •Learn fundamentals of concurrent and parallel computing –Describe benefits and applications of parallel computing. –Explain

MachineforDevelopmentforOpenMP andMPI

• LinuxmachinesinSwearingen1D39and3D22– AllCSCEstudentsbydefaulthaveaccesstothesemachine

usingtheirstandardlogincredentials• Letmeknowifyou,CSCEornot,cannotaccess

– RemoteaccessisalsoavailableviaSSHoverport222. Namingschemaisasfollows:• l-1d39-01.cse.sc.eduthroughl-1d39-26.cse.sc.edu• l-3d22-01.cse.sc.eduthroughl-3d22-20.cse.sc.edu

• Restrictedto2GBofdataintheirhomefolder(~/).– Formorespace,createadirectoryin/scratchonthelogin

machine,howeverthatdataisnotsharedanditwillonlybeavailableonthatspecificmachine.

6

Page 7: Lecture 1: An Introduction Parallel Computing CSCE … •Learn fundamentals of concurrent and parallel computing –Describe benefits and applications of parallel computing. –Explain

PuttySSHConnectiononWindows

7

l-1d39-08.cse.sc.edu 222

Page 8: Lecture 1: An Introduction Parallel Computing CSCE … •Learn fundamentals of concurrent and parallel computing –Describe benefits and applications of parallel computing. –Explain

SSHConnectionfromLinux/MacOSXTerminal

8

-XforenablingX-windowsforwardingsoyoucanusethegraphicsdisplayonyourcomputer.ForMacOSX,youneedhaveXserversoftwareinstalled,e.g.Xquartz(https://www.xquartz.org/)istheoneIuse.

Page 9: Lecture 1: An Introduction Parallel Computing CSCE … •Learn fundamentals of concurrent and parallel computing –Describe benefits and applications of parallel computing. –Explain

TryinTheLabandFromRemote

• Bringyourlaptop

9

Page 10: Lecture 1: An Introduction Parallel Computing CSCE … •Learn fundamentals of concurrent and parallel computing –Describe benefits and applications of parallel computing. –Explain

Topics

• Introduction• Programmingonsharedmemorysystem(Chapter7)

– OpenMP– PThread,mutualexclusion,locks,synchronizations– Cilk/Cilkplus(?)

• Principlesofparallelalgorithmdesign(Chapter3)• Analysisofparallelprogramexecutions(Chapter5)

– PerformanceMetricsforParallelSystems• ExecutionTime,Overhead,Speedup,Efficiency,Cost

– ScalabilityofParallelSystems– Useofperformancetools

10

Page 11: Lecture 1: An Introduction Parallel Computing CSCE … •Learn fundamentals of concurrent and parallel computing –Describe benefits and applications of parallel computing. –Explain

Topics

• Programmingonlargescalesystems(Chapter6)– MPI(pointtopointandcollectives)– IntroductiontoPGASlanguages,UPCandChapel(?)

• Parallelarchitecturesandhardware– Parallelcomputerarchitectures– Memoryhierarchyandcachecoherency

• Manycore GPUarchitecturesandprogramming– GPUsarchitectures– CUDAprogramming– IntroductiontooffloadingmodelinOpenMP(?)

• Parallelalgorithms(Chapter8,9&10)– Denselinearalgebra,stencilandimageprocessing

11

Page 12: Lecture 1: An Introduction Parallel Computing CSCE … •Learn fundamentals of concurrent and parallel computing –Describe benefits and applications of parallel computing. –Explain

Prerequisites

• Goodreasoningandanalyticalskills• FamiliaritywithandSkillsofC/C++programming

– macro,pointer,array,struct,union,functionpointer,etc.• FamiliaritywithLinuxenvironment

– SSH,Linuxcommands,vim/Emacs editor• Basicknowledgeofcomputerarchitectureanddatastructures– Memoryhierarchy,cache,virtualaddress– Arrayandlink-list

• Talkwithmeifyouhaveconcern• Turninthesurvey

12

Page 13: Lecture 1: An Introduction Parallel Computing CSCE … •Learn fundamentals of concurrent and parallel computing –Describe benefits and applications of parallel computing. –Explain

Introduction:WhatisandwhyParallelComputing

13

Page 14: Lecture 1: An Introduction Parallel Computing CSCE … •Learn fundamentals of concurrent and parallel computing –Describe benefits and applications of parallel computing. –Explain

AnExample:Grading

14

15questions300exams

From An Introduction to Parallel Programming, By Peter Pacheco, Morgan Kaufmann Publishers Inc, Copyright © 2010, Elsevier Inc. All rights Reserved

Page 15: Lecture 1: An Introduction Parallel Computing CSCE … •Learn fundamentals of concurrent and parallel computing –Describe benefits and applications of parallel computing. –Explain

Three Teaching Assistants

• Tograde300copiesofexams,eachhas15questions15

TA#1TA#2 TA#3

Page 16: Lecture 1: An Introduction Parallel Computing CSCE … •Learn fundamentals of concurrent and parallel computing –Describe benefits and applications of parallel computing. –Explain

DivisionofWork– DataParallelism

• Eachdoesthesametypeofwork(task),butworkingondifferentsheet(data)

16

TA#1

TA#2

TA#3

100exams

100exams

100exams

Page 17: Lecture 1: An Introduction Parallel Computing CSCE … •Learn fundamentals of concurrent and parallel computing –Describe benefits and applications of parallel computing. –Explain

DivisionofWork– TaskParallelism

• Eachdoesdifferenttypeofwork(task),butworkingonsamesheets(data)

17

TA#1

TA#2

TA#3

Questions1- 5

Questions6- 10

Questions11- 15

Page 18: Lecture 1: An Introduction Parallel Computing CSCE … •Learn fundamentals of concurrent and parallel computing –Describe benefits and applications of parallel computing. –Explain

Summary

• Data:300copiesofexam• Task:gradetotal300*15questions• Dataparallelism

– Distributed300copiestothreeTAs– Theyworkindependently

• TaskParallelism– Distributed300copiestothreeTAs– Eachgrades5questionsof100copies– Exchangecopies– Grade5questionsagain– Exchangecopies– Grade5questions

• ThethreeTAscandoinparallel,wecanachieve3timespeeduptheoretically

18

Whichapproachcouldbefaster!

Page 19: Lecture 1: An Introduction Parallel Computing CSCE … •Learn fundamentals of concurrent and parallel computing –Describe benefits and applications of parallel computing. –Explain

Challenges

• ArethethreeTAsgradinginthesameperformance?– OneCPUmaybeslowerthantheother– Theymaynotworkongradingthesametime

• HowtheTAscommunicate?– Aretheysitonthesametable?Oreachtakecopiesandgrade

fromhome?Howtheyshareintermediateresults(taskparallelism)

• Wherethesolutionsarestoredsotheycanrefertowhengrading– Rememberanswersto5questionsvs to15questions

• CacheandMemoryissues

19

Page 20: Lecture 1: An Introduction Parallel Computing CSCE … •Learn fundamentals of concurrent and parallel computing –Describe benefits and applications of parallel computing. –Explain

WhatisParallelComputing?

• A formofcomputation*:– Largeproblemsdividedintosmallerones– Smalleronesarecarriedoutandsolvedsimultaneously

• UsesmorethanoneCPUsorcoresconcurrentlyforoneprogram– Notconventionaltime-sharing:multipleprogramsswitch

betweeneachotherononeCPU– OrmultipleprogramseachonaCPUandnotinteracting

• Serialprocessing– Someprograms,orpartofaprogramareinherentlyserial– Mostofourprogramsanddesktopapplications

*http://en.wikipedia.org/wiki/Parallel_computing 20

Page 21: Lecture 1: An Introduction Parallel Computing CSCE … •Learn fundamentals of concurrent and parallel computing –Describe benefits and applications of parallel computing. –Explain

WhyParallelComputing?

• Savetime(executiontime)andmoney!– Parallelprogramcanrunfasterifrunningconcurrentlyinsteadof

sequentially.

• Solvelargerandmorecomplexproblems!– Utilizemorecomputationalresources

From“21stCenturyGrandChallenges|TheWhiteHouse”,http://www.whitehouse.gov/administration/eop/ostp/grand-challengesGrandchallenges:http://en.wikipedia.org/wiki/Grand_Challenges

21

Picturefrom:IntrotoParallelComputing:https://computing.llnl.gov/tutorials/parallel_comp

Page 22: Lecture 1: An Introduction Parallel Computing CSCE … •Learn fundamentals of concurrent and parallel computing –Describe benefits and applications of parallel computing. –Explain

HighPerformanceComputing(HPC)andParallelComputing

• HPCiswhatreallyneeded*– Parallelcomputingissofartheonlywaytogetthere!!

• Parallelcomputingmakessense!

• ApplicationsthatrequireHPC– Manyproblemdomainsarenaturallyparallelizable– Datacannotfitinmemoryofonemachine

• Computersystems– Physicslimitation:hastobuilditparallel– Parallelsystemsarewidelyaccessible

• Smartphonehas2to4cores+GPUnow

22

*WhatisHPC:http://insidehpc.com/hpc-basic-training/what-is-hpc/Supercomputer:http://en.wikipedia.org/wiki/SupercomputerTOP500(500mostpowerfulcomputersystemsintheworld):http://en.wikipedia.org/wiki/TOP500,http://top500.org/HPCmatter:http://sc14.supercomputing.org/media/social-media

Wewilldiscusseachofthetwoaspecttoday!

Page 23: Lecture 1: An Introduction Parallel Computing CSCE … •Learn fundamentals of concurrent and parallel computing –Describe benefits and applications of parallel computing. –Explain

Simulation:TheThird PillarofScience

• Traditionalscientificandengineeringparadigm:1) Dotheory orpaperdesign.2) Performexperiments orbuildsystem.

• Limitationsofexperiments:– Toodifficult-- buildlargewindtunnels.– Tooexpensive-- buildathrow-awaypassengerjet.– Tooslow-- waitforclimateorgalacticevolution.– Toodangerous-- weapons,drugdesign,climateexperimentation.

• Computationalscienceparadigm:3) Usehighperformancecomputersystemstosimulate thephenomenon

• Baseonknownphysicallawsandefficientnumericalmethods.

23

FromslidesofKathyYelic’s 2007courseatBerkeley:http://www.cs.berkeley.edu/~yelick/cs267_sp07/

Page 24: Lecture 1: An Introduction Parallel Computing CSCE … •Learn fundamentals of concurrent and parallel computing –Describe benefits and applications of parallel computing. –Explain

Applications:ScienceandEngineering

• Modelmanydifficultproblemsbyparallelcomputing– Atmosphere,Earth,Environment– Physics- applied,nuclear,particle,condensedmatter,high

pressure,fusion,photonics– Bioscience,Biotechnology,Genetics– Chemistry,MolecularSciences– Geology,Seismology– MechanicalEngineering- fromprostheticstospacecraft– ElectricalEngineering,CircuitDesign,Microelectronics– ComputerScience,Mathematics– Defense,Weapons

24

Page 25: Lecture 1: An Introduction Parallel Computing CSCE … •Learn fundamentals of concurrent and parallel computing –Describe benefits and applications of parallel computing. –Explain

Applications:IndustrialandCommercial

• Processinglargeamountsofdatainsophisticatedways– Databases,datamining– Oilexploration– Medicalimaginganddiagnosis– Pharmaceuticaldesign– Financialandeconomicmodeling– Managementofnationalandmulti-nationalcorporations– Advancedgraphicsandvirtualreality,particularlyinthe

entertainmentindustry– Networkedvideoandmulti-mediatechnologies– Collaborativeworkenvironments– Websearchengines,webbasedbusinessservices

25

Page 26: Lecture 1: An Introduction Parallel Computing CSCE … •Learn fundamentals of concurrent and parallel computing –Describe benefits and applications of parallel computing. –Explain

EconomicImpactofHPC

• Airlines:– System-widelogisticsoptimizationsystemsonparallelsystems.– Savings:approx.$100millionperairlineperyear.

• Automotivedesign:– Majorautomotivecompaniesuselargesystems(500+CPUs)for:

• CAD-CAM,crashtesting,structuralintegrityandaerodynamics.• Onecompanyhas500+CPUparallelsystem.

– Savings:approx.$1billionpercompanyperyear.• Semiconductorindustry:

– Semiconductorfirmsuselargesystems(500+CPUs)for• deviceelectronicssimulationandlogicvalidation

– Savings:approx.$1billionpercompanyperyear.• Securitiesindustry:

– Savings:approx.$15billionperyearforU.S.homemortgages.

26FromslidesofKathyYelic’s 2007courseatBerkeley:http://www.cs.berkeley.edu/~yelick/cs267_sp07/

Page 27: Lecture 1: An Introduction Parallel Computing CSCE … •Learn fundamentals of concurrent and parallel computing –Describe benefits and applications of parallel computing. –Explain

InherentParallelismofApplications

• Example:weatherpredictionandglobalclimatemodeling

27

Page 28: Lecture 1: An Introduction Parallel Computing CSCE … •Learn fundamentals of concurrent and parallel computing –Describe benefits and applications of parallel computing. –Explain

GlobalClimateModelingProblem

• Problemistocompute:– f(latitude,longitude,elevation,time)à

temperature,pressure,humidity,windvelocity• Approach:

– Discretize thedomain,e.g.,ameasurementpointevery10km– Deviseanalgorithmtopredictweatherattimet+dt givent

• Uses:– Predictmajorevents,e.g.,ElNino– Airqualityforecasting

28

Page 29: Lecture 1: An Introduction Parallel Computing CSCE … •Learn fundamentals of concurrent and parallel computing –Describe benefits and applications of parallel computing. –Explain

TheRiseofMulticoreProcessors

29

Page 30: Lecture 1: An Introduction Parallel Computing CSCE … •Learn fundamentals of concurrent and parallel computing –Describe benefits and applications of parallel computing. –Explain

RecentMulticoreProcessors

30

Page 31: Lecture 1: An Introduction Parallel Computing CSCE … •Learn fundamentals of concurrent and parallel computing –Describe benefits and applications of parallel computing. –Explain

RecentManycore GPUprocessors

31

��

An�Overview�of�the�GK110�Kepler�Architecture�Kepler�GK110�was�built�first�and�foremost�for�Tesla,�and�its�goal�was�to�be�the�highest�performing�parallel�computing�microprocessor�in�the�world.�GK110�not�only�greatly�exceeds�the�raw�compute�horsepower�delivered�by�Fermi,�but�it�does�so�efficiently,�consuming�significantly�less�power�and�generating�much�less�heat�output.��

A�full�Kepler�GK110�implementation�includes�15�SMX�units�and�six�64�bit�memory�controllers.��Different�products�will�use�different�configurations�of�GK110.��For�example,�some�products�may�deploy�13�or�14�SMXs.��

Key�features�of�the�architecture�that�will�be�discussed�below�in�more�depth�include:�

� The�new�SMX�processor�architecture�� An�enhanced�memory�subsystem,�offering�additional�caching�capabilities,�more�bandwidth�at�

each�level�of�the�hierarchy,�and�a�fully�redesigned�and�substantially�faster�DRAM�I/O�implementation.�

� Hardware�support�throughout�the�design�to�enable�new�programming�model�capabilities�

Kepler�GK110�Full�chip�block�diagram�

��

Streaming�Multiprocessor�(SMX)�Architecture�

Kepler�GK110)s�new�SMX�introduces�several�architectural�innovations�that�make�it�not�only�the�most�powerful�multiprocessor�we)ve�built,�but�also�the�most�programmable�and�power�efficient.��

SMX:�192�single�precision�CUDA�cores,�64�double�precision�units,�32�special�function�units�(SFU),�and�32�load/store�units�(LD/ST).�

��

Kepler�Memory�Subsystem�/�L1,�L2,�ECC�

Kepler&s�memory�hierarchy�is�organized�similarly�to�Fermi.�The�Kepler�architecture�supports�a�unified�memory�request�path�for�loads�and�stores,�with�an�L1�cache�per�SMX�multiprocessor.�Kepler�GK110�also�enables�compiler�directed�use�of�an�additional�new�cache�for�read�only�data,�as�described�below.�

64�KB�Configurable�Shared�Memory�and�L1�Cache�

In�the�Kepler�GK110�architecture,�as�in�the�previous�generation�Fermi�architecture,�each�SMX�has�64�KB�of�on�chip�memory�that�can�be�configured�as�48�KB�of�Shared�memory�with�16�KB�of�L1�cache,�or�as�16�KB�of�shared�memory�with�48�KB�of�L1�cache.�Kepler�now�allows�for�additional�flexibility�in�configuring�the�allocation�of�shared�memory�and�L1�cache�by�permitting�a�32KB�/�32KB�split�between�shared�memory�and�L1�cache.�To�support�the�increased�throughput�of�each�SMX�unit,�the�shared�memory�bandwidth�for�64b�and�larger�load�operations�is�also�doubled�compared�to�the�Fermi�SM,�to�256B�per�core�clock.�

48KB�Read�Only�Data�Cache�

In�addition�to�the�L1�cache,�Kepler�introduces�a�48KB�cache�for�data�that�is�known�to�be�read�only�for�the�duration�of�the�function.�In�the�Fermi�generation,�this�cache�was�accessible�only�by�the�Texture�unit.�Expert�programmers�often�found�it�advantageous�to�load�data�through�this�path�explicitly�by�mapping�their�data�as�textures,�but�this�approach�had�many�limitations.��

• ~3kcores

Page 32: Lecture 1: An Introduction Parallel Computing CSCE … •Learn fundamentals of concurrent and parallel computing –Describe benefits and applications of parallel computing. –Explain

UnitsofMeasureinHPC

• Flop:floatingpointoperation(*,/,+,-,etc)• Flop/s:floatingpointoperationspersecond,writtenalsoasFLOPS• Bytes:sizeofdata

– A doubleprecisionfloatingpointnumberis8bytes• Typicalsizesaremillions,billions,trillions…

– Mega Mflop/s=106 flop/sec Mzbyte =220 =1048576=~106 bytes– Giga Gflop/s=109 flop/sec Gbyte =230 =~109 bytes– Tera Tflop/s=1012 flop/secTbyte =240 =~1012 bytes– Peta Pflop/s=1015 flop/sec Pbyte =250 =~1015 bytes– Exa Eflop/s=1018 flop/secEbyte =260 =~1018 bytes– Zetta Zflop/s=1021 flop/secZbyte =270 =~1021 bytes

• www.top500.orgfortheunitsofthefastestmachinesmeasuredusingHighPerformanceLINPACK(HPL)Benchmark– Thefastest:SunwayTaihuLight,~93petaflop/s– Thethird(fastestinUS):DoEORNLTitan,17.59petaflop/s

32

Page 33: Lecture 1: An Introduction Parallel Computing CSCE … •Learn fundamentals of concurrent and parallel computing –Describe benefits and applications of parallel computing. –Explain

HowtoMeasureandCalculatePerformance(FLOPS)

33

https://passlab.github.io/CSCE569/resources/sum.c

• Calculate#FLOPs(2*Nor3*N)– Checktheloopcount(N)andFLOPsper

loopiteration(2or3).

• Measuretimetocomputeusingtimer– elapsedandelapsed_2areinsecond

• FLOPS=#FLOPs/Time– MFLOPSintheexample

Page 34: Lecture 1: An Introduction Parallel Computing CSCE … •Learn fundamentals of concurrent and parallel computing –Describe benefits and applications of parallel computing. –Explain

HighPerformanceLINPACK(HPL)BenchmarkPerformance(Rmax)inTop500

• Measured usingtheHighPerformanceLINPACK(HPC)Benchmarkthatsolvesadensesystemoflinearequationsà Rankingthemachines– Ax=b– https://www.top500.org/project/linpack/– https://en.wikipedia.org/wiki/LINPACK_benchmarks

34

Page 35: Lecture 1: An Introduction Parallel Computing CSCE … •Learn fundamentals of concurrent and parallel computing –Describe benefits and applications of parallel computing. –Explain

Top500(www.top500.org),Nov2017

35

Page 36: Lecture 1: An Introduction Parallel Computing CSCE … •Learn fundamentals of concurrent and parallel computing –Describe benefits and applications of parallel computing. –Explain

HPCPeakPerformance(Rpeak)Calculation

• NodeperformanceinGflop/s=(CPUspeedinGHz)x(numberofCPUcores)x(CPUinstructionpercycle)x(numberofCPUspernode).– CPUinstructionspercycle(IPC)=#Flopspercycle

• BecausepipelinedCPUcandooneinstructionpercycle• 4or8formostCPU(IntelorAMD)

– http://www.calcverter.com/calculation/CPU-peak-theoretical-performance.php

• HPCPeak(Rpeak)=#nodes*NodePerformanceinGFlops

36

Page 37: Lecture 1: An Introduction Parallel Computing CSCE … •Learn fundamentals of concurrent and parallel computing –Describe benefits and applications of parallel computing. –Explain

CPUPeakPerformanceExample• IntelX5600seriesCPUsandAMD6100/6200/6300seriesCPUshave4

instructionspercycleIntelE5-2600seriesCPUshave8instructionspercycle

• Example1:Dual-CPUserverbasedonIntelX5675(3.06GHz6-cores)CPUs:– 3.06x6x4x2=144.88GFLOPS

• Example2:Dual-CPUserverbasedonIntelE5-2670(2.6GHz8-cores)CPUs:– 2.6x8x8x2=332.8GFLOPS– With8nodes:332.8GFLOPSx8=2,442.4GFLOPS=2.44TFLOPS

• Example3:Dual-CPUserverbasedonAMD6176(2.3GHz12-cores)CPUs:– 2.3x12x4x2=220.8GFLOPS

• Example4:Dual-CPUserverbasedonAMD6274(2.2GHz16-cores)CPUs:– 2.2x16x4x2=281.6GFLOPS

37https://saiclearning.wordpress.com/2014/04/08/how-to-calculate-peak-theoretical-performance-of-a-cpu-based-hpc-system/

Page 38: Lecture 1: An Introduction Parallel Computing CSCE … •Learn fundamentals of concurrent and parallel computing –Describe benefits and applications of parallel computing. –Explain

Performance(HPL)DevelopmentOverYearsofTop500Machines

38

Page 39: Lecture 1: An Introduction Parallel Computing CSCE … •Learn fundamentals of concurrent and parallel computing –Describe benefits and applications of parallel computing. –Explain

4KindsofRankingofHPC/Supercomputers

1. Top500:accordingtotheMeasured HighPerformanceLINPACK(HPL)Benchmarkperformance

– NotPeakperformance,Nototherapplications

2. RankingaccordingtoHPCG benchmarkperformance3. Graph500Rankingaccordingtographprocessingcapability

– ShortestPathandBreadthFirstSearch– https://graph500.org

4. Green500RankingaccordingtoPowerefficiency(GFLOPS/Watts)

– https://www.top500.org/green500/– Generatesublist inthefollowingslidesfrom

https://www.top500.org/statistics/sublist/39

Page 40: Lecture 1: An Introduction Parallel Computing CSCE … •Learn fundamentals of concurrent and parallel computing –Describe benefits and applications of parallel computing. –Explain

HPCGRanking

• HPCG:HighPerformanceConjugateGradients(HPCG)Benchmark(http://www.hpcg-benchmark.org/)

40

Page 41: Lecture 1: An Introduction Parallel Computing CSCE … •Learn fundamentals of concurrent and parallel computing –Describe benefits and applications of parallel computing. –Explain

Graph500(https://graph500.org)

• Rankingaccordingtothecapabilityofprocessinglarge-scalegraph(ShortestPathandBreadthFirstSearch)

41

Page 42: Lecture 1: An Introduction Parallel Computing CSCE … •Learn fundamentals of concurrent and parallel computing –Describe benefits and applications of parallel computing. –Explain

Green500:PowerEfficiency(GFLOPS/Watts)

42

• PowerEfficiency=HPLPerformance/Power– E.g.TaihuLight #1ofTop500:=93,014.6/15,371=6.051

Gflops/watts)• https://www.top500.org/green500/

Page 43: Lecture 1: An Introduction Parallel Computing CSCE … •Learn fundamentals of concurrent and parallel computing –Describe benefits and applications of parallel computing. –Explain

Green500:PowerEfficiency(GFLOPS/Watts)

43

• https://www.top500.org/green500/

Page 44: Lecture 1: An Introduction Parallel Computing CSCE … •Learn fundamentals of concurrent and parallel computing –Describe benefits and applications of parallel computing. –Explain

PerformanceEfficiency

• HPCPerformanceEfficiency=ActualMeasuredPerformanceGFLOPS/TheoreticalPeakPerformanceGFLOPS– E.g.#1inTop500

• 93,014.6/125,435.9=74.2%

44

https://www.penguincomputing.com/company/blog/calculate-hpc-efficiency/

Page 45: Lecture 1: An Introduction Parallel Computing CSCE … •Learn fundamentals of concurrent and parallel computing –Describe benefits and applications of parallel computing. –Explain

HPLPerformanceEfficiencyofTop500(2015list)

• Mostly40%- 90%(ok)

45

Page 46: Lecture 1: An Introduction Parallel Computing CSCE … •Learn fundamentals of concurrent and parallel computing –Describe benefits and applications of parallel computing. –Explain

HPCGEfficiencyofTop70ofTop500(2015list)

• Mostlybelow5%andonlysomearound10%

46

Page 47: Lecture 1: An Introduction Parallel Computing CSCE … •Learn fundamentals of concurrent and parallel computing –Describe benefits and applications of parallel computing. –Explain

RankingSummary• HighPerformanceLINPACK(HPL)forTop500

– Denselinearalgebra(Ax=b),highlycomputationintensive– RankTop500forabsolutecomputationcapability

• HPCG:HighPerformanceConjugateGradients(HPCG)Benchmark,HPLalternatives– SparseMatrix-vectormultiplication,balancedmemoryandcomputation

intensity– Rankingmachineswithregardstothecombinationofcomputationand

memoryperformance

• Graph500:ShortestPathandBreadthFirstSearch– Rankingaccordingtothecapabilityofprocessinglarge-scalegraph– Stressingnetworkandmemorysystems

• Green500ofTop500(HPLGFlops/watts)– Powerefficiency

47

Page 48: Lecture 1: An Introduction Parallel Computing CSCE … •Learn fundamentals of concurrent and parallel computing –Describe benefits and applications of parallel computing. –Explain

Whyisparallelcomputing,namelymulticore,manycore andclusters,theonlyway,sofar,forhighperformance?

48

Page 49: Lecture 1: An Introduction Parallel Computing CSCE … •Learn fundamentals of concurrent and parallel computing –Describe benefits and applications of parallel computing. –Explain

SemiconductorTrend:“Moore’sLaw”

GordonMoore,FounderofIntel• 1965:sincetheintegratedcircuitwasinvented,thenumberof

transistors/inch2 inthesecircuitsroughlydoubledeveryyear;thistrendwouldcontinuefortheforeseeablefuture

• 1975:revised- circuitcomplexitydoubleseverytwoyears

49Imagecredit:Intel

Page 50: Lecture 1: An Introduction Parallel Computing CSCE … •Learn fundamentals of concurrent and parallel computing –Describe benefits and applications of parallel computing. –Explain

MicroprocessorTransistorCounts1971-2011&Moore'sLaw

50

https://en.wikipedia.org/wiki/Transistor_count

Page 51: Lecture 1: An Introduction Parallel Computing CSCE … •Learn fundamentals of concurrent and parallel computing –Describe benefits and applications of parallel computing. –Explain

Moore’sLawTrends

• Moretransistors=↑opportunitiesforexploitingparallelismintheinstructionlevel(ILP)– Pipeline,superscalar,VLIW(VeryLongInstructionWord),SIMD(Single

InstructionMultipleData)orvector,speculation,branchprediction• Generalpathofscaling

– Widerinstructionissue,longerpiepline– Morespeculation– Moreandlargerregistersandcache

• Increasingcircuitdensity~=increasingfrequency~=increasingperformance

• Transparenttousers– Aneasyjobofgettingbetterperformance:buyingfasterprocessors(higher

frequency)

• Wehaveenjoyedthisfreelunchforseveraldecades,however…

51

Page 52: Lecture 1: An Introduction Parallel Computing CSCE … •Learn fundamentals of concurrent and parallel computing –Describe benefits and applications of parallel computing. –Explain

ProblemsofTraditionalILPScaling

• Fundamentalcircuitlimitations1– delays⇑ asissuequeues⇑ andmulti-portregisterfiles⇑– increasingdelayslimitperformancereturnsfromwiderissue

• Limitedamountofinstruction-levelparallelism1

– inefficientforcodeswithdifficult-to-predictbranches

• Powerandheatstallclockfrequencies

52

[1]Thecaseforasingle-chipmultiprocessor,K.Olukotun,B.Nayfeh,L.Hammond,K.Wilson,andK.Chang,ASPLOS-VII,1996.

Page 53: Lecture 1: An Introduction Parallel Computing CSCE … •Learn fundamentals of concurrent and parallel computing –Describe benefits and applications of parallel computing. –Explain

ILPImpacts

53

Page 54: Lecture 1: An Introduction Parallel Computing CSCE … •Learn fundamentals of concurrent and parallel computing –Describe benefits and applications of parallel computing. –Explain

Simulationsof8-issueSuperscalar

54

Page 55: Lecture 1: An Introduction Parallel Computing CSCE … •Learn fundamentals of concurrent and parallel computing –Describe benefits and applications of parallel computing. –Explain

Power/HeatDensityLimitsFrequency

55

• Somefundamentalphysicallimitsarebeingreached

Page 56: Lecture 1: An Introduction Parallel Computing CSCE … •Learn fundamentals of concurrent and parallel computing –Describe benefits and applications of parallel computing. –Explain

WeWillHaveThis…

56

Page 57: Lecture 1: An Introduction Parallel Computing CSCE … •Learn fundamentals of concurrent and parallel computing –Describe benefits and applications of parallel computing. –Explain

01/17/2007 CS267-Lecture1 57

RevolutionHappedAlready• Chipdensityis

continuingincrease~2xevery2years– Clockspeedisnot– Numberofprocessor

coresmaydoubleinstead

• Thereislittleornohiddenparallelism(ILP)tobefound

• Parallelismmustbeexposedtoandmanagedbysoftware– Nofreelunch

Source:Intel,Microsoft(Sutter)andStanford(Olukotun,Hammond)

Page 58: Lecture 1: An Introduction Parallel Computing CSCE … •Learn fundamentals of concurrent and parallel computing –Describe benefits and applications of parallel computing. –Explain

IBMBG/L

ASCIWhitePacific

EDSAC1UNIVAC1

IBM7090

CDC6600

IBM360/195CDC7600

Cray1

CrayX-MPCray2

TMCCM-2

TMCCM-5 CrayT3D

ASCIRed

1950 1960 1970 1980 1990 2000 2010

1KFlop/s

1MFlop/s

1GFlop/s

1TFlop/s

1PFlop/s

Scalar

Super Scalar

Parallel

Vector

1941 1 (Floating Point operations / second, Flop/s)1945 100 1949 1,000 (1 KiloFlop/s, KFlop/s) 1951 10,000 1961 100,000 1964 1,000,000 (1 MegaFlop/s, MFlop/s) 1968 10,000,000 1975 100,000,000 1987 1,000,000,000 (1 GigaFlop/s, GFlop/s) 1992 10,000,000,000 1993 100,000,000,000 1997 1,000,000,000,000 (1 TeraFlop/s, TFlop/s) 2000 10,000,000,000,000 2005 131,000,000,000,000 (131 Tflop/s)

Super Scalar/Vector/Parallel

(103)

(106)

(109)

(1012)

(1015)

2XTransistors/ChipEvery1.5Years

TheTrends

Page 59: Lecture 1: An Introduction Parallel Computing CSCE … •Learn fundamentals of concurrent and parallel computing –Describe benefits and applications of parallel computing. –Explain

Nowit’sUpToProgrammers

• Addingmoreprocessorsdoesn’thelpmuchifprogrammersaren’tawareofthem…– …ordon’tknowhowtousethem.

• Serialprogramsdon’tbenefitfromthisapproach(inmostcases).

59

Page 60: Lecture 1: An Introduction Parallel Computing CSCE … •Learn fundamentals of concurrent and parallel computing –Describe benefits and applications of parallel computing. –Explain

ConcludingRemarks

• Thelawsofphysicshavebroughtustothedoorstepofmulticoretechnology– Theworstorthebesttimetomajorincomputerscience

• IEEERebootingComputing(http://rebootingcomputing.ieee.org/)

• Serialprogramstypicallydon’tbenefitfrommultiplecores.• Automaticparallelizationfromserialprogramisn’tthemostefficientapproachtousemulticorecomputers.– Provednotaviableapproach

• Learningtowriteparallelprogramsinvolves– learninghowtocoordinatethecores.

• Parallelprogramsareusuallyverycomplexandtherefore,requiresoundprogramtechniquesanddevelopment.

60

Page 61: Lecture 1: An Introduction Parallel Computing CSCE … •Learn fundamentals of concurrent and parallel computing –Describe benefits and applications of parallel computing. –Explain

References

• IntroductiontoParallelComputing,Blaise Barney,LawrenceLivermoreNationalLaboratory– https://computing.llnl.gov/tutorials/parallel_comp

• SomeslidesareadaptedfromnotesofRiceUniversityJohnMellor-Crummey’s classandBerkely KathyYelic’s class.

• Examplesarefromchapter01slidesofbook“AnIntroductiontoParallelProgramming”byPeterPacheco– Notethecopyrightnotice

• LatestHPCnews– http://www.hpcwire.com

• World-widepremierconferenceforsupercomputing– http://www.supercomputing.org/,theweekbefore

thanksgivingweek 61

Page 62: Lecture 1: An Introduction Parallel Computing CSCE … •Learn fundamentals of concurrent and parallel computing –Describe benefits and applications of parallel computing. –Explain

62

• “Ithinkthereisaworldmarketformaybefivecomputers.”– ThomasWatson,chairmanofIBM,1943.

• “Thereisnoreasonforanyindividualtohaveacomputerintheirhome”

– KenOlson,presidentandfounderofDigitalEquipmentCorporation,1977.

• “640K[ofmemory]oughttobeenoughforanybody.”– BillGates,chairmanofMicrosoft,1981.

• “Onseveralrecentoccasions,Ihavebeenaskedwhetherparallelcomputingwillsoonberelegatedtothetrashheapreservedforpromisingtechnologiesthatneverquitemakeit.”

– KenKennedy,CRPCDirectory,1994

http://highscalability.com/blog/2014/12/31/linus-the-whole-parallel-computing-is-the-future-is-a-bunch.html

VisionandWisdombyExperts

Page 63: Lecture 1: An Introduction Parallel Computing CSCE … •Learn fundamentals of concurrent and parallel computing –Describe benefits and applications of parallel computing. –Explain

A simple example• Computenvaluesandaddthemtogether.• Serialsolution:

63

Page 64: Lecture 1: An Introduction Parallel Computing CSCE … •Learn fundamentals of concurrent and parallel computing –Describe benefits and applications of parallel computing. –Explain

Example(cont.)

• Wehavepcores,pmuchsmallerthann.• Eachcoreperformsapartialsumofapproximatelyn/pvalues.

Each core uses it’s own private variablesand executes this block of codeindependently of the other cores.

64

Page 65: Lecture 1: An Introduction Parallel Computing CSCE … •Learn fundamentals of concurrent and parallel computing –Describe benefits and applications of parallel computing. –Explain

Example(cont.)

• Aftereachcorecompletesexecutionofthecode,isaprivatevariablemy_sum containsthesumofthevaluescomputedbyitscallstoCompute_next_value.

• Ex.,8cores,n=24,thenthecallstoCompute_next_valuereturn:

1,4,3,9,2,8,5,1,1,5,2,7,2,5,0,4,1,8,6,5,1,2,3,9

65

Page 66: Lecture 1: An Introduction Parallel Computing CSCE … •Learn fundamentals of concurrent and parallel computing –Describe benefits and applications of parallel computing. –Explain

Example(cont.)

• Onceallthecoresaredonecomputingtheirprivatemy_sum,theyformaglobalsumbysendingresultstoadesignated“master” corewhichaddsthefinalresult.

66

Page 67: Lecture 1: An Introduction Parallel Computing CSCE … •Learn fundamentals of concurrent and parallel computing –Describe benefits and applications of parallel computing. –Explain

Example(cont.)

67

SPMD:Allrunthesameprogram,butperformdifferentlydependingonwhotheyare.

Page 68: Lecture 1: An Introduction Parallel Computing CSCE … •Learn fundamentals of concurrent and parallel computing –Describe benefits and applications of parallel computing. –Explain

Example(cont.)

Core 0 1 2 3 4 5 6 7my_sum 8 19 7 15 7 13 12 14

Globalsum8+19+7+15+7+13+12+14=95

Core 0 1 2 3 4 5 6 7my_sum 95 19 7 15 7 13 12 14

68

Page 69: Lecture 1: An Introduction Parallel Computing CSCE … •Learn fundamentals of concurrent and parallel computing –Describe benefits and applications of parallel computing. –Explain

Butwait!There’samuchbetterwaytocomputetheglobalsum.

69

Page 70: Lecture 1: An Introduction Parallel Computing CSCE … •Learn fundamentals of concurrent and parallel computing –Describe benefits and applications of parallel computing. –Explain

Betterparallelalgorithm

• Don’tmakethemastercoredoallthework.• Shareitamongtheothercores.• Pairthecoressothatcore0addsitsresultwithcore1’sresult.

• Core2addsitsresultwithcore3’sresult,etc.• Workwithoddandevennumberedpairsofcores.

70

Page 71: Lecture 1: An Introduction Parallel Computing CSCE … •Learn fundamentals of concurrent and parallel computing –Describe benefits and applications of parallel computing. –Explain

Betterparallelalgorithm(cont.)

• Repeattheprocessnowwithonlytheevenlyrankedcores.• Core0addsresultfromcore2.• Core4addstheresultfromcore6,etc.

• Nowcoresdivisibleby4repeattheprocess,andsoforth,untilcore0hasthefinalresult.

71

Page 72: Lecture 1: An Introduction Parallel Computing CSCE … •Learn fundamentals of concurrent and parallel computing –Describe benefits and applications of parallel computing. –Explain

Multiplecoresformingaglobalsum

72

Page 73: Lecture 1: An Introduction Parallel Computing CSCE … •Learn fundamentals of concurrent and parallel computing –Describe benefits and applications of parallel computing. –Explain

Analysis

• Inthefirstexample,themastercoreperforms7receivesand7additions.

• Inthesecondexample,themastercoreperforms3receivesand3additions.

• Theimprovementismorethanafactorof2!

73

Page 74: Lecture 1: An Introduction Parallel Computing CSCE … •Learn fundamentals of concurrent and parallel computing –Describe benefits and applications of parallel computing. –Explain

Analysis(cont.)

• Thedifferenceismoredramaticwithalargernumberofcores.

• Ifwehave1000cores:– Thefirstexamplewouldrequirethemastertoperform999

receivesand999additions.– Thesecondexamplewouldonlyrequire10receivesand10

additions.

• That’sanimprovementofalmostafactorof100!

74