Parallel Computing - Kursused · 2016. 9. 21. · Mini-supercomputer in a compute card Simplified x86 cores Typically easy to get code to run, more difficult to get code to run
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Typical chip today has multiple coresData may need to be obtained from a hard disk, RAM orcache before being processedFor many applications getting data can be more of aconstraint than computing the data
External specialized device for floating point operationsTypically good at doing many simplified instructions inparallelHigh latency is compensated by high bandwidth
Graphics Cards and General Purpose Computing onGraphics Cards
Nvidia – many simple cores, CUDA, CUDA Fortran, OpenACC, OpenCL and OpenGL application programminginterfaces, strong support of academic communityAMD – many simple cores, Open CL and OpenGL. Havelaunched APU (Accelerated Processing Unit) whichcombines CPU and GPUEmbedded graphics cards in AMD APU, Cell phone chips,such as Qualcomm snapdragon
1Tflop of performanceMini-supercomputer in a compute cardSimplified x86 coresTypically easy to get code to run, more difficult to get codeto run efficiently
Bus – simple, cheap, poor communication performanceRing – simple, cheap, poor communication performanceMesh – simple, more expensive than ring, bettercommunication performance than ringHypercube – good communication performance, expensiveat a large scaleTorus 2D, 3D, 4D, 6D – good communication performance,Fat tree – Commonly used, not quite as good performanceas a torus, but cheaperWhich topology is cost effective for a monte carlosimulation?What is the topology of Rocket?
NAS Thinking Machines CM-5, photographer: Tom Trower, 1993(This is probably a 256 processor machine.)
131 Gflops on 1024 processorsFastest computer on Top 500 list in June 1993Fat tree topology networkThinking Machines grew out of Danny Hills doctoralresearch, but is no longer producing supercomputers
35.86 Tflops on 5120 processorsFastest computer on Top 500 list between March 2002 andNovember 2004Vector processorsFive times faster than previous first computer on Top500
Adam Bertsch next to a Blue Gene L system at LawrenceLivermore National Laboratories
596 Tflops on 106,496 dual core processorsFastest computer on Top 500 list between November 2004and November 20073D torus and many not so fast coresMore athttps://asc.llnl.gov/computing_resources/bluegenel/configuration.html
Currently 10.5 Pflops on 88,128 SPARC64 VIIIfxprocessors with 8 cores per processorFastest computer on Top 500 list between June 2011 andJune 2012Fastest computer on Graph 500 list from June 2011 tothe present6D “mesh/”torus network and many fast and smart coresMore at http://www.aics.riken.jp/en/k-computer/system
Titan Supercomputer at Oak Ridge National Laboratory
27 Pflops on 18,688 AMD Opteron 6274 16-core CPUsand 18,688 Nvidia Tesla K20X GPUsFastest computer on Top 500 list between November 2012and June 2013More at https://www.olcf.ornl.gov/computing-resources/titan-cray-xk7/
33.86 Pflops on 32,000 Intel Xeon E5-2692 chips with48,000 Xeon Phi 31S1P coprocessorsFat tree topology, American chips, but Fat tree topologyInterconnect is made in ChinaFastest computer on Top 500 list between November 2013and June 2016More atwww.netlib.org/utk/people/JackDongarra/PAPERS/tianhe-2-dongarra-report.pdf
93 Pflops on 40,960 Sunway SW26010 260C chipsFastest computer on Top 500 list between June 2016 andnowMore at http://engine.scichina.com/downloadPdf/rdbMLbAMd6ZiKwDco
Supercomputer architectures are still evolvingDepending on the problem you are solving, the best choiceof computer architecture and algorithm should be made ifpossibleIn many cases, you have no choice in the computerarchitecture of a supercomputer, but do have some choicein the algorithmSometimes you are lucky and can choose both, but mayneed to write a lot of codeNeed to consider both peak floating point performance andmemory bandwidth to determine serial speed of yourapplicationConsider re-organizing your work to make the codeefficient and well suited to the architecture you are arunning on
Parallel Computer Architecture; RR 2.1-2.4, 2.7, 3.4,3.5,Hoffman, J., Treibig, J. Hager, G. Wellein, G. “PerformanceEngineering for a Medical Imaging Application on the IntelXeon Phi Accelerator”http://arxiv.org/abs/1401.3615
T. Hoefler “Networking and Computer Architecture”http://htor.inf.ethz.ch/teaching/CS498/
A. Grama, A. Gupta, G. Karypis, V. Kumar, Introduction toParallel Computing, 2nd Ed., Addison Wesley (2003)Patterson, D.A., Hennessy, J.L. Computer Organizaitonand Design: The Hardware and Software Interface 5thEd., Morgan Kaufmann (2014)
Rahman, R. Intel Xeon Phi Coprocessor Architectureand Tools: The Guide for Application Developers,Apress Open, (2013) $0.35 on AmazonSolnushkin, K.http://clusterdesign.org/fat-trees/