GPU Supercomputing N.D. Hari Dass Indian Institute of Science, Bangalore Poornaprajna Institute, Bangalore Saturday, August 22, 2009
GPU Supercomputing
N.D. Hari DassIndian Institute of Science, Bangalore
Poornaprajna Institute, Bangalore
Saturday, August 22, 2009
Supercomputing in Old Stone Age
• Long long ago Supercomputers had to be specially built.
• It required large memory blocks - expensive!!
• The interconnects were proprietary - also expensive, though with great performance!
• Additional features like large scale vector processing.
2
Saturday, August 22, 2009
Supercomputing in New Stone Age
• The idea was to use off the shelf desktops without monitors, connect them with networks with as high bandwidth and as low latency as possible.
• Distribute the memory• Era of Clusters
3
Saturday, August 22, 2009
KABRU – The Massive Cluster at IMSc
Saturday, August 22, 2009
Saturday, August 22, 2009
Saturday, August 22, 2009
Saturday, August 22, 2009
Supermicro Twin - 2 Nodes in 1UNode 1
Node 2
1U Twin™ is Supermicro innovative designed 1U rack mount system for increasing computing density, saving cost, and reducing energy and space requirements. Supports Dual Xeon Dual/Quad Core CPUs (up to 16 cores in 1 U, up to 672 cores in a 42U rack)
1U Twin systemcontains two independent symmetric motherboards!!!
Saturday, August 22, 2009
Twin Motherboards
Saturday, August 22, 2009
Supermicro Twin - Specifications• Supports up to two Intel® Xeon® 51xx, 52xx,
53xx & 54xx processors per node 1600/1333/1066MHz System Bus
• Supports up to 64GB memory per node DDR2-667/800(1.8V/1.5V) FBDIMMs (1.5V FBDIMMs consume less power and generate less heat)
• Available with GbE/DDR IB/10Gb Ethernet• PCI-Express x16 expansion slot• High-efficiency shared power supply (93%
efficiency)
Saturday, August 22, 2009
Supermicro Blade
• 90% cable reduction Results in better airflow & better cooling• Easier and faster to deploy & troubleshoot• Common, Shared, Redundant and high-efficiency power supply (90%-93% efficiency)
• 7U Blade chassis• Can accommodate 10 Dual-Processor or Quad-processor blades• Up to 160 cores per 7U or 960 cores per 42U rack (using quad-processor blades)• Up to 32GB/64 memory per Dual/Quad processor blade• DDR Infiniband available as option
Saturday, August 22, 2009
Clusters: Then & Now
2003 NOW
1U TWIN BLADE
No. Of CPU
164 20 20 20
Rack Space
82U 10U 5U 7U
WATTS 25KW
4KW 3.85KW 3.85KW
Saturday, August 22, 2009
Twin-U Vs Blade
Twin 1U Blades More Compact/Less space (0.5U)
0.7U
Cheaper Expensive Std. PCI-Express Expansion
Mezzanine Expansion
Power supply not redundant
Redundant Power
Cabling is a mess
Lesser/Neater cabling
Saturday, August 22, 2009
Some of the problems..
• Slow PCI slot performance• Memory access bottlenecks
14
Saturday, August 22, 2009
Core Incompetence?
15
Single 493 MB 81.2 s 1.936µs --
2 Cores 246.5 43.1 s 2.06µs 788 MB/s
4 Cores 129 33.3 s 3.18µs 4928 Cores1-D
70.4 32.2 s 6.15µs 173
8 Cores3-D
61.7 31.6 s 6.03µs 414
Intel 2xQuad Core @ 2.8 GHz
Saturday, August 22, 2009
Core Incompetence?
16
AMD 2xQuad 2111 GHz
1 Core 492 147 s 3.5µs
2 Cores 246 72.32 s 3.448µs
4 Cores 129 47.8 4.56µs
8 Cores 70 29.3 5.6
Saturday, August 22, 2009
Intel Nehalem
• This architecture has significantly overcome the FSB bottlenecks.
• The scaling from 1 to 2, 2 to 4 cores is excellent.
• The scaling from 4 to 8 is good though not as good as in the case of AMD
• But the overall performance of Nehalem better than that of AMD
17
Saturday, August 22, 2009
Speed - Memory Issue
• As the number of cores goes up the CPU performance (theoretical peak) increases.
• KABRU: 4.8 GFlops/CPU• Intel Quad Core: 50 GFlops/CPU• It becomes harder to maintain the ratio of
‘Memory to Performance’.• Issues with increasing memory: different
chipset, power consumption, ...
18
Saturday, August 22, 2009
GPU Based Supercomputing
• On a single Tesla C1060 card the claimed peak performance of 1Teraflops in single precision!
• Four such cards can sit in a single 1U box• Cost of such GPGPU supercomputers is
about 5 lakh rupees.• Nearly 4 times as fast as Kabru but
costing 50 times less!• Power consumption about 800 W - 40
times less; no airconditioning/infrastructure19
Saturday, August 22, 2009
A Tesla C1060 Card
20
Saturday, August 22, 2009
4 Tesla In 1U
21
Saturday, August 22, 2009
Issues with GPU’s
• Codes should have a high degree of data parallelism.
• Available dedicated memory rather low - even for Tesla C1060 cards it is 4 GB per card.
• Double precision performance much poorer than single precision performance - factor 12 lower!!
• Due to register structure - an improvement by a factor of 3 talked about. 22
Saturday, August 22, 2009
Issues with GPU’s
• If the code is a mixture of single and double precisions with the volume of latter around 10% still OK.
• Exploiting the host CPU’s an option.• Transfers between CPU and GPU through
the PCI x16 Gen 2.0 technology.• Transfer speed nowhere compared to, say,
between CPU & Cache• Often better to perform a fresh calculation
instead of fetching processed data 23
Saturday, August 22, 2009
Issues with GPU’s
• Have to code using a new ‘language’ - CUDA in the case of NVIDIA cards
• Not really a problem for moderate sized codes but can be an issue for large codes
• Requires a dexterous management of CPU and GPU resources
• But considering the phenomenal performance improvements that are being talked about, worth the trouble!!
• Intel Larrabie ?? 24
Saturday, August 22, 2009