Top Banner
Simultaneous Multithreading: Maximizing On-Chip Parallelism Presented By: Daron Shrode Shey Liggett
26

Simultaneous Multithreading: Maximizing On-Chip Parallelism Presented By: Daron Shrode Shey Liggett.

Dec 27, 2015

Download

Documents

Pearl Wood
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Simultaneous Multithreading: Maximizing On-Chip Parallelism Presented By: Daron Shrode Shey Liggett.

Simultaneous Multithreading: Maximizing

On-Chip Parallelism

Presented By:Daron ShrodeShey Liggett

Page 2: Simultaneous Multithreading: Maximizing On-Chip Parallelism Presented By: Daron Shrode Shey Liggett.

Introduction Simultaneous Multithreading – a

technique permitting several independent threads to issue instructions to a superscalar’s multiple functional units in a single cycle.

The objective of SM is to substantially increase processor utilization in the face of both long memory latencies and limited available parallelism per thread.

Page 3: Simultaneous Multithreading: Maximizing On-Chip Parallelism Presented By: Daron Shrode Shey Liggett.
Page 4: Simultaneous Multithreading: Maximizing On-Chip Parallelism Presented By: Daron Shrode Shey Liggett.

Overview Introduce several SM models Evaluate the performance of those

models relative to superscalar and fine-grain multithreading

Show how to tune the cache hierarchy for SM processors

Demonstrate the potential for performance and real-estate advantages of SM architectures over small-scale, on chip multiprocessors

Page 5: Simultaneous Multithreading: Maximizing On-Chip Parallelism Presented By: Daron Shrode Shey Liggett.

Simulation Environment Developed a simulation environment that

defines an implementation of SM architecture Uses emulation-based instructions-level

simulation, similar to Tango and g88 Models the execution pipelines, the memory

hierarchy (both in terms of hit rates and bandwidths), the TLBs, and the branch prediction logic of a wide superscalar processor

Based on the Alpha AXP 21164, augmented first for wider superscalar execution and then for multithreaded execution

Page 6: Simultaneous Multithreading: Maximizing On-Chip Parallelism Presented By: Daron Shrode Shey Liggett.

Simulation Environment (cont.)

Typical Simulated configuration contains 10 functional units of four types (four integer, two floating point, three load/store and 1 branch) and a maximum issue rate of 8 instructions per cycle.

Page 7: Simultaneous Multithreading: Maximizing On-Chip Parallelism Presented By: Daron Shrode Shey Liggett.

Simulation Environment (cont.)

Page 8: Simultaneous Multithreading: Maximizing On-Chip Parallelism Presented By: Daron Shrode Shey Liggett.
Page 9: Simultaneous Multithreading: Maximizing On-Chip Parallelism Presented By: Daron Shrode Shey Liggett.

Superscalar Bottlenecks No dominant source of wasted issue bandwidth,

therefore, no dominant solution No single latency-tolerating technique will produce a

dramatic increase in the performance of these programs if it only attacks specific types of latencies

Page 10: Simultaneous Multithreading: Maximizing On-Chip Parallelism Presented By: Daron Shrode Shey Liggett.

SM Machine Models

Page 11: Simultaneous Multithreading: Maximizing On-Chip Parallelism Presented By: Daron Shrode Shey Liggett.

SM Machine Models (cont.)

Page 12: Simultaneous Multithreading: Maximizing On-Chip Parallelism Presented By: Daron Shrode Shey Liggett.

SM Machine Models (cont.)

Page 13: Simultaneous Multithreading: Maximizing On-Chip Parallelism Presented By: Daron Shrode Shey Liggett.

SM Machine Models (cont.)

Page 14: Simultaneous Multithreading: Maximizing On-Chip Parallelism Presented By: Daron Shrode Shey Liggett.

SM Machine Models (cont.)

Page 15: Simultaneous Multithreading: Maximizing On-Chip Parallelism Presented By: Daron Shrode Shey Liggett.

SM Machine Models (cont.)

Page 16: Simultaneous Multithreading: Maximizing On-Chip Parallelism Presented By: Daron Shrode Shey Liggett.

SM Machine Models (cont.) In summary, the results show that

simultaneous multithreading surpasses limits on the performance attainable through either single-thread execution or fine-grain multithreading, when run on a wide superscalar.

Simplified implementations of SM with limited per-thread capabilities can still attain high instruction throughput.

These improvements come without any significant tuning of the architecture for multithreaded execution.

Page 17: Simultaneous Multithreading: Maximizing On-Chip Parallelism Presented By: Daron Shrode Shey Liggett.

Cache Design Cache sharing caused performance

degradation in SM processors. Different cache configurations

were simulated to determine optimum configurations

Page 18: Simultaneous Multithreading: Maximizing On-Chip Parallelism Presented By: Daron Shrode Shey Liggett.
Page 19: Simultaneous Multithreading: Maximizing On-Chip Parallelism Presented By: Daron Shrode Shey Liggett.

Cache Design Two configurations appear to be

good choices: 64s.64s 64p.64s

Important Note: cache sizes today are larger than those at the time of the paper (1995).

Page 20: Simultaneous Multithreading: Maximizing On-Chip Parallelism Presented By: Daron Shrode Shey Liggett.

Simultaneous Multithreading vs Single-Chip Multiprocessing On organizational level, the two

are similar: Multiple register sets Multiple functional units High issue bandwidth on a single chip

Page 21: Simultaneous Multithreading: Maximizing On-Chip Parallelism Presented By: Daron Shrode Shey Liggett.

Simultaneous Multithreading vs Single-Chip Multiprocessing (cont.)

Key difference is the way resources are partitioned and scheduled: MP statically partitions resources SM allows partitions to change every

cycle MP and SM tested in similar

configurations to compare performance:

Page 22: Simultaneous Multithreading: Maximizing On-Chip Parallelism Presented By: Daron Shrode Shey Liggett.

Simultaneous Multithreading vs Single-Chip Multiprocessing (cont.)

Page 23: Simultaneous Multithreading: Maximizing On-Chip Parallelism Presented By: Daron Shrode Shey Liggett.

Conclusion

Page 24: Simultaneous Multithreading: Maximizing On-Chip Parallelism Presented By: Daron Shrode Shey Liggett.

Pentium 4 Product Features Available at 1.50, 1.60, 1.70, 1.80, 1.90 and 2 GHz Binary compatible with applications running on previous members of the Intel microprocessor

line Intel ® NetBurst™ micro-architecture System bus frequency at 400 MHz Rapid Execution Engine: Arithmetic Logic Units (ALUs) run at twice the processor core frequency Hyper Pipelined Technology Advance Dynamic Execution —Very deep out-of-order execution —Enhanced branch prediction Level 1 Execution Trace Cache stores 12K micro-ops and removes decoder latency from main

execution loops 8 KB Level 1 data cache 256 KB Advanced Transfer Cache (on- die, full speed Level 2 (L2) cache) with 8-way associativity

and Error Correcting Code (ECC) 144 new Streaming SIMD Extensions 2 (SSE2) instructions Enhanced floating point and multimedia unit for enhanced video, audio, encryption, and 3D

performance Power Management capabilities —System Management mode —Multiple low-power states Optimized for 32-bit applications running on advanced 32-bit operating systems 8-way cache associativity provides improved cache hit rate on load/store operations.

Page 25: Simultaneous Multithreading: Maximizing On-Chip Parallelism Presented By: Daron Shrode Shey Liggett.

AMD AthlonThe AMD Athlon XP processor features a seventh-generationmicroarchitecture with an integrated, exclusive L2 cache, whichsupports the growing processor and system bandwidthrequirements of emerging software, graphics, I/O, and memorytechnologies. The high-speed execution core of theAMD Athlon XP processor includes multiple x86 instructiondecoders, a dual-ported 128-Kbyte split level-one (L1) cache, anexclusive 256-Kbyte L2 cache, three independent integerpipelines, three address calculation pipelines, and a superscalar,fully pipelined, out-of-order, three-way floating-point engine.The floating-point engine is capable of delivering outstandingperformance on numerically complex applications.

Page 26: Simultaneous Multithreading: Maximizing On-Chip Parallelism Presented By: Daron Shrode Shey Liggett.

AMD Athlon (cont.)The following features summarize the AMD Athlon XPprocessor QuantiSpeed architecture: An advanced nine-issue, superpipelined, superscalar x86

processor microarchitecture designed for increasedInstructions Per Cycle (IPC) and high clock frequencies

Fully pipelined floating-point unit that executes all x87(floating-point), MMX, SSE and 3DNow! instructions

Hardware data pre-fetch that increases and optimizesperformance on high-end software applications utilizinghigh-bandwidth system capability

Advanced two-level Translation Look-aside Buffer (TLB)structures for both enhanced data and instruction addresstranslation. The AMD Athlon XP processor withQuantiSpeed architecture incorporates three TLBoptimizations: the L1 DTLB increases from 32 to 40 entries,the L2 ITLB and L2 DTLB both use exclusive architecture,and the TLB entries can be speculatively loaded.