Top Banner

of 19

Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
  • 5/24/2018 SMT and CMP Architectures

    1/19

    SMT and CMP Architectures

    DINESH

    INTRODUCTION

  • 5/24/2018 SMT and CMP Architectures

    2/19

    Contemporary forms of parallelism

    Instruction-level parallelism(ILP)

    Wide-issue Superscalar processors (SS)

    4 or more instruction per cycle

    Executing a single program or thread

    Attempts to find multiple instructions to issue each cycle.

    Out-of-order execution => instructions are sent to executionunits based on instruction dependencies rather than programorder

    Thread-level parallelism(TLP)

    Fine-grained multithreaded superscalars(FGMS)

    Contain hardware state for several threads Executing multiple threads

    On any given cycle a processor executes instructions from oneof the threads

    Multiprocessor(MP)

    Performance improved by adding more CPUs

  • 5/24/2018 SMT and CMP Architectures

    3/19

    Simultaneous Multithreading

    Key idea

    Issue multiple instructions from multiple threads each

    cycle

    Features Fully exploit thread-level parallelism and instruction-

    level parallelism.

    Multiplefunctional units Modern processors have more functional units available then a

    single thread can utilize.

    Register renaming and dynamic scheduling

    Multiple instructions from independent threads can co-existand co-execute.

  • 5/24/2018 SMT and CMP Architectures

    4/19

    Summary: Multithreaded Categories

    4

    T

    ime(proce

    ssorcycle)

    Superscalar Fine-Grained Coarse-Grained

    Simultaneous

    Multithreading

    Thread 1

    Thread 2

    Thread 3

    Thread 4

    Thread 5

    Idle slot

  • 5/24/2018 SMT and CMP Architectures

    5/19

    Horizontal dimension represents the instruction issuecapabilty in each clock cycles.

    Vertical dimension represents a sequence of clock cycles. Empty slots indicates that the corresponding issue slots

    are unused in that clock cycles.

  • 5/24/2018 SMT and CMP Architectures

    6/19

    Superscalar processor with no multithreading:only one thread is processed in one clock cycle

    Use of issue slots is limited by a lack of ILP. Stalls such as an instruction cache miss leaves the entire processor

    idle.

    Fine-grained multithreading:switches threads on every clock cycle

    Pro: hide latency of from both short and long stalls Con: Slows down execution of the individual threads ready to go.

    Only one thread issues inst. In a given clock cycle.

    Course-grained multithreading:switches threads only on costly stalls(e.g., L2 stalls)

    Pros: no switching each clock cycle, no slow down for ready-to-gothreads. Reduces no of completely idle clock cycles.

    Con: limitations in hiding shorter stalls

  • 5/24/2018 SMT and CMP Architectures

    7/19

    Simultaneous Multithreading: exploits TLP at the same time it exploits ILP with multiple

    threads using the issue slots in a single-clock cycle.

    issue slots is limited by the following factors:

    Imbalances in the resource needs.

    Resource availability over multiple threads. Number of active threads considered.

    Finite limitations of buffer.

    Ability to fetch enough instructions frommultiple threads.

    Practical limitations of what instructionscombinations can issue from one thread andmultiple threads.

  • 5/24/2018 SMT and CMP Architectures

    8/19

    Performance Implications of SMT Single thread performance is likely to go down (caches,

    branch predictors, registers, etc. are shared)this effectcan be mitigated by trying to prioritize one thread

    While fetching instructions, thread priority candramatically influence total throughputa widelyaccepted heuristic (ICOUNT): fetch such that each threadhas an equal share of processor resources

    With eight threads in a processor with many resources,SMT yields throughput improvements of roughly 2-4

    Alpha 21464 and Intel Pentium 4 are examples of SMT

  • 5/24/2018 SMT and CMP Architectures

    9/19

    Effectively Using Parallelism on a SMT ProcessorParallel workload

    threads SS MP2 MP4 FGMT SMT

    1 3.3 2.4 1.5 3.3 3.3

    2 -- 4.3 2.6 4.1 4.7

    4 -- -- 4.2 4.2 5.6

    8 -- -- -- 3.5 6.1

    Instruction Throughput executing a parallel workload

  • 5/24/2018 SMT and CMP Architectures

    10/19

    Comparison of SMT vs

    SuperscalarSMT processors are compared to base superscalar

    processors in several key measures :

    Utilization of functional units. Utilization of fetch units.

    Accuracy of branch predictor.

    Hit rates of primary caches.

    Hit rates of secondary caches.Performance improvement:

    Issue slots.

    Funtional units.

    Renaming registers.

  • 5/24/2018 SMT and CMP Architectures

    11/19

    CMP Architecture Chip-level multiprocessing(CMP or multicore):

    integrates two or more independent cores(normally aCPU) into a single package composed of a single

    integrated circuit(IC), called a die, or more diespackaged, each executing threads independently.

    Every funtional units of a processor is duplicated.

    Multiple processors, each with a full set of

    architectural resources, reside on the same die Processors may share an on-chip cache

    or each can have its own cache

    Examples: HP Mako, IBM Power4

    Challenges: Power, Die area (cost)

  • 5/24/2018 SMT and CMP Architectures

    12/19

    Single core computer

  • 5/24/2018 SMT and CMP Architectures

    13/19

    Single coreSingle core CPU chip

  • 5/24/2018 SMT and CMP Architectures

    14/19

    Multi-core CPU chipCore 1 Core 2 Core 3 Core 4

  • 5/24/2018 SMT and CMP Architectures

    15/19

    Chip Multithreading = Chip Multiprocessing + HardwareMultithreading.

    Chip Multithreading is the capability of a processor to processmultiple s/w threads simulataneous h/w threads of execution.

    CMP is achieved by multiple cores on a single chip or multiplethreads on a single core.

    CMP processors are especially suited to server workloads, whichgenerally have high levels of Thread-Level Parallelism(TLP).

    Chip Multithreading

  • 5/24/2018 SMT and CMP Architectures

    16/19

    CMPs Performance

    CMPs are now the only way to build high performance

    microprocessors , for a variety of reasons:

    o Large uniprocessors are no longer scaling in performance,

    because it is only possible to extract a limited amount ofparallelism from a typical instruction stream.

    o Cannot simply ratchet up the clock speed on todaysprocessors,or the power dissipation will become prohibitive.

    o CMT processors support many h/w strands through efficientsharing of on-chip resources such as pipelines, caches and

    predictors.

    o CMT processors are a good match for server workloads,whichhave high levels of TLP and relatively low levels of ILP.

  • 5/24/2018 SMT and CMP Architectures

    17/19

    SMT and CMP The performance race between SMT and CMP is not yet decided.

    CMP is easier to implement, but only SMT has the ability to hidelatencies.

    A functional partitioning is not exactly reached within a SMTprocessor due to the centralized instruction issue.

    o A separation of the thread queues is a possible solution,although it does not remove the central instruction issue.

    o A combination of simultaneous multithreading with the CMPmay be superior.

    Research : combine SMT or CMP organization with the ability tocreate threads with compiler support of fully dynamically out of asingle thread.

    o Thread-level speculation

    o Close to multiscalar

  • 5/24/2018 SMT and CMP Architectures

    18/19

    Ti

    me

    (Process

    orcycle)

    Multiprocessor vs. SMT

    Multiprocessor(MP2) SMT

    Unutilized

    Thread 1

    Thread 2

  • 5/24/2018 SMT and CMP Architectures

    19/19

    THANK U GUYS