Top Banner

of 24

Ch2 Embedded Processors-I

Apr 04, 2018

Download

Documents

Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
  • 7/29/2019 Ch2 Embedded Processors-I

    1/24

    8/13/2012 Embedded S stems 1

    Em b e d d e d P r o ce ss o r s -I

    D R . AP AR NA P .Ass is t a n t P r o fe s so r

    EC Dep t

    N ITK, Su r a t h k a l

  • 7/29/2019 Ch2 Embedded Processors-I

    2/24

    8/13/2012 2

    Embedded Processor Categories

    General Purpose Processor

    Microcontrollers

    Digital Signal Processor

    Customized processors and FPGA can be included forspecific functionality.

  • 7/29/2019 Ch2 Embedded Processors-I

    3/24

    8/13/2012 Embedded S stems 3

    Microprocessor

  • 7/29/2019 Ch2 Embedded Processors-I

    4/24

    8/13/2012 4

    General Purpose Processors

    Processor designed for a variety of computation tasks Off-the-shelf -- pre-designed for a common task

    Low unit cost, in part because manufacturer spreadsNRE over large numbers of units

    Carefully designed since higher NRE is acceptable Can yield good performance, size and power

    Low NRE cost, short time-to-market/prototype, high

    flexibility User just writes software; no processor design

  • 7/29/2019 Ch2 Embedded Processors-I

    5/24

    8/13/2012 5

    Basic Architecture

  • 7/29/2019 Ch2 Embedded Processors-I

    6/24

    8/13/2012 6

    Evolution

    Intel Processors

  • 7/29/2019 Ch2 Embedded Processors-I

    7/248/13/2012 7

    -contd

    1950s- IBM instituted a research program.

    1964- Release of System/360

    Mid-1970s improved measurement tools demonstrated on CISC In 1971- Intel released first processor Intel 4004 for use in calculators.

    In 1975 MC 6800 was released- First processor with Index registers.

    1975-801 project initiated at IBMs Watson Research Center.

    1979- 32-bit RISC microprocessor (801) developed led by Joel Birnbaum

    1979 MC 68000, 32 bit processor with 16 bit buses With protected mode of operation.

    1981 MIPS-I developed at Stanford, RISC-I at Berkeley.

    1988 RISC processors had taken over high-end of the workstation market

    Early 1990s IBMs POWER (Performance Optimization WithEnhancedRISC)

    architecture introduced w/ the RISC System/6k

    { AIM (Apple, IBM, Motorola) alliance formed, resulting in PowerPC

  • 7/29/2019 Ch2 Embedded Processors-I

    8/248/13/2012 8

    Architectural Variants

    Von Neumann vs Harvard Architecture:

    Harvard allows two simultaneous memory fetches.

    Most DSPs and embedded controllers use Harvard architecture forstreaming data:

    { greater memory bandwidth;{ more predictable bandwidth

    Most of the computers are von Neumann architecture

    In certain embedded applications where the program is more-or-less

    hard wired, the Harvard architecture is advantageous.

  • 7/29/2019 Ch2 Embedded Processors-I

    9/248/13/2012 9

    -contd

    RISC vs CISC

    Complex instruction set computer (CISC):

    { many addressing modes

    { many operations.

    { Simple programming and Less program space.

    { Complex processor

    { control-store control unit

    Reduced instruction set computer (RISC):{ load/store architecture

    { Simple processor and pipelinable instructions.

    { Hardwired control unit.

  • 7/29/2019 Ch2 Embedded Processors-I

    10/248/13/2012 10

    Pipelining: Increasing Instruction Throughput

    1 2 3 4 5 6 7 8

    1 2 3 4 5 6 7 8

    1 2 3 4 5 6 7 8

    1 2 3 4 5 6 7 8

    Fetch-instr.

    Decode

    Fetch ops.

    Execute

    Store res.

    1 2 3 4 5 6 7 8

    1 2 3 4 5 6 7 8

    1 2 3 4 5 6 7 8

    1 2 3 4 5 6 7 8

    1 2 3 4 5 6 7 8

    Wash

    Time

    Non-pipelined Pipelined

    Time

    Time

    Pipelined

    pipelined instruction

    execution

    non-pipelined Laundry pipelined Laundry

    Instruction 1

  • 7/29/2019 Ch2 Embedded Processors-I

    11/248/13/2012 11

    -contd: Superscaler vs VLIW

    Superscalar

    -Fetches instructions in batches,executes as many as possible

    -May require extensive hardwareto detect independentinstructions

    VLIW

    -Each word in memory has multipleindependent instructions

    -Relies on the compiler to detectand schedule instructions

    -Currently growing in popularity

    Two Pipelines

    Fetch-

    instr.

    Decode

    Execute

    1 2 3 4 5 6 7 8

    Timepipelined instruction

    execution

    1 2 3 4 5 6 7 8

    1 2 3 4 5 6 7 8

    1 2 3 4 5 6 7 8

    1 2 3 4 5 6 7 8

    1 2 3 4 5 6 7 8

    Multiple ALUs to support more than one instruction stream

  • 7/29/2019 Ch2 Embedded Processors-I

    12/248/13/2012 12

    Typical Processors-VIA C3

  • 7/29/2019 Ch2 Embedded Processors-I

    13/248/13/2012 13

    Architecture-VIA C3

    VIA C3 is processor by VIAtechnologies based on x86 ISA.

    Compared to Pentium, these arepower efficient and hence more

    suitable for embedded market.

    Low power consumption andeffective heat dissipation.

    Suitable for personal electronicsand mobile phones.

    Good performance for Internet,digital media applications, videoconferencing, web browsing.

    Multiple Stages ofPipeline- 12stages.

    More than one Level of Cache

    Memory. Available in EBGA package .

  • 7/29/2019 Ch2 Embedded Processors-I

    14/248/13/2012 14

    Architectural Details

    Instruction Fetch Unit

    Fetches instruction from I-cache or the external bus.

    Three pipeline stages exist in Instruction Fetch Unit that deliver aligned instructions intothe instruction decode buffers.

    The instruction is predecoded as it comes out of the cache

    Predecode is overlapped with other required operations and, thus, effectively takes no time.

    The fetched instruction data is placed sequentially into multiple buffers.

    TLB (Translation Look-aside Buffer) holds the address of the pages in the memory accessedrecently.

    The TLB enables faster computing because it allows the address processing to take placeindependent of the normal address-translation pipeline.

  • 7/29/2019 Ch2 Embedded Processors-I

    15/248/13/2012 15

    -contd

    Converts instruction byte into internal execution formatby2 pipeline stages.

    Branching operations are identified here and the processor starts getting newinstructions from a different location.

    The F stage decodes and formats an instruction into an intermediate format.The internal-format instructions are placed into a five-deep FIFO queue: the FIQ.

    The X-stage, translates an intermediate-form instruction from the FIQ into theinternal microinstruction format.

    Instruction fetch, decode, and translation are made asynchronous from executionvia a five-entry FIFO queue.

    Instruction Decode Unit

  • 7/29/2019 Ch2 Embedded Processors-I

    16/248/13/2012 16

    -Contd

    Branch Prediction (BP)- Branch History Table (BHT) & Branch Target Buffer (BTB)

    IFU pre-fetches the instruction in to IF cache at different stages and sends themfor decoding. In case of Branch instruction all instrn are abandoned and new setneeds to be loaded.

    Prediction of branch earlier in the pipeline can save time in flushing out thecurrent instructions and getting new instructions.

    BP is a technique that attempts to infer the proper next instruction address,knowing only the current one.

    Typically it uses a BTB, a small, associative memorythat watches the instructioncache index and tries to predict which index should be accessed next, based on

    branch historywhich stored in another set of buffers known as BHT. This iscarried out in the F stage.

  • 7/29/2019 Ch2 Embedded Processors-I

    17/24

    8/13/2012 17

    -Contd

    Decode s t a ge (R) : Micro-instructions aredecoded, integer register files are accessedand resource dependencies are evaluated.

    Ad d r ess in g s t age (A) : Memory addressesare calculated and sent to the D-cache (DataCache).

    Cach e Access s ta ges (D, G) : The D-cacheand D-TLB (Data Translation Look asideBuffer) are accessed and aligned load datareturned at the end of the G-stage.

    Execu te s t a ge (E) : Integer ALU operationsare performed. All basic ALU functions takeone clock except multiply and divide.

    Stor e s t a ge (S ): Integer store data isgrabbed in this stage and placed in a storebuffer.

    W r it e -ba c k s t a ge ( W ) : The results ofoperations are committed to the register file.

    Integer Unit

  • 7/29/2019 Ch2 Embedded Processors-I

    18/24

    8/13/2012 18

    -Contd

    Floating Point Unit (FPU)

    Separate 80-bit floating-point execution unit that can execute floating-pointinstructions (FPI) in parallel with integer instructions.

    FPI are passed from the integer pipeline to the FPU thr a separate FIFO queue. This queue, which runs at the processor clock speed, decouples the slower

    running FP unit from the integer pipeline so that the integer pipeline can

    continue to process instructions overlapped with FP instructions. Basic arithmetic floating-point instructions (add, multiply, divide, square root,

    compare, etc.) are represented by a single internal floating-point instruction. Certain little-used and complex floating point instructions (sin, tan, etc.)

    implemented in microcode and are represented by a long stream of instructions

    coming from the ROM. These instructions tie up the integer instructionpipeline such that integer execution cannot proceed until they complete.

  • 7/29/2019 Ch2 Embedded Processors-I

    19/24

    8/13/2012 19

    -Contd

    MMX & 3D Unit Separate execution unit for the MMX-compatible instructions. One MMX instruction can issue into the MMX unit every clock. The MMX multiplier is fully pipelined and can start one non-dependent

    MMX multiply[-add] instruction (which consists of up to four separatemultiplies) every clock. Other MMX instructions execute in one clock. Multiplies followed by a dependent MMX instruction require two clocks.

    Separate execution unit for some specific 3D instructions. These instructions provide assistance for graphics transformations SIMD(Single Instruction Multiple Data) single-precision floating-pointcapabilities.

    One 3D instruction can issue into the 3D unit every clock.

    The 3D unit has two single-precision floating-point multipliers and twosingle-precision floating-point adders. Other functions such asconversions, reciprocal, and reciprocal square root are provided.

    The multiplier and adder are fully pipelined and can start any non-

    dependent 3D instructions every clock.

  • 7/29/2019 Ch2 Embedded Processors-I

    20/24

    8/13/2012 20

    VIAC3 processor uses the same x86 instruction set as Intelprocessor

    It is a pipelined architecture.

    Because of the uncertainties associated with Branching theoverall instruction execution time is not fixed (therefore it is notsuitable for some of the real time applications which need

    accurate execution speed) It handles a verycomplex instruction set .

    The overall power consumption because of the complexity of

    the processor is higher.

  • 7/29/2019 Ch2 Embedded Processors-I

    21/24

    8/13/2012 21

    Typical Processors-PowerPC- MPC601

    POWER (Performance Optimization WithEnhancedRISC) is a RISC instruction setarchitecture designed by IBM.

    Created by the 1991 Apple-IBM-Motorola alliance,

    known asAIM. PowerPC is largely based on IBM's POWER

    architecture.

    The PowerPC architecture allows optimizing

    compilers to schedule instructions to maximizeperformance through efficient use of thePowerPC instruction set and register model.

    The multiple, independent execution units allowcompilers to maximize parallelism andinstruction throughput.

    32-bit and 64-bit PowerPC processors have been afavorite of embedded computer designers.

    MPC601 was the first PowerPC processor with aspeed of 66MHz and 132 MIPS.

  • 7/29/2019 Ch2 Embedded Processors-I

    22/24

    8/13/2012 22

    High-performance superscalar MP As many as three instructions in executionper clock

    Single clock cycle execution for mostinstructions Pipelined FPU for all single-precision andmost double-precision operations Three independent execution units and two

    register files BPU featuring static branch prediction A 32-bit IU FullyIEEE 754-compliant FPU for both

    single- and double-precision operations. 32 GPRs for integer operands 32 FPRs for single- or double-precisionoperands

  • 7/29/2019 Ch2 Embedded Processors-I

    23/24

    8/13/2012 23

    High instruction and data throughput Zero-cycle branch capability

    Instruction unit capable of fetching eight instructions per clock from thecache

    An eight-entry instruction queue that provides look-ahead capability Interlocked pipelines with feed-forwarding that control data

    dependencies in hardware

    Unified 32-Kbyte cacheeight-way set-associative, physically addressed;

    LRU replacement Memory unit with a two-element read queue and a three-element write

    queue

    Run-time reordering of loads and stores

    BPU that performs condition register (CR) look-ahead operations

    Address translation facilities for both Data and Instructions thr UTLB-BTB and ITLB resp.

    52-bit virtual address; 32-bit physical address

  • 7/29/2019 Ch2 Embedded Processors-I

    24/24

    8/13/2012 24

    Summary