Top Banner

of 29

A Highly Configurable Cache Architecture for Embedded

May 30, 2018

Download

Documents

moninsurana
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
  • 8/14/2019 A Highly Configurable Cache Architecture for Embedded

    1/29

    Chuanjun Zhang, UC Riverside

    1

    A highly Configurable CacheArchitecture for Embedded

    Systems

    Chuanjun Zhang*, Frank Vahid** , and Walid Najjar*Dept. of Electrical Engineering

    Dept. of Computer Science and Engineering

    University of California, Riverside**Also with the Center for Embedded Computer Systems at UC Irvine

    This work was supported by the National Science Foundation andthe Semiconductor Research Corporation

  • 8/14/2019 A Highly Configurable Cache Architecture for Embedded

    2/29

    Chuanjun Zhang, UC Riverside 2

    Outline

    Why a Configurable Cache? What Parameters ? Configurable Associativity by Way Concatenation Configurable Size by Way Shutdown Configurable Line Size

    How to Configure Cache Cache Parameter Explorer A Heuristic Algorithm Searches Pareto Set of Cache

    Parameters : Tradeoff Between Energy Dissipation and Performance

    The explorer is Synthesized Using Synopsys Conclusions and Future Work

  • 8/14/2019 A Highly Configurable Cache Architecture for Embedded

    3/29

    Chuanjun Zhang, UC Riverside 3

    Why Choose Cache: ImpactsPerformance and Power

    Performance impacts arewell known

    Power ARM920T: Caches consume

    50% of total processor system

    power (Segars 01) M*CORE: Unified cache

    consumes 50% of totalprocessor system power(Lee/Moyer/Arends 99)

    Well show that aconfigurable cache canreduce that power nearly inhalf on average

  • 8/14/2019 A Highly Configurable Cache Architecture for Embedded

    4/29

    Chuanjun Zhang, UC Riverside 4

    Why a Configurable Cache?

    An embedded systemmay execute oneapplication forever Tuning the cache

    configuration (size,

    associativity, line size)can save a lot of energy

    Associativity example 40% difference in memory

    access energy0%

    25%

    50%

    75%

    100%

    1 2 4

    epic

    mpeg2(b)

    0.0%

    0.5%

    1.0%

    1.5%

    2.0%

    1 2 4

    epicmpeg2

    epic & mpeg2 from MediaBench

    associativity

    associativity

    Missrate

    Normalized

    Energy

  • 8/14/2019 A Highly Configurable Cache Architecture for Embedded

    5/29

    Chuanjun Zhang, UC Riverside 5

    Benefits of Configurable Cache

    Mass production Unique chips getting more expensive as technology

    scales down (ITRS) Huge benefits to mass producing a single chip

    Harder to produce chips distinguished by cachewhen we have 50-100 processors per chip

    Adapt to program phases Recent research shows programs have different

    cache requirements over time Much research assumes a configurable cache

  • 8/14/2019 A Highly Configurable Cache Architecture for Embedded

    6/29

    Chuanjun Zhang, UC Riverside 6

    Caches Vary Greatly in Embedded Processors

    Processor Size As. Line Size As. Line Processor Size As. Line Size As. Line

    AMD-K6-IIIE 32K 2 32 32K 2 32 Motorola MPC8540 32K 4 32/64 32K 4 32/64

    Alchemy AU1000 16K 4 32 16K 4 32 Motorola MPC7455 32K 8 32 32K 8 32

    ARM 7 8K/U 4 16 8K/U 4 16 NEC VR5500 32K 2 32 32K 2 32

    ColdFire 0-32K DM 16 0-32K N/A N/A NEC VR4131 16K 2 16/32 16K 2 16/32

    Hitachi SH7750S (SH4) 8K DM 32 16K DM 32 NEC VR4181 4K DM 16 4K DM 16

    Hitachi SH7727 16K/U 4 16 16K/U 4 16 NEC VR4181A 8K DM 32 8K DM 32

    IBM PPC 750CX 32K 8 32 32K 8 32 NEC VR4121 16 DM 16 8K DM 16

    IBM PPC 7603 16K 4 32 16K 4 32 PMC Sierra RM9000X2 16K 4 N/A 16K 4 N/A

    IBM750FX 32K 8 32 32K 8 32 PMC Sierra RM7000A 16K 4 32 16K 4 32IBM403GCX 16K 2 16 8K 2 16 SandCraft sr71000 32K 4 32 32K 4 32

    IBM Power PC 405CR 16K 2 32 8K 2 32 Sun Ultra SPARC Iie 16K 2 N/A 16K DM N/A

    Intel 960JA 2K 2 N/A 1K 2 N/A SuperH 32K 4 32 32K 4 32

    Intel 960JD 4K 2 N/A 2K 2 N/A TI TMS320C6414 16K DM N/A 16K 2 N/A

    Intel 960IT 16K 2 N/A 4K 2 N/A TriMedia TM32A 32K 8 64 16K 8 64

    Motorola MPC8240 16K 4 32 16K 4 32 Xilinx Virtex IIPro 16K 2 32 8K 2 32

    Instruct. Cache Data Cache Instruct. Cache Data Cache

  • 8/14/2019 A Highly Configurable Cache Architecture for Embedded

    7/29

    Chuanjun Zhang, UC Riverside 7

    Configurable Associativity by WayConcatenation

    Four-way set-associative basecache

    Ways can be

    concatenated toform two-way

    Can be furtherconcatenated todirect-mapped Concatenation is

    logical only 1array accessed

    Way 1 Way 2 Way 3 Way 4

    four-wa

    y

    Way 1 Way 2 tw

    o-wa

    y

    Way 1directmapped

    C. Zhang(ISCA 03)

  • 8/14/2019 A Highly Configurable Cache Architecture for Embedded

    8/29

    Chuanjun Zhang, UC Riverside 8

    Way-Concatenate Cache Architecture

    index

    data output

    critical path

    6x

    64

    6x

    64

    a31 tag address a13 a12 a11 a10 index a5 a4 line offset a0

    data

    array

    Trivial area

    overheadNo performanceoverhead

    NAND transistors enlargedto match inverter speed

    Configuration circuit

    operates concurrent todecoders

    reg0 reg1 ways

    0 0 DM

    0 1 2

    1 0 2

    1 1 4reg1

    reg0

    c1 c3c0 c2

    Configuration circuit

    c1c0

    tag

    addressc0 c1

    mux driver

    line offset

    c2

    6x

    64

    6x

    64

    c3c2

    c3

    6x

    64

    6x

    64

    tag part

  • 8/14/2019 A Highly Configurable Cache Architecture for Embedded

    9/29

    Chuanjun Zhang, UC Riverside 9

    Previous Method Way Shutdown

    Albonesi proposed a cache where ways could be shut down To save dynamic power

    Motorola M*CORE has same way-shutdown feature Unified cache even allows setting each way as I, D, both, or off

    Way 1 Way 2 Way 3 Way 4

    Reduces dynamic power by accessing fewer ways But, decreases total size, so may increase miss rate

  • 8/14/2019 A Highly Configurable Cache Architecture for Embedded

    10/29

    Chuanjun Zhang, UC Riverside 10

    Way Shutdown Can be Good for StaticPower

    Static power (leakage) increasingly important in nanoscaletechnologies We combine way shutdown with way concatenate Use sleep transistor method of Powell (ISLPED 2000)

    Gnd

    VddBitline

    Bitline

    Gated-Vdd

    Control

    When off,preventsleakage.But 4%

    performance overhead

  • 8/14/2019 A Highly Configurable Cache Architecture for Embedded

    11/29

    Chuanjun Zhang, UC Riverside 11

    Cache Line Size

    64B cache line64B

    consecutive

    code

    64B non

    consecutive

    code

    16B

    A

    B

    48B are wasted

    64B cache line

    C. Zhang(ISVLSI 03)

  • 8/14/2019 A Highly Configurable Cache Architecture for Embedded

    12/29

    Chuanjun Zhang, UC Riverside 12

    Configurable Cache Line Size With LineConcatenation

    Counter

    bus

    One Way

    Off Chip Memory

    4 physicallines are

    filled when

    line size

    is 64 bytes

    The physical linesize is 16 byte

    A programmablecounter is used to

    designate the linesize

    An interleaved offchip memoryorganization

    16 bytes

  • 8/14/2019 A Highly Configurable Cache Architecture for Embedded

    13/29

    Chuanjun Zhang, UC Riverside 13

    Computing Total Memory-Related Energy

    Considers CPU stall energy and off-chip memory energy Excludes CPU active energy Thus, represents all memory-relatedenergy

    energy_mem = energy_dynamic + energy_static

    energy_miss = k_miss_energy * energy_hit

    energy_static_per_cycle = k_static * energy_total_per_cycle

    (We varied the ks to account for different system implementations)

    energy_dynamic = cache_hits * energy_hit + cache_misses * energy_miss

    energy_miss = energy_offchip_access + energy_uP_stall + energy_cache_block_fillenergy_static = cycles * energy_static_per_cycle

    Underlined measured quantitiesSimpleScalar (cache_hits, cache_misses, cycles)Our layout or data sheets (others)

  • 8/14/2019 A Highly Configurable Cache Architecture for Embedded

    14/29

    Chuanjun Zhang, UC Riverside 14

    Energy Savings

    Energy savings when way concatenation, way shut

    down, and cache line size concatenation areimplemented. (C. ZhangTECS ACM To Appear)

    127% 620% 12

    0%

    20%

    40%

    60%

    80%

    100%

    120%

    padpcm

    crc

    auto2

    bcnt

    bilv

    binary b

    lit

    brev

    g3fax fi

    r

    pjepg

    ucbqsort

    v42

    adpcm

    epic

    g721

    pegwit

    mpeg

    jpeg

    art

    mcf

    parser

    vpr

    NormalizedEner

    cnv8K4W32B cnv8K1W32B cfg8Kwc32Bcfg8Kwcws32B cfg8Kwcwslc

  • 8/14/2019 A Highly Configurable Cache Architecture for Embedded

    15/29

    Chuanjun Zhang, UC Riverside 15

    Cache Parameters that Consume theLowest Energy Varies Across Applications

    Ben. I$ D$ Ben. I$ D$

    padpcm 8K1W32B 8K1W32B pjepg 4K1W32B 4K2W64B

    crc 2K1W32B 4K1W64B ucbqsort 4K1W16B 4K1W64B

    auto 8K2W16B 4K1W32B v42 8K1W16B 8K2W16B

    bcnt 2K1W32B 2K1W64B adpcm 2K1W16B 4K1W16Bbilv 4K1W32B 2K1W32B epic 2K1W64B 8K1W16B

    binary 2K1W32B 2K1W32B g721 8K4W16B 2K1W16B

    blit 2K1W16B 8K2W32B pegwit 4K1W16B 4K1W16B

    brev 4K1W32B 2K1W32B mpeg2 4K1W32B 8K2W16B

    g3fax 4K1W32B 4K1W16B art 2K1W32B 2K1W16Bfir 4K 1W32B 2K1W32B parser 8K4W16B 8K2W64B

    jpeg 8K4W32B 4K2W32B mcf 8K4W16B 8K1W16B

    vpr 8K4W32B 2K1W16B

    Best Configuration Best Configuration

  • 8/14/2019 A Highly Configurable Cache Architecture for Embedded

    16/29

    Chuanjun Zhang, UC Riverside 16

    How to Configure Cache

    Simulation-based methods Drawback: slowness.

    Seconds of real-timework may take tens of hours to simulate

    Simulation tools set up Increase the time

    Self exploring method Cache parameter explorer

    Incorporated on a prototype platform

    Pareto parameters: a set of parameters showperformance and energy trade off

  • 8/14/2019 A Highly Configurable Cache Architecture for Embedded

    17/29

    Chuanjun Zhang, UC Riverside 17

    Cache self-exploring hardware

    An explorer is used todetect the Pareto set ofcache parameters

    The explorer standsaside to collectinformation used to

    calculate the energy

    MemProcesso

    r

    D$

    I$

    Explorer

  • 8/14/2019 A Highly Configurable Cache Architecture for Embedded

    18/29

    Chuanjun Zhang, UC Riverside 18

    Pareto parameter sets

    pegwit

    56

    60

    64

    68

    72

    0.04 0.08 0.12 0.16Energy(mJ)

    Time(millioncycl A

    BC

    D

    Lowest

    Energy

    BestPerformance

    Tradeoff betweenEnergy and

    Performance

    Not aPareto

    Point

  • 8/14/2019 A Highly Configurable Cache Architecture for Embedded

    19/29

    Chuanjun Zhang, UC Riverside 19

    Heuristic algorithm

    Search all possible Cache configurations Time consuming. Considering other configurable

    parameters: voltage levels, bus width, etc. thesearch space will increase very quickly to millions.

    A heuristic is proposed First to search point A

    Sequence of searching parameter matters, Do not need cache flush

    Then searching for point B Last we search for points in region C56

    60

    64

    68

    72

    0.04 0.08 0.12 0.16

    A

    BC

    Time

    Energy(mJ)

    LowestEnergy

    Best Perf

    Tradeoff

    f h i

  • 8/14/2019 A Highly Configurable Cache Architecture for Embedded

    20/29

    Chuanjun Zhang, UC Riverside 20

    0%

    3%

    6%

    9%

    12%

    16B 32B 64B 1W 2W 4W

    Ave.

    Icachemissrate

    8k 4k 2k

    0.0

    0.2

    0.4

    0.6

    0.8

    1.0

    16B 32B 64B 1W 2W 4W

    Ave.

    Icacheenergy

    8k 4k 2k

    Impact of Cache Parameters on MissRate and Energy

    Average Instruction Cache Miss Rate and Normalized Energy of the

    Benchmarks.

    One Way

    Line Size 32B

    Line Size 32B

    One Way

    E Di i ti O Chi C h

  • 8/14/2019 A Highly Configurable Cache Architecture for Embedded

    21/29

    Chuanjun Zhang, UC Riverside 21

    0

    1

    2

    3

    4

    5

    1KB

    2KB

    4KB

    8KB

    16KB

    32KB

    64KB

    128KB

    256KB

    512KB

    1MB

    Cache Size

    Energy(J)

    Cache Memory Total

    Energy Dissipation on On-Chip Cache

    and Off Chip Memory

    .

    Benchmark:

    parser

  • 8/14/2019 A Highly Configurable Cache Architecture for Embedded

    22/29

    Chuanjun Zhang, UC Riverside 22

    Searching for Point A

    Point A :The least energy

    cache configuration

    W1 W2 W3 W4

    Search Cache

    Size

    Search Line

    Size

    Search

    Associativity

    Way prediction

    56

    60

    64

    68

    72

    0.04 0.08 0.12 0.16

    A

    Energy(mJ)

    Time

    LowestEnergy

  • 8/14/2019 A Highly Configurable Cache Architecture for Embedded

    23/29

    Chuanjun Zhang, UC Riverside 23

    Searching for Point B

    Point B :The best performance cache configuration

    High associativity doesnt mean high performance

    Large line size may not be good for data cache

    W1 W2 W3 W4

    Fix Cache Size Search Line

    Size

    Search

    Associativity

    No Way

    prediction

    56

    60

    64

    68

    72

    0.04 0.08 0.12 0.16

    A

    Energy(mJ)

    B

    BestPerformance

  • 8/14/2019 A Highly Configurable Cache Architecture for Embedded

    24/29

    Chuanjun Zhang, UC Riverside 24

    Searching for Point C

    Cache parameters in region C:

    represent the trade off

    between energy and

    performance Choose cache parameters

    between points A and B. Cache size at points A and B are

    8K and 4K respectively, then the

    cache size of points in region C

    will be tested at 8K and 4K.

    Combinations of point A andBs parameters are tested.

    Point A B C

    Line size 64 64 64

    Cache size 2K 8K 4K 8K

    Associativity 1W 4W 1W 1W 2W

    56

    60

    64

    68

    72

    0.04 0.08 0.12 0.16

    A

    CB

    Tradeoff betweenEnergy and

    Performance

  • 8/14/2019 A Highly Configurable Cache Architecture for Embedded

    25/29

  • 8/14/2019 A Highly Configurable Cache Architecture for Embedded

    26/29

    Chuanjun Zhang, UC Riverside 26

    Implementing the Heuristic in Hardware

    Total size of the explorer About 4,200 gates, or 0.041 mm2 in 0.18 micron CMOS

    technology.

    Area overhead

    Compared to the reported size of the MIPS 4Kp with cache, thisrepresents just over a 3% area overhead.

    Power consumption: 2.69 mW at 200 MHz. The power overhead compared with the

    MIPS 4Kp would be less than 0.5%. Furthermore, the exploring hardware is used only during the

    exploring stage, and can be shut down after the bestconfiguration is determined.

  • 8/14/2019 A Highly Configurable Cache Architecture for Embedded

    27/29

    Chuanjun Zhang, UC Riverside 27

    How well the heuristic is ?

    Time complexity: Search all space: O(m x n x l x p) Heuristic : O(m + n + l + p)

    m:number of associativities, n :number of cache size l : number of cache line size , p :way prediction on/off

    Efficiency

    On average 5 searching instead of 27 total searchings can find point A 2 out of 19 benchmarks miss the lowest power cache configuration. Use a different searching heuristic: line size, associativity, way prediction and

    cache size. 11 out 19 benchmarks miss the best configuration

  • 8/14/2019 A Highly Configurable Cache Architecture for Embedded

    28/29

    Chuanjun Zhang, UC Riverside 28

    Results of Some Other Benchmarks

    bliv

    11233000

    11234000

    11235000

    11236000

    11237000

    11238000

    11239000

    0 0.002 0.004 0.006 0.008 0.01

    Energy(mJ)

    Time(cycles)

    pe g w i

    55000000

    60000000

    65000000

    70000000

    75000000

    0 0.05 0.1 0.15 0.2Energy(mJ

    Time(cycles)

    padpc

    132000134000136000138000140000142000144000146000

    0.174 0.175 0.176 0.177 0.178 0.179 0.18

    Energy(nJ

    Time

    (cycles)

    crc

    3090000

    3092000

    3094000

    3096000

    3098000

    3100000

    3102000

    3104000

    0 0.001 0.002 0.003 0.004

    Energy(mJ)

    Time(cycles)

  • 8/14/2019 A Highly Configurable Cache Architecture for Embedded

    29/29

    Chuanjun Zhang, UC Riverside 29

    Conclusion and Future Work A configurable cache architecture is proposed.

    Associativity, size,line size. A cache parameter explorer is implemented to find the cache

    parameters. A heuristic algorithm is proposed to search the Pareto cache

    parameter sets. The complexity of the heuristic is O(m+n+l) instead of O(m*n*l) Only 95% of the Pareto points can be found by Heuristic

    Overhead little area and power overhead, and no performance overhead.

    Future Work Dynamically detect the cache parameters .