Top Banner
Page 1 Trace Caches Michele Co CS 451
26

Trace Caches

Feb 02, 2016

Download

Documents

saxton

Trace Caches. Michele Co CS 451. Motivation. High performance superscalar processors High instruction throughput Exploit ILP Wider dispatch and issue paths Execution units designed for high parallelism Many functional units Large issue buffers Many physical registers - PowerPoint PPT Presentation
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Trace Caches

Page 1

Trace Caches

Michele Co

CS 451

Page 2: Trace Caches

Page 2

Motivation

High performance superscalar processorsHigh instruction throughputExploit ILP

–Wider dispatch and issue pathsExecution units designed for high parallelism

–Many functional units

–Large issue buffers

–Many physical registers Fetch bandwidth becomes performance bottleneck

Page 3: Trace Caches

Page 3

Fetch Performance Limiters

Cache hit rate Branch prediction accuracy Branch throughput

Need to predict more than one branch per cycle

Non-contiguous instruction alignment Fetch unit latency

Page 4: Trace Caches

Page 4

Problems with Traditional Instruction Cache

Contain instructions in compiled orderWorks well for sequential code with little branching, or code

with large basic blocks

Page 5: Trace Caches

Page 5

Suggested Solutions

Multiple branch target address prediction Branch address cache

(1993, Yeh, Marr, Patt)

–Provides quick access to multiple target addresses

–Disadvantages• Complex alignment

network, additional latency

Page 6: Trace Caches

Page 6

Suggested Solutions (cont’d)

Collapsing buffer Multiple accesses to btb

(1995, Conte, Mills, Menezes, Patel)–Allows fetching non-

adjacent cache lines–Disadvantages

• Bank conflicts• Poor scalability for

interblock branches• Significant logic added

before and after instruction cache

Fill unit Caches RISC-like

instructions derived from CISC instruction stream

(1988, Melvin, Shebanow, Patt)

Page 7: Trace Caches

Page 7

Problems with Prior Approaches

Need to generate pointers for all noncontiguous instruction blocks BEFORE fetching can beginExtra stages, additional latencyComplex alignment network necessary

Multiple simultaneous access to instruction cacheMultiporting is expensive

SequencingAdditional stages, additional latency

Page 8: Trace Caches

Page 8

Potential Solution – Trace Cache Rotenberg, Bennett, Smith (1996) Advantages

Caches dynamic instruction sequences

–Fetches past multiple branchesNo additional fetch unit latency

DisadvantagesRedundant instruction storage

–Between trace cache and instruction cache

–Within trace cache

Page 9: Trace Caches

Page 9

Trace Cache Details

TraceSequence of instructions potentially containing branches and

their targetsTerminate on branches with indeterminate number of targets

–Returns, indirect jumps, traps Trace identifier

Start address + branch outcomes Trace cache line

Valid bitTagBranch flagsBranch maskTrace fall-through addressTrace target address

Page 10: Trace Caches

Page 10

Page 11: Trace Caches

Page 11

Next Trace Prediction (NTP)

History register Correlating table

Complex history indexing

Secondary Table Indexed by most recently

committed trace ID

Index generating function

Page 12: Trace Caches

Page 12

NTP Index Generation

Page 13: Trace Caches

Page 13

Return History Stack

Page 14: Trace Caches

Page 14

Trace Cache vs. Existing Techniques

Page 15: Trace Caches

Page 15

Trace Cache Optimizations

PerformancePartial matching [Friendly, Patel, Patt (1997)] Inactive issue [Friendly, Patel, Patt (1997)]Trace preconstruction [Jacobson, Smith (2000)]

PowerSequential access trace cache [Hu, et al., (2002)]Dynamic direction prediction based trace cache [Hu, et al.,

(2003)]Micro-operation cache [Solomon, et al., 2003]

Page 16: Trace Caches

Page 16

Trace Processors

Trace Processor Architecture Processing elements (PE)

–Trace-sized instruction buffer–Multiple dedicated functional units–Local register file–Copy of global register file

Use hierarchy to distribute execution resources

Addresses superscalar processor issues Complexity

–Simplified multiple branch prediction (next trace prediction)–Elimination of local dependence checking (local register file)–Decentralized instruction issue and result bypass logic

Architectural limitations–Reduced bandwidth pressure on global register file (local register

files)

Page 17: Trace Caches

Page 17

Trace Processor

Page 18: Trace Caches

Page 18

Trace Cache Variations

Block-based trace cache (BBTC)Black, Rychlik, Shen (1999)Less storage capacity needed

Page 19: Trace Caches

Page 19

Trace Table: BBTC Trace Prediction

Page 20: Trace Caches

Page 20

Block Cache

Page 21: Trace Caches

Page 21

Rename Table

Page 22: Trace Caches

Page 22

BBTC Optimization

Completion time multiple branch prediction (Rakvic, et al., 2000) Improvement over trace table predictions

Page 23: Trace Caches

Page 23

Tree-based Multiple Branch Prediction

Page 24: Trace Caches

Page 24

Tree-PHT

Page 25: Trace Caches

Page 25

Tree-PHT Update

Page 26: Trace Caches

Page 26

Trace Cache Variations (cont’d)

Software trace cacheRamirez, Larriba-Pey, Navarro, Torrellas (1999)Profile-directed code reordering to maximize sequentiality

–Convert taken branches to not-taken

–Move unused basic blocks out of execution path

–Inline frequent basic blocks

–Map most popular traces to reserved area of i-cache