Top Banner
Revolver: Processor Architecture for Power Efficient Loop Execution Mitchell Haygena, Vignayan Reddy and Mikko H. Lipasti - Padmini Gaur( 13IS15F) Sanchi (13IS20F)
46

Revolver: Processor Architecture for Power Efficient Loop Execution Mitchell Haygena, Vignayan Reddy and Mikko H. Lipasti -Padmini Gaur( 13IS15F) Sanchi.

Dec 17, 2015

Download

Documents

Dina Chase
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Revolver: Processor Architecture for Power Efficient Loop Execution Mitchell Haygena, Vignayan Reddy and Mikko H. Lipasti -Padmini Gaur( 13IS15F) Sanchi.

Revolver: Processor Architecture for Power Efficient Loop ExecutionMitchell Haygena, Vignayan Reddy and Mikko H. Lipasti

-Padmini Gaur( 13IS15F) Sanchi (13IS20F)

Page 2: Revolver: Processor Architecture for Power Efficient Loop Execution Mitchell Haygena, Vignayan Reddy and Mikko H. Lipasti -Padmini Gaur( 13IS15F) Sanchi.

Contents• The Need• Approaches and Isssues• Revolver: Some basics• Loop Handling

▫ Loop Detection Detection and Training Finite State Machine

• Loop Execution• Scheduler

▫ Units▫ Tag Propagation Unit

• Loop pre-execution• Conclusion• References

Page 3: Revolver: Processor Architecture for Power Efficient Loop Execution Mitchell Haygena, Vignayan Reddy and Mikko H. Lipasti -Padmini Gaur( 13IS15F) Sanchi.

The Need

•Per-transistor energy benefit improvement

•Increasing computational efficiency▫Power efficient mobile, server▫Increasing energy contraints

•Elimination of unnecessary pipeline activity

•Managing energy utilization▫Small energy requirements of instruction

execution but Large control overheads

Page 4: Revolver: Processor Architecture for Power Efficient Loop Execution Mitchell Haygena, Vignayan Reddy and Mikko H. Lipasti -Padmini Gaur( 13IS15F) Sanchi.

So far: Approaches and Issues

•Pipeline centric instruction caching▫Emphasizing temporal instruction locality

•Capturing loop instruction in buffer▫Inexpensive retrieval for future iterations

•Out-of-order processors: Issues?▫Resource allocation▫Program ordering▫Operand dependency

Page 5: Revolver: Processor Architecture for Power Efficient Loop Execution Mitchell Haygena, Vignayan Reddy and Mikko H. Lipasti -Padmini Gaur( 13IS15F) Sanchi.

Instructions serviced by Loop Buffer

Page 6: Revolver: Processor Architecture for Power Efficient Loop Execution Mitchell Haygena, Vignayan Reddy and Mikko H. Lipasti -Padmini Gaur( 13IS15F) Sanchi.

Energy Consumption

[Power Efficient Loop Execution Techniques: Mitchell Bryan Hayenga]

Page 7: Revolver: Processor Architecture for Power Efficient Loop Execution Mitchell Haygena, Vignayan Reddy and Mikko H. Lipasti -Padmini Gaur( 13IS15F) Sanchi.
Page 8: Revolver: Processor Architecture for Power Efficient Loop Execution Mitchell Haygena, Vignayan Reddy and Mikko H. Lipasti -Padmini Gaur( 13IS15F) Sanchi.

Revolver: An enhanced approach•Out-of-order back-end•Overall design similar to normal

processor•Non-loop instructions

▫Follow normal conventional pipeline•No Register Allocation Table on front-end

instead Tag propagation unit at back-end•Loop mode:

▫Detection and dispatching loop to back-end

Page 9: Revolver: Processor Architecture for Power Efficient Loop Execution Mitchell Haygena, Vignayan Reddy and Mikko H. Lipasti -Padmini Gaur( 13IS15F) Sanchi.

The promises

•No additional resource allocation•Energy consumption at front-end

managed•Pre-execution of future iterations•Operand dependence linking moved to

back-end

Page 10: Revolver: Processor Architecture for Power Efficient Loop Execution Mitchell Haygena, Vignayan Reddy and Mikko H. Lipasti -Padmini Gaur( 13IS15F) Sanchi.

Loop handling

•Loop detection•Training feedback•Loop execution

▫Wakeup logic▫Tag Propagation Unit

•Load Pre-execution

Page 11: Revolver: Processor Architecture for Power Efficient Loop Execution Mitchell Haygena, Vignayan Reddy and Mikko H. Lipasti -Padmini Gaur( 13IS15F) Sanchi.

Loop Detection

•Detection (at) stages:▫Post-execution▫At decode stage

•Enabling loop mode at decode•Calculation of:

▫Start address▫Required resources

Page 12: Revolver: Processor Architecture for Power Efficient Loop Execution Mitchell Haygena, Vignayan Reddy and Mikko H. Lipasti -Padmini Gaur( 13IS15F) Sanchi.

Detection and Training

•Key mechanisms:▫Detection logic at front-end -> dispatched▫Feedback by back-end: Profitability of loops

•Profitability▫Disabling future loop-mode

•Detection control▫Loop Detection Finite State Machine

Page 13: Revolver: Processor Architecture for Power Efficient Loop Execution Mitchell Haygena, Vignayan Reddy and Mikko H. Lipasti -Padmini Gaur( 13IS15F) Sanchi.

FSM

Page 14: Revolver: Processor Architecture for Power Efficient Loop Execution Mitchell Haygena, Vignayan Reddy and Mikko H. Lipasti -Padmini Gaur( 13IS15F) Sanchi.

FSM states

•Idle: Through decode until valid/profitable loop or PC-relative backward branch/jump detection

•Profitability logged in Loop Address Table•LAT records:

▫Composition and profitability•Profitable loop dispatched•Backward jump/ branch and No loop

▫Train State

Page 15: Revolver: Processor Architecture for Power Efficient Loop Execution Mitchell Haygena, Vignayan Reddy and Mikko H. Lipasti -Padmini Gaur( 13IS15F) Sanchi.

•Train state: ▫Records start address▫End address▫Allowable unroll factor

•Resources required added to LAT•Loop ends -> Idle state•In dispatch state the decode logic guides

the dispatch of loop instructions into the out of order backend.

Page 16: Revolver: Processor Architecture for Power Efficient Loop Execution Mitchell Haygena, Vignayan Reddy and Mikko H. Lipasti -Padmini Gaur( 13IS15F) Sanchi.

•Disabling loop mode on:▫System calls▫Memory barriers▫Load-store linked conditional pair

Page 17: Revolver: Processor Architecture for Power Efficient Loop Execution Mitchell Haygena, Vignayan Reddy and Mikko H. Lipasti -Padmini Gaur( 13IS15F) Sanchi.

Training Feedback• Profitability

▫4-bit counter▫Default value =8▫Loop mode enabling if value>=8▫Dispatched loop unrolled more than twice, +2▫Else, -2▫Mis-prediction other than fall-through,

profitability set = 0• Disabled loops:

▫Front-end increments by 1 for 2 sequential successful dispatch

Page 18: Revolver: Processor Architecture for Power Efficient Loop Execution Mitchell Haygena, Vignayan Reddy and Mikko H. Lipasti -Padmini Gaur( 13IS15F) Sanchi.

Loop: Basic idea•Unrolling loop:

▫Depending on back-end resources▫As much as possible▫Eliminating additional resource use after

dispatch•Loop instruction stays in issue queue,

executes till completion of iteration•Maintaining provided resources across

multiple executions•Load-store queues modified maintaining

program order

Page 19: Revolver: Processor Architecture for Power Efficient Loop Execution Mitchell Haygena, Vignayan Reddy and Mikko H. Lipasti -Padmini Gaur( 13IS15F) Sanchi.

Contd..

•Proper access of destination and source register

•Loop exit:▫Removing instructions from back-end▫Loop fall-trough path dispatch

Page 20: Revolver: Processor Architecture for Power Efficient Loop Execution Mitchell Haygena, Vignayan Reddy and Mikko H. Lipasti -Padmini Gaur( 13IS15F) Sanchi.

Loop execution: Let’s follow• Green: Odd numbered• Blue: Even numbered• Pointers:

▫ Program order maintenance: loop_start, loop_end

▫ Oldest uncommitted entry: commit

Page 21: Revolver: Processor Architecture for Power Efficient Loop Execution Mitchell Haygena, Vignayan Reddy and Mikko H. Lipasti -Padmini Gaur( 13IS15F) Sanchi.

Loop execution, contd..

•Commiting:▫Start to end▫Wrapping to start: next loop iteration▫Resetting issue queue entries for next loop

iteration▫Load queue entries invalidated▫Store queue entries:

Passed to write-buffer Immediate reuse in next iteration Cannot write to buffer -> stall (very rare)

Page 22: Revolver: Processor Architecture for Power Efficient Loop Execution Mitchell Haygena, Vignayan Reddy and Mikko H. Lipasti -Padmini Gaur( 13IS15F) Sanchi.

Scheduler: Units

•Wake-up array▫Identifying Ready instructions

•Select logic▫Arbitration between reading instructions

•Silo instruction▫Producing the opcode and physical

identifiers of selected instruction

Page 23: Revolver: Processor Architecture for Power Efficient Loop Execution Mitchell Haygena, Vignayan Reddy and Mikko H. Lipasti -Padmini Gaur( 13IS15F) Sanchi.

Scheduler: The design

Page 24: Revolver: Processor Architecture for Power Efficient Loop Execution Mitchell Haygena, Vignayan Reddy and Mikko H. Lipasti -Padmini Gaur( 13IS15F) Sanchi.

Scheduler: The concept

•Managed as queue•Maintains program order among entries•Wakeup array

▫Utilizes logical register identifiers▫Position dependence

•Tag Propagation Unit (TPU)▫Physical register mapping

Page 25: Revolver: Processor Architecture for Power Efficient Loop Execution Mitchell Haygena, Vignayan Reddy and Mikko H. Lipasti -Padmini Gaur( 13IS15F) Sanchi.

Wakeup Logic: Overview

•Observes generated results:▫Identifying new instructions capable of

being executed•Program based ordering•Broadcast of logical register identifier

▫No need for renaming▫No physical register identifier in use

Page 26: Revolver: Processor Architecture for Power Efficient Loop Execution Mitchell Haygena, Vignayan Reddy and Mikko H. Lipasti -Padmini Gaur( 13IS15F) Sanchi.

Wakeup: The design

Page 27: Revolver: Processor Architecture for Power Efficient Loop Execution Mitchell Haygena, Vignayan Reddy and Mikko H. Lipasti -Padmini Gaur( 13IS15F) Sanchi.

Wake up array

• Rows: Instructions• Columns: Logical registers• Signals:

▫ Request▫ Granted▫ Ready

Page 28: Revolver: Processor Architecture for Power Efficient Loop Execution Mitchell Haygena, Vignayan Reddy and Mikko H. Lipasti -Padmini Gaur( 13IS15F) Sanchi.

Wakeup operation

•Allocation into wake up array▫Marking logical source and destination

registers•Unscheduled instruction

▫Deassert downstream register column▫Preventing younger, dependent instructions

from waking up•Request sent when:

▫Receiving all necessary source register broadcasts

▫Ready source registers

Page 29: Revolver: Processor Architecture for Power Efficient Loop Execution Mitchell Haygena, Vignayan Reddy and Mikko H. Lipasti -Padmini Gaur( 13IS15F) Sanchi.

•Select grants the request:▫Asserting downstream ready▫Waking up younger dependent instructions

•Wakeup logic cell:▫2 state bits: sourcing/producing logical

register

Page 30: Revolver: Processor Architecture for Power Efficient Loop Execution Mitchell Haygena, Vignayan Reddy and Mikko H. Lipasti -Padmini Gaur( 13IS15F) Sanchi.

The simple logic

Page 31: Revolver: Processor Architecture for Power Efficient Loop Execution Mitchell Haygena, Vignayan Reddy and Mikko H. Lipasti -Padmini Gaur( 13IS15F) Sanchi.

An example with dependence

Page 32: Revolver: Processor Architecture for Power Efficient Loop Execution Mitchell Haygena, Vignayan Reddy and Mikko H. Lipasti -Padmini Gaur( 13IS15F) Sanchi.

Tag Propagation Unit (TPU)

•No renaming!•Maps physical register identifier to logical

registers•Enables reuse of physical register

▫As no additional resources▫Physical register management

•Possible speculative execution of next loop iterations

Page 33: Revolver: Processor Architecture for Power Efficient Loop Execution Mitchell Haygena, Vignayan Reddy and Mikko H. Lipasti -Padmini Gaur( 13IS15F) Sanchi.

Next loop iteration??

•Impossible if: Instruction only has access to single physical

destination register•Speculative execution:

▫Alternative physical register identifier needed

•Solution: 2 physical destination registers▫Alternative writing between 2

Page 34: Revolver: Processor Architecture for Power Efficient Loop Execution Mitchell Haygena, Vignayan Reddy and Mikko H. Lipasti -Padmini Gaur( 13IS15F) Sanchi.

With 2 destination registers

•Double Buffering▫Maintaining previous state while

speculative computation▫N+1 commits, reusing destination register

of iteration N on iteration N+2▫No instruction dependence in N and N+2▫Speculative writing in output register

allowed

Page 35: Revolver: Processor Architecture for Power Efficient Loop Execution Mitchell Haygena, Vignayan Reddy and Mikko H. Lipasti -Padmini Gaur( 13IS15F) Sanchi.

With Double buffering

•Dynamic linkage between dependent instructions and source registers

•Changing logical register mapping▫Overwriting output register column

•Instruction stored in program order:▫Downstream instructions obtain proper

source mapping

Page 36: Revolver: Processor Architecture for Power Efficient Loop Execution Mitchell Haygena, Vignayan Reddy and Mikko H. Lipasti -Padmini Gaur( 13IS15F) Sanchi.

Source, destination and iteration

Page 37: Revolver: Processor Architecture for Power Efficient Loop Execution Mitchell Haygena, Vignayan Reddy and Mikko H. Lipasti -Padmini Gaur( 13IS15F) Sanchi.

Register reclamation

•Any instruction misprediction:▫Flushing downstream instructions▫Propagation of mappings to all newly

scheduled instructions•Better than RAT:

▫Complexities reduced

Page 38: Revolver: Processor Architecture for Power Efficient Loop Execution Mitchell Haygena, Vignayan Reddy and Mikko H. Lipasti -Padmini Gaur( 13IS15F) Sanchi.

Queue entries: Lifetime

•Received prior to dispatch•Retained till instruction exit from backend•Reused to execute multiple loop iterations

▫Immediate freeing of LSQ upon commit▫Position based age logic in LSQ

•Load queue entries:▫Simply reset for future

Page 39: Revolver: Processor Architecture for Power Efficient Loop Execution Mitchell Haygena, Vignayan Reddy and Mikko H. Lipasti -Padmini Gaur( 13IS15F) Sanchi.

Store Queue entries: An extra effort

•Need to write back•Drained into write buffer immediately

between L1 Cache and queue•If cannot write stall

▫Very rare•Wrapping around of commit pointer

Page 40: Revolver: Processor Architecture for Power Efficient Loop Execution Mitchell Haygena, Vignayan Reddy and Mikko H. Lipasti -Padmini Gaur( 13IS15F) Sanchi.

Loop pre-execution

•Pre-execution of future loads:▫Parallelization▫Enabling zero-latency loads

No L1 cache access latency•Repeated execution of load till completion

of all iterations•Exploiting recurrent nature of loop:

▫Highly predictable address patterns

Page 41: Revolver: Processor Architecture for Power Efficient Loop Execution Mitchell Haygena, Vignayan Reddy and Mikko H. Lipasti -Padmini Gaur( 13IS15F) Sanchi.

Learning from example: String copy

•Copying source array to destination array•Predictable load address•Accessing consecutive bytes from memory•Primary addressing access patterns:

▫Stride▫Constant▫Pointer-based

•Placing simple pattern identification hardware alongside pre-executed load buffers

Page 42: Revolver: Processor Architecture for Power Efficient Loop Execution Mitchell Haygena, Vignayan Reddy and Mikko H. Lipasti -Padmini Gaur( 13IS15F) Sanchi.

Stride based addressing

•Most common • Iterating over data array

▫Computing address Δ between 2 consecutive loads

▫Third load matches predicted stride: Stride verification

▫Pre-execution of next load•Constant: A special case of zero-sized stride

▫Reading from same address▫Stack allocated variables/ Pointer aliasing

Page 43: Revolver: Processor Architecture for Power Efficient Loop Execution Mitchell Haygena, Vignayan Reddy and Mikko H. Lipasti -Padmini Gaur( 13IS15F) Sanchi.

Pointer based addressing

•Value returned by current load -> next address

•E.g. Linked list traversals

Page 44: Revolver: Processor Architecture for Power Efficient Loop Execution Mitchell Haygena, Vignayan Reddy and Mikko H. Lipasti -Padmini Gaur( 13IS15F) Sanchi.

Pre-execution: more..

•Pre-executed load buffer placed between load queue and L1 Cache interface

•Store clashes with pre-executed load▫Invalidating entry▫Coherency maintenance

•Pre-executed loads:▫Speculatively waking up dependent

operations on next cycle•Incorrect address prediction:

▫Scheduler cancels and re-issues operation

Page 45: Revolver: Processor Architecture for Power Efficient Loop Execution Mitchell Haygena, Vignayan Reddy and Mikko H. Lipasti -Padmini Gaur( 13IS15F) Sanchi.

Conclusion

•Minimizing energy during loop execution•Elimination of front-end overheads

originating from pipeline activity and resource allocation

•Benefits achieved better than in loop buffers and μop caches

•Pre-execution increases performance during loop execution by hiding L1 cache latencies

•According to research, 5.3-18.3% energy-delay benefit

Page 46: Revolver: Processor Architecture for Power Efficient Loop Execution Mitchell Haygena, Vignayan Reddy and Mikko H. Lipasti -Padmini Gaur( 13IS15F) Sanchi.

References•Scheduling Reusable Instructions for Power

Reduction (J. Hu, N. Vijaykrishnan, S. Kim, M. Kandemir, M. Irwin),2004

•Matrix Scheduler Reloaded (P. G. Sassone, J. Rupley, E. Breckelbaum, G. H. Loh, B. Black)

• Instruction Fetch Energy Reduction Using Loop Caches for Embedded Applications with Small Tight Loops (L. H. Lee, B. Moyer, J. Arends)

•Power Efficient Loop Execution Techniques (Mitchell Bryan Hayenga)