Top Banner
UC Regents Spring 2014 © UCB CS 152 L18: Dynamic Scheduling I 2014-4-3 John Lazzaro (not a prof - “John” is always OK) CS 152 Computer Architecture and Engineering www-inst.eecs.berkeley.edu/ ~cs152/ TA: Eric Love Lecture 19 -- Dynamic Scheduling II Pla y:
39

2014-4-3 John Lazzaro (not a prof - “John” is always OK)

Jan 31, 2016

Download

Documents

Mircea raducanu

www-inst.eecs.berkeley.edu/~cs152/. CS 152 Computer Architecture and Engineering. Lecture 19 -- Dynamic Scheduling II. 2014-4-3 John Lazzaro (not a prof - “John” is always OK). TA: Eric Love. Play:. Case studies of dynamic execution. - PowerPoint PPT Presentation
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: 2014-4-3 John Lazzaro (not a prof - “John” is always OK)

UC Regents Spring 2014 © UCBCS 152 L18: Dynamic Scheduling I

2014-4-3

John Lazzaro(not a prof - “John” is always OK)

CS 152Computer Architecture and Engineering

www-inst.eecs.berkeley.edu/~cs152/

TA: Eric Love

Lecture 19 -- Dynamic Scheduling II

Play:

Page 2: 2014-4-3 John Lazzaro (not a prof - “John” is always OK)

UC Regents Fall 2006 © UCBCS 152 L21: Networks and Routers

Case studies of dynamic execution

DEC Alpha 21264: High performance from a relatively simple implementation of a modern instruction set.

IBM Power: Evolving dynamic designs over many generations.

Simultaneous Multi-threading: Adapting multi-threading to dynamic scheduling.

Short Break

Page 3: 2014-4-3 John Lazzaro (not a prof - “John” is always OK)

DEC Alpha

21164: 4-issue in-order design.

21264 was 50% to 200% faster in real-world applications.

21264: 4-issue out-of-order design.

Page 4: 2014-4-3 John Lazzaro (not a prof - “John” is always OK)

500 MHz 0.5µ parts for in-order 21164 and out-of-order

21264.

Similarly-sized on-chip caches (116K vs 128K)

In-order 21164 has larger

off-chip cache.

21264 has 55% more transistors

than the 21164.

The die is 44% larger.

21264 has a 1.7x advantage on integer code, and a 2.7x advantage of floating-point

code.

21264 consumes

46% more power

than the 21164.

Page 5: 2014-4-3 John Lazzaro (not a prof - “John” is always OK)

UC Regents Spring 2014 © UCBCS 152 L19: Dynamic Scheduling II

The Real Difference: Speculation

If the ability to recover from

mis-speculation is built into an

implementation ... it offers

the option to add speculative features to all parts of the

design.

Page 6: 2014-4-3 John Lazzaro (not a prof - “John” is always OK)

FP Pipe

Int Pipe

Int Pipe

OoOOoO

I-CacheI-CacheData Cache

Data Cache

Fetch and

predict

21264 die

Separate OoO control for integer

and floating point.

RISC decode happens in OoO blocks

Unlabeled areas devoted

to memory system control

Page 7: 2014-4-3 John Lazzaro (not a prof - “John” is always OK)

21264 pipeline diagramRename and Issue stages are primary

locations of dynamic scheduling logic. Load/store disambiguation support resides in Memory stage.

Slot: absorbs delay of long path on last slide.

Page 8: 2014-4-3 John Lazzaro (not a prof - “John” is always OK)

Fetch stage close-up:Each cache line stores predictions of the next line,

and the cache way to be fetched. If predictions are correct, fetcher maintains the required 4

instructions/cycle pace.

Speculative

Page 9: 2014-4-3 John Lazzaro (not a prof - “John” is always OK)

Rename stage close-up:(1) Allocates new physical registers for destinations,

(2) Looks up physical register numbers for sources, (3) Handle rename dependences within the 4

issuing instructions in one clock cycle!

Output:12 physical

registers numbers:

1 destination and 2

sources for the 4

instructions to be issued.Input: 4 instructions specifying

architected registers.

For mis-speculation recovery

Time-stamped.

Page 10: 2014-4-3 John Lazzaro (not a prof - “John” is always OK)

UC Regents Spring 2014 © UCBCS 152 L18: Dynamic Scheduling I

Recall: malloc() -- free() in hardware

The record-keeping

shown in this diagram occurs in the rename

stage.

Page 11: 2014-4-3 John Lazzaro (not a prof - “John” is always OK)

Issue stage close-up:(1) Newly issued instructions placed in top of queue.

(2) Instructions check scoreboard: are 2 sources ready?

(3) Arbiter selects 4 oldest “ready” instructions.(4) Update removes these 4 from queue.Output:

The 4 oldest

instructions whose 2 source registers are ready for use.

Input: 4

just-issued instructions, renamed to use physical

registers.

Scoreboard: Tracks writes to physical registers.

Page 12: 2014-4-3 John Lazzaro (not a prof - “John” is always OK)

Execution close-up:(1) Two copies of register files, to reduce port

pressure.(2) Forwarding buses are low-latency paths through

CPU. Relies on speculations

Page 13: 2014-4-3 John Lazzaro (not a prof - “John” is always OK)

Latencies, from issue to retirement.

8 retirements per cycle can be sustained over

short time periods.Peak rate is 11

retirements in a single cycle.

Retirement managed here.

Short latencies keep buffers to a reasonable size.

Page 14: 2014-4-3 John Lazzaro (not a prof - “John” is always OK)

Execution unit close-up:(1) Two arbiters: one for top pipes, one for bottom

pipes.(2) Instructions statically assigned to top or bottom.

(3) Arbiter dynamically selects left or right.TopTop

Bottom

Thus, 2 dual-issue dynamic machines, not a 4-issue machine. Why? Simplifies arbiter. Performance penalty? A few %.

Page 15: 2014-4-3 John Lazzaro (not a prof - “John” is always OK)

Memory stages close-up:

Input: Say something

Loads and stores from execution unit appear as

“Cluster 0/1 memory unit” in the diagram

below.

1st stop: TLB, to convert virtual memory

addresses.

3rd stop: Flush STQ to the data cache ... on a miss, place in Miss Address File.

(MAF == MHSR)

“Doublepumped”

1 GHz

2nd stop: Load Queue(LDQ) and Store Queue (SDQ) each hold 32 instructions, until retirement ...

So we can roll back!

Page 16: 2014-4-3 John Lazzaro (not a prof - “John” is always OK)

LDQ/STQ close-up:

Hazards we are trying to prevent:

To do so, LDQ and SDQ lists of up to 32 loads and stores, in issued order. When a new load or store arrives, addresses are compared to detect/fix hazards:

Page 17: 2014-4-3 John Lazzaro (not a prof - “John” is always OK)

LDQ/STQ speculation

It also marks the load instruction in a predictor, so that future invocations are not speculatively executed.

First execution Subsequent execution

Page 18: 2014-4-3 John Lazzaro (not a prof - “John” is always OK)

Designing a microprocessor is a team sport. Below are the author and acknowledgement lists for the papers whose figures I use.

There is no “i” in T-E-A-M ...

circuits

architectmicro-architects

Page 19: 2014-4-3 John Lazzaro (not a prof - “John” is always OK)

UC Regents Spring 2014 © UCBCS 152 L19: Dynamic Scheduling II

Break

Play:

Page 20: 2014-4-3 John Lazzaro (not a prof - “John” is always OK)

UC Regents Spring 2014 © UCBCS 152 L18: Dynamic Scheduling I

Multi-Threading

(Dynamic Scheduling)

Page 21: 2014-4-3 John Lazzaro (not a prof - “John” is always OK)

UC Regents Spring 2014 © UCBCS 152 L18: Dynamic Scheduling I

Power 4 (predates Power 5 shown earlier)

Single-threaded predecessor to Power 5. 8 execution units inout-of-order engine, each mayissue an instruction each cycle.

Page 22: 2014-4-3 John Lazzaro (not a prof - “John” is always OK)

UC Regents Spring 2014 © UCBCS 152 L18: Dynamic Scheduling I

For most apps, most execution units lie idle

From: Tullsen, Eggers, and Levy,“Simultaneous Multithreading: Maximizing On-chip Parallelism, ISCA 1995.

For an 8-way superscalar.Observation:

Most hardware in an

out-of-order CPU concerns

physical registers.

Could severalinstruction

threads share this hardware?

Page 23: 2014-4-3 John Lazzaro (not a prof - “John” is always OK)

UC Regents Spring 2014 © UCBCS 152 L18: Dynamic Scheduling I

Simultaneous Multi-threading ...

1

2

3

4

5

6

7

8

9

M M FX FX FP FP BR CCCycle

One thread, 8 units

M = Load/Store, FX = Fixed Point, FP = Floating Point, BR = Branch, CC = Condition Codes

1

2

3

4

5

6

7

8

9

M M FX FX FP FP BR CCCycle

Two threads, 8 units

Page 24: 2014-4-3 John Lazzaro (not a prof - “John” is always OK)

UC Regents Spring 2014 © UCBCS 152 L18: Dynamic Scheduling I

Power 4

Power 5

2 fetch (PC),2 initial decodes

2 commits(architected register sets)

Page 25: 2014-4-3 John Lazzaro (not a prof - “John” is always OK)

UC Regents Spring 2014 © UCBCS 152 L18: Dynamic Scheduling I

Power 5 data flow ...

Why only 2 threads? With 4, one of the shared resources (physical registers, cache, memory bandwidth) would be prone to bottleneck.

Page 26: 2014-4-3 John Lazzaro (not a prof - “John” is always OK)

UC Regents Spring 2014 © UCBCS 152 L18: Dynamic Scheduling I

Power 5 thread performance ...

Relative priority of each thread controllable in hardware.

For balanced operation, both threads run slower than if they “owned” the machine.

Page 27: 2014-4-3 John Lazzaro (not a prof - “John” is always OK)

UC Regents Spring 2014 © UCBCS 152 L18: Dynamic Scheduling I

Multi-Core

Page 28: 2014-4-3 John Lazzaro (not a prof - “John” is always OK)

UC Regents Spring 2014 © UCBCS 152 L18: Dynamic Scheduling I

Recall: Superscalar utilization by a threadFor an 8-way superscalar. Observation:

In many cases, the on-chip cache and DRAM I/O

bandwidth is also

underutilized by one CPU.

So, let 2 cores share them.

Page 29: 2014-4-3 John Lazzaro (not a prof - “John” is always OK)

UC Regents Spring 2014 © UCBCS 152 L18: Dynamic Scheduling I

Most of Power 5 die is shared hardware

Core #1

Core #2

SharedComponents

L2 Cache

L3 Cache Control

DRAMController

Page 30: 2014-4-3 John Lazzaro (not a prof - “John” is always OK)

UC Regents Spring 2014 © UCBCS 152 L18: Dynamic Scheduling I

Core-to-core interactions stay on chip

(2) Threads on two cores share memory via L2 cache operations.Much faster than2 CPUs on 2 chips.

(1) Threads on two cores that use shared libraries conserve L2 memory.

Page 31: 2014-4-3 John Lazzaro (not a prof - “John” is always OK)

UC Regents Spring 2014 © UCBCS 152 L18: Dynamic Scheduling I

Sun Niagara

Page 32: 2014-4-3 John Lazzaro (not a prof - “John” is always OK)

UC Regents Spring 2014 © UCBCS 152 L18: Dynamic Scheduling I

The case for Sun’s Niagara ...For an 8-way superscalar.

Observation:Some apps struggle to

reach a CPI == 1.

For throughput on these apps,a large number of single-issue cores is better

than a few superscalars.

Page 33: 2014-4-3 John Lazzaro (not a prof - “John” is always OK)

UC Regents Spring 2014 © UCBCS 152 L18: Dynamic Scheduling I

Niagara (original): 32 threads on one chip

8 cores:Single-issue, 1.2 GHz6-stage pipeline4-way multi-

threadedFast crypto support

Shared resources:3MB on-chip cache4 DDR2 interfaces32G DRAM, 20 Gb/s1 shared FP unitGB Ethernet ports

Sources: Hot Chips, via EE Times, Infoworld. J Schwartz weblog (Sun COO)

Die size: 340 mm² in 90 nm.Power: 50-60 W

Page 34: 2014-4-3 John Lazzaro (not a prof - “John” is always OK)

UC Regents Spring 2014 © UCBCS 152 L18: Dynamic Scheduling I

The board that booted Niagara first-silicon

Source: J Schwartz weblog (then Sun COO, now CEO)

Page 35: 2014-4-3 John Lazzaro (not a prof - “John” is always OK)

UC Regents Spring 2014 © UCBCS 152 L18: Dynamic Scheduling I

Used in Sun Fire T2000: “Coolthreads”

Web server benchmarks used to position the T2000 in the market.

Claim: server uses 1/3 the power of competing servers.

Page 36: 2014-4-3 John Lazzaro (not a prof - “John” is always OK)

UC Regents Spring 2014 © UCBCS 152 L19: Dynamic Scheduling II

2014

IBM RISC chips, since Power 4 (2001) ...

Page 37: 2014-4-3 John Lazzaro (not a prof - “John” is always OK)

UC Regents Spring 2014 © UCBCS 152 L19: Dynamic Scheduling II

Page 38: 2014-4-3 John Lazzaro (not a prof - “John” is always OK)

UC Regents Spring 2014 © UCBCS 152 L18: Dynamic Scheduling I

Recap: Dynamic Scheduling

Three big ideas: register renaming, data-driven detection of RAW resolution, bus-based architecture.

Has saved architectures that have a small number of registers: IBM 360floating-point ISA, Intel x86 ISA.

Very complex, but enables many things: out-of-order execution, multiple issue, loop unrolling, etc.

Page 39: 2014-4-3 John Lazzaro (not a prof - “John” is always OK)

On Tuesday

Epilogue ...

Have a good weekend!