1 ECE4100/6100 H-H. S. Lee ECE ECE 4100/610 4100/610 0 0 Guest Guest Lecture: Lecture: P6 P6 & NetBurst & NetBurst Microa Microa rchitecture rchitecture Prof. Hsien-Hsin Sean Lee Prof. Hsien-Hsin Sean Lee School of ECE School of ECE Georgia Institute of Georgia Institute of Technology Technology February 11, 2003 February 11, 2003
ECE 4100/610 0 Guest Lecture: P6 & NetBurst Microa rchitecture. Prof. Hsien-Hsin Sean Lee School of ECE Georgia Institute of Technology February 11, 2003. Why study P6 from last millennium?. A paradigm shift from Pentium A RISC core disguised as a CISC - PowerPoint PPT Presentation
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Prof. Hsien-Hsin Sean LeeProf. Hsien-Hsin Sean Lee
School of ECESchool of ECE
Georgia Institute of Georgia Institute of TechnologyTechnology
February February 11, 200311, 2003
2
ECE4100/6100H-H. S. LeeWhy study P6 from last
millennium? A paradigm shift from Pentium A RISC core disguised as a CISCHuge market success:
Microarchitecture And stock price
Architected by former VLIW and RISC folks Multiflow (pioneer in VLIW architecture for super-
minicomputer) Intel i960 (Intel’s RISC for graphics and embedded
controller)Netburst (P4’s microarchitecture) is based on P6
3
ECE4100/6100H-H. S. Lee
P6 Basics One implementation of IA32 architecture Super-pipelined processor 3-way superscalar In-order front-end and back-end Dynamic execution engine (restricted dataflow) Speculative execution P6 microarchitecture family processors include
Pentium Pro Pentium II (PPro + MMX + 2x caches—16KB I/16KB D) Pentium III (P-II + SSE + enhanced MMX, e.g. PSAD) Celeron (without MP support) Later P-II/P-III/Celeron all have on-die L2 cache
4
ECE4100/6100H-H. S. Lee
x86 Platform Architecture
System System Memory Memory (DRAM)(DRAM)
MCHMCH
Front-Side Front-Side BusBus
PCI USB I/O
GraphicsGraphicsProcessor Processor
LocalFrameBuffer
AGP
(SRAM)(SRAM)L2 CacheL2 Cache
Back-SideBack-Side
BusBus
P6 CoreP6 Core
Host ProcessorHost Processor
L1L1CacheCache
(SRAM)(SRAM)
GPUGPU
ICHICH
chipsetchipset
On-die or on-package
5
ECE4100/6100H-H. S. Lee
Pentium III Die Map EBL/BBL – External/Backside Bus logic MOB - Memory Order Buffer Packed FPU - Floating Point Unit for SSE IEU - Integer Execution Unit FAU - Floating Point Arithmetic Unit MIU - Memory Interface Unit DCU - Data Cache Unit (L1) PMH - Page Miss Handler DTLB - Data TLB BAC - Branch Address Calculator RAT - Register Alias Table SIMD - Packed Floating Point unit RS - Reservation Station BTB - Branch Target Buffer TAP – Test Access Port IFU - Instruction Fetch Unit and L1 I-Cache ID - Instruction Decode ROB - Reorder Buffer MS - Micro-instruction Sequencer
6
ECE4100/6100H-H. S. LeeISA Enahncement (on top of
Pentium) CMOVcc / FCMOVcc r, r/m
Conditional moves (predicated move) instructions Based on conditional code (cc)
FCOMI/P : compare FP stack and set integer flags RDPMC/RDTSC instructions Uncacheable Speculative Write-Combining (USWC) —weakly
ordered memory type for graphics memory MMX in Pentium II
IFU1: Initiate fetch, requesting 16 bytes at a time IFU2: Instruction length decoder, mark instruction boundaries, BTB makes prediction IFU3: Align instructions to 3 decoders in 4-1-1 format
Streaming Buffer
Instruction Cache
Victim Cache
Instruction TLB
datadata addraddr
P.AddrP.Addr
Branch Target Buffer
Next PCNext PCMuxMux
Other fetch Other fetch requestsrequests
Lin
ear
Add
ress
Lin
ear
Add
ress
Select Select muxmux
ILDILDLength Length marksmarks
Instruction Instruction rotatorrotator
Instruction Instruction bufferbuffer
#bytes #bytes consumed consumed by IDby ID
Prediction Prediction marksmarks
10
ECE4100/6100H-H. S. Lee
Dynamic Branch Prediction
Similar to a 2-level PAs design Associated with each BTB entry W/ 16-entry Return Stack Buffer 4 branch predictions per cycle (due to
16-byte fetch per cycle)
Static prediction provided by Branch Address Calculator when BTB misses (see prior slide)
512-entry BTB 512-entry BTB 1 1 0
Branch History RegisterBranch History Register(BHR)(BHR)
0000 0001 0010
1111 1110
Pattern History Tables Pattern History Tables (PHT)(PHT)
4-1-1 decoder Decode rate depends on instruction alignment DEC1: translate x86 into micro-operation’s (ops) DEC2: move decoded ops to ID queue MS performs translations either
Generate entire op sequence from microcode ROM Receive 4 ops from complex decoder, and the rest from microcode ROM
The interface between in-order and out-of-order pipelines
Allocates “3-or-none” ops per cycle into RS, ROB “all-or-none” in MOB (LB and SB)
Generate physical destination PdstPdst from the ROB and pass it to the Register Alias Table (RAT)
Stalls upon shortage of resources
14
ECE4100/6100H-H. S. Lee
Register Alias Table (RAT)
Register renaming for 8 integer registers, 8 floating point (stack) registers and flags: 3 op per cycle 40 80-bit physicalphysical registers embedded in the ROB (thereby, 6 bit to specify PSrcPSrc) RAT looks up physical ROB locations for renamed sources based on RRF bit
SUB EAX, EAX SUB EAX, EAX MOVB AL, m8 ; MOVB AL, m8 ; ADD EAX, m32 ; no stallADD EAX, m32 ; no stall
Idiom Fix (1)Idiom Fix (1)
Idiom Fix (2)Idiom Fix (2)
CMP EAX, EBX CMP EAX, EBX INC ECX INC ECX JBE XX ; stallJBE XX ; stall
Partial flag stalls (1)Partial flag stalls (1)
JBEJBE reads both ZFZF and CFCF while INC affects (ZFZF,OF,SF,AF,PF)
LAHF LAHF loads low byte of EFLAGS EFLAGS
TEST EBX, EBX TEST EBX, EBX LAHF ; stallLAHF ; stall
Partial flag stalls (2)Partial flag stalls (2)
17
ECE4100/6100H-H. S. Lee
Reservation Stations
Gateway to execution: binding max 5 op to each port per cycle 20 op entry buffer bridging the In-order and Out-of-order engine RS fields include op opcode, data valid bits, Pdst, Psrc, source data, BrPred, etc. Oldest first FIFO scheduling when multiple ops are ready at the same cycle
Port 0Port 0
Port 1Port 1
Port 2Port 2
Port 3Port 3
Port 4Port 4
IEU0IEU0 FaddFadd FmulFmul ImulImul DivDiv
IEU1IEU1 JEUJEU
AGU0AGU0
AGU1AGU1
MOBMOB DCUDCU
ROBROB RRFRRF
PfaddPfadd
PfmulPfmul
PfshufPfshuf
WB bus 1WB bus 1
WB bus 0WB bus 0
Ld addrLd addr
St addrSt addr
LDALDA
STASTA
STDSTDSt dataSt data
Loaded dataLoaded data
RSRS
Retired Retired datadata
18
ECE4100/6100H-H. S. Lee
ReOrder Buffer A 40-entry circular buffer
Similar to that described in [SmithPleszkun85][SmithPleszkun85]
157-bit wide Provide 40 alias physical registers
Out-of-orderOut-of-order completion Deposit exception in each entry Retirement (or de-allocation)
After resolving prior speculation Handle exceptions thru MS Clear OOO state when a mis-predicted branch or
exception is detected 3 op’s per cycle in program orderin program order For multi-op x86 instructions: none or all (atomic)none or all (atomic)
ALLOCALLOC
RATRAT
RSRS
RRFRRFROBROB. . . . ..
MSMS
(exp) (exp) code assistcode assist
19
ECE4100/6100H-H. S. Lee
Memory Execution Cluster
Manage data memory accesses Address Translation Detect violation of access ordering
RS / ROBRS / ROB
LDLD STASTA STDSTD
DTLBDTLBDTLBDTLB
LDLD STASTADCUDCUDCUDCU
Load BufferLoad Buffer
Store BufferStore BufferEBLEBL
Memory Cluster BlocksMemory Cluster Blocks
Fill buffers in DCU (similar to MSHR [Kroft’81][Kroft’81]) for handling cache misses (non-blocking)
FBFB
20
ECE4100/6100H-H. S. Lee
Memory Order Buffer (MOB) Allocated by ALLOC A second order RS for memory operations 1 op for load; 2 op’s for store: Store Address (STA) and Store Data (STD) MOB
16-entry load buffer (LB) 12-entry store address buffer (SAB) SAB works in unison with
Store data buffer (SDB) in MIU Physical Address Buffer (PAB) in DCU
Store Buffer (SB): SAB + SDB + PAB Senior Stores
Upon STD/STA retired from ROB SB marks the store “seniorsenior” Senior stores are committed back in program orderprogram order to memory when bus idle or SB full
Prefetch instructions in P-III Senior loadSenior load behavior Due to no explicit architectural destination
21
ECE4100/6100H-H. S. Lee
Store Coloring
ALLOC assigns Store Buffer ID (SBID) in program order ALLOC tags loads with the most recent SBID Check loads against stores with equal or younger SBIDs for potential
address conflicts SDB forwards data if conflict detected
x86 Instructionsx86 Instructions op’sop’s store colorstore color mov (0x1220), ebxmov (0x1220), ebx std (ebx)std (ebx) 2 2
ECE4100/6100H-H. S. LeeMemory Type Range Registers
(MTRR) Control registers written by the system (OS) Supporting Memory TypesMemory Types
UnCacheable (UC) Uncacheable Speculative Write-combining (USWC or WC)
Use a fill buffer entry as WC buffer WriteBack (WB) Write-Through (WT) Write-Protected (WP)
E.g. Support copy-on-write in UNIX, save memory space by allowing child processes to share with their parents. Only create new memory pages when child processes attempt to write.
Page Miss Handler (PMH) Look up MTRR while supplying physical addresses Return memory types and physical address to DTLB
23
ECE4100/6100H-H. S. LeeIntel NetBurst
MicroarchitecturePentium 4’s microarchitecture, a post-P6 new generationOriginal target market: Graphics workstations, but … the
major competitor screwed up themselves…Design Goals:
Reduced clock periodNew pipeline designed for scalability
24
ECE4100/6100H-H. S. Lee
Innovations Beyond P6Hyperpipelined technologyStreaming SIMD Extension 2 Enhanced branch predictorExecution trace cacheRapid execution engineAdvanced Transfer CacheHyper-threading Technology (in Xeon and Xeon MP)
25
ECE4100/6100H-H. S. Lee
Pentium 4 Fact Sheet IA-32 fully backward compatible Available at speeds ranging from 1.3 to ~3 GHz Hyperpipelined (20+ stages) 42+ million transistors 0.18 μ for 1.7 to 1.9GHz; 0.13μ for 1.8 to 2.8GHz; Die Size of 217mm2
Consumes 55 watts of power at 1.5Ghz 400MHz (850) and 533MHz (850E) system bus 512KB or 256KB 8-way full-speed on-die L2 Advanced Transfer Cache (up
to 89.6 GB/s @2.8GHz to L1) 1MB or 512KB L3 cache (in Xeon MP) 144 new 128 bit SIMD instructions (SSE2) HyperThreading Technology (only enabled in Xeon and Xeon MP)
Execution Trace CachePrimary first level I-cache to replace conventional L1
Decoding several x86 instructions at high frequency is difficult, take several pipeline stages
Branch misprediction penalty is horrible lost 20 pipeline stages vs. 10 stages in P6lost 20 pipeline stages vs. 10 stages in P6
Advantages Cache post-decodepost-decode ops High bandwidth instruction fetching Eliminate x86 decoding overheads Reduce branch recovery time if TC hits
Hold up to 12,000 ops 6 ops per trace line Many (?) trace lines in a single trace
31
ECE4100/6100H-H. S. Lee
Execution Trace CacheDeliver 3 op’s per cycle to OOO engineX86 instructions read from L2 when TC misses (7+ cycle latency)TC Hit rate ~ 8K to 16KB conventional I-cache Simplified x86 decoder
Only one complex instruction per cycle Instruction > 4 op will be executed by micro-code ROM (P6’s MS)
Perform branch prediction in TC 512-entry BTB + 16-entry RAS With BP in x86 IFU, reduce 1/3 misprediction compared to P6 Intel did not disclose the details of BP algorithms used in TC and x86
IFU (Dynamic + Static)
32
ECE4100/6100H-H. S. Lee
Out-Of-Order Engine
Similar design philosophy with P6 uses Allocator Register Alias Table 128 physical registers 126-entry ReOrder Buffer 48-entry load buffer 24-entry store buffer
Front-end Front-end RATRAT RF (128-entry)RF (128-entry) ROB (126)ROB (126)
34
ECE4100/6100H-H. S. Lee
Micro-op Scheduling op FIFO queues
Memory queue for loads and stores Non-memory queue
op schedulers Several schedulers fire instructions to execution (P6’s RS) 4 distinct dispatch ports Maximum dispatch: 6 ops per cycle (2 fast ALU from Port 0,1 per cycle; 1 from
ld/st ports)
Exec Port 0Exec Port 0 Exec Port 1Exec Port 1 Load PortLoad Port Store PortStore Port
Data Memory Accesses8KB 4-way L1 + 256KB 8-way L2 (with a HW prefetcher)Load-to-use speculation
Dependent instruction dispatched before load finishesDue to the high frequency and deep pipeline depth
Scheduler assumes loads always hit L1 If L1 miss, dependent instructions left the scheduler receive incorrect data
temporarily – mis-speculationmis-speculation Replay logic Replay logic – Re-execute the load when mis-speculated Independent instructions are allowed to proceed
Up to 4 outstanding load misses (= 4 fill buffers in original P6)Store-to-load forwarding buffer
24 entries Have the same starting physical address Load data size <= store data size
36
ECE4100/6100H-H. S. Lee
Streaming SIMD Extension 2P-III SSE (Katmai New Instructions: KNI)
Exploit ILP through TLP (—Thread-Level Parallelism) Issuing and executing multiple threads at the same
snapshotSingle P4 Xeon appears to be 2 logical processors2 logical processorsShare the same execution resourcesArchitectural states are duplicated in hardware
Supports 2 replicated hardware contexts: PC (or IP) and architecture registers
New directions of usageHelper (or assisted) threads (e.g. speculative precomputation) Speculative multithreading
Clearwater (once called Xtream logic) 8 context SMT “network processor” designed by DISC architect (company no longer exists)
SUN 4-SMT-processor CMP?
42
ECE4100/6100H-H. S. Lee
Speculative Multithreading SMT can justify wider-than-ILP datapath But, datapath is only fully utilized by multiple threads How to speed up single-thread program by utilizing multiple threads? What to do with spare resources?
Execute both sides of hard-to-predictable branches Eager execution or Polypath execution Dynamic predication
Send another thread to scout ahead to warm up caches & BTB Speculative precomputation Early branch resolution
Speculatively execute future work Multiscalar or dynamic multithreading e.g. start several loop iterations concurrently as different threads, if data dependence
is detected, redo the work Run a dynamic compiler/optimizer on the side Dynamic verification