Top Banner
Chapter 7 <1> Digital Design and Computer Architecture,2 nd Edi(on Chapter 7 David Money Harris and Sarah L. Harris
22

Chapter(7(users.tricity.wsu.edu/~bobl/cpts260/u08_multiprocessors/notes_hh.pdf · Digital’Design’and’Computer’Architecture,2 ndEdion" Chapter(7(David(Money(Harris(and(Sarah(L.(Harris

Jul 23, 2018

Download

Documents

vanthuy
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Chapter(7(users.tricity.wsu.edu/~bobl/cpts260/u08_multiprocessors/notes_hh.pdf · Digital’Design’and’Computer’Architecture,2 ndEdion" Chapter(7(David(Money(Harris(and(Sarah(L.(Harris

Chapter  7  <1>    

Digital  Design  and  Computer  Architecture,  2nd  Edi(on  

Chapter  7  

David  Money  Harris  and  Sarah  L.  Harris  

Page 2: Chapter(7(users.tricity.wsu.edu/~bobl/cpts260/u08_multiprocessors/notes_hh.pdf · Digital’Design’and’Computer’Architecture,2 ndEdion" Chapter(7(David(Money(Harris(and(Sarah(L.(Harris

Chapter  7  <2>    

Chapter  7  ::  Topics  

•  Introduc(on  (done)  •  Performance  Analysis  (done)  •  Single-­‐Cycle  Processor  (done)  •  Mul(cycle  Processor  (done)  •  Pipelined  Processor  (done)  •  Excep(ons  (done)  •  Advanced  Microarchitecture  (now)  

Page 3: Chapter(7(users.tricity.wsu.edu/~bobl/cpts260/u08_multiprocessors/notes_hh.pdf · Digital’Design’and’Computer’Architecture,2 ndEdion" Chapter(7(David(Money(Harris(and(Sarah(L.(Harris

Chapter  7  <3>    

•  Deep  Pipelining  •  Branch  PredicCon  •  Superscalar  Processors  •  Out  of  Order  Processors  •  Register  Renaming  •  SIMD  •  MulCthreading  •  MulCprocessors  

Advanced  Microarchitecture  

Page 4: Chapter(7(users.tricity.wsu.edu/~bobl/cpts260/u08_multiprocessors/notes_hh.pdf · Digital’Design’and’Computer’Architecture,2 ndEdion" Chapter(7(David(Money(Harris(and(Sarah(L.(Harris

Chapter  7  <4>    

•  10-­‐20  stages  typical  •  Number  of  stages  limited  by:  – Pipeline  hazards  – Sequencing  overhead  – Power  – Cost  

Deep  Pipelining  

Page 5: Chapter(7(users.tricity.wsu.edu/~bobl/cpts260/u08_multiprocessors/notes_hh.pdf · Digital’Design’and’Computer’Architecture,2 ndEdion" Chapter(7(David(Money(Harris(and(Sarah(L.(Harris

Chapter  7  <5>    

•  Ideal  pipelined  processor:  CPI  =  1  •  Branch  mispredicCon  increases  CPI  •  Sta(c  branch  predic(on:  –  Check  direcCon  of  branch  (forward  or  backward)  –  If  backward,  predict  taken  –  Else,  predict  not  taken  

•  Dynamic  branch  predic(on:  –  Keep  history  of  last  (several  hundred)  branches  in  branch  target  buffer,  record:  •  Branch  desCnaCon  •  Whether  branch  was  taken  

Branch  PredicCon  

Page 6: Chapter(7(users.tricity.wsu.edu/~bobl/cpts260/u08_multiprocessors/notes_hh.pdf · Digital’Design’and’Computer’Architecture,2 ndEdion" Chapter(7(David(Money(Harris(and(Sarah(L.(Harris

Chapter  7  <6>    

add $s1, $0, $0 # sum = 0

add $s0, $0, $0 # i = 0 addi $t0, $0, 10 # $t0 = 10

for:

beq $s0, $t0, done # if i == 10, branch

add $s1, $s1, $s0 # sum = sum + i

addi $s0, $s0, 1 # increment i j for

done:

Branch  PredicCon  Example  

Page 7: Chapter(7(users.tricity.wsu.edu/~bobl/cpts260/u08_multiprocessors/notes_hh.pdf · Digital’Design’and’Computer’Architecture,2 ndEdion" Chapter(7(David(Money(Harris(and(Sarah(L.(Harris

Chapter  7  <7>    

•  Remembers  whether  branch  was  taken  the  last  Cme  and  does  the  same  thing  

•  Mispredicts  first  and  last  branch  of  loop  

1-­‐Bit  Branch  Predictor  

Page 8: Chapter(7(users.tricity.wsu.edu/~bobl/cpts260/u08_multiprocessors/notes_hh.pdf · Digital’Design’and’Computer’Architecture,2 ndEdion" Chapter(7(David(Money(Harris(and(Sarah(L.(Harris

Chapter  7  <8>    

Only  mispredicts  last  branch  of  loop  

stronglytaken

predicttaken

weaklytaken

predicttaken

weaklynot taken

predictnot taken

stronglynot taken

predictnot taken

taken taken taken

takentakentaken

taken

taken

2-­‐Bit  Branch  Predictor  

Page 9: Chapter(7(users.tricity.wsu.edu/~bobl/cpts260/u08_multiprocessors/notes_hh.pdf · Digital’Design’and’Computer’Architecture,2 ndEdion" Chapter(7(David(Money(Harris(and(Sarah(L.(Harris

Chapter  7  <9>    

•  MulCple  copies  of  datapath  execute  mulCple  instrucCons  at  once  

•  Dependencies  make  it  tricky  to  issue  mulCple  instrucCons  at  once  

CLK CLK CLK CLK

ARD A1

A2RD1A3

WD3WD6

A4A5A6

RD4

RD2RD5

InstructionMemory

RegisterFile Data

Memory

ALUs

PC

CLK

A1A2

WD1WD2

RD1RD2

Superscalar  

Page 10: Chapter(7(users.tricity.wsu.edu/~bobl/cpts260/u08_multiprocessors/notes_hh.pdf · Digital’Design’and’Computer’Architecture,2 ndEdion" Chapter(7(David(Money(Harris(and(Sarah(L.(Harris

Chapter  7  <10>    

    Ideal  IPC:    2 Actual  IPC:  2

Time (cycles)

1 2 3 4 5 6 7 8

RF40

$s0

RF

$t0+

DMIM

lw

add

lw $t0, 40($s0)

add $t1, $s1, $s2

sub $t2, $s1, $s3

and $t3, $s3, $s4

or $t4, $s1, $s5

sw $s5, 80($s0)

$t1$s2

$s1

+

RF$s3

$s1

RF

$t2-

DMIM

sub

and $t3$s4

$s3

&

RF$s5

$s1

RF

$t4|

DMIM

or

sw80

$s0

+ $s5

Superscalar  Example  

Page 11: Chapter(7(users.tricity.wsu.edu/~bobl/cpts260/u08_multiprocessors/notes_hh.pdf · Digital’Design’and’Computer’Architecture,2 ndEdion" Chapter(7(David(Money(Harris(and(Sarah(L.(Harris

Chapter  7  <11>    

lw $t0, 40($s0) add $t1, $t0, $s1  sub $t0, $s2, $s3   Ideal  IPC:    2 and $t2, $s4, $t0 Actual  IPC:  6/5  =  1.17 or $t3, $s5, $s6

sw $s7, 80($t3)

Stall

Time (cycles)

1 2 3 4 5 6 7 8

RF40

$s0

RF

$t0+

DMIM

lwlw $t0, 40($s0)

add $t1, $t0, $s1

sub $t0, $s2, $s3

and $t2, $s4, $t0

sw $s7, 80($t3)

RF$s1

$t0add

RF$s1

$t0

RF

$t1+

DM

RF$t0

$s4

RF

$t2&

DMIM

and

IMor

and

sub

|$s6

$s5$t3

RF80

$t3

RF+

DMsw

IM

$s7

9

$s3

$s2

$s3

$s2-

$t0

oror $t3, $s5, $s6

IM

Superscalar  with  Dependencies  

Page 12: Chapter(7(users.tricity.wsu.edu/~bobl/cpts260/u08_multiprocessors/notes_hh.pdf · Digital’Design’and’Computer’Architecture,2 ndEdion" Chapter(7(David(Money(Harris(and(Sarah(L.(Harris

Chapter  7  <12>    

•  Looks  ahead  across  mulCple  instrucCons  •  Issues  as  many  instrucCons  as  possible  at  once  •  Issues  instrucCons  out  of  order  (as  long  as  no  dependencies)  

•  Dependencies:  –  RAW  (read  aaer  write):  one  instrucCon  writes,  later  instrucCon  reads  a  register  

– WAR  (write  aaer  read):  one  instrucCon  reads,  later  instrucCon  writes  a  register  

– WAW  (write  aaer  write):  one  instrucCon  writes,  later  instrucCon  writes  a  register  

Out  of  Order  Processor  

Page 13: Chapter(7(users.tricity.wsu.edu/~bobl/cpts260/u08_multiprocessors/notes_hh.pdf · Digital’Design’and’Computer’Architecture,2 ndEdion" Chapter(7(David(Money(Harris(and(Sarah(L.(Harris

Chapter  7  <13>    

•  Instruc(on  level  parallelism  (ILP):  number  of    instrucCon  that  can  be  issued  simultaneously  (average  <  3)  

•  Scoreboard:  table  that  keeps  track  of:  – InstrucCons  waiCng  to  issue  – Available  funcConal  units  – Dependencies  

Out  of  Order  Processor  

Page 14: Chapter(7(users.tricity.wsu.edu/~bobl/cpts260/u08_multiprocessors/notes_hh.pdf · Digital’Design’and’Computer’Architecture,2 ndEdion" Chapter(7(David(Money(Harris(and(Sarah(L.(Harris

Chapter  7  <14>    

lw $t0, 40($s0) add $t1, $t0, $s1  sub $t0, $s2, $s3   Ideal  IPC:    2 and $t2, $s4, $t0 Actual  IPC:  6/4  =  1.5 or $t3, $s5, $s6

sw $s7, 80($t3) Time (cycles)

1 2 3 4 5 6 7 8

RF40

$s0

RF

$t0+

DMIM

lwlw $t0, 40($s0)

add $t1, $t0, $s1

sub $t0, $s2, $s3

and $t2, $s4, $t0

sw $s7, 80($t3)

or|$s6

$s5$t3

RF80

$t3

RF+

DMsw $s7

or $t3, $s5, $s6

IM

RF$s1

$t0

RF

$t1+

DMIM

add

sub-$s3

$s2$t0

two cycle latencybetween load anduse of $t0

RAW

WAR

RAW

RF$t0

$s4

RF&

DMand

IM

$t2

RAW

Out  of  Order  Processor  Example  

Page 15: Chapter(7(users.tricity.wsu.edu/~bobl/cpts260/u08_multiprocessors/notes_hh.pdf · Digital’Design’and’Computer’Architecture,2 ndEdion" Chapter(7(David(Money(Harris(and(Sarah(L.(Harris

Chapter  7  <15>    

Time (cycles)

1 2 3 4 5 6 7

RF40

$s0

RF

$t0+

DMIM

lwlw $t0, 40($s0)

add $t1, $t0, $s1

sub $r0, $s2, $s3

and $t2, $s4, $r0

sw $s7, 80($t3)

sub-$s3

$s2$r0

RF$r0

$s4

RF&

DMand

$s7

or $t3, $s5, $s6IM

RF$s1

$t0

RF

$t1+

DMIM

add

sw+80

$t3

RAW

$s6

$s5|

or

2-cycle RAW

RAW

$t2

$t3

lw $t0, 40($s0)      add $t1, $t0, $s1      sub $t0, $s2, $s3      Ideal  IPC:    2  and $t2, $s4, $t0      Actual  IPC:  6/3  =  2  or $t3, $s5, $s6

sw $s7, 80($t3)

Register  Renaming  

Page 16: Chapter(7(users.tricity.wsu.edu/~bobl/cpts260/u08_multiprocessors/notes_hh.pdf · Digital’Design’and’Computer’Architecture,2 ndEdion" Chapter(7(David(Money(Harris(and(Sarah(L.(Harris

Chapter  7  <16>    

•  Single  InstrucCon  MulCple  Data  (SIMD)  –  Single  instrucCon  acts  on  mulCple  pieces  of  data  at  once  –  Common  applicaCon:  graphics  –  Perform  short  arithmeCc  operaCons  (also  called  packed  arithme2c)  

•  For  example,  add  four  8-­‐bit  elements  padd8 $s2, $s0, $s1

a0

0781516232432 Bit position

$s0a1a2a3

b0 $s1b1b2b3

a0 + b0 $s2a1 + b1a2 + b2a3 + b3

+

SIMD  

Page 17: Chapter(7(users.tricity.wsu.edu/~bobl/cpts260/u08_multiprocessors/notes_hh.pdf · Digital’Design’and’Computer’Architecture,2 ndEdion" Chapter(7(David(Money(Harris(and(Sarah(L.(Harris

Chapter  7  <17>    

•  Mul(threading  – Wordprocessor:  thread  for  typing,  spell  checking,  prinCng  

•  Mul(processors  – MulCple  processors  (cores)  on  a  single  chip  

Advanced  Architecture  Techniques  

Page 18: Chapter(7(users.tricity.wsu.edu/~bobl/cpts260/u08_multiprocessors/notes_hh.pdf · Digital’Design’and’Computer’Architecture,2 ndEdion" Chapter(7(David(Money(Harris(and(Sarah(L.(Harris

Chapter  7  <18>    

•  Process:  program  running  on  a  computer  – MulCple  processes  can  run  at  once:  e.g.,  surfing  Web,  playing  music,  wriCng  a  paper  

•  Thread:  part  of  a  program  – Each  process  has  mulCple  threads:  e.g.,  a  word  processor  may  have  threads  for  typing,  spell  checking,  prinCng  

Threading:  DefiniCons  

Page 19: Chapter(7(users.tricity.wsu.edu/~bobl/cpts260/u08_multiprocessors/notes_hh.pdf · Digital’Design’and’Computer’Architecture,2 ndEdion" Chapter(7(David(Money(Harris(and(Sarah(L.(Harris

Chapter  7  <19>    

•  One  thread  runs  at  once  •  When  one  thread  stalls  (for  example,  waiCng  for  memory):  – Architectural  state  of  that  thread  stored  – Architectural  state  of  waiCng  thread  loaded  into  processor  and  it  runs  

–  Called  context  switching  •  Appears  to  user  like  all  threads  running  simultaneously  

Threads  in  ConvenConal  Processor  

Page 20: Chapter(7(users.tricity.wsu.edu/~bobl/cpts260/u08_multiprocessors/notes_hh.pdf · Digital’Design’and’Computer’Architecture,2 ndEdion" Chapter(7(David(Money(Harris(and(Sarah(L.(Harris

Chapter  7  <20>    

•  MulCple  copies  of  architectural  state  •  MulCple  threads  ac(ve  at  once:  – When  one  thread  stalls,  another  runs  immediately  –  If  one  thread  can’t  keep  all  execuCon  units  busy,  another  thread  can  use  them  

•  Does  not  increase  instrucCon-­‐level  parallelism  (ILP)  of  single  thread,  but  increases  throughput    

 Intel  calls  this  “hyperthreading”  

MulCthreading  

Page 21: Chapter(7(users.tricity.wsu.edu/~bobl/cpts260/u08_multiprocessors/notes_hh.pdf · Digital’Design’and’Computer’Architecture,2 ndEdion" Chapter(7(David(Money(Harris(and(Sarah(L.(Harris

Chapter  7  <21>    

•  MulCple  processors  (cores)  with  a  method  of  communicaCon  between  them  

•  Types:  – Homogeneous:  mulCple  cores  with  shared  memory  – Heterogeneous:  separate  cores  for  different  tasks  (for  example,  DSP  and  CPU  in  cell  phone)  

–  Clusters:  each  core  has  own  memory  system  

MulCprocessors  

Page 22: Chapter(7(users.tricity.wsu.edu/~bobl/cpts260/u08_multiprocessors/notes_hh.pdf · Digital’Design’and’Computer’Architecture,2 ndEdion" Chapter(7(David(Money(Harris(and(Sarah(L.(Harris

Chapter  7  <22>    

•  Paferson  &  Hennessy’s:  Computer  Architecture:  A  Quan2ta2ve  Approach  

•  Conferences:  – www.cs.wisc.edu/~arch/www/  –  ISCA  (InternaConal  Symposium  on  Computer  Architecture)  

– HPCA  (InternaConal  Symposium  on  High  Performance  Computer  Architecture)  

 

Other  Resources