Top Banner
Low Power Floa+ng Point Units Dr. Alberto A. Del Barrio García Complutense University of Madrid
43

Lp fp us with multispeculative techniques_19_12_12

Jun 19, 2015

Download

Technology

greendisc
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Lp fp us with multispeculative techniques_19_12_12

Low  Power  Floa+ng  Point  Units  

Dr.  Alberto  A.  Del  Barrio  García  Complutense  University  of  Madrid  

Page 2: Lp fp us with multispeculative techniques_19_12_12

Outline  •  Introduc+on  •  FP-­‐MACs  – Why  not  the  mul+plier?  

•  Mul+specula+ve  Adders  (MSADD)  •  Mul+specula+ve  FP-­‐MACs  –  Integra+ng  MSADD  and  FP-­‐MAC  –  The  CCTA  –  The  Sign  Predictor  

•  Experiments  •  Conclusions  

Page 3: Lp fp us with multispeculative techniques_19_12_12

Outline  •  Introduc)on  •  FP-­‐MACs  – Why  not  the  mul+plier?  

•  Mul+specula+ve  Adders  (MSADD)  •  Mul+specula+ve  FP-­‐MACs  –  Integra+ng  MSADD  and  FP-­‐MAC  –  The  CCTA  –  The  Sign  Predictor  

•  Experiments  •  Conclusions  

Page 4: Lp fp us with multispeculative techniques_19_12_12

Introduc+on:  FPUs  

Blue  Gene/Q  FPU  (1st  top  500,  June  2012)  Power  distribu+on  within  a  node  of  a  hypothe+cal  Exascale  system  [Kogge  et  al.,  2008]  

FP#MAC0( FP#MAC1( FP#MAC2( FP#MAC3(

RF( RF( RF( RF(

LOAD(

A2(

Permute(

256(

64(

Page 5: Lp fp us with multispeculative techniques_19_12_12

Introduc+on:  FP-­‐MAC,  R=AxB+C  •  Mul+plica+on  stage  

–  Effec+ve  addi+on/subtrac+on  decision,  sub=(s(A)  xor  s(B))  xor  s(C)  –  Exponent  difference,  d=exp(C)-­‐(exp(A)+exp(B))  –  C  alignment  and  inversion  (Cinv)  –  AxB  in  CSA  format  (X,Y)  such  that  Z=2X+Y  

•  Addi+on  stage  –  Sign  calcula+on  –  Addi+on:  2X+Y+Cinv  –  LZA:  normalizing  shib  calcula+on  

•  Round  and  Normalize  –  2’s  Complement  –  Normalize  and  exponent  adjustement  –  Round  and  Postnormaliza+on  

Page 6: Lp fp us with multispeculative techniques_19_12_12

Introduc+on:  FP-­‐MAC  C[wt%1:0]* A[wt%1:0]* B[wt%1:0]*

Sign*Logic* Exponent*Difference*

Alignment*Shi>er*

Inversion*

Booth*MulDplier*Array*

3:2*CSA*Compressor*

Carry*Propagate*Adder* Leading*Zero*AnDcipator*

Incrementer*

MUX*Sign*Adjust*

Complement*

NormalizaDon*Shi>er*

Rounder*&*Post*norm.*

SDcky*Logic*Exponent*Adjust*

R[wt%1:0]*

[wt%1]*[wf+we%1:wf]* {1’b1,B[wf%1:0]}*{1’b1,C[wf%1:0]}* {1’b1,A[wf%1:0]}*

[walign%1:0]*

[wep%1:0]* [walign%1:wcsa]* [wcsa%1:1]* [wcsa%1:0]*[wcsa%1:0]*

[wcsa%1:0]*[wcsa%1:0]*Compl&

[walign%1:0]* [wsh%1:0]*

cout&

[wep%1:0]*

[2:0]*

s+cky&rnd&lsb&

[wep%1:0]*

[wf%1:0]*

Page 7: Lp fp us with multispeculative techniques_19_12_12

Introduc+on:  FP-­‐MAC  •  Baseline  implementa+on  

[Huang  et  al.,  Trans.  On  Comp.  2012]  

•  A  FPU  is  a/several  Floa+ng  Point  Mul+plica+on  Accumula+on  FUs  

•  Three  significant  components:  the  mul+plier,  the  adder  and  the  rounding  and  normalizing  module    

Here  is  some  data  [Huang  et  al.  2012]  

Page 8: Lp fp us with multispeculative techniques_19_12_12

Introduc+on  

•  Mul+specula+on,  Exascale  and  FPUs  –  FPU’s  role  is  quite  important  in  Exascale  systems  

•  Exaflop  energy  efficiency  must  be  improved  –  FPUs  are  cri+cal  from  both  performance  and  power/energy  point  of  view  

– Mul+specula+ve  Adders  are  low-­‐power  –  This  is  an  excellent  scenario  for  applying  mul+specula+on  •  Mul+plier  è  Extremely  difficult  •  Addi+on  è  Possible  •  Rounder  è  Let  me  think  J  

Page 9: Lp fp us with multispeculative techniques_19_12_12

Outline  •  Introduc+on  •  FP-­‐MACs  – Why  not  the  mul)plier?  

•  Mul+specula+ve  Adders  (MSADD)  •  Mul+specula+ve  FP-­‐MACs  –  Integra+ng  MSADD  and  FP-­‐MAC  –  The  CCTA  –  The  Sign  Predictor  

•  Experiments  •  Conclusions  

Page 10: Lp fp us with multispeculative techniques_19_12_12

FP-­‐MAC:  Mul+plier  •  Why  not  the  mul+plier  ?  –  Par+al  Product  Matrix  +  Last  Stage  Adder  

•  In  FP-­‐MAC  the  LSA  corresponds  with  the  accumula+on  – All  PPM  implementa+ons  consist  of  a  Booth  recoder  and  a  Wallace/Dadda  trees  •  Booth  (1951),  Wallace  (1964),  Dadda  (1965)  •  Important  improvements  since  then  

–  U+lizing  4:2  compressors  counters  instead  of  (3,2)  counters  [Weinberger,  1981]  

–  Balancing  delays:  Three  Dimensional  op+miza+on  Method  (TDM)  [Oklobdzija,Villeger  and  Liu,  1996]  

–  LSA  designs  con+nue  evolving  with  the  appearance  of  VLFUs  

Page 11: Lp fp us with multispeculative techniques_19_12_12

FP-­‐MAC:  Mul+plier  

Page 12: Lp fp us with multispeculative techniques_19_12_12

FP-­‐MAC:  Mul+plier  

Radix-­‐4  Booth  Higher  radices  have  been  tried,  but  the  addi+onal  complexity  makes  this  approach  unworthy.  

Radix-­‐4  is  what  is  u+lized.  

Page 13: Lp fp us with multispeculative techniques_19_12_12

FP-­‐MAC:  Booth  Encoding  •  Add  0  to  right  of  LSB  since  the  first  group  has  no  group  with  

which  to  overlap  •  Examine  3  bits  at  a  +me  •  Encode  2  bits  at  a  +me  •  Overlap  a  bit  between  par+al  products  •  Example:  mul+plier  =  1001  =  -­‐7  (C2  format)  

Page 14: Lp fp us with multispeculative techniques_19_12_12

FP-­‐MAC:  Mul+plier  Number  of  stages  in  a  Dadda  Tree  

Dadda  Tree  for  an  8x8  mul+plier  with  (3,2)  counters  

Page 15: Lp fp us with multispeculative techniques_19_12_12

FP-­‐MAC:  Mul+plier  Conceptually:  4:2  compressor  built  with  two  (3,2)  counters  

(Full  Adders)  

(7,3)  and  (15,4)  counters,  and  several  other  compressors  have  also  been  tried,  but  4:2  

compressors  seems  to  be  the  most  efficient.  Most  implementa+ons  today  u+lize  4:2  compressors  

Page 16: Lp fp us with multispeculative techniques_19_12_12

FP-­‐MAC:  Mul+plier  An  example  of  Oklobdzija  et  al.  technique:  a  balanced  4:2  compressor  built  with  two  (3,2)  counters.  Large  delays  are  connected  to  “fast  inputs”  and  short  delays  to  “slow  inputs”  

Page 17: Lp fp us with multispeculative techniques_19_12_12

Outline  •  Introduc+on  •  FP-­‐MACs  – Why  not  the  mul+plier?  

•  Mul)specula)ve  Adders  (MSADD)  •  Mul+specula+ve  FP-­‐MACs  –  Integra+ng  MSADD  and  FP-­‐MAC  –  The  CCTA  –  The  Sign  Predictor  

•  Experiments  •  Conclusions  

Page 18: Lp fp us with multispeculative techniques_19_12_12

Mul+specula+ve  Adder  

Page 19: Lp fp us with multispeculative techniques_19_12_12

Kogge-­‐Stone  Adder  

•  White  dots:  •  gi  =  xi  ·∙  yi  •  pi  =  xi  xor  yi  

•  Purple  dots:  •  (G’,P’)  *  (G’’,P’’)  =  (G’+P’·∙G’’,P’·∙P’’),  being  X’  more  significant  than  X’’  

Page 20: Lp fp us with multispeculative techniques_19_12_12

Mul+specula+ve  Adder  

20  

n-­‐bit  Kogge-­‐Stone  Adder  o  Complex  carry  propaga+on  tree  o  The  fastest  non  specula+ve  design  

o  O(log(n))  o  Huge  area  

o  O(n*log(n))  with  large  n  

Image  taken  from  hvp://www.aoki.ecei.tohoku.ac.jp/arith/mg/algorithm.html  

P  P  P  

n-­‐bit  Mul)specula)ve  KS  o  n/k  simpler  carry  propaga+on  trees  o  Extremely  fast  

o  O(log(k))  o  Predictors  accuracy  

o  Reduced  area  o  Small  KS  have  area  O(n)  o  Area:  n/k*O(k)  ≅ O(n)  

Page 21: Lp fp us with multispeculative techniques_19_12_12

Mul+specula+ve  Adder  a)#

b)#

15# 14# 13# 12# 11# 10# 9# 8# 7# 6# 5# 4# 3# 2# 1# 0#

15# 14# 13# 12# 11# 10# 9# 8# 7# 6# 5# 4# 3# 2# 1# 0#

The  same  bit  flip  affects  to  much  less  nodes,  especially  if  

there  is  a  hit.    

Moreover,  MSADD  contains  less  nodes  

Page 22: Lp fp us with multispeculative techniques_19_12_12

Outline  •  Introduc+on  •  FP-­‐MACs  – Why  not  the  mul+plier?  

•  Mul+specula+ve  Adders  (MSADD)  •  Mul)specula)ve  FP-­‐MACs  –  Integra)ng  MSADD  and  FP-­‐MAC  –  The  CCTA  –  The  Sign  Predictor  

•  Experiments  •  Conclusions  

Page 23: Lp fp us with multispeculative techniques_19_12_12

FP-­‐MAC  and  Mul+specula+on  Conceptually,  the  baseline  design  consists  of  two  adders:  •  1  Adder  to  perform  the  addi+on  •  1  Constant  Adder  to  Complement  

result  when  necessary  

The  idea  consists  of  subs+tu+ng  the  Adders  by  a  MSADD  with  Sta+c  Zero  Predic+on.  In  order  to  avoid  addi+onal  delays  due  to  correc+ons,  mispredic+ons  (+1)  will  be  corrected  on  the  fly  in  the  C2-­‐complementer.    

n"

+"

n"n"

+"n"+"n"

1"0"

+"

“00…00”"

n"+"n"+"

n"n"Compl&

Adder%

C2(Complementer%Frag. 0 Frag. 1 …Frag. n/k-2

errn/k-1

Frag. n/k-1

k k k k

n

hit

Pn/k-1 P1

k k k k k k k k

n

A B

n

Z

Cout predn/k-1

err1

pred1

Zn-1..n-k Zn-k-1..n-2k Z2k-1..k Zk-1..0

New$C2'Complementer$

MSADD$

k

+kk

+ k+k

10

+k

+ k+kk

Compl

+kk

+ k+k

k&1

Compl Mi

+kk

+ k+k

n

kk ……

Page 24: Lp fp us with multispeculative techniques_19_12_12

FP-­‐MAC  and  Mul+specula+on  •  In  every  k-­‐bit  fragment  of  the  C2-­‐Complementer,  4  cases  

can  happen:  –  Compl=’0’,  Mi=’0’.  The  k-­‐bit  result  is  correct  and  it  must  not  be  complemented.  

–  Compl=’0’,  Mi=’1’.  The  k-­‐bit  result  is  not  correct  and  it  must  not  be  complemented.  Hence,  we  must  perform  the  opera+on  X+1.  

–  Compl=’1’,  Mi=’0’.  The  k-­‐bit  result  is  correct  and  it  must  be  complemented.  Hence,  we  must  perform  the  opera+on  C1(X).  

–  Compl=’1’,  Mi=’1’.  The  k-­‐bit  result  is  incorrect  and  it  must  be  complemented.  Hence,  we  must  perform  the  opera+on  C1(X+1)  except  frag.  0,  that  must  perform  C1(X)+1  instead.  

•  Lemma.  C1(X+1)  =  C1(X)  +  “11…11”  

Page 25: Lp fp us with multispeculative techniques_19_12_12

FP-­‐MAC  and  Mul+specula+on  •  I1  =  1111  1111  1111  1111  •  I2  =  0000  0000  0000  0001  •  Without  mul+specula+on  I1+I2  =  0000  0000  0000  0000  •  Applying  mul+specula+on,  the  result  of  I1+I2  aber  the  first  step  is:  

–  S  =  1111  1111  1111  0000  –  C  =  0000  0000  0001  0000  

•  As  the  carry-­‐out  of  fragment  0,  i.e.  M1,  is  equal  to  ‘1’,  the  new  two’s  complementer  must  add  this  to  the  addi+on  result  of  fragment  1.  The  new  vectors  would  be:  –  S  =  1111  1111  0000  0000  –  C  =  0000  0001  0000  0000    

•  We  need  more  steps  to  complete  the  correc+on  è  NOT  ALLOWED  

Page 26: Lp fp us with multispeculative techniques_19_12_12

Outline  •  Introduc+on  •  FP-­‐MACs  – Why  not  the  mul+plier?  

•  Mul+specula+ve  Adders  (MSADD)  •  Mul)specula)ve  FP-­‐MACs  –  Integra+ng  MSADD  and  FP-­‐MAC  –  The  CCTA  –  The  Sign  Predictor  

•  Experiments  •  Conclusions  

Page 27: Lp fp us with multispeculative techniques_19_12_12

MSADD%

New%C2+Complementer%

Corrected%Carry%Tree%An8cipator%

n% n%

n%S"

n/k+1%

Mi"

n%ZSM"

Compl"

G,P"n/k+1%

n/k+1%C"

FP-­‐MAC  and  Mul+specula+on  

With  the  proposed  scheme  it  is  possible  to  correct  individual  mispredic+ons,  but  the  case  of  propaga+ng  mispredic+ons  s+ll  exists.  In  order  to  solve  this  problem,  the  Correc+on  Carry  Tree  An+cipator  (CCTA)  will  feed  the  k-­‐bit  modules  with  the  proper  Cin  to  produce  a  totally  correct  result  at  the  output  of  the  C2-­‐Complementer  

Page 28: Lp fp us with multispeculative techniques_19_12_12

FP-­‐MAC  and  Mul+specula+on:  An  example  

I1  =  1111  1111  1111  1111  I2  =  0000  0000  0000  0011    The  result  will  be  posi+ve,  hence  Compl=’0’.  With  a  conven+onal  flow  Z=ZSM=  0000  0000  0000  0010      Applying  mul+specula+on  in  the  addi+on,  the  result  of  I1+I2  is:  S  =  1111  1111  1111  0010  C  =  0000  0000  0001  0000  

With  the  proper  transforma+ons  and  Mi’s  coming  from  the  CCTA,    T  =            1111  1111  1111  0010  M  =          0001  0001  0001  0000  ZSM  =      0000  0000  0000  0010    

These  4-­‐bit  addi+ons  are  performed  in  parallel  !!  Without  propaga+ng  carries  from  a  fragment  to  the  following  !!  

G0#=#‘1’#P0#=#‘0’#

G1#=#‘0’#P1#=#‘1’#

G2#=#‘0’#P2#=#‘1’#

G3#=#‘0’#P3#=#‘1’#

(‘1’,’0’)#

(‘0’,’1’)#

(‘0’,’1’)# (‘1’,’0’)#

M1=‘1’#

M2=‘1’#

M3=‘1’#

M4=‘1’#

(‘1’,’0’)#

Page 29: Lp fp us with multispeculative techniques_19_12_12

FP-­‐MAC  and  Mul+specula+on:  An  example  (2)  

I1  =  1111  1110  1111  1001    I2  =  0000  0000  0001  0011  The  result  (-­‐263)+19  is  nega+ve,  hence  Compl=‘1’  With  a  conven+onal  flow    Z  =  1111  1111  0000  1100,  and    ZSM  =  0000  0000  1111  0100      Applying  mul+specula+on  in  the  addi+on,  the  result  of  I1+I2  is:  S  =  1111  1110  0000  1100  C  =  0000  0001  0000  0000   With  the  proper  transforma+ons  and  

Mi’s  coming  from  the  CCTA,    T    =      0000  0001  1111  0011  M  =      0000  1111  0000  0001  ZSM  =  0000  0000  1111  0100  

These  4-­‐bit  addi+ons  are  performed  in  parallel  !!  Without  propaga+ng  carries  from  a  fragment  to  the  following  !!  

G0#=#‘0’#P0#=#‘0’#

G1#=#‘1’#P1#=#‘0’#

G2#=#‘0’#P2#=#‘0’#

G3#=#‘0’#P3#=#‘1’#

(‘1’,’0’)#

(‘0’,’0’)#

(‘0’,’0’)# (‘0’,’0’)#

M1=‘0’#

M2=‘1’#

M3=‘0’#

M4=‘0’#

(‘0’,’0’)#

Page 30: Lp fp us with multispeculative techniques_19_12_12

FP-­‐MAC  and  Mul+specula+on:  a  fast  evalua+on  

•  Execu+on  +me:  worst  case  log(k)+log(n/k)+log(k)  =  log(n)+log(k)  vs  2log(n)  

•  Area  and  energy:  MSADDs  occupy/consume  less  than  conven+onal  adders  

MSADD%

New%C2+Complementer%

Corrected%Carry%Tree%An8cipator%

n% n%

n%S"

n/k+1%

Mi"

n%ZSM"

Compl"

G,P"n/k+1%

n/k+1%C" n"

+"

n"n"

+"n"+"n"

1"0"

+"

“00…00”"

n"+"n"+"

n"n"Compl&

Adder%

C2(Complementer%

O(log(k))  

O(log(k))  

O(log(n/k))  O(log(n))  

O(log(n))  

Page 31: Lp fp us with multispeculative techniques_19_12_12

Outline  •  Introduc+on  •  FP-­‐MACs  – Why  not  the  mul+plier?  

•  Mul+specula+ve  Adders  (MSADD)  •  Mul)specula)ve  FP-­‐MACs  –  Integra+ng  MSADD  and  FP-­‐MAC  –  The  CCTA  –  The  Sign  Predictor  

•  Experiments  •  Conclusions  

Page 32: Lp fp us with multispeculative techniques_19_12_12

FP-­‐MAC  and  Mul+specula+on:  the  Compl  signal  

•  In  conven+onal  implementa+ons,  Compl=addi+on.msbit  

•  With  the  mul+specula+ve  adder  stage  the  msbit  can  be  incorrect  

•  Solu+ons  –  [Lang  &  Bruguera  2004,  2005]  Sign  Detector  +  Full  man+ssa  comparison  in  the  worst  case  

– Ours  è  Predict  the  Sign  

Page 33: Lp fp us with multispeculative techniques_19_12_12

FP-­‐MAC  and  Mul+specula+on:  the  Sign  Predictor  

•  Only  few  cases  can  mispredict,  given  R=AxB+C  – d=  Exp(C)-­‐(Exp(A)+Exp(B))  •  d<0  è  R>=0  •  d>1  è  R<0  

– sub=sign(C)  xor  sign(A)  xor  sign(B)  •  This  signal  is  ‘1’  if  there  is  an  effec+ve  subtrac+on,  i.e.  iff  sign(AxB)  is  different  to  sign(C)  

– Full  man+ssa  comparisson  (mispredic+on  in  our  case)  iff  sub=‘1’  and  (d=0  or  d=1)  [Lang  and  Bruguera]  

Page 34: Lp fp us with multispeculative techniques_19_12_12

FP-­‐MAC  and  Mul+specula+on:  the  Sign  Predictor  

est  is  a  combina+onal  es+ma+on  of  the  Compl  signal  Ini+ally,  est=‘1’  iff  sub=‘1’  and  d>1  hit  <=  est  xor  msbit;  Q(i+1)  <=  not(hit);  Compl  <=  Q(i)  xor  est;  If  there  is  a  mispredic+on,  only  a  stall  cycle  is  required  for  recomplemen+ng  the  result  

S0#

S1#

hit=‘1’

hit=‘0’

hit Q(i) Compl Q(i+1) Comments

0 0 est 1 S03and3failure,3transition3to3S1

0 1 : : This3case3never3happens,3In3S13there3is3always3a3hit

1 0 est 0 S03and3hit,3remain3in3S0

1 1 ¬(est) 0 S13and3hit,3transition3to3S0

Page 35: Lp fp us with multispeculative techniques_19_12_12

FP-­‐MAC  and  Mul+specula+on:  the  Sign  Predictor  

•  Hit  rate  is  cri+cal.  It  may  be  necessary  to  increase  it.  

•  N-­‐bits  synchronous  predictor.  Use  the  most  significant  bits  of  the  addi+on  operands  [Ashmila  et  al.  2005]    –  1  bit-­‐>  esti  =  xi·∙yi  –  2  bits-­‐>  esti  =  xi·∙yi  +  (xi+yi)·∙(xi-­‐1·∙yi-­‐1)  –  3  bits-­‐>  esti  =  xi·∙yi  +(xi+yi)(xi-­‐1·∙yi-­‐1+(xi-­‐1+yi-­‐1)·∙(xi-­‐2·∙yi-­‐2))  

•  Hit  probability:  1-­‐2-­‐(N+1)  •  The  es+ma+on  logic  is  out  of  the  cri+cal  path  

Page 36: Lp fp us with multispeculative techniques_19_12_12

FP-­‐MAC  and  Mul+specula+on:  the  Sign  Predictor  

s

Es+ma+on  bits   Es+ma+on  bits  

d=1   d=0  

mul  

Cinv  

mul  

Cinv  

•  Es+mate  the  carry-­‐out  of  the  yellow  cell  •  Predict  the  msbit,  i.e.  red  cell,  with  zi  =  xi  xor  yi  xor  ci  =  xi  xor  yi  xor  cest  •  The  bits  more  significant  than  the  red  cell  are  a  sign  extension  

Page 37: Lp fp us with multispeculative techniques_19_12_12

Mul+specula+ve  FP-­‐MAC  C[wt%1:0]* A[wt%1:0]* B[wt%1:0]*

Sign*Logic* Exponent*Difference*

Alignment*Shi>er*

Inversion*

Booth*MulDplier*Array*

3:2*CSA*Compressor*

MSADD*Leading*Zero*AnDcipator*

Sign*Predictor*

NC2C*

NormalizaDon*Shi>er*

Rounder*&*Post*norm.*

SDcky*Logic*Exponent*Adjust*

R[wt%1:0]*

[wt%1]*[wf+we%1:wf]* {1’b1,B[wf%1:0]}*{1’b1,C[wf%1:0]}* {1’b1,A[wf%1:0]}*

[walign%1:0]*

[wep%1:0]* [walign%1:0]* [wcsa%1:1]* [wcsa%1:0]*[wcsa%1:0]*

[walign%1:0]*[wcsa%1:0]*

Complpred)[walign%1:0]* [wsh%1:0]*

[wep%1:0]*

[2:0]*

s+cky)rnd)lsb)

[wep%1:0]*

[wf%1:0]*

[walign%1:0]*

[walign%1:wcsa]*[wcsa%1:0]*

“0…0”*

CCTA* [walign/k%1:0]*

[walign/k%1:0]*

Mi)

P)

G)

msNC2C)

Page 38: Lp fp us with multispeculative techniques_19_12_12

Outline  •  Introduc+on  •  FP-­‐MACs  – Why  not  the  mul+plier?  

•  Mul+specula+ve  Adders  (MSADD)  •  Mul+specula+ve  FP-­‐MACs  –  Integra+ng  MSADD  and  FP-­‐MAC  –  The  CCTA  –  The  Sign  Predictor  

•  Experiments  •  Conclusions  

Page 39: Lp fp us with multispeculative techniques_19_12_12

Experiments:  synthesis  results  

Precision Area+(um2) Delay+(ns) Power+(mW) %+P+Gain Energy+(pJ) %+E+GainSingle 23676.1 4.98 15.10 G 75.22 GDouble 80851.3 5.64 52.85 G 298.09 GQuad 301294.4 6.23 191.56 G 1193.43 GSingle 22478 4.24 14.67 G2.85 62.22 G17.28Double 77074.56 4.44 51.77 G2.06 229.84 G22.90Quad 289275.5 5.34 186.72 G2.53 997.06 G16.45

Conv

MS

Not  pipelined  implementa+on  MS  Single  and  Double  with  k=4  bits,  Quad  with  k=8  bits  

3:2$Compr Adder CCTA C2C SignConv 0.481 1.778 0 1.694 0K64 0.499 1.614 1.17E=06 2.113 3.33E=02K32 0.501 1.285 5.01E=03 1.849 3.23E=02K16 0.503 1.212 1.45E=02 1.924 3.24E=02K8 0.496 1.013 3.75E=02 1.566 3.24E=02K4 0.501 0.979 1.36E=01 1.589 3.24E=02

Power  breakdown,  DP:  3:2  +  adder  +  C2C  

Page 40: Lp fp us with multispeculative techniques_19_12_12

Experiments:  Sign  Predictor  Accuracy  

0.95%0.955%0.96%0.965%0.97%0.975%0.98%0.985%0.99%0.995%

1%

ferret%

blackscholes%

bodytrack%

x264%

streamcluster%

0=bits%

1=bits%

2=bits%

3=bits%

0.9$0.91$0.92$0.93$0.94$0.95$0.96$0.97$0.98$0.99$

1$

ferret$

blackscholes$

bodytrack$

x264$

swap>ons$

freqmine$

streamcluster$

canneal$

splash2x.barnes$

splash2x.fmm

$

splash2x.ocean_cp$

0Ebits$ 1Ebits$ 2Ebits$ 3Ebits$

Single  

Double  

Modified  libsob-­‐fp  for  x86_64  PARSEC  and  SPLASH2x  compiled  with  this  library  Single  and  Double  precision    traces  are  processed  aberwards  

Page 41: Lp fp us with multispeculative techniques_19_12_12

Outline  •  Introduc+on  •  FP-­‐MACs  – Why  not  the  mul+plier?  

•  Mul+specula+ve  Adders  (MSADD)  •  Mul+specula+ve  FP-­‐MACs  –  Integra+ng  MSADD  and  FP-­‐MAC  –  The  CCTA  –  The  Sign  Predictor  

•  Experiments  •  Conclusions  

Page 42: Lp fp us with multispeculative techniques_19_12_12

Conclusions  •  Mul+specula+ve  ideas  contribute  to  the  energy  decrease  

•  1st,  a  probably  inexact  addi+on  is  performed  •  2nd,  the  C2C  is  used  for  complemen+ng  and  correc+ng    

•  The  sign  is  predicted  in  order  to  avoid  huge  comparisons  

•  In  the  future  new  op+miza+ons  must  be  performed  – Mul+plier  and  Normalizer  

Page 43: Lp fp us with multispeculative techniques_19_12_12

Thank  you  very  much  for  your  aven+on  !!  

You  can  em@il  me  to:  [email protected]