Top Banner
Supercomputing: The Next 10 Years Marc Snir Argonne Na.onal Laboratory & University of Illinois at UrbanaChampaign
23

Keynote snir sc

Sep 03, 2014

Download

Technology

Marc Snir

presentation Cray award SC13
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Keynote snir sc

Supercomputing: The Next 10 Years

Marc  Snir  Argonne  Na.onal  Laboratory  &  University  of  Illinois  at  Urbana-­‐Champaign  

Page 2: Keynote snir sc

Past

Those  who  cannot  remember  the  past  are  condemned  to  repeat  it  (Santayana)  

November  13  

MCS    -­‐-­‐  Marc  Snir  

2  

Page 3: Keynote snir sc

The Last Great Extinction

November  13  

MCS    -­‐-­‐  Marc  Snir  

3  

1  

10  

100  

1000  

10000  

100000  

1000000  

10000000  

Core  Count  of  leading  Top500  System  

The  aJack  of  the  killer  micros  

ShiL  from  bipolar  vector  processor  to  clusters  of  MOS  microprocessors  

Page 4: Keynote snir sc

1990: The Attack of the Killer Micros (Eugene Brooks, 1990)

§  Bipolar  technology  had  hit  a  power  wall  (nitrogen  cooling)  §  Alterna.ve  materials  were  too  expensive  /not  ready  (gallium  arsenide)  §  An  alterna.ve  “good  enough”  technology  was  ready  

–  MOS  microprocessors  had  been  around    20  years  and  were  a  fast  growing  market  

–  MOS  had  a  clear  evolu.on  path  (“Moore’s  Law”)  

§  MOS  was  no  beJer  than  bipolar  (in  1991)  

November  13  

MCS    -­‐-­‐  Marc  Snir  

4  

Cray  C90    •  244  MHz    •  Vector    •  Vector  registers    •  16  shared-­‐memory  nodes  

CM5    •  32  MHz  •  Scalar  •  Cache  •  1024  message-­‐

passing  nodes  

§  New  paradigm  took  a  while  to  establish  itself  (CM1,  CM2,  KSR…)  §  Change  in  technology  led  to  change  in  vendors  and  business  model  §  Technology  shiL  required  a  long  and  painful  process  of  code  rewrite  

Page 5: Keynote snir sc

Present

The  past  no  longer  is  and  the  future  is  not  yet  (St.  Augus.ne)  

November  13  

MCS    -­‐-­‐  Marc  Snir  

5  

Page 6: Keynote snir sc

20 Years of (Near) Stability

§  One  dominant  programming  model:  Message-­‐Passing  (MPI)  §  One  major  shiL  –  from  single  core  to  mul.core  –  Easy  since  one  can  treat  each  core  as  a  node  

November  13  

MCS    -­‐-­‐  Marc  Snir  

6  

1  

10  

100  

1000  

10000  

100000  

1000000  

10000000  

mul.core  

Page 7: Keynote snir sc

Increasing Instability

§  Heterogeneous  memory:  NUMA,  noncoherent  shared  memory,  scratchpads…  

§  Heterogenous  processing:  GPUs,  accelerators,  big-­‐small  cores    (NVIDIA,  Xeon  Phi,  ARM  big.LITTLE))  

§  Hybrid  Memory  Cube  &  near-­‐memory  processing  §  No  standard  programming  model  

November  13  

MCS    -­‐-­‐  Marc  Snir  

7  

1  

10  

100  

1000  

10000  

100000  

1000000  

10000000  

mul.core  

accelerators  

Page 8: Keynote snir sc

On Our Way to the Next Extinction? §  History  repeats  itself:  –  CMOS  technology  has  hit  a  power  wall    •  Clock  speed  is  not  raising  

–  Alterna.ve  materials  are  (too)  expensive  /not  ready  (gallium  arsenide  and  other  III-­‐V  materials;  nanowires,  nanotubes)  

While  power  consump0on  is  an  urgent  challenge,  its  leakage  or  sta0c  component  will  become  a  major  industry  crisis  in  the  long  term,  threatening  the  survival  of  CMOS  technology  itself,  just  as  bipolar  technology  was  threatened  and  eventually  disposed  of  decades  ago  (ITRS  2011)  

§  History  does  not  repeat  itself:  –  There  is  a  much  larger  industrial  base  –  An  alterna.ve  “good  enough”  technology  IS  NOT  ready  –  There  is  much  more  code  that  needs  to  be  rewriJen  if  new  model  is  needed  (>200MLOCs)  

November  13  

MCS    -­‐-­‐  Marc  Snir  

8  

Page 9: Keynote snir sc

Future

It  is  difficult  to  make  predic.ons,  especially  about  the  future  (Yogi  Berra)    

November  13  

MCS    -­‐-­‐  Marc  Snir  

9  

Page 10: Keynote snir sc

The End of Moore’s Law is Coming

§  Moore’s  Law:  The  number  of  transistors  per  chip  doubles  every  two/three  years  

§  Stein’s  Law:  If  something  cannot  go  forever,  it  will  stop  

§  Ques.on  is  not  whether  but  when  will  Moore’s  Law  stop  

November  13  

MCS    -­‐-­‐  Marc  Snir  

10  

Page 11: Keynote snir sc

The 7nm Wall

19  November  2013  

ANL-­‐LBNL-­‐ORNL-­‐PNNL    

11  

(courtesy  J.  Aldun)  

Page 12: Keynote snir sc

The End of the Road (?)

§  Quantum  tunneling  becomes  a  major  obstacle  as  devices  shrinks  –  7-­‐5nm  feature  size  has  long  been  predicted  to  be  the  lower  limit  for  CMOS  devices  •  ITRS  predicts  7.5nm  will  be  reached  in  2024  

§  7.5nm  ~  30  atoms  of  silicon  –  No  much  room  for  further  miniaturiza0on,  independent  of  technology!  

–  Room  for  clock  increase  (new  materials,  quantum  effect  gates,  cryogenic  devices…)  

 

   

November  13  

MCS    -­‐-­‐  Marc  Snir  

12  

Page 13: Keynote snir sc

The Last Mile is the Most Expensive Mile

§  New  technologies  are  needed  –  New  materials  (e.g.,  III-­‐V,  germanium  thin  channels,  nanowires,  nanotubes  

or  graphene)    –  New  structures  (e.g.,  3D  transistor  structures)    –  New  packages  (e.g.,  HMC,  photonics)  –  New  lithography  –  Control  or  tolerance  of  large  variances  (safety  margins,  resilience,  aging)  

§  New  technologies  are  expensive  –  NRE  increases  faster  than  profits  –  forces  consolida.on  –  Only  two  companies  can  sustain  the  investments  needed  to  go  below  22nm  

(Intel  and  Samsung)    [Heck,  Kaza,  Pinner]  §  Less  compe..on  &  larger  investments  =  slower  progress  

November  13  

MCS    -­‐-­‐  Marc  Snir  

13  

Page 14: Keynote snir sc

The Future Is Not What It Was

19  November  2013  

ANL-­‐LBNL-­‐ORNL-­‐PNNL    

14  

(courtesy  J.  Aldun)  

Page 15: Keynote snir sc

The Path of Least Resistance – Other than Moore

§  Industry  goal  is  not  increased  performance;  it  is  increased  ROI.  Industry  will  increasingly  invest  in  alterna.ves  as  increasing  performance  becomes  more  expensive  –  Low  power,  low  cost  –  New  markets:  MEMS,  sensors  –  System  on  a  chip  (smartphone,  tablet)  ✗  Fewer  good  commodity  building  blocks  for  HPC  –  No  low-­‐power/high-­‐flops/high-­‐resilience  CPU  ✔ More  opportuni.es  for  semi-­‐custom  and  integra.on  of  mul.ple  vendor  IP  on  a  chip  

§  New  business  model  for  supercompu.ng?  –  Semi-­‐custom  &  system  on  a  chip  integrator  

 

Page 16: Keynote snir sc

Exascale

November  13  

MCS    -­‐-­‐  Marc  Snir  

16  

Page 17: Keynote snir sc

Identified Issues

§  Scale  (billion  threads)  §  Power  (10’s  of  MWaJs)  –  Communica<on:  >  99%  of  power  is  consumed  by  moving  operands  across  the  memory  hierarchy  and  across  nodes  

–  Reduced  memory  size:  (communica.on  in  .me)  §  Resilience:  Something  fails  every  hour;  the  machine  is  never  

“whole”  –  Trade-­‐off  between  power  and  resilience  

§  Asynchrony:  Equal  work  ≠  equal  .me  –  Power  management  –  Error  recovery  

November  13  

MCS    -­‐-­‐  Marc  Snir  

17  

Page 18: Keynote snir sc

My Main Concerns

§  Uncertainly  about  underlying  HW  architecture  –  Slower  progress  of  IC  will  necessitate  faster  progress  of  architecture  –  May  not  converge  to  a  new,  stable  model  –  It  is  not  about  por.ng  applica.ons  to  a  new  programming  model  –  it  is  about  designing  applica.ons  for  portability  

§  Increased  soFware  complexity  –  Simula.ons  of  complex  systems  +  uncertainty  quan.fica.on  +  op.miza.on…  

–  Support  of  complex  workflows  (e.g.,  in  situ  analysis)  –  SoLware  management  of  power  and  failures  –  Heterogeneity  –  Scale  and  .ght  coupling  (tail  of  distribu.on  maJers!)  –  Hypothesis:  soLware  will  con.nue  to  be  dominant  cause  of  failures  

November  13  

MCS    -­‐-­‐  Marc  Snir  

18  

Page 19: Keynote snir sc

Conclusion

§  Moore’s  Law  is  slowing  down;  the  slow-­‐down  has  many  fundamental  consequences  –  only  a  few  of  them  explored  in  this  talk  

§  HPC  is  the  “canary  in  the  mine”:  –  issues  appear  earlier  because  of  size  and  .ght  coupling  

§  Op.mis.c  view  of  the  next  decades:  no  stasis.    –  A  frenzy  of  innova.on  to  con.nue  pushing  current  ecosystem,  followed  by  frenzy  of  innova.on  to  use  totally  different  compute  technologies  

§  Pessimis.c  view:    The  end  is  coming  

November  13  

MCS    -­‐-­‐  Marc  Snir  

19  

Page 20: Keynote snir sc

November  13  

MCS    -­‐-­‐  Marc  Snir  

20  

Page 21: Keynote snir sc

Backup

November  13  

MCS    -­‐-­‐  Marc  Snir  

21  

Page 22: Keynote snir sc

Do We Care?

§  It’s  all  about  Big  Data  Now,  simula.ons  are  passé.  §  B***t  §  All  science  is  either  physics  or  stamp  collec0ng.  (Ernest  

Rutherford)  –  In  Physical  Sciences,  experiments  and  observa.ons  exist  to  validate/refute/mo.vate  theory.  “Data  Mining”  not  driven  by  a  scien.fic  hypothesis  is  “stamp  collec.on”.  

§  Simula.on  is  needed  to  go  from  a  mathema.cal  model  to  predic.ons  on  observa.ons.  –  If  system  is  complex  (e.g.,  climate)  then  simula.on  is  expensive  –  OLen,  models  are  stochas.c  and  predic.ons  are  sta.s.cal  –  complica.ng  both  simula.on  and  data  analysis  

 

November  13  

MCS    -­‐-­‐  Marc  Snir  

22  

Page 23: Keynote snir sc

Observation Meets Data: Cosmology Computation Meets Data: The Argonne View

Mapping the Sky with Survey Instruments

Observations: Statistical error bars will ‘disappear’ soon!

Emulator based on Gaussian Process Interpolation in High-

Dimensional Spaces

Supercomputer Simulation Campaign

Markov chain Monte Carlo

‘PrecisionOracle’

‘Cosmic Calibration’

LSST Weak Lensing

HACC+CCF (Domain science+CS+Math+Stats

+Machine learning)

CCF= Cosmic Calibration Framework

w = -1w = - 0.9

LSSTHACC=Hardware/Hybrid Accelerated Cosmology Code(s)

Wednesday, September 19, 12

(courtesy  Salman  Habib)  Record-­‐breaking  applica.on:  3.6  Trillion  par.cles,  14  Pflop/s