Top Banner
Lecture 6: Parallel compu0ng, cloud compu0ng and working on Amazon Web Services Greg Caporaso [email protected]
28
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Lecture6.pptx

Lecture  6:  Parallel  compu0ng,  cloud  compu0ng  and  working  on  Amazon  

Web  Services  

Greg  Caporaso  [email protected]  

Page 2: Lecture6.pptx

Some  last  thoughts  on  regular  expressions    

Page 3: Lecture6.pptx

Robust  searches  

•  Some0mes  your  queries  will  fail  – Won’t  produce  output  (good)  – Will  produce  incorrect  output  (bad)  

•  Fail  loudly!  Produce  a  (useful)  error  message  on  failure.    

Page 4: Lecture6.pptx

Designing  robust  searches    

•  Make  assump0ons  explicit  –  If  you’re  assuming  that  your  records  start  with  ‘>’,  search  for  ‘^>’  to  avoid  matching  ‘>’  characters  that  show  up  in  other  places  

•  Match  full  lines  by  including  ^  and  $  in  your  search  query  

•  Check  the  number  of  matches  that  were  made:  is  it  reasonable?    

Page 5: Lecture6.pptx

Tes0ng  of  soWware  

•  Start  thinking  about  what  posi0ve  and  nega0ve  controls  for  these  terms  might  look  like.  SoWware  tes0ng  is  something  we’ll  be  discussing  regularly  through-­‐out  the  semester.  

Page 6: Lecture6.pptx

Why  is  parallel  compu0ng  important  in  bioinforma0cs?  

Page 7: Lecture6.pptx

Cluster  compu0ng  

•  Many  computers  connected  to  one  another  to  serve  as  a  larger  compute  resource.  

•  Compute-­‐intensive  jobs  can  be  split  over  many  systems  and  run  in  parallel.    

•  Similar  to  desktop  compute  hardware,  but  different  casing,  no  (or  only  few)  displays/keyboards  directly  connected.  

•  Owned  and  maintained  “in-­‐house”.  

Page 8: Lecture6.pptx

Why  is  parallel  compu0ng  important  in  bioinforma0cs?  

Pla$orm   Sanger   454  (Titanium)  

Illumina  Genome  Analyzer  II  

Illumina  HiSeq2000   Illumina  MiSeq  

Read  Length  (bases)   ~1000   ~400   150  (single  end)   100  (single  end)   150  (single  end);  250  soon  

Number  of  reads   96  or  384   ~1,000,000   ~100,000,000   ~1,600,000,000   ~10,000,000  

Maximum  number  of  samples  per  run   n/a   1000   12,000  (barcode-­‐

limited)  24,000  (barcode-­‐

limited)  2500  (barcode-­‐

limited)  

Sequences  per  $1  (sequencing  costs  only)   0.44   100   5000   200,000   12,500  

Page 9: Lecture6.pptx

The  “benchtop”  sequencer  

Page 10: Lecture6.pptx

OTU  picking:  example  of  a  compute  intensive  process  

What taxa are represented in each sample?

>PC.634_1 FLP3FBN01ELBSX CTGGGCCGTGTCTCAGTCCCAATGTGGCCGTTTACCCTCTCAGGCCGGCTACGCATCATCGCCTTGGTGGGCCGTTACCTCACCAACTAGCTAATGCGCCGCAGGTCCATCCATGTTCACGCCTTGATGGGCGCTTTAATATACTGAGCATGCGCTCTGTATACCTATCCGGTTTTAGCTACCGTTTCCAGCAGTTATCCCGGACACATGGGCTAGG>PC.634_2 FLP3FBN01EG8AXTTGGACCGTGTCTCAGTTCCAATGTGGGGGCCTTCCTCTCAGAACCCCTATCCATCGAAGGCTTGGTGGGCCGTTACCCCGCCAACAACCTAATGGAACGCATCCCCATCGATGACCGAAGTTCTTTAATAGTTCTACCATGCGGAAGAACTATGCCATCGGGTATTAATCTTTCTTTCGAAAGGCTATCCCCGAGTCATCGGCAGGTTGGATACGTGTTACTCACCCGTGCGCCGGT>PC.354_3 FLP3FBN01EEWKDTTGGGCCGTGTCTCAGTCCCAATGTGGCCGATCAGTCTCTTAACTCGGCTATGCATCATTGCCTTGGTAAGCCGTTACCTTACCAACTAGCTAATGCACCGCAGGTCCATCCAAGAGTGATAGCAGAACCATCTTTCAAACTCTAGACATGCGTCTAGTGTTGTTATCCGGTATTAGCATCTGTTTCCAGGTGTTATCCCAGTCTCTTGGG...

Reference treeof non-redundant

full length sequences

BLAST againstreference tree

Page 11: Lecture6.pptx

OTU  picking:  example  of  a  compute  intensive  process  

What taxa are represented in each sample?

Clusters of “Operational Taxonomic Units” (OTUs); Per sample hits on reference tree;

Taxonomic assignments

BLAST againstreference tree

>PC.634_1 FLP3FBN01ELBSX CTGGGCCGTGTCTCAGTCCCAATGTGGCCGTTTACCCTCTCAGGCCGGCTACGCATCATCGCCTTGGTGGGCCGTTACCTCACCAACTAGCTAATGCGCCGCAGGTCCATCCATGTTCACGCCTTGATGGGCGCTTTAATATACTGAGCATGCGCTCTGTATACCTATCCGGTTTTAGCTACCGTTTCCAGCAGTTATCCCGGACACATGGGCTAGG>PC.634_2 FLP3FBN01EG8AXTTGGACCGTGTCTCAGTTCCAATGTGGGGGCCTTCCTCTCAGAACCCCTATCCATCGAAGGCTTGGTGGGCCGTTACCCCGCCAACAACCTAATGGAACGCATCCCCATCGATGACCGAAGTTCTTTAATAGTTCTACCATGCGGAAGAACTATGCCATCGGGTATTAATCTTTCTTTCGAAAGGCTATCCCCGAGTCATCGGCAGGTTGGATACGTGTTACTCACCCGTGCGCCGGT>PC.354_3 FLP3FBN01EEWKDTTGGGCCGTGTCTCAGTCCCAATGTGGCCGATCAGTCTCTTAACTCGGCTATGCATCATTGCCTTGGTAAGCCGTTACCTTACCAACTAGCTAATGCACCGCAGGTCCATCCAAGAGTGATAGCAGAACCATCTTTCAAACTCTAGACATGCGTCTAGTGTTGTTATCCGGTATTAGCATCTGTTTCCAGGTGTTATCCCAGTCTCTTGGG...

Reference treeof non-redundant

full length sequences

Page 12: Lecture6.pptx

OTU  Picking    

•  For  1  billion  sequence  reads,  the  ini0al  step  ran  for  ~116  hours  on  110  processors  requiring  4GB  of  RAM  per  job  for  workers  and  64GB  of  RAM  for  master.  

Page 13: Lecture6.pptx

OTU  Picking    

•  For  1  billion  sequence  reads,  the  ini0al  step  ran  for  ~116  hours  on  110  processors  requiring  4GB  of  RAM  per  job  for  workers  and  64GB  of  RAM  for  the  master  job.  

•  So,  on  a  single  processor  desktop  with  64GB  of  RAM…  12760  hours  or  532  days!  

Page 14: Lecture6.pptx

OTU  Picking    

•  For  1  billion  sequence  reads,  the  ini0al  step  ran  for  ~116  hours  on  110  processors  requiring  4GB  of  RAM  per  job  for  workers  and  64GB  of  RAM  for  the  master  job.  

•  So,  on  a  single  processor  desktop  with  64GB  of  RAM…  12760  hours  or  532  days!  

•  One  HiSeq2000  generates  this  data  in  a  week!  

Page 15: Lecture6.pptx

Cloud  compu0ng  

•  Implemented  on  a  cluster  (or  grid),  but  compute  power  is  rented  as  a  service  to  support  arbitrary  applica0ons.    

Page 16: Lecture6.pptx

Maintaining  hardware  is  expensive  

•  Temperature  (redundant  cooling  systems)  •  Redundant  network  connec0ons  •  Hardware  maintenance  (e.g.,  replacing  hard  drives)  

•  Fire  suppression  •  Back-­‐up  power  •  System  administrator  ($$)  

Page 17: Lecture6.pptx

Pay-­‐as-­‐you-­‐go  compute  power  

•  Public  clouds  (e.g.,  Amazon)  rent  compute  resources  

•  Log  in,  boot  virtual  machine  image,  run  analyses,  and  terminate  instance.  

•  Cheaper  for  many  tasks  than  buying,  maintaining,  and  suppor0ng  a  compute  cluster.  

Page 18: Lecture6.pptx

Types  of  cloud  offerings  

•  Applica0ons/SaaS  (e.g.,  Google  Docs,  gmail,  Dropbox,  iCloud)  

•  Compu0ng  planorm/PaaS  (e.g.,  Google  App  Engine)  

•  Raw  compute  resources/IaaS  (e.g.,  Amazon  Elas0c  Compute  Cloud  (EC2))  

Page 19: Lecture6.pptx

Cloud  compu0ng  op0ons  

•  Amazon  Elas0c  Compute  Cloud  (EC2)  •  Magellan  –  Argonne's  DOE  Cloud  Compu0ng  •  Data  Intensive  Academic  Grid  (DIAG)  –      Ins0tute  for  Genome  Sciences  (IGS),  University  of  Maryland  School  of  Medicine  (UMSOM)  

Page 20: Lecture6.pptx

Interac0ng  with  the  Amazon  Cloud  

•  Boot  virtual  machine  image  via  web  interface  (or  a  third-­‐party  tool  like  StarCluster).  

•  Log  in  and  work  via  terminal  (or  via  web  interface  with  IPython  Notebook)  

•  Move  data  back  and  forth  via  sWp/scp  or  a  graphical  sWp  client  (e.g.,  Cyberduck  [free/cross-­‐planorm])  

Page 21: Lecture6.pptx

Virtual  machines  

•  A  “guest”  opera0ng  system  running  within  a  “host”  opera0ng  system  

•  A  soWware  implementa0on  of  a  computer,  that  operates  like  a  physical  computer.    

•  A  developer  can  create  a  virtual  machine  image  which  contains  their  tools  pre-­‐installed.  Users  can  then  instan)ate  that  image  to  work  with  those  tools.  

Browse  this  page:  hvp://en.wikipedia.org/wiki/Virtual_machine  

Page 22: Lecture6.pptx

Benefits  that  virtual  machines  offer  bioinforma0cs  

•  Reproducibility:  can  publish  protocols  with  a  virtual  machine  instance  id.  

•  Updates  are  burden  of  developer,  not  user.  •  Coupled  with  cloud  compu0ng,  it’s  the  perfect  model  for  users  with  sporadic  compute  needs.  

Page 23: Lecture6.pptx

EC2  costs:  www.ec2instances.info  

Page 24: Lecture6.pptx

QIIME  virtual  machine  

•  The  QIIME  package  distributes  an  EC2  virtual  machine  with  QIIME  and  its  (many)  dependencies  pre-­‐installed.  

•  Dependencies  include  commonly  used  tools  like  BLAST,  muscle,  FastTree,  uclust,  IPython,  and  a  lot  more.  A  par0al  list  is  available  here:  hvp://qiime.org/install/install.html  

•  Latest  machine  iden0fier  can  always  be  found  at:  hvp://qiime.org/home_sta0c/dataFiles.html  

Page 25: Lecture6.pptx

I  think  there  is  a  world  market  for  maybe  five  computers.      -­‐  Thomas  Watson,  IBM  Founder,  1943  

Page 26: Lecture6.pptx

I  think  there  is  a  world  market  for  maybe  five  computers.      -­‐  Thomas  Watson,  IBM  Founder,  1943  

All  figures  are  in  units  of  1000.  hvp://jeremyreimer.com/postman/node/329  

Units sold by year

Page 27: Lecture6.pptx

The  democra0za0on  of  DNA  sequencing  

+   +  

Affordable    sequencing  

Cloud  compu0ng  

Open-­‐source  soWware  

Page 28: Lecture6.pptx

This  work  is  licensed  under  the  Crea0ve  Commons  Avribu0on  3.0  United  States  License.  To  view  a  copy  of  this  license,  visit  hvp://crea0vecommons.org/licenses/by/3.0/us/  or  send  a  lever  to  Crea0ve  Commons,  171  Second  Street,  Suite  300,  San  Francisco,  California,  94105,  USA.    Feel  free  to  use  or  modify  these  slides,  but  please  credit  me  by  placing  the  following  avribu0on  informa0on  where  you  feel  that  it  makes  sense:  Greg  Caporaso,  www.caporaso.us.