Top Banner
© 2010 – 2015 Cloudera, Inc. All Rights Reserved Introduc=on to Apache Hadoop and its Ecosystem Mark Grover | Intro to Cloud Compu=ng, Carnegie Mellon SV github.com/markgrover/hadoopintrofast © Copyright 20102014 Cloudera, Inc. All rights reserved.
45

Intro to Hadoop Presentation at Carnegie Mellon - Silicon Valley

Aug 19, 2014

Download

Engineering

markgrover

Introduction to Hadoop presentation at Carnegie Mellon University, Silicon Valley Campus.
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Intro to Hadoop Presentation at Carnegie Mellon - Silicon Valley

©  2010  –  2015  Cloudera,  Inc.  All  Rights  Reserved  

Introduc=on  to  Apache  Hadoop    and  its  Ecosystem  Mark  Grover    |    Intro  to  Cloud  Compu=ng,  Carnegie  Mellon  SV  github.com/markgrover/hadoop-­‐intro-­‐fast  

©  Copyright  2010-­‐2014              Cloudera,  Inc.                All  rights  reserved.      

Page 2: Intro to Hadoop Presentation at Carnegie Mellon - Silicon Valley

©  2010  –  2015  Cloudera,  Inc.  All  Rights  Reserved  

About  Me  •  CommiNer  on  Apache  Bigtop,  commiNer  and  PPMC  member  on  Apache  Sentry  (incuba=ng).  •  Contributor  to  Apache  Hadoop,  Hive,  Spark,  Sqoop,  Flume.  •  SoUware  developer  at  Cloudera  • @mark_grover  • www.linkedin.com/in/grovermark  

Page 3: Intro to Hadoop Presentation at Carnegie Mellon - Silicon Valley

©  2010  –  2015  Cloudera,  Inc.  All  Rights  Reserved  

Co-­‐author  O’Reilly  book  

• @hadooparchbook  •  hadooparchitecturebook.com  •  To  be  released  early  2015  

Page 4: Intro to Hadoop Presentation at Carnegie Mellon - Silicon Valley

©  2010  –  2015  Cloudera,  Inc.  All  Rights  Reserved  

About  the  Presenta=on…  

• What’s  ahead  •  Fundamental  Concepts  •  HDFS:  The  Hadoop  Distributed  File  System  •  Data  Processing  with  MapReduce  •  Demo  •  Conclusion  +  Q&A  

Page 5: Intro to Hadoop Presentation at Carnegie Mellon - Silicon Valley

©  2010  –  2015  Cloudera,  Inc.  All  Rights  Reserved  

Fundamental  Concepts  Why  the  World  Needs  Hadoop  

Page 6: Intro to Hadoop Presentation at Carnegie Mellon - Silicon Valley

©  2010  –  2015  Cloudera,  Inc.  All  Rights  Reserved  

What’s  the  craze  about  Hadoop?  

•  Volume  • More  and  more  data  being  generated  • Machine  generated  data  increasing  

•  Velocity  •  Data  coming  it  at  higher  speed  

•  Variety  •  Audio,  video,  images,  log  files,  web  pages,  social  network  connec=ons,  etc.  

Page 7: Intro to Hadoop Presentation at Carnegie Mellon - Silicon Valley

©  2010  –  2015  Cloudera,  Inc.  All  Rights  Reserved  

We  Need  a  System  that  Scales  

•  Too  much  data  for  tradi=onal  tools  •  Two  key  problems  

•  How  to  reliably  store  this  data  at  a  reasonable  cost  •  How  to  we  process  all  the  data  we’ve  stored  

Page 8: Intro to Hadoop Presentation at Carnegie Mellon - Silicon Valley

©  2010  –  2015  Cloudera,  Inc.  All  Rights  Reserved  

What  is  Apache  Hadoop?  

•  Scalable  data  storage  and  processing  •  Distributed  and  fault-­‐tolerant    •  Runs  on  standard  hardware  

•  Two  main  components  •  Storage:  Hadoop  Distributed  File  System  (HDFS)  •  Processing:  MapReduce  

• Hadoop  clusters  are  composed  of  computers  called  nodes  •  Clusters  range  from  a  single  node  up  to  several  thousand  nodes  

Page 9: Intro to Hadoop Presentation at Carnegie Mellon - Silicon Valley

©  2010  –  2015  Cloudera,  Inc.  All  Rights  Reserved  

How  Did  Apache  Hadoop  Originate?  

• Heavily  influenced  by  Google’s  architecture  •  Notably,  the  Google  Filesystem  and  MapReduce  papers  

• Other  Web  companies  quickly  saw  the  benefits  •  Early  adop=on  by  Yahoo,  Facebook  and  others  

2002 2003 2004 2005 2006

Google publishes MapReduce paper

Nutch rewritten for MapReduce

Hadoop becomesLucene subproject

Nutch spun off from Lucene

Google publishes GFS paper

Page 10: Intro to Hadoop Presentation at Carnegie Mellon - Silicon Valley

©  2010  –  2015  Cloudera,  Inc.  All  Rights  Reserved  

Comparing  Hadoop  to  Other  Systems  

• Monolithic  systems  don’t  scale  • Modern  high-­‐performance  compu=ng  systems  are  distributed  

•  They  spread  computa=ons  across  many  machines  in  parallel  • Widely-­‐used  used  for  scien=fic  applica=ons  •  Let’s  examine  how  a  typical  HPC  system  works  

Page 11: Intro to Hadoop Presentation at Carnegie Mellon - Silicon Valley

©  2010  –  2015  Cloudera,  Inc.  All  Rights  Reserved  

Architecture  of  a  Typical  HPC  System  

Storage System

Compute Nodes

Fast Network

Page 12: Intro to Hadoop Presentation at Carnegie Mellon - Silicon Valley

©  2010  –  2015  Cloudera,  Inc.  All  Rights  Reserved  

Architecture  of  a  Typical  HPC  System  

Storage System

Compute Nodes

Step 1: Copy input data

Fast Network

Page 13: Intro to Hadoop Presentation at Carnegie Mellon - Silicon Valley

©  2010  –  2015  Cloudera,  Inc.  All  Rights  Reserved  

Architecture  of  a  Typical  HPC  System  

Storage System

Compute Nodes

Step 2: Process the data

Fast Network

Page 14: Intro to Hadoop Presentation at Carnegie Mellon - Silicon Valley

©  2010  –  2015  Cloudera,  Inc.  All  Rights  Reserved  

Architecture  of  a  Typical  HPC  System  

Storage System

Compute Nodes

Step 3: Copy output data

Fast Network

Page 15: Intro to Hadoop Presentation at Carnegie Mellon - Silicon Valley

©  2010  –  2015  Cloudera,  Inc.  All  Rights  Reserved  

You  Don’t  Just  Need  Speed…  

•  The  problem  is  that  we  have  way  more  data  than  code  

$ du -ks code/ 1,087 $ du –ks data/ 854,632,947,314

Page 16: Intro to Hadoop Presentation at Carnegie Mellon - Silicon Valley

©  2010  –  2015  Cloudera,  Inc.  All  Rights  Reserved  

You  Need  Speed  At  Scale  

Storage System

Compute Nodes

Bottleneck

Page 17: Intro to Hadoop Presentation at Carnegie Mellon - Silicon Valley

©  2010  –  2015  Cloudera,  Inc.  All  Rights  Reserved  

Hadoop  Design  Fundamental:  Data  Locality  

•  This  is  a  hallmark  of  Hadoop’s  design  •  Don’t  bring  the  data  to  the  computa=on  •  Bring  the  computa=on  to  the  data  

• Hadoop  uses  the  same  machines  for  storage  and  processing  •  Significantly  reduces  need  to  transfer  data  across  network  

Page 18: Intro to Hadoop Presentation at Carnegie Mellon - Silicon Valley

©  2010  –  2015  Cloudera,  Inc.  All  Rights  Reserved  

Other  Hadoop  Design  Fundamentals  

• Machine  failure  is  unavoidable  –  embrace  it  •  Build  reliability  into  the  system  

•  “More”  is  usually  beNer  than  “faster”  •  Throughput  maNers  more  than  latency  

Page 19: Intro to Hadoop Presentation at Carnegie Mellon - Silicon Valley

©  2010  –  2015  Cloudera,  Inc.  All  Rights  Reserved  

The  Hadoop  Distributed  Filesystem  

HDFS  

Page 20: Intro to Hadoop Presentation at Carnegie Mellon - Silicon Valley

©  2010  –  2015  Cloudera,  Inc.  All  Rights  Reserved  

HDFS:  Hadoop  Distributed  File  System  

•  Inspired  by  the  Google  File  System  •  Reliable,  low-­‐cost  storage  for  massive  amounts  of  data  

•  Similar  to  a  UNIX  filesystem  in  some  ways  •  Hierarchical  •  UNIX-­‐style  paths  (e.g.,  /sales/alice.txt)  •  UNIX-­‐style  file  ownership  and  permissions  

Page 21: Intro to Hadoop Presentation at Carnegie Mellon - Silicon Valley

©  2010  –  2015  Cloudera,  Inc.  All  Rights  Reserved  

HDFS:  Hadoop  Distributed  File  System  

•  There  are  also  some  major  devia=ons  from  UNIX  filesystems  • Highly-­‐op=mized  for  processing  data  with  MapReduce  

•  Designed  for  sequen=al  access  to  large  files  •  Cannot  modify  file  content  once  wriNen  

•  It’s  actually  a  user-­‐space  Java  process  •  Accessed  using  special  commands  or  APIs  

• No  concept  of  a  current  working  directory  

Page 22: Intro to Hadoop Presentation at Carnegie Mellon - Silicon Valley

©  2010  –  2015  Cloudera,  Inc.  All  Rights  Reserved  

Copying  Local  Data  To  and  From  HDFS  

•  Remember  that  HDFS  is  dis=nct  from  your  local  filesystem  •  hadoop fs –put  copies  local  files  to  HDFS  •  hadoop fs –get  fetches  a  local  copy  of  a  file  from  HDFS  

$ hadoop fs -put sales.txt /reports

Hadoop Cluster

Client Machine

$ hadoop fs -get /reports/sales.txt

Page 23: Intro to Hadoop Presentation at Carnegie Mellon - Silicon Valley

©  2010  –  2015  Cloudera,  Inc.  All  Rights  Reserved  

HDFS  Demo  

•  I  will  now  demonstrate  the  following  1.  How  to  list  the  contents  of  a  directory  2.  How  to  create  a  directory  in  HDFS  3.  How  to  copy  a  local  file  to  HDFS  4.  How  to  display  the  contents  of  a  file  in  HDFS  5.  How  to  remove  a  file  from  HDFS  

Page 24: Intro to Hadoop Presentation at Carnegie Mellon - Silicon Valley

©  2010  –  2015  Cloudera,  Inc.  All  Rights  Reserved  

A  Scalable  Data  Processing  Framework  

Data  Processing  with  MapReduce  

Page 25: Intro to Hadoop Presentation at Carnegie Mellon - Silicon Valley

©  2010  –  2015  Cloudera,  Inc.  All  Rights  Reserved  

What  is  MapReduce?  

• MapReduce  is  a  programming  model  •  It’s  a  way  of  processing  data    •  You  can  implement  MapReduce  in  any  language  

Page 26: Intro to Hadoop Presentation at Carnegie Mellon - Silicon Valley

©  2010  –  2015  Cloudera,  Inc.  All  Rights  Reserved  

Understanding  Map  and  Reduce  

•  You  supply  two  func=ons  to  process  data:  Map  and  Reduce  • Map:  typically  used  to  transform,  parse,  or  filter  data  •  Reduce:  typically  used  to  summarize  results  

•  The  Map  func=on  always  runs  first  •  The  Reduce  func=on  runs  aUerwards,  but  is  op=onal  

•  Each  piece  is  simple,  but  can  be  powerful  when  combined  

Page 27: Intro to Hadoop Presentation at Carnegie Mellon - Silicon Valley

©  2010  –  2015  Cloudera,  Inc.  All  Rights  Reserved  

MapReduce  Benefits  

•  Scalability  •  Hadoop  divides  the  processing  job  into  individual  tasks  •  Tasks  execute  in  parallel  (independently)  across  cluster  

•  Simplicity  •  Processes  one  record  at  a  =me  

•  Ease  of  use  •  Hadoop  provides  job  scheduling  and  other  infrastructure  •  Far  simpler  for  developers  than  typical  distributed  compu=ng  

Page 28: Intro to Hadoop Presentation at Carnegie Mellon - Silicon Valley

©  2010  –  2015  Cloudera,  Inc.  All  Rights  Reserved  

MapReduce  in  Hadoop  

• MapReduce  processing  in  Hadoop  is  batch-­‐oriented  • A  MapReduce  job  is  broken  down  into  smaller  tasks  

•  Tasks  run  concurrently  •  Each  processes  a  small  amount  of  overall  input  

• MapReduce  code  for  Hadoop  is  usually  wriNen  in  Java  •  This  uses  Hadoop’s  API  directly  

•  You  can  do  basic  MapReduce  in  other  languages  •  Using  the  Hadoop  Streaming  wrapper  program  •  Some  advanced  features  require  Java  code  

Page 29: Intro to Hadoop Presentation at Carnegie Mellon - Silicon Valley

©  2010  –  2015  Cloudera,  Inc.  All  Rights  Reserved  

MapReduce  Example  in  Python  

•  The  following  example  uses  Python  •  Via  Hadoop  Streaming  

•  It  processes  log  files  and  summarizes  events  by  type  •  I’ll  explain  both  the  data  flow  and  the  code  

Page 30: Intro to Hadoop Presentation at Carnegie Mellon - Silicon Valley

©  2010  –  2015  Cloudera,  Inc.  All  Rights  Reserved  

Job  Input  

• Here’s  the  job  input    

•  Each  map  task  gets  a  chunk  of  this  data  to  process  •  Typically  corresponds  to  a  single  block  in  HDFS  

2013-06-29 22:16:49.391 CDT INFO "This can wait"2013-06-29 22:16:52.143 CDT INFO "Blah blah blah"2013-06-29 22:16:54.276 CDT WARN "This seems bad"2013-06-29 22:16:57.471 CDT INFO "More blather"2013-06-29 22:17:01.290 CDT WARN "Not looking good"2013-06-29 22:17:03.812 CDT INFO "Fairly unimportant"2013-06-29 22:17:05.362 CDT ERROR "Out of memory!"

Page 31: Intro to Hadoop Presentation at Carnegie Mellon - Silicon Valley

©  2010  –  2015  Cloudera,  Inc.  All  Rights  Reserved  

#!/usr/bin/env python import sys levels = ['TRACE', 'DEBUG', 'INFO', 'WARN', 'ERROR', 'FATAL'] for line in sys.stdin: fields = line.split() level = fields[3].upper() if level in levels: print "%s\t1" % level

1 2 3 4 5 6 7 8 9

10 11 12 1314

Python  Code  for  Map  Func=on  

If  it  matches  a  known  level,  print  it,  a  tab  separator,  and  the  literal  value  1  (since  the  level  can  only  occur  once  per  line)  

Read  records  from  standard  input.  Use  whitespace  to  split  into  fields.      

Define  list  of  known  log  levels  

Extract  “level”  field  and  convert  to  uppercase  for  consistency.  

Page 32: Intro to Hadoop Presentation at Carnegie Mellon - Silicon Valley

©  2010  –  2015  Cloudera,  Inc.  All  Rights  Reserved  

Output  of  Map  Func=on  

•  The  map  func=on  produces  key/value  pairs  as  output  

INFO 1INFO 1WARN 1INFO 1WARN 1INFO 1ERROR 1

Page 33: Intro to Hadoop Presentation at Carnegie Mellon - Silicon Valley

©  2010  –  2015  Cloudera,  Inc.  All  Rights  Reserved  

The  “Shuffle  and  Sort”  

• Hadoop  automa9cally  merges,  sorts,  and  groups  map  output  •  The  result  is  passed  as  input  to  the  reduce  func=on  • More  on  this  later…  

INFO 1INFO 1WARN 1INFO 1WARN 1INFO 1ERROR 1

ERROR 1INFO 1INFO 1INFO 1INFO 1WARN 1WARN 1

Shuffle  and  Sort  

Map  Output   Reduce  Input  

Page 34: Intro to Hadoop Presentation at Carnegie Mellon - Silicon Valley

©  2010  –  2015  Cloudera,  Inc.  All  Rights  Reserved  

Input  to  Reduce  Func=on  

•  Reduce  func=on  receives  a  key  and  all  values  for  that  key    

 •  Keys  are  always  passed  to  reducers  in  sorted  order  

•  Although  not  obvious  here,  values  are  unordered  

ERROR 1INFO 1INFO 1INFO 1INFO 1WARN 1WARN 1

Page 35: Intro to Hadoop Presentation at Carnegie Mellon - Silicon Valley

©  2010  –  2015  Cloudera,  Inc.  All  Rights  Reserved  

Python  Code  for  Reduce  Func=on  

#!/usr/bin/env python import sys previous_key = None sum = 0 for line in sys.stdin: key, value = line.split() if key == previous_key: sum = sum + int(value) # continued on next slide

1 2 3 4 5 6 7 8 9

10 11 12 13

Ini=alize  loop  variables  

Extract  the  key  and  value  passed  via  standard  input  

If  key  unchanged,    increment  the  count  

Page 36: Intro to Hadoop Presentation at Carnegie Mellon - Silicon Valley

©  2010  –  2015  Cloudera,  Inc.  All  Rights  Reserved  

Python  Code  for  Reduce  Func=on  

# continued from previous slide else: if previous_key: print '%s\t%i' % (previous_key, sum) previous_key = key sum = 1 print '%s\t%i' % (previous_key, sum)

14 15 16 17 18 19 20 21 22 Print  data  for  the  final  

key  

If  key  changed,    print  data  for  old  level  

Start  tracking  data  for  the  new  record  

Page 37: Intro to Hadoop Presentation at Carnegie Mellon - Silicon Valley

©  2010  –  2015  Cloudera,  Inc.  All  Rights  Reserved  

Output  of  Reduce  Func=on  

•  Its  output  is  a  sum  for  each  level  

ERROR 1INFO 4WARN 2

Page 38: Intro to Hadoop Presentation at Carnegie Mellon - Silicon Valley

©  2010  –  2015  Cloudera,  Inc.  All  Rights  Reserved  

Recap  of  Data  Flow  

   

ERROR 1INFO 4WARN 2

INFO 1INFO 1WARN 1INFO 1WARN 1INFO 1ERROR 1

ERROR 1INFO 1INFO 1INFO 1INFO 1WARN 1WARN 1

Map  input  

Map  output   Reduce  input   Reduce  output  

2013-06-29 22:16:49.391 CDT INFO "This can wait"2013-06-29 22:16:52.143 CDT INFO "Blah blah blah"2013-06-29 22:16:54.276 CDT WARN "This seems bad"2013-06-29 22:16:57.471 CDT INFO "More blather"2013-06-29 22:17:01.290 CDT WARN "Not looking good"2013-06-29 22:17:03.812 CDT INFO "Fairly unimportant"2013-06-29 22:17:05.362 CDT ERROR "Out of memory!"

Shuffle  and  sort  

Page 39: Intro to Hadoop Presentation at Carnegie Mellon - Silicon Valley

©  2010  –  2015  Cloudera,  Inc.  All  Rights  Reserved  

How  to  Run  a  Hadoop  Streaming  Job  

•  I’ll  demonstrate  this  now…    

Page 40: Intro to Hadoop Presentation at Carnegie Mellon - Silicon Valley

©  2010  –  2015  Cloudera,  Inc.  All  Rights  Reserved  

Open  Source  Tools  that  Complement  Hadoop  

The  Hadoop  Ecosystem  

Page 41: Intro to Hadoop Presentation at Carnegie Mellon - Silicon Valley

©  2010  –  2015  Cloudera,  Inc.  All  Rights  Reserved  

The  Hadoop  Ecosystem  

•  "Core  Hadoop"  consists  of  HDFS  and  MapReduce  •  These  are  the  kernel  of  a  much  broader  plauorm  

• Hadoop  has  many  related  projects  •  Some  help  you  integrate  Hadoop  with  other  systems  •  Others  help  you  analyze  your  data  

•  These  are  not  considered  “core  Hadoop”  •  Rather,  they’re  part  of  the  Hadoop  ecosystem  • Many  are  also  open  source  Apache  projects  

Page 42: Intro to Hadoop Presentation at Carnegie Mellon - Silicon Valley

©  2010  –  2015  Cloudera,  Inc.  All  Rights  Reserved  

Visual  Overview  of  a  Complete  Workflow  Import Transaction Data

from RDBMSSessionize WebLog Data with Pig

Analyst uses Impala forbusiness intelligence

Sentiment Analysis on Social Media with Hive

Hadoop Cluster with Impala

Generate Nightly Reports using Pig, Hive, or Impala

Build product recommendations for

Web site

Page 43: Intro to Hadoop Presentation at Carnegie Mellon - Silicon Valley

©  2010  –  2015  Cloudera,  Inc.  All  Rights  Reserved  

Key  Points  

• We’re  genera=ng  massive  volumes  of  data  •  This  data  can  be  extremely  valuable  •  Companies  can  now  analyze  what  they  previously  discarded  

• Hadoop  supports  large-­‐scale  data  storage  and  processing  •  Heavily  influenced  by  Google's  architecture  •  Already  in  produc=on  by  thousands  of  organiza=ons  •  HDFS  is  Hadoop's  storage  layer  • MapReduce  is  Hadoop's  processing  framework  

• Many  ecosystem  projects  complement  Hadoop  •  Some  help  you  to  integrate  Hadoop  with  exis=ng  systems  •  Others  help  you  analyze  the  data  you’ve  stored  

Page 44: Intro to Hadoop Presentation at Carnegie Mellon - Silicon Valley

©  2010  –  2015  Cloudera,  Inc.  All  Rights  Reserved  

Highly  Recommended  Books  

Author:  Tom  White  ISBN:  1-­‐449-­‐31152-­‐0  

Author:  Eric  Sammer  ISBN:  1-­‐449-­‐32705-­‐2  

Page 45: Intro to Hadoop Presentation at Carnegie Mellon - Silicon Valley

©  2010  –  2015  Cloudera,  Inc.  All  Rights  Reserved  

Ques=ons?  

•  Thank  you  for  aNending!  •  I’ll  be  happy  to  answer  any  addi=onal  ques=ons  now…  • Demo  and  slides  at  github.com/markgrover/hadoop-­‐intro-­‐fast  •  TwiNer:  mark_grover  •  Survey  page:  =ny.cloudera.com/mark