Top Banner
HDInsight Essentials ISBN : 1849695369 / ISBN 13 : 9781849695367 Rajesh Nadipalli 05/01/2014
37
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Hd insight essentials quick view

HDInsight  Essentials  ISBN  :  1849695369    /  ISBN  13  :  9781849695367  

Rajesh  Nadipalli  05/01/2014  

Page 2: Hd insight essentials quick view

Goals  of  this  Book  • Focus  on  Microso'’s  new  Hadoop  distribu=on  • Serve  as  Quick  Reference  • Provide  an  Overview  of  Hadoop  • Address  both  cloud  and  on-­‐premise  setup  for  HDInsight  • Highlight  HDInsight  differen:ator    • Provide  Prac=cal  &  Real  world  examples  

Page 3: Hd insight essentials quick view

Book  Table  of  Contents  • Chapter  1:    HDInsight  in  a  Heartbeat  • Chapter  2:    Deployment  HDInsight  on  premise  • Chapter  3:    HDInsight  Azure  cloud  service  • Chapter  4:    Administer  your  cluster  • Chapter  5:    Ingest  data  to  your  cluster  • Chapter  6:    Transform  data  in  your  cluster  • Chapter  7:    Analyze  &  Report  data  from  cluster  • Chapter  8:    Project  Planning  &                                              Architectural  Considera=ons  

Page 4: Hd insight essentials quick view

CHAPTER  1  HIGHLIGHTS:    HDINSIGHT  IN  A  HEARTBEAT  

Page 5: Hd insight essentials quick view

Big  Data  Problem  Characteristics    

Page 6: Hd insight essentials quick view

Hadoop  Overview  

Self Healing Distributed Storage

Fault Tolerant Distributed Computing

+ Abstraction for

Parallel Processing

CORE HADOOP COMPONENTS •  HDFS:  Distributed  Storage  –  replicated,  self-­‐healing  and  scalable    

•  MapReduce:    Parallel  Processing,  process  local  data  for  efficiency    

 

Page 7: Hd insight essentials quick view

NameNode

JobTracker TaskTracker  

 TaskTracker  

 TaskTracker  

 MapReduce  Layer  

Distributed    File  System  

Layer   Secondary NameNode

Master  Node   Slaves  Nodes  

DataNode    

DataNode    

DataNode    

Hadoop  Nodes  Layout  

Page 8: Hd insight essentials quick view

Data  Sources        

RDBMS    Databases  

Audio,    Images   Log  Files   Sensors,    

RFID  Social    

Media,  Feeds  

 Hadoop  Data  Store  

       

HDFS  

Hbase    (NOSQL  DB)  

 Data  Processing  

     

Mapreduce  

 Data  Access  

     

Hive   Pig   Mahout    Machine  Learning  

Flume,  Sqoop  

Excel  

Business    Data  Feeds  

Zook

eepe

r  (Distrib

uted  Process  M

anag

ement)  

Hcatalog  (M

etad

ata  on

 Pig,  H

ive,  M

apRe

duce  )  

Oozie    Workflow,  Scheduler  

Infrastructure  ,  Ope

ra:o

ns  

(Mon

itorin

g,  Con

figura<

on)  

Hadoop  Eco  System  

Page 9: Hd insight essentials quick view

Collect & Import to HDFS

Process (MapReduce)

Analyze (BI Tools) Report & Publish

End  to  End  Solution  on  Hadoop  

Page 10: Hd insight essentials quick view

Popular  Hadoop  Distributions  •  Amazon  Elas=c  MapReduce  (cloud,  hbp://aws.amazon.com/elas=cmapreduce/)    

•  Cloudera  (hbp://www.cloudera.com/content/cloudera/en/home.html)    

•  EMC  PivitolHD  (hbp://gopivotal.com/)    

•  Hortonworks  HDP  (hbp://hortonworks.com/)    

•  MapR  (hbp://mapr.com/)    

•  Microsod  HDInsight  (cloud,  hbp://www.windowsazure.com/)  

Page 11: Hd insight essentials quick view

HDInsight  Differenciator  •  Enterprise-­‐ready  Hadoop  backed  by  Microsod    

•  Analy:cs  using  Excel  

•  Integra=on  with  Ac=ve  Directory.      

•  Integra=on  with  .NET  and  Javascript    

•  Connectors  to  RDBMS    

•  Scale  using  cloud  offering:    Azure  HDInsight  service  enables  customers  to  scale  quickly  and  has  seamless  interface  between  HDFS  and  Azure  Storage  Vault    

•  JavaScript  Console  

Page 12: Hd insight essentials quick view

WordCount  in  HDInsight  

Page 13: Hd insight essentials quick view

CHAPTER  2  HIGHLIGHTS:    HDINSIGHT  INSTALL  ON  PREMISE  

Page 14: Hd insight essentials quick view

Apache  Hadoop        

•  Open  Source  Sodware  •  Community  Development      

Hortonworks  Data  PlaSorm        

•  Enterprise  Hadoop  Plagorm  (HDP)  •  Leaders  in  Hadoop  •  Code  commibers  to  Hadoop  

Microso'  HDInsight        

•  Built  on  top  of  HDP  •  Integra=on  with  ASV,  Excel,  Powerview,  

SQLServer,  Ac=ve  Directory      

HDInsight  Distribution  

Page 15: Hd insight essentials quick view

Physical  Install  Options  

NN          SNN            JT  

DN    /  TT  

Single  node  for  development/test      

Mul=  node  for  produc=on      

Page 16: Hd insight essentials quick view

Multi  Node  Install  Steps  •  Pre-­‐requisites  •  Networking  Setup  •  Remote  Scrip=ng  •  Firewall  Setup  •  Sodware  Install  (each  node)  •  Hadoop  Configura=on  •  Verifica=on  

Page 17: Hd insight essentials quick view

CHAPTER  3  HIGHLIGHTS:    HDINSIGHT  AZURE  SERVICE  

Page 18: Hd insight essentials quick view

Azure  Cloud  Service  

Create  Storage  

Create  HDInsight  cluster  

Page 19: Hd insight essentials quick view

CHAPTER  4  HIGHLIGHTS:    ADMINISTER  YOUR  CLUSTER  

Page 20: Hd insight essentials quick view

HDInsight  Cluster  Management  

Page 21: Hd insight essentials quick view

HDInsight  Dashboard  

Page 22: Hd insight essentials quick view

HDInsight  Dashboard  

Page 23: Hd insight essentials quick view

NameNode  Status  

Page 24: Hd insight essentials quick view

Jobtracker  Status  

Page 25: Hd insight essentials quick view

CHAPTER  5  HIGHLIGHTS:    INGEST  DATA  INTO  YOUR  CLUSTER  

Page 26: Hd insight essentials quick view

Loading  Data  into  your  Cluster  You  have  following  op=ons…    •  Loading  data  using  Hadoop  commands  •  Loading  data  using  Azure  Storage  Vault  •  Loading  data  using  Interac:ve  JavaScript    •  Shipping  data  to  your  Cluster  •  Loading  data  from  RDBMS  via  Sqoop  

Page 27: Hd insight essentials quick view

Loading  via  Azure  Storage  Explorer  

Page 28: Hd insight essentials quick view

CHAPTER  6  HIGHLIGHTS:    TRANSFORM  YOUR  DATA  

Page 29: Hd insight essentials quick view

Transforming  Data  You  have  following  op=ons…    •  MapReduce  •  Hive  •  Pig  •  Others  

Page 30: Hd insight essentials quick view

Processing  Data  in  Cluster  Map for Jan2012

Map for Feb2012

Map for Apr2013

…  

One Reducer

Page 31: Hd insight essentials quick view

HDFS  

Hive  JDBC/OBDC

Metastore

Thrift Server

Command Line Web GUI

Driver (Parser, Planner, Executor)

MapReduce  

Hive  

Page 32: Hd insight essentials quick view

Raw  Data  in  HDFS  •  Distributed  

Storage  •  Reliable  

Data  Processing  via  Pig  •  Pipelines  •  Itera=ve  Processing  •  Research  

Data  Warehouse  

HDFS  

Data  Warehouse  via  Hive  •  BI  Tools  •  Analysis  

Hive  or  Pig?  

Page 33: Hd insight essentials quick view

CHAPTER  7  HIGHLIGHTS:    ANALYZE  &  REPORT  

Page 34: Hd insight essentials quick view

Analyze  using  Excel  

Page 35: Hd insight essentials quick view

Analyze  using  Excel  

Page 36: Hd insight essentials quick view

CHAPTER  8:    PROJECT  PLANNING  &  ARCHITECTURAL  CONSIDERATIONS  

Page 37: Hd insight essentials quick view

Execu:ve  &  Stakeholder    

Buy-­‐in  

Discovery  &  Analysis  

Design  

Implementa:on  User  Acceptance  

Produc:on  Opera:ons  

Feedback,  New  Requirements