Top Banner

Click here to load reader

A Crash Course in Apache Hadoop - Blanco · PDF file Why Hadoop? Benefits of the Hadoop Architecture Consolidates Data Integrates with many existing platforms Scalable and Affordable

Aug 22, 2020

ReportDownload

Documents

others

  • A Crash Course in Apache Hadoop

  • Event Outline 1. What is Hadoop

    2. Current data challenges

    3. Hadoop Solutions

    4. Architecture

    5. Workshop

  • Who & When ● Origin from Google papers ● Originally developed at

    Yahoo! ○ Doug Cutting, Michael

    Cafarella ● Project officially began

    around 2005. ● Named after a toy elephant

  • Why

    ● More ways to collect data ● Too much data

    ○ CERN Laboratory ○ Google/Yahoo/Facebook

    ● Forget about processing when you can’t even store it

  • Analogy ● Imagine you needed to transport 2,000,000kg of raw material ● How would you do it? (Let’s assume that horsepower is proportional to the

    mass that each vehicle can carry)

    ○ Ferrari 458 will run - $243,000 ~560 Horsepower ○ Bugatti Veyron - $2,310,688 ~1200 horsepower ○ Brand new Ford F-150 will cost $30,000 ~325 horsepower ○ Dodge Caravan - $20,000 ~280 horsepower

  • What

    ● Provide a way to reliably access and process large volumes of data

    ● Designed to scale across many, many machines.

  • The ASF ● Apache Open Source

    ○ OpenOffice ○ HTTP Server ○ Subversion ○ Tomcat Webserver ○ Commons ○ Maven ○ Hadoop

    ● Anyone can view the source code! ○ Build/edit/modify on your own

    machine

  • Why data?

    Opportunities and analytic insights for businesses

  • Why Hadoop? ● Benefits of the Hadoop Architecture

    ○ Consolidates Data ○ Integrates with many existing

    platforms ○ Scalable and Affordable ○ Real-Time Insights

  • The Hadoop Ecosystem

  • Hadoop Architecture

  • Notable Hadoop Projects ● Apache Kafka → Data Streaming ● Apache HBase → Big Data Management ● Apache Hive → Read and Query from HDFS ● Apache ZooKeeper → HA management ● Apache Spark → Processing Engine ● Apache Ambari → Cluster management

  • At the Core of Hadoop ● Hadoop Distributed File System (HDFS)

    ● Hadoop MapReduce (Processing Engine)

    ● Hadoop Common (Core Hadoop Libraries)

    ● Hadoop YARN (Yet Another Resource Manager)

    ○ CPU/Storage/Memory management (parallel jobs)

  • Resource Manager

    YARN

    Cluster Architecture

    { } Worker Node

    NodeManagerDataNode

    Ambari, Hive, Zeppelin, Knox, etc..

    NameNode (HDFS)

    Resource Manager (YARN)

    Namenode HDFS

  • HDFS Architecture ● Fault-tolerant distributed storage

    ○ Split file into logical blocks ○ Store multiple copies of each block

    01000100 11111001 01001001 11000001 01010111 01001010 00100111 11001010 01001110 00001010 10111010 01001001 01011100 0

    1

    2

    3

    4

    1

    1

    1

    2

    2

    2

    3 3

    3

    4

    4

    4

    File

    Cluster

    File Blocks

  • HDFS - Namenode and Heartbeats ● Namenode communicates through Heartbeats

    ○ Keep track of all the data notes ○ Which data is stored and where

    NameNode

    DataNode 1 DataNode 3DataNode 2 DataNode 4

    Hey NameNode, I’m Here!

    Hey NameNode, I’m Here!

    123 123

    DataNode 2, can you replicate that 123 block to DataNode 3?

  • MapReduce in Hadoop

    ● Shuffle and Sort ○ Break a problem into

    sub-problems ● Batch Processing

  • What do iOS 4 and Windows 3.1 have in common?

  • Multi-Use vs. Batch

  • Workshop

Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.