Introduction to Hadoop, Hive, an d Apache Sparkteaching.csse.uwa.edu.au/units/CITS3402/lectures/CITS3402-Spark.pdf · Why Hadoop? • Hadoop is a platorm for storage and processinghuge

Introduction to Hadoop, Hive, and Apache Spark

Concepts and Tools

September 2018

Outline

• Overview• MapReduce Framework• HDFS Framework• Hadoop Cluster Mechanisms• Relevant Technologies

– Hive, sqoop, Pig and NoSQL• Apache Spark

What and Why?

} How?

Overview of Hadoop

Why Hadoop?• Hadoop is a platorm for storage and processing huge da

tasets distributed on clusters of commodity machines.

• Two core components of Hadoop:– Processing engine (traditionally MapReduce, more recently, S

park) – HDFS (Hadoop Distributed File Systems)

Why Hadoop (cont.)?

• Hadoop addresses “big data” challenges.• “Big data” creates large business values today.• Various industries face “big data” challenges.

Without an efficient data processing approach, the data cannot create business values.– Many industries end up creating large amounts of

data that they are unable to gain any insight from.

*http://wikibon.org/ 5

Big Data!!

• What is “big data”?• One SKA Survey will generate a data product o

f 4 EB.• The DINGO uv grid dataset is ~ 4 PB• General requirement for SKA Phase 1 –

– Initial bare minimum 600 PB– Annual increase at least 1 EB per year

Core Components of Hadoop

• MapReduce/Spark– An efficient programming framework for processing parallelizable probl

ems across huge datasets using a large number of commodity machines.

• HDFS– A distributed file system designed to efficiently allocate data across mul

tiple commodity machines, and provide self-healing functions when some of them go down.

Commodity machine

Super computer

Performance Low HighCost Low HighAvailability Readily available Hard to obtain

MapReduce Framework

• Map: – Extract something of interest from each

chunk of record.• Reduce:

– Aggregate the intermediate outputs from the Map process.

• The Map and Reduce have different instantiations in different problems.

General framework

MapReduce Framework

• Inputs and outputs of Mappers and Reducers are key value pairs <k,v>.

• Programmers must do the coding according to the MapReduce Model– Specify Map method– Specify Reduce Method– Define the intermediate outputs in <k,v> format.

HDFS Framework

• Hadoop Distributed File System (HDFS) is a highly fault-tolerant distributed file system for Hadoop.– Data storage infrastructure of Hadoop Cluster– Hadoop ≈ Processing engine (MapReduce or Spark) + HDFS

• Specifically designed to work with MapReduce/Spark.

• Major assumptions:– Large data sets.– Hardware failure.– Write once, read many.

HDFS Framework• Key features of HDFS:

– Fault Tolerance - Automatically and seamlessly recover from failures – Data Replication- to provide redundancy.– Load Balancing - Place data intelligently for maximum efficiency and utilization – Scalability- Add servers to increase capacity

– “Moving computations is cheaper than moving data.”

HDFS Framework

• Components of HDFS:– DataNodes

• Store the data with optimized redundancy.

– Journal Node• Coordinates the data nodes with the Name Node

– NameNode (s)• Manages the DataNodes.• Secondary Name Node – failover capability

Hadoop vs RDBMS

• Many businesses are turning from RDBMS to Hadoop-based systems for data management.

• In a word, if businesses need to process and analyze large-scale, real-time data, choose Hadoop. Otherwise staying with RDBMS is still a wise choice.

Hadoop-based RDBMS

Data format Structured & Unstructured Mostly structuredScalability Very high LimitedSpeed Fast for large-scale data Very fast for small-medium size data. Analytics Powerful analytical tools for

big-data.Some limited built-in analytics.

Hadoop vs Other Distributed Systems

• Common Challenges in Distributed Systems– Component Failure

• Individual computer nodes may overheat, crash, experience hard drive failures, or run out of memory or disk space.

– Network Congestion• Data may not arrive at a particular point in time.

– Communication Failure• Multiple implementations or versions of client software may speak sli

ghtly different protocols from one another.

– Security• Data may be corrupted, or maliciously or improperly transmitted.

– Failover modes. Hadoop is automated.

Hadoop vs Other Distributed Systems

• Hadoop– Uses efficient programming model.– Efficient, automatic distribution of data and work a

cross machines.– Good in component failure and network congestio

n problems.– Weak for security issues. (Although…)

Hadoop Cluster Architecture

Cloudera

• A platorm that integrates many Hadoop-based products and services.

Hadoop Architecture (1)

• Hadoop has a master/slave architecture. • Typically one machine in the cluster is designat

ed as the NameNode and another machine as the JobTracker, exclusively. – These are the masters.

• The rest of the machines in the cluster act as both DataNode and TaskTracker. – These are the slaves.

Hadoop Architecture (2)

• NameNode (master)– Manages the file system namespace.– Executes file system namespace operations like opening, closing, and r

enaming files and directories. – It also determines the mapping of data chunks to DataNodes.– Monitor DataNodes by receiving heartbeats.

• DataNodes (slaves)– Manage storage attached to the nodes that they run on.– Serve read and write requests from the file system’s clients. – Perform block creation, deletion, and replication upon instruction fro

m the NameNode.

Hadoop Architecture (3)• JobTracker (master)

– Receive jobs from client.– Talks to the NameNode to determine the location of the data– Manage and schedule the entire job. – Split and assign tasks to slaves (TaskTrackers).– Monitor the slave nodes by receiving heartbeats.

• TaskTrackers (slaves)– Manage individual tasks assigned by the JobTracker, including Map operatio

ns and Reduce operations.– Every TaskTracker is configured with a set of slots, these indicate the numbe

r of tasks that it can accept.– Send out heartbeat messages to the JobTracker to tell that it is still alive. – Notify the JobTracker when succeeds or fails.

Hadoop Architecture

• Example 1

NameNodeJob Tracker

masters

Cluster service layout

Hadoop Resource Management

• Apache YARN (Yet Another Resource Negotiator)• Decouples resource management from data proc

essing requirements• Provides resources for any processing framework

compatible with Hadoop• Resource manager – dedicated scheduler, resides

on one of the service nodes• Node manager Daemons on each of the worker n

odes in the cluster

YARNYet Another Resource Negotiator• Scheduler• Resource Mana

Zookeeper

• Zookeeper: A cluster management tool that supports coordination between nodes in a distributed system.– When designing a Hadoop-based application, a lot of coordination works need t

o be considered. Writing these functionalities is difficult.

• Zookeeper provides services that can be used to develop distributed applications.

• Zookeeper provide services such as :Configuration managementSynchronizationGroup servicesLeader electionEtc.

Hadoop evolution

Relevant Technologies

Technologies relevant to Hadoop

Zookeeper

Hadoop Ecosystem

• Hive: data warehousing application based on Hadoop.– Query language is HiveQL, which looks similar to S

QL.– Translate HiveQL into MapReduce or Spark jobs.– Store & manage data on HDFS.– Can be used as an interface for HBase, MongoDB,

Cassandra etc.

Hive – a datawarehouse for HDFS

• Simply put, Hive is a metadata layer on HDFS datasets • Different table types

– Internal or external tables• Different table formats

– Sequence, text, Parquet. ORC, RCFile• Compression

– Compression codecs – snappy, gzip, zlib• What Hive gives us

– SQL, Partitioning, Indexes

Hive Partitioning

• Provides simple interface for importing data straight from relational DB to Hadoop.

• HDFS- Append only file system– A file once created, written, and closed need not be changed. – To modify any portion of a file that is already written, one must rew

rite the entire file and replace the old file.– Not efficient for random read/write.– Use relational database? Not scalable.

• Solution: NoSQL– Stands for Not Only SQL.– Class of non-relational data storage systems.– Usually do not require a pre-defined table schema in advance.– Scale horizontally.

• VS vertically.

NoSQL• NoSQL data store models:

– Document store– Wide-column store– Key Value store– Graph store

• NoSQL Examples:– HBase– Cassandra– MongoDB– CouchDB– Redis– Riak– Neo4J– ….

• HBase– Hadoop Database.

• Good integration with Hadoop.

– A datastore on HDFS that supports random read and write.

– A distributed database modeled after Google BigTable.

– Best fit for very large Hadoop projects.

• A high-level platorm for creating MapReduce programs used in Hadoop.

• Translate into efficient sequences of one or more MapReduce jobs.

• Executing the MapReduce jobs.

Need for High-Level Languages

• Hadoop is great for large data processing!– But writing Mappers and Reducers in Java for ever

ything is verbose and slow.• Solution: develop higher-level data processing

languages, on later processing engines.– Hive: HiveQL is like SQL.– Pig: Pig Latin similar to Perl.– Use Python!

Apache Spark

Apache Spark Background

• Many of the aforementioned Big Data technologies (Hbase, Hive, Pig, Mahout, etc.) are not integrated with each other.

• This can lead to reduced performance and integration difficulties.

• However, Apache Spark is a state-of-the-art Big Data technology that integrates many of the core functions from each of these technologies under one framework.

• Biggest advantage Spark offers over Map Reduce is in memory processing

Apache Spark Background

• Apache Spark is fast and general engine for large-scale data processing built upon distributed file systems. – Most common is Hadoop Distributed File System (HDFS).

• Claims to be 100 times faster than MapReduce and supports Java, Python, and Scala API’s.

• Spark is good for distributed computing tasks, and can handle batch, interactive, and real-time data within a single framework.

• Spark can also be run independently of Hadoop as well.

Residual Distributed Datasets

• The core abstraction for working with data• Spark automatically distributes the data across t

he cluster and parallelizes the operations• An RDD is simply a distributed collection of objec

ts• RDDs are split into partitions, which can be comp

uted on different cluster nodes• RDDs can contain any type of Python, Java or Sca

la object, including user-defined classes

Aside - Spark and Machine Learning

• Why it’s important!• Libraries available (Python on Spark)

– Mllib– astroML– astroPy– Theano– Tensorflow– The usual suspects (numpy, scipy)– More…

Spark Deployment Options

• Standalone − Spark occupies the place on top of HDFS. Spark and MapReduce run side-by-side for all jobs.

• Hadoop Yarn − Spark runs on Yarn withou t any pre-installation or root access required. It helps to integrate Spark into Hadoop ecosystem or Hadoop stack. It allows other components to run on top of the stack.

• Spark in MapReduce (SIMR) − Spark in M apReduce is used to launch spark job in addition to standalone deployment. With SIMR, user can start Spark and uses its shell without any administrative access.

Spark on YARN

• Resource management, scheduling and security controlled by YARN

• Each Spark executor runs as a YARN container

• Spark vs MapReduce – MapReduce schedules a container and starts a JVM fo

r each task– Spark hosts multiple tasks within the same container

Spark on YARN (continued)

• Each application has• ApplicationMaster process • ApplicationMaster requests resources from the Resour

ce Manager• When resources allocated, instructs the NodeManager

s to start containers on it’s behalf.• Deployment Modes

– Cluster – Driver runs ApplicationMaster on a cluster host chosen by YARN

– Client – Driver runs on the host where the job is submitted

Spark Components

• Regardless of deployment, Spark provides four standard libraries. – Spark SQL – allows for SQL like que

ries of data– Spark Streaming – allows real-time

processing of data– GraphX – allows graph analytics– Mllib – provides Machine Learning

tools.

Spark Components – Spark SQL–Spark SQL introduces a new data abstraction called SchemaRDD, which pr

ovides support for structured and semi-structured data. Consider the examples below.

–From Hive:c = HiveContext(sc)rows = c.sql(“select text, year, from hivetable”)rows.filter(lamba r: r.year > 2013).collect()

–From JSON: c.jsonFile(“tweets.json”).registerAsTable(“tweets”)c.sql(“select text, user.name from tweets”)

• Hadoop is powerful. But where do we find so many commodity machines?

Amazon Elastic MapReduce

• Setting up Hadoop clusters on the cloud.• Amazon Elastic MapReduce (AEM).

– Powered by Hadoop.– Uses EC2 instances as virtual servers for the master and sla

ve nodes.• Key Features:

– No need to do server maintenance.– Resizable clusters.– Hadoop application support including HBase, Pig, Hive etc.– Easy to use, monitor, and manage.

Spark Components – MLlib

• MLlib (Machine Learning Library) is a distributed machine learning framework above Spark.

• Spark MLlib is nine times as fast as the Hadoop disk-based version of Apache Mahout (before Mah out gained a Spark interface).

• Spark MLlib provides a variety of machine learning classic algorithms.

Spark Components – MLlib Algorithms

• Classification – logistic regression, linear SVM, Naïve Bayes, classification tree

• Regression – Generalized Linear Models (GLMs), Regression tree

• Collaborative filtering – Alternating Least Squares (ALS), Non-negative Matrix Factorization (NMF)

• Clustering – k-means

• Decomposition – SVD, PCA

• Optimization – stochastic gradient descent, L-BFGS

Interfaces!

• How do we access Hadoop and Spark?• CLI exist for

– Hive– Pig– Spark– Pyspark– Spark-submit

Better interfaces!

• Web interfaces – we will be using– Hue for Hive and Pig;– Jupyter for Pyspark; but others exist as well, eg.– Apache Zeppelin is also worth a look.– And the Cloudera Management server

• Visual representation of the entire cluster• Status of every service within the cluster• Extensive monitoring and configuration options

Lab – set up our environment

• Start the docker image• Start the Jupyter notebook server

• Explore Hue and Jupyter• Start the Cloudera Management Server

Lab – Set up the housing data set

• Import the housing data set into HDFS• Create an external table definition

• Create an internal table from the external table• Compare the two tables.

Lab – Use Pyspark to create RDD• We’ll use a Jupyter notebook to

demonstrate this.• Open nb1-rdd-create notebook in

Jupyter• Run the notebook

• As an optional exercise, save the file as a Hive table and use SparkSQL to

create the RDD

References

• These articles are good for learning Hadoop.– http://developer.yahoo.com/hadoop/tutorial/– https://hadoop.apache.org/docs/r1.2.1/mapred_tu

torial.html– http://www.michael-noll.com/tutorials/– http://www.slideshare.net/cloudera/tokyo-nosqlsl

idesonly– http://www.fromdev.com/2010/12/interview-que

stions-hadoop-mapreduce.html

Spark Components – Spark Steaming

–Spark Streaming leverages Spark’s fast scheduling ability to perform streaming analytics.

– Chops up the live stream into batches of X seconds

– Spark treats each data batch as Resilient Distributed Datasets (RDDs) and processes them using RDD operations

– The processed results of the RDD operations are returned in batches

Spark Components – Spark Steaming

• Spark Streaming leverages Spark’s fast scheduling ability to perform streaming analytics.– Chops up the live stream into batches of X sec

onds– Spark treats each data batch as Resilient Distri

buted Datasets (RDDs) and processes them using RDD operations

– The processed results of the RDD operations are returned in batches

Lab – Flume to capture streaming data

Spark Components - GraphX

• GraphX is a distributed graph-processing framework on top of Spark.

• Users can build graphs using RDDs of nodes and edges.

• Provides a large library of graph algorithms with decomposable steps.

Spark Components - GraphX

Spark Components – GraphX Algorithms

• Collaborative Filtering– Alternating Least Squares– Stochastic Gradient Descent– Tensor Factorization

• Structured Prediction– Loopy Belief Propagation– Max-Product Linear Programs– Gibbs Sampling

• Semi-supervised ML– Graph SSL– CoEM

• Community Detection– Triangle Counting– K-core Decomposition– K-Truss

• Graph Analytics– PageRank– Personalized PageRank– Shortest Path– Graph Coloring

• Classification– Neural Networks

Resources for Apache Spark

• Spark has a variety of free resources you can learn from. – Big Data University -

http://bigdatauniversity.com/courses/spark-fundamentals/

– Founders of Spark, Databricks - https://databricks.com/

– Apache Spark download - http://spark.apache.org/ – Apache Spark set up tutorial -

http://www.tutorialspoint.com/apache_spark/ 73

Introduction to Hadoop, Hive, an d Apache Sparkteaching.csse.uwa.edu.au/units/CITS3402/lectures/CITS3402-Spark.pdf · Why Hadoop? • Hadoop is a platorm for storage and processinghuge

Documents

2. Hadoop -...

Continuous Delivery for Linux/Windows/Hadoop...Beta Cluster....

Hadoop Present - Open Enterprise Hadoop

Hadoop 1.0 vs Hadoop 2.0

Docker based Hadoop provisioning - Hadoop Summit 2014

Introduction to Hadoop and Hadoop component

Analyzing Hadoop with Hadoop

Hadoop Summit 2010 Machine Learning Using Hadoop

Curso Hadoop. FcoJavierLahozSevilla v1.0.pdf ·...

PROFESSIONAL HADOOP® SOLUTIONS - Startseite€¦ · The...

Introduction to MPI - Unit...

Hadoop Interview Questions Version 2.0.0 Author: Hadoop...

Configuración para Hadoop Configuración de WPS para...

Hadoop World 2010: Productionizing Hadoop: Lessons Learned

Hadoop Online Tutorials - indiatrainings.in · Menu Search....

Hadoop , Hadoop , Hadoop !!!