Introduction to Hadoop, Hive, an d Apache Sparkteaching.csse.uwa.edu.au/units/CITS3402/lectures/CITS3402-Spark.pdf · Why Hadoop? • Hadoop is a platorm for storage and processinghuge
Post on 31-Aug-2019
3 Views
Preview:
Transcript
Introduction to Hadoop, Hive, and Apache Spark
Concepts and Tools
September 2018
1
Outline
• Overview• MapReduce Framework• HDFS Framework• Hadoop Cluster Mechanisms• Relevant Technologies
– Hive, sqoop, Pig and NoSQL• Apache Spark
What and Why?
} How?
2
Overview of Hadoop
3
Why Hadoop?• Hadoop is a platorm for storage and processing huge da
tasets distributed on clusters of commodity machines.
• Two core components of Hadoop:– Processing engine (traditionally MapReduce, more recently, S
park) – HDFS (Hadoop Distributed File Systems)
4
Why Hadoop (cont.)?
• Hadoop addresses “big data” challenges.• “Big data” creates large business values today.• Various industries face “big data” challenges.
Without an efficient data processing approach, the data cannot create business values.– Many industries end up creating large amounts of
data that they are unable to gain any insight from.
*http://wikibon.org/ 5
Big Data!!
• What is “big data”?• One SKA Survey will generate a data product o
f 4 EB.• The DINGO uv grid dataset is ~ 4 PB• General requirement for SKA Phase 1 –
– Initial bare minimum 600 PB– Annual increase at least 1 EB per year
6
Core Components of Hadoop
7
Core Components of Hadoop
• MapReduce/Spark– An efficient programming framework for processing parallelizable probl
ems across huge datasets using a large number of commodity machines.
• HDFS– A distributed file system designed to efficiently allocate data across mul
tiple commodity machines, and provide self-healing functions when some of them go down.
Commodity machine
Super computer
Performance Low HighCost Low HighAvailability Readily available Hard to obtain
8
MapReduce Framework
• Map: – Extract something of interest from each
chunk of record.• Reduce:
– Aggregate the intermediate outputs from the Map process.
• The Map and Reduce have different instantiations in different problems.
General framework
9
MapReduce Framework
10
MapReduce Framework
• Inputs and outputs of Mappers and Reducers are key value pairs <k,v>.
• Programmers must do the coding according to the MapReduce Model– Specify Map method– Specify Reduce Method– Define the intermediate outputs in <k,v> format.
11
HDFS Framework
• Hadoop Distributed File System (HDFS) is a highly fault-tolerant distributed file system for Hadoop.– Data storage infrastructure of Hadoop Cluster– Hadoop ≈ Processing engine (MapReduce or Spark) + HDFS
• Specifically designed to work with MapReduce/Spark.
• Major assumptions:– Large data sets.– Hardware failure.– Write once, read many.
12
HDFS Framework• Key features of HDFS:
– Fault Tolerance - Automatically and seamlessly recover from failures – Data Replication- to provide redundancy.– Load Balancing - Place data intelligently for maximum efficiency and utilization – Scalability- Add servers to increase capacity
– “Moving computations is cheaper than moving data.”
13
HDFS Framework
• Components of HDFS:– DataNodes
• Store the data with optimized redundancy.
– Journal Node• Coordinates the data nodes with the Name Node
– NameNode (s)• Manages the DataNodes.• Secondary Name Node – failover capability
14
Hadoop vs RDBMS
• Many businesses are turning from RDBMS to Hadoop-based systems for data management.
• In a word, if businesses need to process and analyze large-scale, real-time data, choose Hadoop. Otherwise staying with RDBMS is still a wise choice.
Hadoop-based RDBMS
Data format Structured & Unstructured Mostly structuredScalability Very high LimitedSpeed Fast for large-scale data Very fast for small-medium size data. Analytics Powerful analytical tools for
big-data.Some limited built-in analytics.
15
Hadoop vs Other Distributed Systems
• Common Challenges in Distributed Systems– Component Failure
• Individual computer nodes may overheat, crash, experience hard drive failures, or run out of memory or disk space.
– Network Congestion• Data may not arrive at a particular point in time.
– Communication Failure• Multiple implementations or versions of client software may speak sli
ghtly different protocols from one another.
– Security• Data may be corrupted, or maliciously or improperly transmitted.
– Failover modes. Hadoop is automated.
16
Hadoop vs Other Distributed Systems
• Hadoop– Uses efficient programming model.– Efficient, automatic distribution of data and work a
cross machines.– Good in component failure and network congestio
n problems.– Weak for security issues. (Although…)
17
Hadoop Cluster Architecture
18
Cloudera
• A platorm that integrates many Hadoop-based products and services.
19
Hadoop Architecture (1)
• Hadoop has a master/slave architecture. • Typically one machine in the cluster is designat
ed as the NameNode and another machine as the JobTracker, exclusively. – These are the masters.
• The rest of the machines in the cluster act as both DataNode and TaskTracker. – These are the slaves.
20
Hadoop Architecture (2)
• NameNode (master)– Manages the file system namespace.– Executes file system namespace operations like opening, closing, and r
enaming files and directories. – It also determines the mapping of data chunks to DataNodes.– Monitor DataNodes by receiving heartbeats.
• DataNodes (slaves)– Manage storage attached to the nodes that they run on.– Serve read and write requests from the file system’s clients. – Perform block creation, deletion, and replication upon instruction fro
m the NameNode.
21
Hadoop Architecture (3)• JobTracker (master)
– Receive jobs from client.– Talks to the NameNode to determine the location of the data– Manage and schedule the entire job. – Split and assign tasks to slaves (TaskTrackers).– Monitor the slave nodes by receiving heartbeats.
• TaskTrackers (slaves)– Manage individual tasks assigned by the JobTracker, including Map operatio
ns and Reduce operations.– Every TaskTracker is configured with a set of slots, these indicate the numbe
r of tasks that it can accept.– Send out heartbeat messages to the JobTracker to tell that it is still alive. – Notify the JobTracker when succeeds or fails.
22
Hadoop Architecture
• Example 1
NameNodeJob Tracker
masters
23
Cluster service layout
24
25
Hadoop Resource Management
• Apache YARN (Yet Another Resource Negotiator)• Decouples resource management from data proc
essing requirements• Provides resources for any processing framework
compatible with Hadoop• Resource manager – dedicated scheduler, resides
on one of the service nodes• Node manager Daemons on each of the worker n
odes in the cluster
27
YARNYet Another Resource Negotiator• Scheduler• Resource Mana
ger
28
Zookeeper
• Zookeeper: A cluster management tool that supports coordination between nodes in a distributed system.– When designing a Hadoop-based application, a lot of coordination works need t
o be considered. Writing these functionalities is difficult.
• Zookeeper provides services that can be used to develop distributed applications.
29
• Zookeeper provide services such as :Configuration managementSynchronizationGroup servicesLeader electionEtc.
Hadoop evolution
30
Relevant Technologies
31
Technologies relevant to Hadoop
Zookeeper
Pig
32
Hadoop Ecosystem
33
Hive
• Hive: data warehousing application based on Hadoop.– Query language is HiveQL, which looks similar to S
QL.– Translate HiveQL into MapReduce or Spark jobs.– Store & manage data on HDFS.– Can be used as an interface for HBase, MongoDB,
Cassandra etc.
34
Hive – a datawarehouse for HDFS
• Simply put, Hive is a metadata layer on HDFS datasets • Different table types
– Internal or external tables• Different table formats
– Sequence, text, Parquet. ORC, RCFile• Compression
– Compression codecs – snappy, gzip, zlib• What Hive gives us
– SQL, Partitioning, Indexes
35
36
Hive Partitioning
37
Sqoop
• Provides simple interface for importing data straight from relational DB to Hadoop.
38
NoSQL
• HDFS- Append only file system– A file once created, written, and closed need not be changed. – To modify any portion of a file that is already written, one must rew
rite the entire file and replace the old file.– Not efficient for random read/write.– Use relational database? Not scalable.
• Solution: NoSQL– Stands for Not Only SQL.– Class of non-relational data storage systems.– Usually do not require a pre-defined table schema in advance.– Scale horizontally.
• VS vertically.
39
NoSQL• NoSQL data store models:
– Document store– Wide-column store– Key Value store– Graph store
• NoSQL Examples:– HBase– Cassandra– MongoDB– CouchDB– Redis– Riak– Neo4J– ….
40
HBase
• HBase– Hadoop Database.
• Good integration with Hadoop.
– A datastore on HDFS that supports random read and write.
– A distributed database modeled after Google BigTable.
– Best fit for very large Hadoop projects.
41
Pig
• A high-level platorm for creating MapReduce programs used in Hadoop.
• Translate into efficient sequences of one or more MapReduce jobs.
• Executing the MapReduce jobs.
42
Need for High-Level Languages
• Hadoop is great for large data processing!– But writing Mappers and Reducers in Java for ever
ything is verbose and slow.• Solution: develop higher-level data processing
languages, on later processing engines.– Hive: HiveQL is like SQL.– Pig: Pig Latin similar to Perl.– Use Python!
43
Apache Spark
45
Apache Spark Background
• Many of the aforementioned Big Data technologies (Hbase, Hive, Pig, Mahout, etc.) are not integrated with each other.
• This can lead to reduced performance and integration difficulties.
• However, Apache Spark is a state-of-the-art Big Data technology that integrates many of the core functions from each of these technologies under one framework.
• Biggest advantage Spark offers over Map Reduce is in memory processing
46
Apache Spark Background
• Apache Spark is fast and general engine for large-scale data processing built upon distributed file systems. – Most common is Hadoop Distributed File System (HDFS).
• Claims to be 100 times faster than MapReduce and supports Java, Python, and Scala API’s.
• Spark is good for distributed computing tasks, and can handle batch, interactive, and real-time data within a single framework.
• Spark can also be run independently of Hadoop as well.
47
Residual Distributed Datasets
• The core abstraction for working with data• Spark automatically distributes the data across t
he cluster and parallelizes the operations• An RDD is simply a distributed collection of objec
ts• RDDs are split into partitions, which can be comp
uted on different cluster nodes• RDDs can contain any type of Python, Java or Sca
la object, including user-defined classes
48
Aside - Spark and Machine Learning
• Why it’s important!• Libraries available (Python on Spark)
– Mllib– astroML– astroPy– Theano– Tensorflow– The usual suspects (numpy, scipy)– More…
49
Spark Deployment Options
• Standalone − Spark occupies the place on top of HDFS. Spark and MapReduce run side-by-side for all jobs.
• Hadoop Yarn − Spark runs on Yarn withou t any pre-installation or root access required. It helps to integrate Spark into Hadoop ecosystem or Hadoop stack. It allows other components to run on top of the stack.
• Spark in MapReduce (SIMR) − Spark in M apReduce is used to launch spark job in addition to standalone deployment. With SIMR, user can start Spark and uses its shell without any administrative access.
51
Spark on YARN
• Resource management, scheduling and security controlled by YARN
• Each Spark executor runs as a YARN container
• Spark vs MapReduce – MapReduce schedules a container and starts a JVM fo
r each task– Spark hosts multiple tasks within the same container
52
Spark on YARN (continued)
• Each application has• ApplicationMaster process • ApplicationMaster requests resources from the Resour
ce Manager• When resources allocated, instructs the NodeManager
s to start containers on it’s behalf.• Deployment Modes
– Cluster – Driver runs ApplicationMaster on a cluster host chosen by YARN
– Client – Driver runs on the host where the job is submitted
53
Spark Components
• Regardless of deployment, Spark provides four standard libraries. – Spark SQL – allows for SQL like que
ries of data– Spark Streaming – allows real-time
processing of data– GraphX – allows graph analytics– Mllib – provides Machine Learning
tools.
54
Spark Components – Spark SQL–Spark SQL introduces a new data abstraction called SchemaRDD, which pr
ovides support for structured and semi-structured data. Consider the examples below.
–From Hive:c = HiveContext(sc)rows = c.sql(“select text, year, from hivetable”)rows.filter(lamba r: r.year > 2013).collect()
–From JSON: c.jsonFile(“tweets.json”).registerAsTable(“tweets”)c.sql(“select text, user.name from tweets”)
55
• Hadoop is powerful. But where do we find so many commodity machines?
56
Amazon Elastic MapReduce
• Setting up Hadoop clusters on the cloud.• Amazon Elastic MapReduce (AEM).
– Powered by Hadoop.– Uses EC2 instances as virtual servers for the master and sla
ve nodes.• Key Features:
– No need to do server maintenance.– Resizable clusters.– Hadoop application support including HBase, Pig, Hive etc.– Easy to use, monitor, and manage.
57
Spark Components – MLlib
• MLlib (Machine Learning Library) is a distributed machine learning framework above Spark.
• Spark MLlib is nine times as fast as the Hadoop disk-based version of Apache Mahout (before Mah out gained a Spark interface).
• Spark MLlib provides a variety of machine learning classic algorithms.
58
Spark Components – MLlib Algorithms
• Classification – logistic regression, linear SVM, Naïve Bayes, classification tree
• Regression – Generalized Linear Models (GLMs), Regression tree
• Collaborative filtering – Alternating Least Squares (ALS), Non-negative Matrix Factorization (NMF)
• Clustering – k-means
• Decomposition – SVD, PCA
• Optimization – stochastic gradient descent, L-BFGS
59
Interfaces!
• How do we access Hadoop and Spark?• CLI exist for
– Hive– Pig– Spark– Pyspark– Spark-submit
60
Better interfaces!
• Web interfaces – we will be using– Hue for Hive and Pig;– Jupyter for Pyspark; but others exist as well, eg.– Apache Zeppelin is also worth a look.– And the Cloudera Management server
• Visual representation of the entire cluster• Status of every service within the cluster• Extensive monitoring and configuration options
61
Lab – set up our environment
• Start the docker image• Start the Jupyter notebook server
• Explore Hue and Jupyter• Start the Cloudera Management Server
62
Lab – Set up the housing data set
• Import the housing data set into HDFS• Create an external table definition
• Create an internal table from the external table• Compare the two tables.
63
Lab – Use Pyspark to create RDD• We’ll use a Jupyter notebook to
demonstrate this.• Open nb1-rdd-create notebook in
Jupyter• Run the notebook
• As an optional exercise, save the file as a Hive table and use SparkSQL to
create the RDD
65
References
• These articles are good for learning Hadoop.– http://developer.yahoo.com/hadoop/tutorial/– https://hadoop.apache.org/docs/r1.2.1/mapred_tu
torial.html– http://www.michael-noll.com/tutorials/– http://www.slideshare.net/cloudera/tokyo-nosqlsl
idesonly– http://www.fromdev.com/2010/12/interview-que
stions-hadoop-mapreduce.html
66
Spark Components – Spark Steaming
–Spark Streaming leverages Spark’s fast scheduling ability to perform streaming analytics.
– Chops up the live stream into batches of X seconds
– Spark treats each data batch as Resilient Distributed Datasets (RDDs) and processes them using RDD operations
– The processed results of the RDD operations are returned in batches
67
Spark Components – Spark Steaming
• Spark Streaming leverages Spark’s fast scheduling ability to perform streaming analytics.– Chops up the live stream into batches of X sec
onds– Spark treats each data batch as Resilient Distri
buted Datasets (RDDs) and processes them using RDD operations
– The processed results of the RDD operations are returned in batches
68
Lab – Flume to capture streaming data
69
Spark Components - GraphX
• GraphX is a distributed graph-processing framework on top of Spark.
• Users can build graphs using RDDs of nodes and edges.
• Provides a large library of graph algorithms with decomposable steps.
70
Spark Components - GraphX
71
Spark Components – GraphX Algorithms
• Collaborative Filtering– Alternating Least Squares– Stochastic Gradient Descent– Tensor Factorization
• Structured Prediction– Loopy Belief Propagation– Max-Product Linear Programs– Gibbs Sampling
• Semi-supervised ML– Graph SSL– CoEM
• Community Detection– Triangle Counting– K-core Decomposition– K-Truss
• Graph Analytics– PageRank– Personalized PageRank– Shortest Path– Graph Coloring
• Classification– Neural Networks
72
Resources for Apache Spark
• Spark has a variety of free resources you can learn from. – Big Data University -
http://bigdatauniversity.com/courses/spark-fundamentals/
– Founders of Spark, Databricks - https://databricks.com/
– Apache Spark download - http://spark.apache.org/ – Apache Spark set up tutorial -
http://www.tutorialspoint.com/apache_spark/ 73
top related