Hadoop MapReduce Fundamentals
Post on 26-Jan-2015
126 Views
Preview:
DESCRIPTION
Transcript
Hadoop MapReduce Fundamentals
@LynnLangit
a five part series – Part 1 of 5
Course Outline
What is Hadoop?
Open-source data storage and processing API Massively scalable, automatically parallelizable
Based on work from Google GFS + MapReduce + BigTable
Current Distributions based on Open Source and Vendor Work Apache Hadoop Cloudera – CH4 w/ Impala Hortonworks MapR AWS Windows Azure HDInsight
Why Use Hadoop?
Cheaper Scales to Petabytes or
more
Faster Parallel data
processing
Better Suited for particular
types of BigData problems
What types of business problems for Hadoop?
Source: Cloudera “Ten Common Hadoopable Problems”
Companies Using Hadoop
Yahoo
Amazon
eBay
American Airlines
The New York Times
Federal Reserve Board
IBM
Orbitz
Forecast growth of Hadoop Job Market
Source: Indeed -- http://www.indeed.com/jobtrends/Hadoop.html
Hadoop is a set of Apache Frameworks and more…
Data storage (HDFS) Runs on commodity hardware (usually Linux) Horizontally scalable
Processing (MapReduce) Parallelized (scalable) processing Fault Tolerant
Other Tools / Frameworks Data Access
HBase, Hive, Pig, Mahout Tools
Hue, Sqoop Monitoring
Greenplum, ClouderaHadoop Core - HDFS
MapReduce API
Data Access
Tools & Libraries
Monitoring & Alerting
What are the core parts of a Hadoop distribution?
Hadoop Cluster HDFS (Physical) Storage
MapReduce Job – Logical View
Image from - http://mm-tom.s3.amazonaws.com/blog/MapReduce.png
Hadoop Ecosystem
Common Hadoop Distributions
Open Source Apache
Commercial Cloudera Hortonworks MapR AWS MapReduce Microsoft HDInsight
(Beta)
A View of Hadoop (from Hortonworks)
Source: “Intro to Map Reduce” -- http://www.youtube.com/watch?v=ht3dNvdNDzI
Setting up Hadoop Development
Demo – Setting up Cloudera Hadoop
Note: Demo VMs can be downloaded from - https://ccp.cloudera.com/display/SUPPORT/Demo+VMs
Hadoop MapReduce Fundamentals
@LynnLangit
a five part series – Part 2 of 5
So, what’s the problem?
“I can just use some ‘SQL-like’ language to query Hadoop, right?
“Yeah, SQL-on-Hadoop…that’s what I want “I don’t want learn a new query language and…. “I want massive scale for my shiny, new BigData
Ways to MapReduce
Libraries Languages
Note: Java is most common, but other languages can be used
Demo – Using Hive QL on CDH4
What is Hive?
a data warehouse system for Hadoop that facilitates easy data summarization supports ad-hoc queries (still batch though…) created by Facebook
a mechanism to project structure onto this data and query the data using a SQL-like language – HiveQL
Interactive-console –or- Execute scripts Kicks off one or more MapReduce jobs in the background
an ability to use indexes, built-in user-defined functions
Is HQL == ANSI SQL? – NO!
--non-equality joins ARE allowed on ANSI SQL
--but are NOT allowed on Hive (HQL)
SELECT a.* FROM a JOIN b ON (a.id <> b.id)
Note: Joins are quite different in MapReduce, more on that coming up…
Preparing for MapReduce
Common Hadoop Shell Commands
hadoop fs –cat file:///file2hadoop fs –mkdir /user/hadoop/dir1 /user/hadoop/dir2hadoop fs –copyFromLocal <fromDir> <toDir>hadoop fs –put <localfile>
hdfs://nn.example.com/hadoop/hadoopfilesudo hadoop jar <jarFileName> <method> <fromDir>
<toDir> hadoop fs –ls /user/hadoop/dir1
hadoop fs –cat hdfs://nn1.example.com/file1hadoop fs –get /user/hadoop/file <localfile>
Tips-- ‘sudo’ means ‘run as administrator’ (super user)--some hadoop configurations use ‘hadoop dfs’ rather than ‘hadoop fs’ – file paths to hadoop differ for the former, see the link included for more detail
Demo – Working with Files and HDFS
Thinking in MapReduce
Hint: “It’s Functional”
Understanding MapReduce – P1/3
Map>> (K1, V1)
Info in Input Split
list (K2, V2) Key / Value out
(intermediate values)
One list per local node
Can implement local Reducer (or Combiner)
Understanding MapReduce – P2/3
Map>> (K1, V1)
Info in Input Split
list (K2, V2) Key / Value out
(intermediate values)
One list per local node
Can implement local Reducer (or Combiner)
Shuffle/Sort>>
Understanding MapReduce – P3/3
Map>> (K1, V1)
Info in Input Split
list (K2, V2) Key / Value out
(intermediate values)
One list per local node
Can implement local Reducer (or Combiner)
Reduce (K2, list(V2)
Shuffle / Sort phase precedes Reduce phase
Combines Map output into a list
list (K3, V3) Usually aggregates
intermediate values
(input) <k1, v1> map <k2, v2> combine <k2, v2> reduce <k3, v3> (output)
Shuffle/Sort>>
Image from: http://blog.jteam.nl/wp-content/uploads/2009/08/MapReduceWordCountOverview1.png
MapReduce Example - WordCount
MapReduce Objects
Each daemon spawns a new JVM
Ways to MapReduce
Libraries Languages
Note: Java is most common, but other languages can be used
Demo – Running MapReduce WordCount
Hadoop MapReduce Fundamentals
@LynnLangit
a five part series – Part 3 of 5
Ways to run MapReduce Jobs
Configure JobConf options From Development Environment (IDE) From a GUI utility
Cloudera – Hue Microsoft Azure – HDInsight console
From the command line hadoop jar <filename.jar> input output
Ways to MapReduce
Libraries Languages
Note: Java is most common, but other languages can be used
Setting up Hadoop On Windows Azure
About HDInsight
Demo – MapReduce in the Cloud
WordCount MapReduce using HDInsight
MapReduce (WordCount) with Java Script
Note: JavaScript is part of the Azure Hadoop distribution
Common Data Sources for MapReduce Jobs
Where is your Data coming from?
On premises Local file system Local HDFS instance
Private Cloud Cloud storage
Public Cloud Input Storage buckets Script / Code buckets Output buckets
Common Data Jobs for MapReduce
Demo – Other Types of MapReduce
Tip: Review the Java MapReduce code in these samples as well.
Methods to write MapReduce Jobs
Typical – usually written in Java MapReduce 2.0 API MapReduce 1.0 API
Streaming Uses stdin and stdout Can use any language to write Map and Reduce Functions
C#, Python, JavaScript, etc…
Pipes Often used with C++
Abstraction libraries Hive, Pig, etc… write in a higher level language, generate one
or more MapReduce jobs
Ways to MapReduce
Libraries Languages
Note: Java is most common, but other languages can be used
Demo – MapReduce via C# & PowerShell
Ways to MapReduce
Libraries Languages
Note: Java is most common, but other languages can be used
Using AWS MapReduce
Note: You can select Apache or MapR Hadoop Distributions to run your MapReduce job on the AWS Cloud
What is Pig?
ETL Library for HDFS developed at Yahoo Pig Runtime Pig Language Generates MapReduce Jobs
ETL steps LOAD <file> FILTER, JOIN, GROUP BY, FOREACH, GENERATE, COUNT… DUMP {to screen for testing} STORE <newFile>
MapReduce Python Sample
Remember that white space matters in Python!
Demo – Using AWS MapReduce with Pig
Note: You can select Apache or MapR Hadoop Distributions to run your MapReduce job on the AWS Cloud
AWS Data Pipeline with HIVE
Hadoop MapReduce Fundamentals
@LynnLangit
a five part series – Part 4 of 5
Better MapReduce - Optimizations
Optimization BEFORE running a MapReduce Job
More about Input File Compression
From Cloudera… Their version of LZO ‘splittable’
Type File Size GB Compress Decompress
None Log 8.0 - -
Gzip Log.gz 1.3 241 72
LZO Log.lzo 2.0 55 35
Optimization WITHIN a MapReduce Job
59
Mapper Task Optimization
Data Types Writable
Text (String) IntWritable LongWritable FloatWritable BooleanWritable
WritableComparable for keys Custom Types supported – write RawComparator
Reducer Task Optimization
MapReduce Job Optimization
Demo – Unit Testing MapReduce
Using MRUnit + Asserts Optionally using ApprovalTests
Image from http://c0de-x.com/wp-content/uploads/2012/10/staredad_english.png
A note about MapReduce 2.0
Splits the existing JobTracker’s roles resource management job lifecycle management
MapReduce 2.0 provides many benefits over the existing MapReduce framework, such as better scalability
through distributed job lifecycle management support for multiple Hadoop MapReduce API versions in a
single cluster
What is Mahout? Library with common machine learning algorithms
Over 20 algorithms Recommendation (likelihood – Pandora) Classification (known data and new data – spam id) Clustering (new groups of similar data – Google news)
Can non-statisticians find value using this library?
Mahout Algorithms
Setting up Hadoop on Windows
For local development Install from binaries from Web Platform Installer Install .NET Azure SDK (for Azure BLOB storage) Install other tools
Neudesic Azure Storage Viewer
Demo – Mahout
Using HDInsight
What about the output?
Clients (Visualizations) for HDFS
Many clients use Hive Often included in GUI console tools for Hadoop distributions as
well Microsoft includes clients in Office (Excel 2013)
Direct Hive client Connect using ODBC
PowerPivot – data mashups and presentation Data Explorer – connect, transform, mashup and filter
Hadoop SDK on Codeplex Other popular clients
Qlikview Tableau Karmasphere
Demo – Executing Hive Queries
Demo – Using HDFS output in Excel 2013
To download Data Explorer:http://www.microsoft.com/en-us/download/details.aspx?id=36803
Ab
ou
t V
isu
alizati
on
Demo – New Visualizations – D3
Hadoop MapReduce Fundamentals
@LynnLangit
a five part series – Part 5 of 5
Limitations of MapReduce
Comparing: RDBMS vs. Hadoop
Traditional RDBMS Hadoop / MapReduce
Data Size Gigabytes (Terabytes) Petabytes (Hexabytes)
Access Interactive and Batch Batch – NOT Interactive
Updates Read / Write many times Write once, Read many times
Structure Static Schema Dynamic Schema
Integrity High (ACID) Low
Scaling Nonlinear Linear
Query Response Time
Can be near immediate Has latency (due to batch processing)
Microsoft alternatives to MapReduce
Use existing relational system Scale via cloud or edition (i.e. Enterprise or PDW)
Use in memory OLAP SQL Server Analysis Services Tabular Models
Use “productized” Dremel Microsoft Polybase – status = beta?
Looking Forward - Dremel or Apache Drill
Based on original research from Google
Apache Drill Architecture
In-market MapReduce Alternatives
Cloudera
Impala
Big Query
Demo – Google’s BigQuery Dremel for the rest of us
Hadoop MapReduce Call to Action
More MapReduce Developer Resources
Based on the distribution – on premises Apache
MapReduce tutorial - http://hadoop.apache.org/docs/r1.0.4/mapred_tutorial.htmlCloudera
Cloudera Cloudera University - http://university.cloudera.com/ Cloudera Developer Course (4 day) - *RECOMMENDED* -
http://university.cloudera.com/training/apache_hadoop/developer.html Hortonworks MapR
Based on the distribution – cloud AWS MapReduce
Tutorial - http://aws.amazon.com/elasticmapreduce/training/#gs Windows Azure HDInsight
Tutorial - http://www.windowsazure.com/en-us/manage/services/hdinsight/using-mapreduce-with-hdinsight/
More resources - http://www.windowsazure.com/en-us/develop/net/tutorials/intro-to-hadoop/
The Changing Data Landscape
top related