Hadoop MapReduce Fundamentals

@LynnLangit

a five part series – Part 1 of 5

Course Outline

What is Hadoop?

Open-source data storage and processing API Massively scalable, automatically parallelizable

Based on work from Google GFS + MapReduce + BigTable

Current Distributions based on Open Source and Vendor Work Apache Hadoop Cloudera – CH4 w/ Impala Hortonworks MapR AWS Windows Azure HDInsight

Why Use Hadoop?

Cheaper Scales to Petabytes or

Faster Parallel data

processing

Better Suited for particular

types of BigData problems

What types of business problems for Hadoop?

Source: Cloudera “Ten Common Hadoopable Problems”

Companies Using Hadoop

Facebook

Amazon

American Airlines

The New York Times

Federal Reserve Board

Orbitz

Forecast growth of Hadoop Job Market

Source: Indeed -- http://www.indeed.com/jobtrends/Hadoop.html

Hadoop is a set of Apache Frameworks and more…

Data storage (HDFS) Runs on commodity hardware (usually Linux) Horizontally scalable

Processing (MapReduce) Parallelized (scalable) processing Fault Tolerant

Other Tools / Frameworks Data Access

HBase, Hive, Pig, Mahout Tools

Hue, Sqoop Monitoring

Greenplum, ClouderaHadoop Core - HDFS

MapReduce API

Data Access

Tools & Libraries

Monitoring & Alerting

What are the core parts of a Hadoop distribution?

Hadoop Cluster HDFS (Physical) Storage

MapReduce Job – Logical View

Image from - http://mm-tom.s3.amazonaws.com/blog/MapReduce.png

Hadoop Ecosystem

Common Hadoop Distributions

Open Source Apache

Commercial Cloudera Hortonworks MapR AWS MapReduce Microsoft HDInsight

(Beta)

A View of Hadoop (from Hortonworks)

Source: “Intro to Map Reduce” -- http://www.youtube.com/watch?v=ht3dNvdNDzI

Setting up Hadoop Development

Demo – Setting up Cloudera Hadoop

Note: Demo VMs can be downloaded from - https://ccp.cloudera.com/display/SUPPORT/Demo+VMs

@LynnLangit

So, what’s the problem?

“I can just use some ‘SQL-like’ language to query Hadoop, right?

“Yeah, SQL-on-Hadoop…that’s what I want “I don’t want learn a new query language and…. “I want massive scale for my shiny, new BigData

Ways to MapReduce

Libraries Languages

Note: Java is most common, but other languages can be used

Demo – Using Hive QL on CDH4

What is Hive?

a data warehouse system for Hadoop that facilitates easy data summarization supports ad-hoc queries (still batch though…) created by Facebook

a mechanism to project structure onto this data and query the data using a SQL-like language – HiveQL

Interactive-console –or- Execute scripts Kicks off one or more MapReduce jobs in the background

an ability to use indexes, built-in user-defined functions

Is HQL == ANSI SQL? – NO!

--non-equality joins ARE allowed on ANSI SQL

--but are NOT allowed on Hive (HQL)

SELECT a.* FROM a JOIN b ON (a.id <> b.id)

Note: Joins are quite different in MapReduce, more on that coming up…

Preparing for MapReduce

Common Hadoop Shell Commands

hadoop fs –cat file:///file2hadoop fs –mkdir /user/hadoop/dir1 /user/hadoop/dir2hadoop fs –copyFromLocal <fromDir> <toDir>hadoop fs –put <localfile>

hdfs://nn.example.com/hadoop/hadoopfilesudo hadoop jar <jarFileName> <method> <fromDir>

<toDir> hadoop fs –ls /user/hadoop/dir1

hadoop fs –cat hdfs://nn1.example.com/file1hadoop fs –get /user/hadoop/file <localfile>

Tips-- ‘sudo’ means ‘run as administrator’ (super user)--some hadoop configurations use ‘hadoop dfs’ rather than ‘hadoop fs’ – file paths to hadoop differ for the former, see the link included for more detail

Demo – Working with Files and HDFS

Thinking in MapReduce

Hint: “It’s Functional”

Understanding MapReduce – P1/3

Map>> (K1, V1)

Info in Input Split

list (K2, V2) Key / Value out

(intermediate values)

One list per local node

Can implement local Reducer (or Combiner)

Map>> (K1, V1)

Info in Input Split

Shuffle/Sort>>

Map>> (K1, V1)

Info in Input Split

Reduce (K2, list(V2)

Shuffle / Sort phase precedes Reduce phase

Combines Map output into a list

list (K3, V3) Usually aggregates

intermediate values

(input) <k1, v1> map <k2, v2> combine <k2, v2> reduce <k3, v3> (output)

Shuffle/Sort>>

Image from: http://blog.jteam.nl/wp-content/uploads/2009/08/MapReduceWordCountOverview1.png

MapReduce Example - WordCount

MapReduce Objects

Each daemon spawns a new JVM

Ways to MapReduce

Libraries Languages

Demo – Running MapReduce WordCount

@LynnLangit

Ways to run MapReduce Jobs

Configure JobConf options From Development Environment (IDE) From a GUI utility

Cloudera – Hue Microsoft Azure – HDInsight console

From the command line hadoop jar <filename.jar> input output

Ways to MapReduce

Libraries Languages

Setting up Hadoop On Windows Azure

About HDInsight

Demo – MapReduce in the Cloud

WordCount MapReduce using HDInsight

MapReduce (WordCount) with Java Script

Note: JavaScript is part of the Azure Hadoop distribution

Common Data Sources for MapReduce Jobs

Where is your Data coming from?

On premises Local file system Local HDFS instance

Private Cloud Cloud storage

Public Cloud Input Storage buckets Script / Code buckets Output buckets

Common Data Jobs for MapReduce

Demo – Other Types of MapReduce

Tip: Review the Java MapReduce code in these samples as well.

Methods to write MapReduce Jobs

Typical – usually written in Java MapReduce 2.0 API MapReduce 1.0 API

Streaming Uses stdin and stdout Can use any language to write Map and Reduce Functions

C#, Python, JavaScript, etc…

Pipes Often used with C++

Abstraction libraries Hive, Pig, etc… write in a higher level language, generate one

or more MapReduce jobs

Ways to MapReduce

Libraries Languages

Demo – MapReduce via C# & PowerShell

Ways to MapReduce

Libraries Languages

Using AWS MapReduce

Note: You can select Apache or MapR Hadoop Distributions to run your MapReduce job on the AWS Cloud

What is Pig?

ETL Library for HDFS developed at Yahoo Pig Runtime Pig Language Generates MapReduce Jobs

ETL steps LOAD <file> FILTER, JOIN, GROUP BY, FOREACH, GENERATE, COUNT… DUMP {to screen for testing} STORE <newFile>

MapReduce Python Sample

Remember that white space matters in Python!

Demo – Using AWS MapReduce with Pig

Note: You can select Apache or MapR Hadoop Distributions to run your MapReduce job on the AWS Cloud

AWS Data Pipeline with HIVE

@LynnLangit

Better MapReduce - Optimizations

Optimization BEFORE running a MapReduce Job

More about Input File Compression

From Cloudera… Their version of LZO ‘splittable’

Type File Size GB Compress Decompress

None Log 8.0 - -

Gzip Log.gz 1.3 241 72

LZO Log.lzo 2.0 55 35

Optimization WITHIN a MapReduce Job

Mapper Task Optimization

Data Types Writable

Text (String) IntWritable LongWritable FloatWritable BooleanWritable

WritableComparable for keys Custom Types supported – write RawComparator

Reducer Task Optimization

MapReduce Job Optimization

Demo – Unit Testing MapReduce

Using MRUnit + Asserts Optionally using ApprovalTests

Image from http://c0de-x.com/wp-content/uploads/2012/10/staredad_english.png

A note about MapReduce 2.0

Splits the existing JobTracker’s roles resource management job lifecycle management

MapReduce 2.0 provides many benefits over the existing MapReduce framework, such as better scalability

through distributed job lifecycle management support for multiple Hadoop MapReduce API versions in a

single cluster

What is Mahout? Library with common machine learning algorithms

Over 20 algorithms Recommendation (likelihood – Pandora) Classification (known data and new data – spam id) Clustering (new groups of similar data – Google news)

Can non-statisticians find value using this library?

Mahout Algorithms

Setting up Hadoop on Windows

For local development Install from binaries from Web Platform Installer Install .NET Azure SDK (for Azure BLOB storage) Install other tools

Neudesic Azure Storage Viewer

Demo – Mahout

Using HDInsight

What about the output?

Clients (Visualizations) for HDFS

Many clients use Hive Often included in GUI console tools for Hadoop distributions as

well Microsoft includes clients in Office (Excel 2013)

Direct Hive client Connect using ODBC

PowerPivot – data mashups and presentation Data Explorer – connect, transform, mashup and filter

Hadoop SDK on Codeplex Other popular clients

Qlikview Tableau Karmasphere

Demo – Executing Hive Queries

Demo – Using HDFS output in Excel 2013

To download Data Explorer:http://www.microsoft.com/en-us/download/details.aspx?id=36803

alizati

Demo – New Visualizations – D3

@LynnLangit

Limitations of MapReduce

Comparing: RDBMS vs. Hadoop

Traditional RDBMS Hadoop / MapReduce

Data Size Gigabytes (Terabytes) Petabytes (Hexabytes)

Access Interactive and Batch Batch – NOT Interactive

Updates Read / Write many times Write once, Read many times

Structure Static Schema Dynamic Schema

Integrity High (ACID) Low

Scaling Nonlinear Linear

Query Response Time

Can be near immediate Has latency (due to batch processing)

Microsoft alternatives to MapReduce

Use existing relational system Scale via cloud or edition (i.e. Enterprise or PDW)

Use in memory OLAP SQL Server Analysis Services Tabular Models

Use “productized” Dremel Microsoft Polybase – status = beta?

Looking Forward - Dremel or Apache Drill

Based on original research from Google

Apache Drill Architecture

In-market MapReduce Alternatives

Cloudera

Impala

Google

Big Query

Demo – Google’s BigQuery Dremel for the rest of us

Hadoop MapReduce Call to Action

More MapReduce Developer Resources

Based on the distribution – on premises Apache

MapReduce tutorial - http://hadoop.apache.org/docs/r1.0.4/mapred_tutorial.htmlCloudera

Cloudera Cloudera University - http://university.cloudera.com/ Cloudera Developer Course (4 day) - *RECOMMENDED* -

http://university.cloudera.com/training/apache_hadoop/developer.html Hortonworks MapR

Based on the distribution – cloud AWS MapReduce

Tutorial - http://aws.amazon.com/elasticmapreduce/training/#gs Windows Azure HDInsight

Tutorial - http://www.windowsazure.com/en-us/manage/services/hdinsight/using-mapreduce-with-hdinsight/

More resources - http://www.windowsazure.com/en-us/develop/net/tutorials/intro-to-hadoop/

The Changing Data Landscape

Hadoop MapReduce Fundamentals

hadoop mapreduce fundamentals

use hadoop

hadoop development

hadoop dfs

hadoop configurations

view of hadoop

hadoop ecosystem

demo mapreduce

Technology

MapReduce Improvements in MapR Hadoop

Hadoop: Beyond MapReduce

Überblick Hadoop Einführung HDFS und MapReduce -...

Hadoop MapReduce - 123seminarsonly.com · Hadoop MapReduce....

MapReduce Online - USENIX · 2.2 Hadoop Architecture Hadoop...

A Micro-Benchmark Suite for Evaluating Hadoop MapReduce...

MapReduce and Hadoop

Hadoop MapReduce

Tutorial Hadoop HDFS MapReduce

Hadoop MapReduce joins

Introduction to MapReduce | MapReduce Architecture |...

Analyzing Big Data using Hadoop MapReduce - … ·...

CS-495/595 Big DataCS-495/595 Big Data:::: Exam #1Exam...

IIHTiihttrichy.com/brochuers/BigdataHadoopCoursesBrochure.pd...

MapReduce Programming with Apache Hadoop -...

Hadoop and MapReduce