Top Banner
LARGE-SCALE DATA PROCESSING WITH HADOOP AND PHP
87

Large-Scale Data Processing with Hadoop and PHP (PHPDAY11 2011-05-14)

Jan 15, 2015

Download

Technology

David Zuelke

Presentation given at PHP Day 2011 in Verona, Italy.
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Large-Scale Data Processing with Hadoop and PHP (PHPDAY11 2011-05-14)

LARGE-SCALE DATA PROCESSING WITH HADOOP AND PHP

Page 2: Large-Scale Data Processing with Hadoop and PHP (PHPDAY11 2011-05-14)

David Zülke

Page 3: Large-Scale Data Processing with Hadoop and PHP (PHPDAY11 2011-05-14)

David Zuelke

Page 4: Large-Scale Data Processing with Hadoop and PHP (PHPDAY11 2011-05-14)
Page 5: Large-Scale Data Processing with Hadoop and PHP (PHPDAY11 2011-05-14)

http://en.wikipedia.org/wiki/File:München_Panorama.JPG

Page 6: Large-Scale Data Processing with Hadoop and PHP (PHPDAY11 2011-05-14)

Founder

Page 8: Large-Scale Data Processing with Hadoop and PHP (PHPDAY11 2011-05-14)

Lead Developer

Page 11: Large-Scale Data Processing with Hadoop and PHP (PHPDAY11 2011-05-14)

FROM 30.000 FEETDistributed And Parallel Computing

Page 12: Large-Scale Data Processing with Hadoop and PHP (PHPDAY11 2011-05-14)

we want to process data

Page 13: Large-Scale Data Processing with Hadoop and PHP (PHPDAY11 2011-05-14)

how much data exactly?

Page 14: Large-Scale Data Processing with Hadoop and PHP (PHPDAY11 2011-05-14)

SOME NUMBERS

• Facebook

•New data per day:

• 200 GB (March 2008)

• 2 TB (April 2009)

• 4 TB (October 2009)

• 12 TB (March 2010)

• Google

•Data processed per month: 400 PB (in 2007!)

• Average job size: 180 GB

Page 15: Large-Scale Data Processing with Hadoop and PHP (PHPDAY11 2011-05-14)

what if you have that much data?

Page 16: Large-Scale Data Processing with Hadoop and PHP (PHPDAY11 2011-05-14)

what if you have just 1% of that amount?

Page 17: Large-Scale Data Processing with Hadoop and PHP (PHPDAY11 2011-05-14)

“no problemo”, you say?(or, if you're Italian, “nessun problema a tutti”)

Page 18: Large-Scale Data Processing with Hadoop and PHP (PHPDAY11 2011-05-14)

reading 180 GB sequentially off a disk will take ~45 minutes

Page 19: Large-Scale Data Processing with Hadoop and PHP (PHPDAY11 2011-05-14)

but you only have 16 GB or so of RAM per computer

Page 20: Large-Scale Data Processing with Hadoop and PHP (PHPDAY11 2011-05-14)

data can be processed much faster than it can be read

Page 21: Large-Scale Data Processing with Hadoop and PHP (PHPDAY11 2011-05-14)

solution: parallelize your I/O

Page 22: Large-Scale Data Processing with Hadoop and PHP (PHPDAY11 2011-05-14)

but now you need to coordinate what you’re doing

Page 23: Large-Scale Data Processing with Hadoop and PHP (PHPDAY11 2011-05-14)

and that’s hard

Page 24: Large-Scale Data Processing with Hadoop and PHP (PHPDAY11 2011-05-14)

what if a node dies?

Page 25: Large-Scale Data Processing with Hadoop and PHP (PHPDAY11 2011-05-14)

is data lost?will other nodes in the grid have to re-start?

how do you coordinate this?

Page 26: Large-Scale Data Processing with Hadoop and PHP (PHPDAY11 2011-05-14)

ENTER: OUR HEROIntroducing MapReduce

Page 27: Large-Scale Data Processing with Hadoop and PHP (PHPDAY11 2011-05-14)

in the olden days, the workload was distributed across a grid

Page 28: Large-Scale Data Processing with Hadoop and PHP (PHPDAY11 2011-05-14)

but the data was shipped around between nodes

Page 29: Large-Scale Data Processing with Hadoop and PHP (PHPDAY11 2011-05-14)

or even stored centrally on something like an SAN

Page 30: Large-Scale Data Processing with Hadoop and PHP (PHPDAY11 2011-05-14)

I/O bottleneck

Page 31: Large-Scale Data Processing with Hadoop and PHP (PHPDAY11 2011-05-14)

along came a Google publication in 2004

Page 32: Large-Scale Data Processing with Hadoop and PHP (PHPDAY11 2011-05-14)

MapReduce: Simplified Data Processing on Large Clustershttp://labs.google.com/papers/mapreduce.html

Page 33: Large-Scale Data Processing with Hadoop and PHP (PHPDAY11 2011-05-14)

now the data is distributed

Page 34: Large-Scale Data Processing with Hadoop and PHP (PHPDAY11 2011-05-14)

computing happens on the nodes where the data already is

Page 35: Large-Scale Data Processing with Hadoop and PHP (PHPDAY11 2011-05-14)

processes are isolated and don’t communicate (share-nothing)

Page 36: Large-Scale Data Processing with Hadoop and PHP (PHPDAY11 2011-05-14)

BASIC PRINCIPLE: MAPPER

• A Mapper reads records and emits <key, value> pairs

• Example: Apache access.log

• Each line is a record

• Extract client IP address and number of bytes transferred

• Emit IP address as key, number of bytes as value

• For hourly rotating logs, the job can be split across 24 nodes*

* In pratice, it’s a lot smarter than that

Page 37: Large-Scale Data Processing with Hadoop and PHP (PHPDAY11 2011-05-14)

BASIC PRINCIPLE: REDUCER

• A Reducer is given a key and all values for this specific key

• Even if there are many Mappers on many computers; the results are aggregated before they are handed to Reducers

• Example: Apache access.log

• The Reducer is called once for each client IP (that’s our key), with a list of values (transferred bytes)

•We simply sum up the bytes to get the total traffic per IP!

Page 38: Large-Scale Data Processing with Hadoop and PHP (PHPDAY11 2011-05-14)

EXAMPLE OF MAPPED INPUT

IP Bytes

212.122.174.13 18271

212.122.174.13 191726

212.122.174.13 198

74.119.8.111 91272

74.119.8.111 8371

212.122.174.13 43

Page 39: Large-Scale Data Processing with Hadoop and PHP (PHPDAY11 2011-05-14)

REDUCER WILL RECEIVE THIS

IP Bytes

212.122.174.13

18271

212.122.174.13191726

212.122.174.13198

212.122.174.13

43

74.119.8.11191272

74.119.8.1118371

Page 40: Large-Scale Data Processing with Hadoop and PHP (PHPDAY11 2011-05-14)

AFTER REDUCTION

IP Bytes

212.122.174.13 210238

74.119.8.111 99643

Page 41: Large-Scale Data Processing with Hadoop and PHP (PHPDAY11 2011-05-14)

PSEUDOCODE

function  map($line_number,  $line_text)  {    $parts  =  parse_apache_log($line_text);    emit($parts['ip'],  $parts['bytes']);}

function  reduce($key,  $values)  {    $bytes  =  array_sum($values);    emit($key,  $bytes);}

212.122.174.13  21023874.119.8.111      99643

212.122.174.13  -­‐  -­‐  [30/Oct/2009:18:14:32  +0100]  "GET  /foo  HTTP/1.1"  200  18271212.122.174.13  -­‐  -­‐  [30/Oct/2009:18:14:32  +0100]  "GET  /bar  HTTP/1.1"  200  191726212.122.174.13  -­‐  -­‐  [30/Oct/2009:18:14:32  +0100]  "GET  /baz  HTTP/1.1"  200  19874.119.8.111      -­‐  -­‐  [30/Oct/2009:18:14:32  +0100]  "GET  /egg  HTTP/1.1"  200  4374.119.8.111      -­‐  -­‐  [30/Oct/2009:18:14:32  +0100]  "GET  /moo  HTTP/1.1"  200  91272212.122.174.13  -­‐  -­‐  [30/Oct/2009:18:14:32  +0100]  "GET  /yay  HTTP/1.1"  200  8371

Page 42: Large-Scale Data Processing with Hadoop and PHP (PHPDAY11 2011-05-14)

A YELLOW ELEPHANTIntroducing Apache Hadoop

Page 44: Large-Scale Data Processing with Hadoop and PHP (PHPDAY11 2011-05-14)

Hadoop is a MapReduce framework

Page 45: Large-Scale Data Processing with Hadoop and PHP (PHPDAY11 2011-05-14)

it allows us to focus on writing Mappers, Reducers etc.

Page 46: Large-Scale Data Processing with Hadoop and PHP (PHPDAY11 2011-05-14)

and it works extremely well

Page 47: Large-Scale Data Processing with Hadoop and PHP (PHPDAY11 2011-05-14)

how well exactly?

Page 48: Large-Scale Data Processing with Hadoop and PHP (PHPDAY11 2011-05-14)

HADOOP AT FACEBOOK (I)

• Predominantly used in combination with Hive (~95%)

• 8400 cores with ~12.5 PB of total storage

• 8 cores, 12 TB storage and 32 GB RAM per node

• 1x Gigabit Ethernet for each server in a rack

• 4x Gigabit Ethernet from rack switch to core

http://www.slideshare.net/royans/facebooks-petabyte-scale-data-warehouse-using-hive-and-hadoop

Hadoop is aware of racks and locality of nodes

Page 49: Large-Scale Data Processing with Hadoop and PHP (PHPDAY11 2011-05-14)

HADOOP AT FACEBOOK (II)

•Daily stats:

• 25 TB logged by Scribe

• 135 TB of compressed data scanned

• 7500+ Hive jobs

• ~80k compute hours

•New data per day:

• I/08: 200 GB

• II/09: 2 TB (compressed)

• III/09: 4 TB (compressed)

• I/10: 12 TB (compressed)

http://www.slideshare.net/royans/facebooks-petabyte-scale-data-warehouse-using-hive-and-hadoop

Page 50: Large-Scale Data Processing with Hadoop and PHP (PHPDAY11 2011-05-14)

HADOOP AT YAHOO!

•Over 25,000 computers with over 100,000 CPUs

• Biggest Cluster :

• 4000 Nodes

• 2x4 CPU cores each

• 16 GB RAM each

•Over 40% of jobs run using Pighttp://wiki.apache.org/hadoop/PoweredBy

Page 51: Large-Scale Data Processing with Hadoop and PHP (PHPDAY11 2011-05-14)

OTHER NOTABLE USERS

• Twitter (storage, logging, analysis. Heavy users of Pig)

• Rackspace (log analysis; data pumped into Lucene/Solr)

• LinkedIn (friend suggestions)

• Last.fm (charts, log analysis, A/B testing)

• The New York Times (converted 4 TB of scans using EC2)

Page 52: Large-Scale Data Processing with Hadoop and PHP (PHPDAY11 2011-05-14)

there’s just one little problem

Page 53: Large-Scale Data Processing with Hadoop and PHP (PHPDAY11 2011-05-14)

you need to write Java code

Page 54: Large-Scale Data Processing with Hadoop and PHP (PHPDAY11 2011-05-14)

however, there is hope...

Page 55: Large-Scale Data Processing with Hadoop and PHP (PHPDAY11 2011-05-14)

STREAMINGHadoop Won’t Force Us To Use Java

Page 56: Large-Scale Data Processing with Hadoop and PHP (PHPDAY11 2011-05-14)

Hadoop Streaming can use any script as Mapper or Reducer

Page 57: Large-Scale Data Processing with Hadoop and PHP (PHPDAY11 2011-05-14)

many configuration options (parsers, formats, combining, …)

Page 58: Large-Scale Data Processing with Hadoop and PHP (PHPDAY11 2011-05-14)

it works using STDIN and STDOUT

Page 59: Large-Scale Data Processing with Hadoop and PHP (PHPDAY11 2011-05-14)

Mappers are streamed the records(usually by line: <line>\n)

and emit key/value pairs: <key>\t<value>\n

Page 60: Large-Scale Data Processing with Hadoop and PHP (PHPDAY11 2011-05-14)

Reducers are streamed key/value pairs:<keyA>\t<value1>\n<keyA>\t<value2>\n<keyA>\t<value3>\n<keyB>\t<value4>\n

Page 61: Large-Scale Data Processing with Hadoop and PHP (PHPDAY11 2011-05-14)

Caution: no separate Reducer processes per key(but keys are sorted)

Page 62: Large-Scale Data Processing with Hadoop and PHP (PHPDAY11 2011-05-14)

HDFSHadoop Distributed File System

Page 63: Large-Scale Data Processing with Hadoop and PHP (PHPDAY11 2011-05-14)

HDFS

• Stores data in blocks (default block size: 64 MB)

•Designed for very large data sets

•Designed for streaming rather than random reads

•Write-once, read-many (although appending is possible)

• Capable of compression and other cool things

Page 64: Large-Scale Data Processing with Hadoop and PHP (PHPDAY11 2011-05-14)

HDFS CONCEPTS

• Large blocks minimize amount of seeks, maximize throughput

• Blocks are stored redundantly (3 replicas as default)

• Aware of infrastructure characteristics (nodes, racks, ...)

• Datanodes hold blocks

• Namenode holds the metadata

Critical component for an HDFS cluster (HA, SPOF)

Page 65: Large-Scale Data Processing with Hadoop and PHP (PHPDAY11 2011-05-14)

JOB PROCESSINGHow Hadoop Works

Page 66: Large-Scale Data Processing with Hadoop and PHP (PHPDAY11 2011-05-14)

Just like I already described! It’s MapReduce!\o/

Page 67: Large-Scale Data Processing with Hadoop and PHP (PHPDAY11 2011-05-14)

BASIC RULES

• Uses Input Formats to split up your data into single records

• You can optimize using combiners to reduce locally on a node

•Only possible in some cases, e.g. for max(), but not avg()

• You can control partitioning of map output yourself

• Rarely useful, the default partitioner (key hash) is enough

• And a million other things that really don’t matter right now ;)

Page 68: Large-Scale Data Processing with Hadoop and PHP (PHPDAY11 2011-05-14)

oh and, if you’re wondering how Hadoop got its name

Page 69: Large-Scale Data Processing with Hadoop and PHP (PHPDAY11 2011-05-14)

The name my kid gave a stuffed yellow elephant. Short, relatively easy to spell and pronounce, meaningless and not used elsewhere: those are my naming criteria. Kids are good at generating such. Googol is a kid’s term.

Doug Cutting

Page 70: Large-Scale Data Processing with Hadoop and PHP (PHPDAY11 2011-05-14)

STREAMING WITH PHPIntroducing HadooPHP

Page 71: Large-Scale Data Processing with Hadoop and PHP (PHPDAY11 2011-05-14)

HADOOPHP

• A little framework to help with writing mapred jobs in PHP

• Takes care of input splitting, can do basic decoding et cetera

• Automatically detects and handles Hadoop settings such as key length or field separators

• Packages jobs as one .phar archive to ease deployment

• Also creates a ready-to-rock shell script to invoke the job

Page 72: Large-Scale Data Processing with Hadoop and PHP (PHPDAY11 2011-05-14)

written by

Page 73: Large-Scale Data Processing with Hadoop and PHP (PHPDAY11 2011-05-14)
Page 75: Large-Scale Data Processing with Hadoop and PHP (PHPDAY11 2011-05-14)

HANDS-ONHadoop Streaming & PHP in action

Page 76: Large-Scale Data Processing with Hadoop and PHP (PHPDAY11 2011-05-14)

THE HADOOP ECOSYSTEMA Little Tour

Page 77: Large-Scale Data Processing with Hadoop and PHP (PHPDAY11 2011-05-14)

APACHE AVROEfficient Data Serialization System With Schemas

(compare: Facebook’s Thrift)

Page 78: Large-Scale Data Processing with Hadoop and PHP (PHPDAY11 2011-05-14)

CLOUDERA FLUMEDistributed Data Collection System

(compare: Facebook’s Scribe)

Page 79: Large-Scale Data Processing with Hadoop and PHP (PHPDAY11 2011-05-14)

APACHE HBASELike Google’s BigTable, Only That You Can Have It, Too!

Page 80: Large-Scale Data Processing with Hadoop and PHP (PHPDAY11 2011-05-14)

HDFSYour Friendly Distributed File

System

Page 81: Large-Scale Data Processing with Hadoop and PHP (PHPDAY11 2011-05-14)

HIVEData Warehousing Made

Simple With An SQL Interface

Page 82: Large-Scale Data Processing with Hadoop and PHP (PHPDAY11 2011-05-14)

PIGA High-Level Language For Modeling Data Processing Tasks

(fulfills the same purpose as Hive)

Page 83: Large-Scale Data Processing with Hadoop and PHP (PHPDAY11 2011-05-14)

ZOOKEEPERYour Distributed Applications,

Coordinated

Page 84: Large-Scale Data Processing with Hadoop and PHP (PHPDAY11 2011-05-14)

!e End

Page 85: Large-Scale Data Processing with Hadoop and PHP (PHPDAY11 2011-05-14)

RESOURCES

• http://www.cloudera.com/developers/learn-hadoop/

• Tom White: Hadoop. The Definitive Guide. O’Reilly, 2009

• http://www.cloudera.com/hadoop/

• Cloudera Distribution for Hadoop is easy to install and has all the stuff included: Hadoop, Hive, Flume, Sqoop, Oozie, …

Page 86: Large-Scale Data Processing with Hadoop and PHP (PHPDAY11 2011-05-14)

Questions?