Top Banner
LARGE-SCALE DATA PROCESSING WITH HADOOP AND PHP
81

Large-Scale Data Processing with Hadoop and PHP (IPCSE11 2011-05-31)

Jan 15, 2015

Download

Technology

David Zuelke

Presentation given at International PHP Conference Spring Edition 2011
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Large-Scale Data Processing with Hadoop and PHP (IPCSE11 2011-05-31)

LARGE-SCALE DATA PROCESSING WITH HADOOP AND PHP

Page 2: Large-Scale Data Processing with Hadoop and PHP (IPCSE11 2011-05-31)

David Zülke

Page 3: Large-Scale Data Processing with Hadoop and PHP (IPCSE11 2011-05-31)

David Zuelke

Page 4: Large-Scale Data Processing with Hadoop and PHP (IPCSE11 2011-05-31)
Page 5: Large-Scale Data Processing with Hadoop and PHP (IPCSE11 2011-05-31)

http://en.wikipedia.org/wiki/File:München_Panorama.JPG

Page 6: Large-Scale Data Processing with Hadoop and PHP (IPCSE11 2011-05-31)

Founder

Page 8: Large-Scale Data Processing with Hadoop and PHP (IPCSE11 2011-05-31)

Lead Developer

Page 11: Large-Scale Data Processing with Hadoop and PHP (IPCSE11 2011-05-31)

FROM 30.000 FEETDistributed And Parallel Computing

Page 12: Large-Scale Data Processing with Hadoop and PHP (IPCSE11 2011-05-31)

we want to process data

Page 13: Large-Scale Data Processing with Hadoop and PHP (IPCSE11 2011-05-31)

how much data exactly?

Page 14: Large-Scale Data Processing with Hadoop and PHP (IPCSE11 2011-05-31)

SOME NUMBERS

• Facebook

•New data per day:

• 200 GB (March 2008)

• 2 TB (April 2009)

• 4 TB (October 2009)

• 12 TB (March 2010)

• Google

•Data processed per month: 400 PB (in 2007!)

• Average job size: 180 GB

Page 15: Large-Scale Data Processing with Hadoop and PHP (IPCSE11 2011-05-31)

what if you have that much data?

Page 16: Large-Scale Data Processing with Hadoop and PHP (IPCSE11 2011-05-31)

what if you have just 1% of that amount?

Page 17: Large-Scale Data Processing with Hadoop and PHP (IPCSE11 2011-05-31)

“no problemo”, you say?

Page 18: Large-Scale Data Processing with Hadoop and PHP (IPCSE11 2011-05-31)

reading 180 GB sequentially off a disk will take ~45 minutes

Page 19: Large-Scale Data Processing with Hadoop and PHP (IPCSE11 2011-05-31)

and you only have 16 to 64 GB of RAM per computer

Page 20: Large-Scale Data Processing with Hadoop and PHP (IPCSE11 2011-05-31)

so you can't process everything at once

Page 21: Large-Scale Data Processing with Hadoop and PHP (IPCSE11 2011-05-31)

general rule of modern computers:

Page 22: Large-Scale Data Processing with Hadoop and PHP (IPCSE11 2011-05-31)

data can be processed much faster than it can be read

Page 23: Large-Scale Data Processing with Hadoop and PHP (IPCSE11 2011-05-31)

solution: parallelize your I/O

Page 24: Large-Scale Data Processing with Hadoop and PHP (IPCSE11 2011-05-31)

but now you need to coordinate what you’re doing

Page 25: Large-Scale Data Processing with Hadoop and PHP (IPCSE11 2011-05-31)

and that’s hard

Page 26: Large-Scale Data Processing with Hadoop and PHP (IPCSE11 2011-05-31)

what if a node dies?

Page 27: Large-Scale Data Processing with Hadoop and PHP (IPCSE11 2011-05-31)

is data lost?will other nodes in the grid have to re-start?

how do you coordinate this?

Page 28: Large-Scale Data Processing with Hadoop and PHP (IPCSE11 2011-05-31)

ENTER: OUR HEROIntroducing MapReduce

Page 29: Large-Scale Data Processing with Hadoop and PHP (IPCSE11 2011-05-31)

in the olden days, the workload was distributed across a grid

Page 30: Large-Scale Data Processing with Hadoop and PHP (IPCSE11 2011-05-31)

but the data was shipped around between nodes

Page 31: Large-Scale Data Processing with Hadoop and PHP (IPCSE11 2011-05-31)

or even stored centrally on something like an SAN

Page 32: Large-Scale Data Processing with Hadoop and PHP (IPCSE11 2011-05-31)

I/O bottleneck

Page 33: Large-Scale Data Processing with Hadoop and PHP (IPCSE11 2011-05-31)

along came a Google publication in 2004

Page 34: Large-Scale Data Processing with Hadoop and PHP (IPCSE11 2011-05-31)

MapReduce: Simplified Data Processing on Large Clustershttp://labs.google.com/papers/mapreduce.html

Page 35: Large-Scale Data Processing with Hadoop and PHP (IPCSE11 2011-05-31)

now the data is distributed

Page 36: Large-Scale Data Processing with Hadoop and PHP (IPCSE11 2011-05-31)

computing happens on the nodes where the data already is

Page 37: Large-Scale Data Processing with Hadoop and PHP (IPCSE11 2011-05-31)

processes are isolated and don’t communicate (share-nothing)

Page 38: Large-Scale Data Processing with Hadoop and PHP (IPCSE11 2011-05-31)

BASIC PRINCIPLE: MAPPER

• A Mapper reads records and emits <key, value> pairs

• Example: Apache access.log

• Each line is a record

• Extract client IP address and number of bytes transferred

• Emit IP address as key, number of bytes as value

• For hourly rotating logs, the job can be split across 24 nodes*

* In pratice, it’s a lot smarter than that

Page 39: Large-Scale Data Processing with Hadoop and PHP (IPCSE11 2011-05-31)

BASIC PRINCIPLE: REDUCER

• A Reducer is given a key and all values for this specific key

• Even if there are many Mappers on many computers; the results are aggregated before they are handed to Reducers

• Example: Apache access.log

• The Reducer is called once for each client IP (that’s our key), with a list of values (transferred bytes)

•We simply sum up the bytes to get the total traffic per IP!

Page 40: Large-Scale Data Processing with Hadoop and PHP (IPCSE11 2011-05-31)

EXAMPLE OF MAPPED INPUT

IP Bytes

212.122.174.13 18271

212.122.174.13 191726

212.122.174.13 198

74.119.8.111 91272

74.119.8.111 8371

212.122.174.13 43

Page 41: Large-Scale Data Processing with Hadoop and PHP (IPCSE11 2011-05-31)

REDUCER WILL RECEIVE THIS

IP Bytes

212.122.174.13

18271

212.122.174.13191726

212.122.174.13198

212.122.174.13

43

74.119.8.11191272

74.119.8.1118371

Page 42: Large-Scale Data Processing with Hadoop and PHP (IPCSE11 2011-05-31)

AFTER REDUCTION

IP Bytes

212.122.174.13 210238

74.119.8.111 99643

Page 43: Large-Scale Data Processing with Hadoop and PHP (IPCSE11 2011-05-31)

PSEUDOCODE

function  map($line_number,  $line_text)  {    $parts  =  parse_apache_log($line_text);    emit($parts['ip'],  $parts['bytes']);}

function  reduce($key,  $values)  {    $bytes  =  array_sum($values);    emit($key,  $bytes);}

212.122.174.13  21023874.119.8.111      99643

212.122.174.13  -­‐  -­‐  [30/Oct/2009:18:14:32  +0100]  "GET  /foo  HTTP/1.1"  200  18271212.122.174.13  -­‐  -­‐  [30/Oct/2009:18:14:32  +0100]  "GET  /bar  HTTP/1.1"  200  191726212.122.174.13  -­‐  -­‐  [30/Oct/2009:18:14:32  +0100]  "GET  /baz  HTTP/1.1"  200  19874.119.8.111      -­‐  -­‐  [30/Oct/2009:18:14:32  +0100]  "GET  /egg  HTTP/1.1"  200  4374.119.8.111      -­‐  -­‐  [30/Oct/2009:18:14:32  +0100]  "GET  /moo  HTTP/1.1"  200  91272212.122.174.13  -­‐  -­‐  [30/Oct/2009:18:14:32  +0100]  "GET  /yay  HTTP/1.1"  200  8371

Page 44: Large-Scale Data Processing with Hadoop and PHP (IPCSE11 2011-05-31)

A YELLOW ELEPHANTIntroducing Apache Hadoop

Page 46: Large-Scale Data Processing with Hadoop and PHP (IPCSE11 2011-05-31)

Hadoop is a MapReduce framework

Page 47: Large-Scale Data Processing with Hadoop and PHP (IPCSE11 2011-05-31)

it allows us to focus on writing Mappers, Reducers etc.

Page 48: Large-Scale Data Processing with Hadoop and PHP (IPCSE11 2011-05-31)

and it works extremely well

Page 49: Large-Scale Data Processing with Hadoop and PHP (IPCSE11 2011-05-31)

how well exactly?

Page 50: Large-Scale Data Processing with Hadoop and PHP (IPCSE11 2011-05-31)

HADOOP AT FACEBOOK (I)

• Predominantly used in combination with Hive (~95%)

• 8400 cores with ~12.5 PB of total storage

• 8 cores, 12 TB storage and 32 GB RAM per node

• 1x Gigabit Ethernet for each server in a rack

• 4x Gigabit Ethernet from rack switch to core

http://www.slideshare.net/royans/facebooks-petabyte-scale-data-warehouse-using-hive-and-hadoop

Hadoop is aware of racks and locality of nodes

Page 51: Large-Scale Data Processing with Hadoop and PHP (IPCSE11 2011-05-31)

HADOOP AT FACEBOOK (II)

•Daily stats:

• 25 TB logged by Scribe

• 135 TB of compressed data scanned

• 7500+ Hive jobs

• ~80k compute hours

•New data per day:

• I/08: 200 GB

• II/09: 2 TB (compressed)

• III/09: 4 TB (compressed)

• I/10: 12 TB (compressed)

http://www.slideshare.net/royans/facebooks-petabyte-scale-data-warehouse-using-hive-and-hadoop

Page 52: Large-Scale Data Processing with Hadoop and PHP (IPCSE11 2011-05-31)

HADOOP AT YAHOO!

•Over 25,000 computers with over 100,000 CPUs

• Biggest Cluster :

• 4000 Nodes

• 2x4 CPU cores each

• 16 GB RAM each

•Over 40% of jobs run using Pighttp://wiki.apache.org/hadoop/PoweredBy

Page 53: Large-Scale Data Processing with Hadoop and PHP (IPCSE11 2011-05-31)

OTHER NOTABLE USERS

• Twitter (storage, logging, analysis. Heavy users of Pig)

• Rackspace (log analysis; data pumped into Lucene/Solr)

• LinkedIn (friend suggestions)

• Last.fm (charts, log analysis, A/B testing)

• The New York Times (converted 4 TB of scans using EC2)

Page 54: Large-Scale Data Processing with Hadoop and PHP (IPCSE11 2011-05-31)

there’s just one little problem

Page 55: Large-Scale Data Processing with Hadoop and PHP (IPCSE11 2011-05-31)

you need to write Java code

Page 56: Large-Scale Data Processing with Hadoop and PHP (IPCSE11 2011-05-31)

however, there is hope...

Page 57: Large-Scale Data Processing with Hadoop and PHP (IPCSE11 2011-05-31)

STREAMINGHadoop Won’t Force Us To Use Java

Page 58: Large-Scale Data Processing with Hadoop and PHP (IPCSE11 2011-05-31)

Hadoop Streaming can use any script as Mapper or Reducer

Page 59: Large-Scale Data Processing with Hadoop and PHP (IPCSE11 2011-05-31)

many configuration options (parsers, formats, combining, …)

Page 60: Large-Scale Data Processing with Hadoop and PHP (IPCSE11 2011-05-31)

it works using STDIN and STDOUT

Page 61: Large-Scale Data Processing with Hadoop and PHP (IPCSE11 2011-05-31)

Mappers are streamed the records(usually by line: <line>\n)

and emit key/value pairs: <key>\t<value>\n

Page 62: Large-Scale Data Processing with Hadoop and PHP (IPCSE11 2011-05-31)

Reducers are streamed key/value pairs:<keyA>\t<value1>\n<keyA>\t<value2>\n<keyA>\t<value3>\n<keyB>\t<value4>\n

Page 63: Large-Scale Data Processing with Hadoop and PHP (IPCSE11 2011-05-31)

Caution: no separate Reducer processes per key(but keys are sorted)

Page 64: Large-Scale Data Processing with Hadoop and PHP (IPCSE11 2011-05-31)

HDFSHadoop Distributed File System

Page 65: Large-Scale Data Processing with Hadoop and PHP (IPCSE11 2011-05-31)

HDFS

• Stores data in blocks (default block size: 64 MB)

•Designed for very large data sets

•Designed for streaming rather than random reads

•Write-once, read-many (although appending is possible)

• Capable of compression and other cool things

Page 66: Large-Scale Data Processing with Hadoop and PHP (IPCSE11 2011-05-31)

HDFS CONCEPTS

• Large blocks minimize amount of seeks, maximize throughput

• Blocks are stored redundantly (3 replicas as default)

• Aware of infrastructure characteristics (nodes, racks, ...)

• Datanodes hold blocks

• Namenode holds the metadata

Critical component for an HDFS cluster (HA, SPOF)

Page 67: Large-Scale Data Processing with Hadoop and PHP (IPCSE11 2011-05-31)

JOB PROCESSINGHow Hadoop Works

Page 68: Large-Scale Data Processing with Hadoop and PHP (IPCSE11 2011-05-31)

Just like I already described! It’s MapReduce!\o/

Page 69: Large-Scale Data Processing with Hadoop and PHP (IPCSE11 2011-05-31)

BASIC RULES

• Uses Input Formats to split up your data into single records

• You can optimize using combiners to reduce locally on a node

•Only possible in some cases, e.g. for max(), but not avg()

• You can control partitioning of map output yourself

• Rarely useful, the default partitioner (key hash) is enough

• And a million other things that really don’t matter right now ;)

Page 70: Large-Scale Data Processing with Hadoop and PHP (IPCSE11 2011-05-31)

oh and, if you’re wondering how Hadoop got its name

Page 71: Large-Scale Data Processing with Hadoop and PHP (IPCSE11 2011-05-31)

The name my kid gave a stuffed yellow elephant. Short, relatively easy to spell and pronounce, meaningless and not used elsewhere: those are my naming criteria. Kids are good at generating such. Googol is a kid’s term.

Doug Cutting

Page 72: Large-Scale Data Processing with Hadoop and PHP (IPCSE11 2011-05-31)

STREAMING WITH PHPIntroducing HadooPHP

Page 73: Large-Scale Data Processing with Hadoop and PHP (IPCSE11 2011-05-31)

HADOOPHP

• A little framework to help with writing mapred jobs in PHP

• Takes care of input splitting, can do basic decoding et cetera

• Automatically detects and handles Hadoop settings such as key length or field separators

• Packages jobs as one .phar archive to ease deployment

• Also creates a ready-to-rock shell script to invoke the job

Page 74: Large-Scale Data Processing with Hadoop and PHP (IPCSE11 2011-05-31)

written by

Page 75: Large-Scale Data Processing with Hadoop and PHP (IPCSE11 2011-05-31)
Page 77: Large-Scale Data Processing with Hadoop and PHP (IPCSE11 2011-05-31)

HANDS-ONHadoop Streaming & PHP in Action

Page 78: Large-Scale Data Processing with Hadoop and PHP (IPCSE11 2011-05-31)

!e End

Page 79: Large-Scale Data Processing with Hadoop and PHP (IPCSE11 2011-05-31)

RESOURCES

• http://www.cloudera.com/developers/learn-hadoop/

• Tom White: Hadoop. The Definitive Guide. O’Reilly, 2009

• http://www.cloudera.com/hadoop/

• Cloudera Distribution for Hadoop is easy to install and has all the stuff included: Hadoop, Hive, Flume, Sqoop, Oozie, …

Page 80: Large-Scale Data Processing with Hadoop and PHP (IPCSE11 2011-05-31)

Questions?