Top Banner
LARGE-SCALE DATA PROCESSING WITH HADOOP AND PHP
81

Large-Scale Data Processing with Hadoop and PHP (IPC2012SE 2012-06-05)

Jan 15, 2015

Download

Technology

David Zuelke

Presentation given at International PHP Conference 2012 Spring Edition in Berlin, Germany.
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Large-Scale Data Processing with Hadoop and PHP (IPC2012SE 2012-06-05)

LARGE-SCALE DATA PROCESSING WITH HADOOP AND PHP

Page 2: Large-Scale Data Processing with Hadoop and PHP (IPC2012SE 2012-06-05)

David Zuelke

Page 3: Large-Scale Data Processing with Hadoop and PHP (IPC2012SE 2012-06-05)

David Zülke

Page 4: Large-Scale Data Processing with Hadoop and PHP (IPC2012SE 2012-06-05)
Page 5: Large-Scale Data Processing with Hadoop and PHP (IPC2012SE 2012-06-05)

http://en.wikipedia.org/wiki/File:München_Panorama.JPG

Page 6: Large-Scale Data Processing with Hadoop and PHP (IPC2012SE 2012-06-05)

Founder

Page 8: Large-Scale Data Processing with Hadoop and PHP (IPC2012SE 2012-06-05)

Lead Developer

Page 11: Large-Scale Data Processing with Hadoop and PHP (IPC2012SE 2012-06-05)

THE BIG DATA CHALLENGEDistributed And Parallel Computing

Page 12: Large-Scale Data Processing with Hadoop and PHP (IPC2012SE 2012-06-05)

we want to process data

Page 13: Large-Scale Data Processing with Hadoop and PHP (IPC2012SE 2012-06-05)

how much data exactly?

Page 14: Large-Scale Data Processing with Hadoop and PHP (IPC2012SE 2012-06-05)

SOME NUMBERS

• Facebook

•New data per day:

• 200 GB (March 2008)

• 2 TB (April 2009)

• 4 TB (October 2009)

• 12 TB (March 2010)

• Google

•Data processed per month: 400 PB (in 2007!)

• Average job size: 180 GB

Page 15: Large-Scale Data Processing with Hadoop and PHP (IPC2012SE 2012-06-05)

what if you have that much data?

Page 16: Large-Scale Data Processing with Hadoop and PHP (IPC2012SE 2012-06-05)

what if you have just 1% of that amount?

Page 17: Large-Scale Data Processing with Hadoop and PHP (IPC2012SE 2012-06-05)

“No Problemo”, you say?

Page 18: Large-Scale Data Processing with Hadoop and PHP (IPC2012SE 2012-06-05)

reading 180 GB sequentially off a disk will take ~45 minutes

Page 19: Large-Scale Data Processing with Hadoop and PHP (IPC2012SE 2012-06-05)

and you only have 16 to 64 GB of RAM per computer

Page 20: Large-Scale Data Processing with Hadoop and PHP (IPC2012SE 2012-06-05)

so you can't process everything at once

Page 21: Large-Scale Data Processing with Hadoop and PHP (IPC2012SE 2012-06-05)

general rule of modern computers:

Page 22: Large-Scale Data Processing with Hadoop and PHP (IPC2012SE 2012-06-05)

data can be processed much faster than it can be read

Page 23: Large-Scale Data Processing with Hadoop and PHP (IPC2012SE 2012-06-05)

solution: parallelize your I/O

Page 24: Large-Scale Data Processing with Hadoop and PHP (IPC2012SE 2012-06-05)

but now you need to coordinate what you’re doing

Page 25: Large-Scale Data Processing with Hadoop and PHP (IPC2012SE 2012-06-05)

and that’s hard

Page 26: Large-Scale Data Processing with Hadoop and PHP (IPC2012SE 2012-06-05)

what if a node dies?

Page 27: Large-Scale Data Processing with Hadoop and PHP (IPC2012SE 2012-06-05)

is data lost?will other nodes in the grid have to re-start?

how do you coordinate this?

Page 28: Large-Scale Data Processing with Hadoop and PHP (IPC2012SE 2012-06-05)

ENTER: OUR HEROIntroducing MapReduce

Page 29: Large-Scale Data Processing with Hadoop and PHP (IPC2012SE 2012-06-05)

in the olden days, the workload was distributed across a grid

Page 30: Large-Scale Data Processing with Hadoop and PHP (IPC2012SE 2012-06-05)

and the data was shipped around between nodes

Page 31: Large-Scale Data Processing with Hadoop and PHP (IPC2012SE 2012-06-05)

or even stored centrally on something like an SAN

Page 32: Large-Scale Data Processing with Hadoop and PHP (IPC2012SE 2012-06-05)

which was fine for small amounts of information

Page 33: Large-Scale Data Processing with Hadoop and PHP (IPC2012SE 2012-06-05)

but today, on the web, we have big data

Page 34: Large-Scale Data Processing with Hadoop and PHP (IPC2012SE 2012-06-05)

I/O bottleneck

Page 35: Large-Scale Data Processing with Hadoop and PHP (IPC2012SE 2012-06-05)

along came a Google publication in 2004

Page 36: Large-Scale Data Processing with Hadoop and PHP (IPC2012SE 2012-06-05)

MapReduce: Simplified Data Processing on Large Clustershttp://labs.google.com/papers/mapreduce.html

Page 37: Large-Scale Data Processing with Hadoop and PHP (IPC2012SE 2012-06-05)

now the data is distributed

Page 38: Large-Scale Data Processing with Hadoop and PHP (IPC2012SE 2012-06-05)

computing happens on the nodes where the data already is

Page 39: Large-Scale Data Processing with Hadoop and PHP (IPC2012SE 2012-06-05)

processes are isolated and don’t communicate (share-nothing)

Page 40: Large-Scale Data Processing with Hadoop and PHP (IPC2012SE 2012-06-05)

BASIC PRINCIPLE: MAPPER

• A Mapper reads records and emits <key, value> pairs

• Example: Apache access.log

• Each line is a record

• Extract client IP address and number of bytes transferred

• Emit IP address as key, number of bytes as value

• For hourly rotating logs, the job can be split across 24 nodes*

* In pratice, it’s a lot smarter than that

Page 41: Large-Scale Data Processing with Hadoop and PHP (IPC2012SE 2012-06-05)

BASIC PRINCIPLE: REDUCER

• A Reducer is given a key and all values for this specific key

• Even if there are many Mappers on many computers; the results are aggregated before they are handed to Reducers

• Example: Apache access.log

• The Reducer is called once for each client IP (that’s our key), with a list of values (transferred bytes)

•We simply sum up the bytes to get the total traffic per IP!

Page 42: Large-Scale Data Processing with Hadoop and PHP (IPC2012SE 2012-06-05)

EXAMPLE OF MAPPED INPUT

IP Bytes

212.122.174.13 18271

212.122.174.13 191726

212.122.174.13 198

74.119.8.111 91272

74.119.8.111 8371

212.122.174.13 43

Page 43: Large-Scale Data Processing with Hadoop and PHP (IPC2012SE 2012-06-05)

REDUCER WILL RECEIVE THIS

IP Bytes

212.122.174.13

18271

212.122.174.13191726

212.122.174.13198

212.122.174.13

43

74.119.8.11191272

74.119.8.1118371

Page 44: Large-Scale Data Processing with Hadoop and PHP (IPC2012SE 2012-06-05)

AFTER REDUCTION

IP Bytes

212.122.174.13 210238

74.119.8.111 99643

Page 45: Large-Scale Data Processing with Hadoop and PHP (IPC2012SE 2012-06-05)

PSEUDOCODE

function  map($line_number,  $line_text)  {    $parts  =  parse_apache_log($line_text);    emit($parts['ip'],  $parts['bytes']);}

function  reduce($key,  $values)  {    $bytes  =  array_sum($values);    emit($key,  $bytes);}

212.122.174.13  21023874.119.8.111      99643

212.122.174.13  -­‐  -­‐  [30/Oct/2009:18:14:32  +0100]  "GET  /foo  HTTP/1.1"  200  18271212.122.174.13  -­‐  -­‐  [30/Oct/2009:18:14:32  +0100]  "GET  /bar  HTTP/1.1"  200  191726212.122.174.13  -­‐  -­‐  [30/Oct/2009:18:14:32  +0100]  "GET  /baz  HTTP/1.1"  200  19874.119.8.111      -­‐  -­‐  [30/Oct/2009:18:14:32  +0100]  "GET  /egg  HTTP/1.1"  200  4374.119.8.111      -­‐  -­‐  [30/Oct/2009:18:14:32  +0100]  "GET  /moo  HTTP/1.1"  200  91272212.122.174.13  -­‐  -­‐  [30/Oct/2009:18:14:32  +0100]  "GET  /yay  HTTP/1.1"  200  8371

Page 46: Large-Scale Data Processing with Hadoop and PHP (IPC2012SE 2012-06-05)

A YELLOW ELEPHANTIntroducing Apache Hadoop

Page 48: Large-Scale Data Processing with Hadoop and PHP (IPC2012SE 2012-06-05)

The name my kid gave a stuffed yellow elephant. Short, relatively easy to spell and pronounce, meaningless and not used elsewhere: those are my naming criteria. Kids are good at generating such. Googol is a kid’s term.

Doug Cutting

Page 49: Large-Scale Data Processing with Hadoop and PHP (IPC2012SE 2012-06-05)

Hadoop is a MapReduce framework

Page 50: Large-Scale Data Processing with Hadoop and PHP (IPC2012SE 2012-06-05)

it allows us to focus on writing Mappers, Reducers etc.

Page 51: Large-Scale Data Processing with Hadoop and PHP (IPC2012SE 2012-06-05)

and it works extremely well

Page 52: Large-Scale Data Processing with Hadoop and PHP (IPC2012SE 2012-06-05)

how well exactly?

Page 53: Large-Scale Data Processing with Hadoop and PHP (IPC2012SE 2012-06-05)

HADOOP AT FACEBOOK (I)

• Predominantly used in combination with Hive (~95%)

• 8400 cores with ~12.5 PB of total storage

• 8 cores, 12 TB storage and 32 GB RAM per node

• 1x Gigabit Ethernet for each server in a rack

• 4x Gigabit Ethernet from rack switch to core

http://www.slideshare.net/royans/facebooks-petabyte-scale-data-warehouse-using-hive-and-hadoop

Hadoop is aware of racks and locality of nodes

Page 54: Large-Scale Data Processing with Hadoop and PHP (IPC2012SE 2012-06-05)

HADOOP AT FACEBOOK (II)

•Daily stats:

• 25 TB logged by Scribe

• 135 TB of compressed data scanned

• 7500+ Hive jobs

• ~80k compute hours

•New data per day:

• I/08: 200 GB

• II/09: 2 TB (compressed)

• III/09: 4 TB (compressed)

• I/10: 12 TB (compressed)

http://www.slideshare.net/royans/facebooks-petabyte-scale-data-warehouse-using-hive-and-hadoop

Page 55: Large-Scale Data Processing with Hadoop and PHP (IPC2012SE 2012-06-05)

HADOOP AT YAHOO!

•Over 25,000 computers with over 100,000 CPUs

• Biggest Cluster :

• 4000 Nodes

• 2x4 CPU cores each

• 16 GB RAM each

•Over 40% of jobs run using Pighttp://wiki.apache.org/hadoop/PoweredBy

Page 56: Large-Scale Data Processing with Hadoop and PHP (IPC2012SE 2012-06-05)

OTHER NOTABLE USERS

• Twitter (storage, logging, analysis. Heavy users of Pig)

• Rackspace (log analysis; data pumped into Lucene/Solr)

• LinkedIn (contact suggestions)

• Last.fm (charts, log analysis, A/B testing)

• The New York Times (converted 4 TB of scans using EC2)

Page 57: Large-Scale Data Processing with Hadoop and PHP (IPC2012SE 2012-06-05)

JOB PROCESSINGHow Hadoop Works

Page 58: Large-Scale Data Processing with Hadoop and PHP (IPC2012SE 2012-06-05)

Just like I already described! It’s MapReduce!\o/

Page 59: Large-Scale Data Processing with Hadoop and PHP (IPC2012SE 2012-06-05)

BASIC RULES

• Uses Input Formats to split up your data into single records

• You can optimize using combiners to reduce locally on a node

•Only possible in some cases, e.g. for max(), but not avg()

• You can control partitioning of map output yourself

• Rarely useful, the default partitioner (key hash) is enough

• And a million other things that really don’t matter right now ;)

Page 60: Large-Scale Data Processing with Hadoop and PHP (IPC2012SE 2012-06-05)

HDFSHadoop Distributed File System

Page 61: Large-Scale Data Processing with Hadoop and PHP (IPC2012SE 2012-06-05)

HDFS

• Stores data in blocks (default block size: 64 MB)

•Designed for very large data sets

•Designed for streaming rather than random reads

•Write-once, read-many (although appending is possible)

• Capable of compression and other cool things

Page 62: Large-Scale Data Processing with Hadoop and PHP (IPC2012SE 2012-06-05)

HDFS CONCEPTS

• Large blocks minimize amount of seeks, maximize throughput

• Blocks are stored redundantly (3 replicas as default)

• Aware of infrastructure characteristics (nodes, racks, ...)

• Datanodes hold blocks

• Namenode holds the metadata

Critical component for an HDFS cluster (HA, SPOF)

Page 63: Large-Scale Data Processing with Hadoop and PHP (IPC2012SE 2012-06-05)

there’s just one little problem

Page 64: Large-Scale Data Processing with Hadoop and PHP (IPC2012SE 2012-06-05)

you need to write Java code

Page 65: Large-Scale Data Processing with Hadoop and PHP (IPC2012SE 2012-06-05)

however, there is hope...

Page 66: Large-Scale Data Processing with Hadoop and PHP (IPC2012SE 2012-06-05)

STREAMINGHadoop Won’t Force Us To Use Java

Page 67: Large-Scale Data Processing with Hadoop and PHP (IPC2012SE 2012-06-05)

Hadoop Streaming can use any script as Mapper or Reducer

Page 68: Large-Scale Data Processing with Hadoop and PHP (IPC2012SE 2012-06-05)

many configuration options (parsers, formats, combining, …)

Page 69: Large-Scale Data Processing with Hadoop and PHP (IPC2012SE 2012-06-05)

it works using STDIN and STDOUT

Page 70: Large-Scale Data Processing with Hadoop and PHP (IPC2012SE 2012-06-05)

Mappers are streamed the records(usually by line: <line>\n)

and emit key/value pairs: <key>\t<value>\n

Page 71: Large-Scale Data Processing with Hadoop and PHP (IPC2012SE 2012-06-05)

Reducers are streamed key/value pairs:<keyA>\t<value1>\n<keyA>\t<value2>\n<keyA>\t<value3>\n<keyB>\t<value4>\n

Page 72: Large-Scale Data Processing with Hadoop and PHP (IPC2012SE 2012-06-05)

Caution: no separate Reducer processes per key(but keys are sorted)

Page 73: Large-Scale Data Processing with Hadoop and PHP (IPC2012SE 2012-06-05)

STREAMING WITH PHPIntroducing HadooPHP

Page 74: Large-Scale Data Processing with Hadoop and PHP (IPC2012SE 2012-06-05)

HADOOPHP

• A little framework to help with writing mapred jobs in PHP

• Takes care of input splitting, can do basic decoding et cetera

• Automatically detects and handles Hadoop settings such as key length or field separators

• Packages jobs as one .phar archive to ease deployment

• Also creates a ready-to-rock shell script to invoke the job

Page 75: Large-Scale Data Processing with Hadoop and PHP (IPC2012SE 2012-06-05)

written by

Page 76: Large-Scale Data Processing with Hadoop and PHP (IPC2012SE 2012-06-05)
Page 77: Large-Scale Data Processing with Hadoop and PHP (IPC2012SE 2012-06-05)

DEMOHadoop Streaming & PHP in Action

Page 78: Large-Scale Data Processing with Hadoop and PHP (IPC2012SE 2012-06-05)

!e End

Page 79: Large-Scale Data Processing with Hadoop and PHP (IPC2012SE 2012-06-05)

RESOURCES

• Book: Tom White: Hadoop. The Definitive Guide. O’Reilly, 2009

• Cloudera Distribution: http://www.cloudera.com/hadoop/

• Also: http://www.cloudera.com/developers/learn-hadoop/

• From this talk:

• Logs: http://infochimps.com/datasets/star-wars-kid-data-dump

• HadooPHP: http://github.com/dzuelke/hadoophp

Page 80: Large-Scale Data Processing with Hadoop and PHP (IPC2012SE 2012-06-05)

Questions?