Top Banner
Alan F. Gates Yahoo! Pig, Making Hadoop Easy
14

Apache Hadoop India Summit 2011 talk "Pig - Making Hadoop Easy" by Alan Gate

May 27, 2015

Download

Documents

Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Apache Hadoop India Summit 2011 talk "Pig - Making Hadoop Easy" by Alan Gate

Alan F. Gates

Yahoo!

Pig, Making Hadoop Easy

Page 2: Apache Hadoop India Summit 2011 talk "Pig - Making Hadoop Easy" by Alan Gate

- 2 -

Who Am I?

• Pig committer and PMC Member• An architect in Yahoo! grid team

Photo credit: Steven Guarnaccia, The Three Little Pigs

Page 3: Apache Hadoop India Summit 2011 talk "Pig - Making Hadoop Easy" by Alan Gate

- 3 -

Motivation By Example

You have web server logs of purchases on your site. You want to find the 10 users who bought the most and the cities they live in. You also want to know what percentage of purchases they account for in those cities.

Load Logs

Find top 10 users

Store top 10 users

Join by city

Sum purchases by city

Calculate percentage

Store results

Page 4: Apache Hadoop India Summit 2011 talk "Pig - Making Hadoop Easy" by Alan Gate

- 4 -

In Pig Latinraw = load 'logs' as (name, city, purchase);

-- Find top 10 usersusrgrp = group raw by (name, city);byusr = foreach usrgrp generate group as k1, SUM(raw.purchase) as utotal;srtusr = order byusr by usrtotal desc;topusrs = limit srtusr 10;store topusrs into 'top_users';

-- Count purchases per citycitygrp = group raw by city;bycity = foreach citygrp generate group as k2, SUM(raw.purchase) as ctotal;

-- Join top users back to cityjnd = join topusrs by k1.city, bycity by k2;pct = foreach jnd generate k1.name, k1.city, utotal/ctotal;store pct into 'top_users_pct_of_city';

Page 5: Apache Hadoop India Summit 2011 talk "Pig - Making Hadoop Easy" by Alan Gate

- 5 -

Translates to Four MapReduce Jobs

Job 1 Job 2 Job 3 Job 4•Load•Group by user•Sum user purchases•Store user purchases•Group by city•Sum city purchases

•Sample output of user sum to decide how to partition for order by

•Order user sums•Limit sums to 10

•Join top users’ purchases with city purchases•Store results

Page 6: Apache Hadoop India Summit 2011 talk "Pig - Making Hadoop Easy" by Alan Gate

- 6 -

Performance

Page 7: Apache Hadoop India Summit 2011 talk "Pig - Making Hadoop Easy" by Alan Gate

- 7 -

Where Do Pigs Live?

Data Collection Data FactoryPig

PipelinesIterative ProcessingResearch

Data Warehouse

BI ToolsAnalysis

Page 8: Apache Hadoop India Summit 2011 talk "Pig - Making Hadoop Easy" by Alan Gate

- 8 -

Pig Highlights

• Language designed to enable efficient description of data flow• Standard relational operators built in• User defined functions (UDFs) can be written for column

transformation (TOUPPER), or aggregation (SUM)• UDFs can be written to take advantage of the combiner• Four join implementations built in: hash, fragment-replicate,

merge, skewed• Multi-query: Pig will combine certain types of operations together

in a single pipeline to reduce the number of times data is scanned• Order by provides total ordering across reducers in a balanced way• Writing load and store functions is easy once an InputFormat and

OutputFormat exist• Piggybank, a collection of user contributed UDFs

Page 9: Apache Hadoop India Summit 2011 talk "Pig - Making Hadoop Easy" by Alan Gate

- 15 -

Who uses Pig for What?

• 70% of production grid jobs at Yahoo (10ks per day)• Also used by Twitter, LinkedIn, Ebay, AOL, …• Used to

– Process web logs– Build user behavior models– Process images– Build maps of the web– Do research on raw data sets

Page 10: Apache Hadoop India Summit 2011 talk "Pig - Making Hadoop Easy" by Alan Gate

- 16 -

Components

User machine

Hadoop Cluster

Pig resides on user machine

Job executes on cluster

No need to install anything extra on your Hadoop cluster.

Accessing Pig:• Submit a script directly• Grunt, the pig shell• PigServer Java class, a JDBC like interface

Page 11: Apache Hadoop India Summit 2011 talk "Pig - Making Hadoop Easy" by Alan Gate

- 17 -

How It Works

A = LOAD ‘myfile’ AS (x, y, z);B = FILTER A by x > 0; C = GROUP B BY x;D = FOREACH A GENERATE x, COUNT(B);STORE D INTO ‘output’;

Pig Latin

Execution PlanMap: Filter Count

Combine/Reduce: Sum

pig.jar:• parses• checks• optimizes• plans execution• submits jar

to Hadoop• monitors job progress

Page 12: Apache Hadoop India Summit 2011 talk "Pig - Making Hadoop Easy" by Alan Gate

- 18 -

New in 0.8

• UDFs can be Jython• Improved and expanded statistics• Performance Improvements

– Automatic merging of small files– Compression of intermediate results

• PigUnit for unit testing your Pig Latin scripts• Access to static Java functions as UDFs• Improved HBase integration• Custom Partitioners

B = group A by $0 partition by YourPartitioner parallel 2;

• Greatly expanded string and math built in UDFs

Page 13: Apache Hadoop India Summit 2011 talk "Pig - Making Hadoop Easy" by Alan Gate

- 19 -

What’s Next?

• Preview of Pig 0.9– Integrate Pig with scripting languages for control flow– Add macros to Pig Latin– Revive ILLUSTRATE– Fix runtime type errors– Rewrite parser to give more useful error messages

• Programming Pig from O’Reilly Press

Page 14: Apache Hadoop India Summit 2011 talk "Pig - Making Hadoop Easy" by Alan Gate

- 20 -

Learn More

• Online documentation: http://pig.apache.org/• Hadoop, The Definitive Guide 2nd edition has an up to date

chapter on Pig, search at your favorite bookstore• Join the mailing lists:

[email protected] for user questions– [email protected] for developer issues

• Follow me on Twitter, @alanfgates