Pig Latin: A Not-So- Foreign Language for Data Processing Christopher Olston, Benjamin Reed, Utkarsh Srivastava, Ravi Kumar, Andrew Tomkins Yahoo! Research SIGMOD’08 Presented By Sandeep Patidar Modified from original Pig Latin talk
Dec 29, 2015
Pig Latin: A Not-So-Foreign Language for Data Processing
Christopher Olston, Benjamin Reed, Utkarsh Srivastava, Ravi Kumar, Andrew TomkinsYahoo! ResearchSIGMOD’08
Presented BySandeep Patidar
Modified from original Pig Latin talk
2
Outline Map-Reduce and the Need for Pig Latin Pig Latin example Feature and Motivation Pig Latin Implementation Debugging Environment Usage Scenarios Related Work Future Work
3
Data Processing Renaissance Internet companies
swimming in dataE.g. TBs/day at Yahoo!
Data analysis is “inner loop” of product innovation
Data analysts are skilled programmers
4
Data Warehousing …?
ScaleScale Often not scalable enough
$ $ $ $$ $ $ $Prohibitively expensive at web scale
• Up to $200K/TB
SQLSQL• Little control over execution method• Query optimization is hard
• Parallel environment• Little or no statistics• Lots of UDFs
5
New Systems For Data Analysis
Map-Reduce
Apache Hadoop
Dryad
6
Map-Reduce Map : Performs the group by
Reduce : Performs the aggregation
These are two high level declarative primitives to enable parallel processing
7 Execution overview of Map-Reduce [2]
1) The Map-Reduce library in the user programfirst splits the input les into M pieces of typically16 megabytes to 64 megabytes (MB) per piece.It then starts up many copies of the program ona cluster of machines.
1) The Map-Reduce library in the user programfirst splits the input les into M pieces of typically16 megabytes to 64 megabytes (MB) per piece.It then starts up many copies of the program ona cluster of machines.
2) One of the copy of the program is special – the master.The rest are workers that are assigned work by the master.There are M map task and R reduce tasks to assign, TheMaster picks idle worker and assign each one a task.
2) One of the copy of the program is special – the master.The rest are workers that are assigned work by the master.There are M map task and R reduce tasks to assign, TheMaster picks idle worker and assign each one a task.
3) A worker who is assigned a map task reads the contentsof the corresponding input split. It parses key/value pairsout of the input data and passes each pair to the user-definedMap function. The intermediate key/value pairs producedby the Map function are buffered in memory.
3) A worker who is assigned a map task reads the contentsof the corresponding input split. It parses key/value pairsout of the input data and passes each pair to the user-definedMap function. The intermediate key/value pairs producedby the Map function are buffered in memory.
4) Periodically, the buffered pairs are written to local disk,partitioned into R regions by the partitioning function.The location of these buffered pairs on the local disk arepassed back to the Master, who is responsible forforwarding these locations to the reduce workers
4) Periodically, the buffered pairs are written to local disk,partitioned into R regions by the partitioning function.The location of these buffered pairs on the local disk arepassed back to the Master, who is responsible forforwarding these locations to the reduce workers
8 Execution overview of Map-Reduce [2]
5) When a reduce worker is modified by the master about these locations,it uses remote procedure calls to read buffered data from the local disks ofmap workers. When a reduce worker has read all intermediate data, it sorts itby the intermediate keys. The sorting is needed because typically
many different key map to the same reduce task.
5) When a reduce worker is modified by the master about these locations,it uses remote procedure calls to read buffered data from the local disks ofmap workers. When a reduce worker has read all intermediate data, it sorts itby the intermediate keys. The sorting is needed because typically
many different key map to the same reduce task.
6) The reduce worker iterate over the sorted intermediate dataand for each unique key encountered, it passes the key and the.corresponding set of intermediate values to the user’s Reduce function.The output of the Reduce function is appended to the final
output file for this reduce partition.
6) The reduce worker iterate over the sorted intermediate dataand for each unique key encountered, it passes the key and the.corresponding set of intermediate values to the user’s Reduce function.The output of the Reduce function is appended to the final
output file for this reduce partition.
7) When all map task and reduce task have been completed,the master wakes up the user program, At this point, theMap-Reduce call in the user program returns back
to the user code.
7) When all map task and reduce task have been completed,the master wakes up the user program, At this point, theMap-Reduce call in the user program returns back
to the user code.
9
Inputrecords
k1 v1
k2 v2
k1 v3
k2 v4
k1 v5
mapmap
mapmap
k1 v1
k1 v3
k1 v5
k2 v2
k2 v4
Outputrecords
reducereduce
reducereduce
10
Map-Reduce Appeal
ScaleScaleScalable due to simpler design
• Only parallelizable operations• No transactions
$ $ Runs on cheap commodity hardware
Procedural Control- a processing “pipe”
SQL SQL
11
Limitations of Map-Reduce1. Extremely rigid data flow
Other flows constantly hacked in
Join, Union Split
MM RR
MM MM RR MM
Chains
2. Common operations must be coded by hand• Join, filter, projection, aggregates, sorting, distinct
3. Semantics hidden inside map-reduce functions• Difficult to maintain, extend, and optimize
12
Pros And Cons Need a high-level, general data flow language
High leveldeclarative language
High leveldeclarative language Low level
procedural languageLow level
procedural language
13
Enter Pig Latin Need a high-level, general data flow language
Pig LatinPig Latin
14
Outline Map-Reduce and the Need for Pig Latin Pig Latin example Feature and Motivation Pig Latin Implementation Debugging Environment Usage Scenarios Related Work Future Work
15
Pig Latin Example 1Suppose we have a table
urls: (url, category, pagerank)
Simple SQL query that finds,
For each sufficiently large category, the average pagerank of high-pagerank urls in that category
SELECT category, Avg(pagetank)FROM urls WHERE pagerank > 0.2GROUP BY category HAVING COUNT(*) > 106
16
Equivalent Pig Latin program good_urls = FILTER urls BY pagerank > 0.2;
groups = GROUP good_urls BY category;
big_groups = FILTER groups BY COUNT(good_urls) > 106 ;
output = FOREACH big_groups GENERATE category,
AVG(good_urls.pagerank);
17
Data FlowFilter good_urlsby pagerank > 0.2Filter good_urlsby pagerank > 0.2
Group by categoryGroup by category
Filter categoryby count > 106
Filter categoryby count > 106
Foreach categorygenerate avg. pagerank
Foreach categorygenerate avg. pagerank
18
Example Data Analysis Task
User Url Time
Amy cnn.com 8:00
Amy bbc.com 10:00
Amy flickr.com 10:05
Fred cnn.com 12:00
Find the top 10 most visited pages in each category
UrlCategor
yPageRan
k
cnn.com News 0.9
bbc.com News 0.8
flickr.com Photos 0.7
espn.com Sports 0.9
Visits Url Info
19
Data FlowLoad VisitsLoad Visits
Group by urlGroup by url
Foreach urlgenerate count
Foreach urlgenerate count Load Url InfoLoad Url Info
Join on urlJoin on url
Group by categoryGroup by category
Foreach categorygenerate top10 urls
Foreach categorygenerate top10 urls
20
In Pig Latinvisits = load ‘/data/visits’ as (user, url, time);gVisits = group visits by url;visitCounts = foreach gVisits generate url, count(visits);
urlInfo = load ‘/data/urlInfo’ as (url, category, pRank);
visitCounts = join visitCounts by url, urlInfo by url;
gCategories = group visitCounts by category;topUrls = foreach gCategories generate
top(visitCounts,10);
store topUrls into ‘/data/topUrls’;
21
Outline Map-Reduce and the Need for Pig Latin Pig Latin example Feature and Motivation Pig Latin Implementation Debugging Environment Usage Scenarios Related Work Future Work
22
Dataflow Language
The step-by-step method of creating a program in Pig is much cleaner and simpler to use than the single block method of SQL. It is easier to keep track of what your variables are, and where you are in the process of analyzing your data.
The step-by-step method of creating a program in Pig is much cleaner and simpler to use than the single block method of SQL. It is easier to keep track of what your variables are, and where you are in the process of analyzing your data.
Jasmine NovakEngineer, Yahoo!
User specifies a sequence of steps where each step specifies only a single high-level data transformation
23
Step by step execution Pig Latin program supply an explicit sequence
of operations, it is not necessary that the operations be executed in that order
e.g., Set of urls of pages classified as spam, but have a high pagerank score
isSpam might be an expensive UDFThen, it will be much better to filter
the url by pagerank first.
isSpam might be an expensive UDFThen, it will be much better to filter
the url by pagerank first.
spam_urls = FILTER urls BY isSpam(url);
culprit_urls = FILTER spam_urls BYpagerank > 0.8;
24
Quick Start and Interoperability
gVisits = group visits by $1;Where $1 uses positional notation to refer second field
visits = load ‘/data/visits’ as (user, url, time);gVisits = group visits by url;visitCounts = foreach gVisits generate url, count(visits);
urlInfo = load ‘/data/urlInfo’ as (url, category, pRank);
Operates directly over filesOperates directly over filesSchemas optional;
Can be assigned dynamicallySchemas optional;
Can be assigned dynamically
25
Nested Data Model Pig Latin has flexible, fully nested data
model (described later)allows complex, non-atomic data
typessuch as sets, map, and tuple.
Nested Model is more closer to programmer than normalization (1NF)
Avoids expensive joins for web-scale data Allows programmer to easily write UDFs
26
UDFs as First-Class Citizens Used Defined Functions (UFDs) can be
used in every constructLoad, Store, Group, Filter, Foreach
Example 2Suppose we want to find for each category, the top 10 urls according to pagerank
groups = GROUP urls BY category;output = FOREACH groups GENERATE
category, top10(urls);
27
Outline Map-Reduce and the Need for Pig Latin Pig Latin example Feature and Motivation Pig Latin Implementation Debugging Environment Usage Scenarios Related Work Future Work
28
Data Model Atom: contains Simple atomic value
‘alice’ ‘lanker’
‘ipod’AtomAtom TupleTuple
Tuple: sequence of fields Bag: collection of tuple with possible
duplicates
29
Map: collection of data items, where each item has an associated key through which is can be looked
30
Pig Latin Commands Specifying Input Data: LOAD
queries = LOAD ‘query_log.txt’ USING myLoad() As (userId, queryString,
timestamp); Per-tuple Processing: FOREACH
expand_queries = FOREACH queries GENERATE userId,
expandQuery(queryString);
31
Pig Latin Commands (Cont.) Discarding Unwanted Data: FILTER
real_queries = FILTER queries BY userId neq ‘bot’;
or FILTER queries BY NOT isBot(userId);
Filtering conditions involve combination of expression, comparison operators such as ==, eq, !=, neq, and the logical connectors AND, OR, NOT
32 Expressions in Pig Latin
33
Example of flattening in FOREACH
34
Pig Latin Commands (Cont.) Getting Related Data Together: COGROUP
Suppose we have two data setsresult: (queryString, url, position)revenue: (queryString, adSlot, amount)
grouped_data = COGROUP result BY queryString, revenue BY queryString;
35 COGROUP versus JOIN
36
Pig Latin Example 3Suppose we were trying to attribute search revenue to search-result urls to figure out the monetary worth of each url.url_revenues = FOREACH grouped_data
GENERATE FLATTEN(distributeRevenue(result, revenue));
Where distributeRevenue is a UDF that accepts search results and revenue info for a query string at a time, and outputs a bag of urls and the revenue attributed to them.
37
Pig Latin Commands (Cont.) Special case of COGROUP: GROUP
grouped_revenue = GROUP revenue BY queryString;query_revenue = FOREACH grouped_revenue
GENERATE queryString, SUM(revenue.amount) AS
totalRevenue; JOIN in Pig Latin
join_result = JOIN result BY queryString,revenue BY queryString;
38
Pig Latin Commands (Cont.) Map-Reduce in Pig Latin
map_result = FOREACH input GENERATE FLATTEN(map(*));
key_group = GROUP map_result BY $0;
output = FOREACH key_group GENERATE reduce(*);
39
Pig Latin Commands (Cont.) Other Command
UNION : Returns the union of two or more bagsCROSS: Returns the cross productORDER: Orders a bag by the specified field(s)DISTINCT: Eliminates duplicate tuple in a bag
Nested OperationsPig Latin allows some command to nested within a FOREACH command
40
Pig Latin Commands (Cont.) Asking for Output : STORE
user can ask for the result of a Pig Latin expression sequence to be materialized to a file
STORE query_revenue INTO ‘myoutput’USING myStore();
myStore is custom serializer.For plain text file, it can be omitted
myStore is custom serializer.For plain text file, it can be omitted
41
Outline Map-Reduce and the Need for Pig Latin Pig Latin example Feature and Motivation Pig Latin Implementation Debugging Environment Usage Scenarios Related Work Future Work
42
Implementation
cluster
HadoopMap-Reduce
HadoopMap-Reduce
PigPig
SQL
automaticrewrite +optimize
or
or USER
Pig is open-source.http://incubator.apache.org/pig
Pig is open-source.http://incubator.apache.org/pig
43
Building a Logical Plan Pig interpreter first parse Pig Latin
command, and verifies that the input files and bags being referred are valid
Builds logical plan for every bag that the user defines
Processing triggers only when user invokes a STORE command on a bag(at that point, the logical plan for that bag is compiled into physical plan and is executed)
44
Every group or join operation forms a map-reduce boundary
Other operations pipelined into map and reduce phases
Map-Reduce Plan Compilation
45
Compilation into Map-ReduceFilter good_urlsby pagerank > 0.2Filter good_urlsby pagerank > 0.2
Group by categoryGroup by category
Filter categoryby count > 106
Filter categoryby count > 106
Foreach categorygenerate avg. pagerank
Foreach categorygenerate avg. pagerank
Every group or join operation forms a map-
reduce boundary
Other operations pipelined into map and reduce phases
Map1
Reduce1
46
Compilation into Map-ReduceLoad VisitsLoad Visits
Group by urlGroup by url
Foreach urlgenerate count
Foreach urlgenerate count Load Url InfoLoad Url Info
Join on urlJoin on url
Group by categoryGroup by category
Foreach categorygenerate top10(urls)
Foreach categorygenerate top10(urls)
Map1
Reduce1Map2
Reduce2
Map3
Reduce3
Every group or join operation forms a map-reduce boundary
Other operations pipelined into map and reduce phases
47
Efficiency With Nested Bags (CO)GROUP command places tuples
belonging to the same group into one or more nested bags
System can avoid actually materializing these bags, which is specially important when the bags are larger than machine’s main memory
One common case is where user applies a algebraic aggregation function over the result of (CO)GROUP operation
48
Debugging Environment Process of constructing Pig Latin program is
iterative step User makes an initial stab at writing a program Submits it to the system for execution Inspects the output
To avoid this inefficiency, user often create a side data set Unfortunately this method does not always work well
Pig comes with debugging environment called Pig Pen creates side data set automatically
49
Pig Pen screen shot
50
Generating a Sandbox Data Set There are three primary objectives in
selecting a sandbox data setRealism: sandbox data set should be subset of the
actual data set
Conciseness: example bags should be as small as possible
Completeness: example bags should be collectively illustrate the key semantics of each command
51
Usage Scenarios Session Analysis :
Web users sessions, i.e., sequence of page views and clicks made by users, are analyzed.
To calculate How long is the average user session How many links does a user clicks on before leaving website How do click pattern vary in the course of a day/week/month Analysis tasks mainly consist of grouping the activity log by users
and/or website
First production release about a year ago At Yahoo! 30% of all Hadoop jobs are run with
Pig
52
Related Work Sawzall
Scripting language used at Google on top of map-reduce Rigid structure consisting of a filtering phase followed by
an aggregation phase
DryadLINQ SQL-like language on top of Dryad, used at Microsoft
Nested Data Models Explored before in the context of object-oriented
databases explored data- parallel languages over nested data, e.g.,
NESL
53
Future Work Safe Optimizer
Performs only high-confidence rewrites User Interface
“Boxes and arrows” GUIPromote collaboration, sharing code
fragments and UDFs External Functions
Tight integration with a scripting language such as Perl or Python
Unified Environment
54
Summary
Big demand for parallel data processingEmerging tools that do not look like SQL
DBMSProgrammers like dataflow pipes over static
files
Hence the excitement about Map-Reduce
But, Map-Reduce is too low-level and rigidPig LatinSweet spot between map-reduce and SQL
Pig LatinSweet spot between map-reduce and SQL
55
References C. Olston, B. Reed, U. Srivastava, R.
Kumar and A. Tomkins. Pig Latin: A Not-So-Foreign Language for Data Processing. SIGMOD 2008
J. Dean and S. Ghemawat. MapReduce: Simplied data processing on large clusters. In Proc. OSDI, 2004.
Pig Latin talk at SIGMOD 2008. http://i.stanford.edu/~usriv/talks/sigmod08-pig-latin.ppt
56
Thank you