MapReduce Application Scripting

8: MapReduce Application Scripting

Zubair Nabi

[email protected]

May 25, 2013

Zubair Nabi 8: MapReduce Application Scripting May 25, 2013 1 / 28

Outline

1 Pig Latin

2 Cascading


Outline

1 Pig Latin

2 Cascading


Introduction

MapReduce is too low-level and rigid and leads to lots of custom usercode

Pig Latin is a declarative language atop MapReduce designed byYahoo!

I Finds the sweet spot between the declarative style of SQL and thelow-level interface of MapReduce

The Pig system compiles Pig Latin queries into physical plans that areexecuted atop Hadoop


Introduction

MapReduce is too low-level and rigid and leads to lots of custom usercodePig Latin is a declarative language atop MapReduce designed byYahoo!




Introduction





Introduction





SQL query to find average pagerank for each large categoryof URLs

1 SELECT category, AVG(pagerank)

2 FROM urls WHERE pagerank > 0.2

3 GROUP BY category HAVING COUNT(∗) > 10^6


Equivalent Pig query

1 good_urls = FILTER urls BY pagerank > 0.2;

2 groups = GROUP good_urls BY category;

3 big_groups = FILTER groups BY COUNT(good_urls)>10^6;

4 output = FOREACH big_groups GENERATE category , AVG(good_urls.pagerank);


Pig Interface

A Pig Latin program is a sequence of steps, reminiscent of traditionalprogramming languages

I In contrast, SQL consists of declarative constraints that collectivelydefine the result

Each step carries out a single data transformation

A Pig Latin program is similar to specifying a query execution or adataflow graph

Due to this dataflow model, it is easier for programmers to understandand control how their data processing task is executed


Pig Interface







Pig Interface







Pig Interface







Pig Interface







Features

Support for a fully nested data model with complex data types

Extensive support for user-defined functions

Ability to operate over plain, schema-less input files

Open-source Apache project


Features






Features






Features






Interoperability

Queries can be performed atop raw data dumps directly

The user needs to provide a function to parse the content of the file intotuples

Similarly, the user also needs to provide a function to convert tuplesinto a byte sequence

Datasets can be laid across diverse data storage sources andapplications


Interoperability






Interoperability






Interoperability






UDFs as first-class citizens

A significant part of large-scale data analysis relies on customprocessing

For instance, the user may be interested in figuring out whether aparticular website is spam

All aspects of processing in Pig Latin including grouping, filtering,joining, and per-tuple processing can be customized via UDFs

UDFs take non-atomic parameters as input and produce non-atomicvalues as output

UDFs are defined in Java

1 groups = GROUP urls BY category;

2 output = FOREACH groups GENERATE

3 category , top10(urls);










































Data Model

Pig has four data types:

1 Atom: A single atomic value such as a string or an integer

2 Tuple: A sequence of values, each with possibly a different data type

3 Bag: A collection of tuples

4 Map: A collection of data types, each with an associated key


Data Model







Data Model







Data Model







Commands

LOAD: Load and deserialize an input file

FOREACH: Process each tuple of a dataset

FILTER: Filter a dataset based on some condition or UDF

COGROUP: Group together tuples which are related in some way fromone or more datasets

GROUP: Group together tuples which are related in some way fromone dataset

STORE: Materialize the output of a Pig Latin expression to a file


Commands








Commands








Commands








Commands








Commands








Other Commands

UNION: Return the union of two or more bags

CROSS: Return the cross product of two or more bags

ORDER: Order a bag by a specified field

DISTINCT: Eliminate duplicate tuples in a bag


Other Commands






Other Commands






Other Commands






MapReduce in PigLatin

1 map_result = FOREACH input GENERATE FLATTEN(map(∗));2 key_groups = GROUP map_result BY $0;

3 output = FOREACH key_groups GENERATE reduce(∗);


Outline

1 Pig Latin

2 Cascading


Introduction

Many applications require a chain of MapReduce jobs

Cascading allows the creation of processing pipelines using languagesthat run atop the JVMSource-pipe-sink paradigm

I Data comes from sourcesI Pipes perform data analysisI Results are written to sinks


Introduction


Cascading allows the creation of processing pipelines using languagesthat run atop the JVM

Source-pipe-sink paradigmI Data comes from sourcesI Pipes perform data analysisI Results are written to sinks


Introduction





Introduction



I Data comes from sources

I Pipes perform data analysisI Results are written to sinks


Introduction



I Data comes from sourcesI Pipes perform data analysis

I Results are written to sinks


Introduction





Terminology

Pipe: data stream

Tuple: data record

Branch: chain of pipes

Pipe Assembly: set of pipe branches

Tap: data source or sink

Flow: pipe assembly bound to a tap

Cascade: a collection flows, in which one flow depends on the outputof another


Terminology

Pipe: data stream

Tuple: data record







Terminology

Pipe: data stream

Tuple: data record







Terminology

Pipe: data stream

Tuple: data record







Terminology

Pipe: data stream

Tuple: data record







Terminology

Pipe: data stream

Tuple: data record







Terminology

Pipe: data stream

Tuple: data record







Pipes

Base class: Pipe

Each: Analyze, transform, or filter individual tuples

Merge: Combine streams with same fields into one

GroupBy: Group tuples based on common values in a specified field

CoGroup: Join streams (similar to SQL join)

Every: Aggregate tuples

HashJoin: Similar to CoGroup but more efficient if one stream canbe held in memory


Pipes

Base class: Pipe








Pipes

Base class: Pipe








Pipes

Base class: Pipe








Pipes

Base class: Pipe








Pipes

Base class: Pipe








Pipes

Base class: Pipe








Pipe Assemblies

Define the processing of tuple streams

I Tuples are read/written to taps

Processing includes filtering, transforming, organizing, and calculating

Can use multiple taps

May also define splits, merges, and joins to manipulate tuple streams


Pipe Assemblies

Define the processing of tuple streamsI Tuples are read/written to taps





Pipe Assemblies






Pipe Assemblies






Pipe Assemblies






Example: Pipe Assembly


Example: Pipe Assembly (2)

1 Pipe lhs = new Pipe( "lhs" );

2 lhs = new Each( lhs, new SomeFunction() );

3 lhs = new Each( lhs, new SomeFilter() );

45 Pipe rhs = new Pipe( "rhs" );

6 rhs = new Each( rhs, new SomeFunction() );

78 Pipe join = new CoGroup( lhs, rhs );

9 join = new Every( join, new SomeAggregator() );

10 join = new GroupBy( join );


1213 join = new Each( join, new SomeFunction() );


Data Processing

Operation: Accept an input tuple, process it, and output zero or moretuples

Tuple: Array of fields

Field: Defines a data type, such as string, integer, etc.


Data Processing





Data Processing





Taps

Data flows in and out of taps

Represent data sources and sinks, such local files, distributed FS files,etc.

Each tap is associated with a scheme that describe the data, such asTextLine, TextDelimited, etc.

Sinks have modes such as SinkMode.KEEP,SinkMode.REPLACE, and SinkMode.UPDATE


Taps






Taps






Taps






Flows

Represent entire pipelines

A pipeline reads data from a source, processes it, and then writes it toa sink


Flows

Represent entire pipelines

A pipeline reads data from a source, processes it, and then writes it toa sink


Example: Flow1 Pipe lhs = new Pipe( "lhs" );

2 lhs = new Each( lhs, new SomeFunction() );

3 lhs = new Each( lhs, new SomeFilter() );

4 Pipe rhs = new Pipe( "rhs" );

5 rhs = new Each( rhs, new SomeFunction() );

6 Pipe join = new CoGroup( lhs, rhs );


89 Tap lhsSource = new Hfs( new TextLine(), "lhs.txt" );

10 Tap rhsSource = new Hfs( new TextLine(), "rhs.txt" );

11 Tap sink = new Hfs( new TextLine(), "output" );

12 FlowDef flowDef = new FlowDef()

13 .setName( "flow−name" )14 .addSource( rhs, rhsSource )

15 .addSource( lhs, lhsSource )

16 .addTailSink( join, sink );

17 Flow flow = new HadoopFlowConnector().connect( flowDef );


Operations

Operations manipulate data

Four kinds:1 Function2 Filter3 Aggregator4 Buffer

Take an input tuple and emit zero or more tuplesI Filter returns a Boolean

Must be wrapped around in either Every or Each pipes


Operations

Operations manipulate dataFour kinds:

1 Function

2 Filter3 Aggregator4 Buffer




Operations


1 Function2 Filter

3 Aggregator4 Buffer




Operations


1 Function2 Filter3 Aggregator

4 Buffer




Operations


1 Function2 Filter3 Aggregator4 Buffer




Operations



Take an input tuple and emit zero or more tuples

I Filter returns a Boolean



Operations






Operations






Example: Wordcount

1 Scheme sourceScheme = new TextLine( new Fields( "line" ) );

2 Tap source = new Hfs( sourceScheme , inputPath );

3 Scheme sinkScheme = new TextLine( new Fields( "word", "count" ) );

4 Tap sink = new Hfs( sinkScheme , outputPath , SinkMode.REPLACE );

5 Pipe assembly = new Pipe( "wordcount" );

6 String regex = " ";

7 Function function = new RegexGenerator( new Fields( "word" ), regex );

8 assembly = new Each( assembly , new Fields( "line" ), function );

9 assembly = new GroupBy( assembly, new Fields( "word" ) );

10 Aggregator count = new Count( new Fields( "count" ) );

11 assembly = new Every( assembly, count );

12 FlowConnector flowConnector = new FlowConnector();

13 Flow flow = flowConnector.connect( "word−count", source, sink, assembly );14 flow.complete();


References

1 Christopher Olston, Benjamin Reed, Utkarsh Srivastava, Ravi Kumar,and Andrew Tomkins. 2008. Pig latin: a not-so-foreign language fordata processing. In Proceedings of the 2008 ACM SIGMODinternational conference on Management of data (SIGMOD ’08). ACM,New York, NY, USA, 1099-1110.

2 Cascading 2.1 User Guide: http://docs.cascading.org/cascading/2.1/userguide/pdf/userguide.pdf


http://docs.cascading.org/cascading/2.1/userguide/pdf/userguide.pdf

http://docs.cascading.org/cascading/2.1/userguide/pdf/userguide.pdf

MapReduce Application Scripting

Technology