Top Banner
8: MapReduce Application Scripting Zubair Nabi [email protected] May 25, 2013 Zubair Nabi 8: MapReduce Application Scripting May 25, 2013 1 / 28
90

MapReduce Application Scripting

May 06, 2015

Download

Technology

Zubair Nabi

Workshop conducted for Teradata, Islamabad
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: MapReduce Application Scripting

8: MapReduce Application Scripting

Zubair Nabi

[email protected]

May 25, 2013

Zubair Nabi 8: MapReduce Application Scripting May 25, 2013 1 / 28

Page 2: MapReduce Application Scripting

Outline

1 Pig Latin

2 Cascading

Zubair Nabi 8: MapReduce Application Scripting May 25, 2013 2 / 28

Page 3: MapReduce Application Scripting

Outline

1 Pig Latin

2 Cascading

Zubair Nabi 8: MapReduce Application Scripting May 25, 2013 3 / 28

Page 4: MapReduce Application Scripting

Introduction

MapReduce is too low-level and rigid and leads to lots of custom usercode

Pig Latin is a declarative language atop MapReduce designed byYahoo!

I Finds the sweet spot between the declarative style of SQL and thelow-level interface of MapReduce

The Pig system compiles Pig Latin queries into physical plans that areexecuted atop Hadoop

Zubair Nabi 8: MapReduce Application Scripting May 25, 2013 4 / 28

Page 5: MapReduce Application Scripting

Introduction

MapReduce is too low-level and rigid and leads to lots of custom usercodePig Latin is a declarative language atop MapReduce designed byYahoo!

I Finds the sweet spot between the declarative style of SQL and thelow-level interface of MapReduce

The Pig system compiles Pig Latin queries into physical plans that areexecuted atop Hadoop

Zubair Nabi 8: MapReduce Application Scripting May 25, 2013 4 / 28

Page 6: MapReduce Application Scripting

Introduction

MapReduce is too low-level and rigid and leads to lots of custom usercodePig Latin is a declarative language atop MapReduce designed byYahoo!

I Finds the sweet spot between the declarative style of SQL and thelow-level interface of MapReduce

The Pig system compiles Pig Latin queries into physical plans that areexecuted atop Hadoop

Zubair Nabi 8: MapReduce Application Scripting May 25, 2013 4 / 28

Page 7: MapReduce Application Scripting

Introduction

MapReduce is too low-level and rigid and leads to lots of custom usercodePig Latin is a declarative language atop MapReduce designed byYahoo!

I Finds the sweet spot between the declarative style of SQL and thelow-level interface of MapReduce

The Pig system compiles Pig Latin queries into physical plans that areexecuted atop Hadoop

Zubair Nabi 8: MapReduce Application Scripting May 25, 2013 4 / 28

Page 8: MapReduce Application Scripting

SQL query to find average pagerank for each large categoryof URLs

1 SELECT category, AVG(pagerank)

2 FROM urls WHERE pagerank > 0.2

3 GROUP BY category HAVING COUNT(∗) > 10^6

Zubair Nabi 8: MapReduce Application Scripting May 25, 2013 5 / 28

Page 9: MapReduce Application Scripting

Equivalent Pig query

1 good_urls = FILTER urls BY pagerank > 0.2;

2 groups = GROUP good_urls BY category;

3 big_groups = FILTER groups BY COUNT(good_urls)>10^6;

4 output = FOREACH big_groups GENERATE category , AVG(good_urls.pagerank);

Zubair Nabi 8: MapReduce Application Scripting May 25, 2013 6 / 28

Page 10: MapReduce Application Scripting

Pig Interface

A Pig Latin program is a sequence of steps, reminiscent of traditionalprogramming languages

I In contrast, SQL consists of declarative constraints that collectivelydefine the result

Each step carries out a single data transformation

A Pig Latin program is similar to specifying a query execution or adataflow graph

Due to this dataflow model, it is easier for programmers to understandand control how their data processing task is executed

Zubair Nabi 8: MapReduce Application Scripting May 25, 2013 7 / 28

Page 11: MapReduce Application Scripting

Pig Interface

A Pig Latin program is a sequence of steps, reminiscent of traditionalprogramming languages

I In contrast, SQL consists of declarative constraints that collectivelydefine the result

Each step carries out a single data transformation

A Pig Latin program is similar to specifying a query execution or adataflow graph

Due to this dataflow model, it is easier for programmers to understandand control how their data processing task is executed

Zubair Nabi 8: MapReduce Application Scripting May 25, 2013 7 / 28

Page 12: MapReduce Application Scripting

Pig Interface

A Pig Latin program is a sequence of steps, reminiscent of traditionalprogramming languages

I In contrast, SQL consists of declarative constraints that collectivelydefine the result

Each step carries out a single data transformation

A Pig Latin program is similar to specifying a query execution or adataflow graph

Due to this dataflow model, it is easier for programmers to understandand control how their data processing task is executed

Zubair Nabi 8: MapReduce Application Scripting May 25, 2013 7 / 28

Page 13: MapReduce Application Scripting

Pig Interface

A Pig Latin program is a sequence of steps, reminiscent of traditionalprogramming languages

I In contrast, SQL consists of declarative constraints that collectivelydefine the result

Each step carries out a single data transformation

A Pig Latin program is similar to specifying a query execution or adataflow graph

Due to this dataflow model, it is easier for programmers to understandand control how their data processing task is executed

Zubair Nabi 8: MapReduce Application Scripting May 25, 2013 7 / 28

Page 14: MapReduce Application Scripting

Pig Interface

A Pig Latin program is a sequence of steps, reminiscent of traditionalprogramming languages

I In contrast, SQL consists of declarative constraints that collectivelydefine the result

Each step carries out a single data transformation

A Pig Latin program is similar to specifying a query execution or adataflow graph

Due to this dataflow model, it is easier for programmers to understandand control how their data processing task is executed

Zubair Nabi 8: MapReduce Application Scripting May 25, 2013 7 / 28

Page 15: MapReduce Application Scripting

Features

Support for a fully nested data model with complex data types

Extensive support for user-defined functions

Ability to operate over plain, schema-less input files

Open-source Apache project

Zubair Nabi 8: MapReduce Application Scripting May 25, 2013 8 / 28

Page 16: MapReduce Application Scripting

Features

Support for a fully nested data model with complex data types

Extensive support for user-defined functions

Ability to operate over plain, schema-less input files

Open-source Apache project

Zubair Nabi 8: MapReduce Application Scripting May 25, 2013 8 / 28

Page 17: MapReduce Application Scripting

Features

Support for a fully nested data model with complex data types

Extensive support for user-defined functions

Ability to operate over plain, schema-less input files

Open-source Apache project

Zubair Nabi 8: MapReduce Application Scripting May 25, 2013 8 / 28

Page 18: MapReduce Application Scripting

Features

Support for a fully nested data model with complex data types

Extensive support for user-defined functions

Ability to operate over plain, schema-less input files

Open-source Apache project

Zubair Nabi 8: MapReduce Application Scripting May 25, 2013 8 / 28

Page 19: MapReduce Application Scripting

Interoperability

Queries can be performed atop raw data dumps directly

The user needs to provide a function to parse the content of the file intotuples

Similarly, the user also needs to provide a function to convert tuplesinto a byte sequence

Datasets can be laid across diverse data storage sources andapplications

Zubair Nabi 8: MapReduce Application Scripting May 25, 2013 9 / 28

Page 20: MapReduce Application Scripting

Interoperability

Queries can be performed atop raw data dumps directly

The user needs to provide a function to parse the content of the file intotuples

Similarly, the user also needs to provide a function to convert tuplesinto a byte sequence

Datasets can be laid across diverse data storage sources andapplications

Zubair Nabi 8: MapReduce Application Scripting May 25, 2013 9 / 28

Page 21: MapReduce Application Scripting

Interoperability

Queries can be performed atop raw data dumps directly

The user needs to provide a function to parse the content of the file intotuples

Similarly, the user also needs to provide a function to convert tuplesinto a byte sequence

Datasets can be laid across diverse data storage sources andapplications

Zubair Nabi 8: MapReduce Application Scripting May 25, 2013 9 / 28

Page 22: MapReduce Application Scripting

Interoperability

Queries can be performed atop raw data dumps directly

The user needs to provide a function to parse the content of the file intotuples

Similarly, the user also needs to provide a function to convert tuplesinto a byte sequence

Datasets can be laid across diverse data storage sources andapplications

Zubair Nabi 8: MapReduce Application Scripting May 25, 2013 9 / 28

Page 23: MapReduce Application Scripting

UDFs as first-class citizens

A significant part of large-scale data analysis relies on customprocessing

For instance, the user may be interested in figuring out whether aparticular website is spam

All aspects of processing in Pig Latin including grouping, filtering,joining, and per-tuple processing can be customized via UDFs

UDFs take non-atomic parameters as input and produce non-atomicvalues as output

UDFs are defined in Java

1 groups = GROUP urls BY category;

2 output = FOREACH groups GENERATE

3 category , top10(urls);

Zubair Nabi 8: MapReduce Application Scripting May 25, 2013 10 / 28

Page 24: MapReduce Application Scripting

UDFs as first-class citizens

A significant part of large-scale data analysis relies on customprocessing

For instance, the user may be interested in figuring out whether aparticular website is spam

All aspects of processing in Pig Latin including grouping, filtering,joining, and per-tuple processing can be customized via UDFs

UDFs take non-atomic parameters as input and produce non-atomicvalues as output

UDFs are defined in Java

1 groups = GROUP urls BY category;

2 output = FOREACH groups GENERATE

3 category , top10(urls);

Zubair Nabi 8: MapReduce Application Scripting May 25, 2013 10 / 28

Page 25: MapReduce Application Scripting

UDFs as first-class citizens

A significant part of large-scale data analysis relies on customprocessing

For instance, the user may be interested in figuring out whether aparticular website is spam

All aspects of processing in Pig Latin including grouping, filtering,joining, and per-tuple processing can be customized via UDFs

UDFs take non-atomic parameters as input and produce non-atomicvalues as output

UDFs are defined in Java

1 groups = GROUP urls BY category;

2 output = FOREACH groups GENERATE

3 category , top10(urls);

Zubair Nabi 8: MapReduce Application Scripting May 25, 2013 10 / 28

Page 26: MapReduce Application Scripting

UDFs as first-class citizens

A significant part of large-scale data analysis relies on customprocessing

For instance, the user may be interested in figuring out whether aparticular website is spam

All aspects of processing in Pig Latin including grouping, filtering,joining, and per-tuple processing can be customized via UDFs

UDFs take non-atomic parameters as input and produce non-atomicvalues as output

UDFs are defined in Java

1 groups = GROUP urls BY category;

2 output = FOREACH groups GENERATE

3 category , top10(urls);

Zubair Nabi 8: MapReduce Application Scripting May 25, 2013 10 / 28

Page 27: MapReduce Application Scripting

UDFs as first-class citizens

A significant part of large-scale data analysis relies on customprocessing

For instance, the user may be interested in figuring out whether aparticular website is spam

All aspects of processing in Pig Latin including grouping, filtering,joining, and per-tuple processing can be customized via UDFs

UDFs take non-atomic parameters as input and produce non-atomicvalues as output

UDFs are defined in Java

1 groups = GROUP urls BY category;

2 output = FOREACH groups GENERATE

3 category , top10(urls);

Zubair Nabi 8: MapReduce Application Scripting May 25, 2013 10 / 28

Page 28: MapReduce Application Scripting

Data Model

Pig has four data types:

1 Atom: A single atomic value such as a string or an integer

2 Tuple: A sequence of values, each with possibly a different data type

3 Bag: A collection of tuples

4 Map: A collection of data types, each with an associated key

Zubair Nabi 8: MapReduce Application Scripting May 25, 2013 11 / 28

Page 29: MapReduce Application Scripting

Data Model

Pig has four data types:

1 Atom: A single atomic value such as a string or an integer

2 Tuple: A sequence of values, each with possibly a different data type

3 Bag: A collection of tuples

4 Map: A collection of data types, each with an associated key

Zubair Nabi 8: MapReduce Application Scripting May 25, 2013 11 / 28

Page 30: MapReduce Application Scripting

Data Model

Pig has four data types:

1 Atom: A single atomic value such as a string or an integer

2 Tuple: A sequence of values, each with possibly a different data type

3 Bag: A collection of tuples

4 Map: A collection of data types, each with an associated key

Zubair Nabi 8: MapReduce Application Scripting May 25, 2013 11 / 28

Page 31: MapReduce Application Scripting

Data Model

Pig has four data types:

1 Atom: A single atomic value such as a string or an integer

2 Tuple: A sequence of values, each with possibly a different data type

3 Bag: A collection of tuples

4 Map: A collection of data types, each with an associated key

Zubair Nabi 8: MapReduce Application Scripting May 25, 2013 11 / 28

Page 32: MapReduce Application Scripting

Commands

LOAD: Load and deserialize an input file

FOREACH: Process each tuple of a dataset

FILTER: Filter a dataset based on some condition or UDF

COGROUP: Group together tuples which are related in some way fromone or more datasets

GROUP: Group together tuples which are related in some way fromone dataset

STORE: Materialize the output of a Pig Latin expression to a file

Zubair Nabi 8: MapReduce Application Scripting May 25, 2013 12 / 28

Page 33: MapReduce Application Scripting

Commands

LOAD: Load and deserialize an input file

FOREACH: Process each tuple of a dataset

FILTER: Filter a dataset based on some condition or UDF

COGROUP: Group together tuples which are related in some way fromone or more datasets

GROUP: Group together tuples which are related in some way fromone dataset

STORE: Materialize the output of a Pig Latin expression to a file

Zubair Nabi 8: MapReduce Application Scripting May 25, 2013 12 / 28

Page 34: MapReduce Application Scripting

Commands

LOAD: Load and deserialize an input file

FOREACH: Process each tuple of a dataset

FILTER: Filter a dataset based on some condition or UDF

COGROUP: Group together tuples which are related in some way fromone or more datasets

GROUP: Group together tuples which are related in some way fromone dataset

STORE: Materialize the output of a Pig Latin expression to a file

Zubair Nabi 8: MapReduce Application Scripting May 25, 2013 12 / 28

Page 35: MapReduce Application Scripting

Commands

LOAD: Load and deserialize an input file

FOREACH: Process each tuple of a dataset

FILTER: Filter a dataset based on some condition or UDF

COGROUP: Group together tuples which are related in some way fromone or more datasets

GROUP: Group together tuples which are related in some way fromone dataset

STORE: Materialize the output of a Pig Latin expression to a file

Zubair Nabi 8: MapReduce Application Scripting May 25, 2013 12 / 28

Page 36: MapReduce Application Scripting

Commands

LOAD: Load and deserialize an input file

FOREACH: Process each tuple of a dataset

FILTER: Filter a dataset based on some condition or UDF

COGROUP: Group together tuples which are related in some way fromone or more datasets

GROUP: Group together tuples which are related in some way fromone dataset

STORE: Materialize the output of a Pig Latin expression to a file

Zubair Nabi 8: MapReduce Application Scripting May 25, 2013 12 / 28

Page 37: MapReduce Application Scripting

Commands

LOAD: Load and deserialize an input file

FOREACH: Process each tuple of a dataset

FILTER: Filter a dataset based on some condition or UDF

COGROUP: Group together tuples which are related in some way fromone or more datasets

GROUP: Group together tuples which are related in some way fromone dataset

STORE: Materialize the output of a Pig Latin expression to a file

Zubair Nabi 8: MapReduce Application Scripting May 25, 2013 12 / 28

Page 38: MapReduce Application Scripting

Other Commands

UNION: Return the union of two or more bags

CROSS: Return the cross product of two or more bags

ORDER: Order a bag by a specified field

DISTINCT: Eliminate duplicate tuples in a bag

Zubair Nabi 8: MapReduce Application Scripting May 25, 2013 13 / 28

Page 39: MapReduce Application Scripting

Other Commands

UNION: Return the union of two or more bags

CROSS: Return the cross product of two or more bags

ORDER: Order a bag by a specified field

DISTINCT: Eliminate duplicate tuples in a bag

Zubair Nabi 8: MapReduce Application Scripting May 25, 2013 13 / 28

Page 40: MapReduce Application Scripting

Other Commands

UNION: Return the union of two or more bags

CROSS: Return the cross product of two or more bags

ORDER: Order a bag by a specified field

DISTINCT: Eliminate duplicate tuples in a bag

Zubair Nabi 8: MapReduce Application Scripting May 25, 2013 13 / 28

Page 41: MapReduce Application Scripting

Other Commands

UNION: Return the union of two or more bags

CROSS: Return the cross product of two or more bags

ORDER: Order a bag by a specified field

DISTINCT: Eliminate duplicate tuples in a bag

Zubair Nabi 8: MapReduce Application Scripting May 25, 2013 13 / 28

Page 42: MapReduce Application Scripting

MapReduce in PigLatin

1 map_result = FOREACH input GENERATE FLATTEN(map(∗));2 key_groups = GROUP map_result BY $0;

3 output = FOREACH key_groups GENERATE reduce(∗);

Zubair Nabi 8: MapReduce Application Scripting May 25, 2013 14 / 28

Page 43: MapReduce Application Scripting

Outline

1 Pig Latin

2 Cascading

Zubair Nabi 8: MapReduce Application Scripting May 25, 2013 15 / 28

Page 44: MapReduce Application Scripting

Introduction

Many applications require a chain of MapReduce jobs

Cascading allows the creation of processing pipelines using languagesthat run atop the JVMSource-pipe-sink paradigm

I Data comes from sourcesI Pipes perform data analysisI Results are written to sinks

Zubair Nabi 8: MapReduce Application Scripting May 25, 2013 16 / 28

Page 45: MapReduce Application Scripting

Introduction

Many applications require a chain of MapReduce jobs

Cascading allows the creation of processing pipelines using languagesthat run atop the JVM

Source-pipe-sink paradigmI Data comes from sourcesI Pipes perform data analysisI Results are written to sinks

Zubair Nabi 8: MapReduce Application Scripting May 25, 2013 16 / 28

Page 46: MapReduce Application Scripting

Introduction

Many applications require a chain of MapReduce jobs

Cascading allows the creation of processing pipelines using languagesthat run atop the JVMSource-pipe-sink paradigm

I Data comes from sourcesI Pipes perform data analysisI Results are written to sinks

Zubair Nabi 8: MapReduce Application Scripting May 25, 2013 16 / 28

Page 47: MapReduce Application Scripting

Introduction

Many applications require a chain of MapReduce jobs

Cascading allows the creation of processing pipelines using languagesthat run atop the JVMSource-pipe-sink paradigm

I Data comes from sources

I Pipes perform data analysisI Results are written to sinks

Zubair Nabi 8: MapReduce Application Scripting May 25, 2013 16 / 28

Page 48: MapReduce Application Scripting

Introduction

Many applications require a chain of MapReduce jobs

Cascading allows the creation of processing pipelines using languagesthat run atop the JVMSource-pipe-sink paradigm

I Data comes from sourcesI Pipes perform data analysis

I Results are written to sinks

Zubair Nabi 8: MapReduce Application Scripting May 25, 2013 16 / 28

Page 49: MapReduce Application Scripting

Introduction

Many applications require a chain of MapReduce jobs

Cascading allows the creation of processing pipelines using languagesthat run atop the JVMSource-pipe-sink paradigm

I Data comes from sourcesI Pipes perform data analysisI Results are written to sinks

Zubair Nabi 8: MapReduce Application Scripting May 25, 2013 16 / 28

Page 50: MapReduce Application Scripting

Terminology

Pipe: data stream

Tuple: data record

Branch: chain of pipes

Pipe Assembly: set of pipe branches

Tap: data source or sink

Flow: pipe assembly bound to a tap

Cascade: a collection flows, in which one flow depends on the outputof another

Zubair Nabi 8: MapReduce Application Scripting May 25, 2013 17 / 28

Page 51: MapReduce Application Scripting

Terminology

Pipe: data stream

Tuple: data record

Branch: chain of pipes

Pipe Assembly: set of pipe branches

Tap: data source or sink

Flow: pipe assembly bound to a tap

Cascade: a collection flows, in which one flow depends on the outputof another

Zubair Nabi 8: MapReduce Application Scripting May 25, 2013 17 / 28

Page 52: MapReduce Application Scripting

Terminology

Pipe: data stream

Tuple: data record

Branch: chain of pipes

Pipe Assembly: set of pipe branches

Tap: data source or sink

Flow: pipe assembly bound to a tap

Cascade: a collection flows, in which one flow depends on the outputof another

Zubair Nabi 8: MapReduce Application Scripting May 25, 2013 17 / 28

Page 53: MapReduce Application Scripting

Terminology

Pipe: data stream

Tuple: data record

Branch: chain of pipes

Pipe Assembly: set of pipe branches

Tap: data source or sink

Flow: pipe assembly bound to a tap

Cascade: a collection flows, in which one flow depends on the outputof another

Zubair Nabi 8: MapReduce Application Scripting May 25, 2013 17 / 28

Page 54: MapReduce Application Scripting

Terminology

Pipe: data stream

Tuple: data record

Branch: chain of pipes

Pipe Assembly: set of pipe branches

Tap: data source or sink

Flow: pipe assembly bound to a tap

Cascade: a collection flows, in which one flow depends on the outputof another

Zubair Nabi 8: MapReduce Application Scripting May 25, 2013 17 / 28

Page 55: MapReduce Application Scripting

Terminology

Pipe: data stream

Tuple: data record

Branch: chain of pipes

Pipe Assembly: set of pipe branches

Tap: data source or sink

Flow: pipe assembly bound to a tap

Cascade: a collection flows, in which one flow depends on the outputof another

Zubair Nabi 8: MapReduce Application Scripting May 25, 2013 17 / 28

Page 56: MapReduce Application Scripting

Terminology

Pipe: data stream

Tuple: data record

Branch: chain of pipes

Pipe Assembly: set of pipe branches

Tap: data source or sink

Flow: pipe assembly bound to a tap

Cascade: a collection flows, in which one flow depends on the outputof another

Zubair Nabi 8: MapReduce Application Scripting May 25, 2013 17 / 28

Page 57: MapReduce Application Scripting

Pipes

Base class: Pipe

Each: Analyze, transform, or filter individual tuples

Merge: Combine streams with same fields into one

GroupBy: Group tuples based on common values in a specified field

CoGroup: Join streams (similar to SQL join)

Every: Aggregate tuples

HashJoin: Similar to CoGroup but more efficient if one stream canbe held in memory

Zubair Nabi 8: MapReduce Application Scripting May 25, 2013 18 / 28

Page 58: MapReduce Application Scripting

Pipes

Base class: Pipe

Each: Analyze, transform, or filter individual tuples

Merge: Combine streams with same fields into one

GroupBy: Group tuples based on common values in a specified field

CoGroup: Join streams (similar to SQL join)

Every: Aggregate tuples

HashJoin: Similar to CoGroup but more efficient if one stream canbe held in memory

Zubair Nabi 8: MapReduce Application Scripting May 25, 2013 18 / 28

Page 59: MapReduce Application Scripting

Pipes

Base class: Pipe

Each: Analyze, transform, or filter individual tuples

Merge: Combine streams with same fields into one

GroupBy: Group tuples based on common values in a specified field

CoGroup: Join streams (similar to SQL join)

Every: Aggregate tuples

HashJoin: Similar to CoGroup but more efficient if one stream canbe held in memory

Zubair Nabi 8: MapReduce Application Scripting May 25, 2013 18 / 28

Page 60: MapReduce Application Scripting

Pipes

Base class: Pipe

Each: Analyze, transform, or filter individual tuples

Merge: Combine streams with same fields into one

GroupBy: Group tuples based on common values in a specified field

CoGroup: Join streams (similar to SQL join)

Every: Aggregate tuples

HashJoin: Similar to CoGroup but more efficient if one stream canbe held in memory

Zubair Nabi 8: MapReduce Application Scripting May 25, 2013 18 / 28

Page 61: MapReduce Application Scripting

Pipes

Base class: Pipe

Each: Analyze, transform, or filter individual tuples

Merge: Combine streams with same fields into one

GroupBy: Group tuples based on common values in a specified field

CoGroup: Join streams (similar to SQL join)

Every: Aggregate tuples

HashJoin: Similar to CoGroup but more efficient if one stream canbe held in memory

Zubair Nabi 8: MapReduce Application Scripting May 25, 2013 18 / 28

Page 62: MapReduce Application Scripting

Pipes

Base class: Pipe

Each: Analyze, transform, or filter individual tuples

Merge: Combine streams with same fields into one

GroupBy: Group tuples based on common values in a specified field

CoGroup: Join streams (similar to SQL join)

Every: Aggregate tuples

HashJoin: Similar to CoGroup but more efficient if one stream canbe held in memory

Zubair Nabi 8: MapReduce Application Scripting May 25, 2013 18 / 28

Page 63: MapReduce Application Scripting

Pipes

Base class: Pipe

Each: Analyze, transform, or filter individual tuples

Merge: Combine streams with same fields into one

GroupBy: Group tuples based on common values in a specified field

CoGroup: Join streams (similar to SQL join)

Every: Aggregate tuples

HashJoin: Similar to CoGroup but more efficient if one stream canbe held in memory

Zubair Nabi 8: MapReduce Application Scripting May 25, 2013 18 / 28

Page 64: MapReduce Application Scripting

Pipe Assemblies

Define the processing of tuple streams

I Tuples are read/written to taps

Processing includes filtering, transforming, organizing, and calculating

Can use multiple taps

May also define splits, merges, and joins to manipulate tuple streams

Zubair Nabi 8: MapReduce Application Scripting May 25, 2013 19 / 28

Page 65: MapReduce Application Scripting

Pipe Assemblies

Define the processing of tuple streamsI Tuples are read/written to taps

Processing includes filtering, transforming, organizing, and calculating

Can use multiple taps

May also define splits, merges, and joins to manipulate tuple streams

Zubair Nabi 8: MapReduce Application Scripting May 25, 2013 19 / 28

Page 66: MapReduce Application Scripting

Pipe Assemblies

Define the processing of tuple streamsI Tuples are read/written to taps

Processing includes filtering, transforming, organizing, and calculating

Can use multiple taps

May also define splits, merges, and joins to manipulate tuple streams

Zubair Nabi 8: MapReduce Application Scripting May 25, 2013 19 / 28

Page 67: MapReduce Application Scripting

Pipe Assemblies

Define the processing of tuple streamsI Tuples are read/written to taps

Processing includes filtering, transforming, organizing, and calculating

Can use multiple taps

May also define splits, merges, and joins to manipulate tuple streams

Zubair Nabi 8: MapReduce Application Scripting May 25, 2013 19 / 28

Page 68: MapReduce Application Scripting

Pipe Assemblies

Define the processing of tuple streamsI Tuples are read/written to taps

Processing includes filtering, transforming, organizing, and calculating

Can use multiple taps

May also define splits, merges, and joins to manipulate tuple streams

Zubair Nabi 8: MapReduce Application Scripting May 25, 2013 19 / 28

Page 69: MapReduce Application Scripting

Example: Pipe Assembly

Zubair Nabi 8: MapReduce Application Scripting May 25, 2013 20 / 28

Page 70: MapReduce Application Scripting

Example: Pipe Assembly (2)

1 Pipe lhs = new Pipe( "lhs" );

2 lhs = new Each( lhs, new SomeFunction() );

3 lhs = new Each( lhs, new SomeFilter() );

45 Pipe rhs = new Pipe( "rhs" );

6 rhs = new Each( rhs, new SomeFunction() );

78 Pipe join = new CoGroup( lhs, rhs );

9 join = new Every( join, new SomeAggregator() );

10 join = new GroupBy( join );

11 join = new Every( join, new SomeAggregator() );

1213 join = new Each( join, new SomeFunction() );

Zubair Nabi 8: MapReduce Application Scripting May 25, 2013 21 / 28

Page 71: MapReduce Application Scripting

Data Processing

Operation: Accept an input tuple, process it, and output zero or moretuples

Tuple: Array of fields

Field: Defines a data type, such as string, integer, etc.

Zubair Nabi 8: MapReduce Application Scripting May 25, 2013 22 / 28

Page 72: MapReduce Application Scripting

Data Processing

Operation: Accept an input tuple, process it, and output zero or moretuples

Tuple: Array of fields

Field: Defines a data type, such as string, integer, etc.

Zubair Nabi 8: MapReduce Application Scripting May 25, 2013 22 / 28

Page 73: MapReduce Application Scripting

Data Processing

Operation: Accept an input tuple, process it, and output zero or moretuples

Tuple: Array of fields

Field: Defines a data type, such as string, integer, etc.

Zubair Nabi 8: MapReduce Application Scripting May 25, 2013 22 / 28

Page 74: MapReduce Application Scripting

Taps

Data flows in and out of taps

Represent data sources and sinks, such local files, distributed FS files,etc.

Each tap is associated with a scheme that describe the data, such asTextLine, TextDelimited, etc.

Sinks have modes such as SinkMode.KEEP,SinkMode.REPLACE, and SinkMode.UPDATE

Zubair Nabi 8: MapReduce Application Scripting May 25, 2013 23 / 28

Page 75: MapReduce Application Scripting

Taps

Data flows in and out of taps

Represent data sources and sinks, such local files, distributed FS files,etc.

Each tap is associated with a scheme that describe the data, such asTextLine, TextDelimited, etc.

Sinks have modes such as SinkMode.KEEP,SinkMode.REPLACE, and SinkMode.UPDATE

Zubair Nabi 8: MapReduce Application Scripting May 25, 2013 23 / 28

Page 76: MapReduce Application Scripting

Taps

Data flows in and out of taps

Represent data sources and sinks, such local files, distributed FS files,etc.

Each tap is associated with a scheme that describe the data, such asTextLine, TextDelimited, etc.

Sinks have modes such as SinkMode.KEEP,SinkMode.REPLACE, and SinkMode.UPDATE

Zubair Nabi 8: MapReduce Application Scripting May 25, 2013 23 / 28

Page 77: MapReduce Application Scripting

Taps

Data flows in and out of taps

Represent data sources and sinks, such local files, distributed FS files,etc.

Each tap is associated with a scheme that describe the data, such asTextLine, TextDelimited, etc.

Sinks have modes such as SinkMode.KEEP,SinkMode.REPLACE, and SinkMode.UPDATE

Zubair Nabi 8: MapReduce Application Scripting May 25, 2013 23 / 28

Page 78: MapReduce Application Scripting

Flows

Represent entire pipelines

A pipeline reads data from a source, processes it, and then writes it toa sink

Zubair Nabi 8: MapReduce Application Scripting May 25, 2013 24 / 28

Page 79: MapReduce Application Scripting

Flows

Represent entire pipelines

A pipeline reads data from a source, processes it, and then writes it toa sink

Zubair Nabi 8: MapReduce Application Scripting May 25, 2013 24 / 28

Page 80: MapReduce Application Scripting

Example: Flow1 Pipe lhs = new Pipe( "lhs" );

2 lhs = new Each( lhs, new SomeFunction() );

3 lhs = new Each( lhs, new SomeFilter() );

4 Pipe rhs = new Pipe( "rhs" );

5 rhs = new Each( rhs, new SomeFunction() );

6 Pipe join = new CoGroup( lhs, rhs );

7 join = new Every( join, new SomeAggregator() );

89 Tap lhsSource = new Hfs( new TextLine(), "lhs.txt" );

10 Tap rhsSource = new Hfs( new TextLine(), "rhs.txt" );

11 Tap sink = new Hfs( new TextLine(), "output" );

12 FlowDef flowDef = new FlowDef()

13 .setName( "flow−name" )14 .addSource( rhs, rhsSource )

15 .addSource( lhs, lhsSource )

16 .addTailSink( join, sink );

17 Flow flow = new HadoopFlowConnector().connect( flowDef );

Zubair Nabi 8: MapReduce Application Scripting May 25, 2013 25 / 28

Page 81: MapReduce Application Scripting

Operations

Operations manipulate data

Four kinds:1 Function2 Filter3 Aggregator4 Buffer

Take an input tuple and emit zero or more tuplesI Filter returns a Boolean

Must be wrapped around in either Every or Each pipes

Zubair Nabi 8: MapReduce Application Scripting May 25, 2013 26 / 28

Page 82: MapReduce Application Scripting

Operations

Operations manipulate dataFour kinds:

1 Function

2 Filter3 Aggregator4 Buffer

Take an input tuple and emit zero or more tuplesI Filter returns a Boolean

Must be wrapped around in either Every or Each pipes

Zubair Nabi 8: MapReduce Application Scripting May 25, 2013 26 / 28

Page 83: MapReduce Application Scripting

Operations

Operations manipulate dataFour kinds:

1 Function2 Filter

3 Aggregator4 Buffer

Take an input tuple and emit zero or more tuplesI Filter returns a Boolean

Must be wrapped around in either Every or Each pipes

Zubair Nabi 8: MapReduce Application Scripting May 25, 2013 26 / 28

Page 84: MapReduce Application Scripting

Operations

Operations manipulate dataFour kinds:

1 Function2 Filter3 Aggregator

4 Buffer

Take an input tuple and emit zero or more tuplesI Filter returns a Boolean

Must be wrapped around in either Every or Each pipes

Zubair Nabi 8: MapReduce Application Scripting May 25, 2013 26 / 28

Page 85: MapReduce Application Scripting

Operations

Operations manipulate dataFour kinds:

1 Function2 Filter3 Aggregator4 Buffer

Take an input tuple and emit zero or more tuplesI Filter returns a Boolean

Must be wrapped around in either Every or Each pipes

Zubair Nabi 8: MapReduce Application Scripting May 25, 2013 26 / 28

Page 86: MapReduce Application Scripting

Operations

Operations manipulate dataFour kinds:

1 Function2 Filter3 Aggregator4 Buffer

Take an input tuple and emit zero or more tuples

I Filter returns a Boolean

Must be wrapped around in either Every or Each pipes

Zubair Nabi 8: MapReduce Application Scripting May 25, 2013 26 / 28

Page 87: MapReduce Application Scripting

Operations

Operations manipulate dataFour kinds:

1 Function2 Filter3 Aggregator4 Buffer

Take an input tuple and emit zero or more tuplesI Filter returns a Boolean

Must be wrapped around in either Every or Each pipes

Zubair Nabi 8: MapReduce Application Scripting May 25, 2013 26 / 28

Page 88: MapReduce Application Scripting

Operations

Operations manipulate dataFour kinds:

1 Function2 Filter3 Aggregator4 Buffer

Take an input tuple and emit zero or more tuplesI Filter returns a Boolean

Must be wrapped around in either Every or Each pipes

Zubair Nabi 8: MapReduce Application Scripting May 25, 2013 26 / 28

Page 89: MapReduce Application Scripting

Example: Wordcount

1 Scheme sourceScheme = new TextLine( new Fields( "line" ) );

2 Tap source = new Hfs( sourceScheme , inputPath );

3 Scheme sinkScheme = new TextLine( new Fields( "word", "count" ) );

4 Tap sink = new Hfs( sinkScheme , outputPath , SinkMode.REPLACE );

5 Pipe assembly = new Pipe( "wordcount" );

6 String regex = " ";

7 Function function = new RegexGenerator( new Fields( "word" ), regex );

8 assembly = new Each( assembly , new Fields( "line" ), function );

9 assembly = new GroupBy( assembly, new Fields( "word" ) );

10 Aggregator count = new Count( new Fields( "count" ) );

11 assembly = new Every( assembly, count );

12 FlowConnector flowConnector = new FlowConnector();

13 Flow flow = flowConnector.connect( "word−count", source, sink, assembly );14 flow.complete();

Zubair Nabi 8: MapReduce Application Scripting May 25, 2013 27 / 28

Page 90: MapReduce Application Scripting

References

1 Christopher Olston, Benjamin Reed, Utkarsh Srivastava, Ravi Kumar,and Andrew Tomkins. 2008. Pig latin: a not-so-foreign language fordata processing. In Proceedings of the 2008 ACM SIGMODinternational conference on Management of data (SIGMOD ’08). ACM,New York, NY, USA, 1099-1110.

2 Cascading 2.1 User Guide: http://docs.cascading.org/cascading/2.1/userguide/pdf/userguide.pdf

Zubair Nabi 8: MapReduce Application Scripting May 25, 2013 28 / 28