8: MapReduce Application Scripting Zubair Nabi [email protected] May 25, 2013 Zubair Nabi 8: MapReduce Application Scripting May 25, 2013 1 / 28
May 06, 2015
8: MapReduce Application Scripting
Zubair Nabi
May 25, 2013
Zubair Nabi 8: MapReduce Application Scripting May 25, 2013 1 / 28
Outline
1 Pig Latin
2 Cascading
Zubair Nabi 8: MapReduce Application Scripting May 25, 2013 2 / 28
Outline
1 Pig Latin
2 Cascading
Zubair Nabi 8: MapReduce Application Scripting May 25, 2013 3 / 28
Introduction
MapReduce is too low-level and rigid and leads to lots of custom usercode
Pig Latin is a declarative language atop MapReduce designed byYahoo!
I Finds the sweet spot between the declarative style of SQL and thelow-level interface of MapReduce
The Pig system compiles Pig Latin queries into physical plans that areexecuted atop Hadoop
Zubair Nabi 8: MapReduce Application Scripting May 25, 2013 4 / 28
Introduction
MapReduce is too low-level and rigid and leads to lots of custom usercodePig Latin is a declarative language atop MapReduce designed byYahoo!
I Finds the sweet spot between the declarative style of SQL and thelow-level interface of MapReduce
The Pig system compiles Pig Latin queries into physical plans that areexecuted atop Hadoop
Zubair Nabi 8: MapReduce Application Scripting May 25, 2013 4 / 28
Introduction
MapReduce is too low-level and rigid and leads to lots of custom usercodePig Latin is a declarative language atop MapReduce designed byYahoo!
I Finds the sweet spot between the declarative style of SQL and thelow-level interface of MapReduce
The Pig system compiles Pig Latin queries into physical plans that areexecuted atop Hadoop
Zubair Nabi 8: MapReduce Application Scripting May 25, 2013 4 / 28
Introduction
MapReduce is too low-level and rigid and leads to lots of custom usercodePig Latin is a declarative language atop MapReduce designed byYahoo!
I Finds the sweet spot between the declarative style of SQL and thelow-level interface of MapReduce
The Pig system compiles Pig Latin queries into physical plans that areexecuted atop Hadoop
Zubair Nabi 8: MapReduce Application Scripting May 25, 2013 4 / 28
SQL query to find average pagerank for each large categoryof URLs
1 SELECT category, AVG(pagerank)
2 FROM urls WHERE pagerank > 0.2
3 GROUP BY category HAVING COUNT(∗) > 10^6
Zubair Nabi 8: MapReduce Application Scripting May 25, 2013 5 / 28
Equivalent Pig query
1 good_urls = FILTER urls BY pagerank > 0.2;
2 groups = GROUP good_urls BY category;
3 big_groups = FILTER groups BY COUNT(good_urls)>10^6;
4 output = FOREACH big_groups GENERATE category , AVG(good_urls.pagerank);
Zubair Nabi 8: MapReduce Application Scripting May 25, 2013 6 / 28
Pig Interface
A Pig Latin program is a sequence of steps, reminiscent of traditionalprogramming languages
I In contrast, SQL consists of declarative constraints that collectivelydefine the result
Each step carries out a single data transformation
A Pig Latin program is similar to specifying a query execution or adataflow graph
Due to this dataflow model, it is easier for programmers to understandand control how their data processing task is executed
Zubair Nabi 8: MapReduce Application Scripting May 25, 2013 7 / 28
Pig Interface
A Pig Latin program is a sequence of steps, reminiscent of traditionalprogramming languages
I In contrast, SQL consists of declarative constraints that collectivelydefine the result
Each step carries out a single data transformation
A Pig Latin program is similar to specifying a query execution or adataflow graph
Due to this dataflow model, it is easier for programmers to understandand control how their data processing task is executed
Zubair Nabi 8: MapReduce Application Scripting May 25, 2013 7 / 28
Pig Interface
A Pig Latin program is a sequence of steps, reminiscent of traditionalprogramming languages
I In contrast, SQL consists of declarative constraints that collectivelydefine the result
Each step carries out a single data transformation
A Pig Latin program is similar to specifying a query execution or adataflow graph
Due to this dataflow model, it is easier for programmers to understandand control how their data processing task is executed
Zubair Nabi 8: MapReduce Application Scripting May 25, 2013 7 / 28
Pig Interface
A Pig Latin program is a sequence of steps, reminiscent of traditionalprogramming languages
I In contrast, SQL consists of declarative constraints that collectivelydefine the result
Each step carries out a single data transformation
A Pig Latin program is similar to specifying a query execution or adataflow graph
Due to this dataflow model, it is easier for programmers to understandand control how their data processing task is executed
Zubair Nabi 8: MapReduce Application Scripting May 25, 2013 7 / 28
Pig Interface
A Pig Latin program is a sequence of steps, reminiscent of traditionalprogramming languages
I In contrast, SQL consists of declarative constraints that collectivelydefine the result
Each step carries out a single data transformation
A Pig Latin program is similar to specifying a query execution or adataflow graph
Due to this dataflow model, it is easier for programmers to understandand control how their data processing task is executed
Zubair Nabi 8: MapReduce Application Scripting May 25, 2013 7 / 28
Features
Support for a fully nested data model with complex data types
Extensive support for user-defined functions
Ability to operate over plain, schema-less input files
Open-source Apache project
Zubair Nabi 8: MapReduce Application Scripting May 25, 2013 8 / 28
Features
Support for a fully nested data model with complex data types
Extensive support for user-defined functions
Ability to operate over plain, schema-less input files
Open-source Apache project
Zubair Nabi 8: MapReduce Application Scripting May 25, 2013 8 / 28
Features
Support for a fully nested data model with complex data types
Extensive support for user-defined functions
Ability to operate over plain, schema-less input files
Open-source Apache project
Zubair Nabi 8: MapReduce Application Scripting May 25, 2013 8 / 28
Features
Support for a fully nested data model with complex data types
Extensive support for user-defined functions
Ability to operate over plain, schema-less input files
Open-source Apache project
Zubair Nabi 8: MapReduce Application Scripting May 25, 2013 8 / 28
Interoperability
Queries can be performed atop raw data dumps directly
The user needs to provide a function to parse the content of the file intotuples
Similarly, the user also needs to provide a function to convert tuplesinto a byte sequence
Datasets can be laid across diverse data storage sources andapplications
Zubair Nabi 8: MapReduce Application Scripting May 25, 2013 9 / 28
Interoperability
Queries can be performed atop raw data dumps directly
The user needs to provide a function to parse the content of the file intotuples
Similarly, the user also needs to provide a function to convert tuplesinto a byte sequence
Datasets can be laid across diverse data storage sources andapplications
Zubair Nabi 8: MapReduce Application Scripting May 25, 2013 9 / 28
Interoperability
Queries can be performed atop raw data dumps directly
The user needs to provide a function to parse the content of the file intotuples
Similarly, the user also needs to provide a function to convert tuplesinto a byte sequence
Datasets can be laid across diverse data storage sources andapplications
Zubair Nabi 8: MapReduce Application Scripting May 25, 2013 9 / 28
Interoperability
Queries can be performed atop raw data dumps directly
The user needs to provide a function to parse the content of the file intotuples
Similarly, the user also needs to provide a function to convert tuplesinto a byte sequence
Datasets can be laid across diverse data storage sources andapplications
Zubair Nabi 8: MapReduce Application Scripting May 25, 2013 9 / 28
UDFs as first-class citizens
A significant part of large-scale data analysis relies on customprocessing
For instance, the user may be interested in figuring out whether aparticular website is spam
All aspects of processing in Pig Latin including grouping, filtering,joining, and per-tuple processing can be customized via UDFs
UDFs take non-atomic parameters as input and produce non-atomicvalues as output
UDFs are defined in Java
1 groups = GROUP urls BY category;
2 output = FOREACH groups GENERATE
3 category , top10(urls);
Zubair Nabi 8: MapReduce Application Scripting May 25, 2013 10 / 28
UDFs as first-class citizens
A significant part of large-scale data analysis relies on customprocessing
For instance, the user may be interested in figuring out whether aparticular website is spam
All aspects of processing in Pig Latin including grouping, filtering,joining, and per-tuple processing can be customized via UDFs
UDFs take non-atomic parameters as input and produce non-atomicvalues as output
UDFs are defined in Java
1 groups = GROUP urls BY category;
2 output = FOREACH groups GENERATE
3 category , top10(urls);
Zubair Nabi 8: MapReduce Application Scripting May 25, 2013 10 / 28
UDFs as first-class citizens
A significant part of large-scale data analysis relies on customprocessing
For instance, the user may be interested in figuring out whether aparticular website is spam
All aspects of processing in Pig Latin including grouping, filtering,joining, and per-tuple processing can be customized via UDFs
UDFs take non-atomic parameters as input and produce non-atomicvalues as output
UDFs are defined in Java
1 groups = GROUP urls BY category;
2 output = FOREACH groups GENERATE
3 category , top10(urls);
Zubair Nabi 8: MapReduce Application Scripting May 25, 2013 10 / 28
UDFs as first-class citizens
A significant part of large-scale data analysis relies on customprocessing
For instance, the user may be interested in figuring out whether aparticular website is spam
All aspects of processing in Pig Latin including grouping, filtering,joining, and per-tuple processing can be customized via UDFs
UDFs take non-atomic parameters as input and produce non-atomicvalues as output
UDFs are defined in Java
1 groups = GROUP urls BY category;
2 output = FOREACH groups GENERATE
3 category , top10(urls);
Zubair Nabi 8: MapReduce Application Scripting May 25, 2013 10 / 28
UDFs as first-class citizens
A significant part of large-scale data analysis relies on customprocessing
For instance, the user may be interested in figuring out whether aparticular website is spam
All aspects of processing in Pig Latin including grouping, filtering,joining, and per-tuple processing can be customized via UDFs
UDFs take non-atomic parameters as input and produce non-atomicvalues as output
UDFs are defined in Java
1 groups = GROUP urls BY category;
2 output = FOREACH groups GENERATE
3 category , top10(urls);
Zubair Nabi 8: MapReduce Application Scripting May 25, 2013 10 / 28
Data Model
Pig has four data types:
1 Atom: A single atomic value such as a string or an integer
2 Tuple: A sequence of values, each with possibly a different data type
3 Bag: A collection of tuples
4 Map: A collection of data types, each with an associated key
Zubair Nabi 8: MapReduce Application Scripting May 25, 2013 11 / 28
Data Model
Pig has four data types:
1 Atom: A single atomic value such as a string or an integer
2 Tuple: A sequence of values, each with possibly a different data type
3 Bag: A collection of tuples
4 Map: A collection of data types, each with an associated key
Zubair Nabi 8: MapReduce Application Scripting May 25, 2013 11 / 28
Data Model
Pig has four data types:
1 Atom: A single atomic value such as a string or an integer
2 Tuple: A sequence of values, each with possibly a different data type
3 Bag: A collection of tuples
4 Map: A collection of data types, each with an associated key
Zubair Nabi 8: MapReduce Application Scripting May 25, 2013 11 / 28
Data Model
Pig has four data types:
1 Atom: A single atomic value such as a string or an integer
2 Tuple: A sequence of values, each with possibly a different data type
3 Bag: A collection of tuples
4 Map: A collection of data types, each with an associated key
Zubair Nabi 8: MapReduce Application Scripting May 25, 2013 11 / 28
Commands
LOAD: Load and deserialize an input file
FOREACH: Process each tuple of a dataset
FILTER: Filter a dataset based on some condition or UDF
COGROUP: Group together tuples which are related in some way fromone or more datasets
GROUP: Group together tuples which are related in some way fromone dataset
STORE: Materialize the output of a Pig Latin expression to a file
Zubair Nabi 8: MapReduce Application Scripting May 25, 2013 12 / 28
Commands
LOAD: Load and deserialize an input file
FOREACH: Process each tuple of a dataset
FILTER: Filter a dataset based on some condition or UDF
COGROUP: Group together tuples which are related in some way fromone or more datasets
GROUP: Group together tuples which are related in some way fromone dataset
STORE: Materialize the output of a Pig Latin expression to a file
Zubair Nabi 8: MapReduce Application Scripting May 25, 2013 12 / 28
Commands
LOAD: Load and deserialize an input file
FOREACH: Process each tuple of a dataset
FILTER: Filter a dataset based on some condition or UDF
COGROUP: Group together tuples which are related in some way fromone or more datasets
GROUP: Group together tuples which are related in some way fromone dataset
STORE: Materialize the output of a Pig Latin expression to a file
Zubair Nabi 8: MapReduce Application Scripting May 25, 2013 12 / 28
Commands
LOAD: Load and deserialize an input file
FOREACH: Process each tuple of a dataset
FILTER: Filter a dataset based on some condition or UDF
COGROUP: Group together tuples which are related in some way fromone or more datasets
GROUP: Group together tuples which are related in some way fromone dataset
STORE: Materialize the output of a Pig Latin expression to a file
Zubair Nabi 8: MapReduce Application Scripting May 25, 2013 12 / 28
Commands
LOAD: Load and deserialize an input file
FOREACH: Process each tuple of a dataset
FILTER: Filter a dataset based on some condition or UDF
COGROUP: Group together tuples which are related in some way fromone or more datasets
GROUP: Group together tuples which are related in some way fromone dataset
STORE: Materialize the output of a Pig Latin expression to a file
Zubair Nabi 8: MapReduce Application Scripting May 25, 2013 12 / 28
Commands
LOAD: Load and deserialize an input file
FOREACH: Process each tuple of a dataset
FILTER: Filter a dataset based on some condition or UDF
COGROUP: Group together tuples which are related in some way fromone or more datasets
GROUP: Group together tuples which are related in some way fromone dataset
STORE: Materialize the output of a Pig Latin expression to a file
Zubair Nabi 8: MapReduce Application Scripting May 25, 2013 12 / 28
Other Commands
UNION: Return the union of two or more bags
CROSS: Return the cross product of two or more bags
ORDER: Order a bag by a specified field
DISTINCT: Eliminate duplicate tuples in a bag
Zubair Nabi 8: MapReduce Application Scripting May 25, 2013 13 / 28
Other Commands
UNION: Return the union of two or more bags
CROSS: Return the cross product of two or more bags
ORDER: Order a bag by a specified field
DISTINCT: Eliminate duplicate tuples in a bag
Zubair Nabi 8: MapReduce Application Scripting May 25, 2013 13 / 28
Other Commands
UNION: Return the union of two or more bags
CROSS: Return the cross product of two or more bags
ORDER: Order a bag by a specified field
DISTINCT: Eliminate duplicate tuples in a bag
Zubair Nabi 8: MapReduce Application Scripting May 25, 2013 13 / 28
Other Commands
UNION: Return the union of two or more bags
CROSS: Return the cross product of two or more bags
ORDER: Order a bag by a specified field
DISTINCT: Eliminate duplicate tuples in a bag
Zubair Nabi 8: MapReduce Application Scripting May 25, 2013 13 / 28
MapReduce in PigLatin
1 map_result = FOREACH input GENERATE FLATTEN(map(∗));2 key_groups = GROUP map_result BY $0;
3 output = FOREACH key_groups GENERATE reduce(∗);
Zubair Nabi 8: MapReduce Application Scripting May 25, 2013 14 / 28
Outline
1 Pig Latin
2 Cascading
Zubair Nabi 8: MapReduce Application Scripting May 25, 2013 15 / 28
Introduction
Many applications require a chain of MapReduce jobs
Cascading allows the creation of processing pipelines using languagesthat run atop the JVMSource-pipe-sink paradigm
I Data comes from sourcesI Pipes perform data analysisI Results are written to sinks
Zubair Nabi 8: MapReduce Application Scripting May 25, 2013 16 / 28
Introduction
Many applications require a chain of MapReduce jobs
Cascading allows the creation of processing pipelines using languagesthat run atop the JVM
Source-pipe-sink paradigmI Data comes from sourcesI Pipes perform data analysisI Results are written to sinks
Zubair Nabi 8: MapReduce Application Scripting May 25, 2013 16 / 28
Introduction
Many applications require a chain of MapReduce jobs
Cascading allows the creation of processing pipelines using languagesthat run atop the JVMSource-pipe-sink paradigm
I Data comes from sourcesI Pipes perform data analysisI Results are written to sinks
Zubair Nabi 8: MapReduce Application Scripting May 25, 2013 16 / 28
Introduction
Many applications require a chain of MapReduce jobs
Cascading allows the creation of processing pipelines using languagesthat run atop the JVMSource-pipe-sink paradigm
I Data comes from sources
I Pipes perform data analysisI Results are written to sinks
Zubair Nabi 8: MapReduce Application Scripting May 25, 2013 16 / 28
Introduction
Many applications require a chain of MapReduce jobs
Cascading allows the creation of processing pipelines using languagesthat run atop the JVMSource-pipe-sink paradigm
I Data comes from sourcesI Pipes perform data analysis
I Results are written to sinks
Zubair Nabi 8: MapReduce Application Scripting May 25, 2013 16 / 28
Introduction
Many applications require a chain of MapReduce jobs
Cascading allows the creation of processing pipelines using languagesthat run atop the JVMSource-pipe-sink paradigm
I Data comes from sourcesI Pipes perform data analysisI Results are written to sinks
Zubair Nabi 8: MapReduce Application Scripting May 25, 2013 16 / 28
Terminology
Pipe: data stream
Tuple: data record
Branch: chain of pipes
Pipe Assembly: set of pipe branches
Tap: data source or sink
Flow: pipe assembly bound to a tap
Cascade: a collection flows, in which one flow depends on the outputof another
Zubair Nabi 8: MapReduce Application Scripting May 25, 2013 17 / 28
Terminology
Pipe: data stream
Tuple: data record
Branch: chain of pipes
Pipe Assembly: set of pipe branches
Tap: data source or sink
Flow: pipe assembly bound to a tap
Cascade: a collection flows, in which one flow depends on the outputof another
Zubair Nabi 8: MapReduce Application Scripting May 25, 2013 17 / 28
Terminology
Pipe: data stream
Tuple: data record
Branch: chain of pipes
Pipe Assembly: set of pipe branches
Tap: data source or sink
Flow: pipe assembly bound to a tap
Cascade: a collection flows, in which one flow depends on the outputof another
Zubair Nabi 8: MapReduce Application Scripting May 25, 2013 17 / 28
Terminology
Pipe: data stream
Tuple: data record
Branch: chain of pipes
Pipe Assembly: set of pipe branches
Tap: data source or sink
Flow: pipe assembly bound to a tap
Cascade: a collection flows, in which one flow depends on the outputof another
Zubair Nabi 8: MapReduce Application Scripting May 25, 2013 17 / 28
Terminology
Pipe: data stream
Tuple: data record
Branch: chain of pipes
Pipe Assembly: set of pipe branches
Tap: data source or sink
Flow: pipe assembly bound to a tap
Cascade: a collection flows, in which one flow depends on the outputof another
Zubair Nabi 8: MapReduce Application Scripting May 25, 2013 17 / 28
Terminology
Pipe: data stream
Tuple: data record
Branch: chain of pipes
Pipe Assembly: set of pipe branches
Tap: data source or sink
Flow: pipe assembly bound to a tap
Cascade: a collection flows, in which one flow depends on the outputof another
Zubair Nabi 8: MapReduce Application Scripting May 25, 2013 17 / 28
Terminology
Pipe: data stream
Tuple: data record
Branch: chain of pipes
Pipe Assembly: set of pipe branches
Tap: data source or sink
Flow: pipe assembly bound to a tap
Cascade: a collection flows, in which one flow depends on the outputof another
Zubair Nabi 8: MapReduce Application Scripting May 25, 2013 17 / 28
Pipes
Base class: Pipe
Each: Analyze, transform, or filter individual tuples
Merge: Combine streams with same fields into one
GroupBy: Group tuples based on common values in a specified field
CoGroup: Join streams (similar to SQL join)
Every: Aggregate tuples
HashJoin: Similar to CoGroup but more efficient if one stream canbe held in memory
Zubair Nabi 8: MapReduce Application Scripting May 25, 2013 18 / 28
Pipes
Base class: Pipe
Each: Analyze, transform, or filter individual tuples
Merge: Combine streams with same fields into one
GroupBy: Group tuples based on common values in a specified field
CoGroup: Join streams (similar to SQL join)
Every: Aggregate tuples
HashJoin: Similar to CoGroup but more efficient if one stream canbe held in memory
Zubair Nabi 8: MapReduce Application Scripting May 25, 2013 18 / 28
Pipes
Base class: Pipe
Each: Analyze, transform, or filter individual tuples
Merge: Combine streams with same fields into one
GroupBy: Group tuples based on common values in a specified field
CoGroup: Join streams (similar to SQL join)
Every: Aggregate tuples
HashJoin: Similar to CoGroup but more efficient if one stream canbe held in memory
Zubair Nabi 8: MapReduce Application Scripting May 25, 2013 18 / 28
Pipes
Base class: Pipe
Each: Analyze, transform, or filter individual tuples
Merge: Combine streams with same fields into one
GroupBy: Group tuples based on common values in a specified field
CoGroup: Join streams (similar to SQL join)
Every: Aggregate tuples
HashJoin: Similar to CoGroup but more efficient if one stream canbe held in memory
Zubair Nabi 8: MapReduce Application Scripting May 25, 2013 18 / 28
Pipes
Base class: Pipe
Each: Analyze, transform, or filter individual tuples
Merge: Combine streams with same fields into one
GroupBy: Group tuples based on common values in a specified field
CoGroup: Join streams (similar to SQL join)
Every: Aggregate tuples
HashJoin: Similar to CoGroup but more efficient if one stream canbe held in memory
Zubair Nabi 8: MapReduce Application Scripting May 25, 2013 18 / 28
Pipes
Base class: Pipe
Each: Analyze, transform, or filter individual tuples
Merge: Combine streams with same fields into one
GroupBy: Group tuples based on common values in a specified field
CoGroup: Join streams (similar to SQL join)
Every: Aggregate tuples
HashJoin: Similar to CoGroup but more efficient if one stream canbe held in memory
Zubair Nabi 8: MapReduce Application Scripting May 25, 2013 18 / 28
Pipes
Base class: Pipe
Each: Analyze, transform, or filter individual tuples
Merge: Combine streams with same fields into one
GroupBy: Group tuples based on common values in a specified field
CoGroup: Join streams (similar to SQL join)
Every: Aggregate tuples
HashJoin: Similar to CoGroup but more efficient if one stream canbe held in memory
Zubair Nabi 8: MapReduce Application Scripting May 25, 2013 18 / 28
Pipe Assemblies
Define the processing of tuple streams
I Tuples are read/written to taps
Processing includes filtering, transforming, organizing, and calculating
Can use multiple taps
May also define splits, merges, and joins to manipulate tuple streams
Zubair Nabi 8: MapReduce Application Scripting May 25, 2013 19 / 28
Pipe Assemblies
Define the processing of tuple streamsI Tuples are read/written to taps
Processing includes filtering, transforming, organizing, and calculating
Can use multiple taps
May also define splits, merges, and joins to manipulate tuple streams
Zubair Nabi 8: MapReduce Application Scripting May 25, 2013 19 / 28
Pipe Assemblies
Define the processing of tuple streamsI Tuples are read/written to taps
Processing includes filtering, transforming, organizing, and calculating
Can use multiple taps
May also define splits, merges, and joins to manipulate tuple streams
Zubair Nabi 8: MapReduce Application Scripting May 25, 2013 19 / 28
Pipe Assemblies
Define the processing of tuple streamsI Tuples are read/written to taps
Processing includes filtering, transforming, organizing, and calculating
Can use multiple taps
May also define splits, merges, and joins to manipulate tuple streams
Zubair Nabi 8: MapReduce Application Scripting May 25, 2013 19 / 28
Pipe Assemblies
Define the processing of tuple streamsI Tuples are read/written to taps
Processing includes filtering, transforming, organizing, and calculating
Can use multiple taps
May also define splits, merges, and joins to manipulate tuple streams
Zubair Nabi 8: MapReduce Application Scripting May 25, 2013 19 / 28
Example: Pipe Assembly
Zubair Nabi 8: MapReduce Application Scripting May 25, 2013 20 / 28
Example: Pipe Assembly (2)
1 Pipe lhs = new Pipe( "lhs" );
2 lhs = new Each( lhs, new SomeFunction() );
3 lhs = new Each( lhs, new SomeFilter() );
45 Pipe rhs = new Pipe( "rhs" );
6 rhs = new Each( rhs, new SomeFunction() );
78 Pipe join = new CoGroup( lhs, rhs );
9 join = new Every( join, new SomeAggregator() );
10 join = new GroupBy( join );
11 join = new Every( join, new SomeAggregator() );
1213 join = new Each( join, new SomeFunction() );
Zubair Nabi 8: MapReduce Application Scripting May 25, 2013 21 / 28
Data Processing
Operation: Accept an input tuple, process it, and output zero or moretuples
Tuple: Array of fields
Field: Defines a data type, such as string, integer, etc.
Zubair Nabi 8: MapReduce Application Scripting May 25, 2013 22 / 28
Data Processing
Operation: Accept an input tuple, process it, and output zero or moretuples
Tuple: Array of fields
Field: Defines a data type, such as string, integer, etc.
Zubair Nabi 8: MapReduce Application Scripting May 25, 2013 22 / 28
Data Processing
Operation: Accept an input tuple, process it, and output zero or moretuples
Tuple: Array of fields
Field: Defines a data type, such as string, integer, etc.
Zubair Nabi 8: MapReduce Application Scripting May 25, 2013 22 / 28
Taps
Data flows in and out of taps
Represent data sources and sinks, such local files, distributed FS files,etc.
Each tap is associated with a scheme that describe the data, such asTextLine, TextDelimited, etc.
Sinks have modes such as SinkMode.KEEP,SinkMode.REPLACE, and SinkMode.UPDATE
Zubair Nabi 8: MapReduce Application Scripting May 25, 2013 23 / 28
Taps
Data flows in and out of taps
Represent data sources and sinks, such local files, distributed FS files,etc.
Each tap is associated with a scheme that describe the data, such asTextLine, TextDelimited, etc.
Sinks have modes such as SinkMode.KEEP,SinkMode.REPLACE, and SinkMode.UPDATE
Zubair Nabi 8: MapReduce Application Scripting May 25, 2013 23 / 28
Taps
Data flows in and out of taps
Represent data sources and sinks, such local files, distributed FS files,etc.
Each tap is associated with a scheme that describe the data, such asTextLine, TextDelimited, etc.
Sinks have modes such as SinkMode.KEEP,SinkMode.REPLACE, and SinkMode.UPDATE
Zubair Nabi 8: MapReduce Application Scripting May 25, 2013 23 / 28
Taps
Data flows in and out of taps
Represent data sources and sinks, such local files, distributed FS files,etc.
Each tap is associated with a scheme that describe the data, such asTextLine, TextDelimited, etc.
Sinks have modes such as SinkMode.KEEP,SinkMode.REPLACE, and SinkMode.UPDATE
Zubair Nabi 8: MapReduce Application Scripting May 25, 2013 23 / 28
Flows
Represent entire pipelines
A pipeline reads data from a source, processes it, and then writes it toa sink
Zubair Nabi 8: MapReduce Application Scripting May 25, 2013 24 / 28
Flows
Represent entire pipelines
A pipeline reads data from a source, processes it, and then writes it toa sink
Zubair Nabi 8: MapReduce Application Scripting May 25, 2013 24 / 28
Example: Flow1 Pipe lhs = new Pipe( "lhs" );
2 lhs = new Each( lhs, new SomeFunction() );
3 lhs = new Each( lhs, new SomeFilter() );
4 Pipe rhs = new Pipe( "rhs" );
5 rhs = new Each( rhs, new SomeFunction() );
6 Pipe join = new CoGroup( lhs, rhs );
7 join = new Every( join, new SomeAggregator() );
89 Tap lhsSource = new Hfs( new TextLine(), "lhs.txt" );
10 Tap rhsSource = new Hfs( new TextLine(), "rhs.txt" );
11 Tap sink = new Hfs( new TextLine(), "output" );
12 FlowDef flowDef = new FlowDef()
13 .setName( "flow−name" )14 .addSource( rhs, rhsSource )
15 .addSource( lhs, lhsSource )
16 .addTailSink( join, sink );
17 Flow flow = new HadoopFlowConnector().connect( flowDef );
Zubair Nabi 8: MapReduce Application Scripting May 25, 2013 25 / 28
Operations
Operations manipulate data
Four kinds:1 Function2 Filter3 Aggregator4 Buffer
Take an input tuple and emit zero or more tuplesI Filter returns a Boolean
Must be wrapped around in either Every or Each pipes
Zubair Nabi 8: MapReduce Application Scripting May 25, 2013 26 / 28
Operations
Operations manipulate dataFour kinds:
1 Function
2 Filter3 Aggregator4 Buffer
Take an input tuple and emit zero or more tuplesI Filter returns a Boolean
Must be wrapped around in either Every or Each pipes
Zubair Nabi 8: MapReduce Application Scripting May 25, 2013 26 / 28
Operations
Operations manipulate dataFour kinds:
1 Function2 Filter
3 Aggregator4 Buffer
Take an input tuple and emit zero or more tuplesI Filter returns a Boolean
Must be wrapped around in either Every or Each pipes
Zubair Nabi 8: MapReduce Application Scripting May 25, 2013 26 / 28
Operations
Operations manipulate dataFour kinds:
1 Function2 Filter3 Aggregator
4 Buffer
Take an input tuple and emit zero or more tuplesI Filter returns a Boolean
Must be wrapped around in either Every or Each pipes
Zubair Nabi 8: MapReduce Application Scripting May 25, 2013 26 / 28
Operations
Operations manipulate dataFour kinds:
1 Function2 Filter3 Aggregator4 Buffer
Take an input tuple and emit zero or more tuplesI Filter returns a Boolean
Must be wrapped around in either Every or Each pipes
Zubair Nabi 8: MapReduce Application Scripting May 25, 2013 26 / 28
Operations
Operations manipulate dataFour kinds:
1 Function2 Filter3 Aggregator4 Buffer
Take an input tuple and emit zero or more tuples
I Filter returns a Boolean
Must be wrapped around in either Every or Each pipes
Zubair Nabi 8: MapReduce Application Scripting May 25, 2013 26 / 28
Operations
Operations manipulate dataFour kinds:
1 Function2 Filter3 Aggregator4 Buffer
Take an input tuple and emit zero or more tuplesI Filter returns a Boolean
Must be wrapped around in either Every or Each pipes
Zubair Nabi 8: MapReduce Application Scripting May 25, 2013 26 / 28
Operations
Operations manipulate dataFour kinds:
1 Function2 Filter3 Aggregator4 Buffer
Take an input tuple and emit zero or more tuplesI Filter returns a Boolean
Must be wrapped around in either Every or Each pipes
Zubair Nabi 8: MapReduce Application Scripting May 25, 2013 26 / 28
Example: Wordcount
1 Scheme sourceScheme = new TextLine( new Fields( "line" ) );
2 Tap source = new Hfs( sourceScheme , inputPath );
3 Scheme sinkScheme = new TextLine( new Fields( "word", "count" ) );
4 Tap sink = new Hfs( sinkScheme , outputPath , SinkMode.REPLACE );
5 Pipe assembly = new Pipe( "wordcount" );
6 String regex = " ";
7 Function function = new RegexGenerator( new Fields( "word" ), regex );
8 assembly = new Each( assembly , new Fields( "line" ), function );
9 assembly = new GroupBy( assembly, new Fields( "word" ) );
10 Aggregator count = new Count( new Fields( "count" ) );
11 assembly = new Every( assembly, count );
12 FlowConnector flowConnector = new FlowConnector();
13 Flow flow = flowConnector.connect( "word−count", source, sink, assembly );14 flow.complete();
Zubair Nabi 8: MapReduce Application Scripting May 25, 2013 27 / 28
References
1 Christopher Olston, Benjamin Reed, Utkarsh Srivastava, Ravi Kumar,and Andrew Tomkins. 2008. Pig latin: a not-so-foreign language fordata processing. In Proceedings of the 2008 ACM SIGMODinternational conference on Management of data (SIGMOD ’08). ACM,New York, NY, USA, 1099-1110.
2 Cascading 2.1 User Guide: http://docs.cascading.org/cascading/2.1/userguide/pdf/userguide.pdf
Zubair Nabi 8: MapReduce Application Scripting May 25, 2013 28 / 28