Cascading
www.cascading.org
Wednesday, May 14, 2008
Design Goals
Make large processing jobs more transparent
Reusable processing components independent of resources
Incremental “data” builds
Simplify testing of processes
Scriptable from higher level languages (Groovy, JRuby, Jython, etc)
Wednesday, May 14, 2008
Cascading Introduction
Wednesday, May 14, 2008
Tuple Streams
TupleA set of ordered data [“John”, “Doe”, 39]
Value Stream
Just tuples
Group Stream
Tuples groups by a key
Value Stream
[V1,V2,...,Vn
[V1,V2,...,Vn
[V1,V2,...,Vn
[V1,V2,...,Vn
[V1,V2,...,Vn
[V1,V2,...,Vn
[V1,V2,...,Vn
Group Stream
[V1,V2,...,Vn
[V1,V2,...,Vn
[V1,V2,...,Vn
[V1,V2,...,Vn
[V1,V2,...,Vn
[V1,V2,...,Vn
[V1,V2,...,Vn
[K1,K2,...,Kn
[K1,K2,...,Kn
Wednesday, May 14, 2008
Tuple Streams
Scalar functions and filtersApply to value and group streams
Aggregate functionsApply to group stream
Functions can be chained
Sink
Source[values]
func[values] [values]
Group[values] [groups/values]
aggr[groups] [values]
[values]
Groupfunc aggr Sink
func
Source
Wednesday, May 14, 2008
Stream Processing
Pipe AssembliesA chain of scalar functions, groupings, aggregate functionsReusable, independent of data source/sink
FlowsAssemblies plus sources and sinks
CascadesA collection of Flows
Cascade
F
F F
F F
S
S
S
S
S S
S
FS SG AF A
Pipe AssemblyFlow
Wednesday, May 14, 2008
Processing Patterns
Chain
Splits
Joins
Cross
Group
Source
Sink
Group Sink
Group
Source
Source
Sink
GroupSource Sink
Source Sink
Source
Sink
Sink
GroupSource Sink
Wednesday, May 14, 2008
MapReduce Planner
Flows are logical ‘units of work’
Flows ‘compiled’ into MR Jobs
Intermediate files are created (and destroyed) to join Jobs
F
S F
F
F
S F
SG
F
F
SG
F SA
A
Job
Map
Job
Reduce
Map
Reduce
Map
Map
JobMap
FlowFlow
FS SG A
Job
ReduceMap
Wednesday, May 14, 2008
Topological Scheduler
Flows walk MapReduce Jobs in dependency order
Cascades walk Flows in dependency order
Independent Jobs and Flows are scheduled to run concurrently
Listeners can react to element events (notify completion or failures)
Only stale data-sets are rebuilt (configurable)
Wednesday, May 14, 2008
Scripting - GroovyFlow flow = builder.flow("wordcount") { source(input, scheme: text()) // input is filename of raw text document
tokenize(/[.,]*\s+/) // output new tuple for each split, result replaces stream by default group() // group on stream count() // count values in group, creates 'count' field by default group(["count"], reverse: true) // group/sort on 'count', reverse the sort order
sink(output) }
flow.complete() // execute, block till completed
Wednesday, May 14, 2008
System Integration
FileSystems (unique to Cascading)
Raw file S3 reading/writing (MD5)Raw file HTTP reading (MD5)Zip filesCan bypass native Hadoop ‘collectors’
Event notification via listeners (XMPP/SQS/Zookeeper notifications)
Groovy scripting for easier local shell/file operations (wget, scp, etc)
Wednesday, May 14, 2008
Cascading API & Internals
Wednesday, May 14, 2008
Core ConceptsTaps and Schemes
Tuples and Fields
Pipes and PipeAssemblies
Each and Every Operators
Groups
Flows, FlowSteps, and FlowConnectors
Cascades, and CascadeConnectors, optional
Wednesday, May 14, 2008
Taps and Schemes
Taps, abstract out where and how a data resources is accessed
hdfs, http, local, S3, etc
Taps, used as Tuple (data) stream sinks, sources, or both
Schemes, define what a resource is made of
text lines, SequenceFile, CSV, etc
Wednesday, May 14, 2008
Tuples and Fields
Tuples are the ‘records’, read from Tap sources, written to Tap sinks
Fields are the ‘column names’, sourced from Schemes
Tuple class, an ordered collection of Comparable values
(“a string”, 1.0, new SomeComparableWritable())
Fields class, a list of field names, absolute or relative positions
(“total”, 3, -1) // fields ‘total’, 4th position, last position
Wednesday, May 14, 2008
Pipes and PipeAssembliesTuple streams pass through Pipes to be processed
Pipes, apply functions, filters, and aggregators to the Tuple stream
Pipe instances are chained together into assemblies
Reusable assemblies are subclasses of class PipeAssembly
GB
E
E
C
B'
C'
EA
E
A
P
B'
C'
EA
E
A
Wednesday, May 14, 2008
Group Class and SubclassesGroup, subclass of Pipe, groups the Tuple stream on given fields
GroupBy and CoGroup subclass Group
GroupBy groups and sorts
CoGroup performs joins
ET TG A
FaFe
ET
TG A
ET
Wednesday, May 14, 2008
Each and Every Classes
Each, subclass of Pipe, applies Functions and Filters to each Tuple instance
(a,b,c) -> Each( func() ) -> (a,b,c,d)
Every, subclass of Pipe, applies Aggregators to every Tuple group
(a: b,c) -> Every( agg()) -> (a,d: b,c)
E A
FaFe
Wednesday, May 14, 2008
Flows and FlowConnectors
Flows encapsulate assemblies and sink and source Taps
FlowConnectors connect assemblies and Taps into Flows
Flow
ET
TG A
TGE
E
E A
FlowStep
FlowStep
Flow
ET
TG A
ET
FlowStep
Wednesday, May 14, 2008
FlowSteps and FlowConnectors
Internally, FlowConnectors ‘compile’ assemblies into FlowSteps
FlowSteps are MapReduce jobs, which are executed in Topo order
Temporary files are created to link FlowSteps
Flow
GT TA GE E
FlowStep FlowStep
Reduce StackMap Stack
T
Reduce StackMap Stack
Wednesday, May 14, 2008
Cascades and CascadeConnectorsAre optional
Cascades bind Flows together via shared Taps
CascadeConnectors connect Flows
Flows are executed in Topo orderCascade
F
F F
E F
T
T
T
T
T T
T
Wednesday, May 14, 2008
SyntaxEach( previous, argSelector, function/filter, resultSelector )
Every( previous, argSelector, aggregator, resultSelector )
GroupBy( previous, groupSelector, sortSelector )
CoGroup( joinN, joiner, declaredFields )
Function( numArgs, declaredFields, .... )
Filter (numArgs, ... )
Aggregator( numArgs, declaredFields, ... )
Wednesday, May 14, 2008