Top Banner
Cascading www.cascading.org [email protected] Wednesday, May 14, 2008
22

Map Reduce using Cascading

Apr 10, 2015

Download

Documents

parthpatil

Cascading is a feature rich API for defining and executing complex, scale-free, and fault tolerant data processing workflows on a Hadoop cluster.
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 2: Map Reduce using Cascading

Design Goals

Make large processing jobs more transparent

Reusable processing components independent of resources

Incremental “data” builds

Simplify testing of processes

Scriptable from higher level languages (Groovy, JRuby, Jython, etc)

Wednesday, May 14, 2008

Page 3: Map Reduce using Cascading

Cascading Introduction

Wednesday, May 14, 2008

Page 4: Map Reduce using Cascading

Tuple Streams

TupleA set of ordered data [“John”, “Doe”, 39]

Value Stream

Just tuples

Group Stream

Tuples groups by a key

Value Stream

[V1,V2,...,Vn

[V1,V2,...,Vn

[V1,V2,...,Vn

[V1,V2,...,Vn

[V1,V2,...,Vn

[V1,V2,...,Vn

[V1,V2,...,Vn

Group Stream

[V1,V2,...,Vn

[V1,V2,...,Vn

[V1,V2,...,Vn

[V1,V2,...,Vn

[V1,V2,...,Vn

[V1,V2,...,Vn

[V1,V2,...,Vn

[K1,K2,...,Kn

[K1,K2,...,Kn

Wednesday, May 14, 2008

Page 5: Map Reduce using Cascading

Tuple Streams

Scalar functions and filtersApply to value and group streams

Aggregate functionsApply to group stream

Functions can be chained

Sink

Source[values]

func[values] [values]

Group[values] [groups/values]

aggr[groups] [values]

[values]

Groupfunc aggr Sink

func

Source

Wednesday, May 14, 2008

Page 6: Map Reduce using Cascading

Stream Processing

Pipe AssembliesA chain of scalar functions, groupings, aggregate functionsReusable, independent of data source/sink

FlowsAssemblies plus sources and sinks

CascadesA collection of Flows

Cascade

F

F F

F F

S

S

S

S

S S

S

FS SG AF A

Pipe AssemblyFlow

Wednesday, May 14, 2008

Page 7: Map Reduce using Cascading

Processing Patterns

Chain

Splits

Joins

Cross

Group

Source

Sink

Group Sink

Group

Source

Source

Sink

GroupSource Sink

Source Sink

Source

Sink

Sink

GroupSource Sink

Wednesday, May 14, 2008

Page 8: Map Reduce using Cascading

MapReduce Planner

Flows are logical ‘units of work’

Flows ‘compiled’ into MR Jobs

Intermediate files are created (and destroyed) to join Jobs

F

S F

F

F

S F

SG

F

F

SG

F SA

A

Job

Map

Job

Reduce

Map

Reduce

Map

Map

JobMap

FlowFlow

FS SG A

Job

ReduceMap

Wednesday, May 14, 2008

Page 9: Map Reduce using Cascading

Topological Scheduler

Flows walk MapReduce Jobs in dependency order

Cascades walk Flows in dependency order

Independent Jobs and Flows are scheduled to run concurrently

Listeners can react to element events (notify completion or failures)

Only stale data-sets are rebuilt (configurable)

Wednesday, May 14, 2008

Page 10: Map Reduce using Cascading

Scripting - GroovyFlow flow = builder.flow("wordcount") { source(input, scheme: text()) // input is filename of raw text document

tokenize(/[.,]*\s+/) // output new tuple for each split, result replaces stream by default group() // group on stream count() // count values in group, creates 'count' field by default group(["count"], reverse: true) // group/sort on 'count', reverse the sort order

sink(output) }

flow.complete() // execute, block till completed

Wednesday, May 14, 2008

Page 11: Map Reduce using Cascading

System Integration

FileSystems (unique to Cascading)

Raw file S3 reading/writing (MD5)Raw file HTTP reading (MD5)Zip filesCan bypass native Hadoop ‘collectors’

Event notification via listeners (XMPP/SQS/Zookeeper notifications)

Groovy scripting for easier local shell/file operations (wget, scp, etc)

Wednesday, May 14, 2008

Page 12: Map Reduce using Cascading

Cascading API & Internals

Wednesday, May 14, 2008

Page 13: Map Reduce using Cascading

Core ConceptsTaps and Schemes

Tuples and Fields

Pipes and PipeAssemblies

Each and Every Operators

Groups

Flows, FlowSteps, and FlowConnectors

Cascades, and CascadeConnectors, optional

Wednesday, May 14, 2008

Page 14: Map Reduce using Cascading

Taps and Schemes

Taps, abstract out where and how a data resources is accessed

hdfs, http, local, S3, etc

Taps, used as Tuple (data) stream sinks, sources, or both

Schemes, define what a resource is made of

text lines, SequenceFile, CSV, etc

Wednesday, May 14, 2008

Page 15: Map Reduce using Cascading

Tuples and Fields

Tuples are the ‘records’, read from Tap sources, written to Tap sinks

Fields are the ‘column names’, sourced from Schemes

Tuple class, an ordered collection of Comparable values

(“a string”, 1.0, new SomeComparableWritable())

Fields class, a list of field names, absolute or relative positions

(“total”, 3, -1) // fields ‘total’, 4th position, last position

Wednesday, May 14, 2008

Page 16: Map Reduce using Cascading

Pipes and PipeAssembliesTuple streams pass through Pipes to be processed

Pipes, apply functions, filters, and aggregators to the Tuple stream

Pipe instances are chained together into assemblies

Reusable assemblies are subclasses of class PipeAssembly

GB

E

E

C

B'

C'

EA

E

A

P

B'

C'

EA

E

A

Wednesday, May 14, 2008

Page 17: Map Reduce using Cascading

Group Class and SubclassesGroup, subclass of Pipe, groups the Tuple stream on given fields

GroupBy and CoGroup subclass Group

GroupBy groups and sorts

CoGroup performs joins

ET TG A

FaFe

ET

TG A

ET

Wednesday, May 14, 2008

Page 18: Map Reduce using Cascading

Each and Every Classes

Each, subclass of Pipe, applies Functions and Filters to each Tuple instance

(a,b,c) -> Each( func() ) -> (a,b,c,d)

Every, subclass of Pipe, applies Aggregators to every Tuple group

(a: b,c) -> Every( agg()) -> (a,d: b,c)

E A

FaFe

Wednesday, May 14, 2008

Page 19: Map Reduce using Cascading

Flows and FlowConnectors

Flows encapsulate assemblies and sink and source Taps

FlowConnectors connect assemblies and Taps into Flows

Flow

ET

TG A

TGE

E

E A

FlowStep

FlowStep

Flow

ET

TG A

ET

FlowStep

Wednesday, May 14, 2008

Page 20: Map Reduce using Cascading

FlowSteps and FlowConnectors

Internally, FlowConnectors ‘compile’ assemblies into FlowSteps

FlowSteps are MapReduce jobs, which are executed in Topo order

Temporary files are created to link FlowSteps

Flow

GT TA GE E

FlowStep FlowStep

Reduce StackMap Stack

T

Reduce StackMap Stack

Wednesday, May 14, 2008

Page 21: Map Reduce using Cascading

Cascades and CascadeConnectorsAre optional

Cascades bind Flows together via shared Taps

CascadeConnectors connect Flows

Flows are executed in Topo orderCascade

F

F F

E F

T

T

T

T

T T

T

Wednesday, May 14, 2008

Page 22: Map Reduce using Cascading

SyntaxEach( previous, argSelector, function/filter, resultSelector )

Every( previous, argSelector, aggregator, resultSelector )

GroupBy( previous, groupSelector, sortSelector )

CoGroup( joinN, joiner, declaredFields )

Function( numArgs, declaredFields, .... )

Filter (numArgs, ... )

Aggregator( numArgs, declaredFields, ... )

Wednesday, May 14, 2008