Scalding - Hadoop Word Count in LESS than 70 lines of code

Post on 27-Jan-2015

107 Views

Category:

Technology

1 Downloads

Preview:

Click to see full reader

DESCRIPTION

Twitter Scalding is built on top of Cascading, which is built on top of Hadoop. It's basically a very nice to read and extend DSL for writing map reduce jobs.

Transcript

ScaldingHadoop Word Count in < 70 lines of code

Konrad 'ktoso' MalawskiJARCamp #3 12.04.2013

Friday, April 12, 13

ScaldingHadoop Word Count

in 4 lines of code

Konrad 'ktoso' MalawskiJARCamp #3 12.04.2013

Friday, April 12, 13

Friday, April 12, 13

Agenda

Friday, April 12, 13

Agenda

Why Scalding? (10%)

Friday, April 12, 13

Agenda

Why Scalding? (10%)+

Friday, April 12, 13

Agenda

Why Scalding? (10%)+

Hadoop Basics (20%)

Friday, April 12, 13

Agenda

Why Scalding? (10%)+

Hadoop Basics (20%)+

Friday, April 12, 13

Agenda

Why Scalding? (10%)+

Hadoop Basics (20%)+

Enter Cascading (40%)

Friday, April 12, 13

Agenda

Why Scalding? (10%)+

Hadoop Basics (20%)+

Enter Cascading (40%)+

Friday, April 12, 13

Agenda

Why Scalding? (10%)+

Hadoop Basics (20%)+

Enter Cascading (40%)+

Hello Scalding (30%)

Friday, April 12, 13

Agenda

Why Scalding? (10%)+

Hadoop Basics (20%)+

Enter Cascading (40%)+

Hello Scalding (30%)=

Friday, April 12, 13

Agenda

Why Scalding? (10%)+

Hadoop Basics (20%)+

Enter Cascading (40%)+

Hello Scalding (30%)=

100%

Friday, April 12, 13

Why Scalding?Word Count in Types

type Word = Stringtype Count = Int

String => Map[Word, Count]

Friday, April 12, 13

Why Scalding?Word Count in Scala

Friday, April 12, 13

Why Scalding?Word Count in Scala

val text = "a a a b b"

Friday, April 12, 13

Why Scalding?Word Count in Scala

val text = "a a a b b"

def wordCount(text: String): Map[Word, Count] =

Friday, April 12, 13

Why Scalding?Word Count in Scala

val text = "a a a b b"

def wordCount(text: String): Map[Word, Count] = text

Friday, April 12, 13

Why Scalding?Word Count in Scala

val text = "a a a b b"

def wordCount(text: String): Map[Word, Count] = text .split(" ")

Friday, April 12, 13

Why Scalding?Word Count in Scala

val text = "a a a b b"

def wordCount(text: String): Map[Word, Count] = text .split(" ") .map(a => (a, 1))

Friday, April 12, 13

Why Scalding?Word Count in Scala

val text = "a a a b b"

def wordCount(text: String): Map[Word, Count] = text .split(" ") .map(a => (a, 1)) .groupBy(_._1)

Friday, April 12, 13

Why Scalding?Word Count in Scala

val text = "a a a b b"

def wordCount(text: String): Map[Word, Count] = text .split(" ") .map(a => (a, 1)) .groupBy(_._1) .map { a => a._1 -> a._2.map(_._2).sum }

Friday, April 12, 13

Why Scalding?Word Count in Scala

val text = "a a a b b"

def wordCount(text: String): Map[Word, Count] = text .split(" ") .map(a => (a, 1)) .groupBy(_._1) .map { a => a._1 -> a._2.map(_._2).sum }

wordCount(text) should equal (Map("a" -> 3), ("b" -> 2)))

Friday, April 12, 13

Stuff > MemoryScala collections... fun but, memory bound!

val text = "so many words... waaah! ..."

text .split(" ") .map(a => (a, 1)) .groupBy(_._1) .map(a => (a._1, a._2.map(_._2).sum))

Friday, April 12, 13

Stuff > MemoryScala collections... fun but, memory bound!

val text = "so many words... waaah! ..."

text .split(" ") .map(a => (a, 1)) .groupBy(_._1) .map(a => (a._1, a._2.map(_._2).sum))

in Memory

Friday, April 12, 13

Stuff > MemoryScala collections... fun but, memory bound!

val text = "so many words... waaah! ..."

text .split(" ") .map(a => (a, 1)) .groupBy(_._1) .map(a => (a._1, a._2.map(_._2).sum))

in Memory

in Memory

Friday, April 12, 13

Stuff > MemoryScala collections... fun but, memory bound!

val text = "so many words... waaah! ..."

text .split(" ") .map(a => (a, 1)) .groupBy(_._1) .map(a => (a._1, a._2.map(_._2).sum))

in Memory

in Memory

in Memory

Friday, April 12, 13

Stuff > MemoryScala collections... fun but, memory bound!

val text = "so many words... waaah! ..."

text .split(" ") .map(a => (a, 1)) .groupBy(_._1) .map(a => (a._1, a._2.map(_._2).sum))

in Memory

in Memory

in Memory

in Memory

Friday, April 12, 13

Stuff > MemoryScala collections... fun but, memory bound!

val text = "so many words... waaah! ..."

text .split(" ") .map(a => (a, 1)) .groupBy(_._1) .map(a => (a._1, a._2.map(_._2).sum))

in Memory

in Memory

in Memory

in Memory

in Memory

Friday, April 12, 13

Apache Hadoop (HDFS + MR)http://hadoop.apache.org/

Friday, April 12, 13

package org.myorg;

import org.apache.hadoop.fs.Path;import org.apache.hadoop.io.IntWritable;import org.apache.hadoop.io.LongWritable;import org.apache.hadoop.io.Text;import org.apache.hadoop.mapred.*;

import java.io.IOException;import java.util.Iterator;import java.util.StringTokenizer;

public class WordCount {

public static class Map extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> { private final static IntWritable one = new IntWritable(1); private Text word = new Text();

public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { String line = value.toString(); StringTokenizer tokenizer = new StringTokenizer(line); while (tokenizer.hasMoreTokens()) { word.set(tokenizer.nextToken()); output.collect(word, one); } } }

public static class Reduce extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> { public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { int sum = 0; while (values.hasNext()) { sum += values.next().get(); } output.collect(key, new IntWritable(sum)); } }

public static void main(String[] args) throws Exception { JobConf conf = new JobConf(WordCount.class); conf.setJobName("wordcount");

conf.setOutputKeyClass(Text.class); conf.setOutputValueClass(IntWritable.class);

conf.setMapperClass(Map.class); conf.setCombinerClass(Reduce.class); conf.setReducerClass(Reduce.class);

conf.setInputFormat(TextInputFormat.class); conf.setOutputFormat(TextOutputFormat.class);

FileInputFormat.setInputPaths(conf, new Path(args[0])); FileOutputFormat.setOutputPath(conf, new Path(args[1]));

JobClient.runJob(conf); }}

Why Scalding?Word Count in Hadoop MR

Friday, April 12, 13

package org.myorg;

import org.apache.hadoop.fs.Path;import org.apache.hadoop.io.IntWritable;import org.apache.hadoop.io.LongWritable;import org.apache.hadoop.io.Text;import org.apache.hadoop.mapred.*;

import java.io.IOException;import java.util.Iterator;import java.util.StringTokenizer;

public class WordCount {

public static class Map extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> { private final static IntWritable one = new IntWritable(1); private Text word = new Text();

public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { String line = value.toString(); StringTokenizer tokenizer = new StringTokenizer(line); while (tokenizer.hasMoreTokens()) { word.set(tokenizer.nextToken()); output.collect(word, one); } } }

public static class Reduce extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> { public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { int sum = 0; while (values.hasNext()) { sum += values.next().get(); } output.collect(key, new IntWritable(sum)); } }

public static void main(String[] args) throws Exception { JobConf conf = new JobConf(WordCount.class); conf.setJobName("wordcount");

conf.setOutputKeyClass(Text.class); conf.setOutputValueClass(IntWritable.class);

conf.setMapperClass(Map.class); conf.setCombinerClass(Reduce.class); conf.setReducerClass(Reduce.class);

conf.setInputFormat(TextInputFormat.class); conf.setOutputFormat(TextOutputFormat.class);

FileInputFormat.setInputPaths(conf, new Path(args[0])); FileOutputFormat.setOutputPath(conf, new Path(args[1]));

JobClient.runJob(conf); }}

Why Scalding?Word Count in Hadoop MR

Friday, April 12, 13

Trivia: How old is Hadoop?

Friday, April 12, 13

Friday, April 12, 13

Friday, April 12, 13

Friday, April 12, 13

Friday, April 12, 13

Friday, April 12, 13

Friday, April 12, 13

Friday, April 12, 13

Friday, April 12, 13

Cascadingwww.cascading.org/

Friday, April 12, 13

Cascadingwww.cascading.org/

Friday, April 12, 13

Cascadingis

Friday, April 12, 13

Cascadingis

Taps & Pipes

Friday, April 12, 13

Cascadingis

Taps & Pipes& Pipes

& SinksFriday, April 12, 13

1: Distributed Copy

Friday, April 12, 13

1: Distributed Copy

Friday, April 12, 13

1: Distributed Copy

// source TapTap inTap = new Hfs(new TextDelimited(true, "\t"), inPath);

Friday, April 12, 13

1: Distributed Copy

// source TapTap inTap = new Hfs(new TextDelimited(true, "\t"), inPath);

// sink TapTap outTap = new Hfs(new TextDelimited(true, "\t"), outPath);

Friday, April 12, 13

1: Distributed Copy

// source TapTap inTap = new Hfs(new TextDelimited(true, "\t"), inPath);

// sink TapTap outTap = new Hfs(new TextDelimited(true, "\t"), outPath);

// a Pipe, connects tapsPipe copyPipe = new Pipe("copy");

Friday, April 12, 13

1: Distributed Copy

// source TapTap inTap = new Hfs(new TextDelimited(true, "\t"), inPath);

// sink TapTap outTap = new Hfs(new TextDelimited(true, "\t"), outPath);

// a Pipe, connects tapsPipe copyPipe = new Pipe("copy");

// build the FlowFlowDef flowDef = FlowDef.flowDef()

Friday, April 12, 13

1: Distributed Copy

// source TapTap inTap = new Hfs(new TextDelimited(true, "\t"), inPath);

// sink TapTap outTap = new Hfs(new TextDelimited(true, "\t"), outPath);

// a Pipe, connects tapsPipe copyPipe = new Pipe("copy");

// build the FlowFlowDef flowDef = FlowDef.flowDef() .addSource( copyPipe, inTap )

Friday, April 12, 13

1: Distributed Copy

// source TapTap inTap = new Hfs(new TextDelimited(true, "\t"), inPath);

// sink TapTap outTap = new Hfs(new TextDelimited(true, "\t"), outPath);

// a Pipe, connects tapsPipe copyPipe = new Pipe("copy");

// build the FlowFlowDef flowDef = FlowDef.flowDef() .addSource(copyPipe, inTap) .addTailSink(copyPipe, outTap);

Friday, April 12, 13

1: Distributed Copy

// source TapTap inTap = new Hfs(new TextDelimited(true, "\t"), inPath);

// sink TapTap outTap = new Hfs(new TextDelimited(true, "\t"), outPath);

// a Pipe, connects tapsPipe copyPipe = new Pipe("copy");

// build the FlowFlowDef flowDef = FlowDef.flowDef() .addSource(copyPipe, inTap) .addTailSink(copyPipe, outTap);

// run!flowConnector.connect(flowDef).complete();

Friday, April 12, 13

1. DCP - Full Codepublic class Main { public static void main(String[] args ) { String inPath = args[0]; String outPath = args[1];

Properties props = new Properties(); AppProps.setApplicationJarClass(properties, Main.class); HadoopFlowConnector flowConnector = new HadoopFlowConnector(props);

Tap inTap = new Hfs( new TextDelimited(true, "\t"), inPath);

Tap outTap = new Hfs(new TextDelimited(true, "\t"), outPath);

Pipe copyPipe = new Pipe("copy");

FlowDef flowDef = FlowDef.flowDef() .addSource(copyPipe, inTap) .addTailSink(copyPipe, outTap);

flowConnector.connect(flowDef).complete();}}

Friday, April 12, 13

1. DCP - Full Codepublic class Main { public static void main(String[] args ) { String inPath = args[0]; String outPath = args[1];

Properties props = new Properties(); AppProps.setApplicationJarClass(properties, Main.class); HadoopFlowConnector flowConnector = new HadoopFlowConnector(props);

Tap inTap = new Hfs( new TextDelimited(true, "\t"), inPath);

Tap outTap = new Hfs(new TextDelimited(true, "\t"), outPath);

Pipe copyPipe = new Pipe("copy");

FlowDef flowDef = FlowDef.flowDef() .addSource(copyPipe, inTap) .addTailSink(copyPipe, outTap);

flowConnector.connect(flowDef).complete();}}

Friday, April 12, 13

1. DCP - Full Codepublic class Main { public static void main(String[] args ) { String inPath = args[0]; String outPath = args[1];

Properties props = new Properties(); AppProps.setApplicationJarClass(properties, Main.class); HadoopFlowConnector flowConnector = new HadoopFlowConnector(props);

Tap inTap = new Hfs( new TextDelimited(true, "\t"), inPath);

Tap outTap = new Hfs(new TextDelimited(true, "\t"), outPath);

Pipe copyPipe = new Pipe("copy");

FlowDef flowDef = FlowDef.flowDef() .addSource(copyPipe, inTap) .addTailSink(copyPipe, outTap);

flowConnector.connect(flowDef).complete();}}

Friday, April 12, 13

1. DCP - Full Codepublic class Main { public static void main(String[] args ) { String inPath = args[0]; String outPath = args[1];

Properties props = new Properties(); AppProps.setApplicationJarClass(properties, Main.class); HadoopFlowConnector flowConnector = new HadoopFlowConnector(props);

Tap inTap = new Hfs( new TextDelimited(true, "\t"), inPath);

Tap outTap = new Hfs(new TextDelimited(true, "\t"), outPath);

Pipe copyPipe = new Pipe("copy");

FlowDef flowDef = FlowDef.flowDef() .addSource(copyPipe, inTap) .addTailSink(copyPipe, outTap);

flowConnector.connect(flowDef).complete();}}

Friday, April 12, 13

1. DCP - Full Codepublic class Main { public static void main(String[] args ) { String inPath = args[0]; String outPath = args[1];

Properties props = new Properties(); AppProps.setApplicationJarClass(properties, Main.class); HadoopFlowConnector flowConnector = new HadoopFlowConnector(props);

Tap inTap = new Hfs( new TextDelimited(true, "\t"), inPath);

Tap outTap = new Hfs(new TextDelimited(true, "\t"), outPath);

Pipe copyPipe = new Pipe("copy");

FlowDef flowDef = FlowDef.flowDef() .addSource(copyPipe, inTap) .addTailSink(copyPipe, outTap);

flowConnector.connect(flowDef).complete();}}

Friday, April 12, 13

1. DCP - Full Codepublic class Main { public static void main(String[] args ) { String inPath = args[0]; String outPath = args[1];

Properties props = new Properties(); AppProps.setApplicationJarClass(properties, Main.class); HadoopFlowConnector flowConnector = new HadoopFlowConnector(props);

Tap inTap = new Hfs( new TextDelimited(true, "\t"), inPath);

Tap outTap = new Hfs(new TextDelimited(true, "\t"), outPath);

Pipe copyPipe = new Pipe("copy");

FlowDef flowDef = FlowDef.flowDef() .addSource(copyPipe, inTap) .addTailSink(copyPipe, outTap);

flowConnector.connect(flowDef).complete();}}

Friday, April 12, 13

2: Word Count

String docPath = args[ 0 ]; String wcPath = args[ 1 ];

Properties properties = new Properties(); AppProps.setApplicationJarClass( props, Main.class ); HadoopFlowConnector flowConnector = new HadoopFlowConnector( props );

// create source and sink taps Tap docTap = new Hfs( new TextDelimited( true, "\t" ), docPath ); Tap wcTap = new Hfs( new TextDelimited( true, "\t" ), wcPath );

// specify a regex operation to split the "document" text lines into a token stream Fields token = new Fields( "token" ); Fields text = new Fields( "text" ); RegexSplitGenerator splitter = new RegexSplitGenerator( token, "[ \\[\\]\\(\\),.]" ); // only returns "token" Pipe docPipe = new Each( "token", text, splitter, Fields.RESULTS );

// determine the word counts Pipe wcPipe = new Pipe( "wc", docPipe ); wcPipe = new GroupBy( wcPipe, token ); wcPipe = new Every( wcPipe, Fields.ALL, new Count(), Fields.ALL );

// connect the taps, pipes, etc., into a flow FlowDef flowDef = FlowDef.flowDef() .setName( "wc" ) .addSource( docPipe, docTap ) .addTailSink( wcPipe, wcTap );

// write a DOT file and run the flow Flow wcFlow = flowConnector.connect( flowDef ); wcFlow.writeDOT( "dot/wc.dot" ); wcFlow.complete(); } }

Friday, April 12, 13

2: Word Count

String docPath = args[ 0 ]; String wcPath = args[ 1 ];

Properties properties = new Properties(); AppProps.setApplicationJarClass( props, Main.class ); HadoopFlowConnector flowConnector = new HadoopFlowConnector( props );

// create source and sink taps Tap docTap = new Hfs( new TextDelimited( true, "\t" ), docPath ); Tap wcTap = new Hfs( new TextDelimited( true, "\t" ), wcPath );

// specify a regex operation to split the "document" text lines into a token stream Fields token = new Fields( "token" ); Fields text = new Fields( "text" ); RegexSplitGenerator splitter = new RegexSplitGenerator( token, "[ \\[\\]\\(\\),.]" ); // only returns "token" Pipe docPipe = new Each( "token", text, splitter, Fields.RESULTS );

// determine the word counts Pipe wcPipe = new Pipe( "wc", docPipe ); wcPipe = new GroupBy( wcPipe, token ); wcPipe = new Every( wcPipe, Fields.ALL, new Count(), Fields.ALL );

// connect the taps, pipes, etc., into a flow FlowDef flowDef = FlowDef.flowDef() .setName( "wc" ) .addSource( docPipe, docTap ) .addTailSink( wcPipe, wcTap );

// write a DOT file and run the flow Flow wcFlow = flowConnector.connect( flowDef ); wcFlow.writeDOT( "dot/wc.dot" ); wcFlow.complete(); } }

Friday, April 12, 13

2: Word Count

String docPath = args[ 0 ]; String wcPath = args[ 1 ];

Properties properties = new Properties(); AppProps.setApplicationJarClass( props, Main.class ); HadoopFlowConnector flowConnector = new HadoopFlowConnector( props );

// create source and sink taps Tap docTap = new Hfs( new TextDelimited( true, "\t" ), docPath ); Tap wcTap = new Hfs( new TextDelimited( true, "\t" ), wcPath );

// specify a regex operation to split the "document" text lines into a token stream Fields token = new Fields( "token" ); Fields text = new Fields( "text" ); RegexSplitGenerator splitter = new RegexSplitGenerator( token, "[ \\[\\]\\(\\),.]" ); // only returns "token" Pipe docPipe = new Each( "token", text, splitter, Fields.RESULTS );

// determine the word counts Pipe wcPipe = new Pipe( "wc", docPipe ); wcPipe = new GroupBy( wcPipe, token ); wcPipe = new Every( wcPipe, Fields.ALL, new Count(), Fields.ALL );

// connect the taps, pipes, etc., into a flow FlowDef flowDef = FlowDef.flowDef() .setName( "wc" ) .addSource( docPipe, docTap ) .addTailSink( wcPipe, wcTap );

// write a DOT file and run the flow Flow wcFlow = flowConnector.connect( flowDef ); wcFlow.writeDOT( "dot/wc.dot" ); wcFlow.complete(); } }

Friday, April 12, 13

String docPath = args[ 0 ]; String wcPath = args[ 1 ];

Properties properties = new Properties(); AppProps.setApplicationJarClass( props, Main.class ); HadoopFlowConnector flowConnector = new HadoopFlowConnector( props );

// create source and sink taps Tap docTap = new Hfs( new TextDelimited( true, "\t" ), docPath ); Tap wcTap = new Hfs( new TextDelimited( true, "\t" ), wcPath );

Fields token = new Fields( "token" ); Fields text = new Fields( "text" ); RegexSplitGenerator splitter = new RegexSplitGenerator( token, "[ \\[\\]\\(\\),.]" );

// only returns "token" Pipe docPipe = new Each( "token", text, splitter, Fields.RESULTS );

// determine the word counts Pipe wcPipe = new Pipe( "wc", docPipe ); wcPipe = new GroupBy( wcPipe, token ); wcPipe = new Every( wcPipe, Fields.ALL, new Count(), Fields.ALL );

// connect the taps, pipes, etc., into a flow FlowDef flowDef = FlowDef.flowDef() .setName( "wc" ) .addSource( docPipe, docTap ) .addTailSink( wcPipe, wcTap );

// write a DOT file and run the flow Flow wcFlow = flowConnector.connect( flowDef ); wcFlow.writeDOT( "dot/wc.dot" ); wcFlow.complete(); } }

2: Word Count

Friday, April 12, 13

2: Word Count String docPath = args[ 0 ]; String wcPath = args[ 1 ];

Properties properties = new Properties(); AppProps.setApplicationJarClass( props, Main.class ); HadoopFlowConnector flowConnector = new HadoopFlowConnector( props );

// create source and sink taps Tap docTap = new Hfs( new TextDelimited( true, "\t" ), docPath ); Tap wcTap = new Hfs( new TextDelimited( true, "\t" ), wcPath );

Fields token = new Fields( "token" ); Fields text = new Fields( "text" ); RegexSplitGenerator splitter = new RegexSplitGenerator( token, "[ \\[\\]\\(\\),.]" );

// only returns "token" Pipe docPipe = new Each( "token", text, splitter, Fields.RESULTS );

// determine the word counts Pipe wcPipe = new Pipe( "wc", docPipe ); wcPipe = new GroupBy( wcPipe, token ); wcPipe = new Every( wcPipe, Fields.ALL, new Count(), Fields.ALL );

// connect the taps, pipes, etc., into a flow FlowDef flowDef = FlowDef.flowDef() .setName( "wc" ) .addSource( docPipe, docTap ) .addTailSink( wcPipe, wcTap );

// write a DOT file and run the flow Flow wcFlow = flowConnector.connect( flowDef ); wcFlow.writeDOT( "dot/wc.dot" ); wcFlow.complete(); } }

2: Word Count

Friday, April 12, 13

2: Word Count

String docPath = args[ 0 ]; String wcPath = args[ 1 ];

Properties properties = new Properties(); AppProps.setApplicationJarClass( props, Main.class ); HadoopFlowConnector flowConnector = new HadoopFlowConnector( props );

// create source and sink taps Tap docTap = new Hfs( new TextDelimited( true, "\t" ), docPath ); Tap wcTap = new Hfs( new TextDelimited( true, "\t" ), wcPath );

Fields token = new Fields( "token" ); Fields text = new Fields( "text" ); RegexSplitGenerator splitter = new RegexSplitGenerator( token, "[ \\[\\]\\(\\),.]" );

// only returns "token" Pipe docPipe = new Each( "token", text, splitter, Fields.RESULTS );

// determine the word counts Pipe wcPipe = new Pipe( "wc", docPipe ); wcPipe = new GroupBy( wcPipe, token ); wcPipe = new Every( wcPipe, Fields.ALL, new Count(), Fields.ALL );

// connect the taps, pipes, etc., into a flow FlowDef flowDef = FlowDef.flowDef() .setName( "wc" ) .addSource( docPipe, docTap ) .addTailSink( wcPipe, wcTap );

// write a DOT file and run the flow Flow wcFlow = flowConnector.connect( flowDef ); wcFlow.writeDOT( "dot/wc.dot" ); wcFlow.complete(); } }

2: Word Count

Friday, April 12, 13

2: Word Count

String docPath = args[ 0 ]; String wcPath = args[ 1 ];

Properties properties = new Properties(); AppProps.setApplicationJarClass( props, Main.class ); HadoopFlowConnector flowConnector = new HadoopFlowConnector( props );

// create source and sink taps Tap docTap = new Hfs( new TextDelimited( true, "\t" ), docPath ); Tap wcTap = new Hfs( new TextDelimited( true, "\t" ), wcPath );

Fields token = new Fields( "token" ); Fields text = new Fields( "text" ); RegexSplitGenerator splitter = new RegexSplitGenerator( token, "[ \\[\\]\\(\\),.]" );

// only returns "token" Pipe docPipe = new Each( "token", text, splitter, Fields.RESULTS );

// determine the word counts Pipe wcPipe = new Pipe( "wc", docPipe ); wcPipe = new GroupBy( wcPipe, token ); wcPipe = new Every( wcPipe, Fields.ALL, new Count(), Fields.ALL );

// connect the taps, pipes, etc., into a flow FlowDef flowDef = FlowDef.flowDef() .setName( "wc" ) .addSource( docPipe, docTap ) .addTailSink( wcPipe, wcTap );

// write a DOT file and run the flow Flow wcFlow = flowConnector.connect( flowDef ); wcFlow.writeDOT( "dot/wc.dot" ); wcFlow.complete(); } }

2: Word Count

Friday, April 12, 13

2: Word Count

String docPath = args[ 0 ]; String wcPath = args[ 1 ];

Properties properties = new Properties(); AppProps.setApplicationJarClass( props, Main.class ); HadoopFlowConnector flowConnector = new HadoopFlowConnector( props );

// create source and sink taps Tap docTap = new Hfs( new TextDelimited( true, "\t" ), docPath ); Tap wcTap = new Hfs( new TextDelimited( true, "\t" ), wcPath );

Fields token = new Fields( "token" ); Fields text = new Fields( "text" ); RegexSplitGenerator splitter = new RegexSplitGenerator( token, "[ \\[\\]\\(\\),.]" );

// only returns "token" Pipe docPipe = new Each( "token", text, splitter, Fields.RESULTS );

// determine the word counts Pipe wcPipe = new Pipe( "wc", docPipe ); wcPipe = new GroupBy( wcPipe, token ); wcPipe = new Every( wcPipe, Fields.ALL, new Count(), Fields.ALL );

// connect the taps, pipes, etc., into a flow FlowDef flowDef = FlowDef.flowDef() .setName( "wc" ) .addSource( docPipe, docTap ) .addTailSink( wcPipe, wcTap );

// write a DOT file and run the flow Flow wcFlow = flowConnector.connect( flowDef ); wcFlow.writeDOT( "dot/wc.dot" ); wcFlow.complete(); } }

2: Word Count

Friday, April 12, 13

2: Word Count

String docPath = args[ 0 ]; String wcPath = args[ 1 ];

Properties properties = new Properties(); AppProps.setApplicationJarClass( props, Main.class ); HadoopFlowConnector flowConnector = new HadoopFlowConnector( props );

// create source and sink taps Tap docTap = new Hfs( new TextDelimited( true, "\t" ), docPath ); Tap wcTap = new Hfs( new TextDelimited( true, "\t" ), wcPath );

Fields token = new Fields( "token" ); Fields text = new Fields( "text" ); RegexSplitGenerator splitter = new RegexSplitGenerator( token, "[ \\[\\]\\(\\),.]" );

// only returns "token" Pipe docPipe = new Each( "token", text, splitter, Fields.RESULTS );

// determine the word counts Pipe wcPipe = new Pipe( "wc", docPipe ); wcPipe = new GroupBy( wcPipe, token ); wcPipe = new Every( wcPipe, Fields.ALL, new Count(), Fields.ALL );

// connect the taps, pipes, etc., into a flow FlowDef flowDef = FlowDef.flowDef() .setName( "wc" ) .addSource( docPipe, docTap ) .addTailSink( wcPipe, wcTap );

// write a DOT file and run the flow Flow wcFlow = flowConnector.connect( flowDef ); wcFlow.writeDOT( "dot/wc.dot" ); wcFlow.complete(); } }

2: Word Count

Friday, April 12, 13

2: Word Count

String docPath = args[ 0 ]; String wcPath = args[ 1 ];

Properties properties = new Properties(); AppProps.setApplicationJarClass( props, Main.class ); HadoopFlowConnector flowConnector = new HadoopFlowConnector( props );

// create source and sink taps Tap docTap = new Hfs( new TextDelimited( true, "\t" ), docPath ); Tap wcTap = new Hfs( new TextDelimited( true, "\t" ), wcPath );

Fields token = new Fields( "token" ); Fields text = new Fields( "text" ); RegexSplitGenerator splitter = new RegexSplitGenerator( token, "[ \\[\\]\\(\\),.]" );

// only returns "token" Pipe docPipe = new Each( "token", text, splitter, Fields.RESULTS );

// determine the word counts Pipe wcPipe = new Pipe( "wc", docPipe ); wcPipe = new GroupBy( wcPipe, token ); wcPipe = new Every( wcPipe, Fields.ALL, new Count(), Fields.ALL );

// connect the taps, pipes, etc., into a flow FlowDef flowDef = FlowDef.flowDef() .setName( "wc" ) .addSource( docPipe, docTap ) .addTailSink( wcPipe, wcTap );

// write a DOT file and run the flow Flow wcFlow = flowConnector.connect( flowDef ); wcFlow.writeDOT( "dot/wc.dot" ); wcFlow.complete(); } }

2: Word Count

Friday, April 12, 13

Cascading - how?

Friday, April 12, 13

Cascading - how?// pseudo code...

Friday, April 12, 13

Cascading - how?// pseudo code...

val flow = FlowDef

Friday, April 12, 13

Cascading - how?// pseudo code...

val flow = FlowDefval flowConnector: FlowDef => List[MRJob] = ...

Friday, April 12, 13

Cascading - how?// pseudo code...

val flow = FlowDefval flowConnector: FlowDef => List[MRJob] = ...

val jobs: List[MRJob] = flowConnector(flow)

Friday, April 12, 13

Cascading - how?// pseudo code...

val flow = FlowDefval flowConnector: FlowDef => List[MRJob] = ...

val jobs: List[MRJob] = flowConnector(flow)

HadoopCluster.execute(jobs)

Friday, April 12, 13

Cascading - how?// pseudo code...

val flow = FlowDefval flowConnector: FlowDef => List[MRJob] = ...

val jobs: List[MRJob] = flowConnector(flow)

HadoopCluster.execute(jobs)

Friday, April 12, 13

Cascading tipsPipe assembly = new Pipe( "assembly" );assembly = new Each( assembly, DebugLevel.VERBOSE, new Debug() );// ...

// head and tail have same nameFlowDef flowDef = new FlowDef() .setName( "debug" ) .addSource( "assembly", source ) .addSink( "assembly", sink ) .addTail( assembly );

Friday, April 12, 13

Cascading tipsPipe assembly = new Pipe( "assembly" );assembly = new Each( assembly, DebugLevel.VERBOSE, new Debug() );// ...

// head and tail have same nameFlowDef flowDef = new FlowDef() .setName( "debug" ) .addSource( "assembly", source ) .addSink( "assembly", sink ) .addTail( assembly );

flowDef.setDebugLevel( DebugLevel.NONE );

Friday, April 12, 13

Cascading tipsPipe assembly = new Pipe( "assembly" );assembly = new Each( assembly, DebugLevel.VERBOSE, new Debug() );// ...

// head and tail have same nameFlowDef flowDef = new FlowDef() .setName( "debug" ) .addSource( "assembly", source ) .addSink( "assembly", sink ) .addTail( assembly );

flowDef.setDebugLevel( DebugLevel.NONE );

flowConnector will NOT create the Debug pipe!

Friday, April 12, 13

Scalding=+

Twitter Scaldinggithub.com/twitter/scalding

Friday, April 12, 13

Scalding API

Friday, April 12, 13

map

Friday, April 12, 13

val data = 1 :: 2 :: 3 :: Nil

mapScala:

Friday, April 12, 13

val data = 1 :: 2 :: 3 :: Nil

val doubled = data map { _ * 2 }

mapScala:

Friday, April 12, 13

val data = 1 :: 2 :: 3 :: Nil

val doubled = data map { _ * 2 }

// Int => Int

mapScala:

Friday, April 12, 13

val data = 1 :: 2 :: 3 :: Nil

val doubled = data map { _ * 2 }

// Int => Int

mapScala:

Friday, April 12, 13

val data = 1 :: 2 :: 3 :: Nil

val doubled = data map { _ * 2 }

// Int => Int

map

IterableSource(data)

Scala:

Scalding:

Friday, April 12, 13

val data = 1 :: 2 :: 3 :: Nil

val doubled = data map { _ * 2 }

// Int => Int

map

IterableSource(data) .map('number -> 'doubled) { n: Int => n * 2 }

Scala:

Scalding:

Friday, April 12, 13

val data = 1 :: 2 :: 3 :: Nil

val doubled = data map { _ * 2 }

// Int => Int

map

IterableSource(data) .map('number -> 'doubled) { n: Int => n * 2 }

// Int => Int

Scala:

Scalding:

Friday, April 12, 13

val data = 1 :: 2 :: 3 :: Nil

val doubled = data map { _ * 2 }

// Int => Int

map

IterableSource(data) .map('number -> 'doubled) { n: Int => n * 2 }

// Int => Int

Scala:

Scalding:

available in Pipe

Friday, April 12, 13

val data = 1 :: 2 :: 3 :: Nil

val doubled = data map { _ * 2 }

// Int => Int

map

IterableSource(data) .map('number -> 'doubled) { n: Int => n * 2 }

// Int => Int

Scala:

Scalding:

available in Pipestays in Pipe

Friday, April 12, 13

val data = 1 :: 2 :: 3 :: Nil

val doubled = data map { _ * 2 }

// Int => Int

map

IterableSource(data) .map('number -> 'doubled) { n: Int => n * 2 }

// Int => Int

Scala:

Scalding:

must choose type!

Friday, April 12, 13

mapTo

Friday, April 12, 13

var data = 1 :: 2 :: 3 :: Nil

mapToScala:

Friday, April 12, 13

var data = 1 :: 2 :: 3 :: Nil

val doubled = data map { _ * 2 }

mapToScala:

Friday, April 12, 13

var data = 1 :: 2 :: 3 :: Nil

val doubled = data map { _ * 2 }data = null

mapToScala:

Friday, April 12, 13

var data = 1 :: 2 :: 3 :: Nil

val doubled = data map { _ * 2 }data = null // Int => Int

mapToScala:

Friday, April 12, 13

var data = 1 :: 2 :: 3 :: Nil

val doubled = data map { _ * 2 }data = null // Int => Int

mapToScala:

release reference

Friday, April 12, 13

var data = 1 :: 2 :: 3 :: Nil

val doubled = data map { _ * 2 }data = null // Int => Int

mapToScala:

release reference

Friday, April 12, 13

var data = 1 :: 2 :: 3 :: Nil

val doubled = data map { _ * 2 }data = null // Int => Int

mapTo

IterableSource(data)

Scala:

Scalding:

release reference

Friday, April 12, 13

var data = 1 :: 2 :: 3 :: Nil

val doubled = data map { _ * 2 }data = null // Int => Int

mapTo

IterableSource(data) .mapTo('doubled) { n: Int => n * 2 }

Scala:

Scalding:

release reference

Friday, April 12, 13

var data = 1 :: 2 :: 3 :: Nil

val doubled = data map { _ * 2 }data = null // Int => Int

mapTo

IterableSource(data) .mapTo('doubled) { n: Int => n * 2 }

// Int => Int

Scala:

Scalding:

release reference

Friday, April 12, 13

var data = 1 :: 2 :: 3 :: Nil

val doubled = data map { _ * 2 }data = null // Int => Int

mapTo

IterableSource(data) .mapTo('doubled) { n: Int => n * 2 }

// Int => Int

Scala:

Scalding:

doubled stays in Pipe

release reference

Friday, April 12, 13

var data = 1 :: 2 :: 3 :: Nil

val doubled = data map { _ * 2 }data = null // Int => Int

mapTo

IterableSource(data) .mapTo('doubled) { n: Int => n * 2 }

// Int => Int

Scala:

Scalding:

doubled stays in Pipenumber is removed

release reference

Friday, April 12, 13

flatMap

Friday, April 12, 13

val data = "1" :: "2,2" :: "3,3,3" :: Nil // List[String]

flatMapScala:

Friday, April 12, 13

val data = "1" :: "2,2" :: "3,3,3" :: Nil // List[String]

val numbers = data flatMap { line => // String

flatMapScala:

Friday, April 12, 13

val data = "1" :: "2,2" :: "3,3,3" :: Nil // List[String]

val numbers = data flatMap { line => // String line.split(",") // Array[String]

flatMapScala:

Friday, April 12, 13

val data = "1" :: "2,2" :: "3,3,3" :: Nil // List[String]

val numbers = data flatMap { line => // String line.split(",") // Array[String]} map { _.toInt } // List[Int]

flatMapScala:

Friday, April 12, 13

val data = "1" :: "2,2" :: "3,3,3" :: Nil // List[String]

val numbers = data flatMap { line => // String line.split(",") // Array[String]} map { _.toInt } // List[Int]

numbers // List[Int]

flatMapScala:

Friday, April 12, 13

val data = "1" :: "2,2" :: "3,3,3" :: Nil // List[String]

val numbers = data flatMap { line => // String line.split(",") // Array[String]} map { _.toInt } // List[Int]

numbers // List[Int]numbers should equal (List(1, 2, 2, 3, 3, 3))

flatMapScala:

Friday, April 12, 13

val data = "1" :: "2,2" :: "3,3,3" :: Nil // List[String]

val numbers = data flatMap { line => // String line.split(",") // Array[String]} map { _.toInt } // List[Int]

numbers // List[Int]numbers should equal (List(1, 2, 2, 3, 3, 3))

flatMapScala:

Friday, April 12, 13

val data = "1" :: "2,2" :: "3,3,3" :: Nil // List[String]

val numbers = data flatMap { line => // String line.split(",") // Array[String]} map { _.toInt } // List[Int]

numbers // List[Int]numbers should equal (List(1, 2, 2, 3, 3, 3))

flatMap

TextLine(data) // like List[String]

Scala:

Scalding:

Friday, April 12, 13

val data = "1" :: "2,2" :: "3,3,3" :: Nil // List[String]

val numbers = data flatMap { line => // String line.split(",") // Array[String]} map { _.toInt } // List[Int]

numbers // List[Int]numbers should equal (List(1, 2, 2, 3, 3, 3))

flatMap

TextLine(data) // like List[String] .flatMap('line -> 'word) { _.split(",") } // like List[String]

Scala:

Scalding:

Friday, April 12, 13

val data = "1" :: "2,2" :: "3,3,3" :: Nil // List[String]

val numbers = data flatMap { line => // String line.split(",") // Array[String]} map { _.toInt } // List[Int]

numbers // List[Int]numbers should equal (List(1, 2, 2, 3, 3, 3))

flatMap

TextLine(data) // like List[String] .flatMap('line -> 'word) { _.split(",") } // like List[String] .map('word -> 'number) { _.toInt } // like List[Int]

Scala:

Scalding:

Friday, April 12, 13

val data = "1" :: "2,2" :: "3,3,3" :: Nil // List[String]

val numbers = data flatMap { line => // String line.split(",") // Array[String]} map { _.toInt } // List[Int]

numbers // List[Int]numbers should equal (List(1, 2, 2, 3, 3, 3))

flatMap

TextLine(data) // like List[String] .flatMap('line -> 'word) { _.split(",") } // like List[String] .map('word -> 'number) { _.toInt } // like List[Int]

Scala:

Scalding:

MR map outside

Friday, April 12, 13

flatMap

Friday, April 12, 13

val data = "1" :: "2,2" :: "3,3,3" :: Nil // List[String]

flatMapScala:

Friday, April 12, 13

val data = "1" :: "2,2" :: "3,3,3" :: Nil // List[String]

val numbers = data flatMap { line => // String

flatMapScala:

Friday, April 12, 13

val data = "1" :: "2,2" :: "3,3,3" :: Nil // List[String]

val numbers = data flatMap { line => // String line.split(",").map(_.toInt) // Array[Int]

flatMapScala:

Friday, April 12, 13

val data = "1" :: "2,2" :: "3,3,3" :: Nil // List[String]

val numbers = data flatMap { line => // String line.split(",").map(_.toInt) // Array[Int]}

flatMapScala:

Friday, April 12, 13

val data = "1" :: "2,2" :: "3,3,3" :: Nil // List[String]

val numbers = data flatMap { line => // String line.split(",").map(_.toInt) // Array[Int]}

numbers // List[Int]

flatMapScala:

Friday, April 12, 13

val data = "1" :: "2,2" :: "3,3,3" :: Nil // List[String]

val numbers = data flatMap { line => // String line.split(",").map(_.toInt) // Array[Int]}

numbers // List[Int]numbers should equal (List(1, 2, 2, 3, 3, 3))

flatMapScala:

Friday, April 12, 13

val data = "1" :: "2,2" :: "3,3,3" :: Nil // List[String]

val numbers = data flatMap { line => // String line.split(",").map(_.toInt) // Array[Int]}

numbers // List[Int]numbers should equal (List(1, 2, 2, 3, 3, 3))

flatMapScala:

Friday, April 12, 13

val data = "1" :: "2,2" :: "3,3,3" :: Nil // List[String]

val numbers = data flatMap { line => // String line.split(",").map(_.toInt) // Array[Int]}

numbers // List[Int]numbers should equal (List(1, 2, 2, 3, 3, 3))

flatMap

TextLine(data) // like List[String]

Scala:

Scalding:

Friday, April 12, 13

val data = "1" :: "2,2" :: "3,3,3" :: Nil // List[String]

val numbers = data flatMap { line => // String line.split(",").map(_.toInt) // Array[Int]}

numbers // List[Int]numbers should equal (List(1, 2, 2, 3, 3, 3))

flatMap

TextLine(data) // like List[String] .flatMap('line -> 'word) { _.split(",").map(_.toInt) }

Scala:

Scalding:

Friday, April 12, 13

val data = "1" :: "2,2" :: "3,3,3" :: Nil // List[String]

val numbers = data flatMap { line => // String line.split(",").map(_.toInt) // Array[Int]}

numbers // List[Int]numbers should equal (List(1, 2, 2, 3, 3, 3))

flatMap

TextLine(data) // like List[String] .flatMap('line -> 'word) { _.split(",").map(_.toInt) } // like List[Int]

Scala:

Scalding:

Friday, April 12, 13

val data = "1" :: "2,2" :: "3,3,3" :: Nil // List[String]

val numbers = data flatMap { line => // String line.split(",").map(_.toInt) // Array[Int]}

numbers // List[Int]numbers should equal (List(1, 2, 2, 3, 3, 3))

flatMap

TextLine(data) // like List[String] .flatMap('line -> 'word) { _.split(",").map(_.toInt) } // like List[Int]

Scala:

Scalding:

map inside Scala

Friday, April 12, 13

groupBy

Friday, April 12, 13

val data = 1 :: 2 :: 30 :: 42 :: Nil // List[Int]

groupByScala:

Friday, April 12, 13

val data = 1 :: 2 :: 30 :: 42 :: Nil // List[Int]

val groups = data groupBy { _ < 10 }

groupByScala:

Friday, April 12, 13

val data = 1 :: 2 :: 30 :: 42 :: Nil // List[Int]

val groups = data groupBy { _ < 10 }

groups // Map[Boolean, Int]

groupByScala:

Friday, April 12, 13

val data = 1 :: 2 :: 30 :: 42 :: Nil // List[Int]

val groups = data groupBy { _ < 10 }

groups // Map[Boolean, Int]

groups(true) should equal (List(1, 2))

groupByScala:

Friday, April 12, 13

val data = 1 :: 2 :: 30 :: 42 :: Nil // List[Int]

val groups = data groupBy { _ < 10 }

groups // Map[Boolean, Int]

groups(true) should equal (List(1, 2))groups(false) should equal (List(30, 42))

groupByScala:

Friday, April 12, 13

val data = 1 :: 2 :: 30 :: 42 :: Nil // List[Int]

val groups = data groupBy { _ < 10 }

groups // Map[Boolean, Int]

groups(true) should equal (List(1, 2))groups(false) should equal (List(30, 42))

groupByScala:

Friday, April 12, 13

val data = 1 :: 2 :: 30 :: 42 :: Nil // List[Int]

val groups = data groupBy { _ < 10 }

groups // Map[Boolean, Int]

groups(true) should equal (List(1, 2))groups(false) should equal (List(30, 42))

groupBy

IterableSource(List(1, 2, 30, 42), 'num)

Scala:

Scalding:

Friday, April 12, 13

val data = 1 :: 2 :: 30 :: 42 :: Nil // List[Int]

val groups = data groupBy { _ < 10 }

groups // Map[Boolean, Int]

groups(true) should equal (List(1, 2))groups(false) should equal (List(30, 42))

groupBy

IterableSource(List(1, 2, 30, 42), 'num) .map('num -> 'lessThanTen) { i: Int => i < 10 }

Scala:

Scalding:

Friday, April 12, 13

val data = 1 :: 2 :: 30 :: 42 :: Nil // List[Int]

val groups = data groupBy { _ < 10 }

groups // Map[Boolean, Int]

groups(true) should equal (List(1, 2))groups(false) should equal (List(30, 42))

groupBy

IterableSource(List(1, 2, 30, 42), 'num) .map('num -> 'lessThanTen) { i: Int => i < 10 } .groupBy('lessThanTen) { _.size }

Scala:

Scalding:

Friday, April 12, 13

val data = 1 :: 2 :: 30 :: 42 :: Nil // List[Int]

val groups = data groupBy { _ < 10 }

groups // Map[Boolean, Int]

groups(true) should equal (List(1, 2))groups(false) should equal (List(30, 42))

groupBy

IterableSource(List(1, 2, 30, 42), 'num) .map('num -> 'lessThanTen) { i: Int => i < 10 } .groupBy('lessThanTen) { _.size }

Scala:

Scalding:

groups all with == value

Friday, April 12, 13

val data = 1 :: 2 :: 30 :: 42 :: Nil // List[Int]

val groups = data groupBy { _ < 10 }

groups // Map[Boolean, Int]

groups(true) should equal (List(1, 2))groups(false) should equal (List(30, 42))

groupBy

IterableSource(List(1, 2, 30, 42), 'num) .map('num -> 'lessThanTen) { i: Int => i < 10 } .groupBy('lessThanTen) { _.size }

Scala:

Scalding:

groups all with == value 'lessThanTenCounts

Friday, April 12, 13

groupBy

Scalding:

Friday, April 12, 13

groupBy

IterableSource(List(1, 2, 30, 42), 'num)

Scalding:

Friday, April 12, 13

groupBy

IterableSource(List(1, 2, 30, 42), 'num) .map('num -> 'lessThanTen) { i: Int => i < 10 }

Scalding:

Friday, April 12, 13

groupBy

IterableSource(List(1, 2, 30, 42), 'num) .map('num -> 'lessThanTen) { i: Int => i < 10 } .groupBy('lessThanTen) { _.sum('total) }

Scalding:

Friday, April 12, 13

groupBy

IterableSource(List(1, 2, 30, 42), 'num) .map('num -> 'lessThanTen) { i: Int => i < 10 } .groupBy('lessThanTen) { _.sum('total) }

Scalding:

'total = [3, 74]

Friday, April 12, 13

Scalding API

Friday, April 12, 13

Scalding APIproject / discard

Friday, April 12, 13

Scalding APIproject / discard

map / mapTo

Friday, April 12, 13

Scalding APIproject / discard

map / mapToflatMap / flatMapTo

Friday, April 12, 13

Scalding APIproject / discard

map / mapToflatMap / flatMapTo

rename

Friday, April 12, 13

Scalding APIproject / discard

map / mapToflatMap / flatMapTo

renamefilter

Friday, April 12, 13

Scalding APIproject / discard

map / mapToflatMap / flatMapTo

renamefilter

unique

Friday, April 12, 13

Scalding APIproject / discard

map / mapToflatMap / flatMapTo

renamefilter

uniquegroupBy / groupAll / groupRandom / shuffle

Friday, April 12, 13

Scalding APIproject / discard

map / mapToflatMap / flatMapTo

renamefilter

uniquegroupBy / groupAll / groupRandom / shuffle

limit

Friday, April 12, 13

Scalding APIproject / discard

map / mapToflatMap / flatMapTo

renamefilter

uniquegroupBy / groupAll / groupRandom / shuffle

limitdebug

Friday, April 12, 13

Scalding APIproject / discard

map / mapToflatMap / flatMapTo

renamefilter

uniquegroupBy / groupAll / groupRandom / shuffle

limitdebug

Group operations

Friday, April 12, 13

Scalding APIproject / discard

map / mapToflatMap / flatMapTo

renamefilter

uniquegroupBy / groupAll / groupRandom / shuffle

limitdebug

Group operations

joinsFriday, April 12, 13

Distributed Copy in Scalding

class WordCountJob(args: Args) extends Job(args) {

Friday, April 12, 13

class WordCountJob(args: Args) extends Job(args) {

val input = Tsv(args("input")) val output = Tsv(args("output"))

Distributed Copy in Scalding

Friday, April 12, 13

class WordCountJob(args: Args) extends Job(args) {

val input = Tsv(args("input")) val output = Tsv(args("output"))

input.read.write(output)

}

Distributed Copy in Scalding

Friday, April 12, 13

class WordCountJob(args: Args) extends Job(args) {

val input = Tsv(args("input")) val output = Tsv(args("output"))

input.read.write(output)

}

Distributed Copy in Scalding

The End.

Friday, April 12, 13

import org.apache.hadoop.util.ToolRunnerimport com.twitter.scalding

object ScaldingJobRunner extends App {

ToolRunner.run(new Configuration, new scalding.Tool, args)

}

Main Class - "Runner"

Friday, April 12, 13

import org.apache.hadoop.util.ToolRunnerimport com.twitter.scalding

object ScaldingJobRunner extends App {

ToolRunner.run(new Configuration, new scalding.Tool, args)

}

Main Class - "Runner"

from App

Friday, April 12, 13

class WordCountJob(args: Args) extends Job(args) {

}

Word Count in Scalding

Friday, April 12, 13

class WordCountJob(args: Args) extends Job(args) {

val inputFile = args("input") val outputFile = args("output")

}

Word Count in Scalding

Friday, April 12, 13

class WordCountJob(args: Args) extends Job(args) {

val inputFile = args("input") val outputFile = args("output")

TextLine(inputFile)

}

Word Count in Scalding

Friday, April 12, 13

class WordCountJob(args: Args) extends Job(args) {

val inputFile = args("input") val outputFile = args("output")

TextLine(inputFile) .flatMap('line -> 'word) { line: String => tokenize(line) }

def tokenize(text: String): Array[String] = implemented}

Word Count in Scalding

Friday, April 12, 13

class WordCountJob(args: Args) extends Job(args) {

val inputFile = args("input") val outputFile = args("output")

TextLine(inputFile) .flatMap('line -> 'word) { line: String => tokenize(line) } .groupBy('word) { _.size }

def tokenize(text: String): Array[String] = implemented}

Word Count in Scalding

Friday, April 12, 13

class WordCountJob(args: Args) extends Job(args) {

val inputFile = args("input") val outputFile = args("output")

TextLine(inputFile) .flatMap('line -> 'word) { line: String => tokenize(line) } .groupBy('word) { group => group.size }

def tokenize(text: String): Array[String] = implemented}

Word Count in Scalding

Friday, April 12, 13

class WordCountJob(args: Args) extends Job(args) {

val inputFile = args("input") val outputFile = args("output")

TextLine(inputFile) .flatMap('line -> 'word) { line: String => tokenize(line) } .groupBy('word) { group => group.size('count) }

def tokenize(text: String): Array[String] = implemented}

Word Count in Scalding

Friday, April 12, 13

class WordCountJob(args: Args) extends Job(args) {

val inputFile = args("input") val outputFile = args("output")

TextLine(inputFile) .flatMap('line -> 'word) { line: String => tokenize(line) } .groupBy('word) { _.size } .write(Tsv(outputFile))

def tokenize(text: String): Array[String] = implemented}

Word Count in Scalding

Friday, April 12, 13

class WordCountJob(args: Args) extends Job(args) {

val inputFile = args("input") val outputFile = args("output")

TextLine(inputFile) .flatMap('line -> 'word) { line: String => tokenize(line) } .groupBy('word) { _.size } .write(Tsv(outputFile))

def tokenize(text: String): Array[String] = implemented}

Word Count in Scalding

4{

Friday, April 12, 13

Dzięki! Thanks!ありがとう!

Konrad Malawski @ java.plt: ktosopl / g: ktosob: blog.project13.pl

Friday, April 12, 13

top related