Scalding - Hadoop Word Count in LESS than 70 lines of code
Post on 27-Jan-2015
107 Views
Preview:
DESCRIPTION
Transcript
ScaldingHadoop Word Count in < 70 lines of code
Konrad 'ktoso' MalawskiJARCamp #3 12.04.2013
Friday, April 12, 13
ScaldingHadoop Word Count
in 4 lines of code
Konrad 'ktoso' MalawskiJARCamp #3 12.04.2013
Friday, April 12, 13
Friday, April 12, 13
Agenda
Friday, April 12, 13
Agenda
Why Scalding? (10%)
Friday, April 12, 13
Agenda
Why Scalding? (10%)+
Friday, April 12, 13
Agenda
Why Scalding? (10%)+
Hadoop Basics (20%)
Friday, April 12, 13
Agenda
Why Scalding? (10%)+
Hadoop Basics (20%)+
Friday, April 12, 13
Agenda
Why Scalding? (10%)+
Hadoop Basics (20%)+
Enter Cascading (40%)
Friday, April 12, 13
Agenda
Why Scalding? (10%)+
Hadoop Basics (20%)+
Enter Cascading (40%)+
Friday, April 12, 13
Agenda
Why Scalding? (10%)+
Hadoop Basics (20%)+
Enter Cascading (40%)+
Hello Scalding (30%)
Friday, April 12, 13
Agenda
Why Scalding? (10%)+
Hadoop Basics (20%)+
Enter Cascading (40%)+
Hello Scalding (30%)=
Friday, April 12, 13
Agenda
Why Scalding? (10%)+
Hadoop Basics (20%)+
Enter Cascading (40%)+
Hello Scalding (30%)=
100%
Friday, April 12, 13
Why Scalding?Word Count in Types
type Word = Stringtype Count = Int
String => Map[Word, Count]
Friday, April 12, 13
Why Scalding?Word Count in Scala
Friday, April 12, 13
Why Scalding?Word Count in Scala
val text = "a a a b b"
Friday, April 12, 13
Why Scalding?Word Count in Scala
val text = "a a a b b"
def wordCount(text: String): Map[Word, Count] =
Friday, April 12, 13
Why Scalding?Word Count in Scala
val text = "a a a b b"
def wordCount(text: String): Map[Word, Count] = text
Friday, April 12, 13
Why Scalding?Word Count in Scala
val text = "a a a b b"
def wordCount(text: String): Map[Word, Count] = text .split(" ")
Friday, April 12, 13
Why Scalding?Word Count in Scala
val text = "a a a b b"
def wordCount(text: String): Map[Word, Count] = text .split(" ") .map(a => (a, 1))
Friday, April 12, 13
Why Scalding?Word Count in Scala
val text = "a a a b b"
def wordCount(text: String): Map[Word, Count] = text .split(" ") .map(a => (a, 1)) .groupBy(_._1)
Friday, April 12, 13
Why Scalding?Word Count in Scala
val text = "a a a b b"
def wordCount(text: String): Map[Word, Count] = text .split(" ") .map(a => (a, 1)) .groupBy(_._1) .map { a => a._1 -> a._2.map(_._2).sum }
Friday, April 12, 13
Why Scalding?Word Count in Scala
val text = "a a a b b"
def wordCount(text: String): Map[Word, Count] = text .split(" ") .map(a => (a, 1)) .groupBy(_._1) .map { a => a._1 -> a._2.map(_._2).sum }
wordCount(text) should equal (Map("a" -> 3), ("b" -> 2)))
Friday, April 12, 13
Stuff > MemoryScala collections... fun but, memory bound!
val text = "so many words... waaah! ..."
text .split(" ") .map(a => (a, 1)) .groupBy(_._1) .map(a => (a._1, a._2.map(_._2).sum))
Friday, April 12, 13
Stuff > MemoryScala collections... fun but, memory bound!
val text = "so many words... waaah! ..."
text .split(" ") .map(a => (a, 1)) .groupBy(_._1) .map(a => (a._1, a._2.map(_._2).sum))
in Memory
Friday, April 12, 13
Stuff > MemoryScala collections... fun but, memory bound!
val text = "so many words... waaah! ..."
text .split(" ") .map(a => (a, 1)) .groupBy(_._1) .map(a => (a._1, a._2.map(_._2).sum))
in Memory
in Memory
Friday, April 12, 13
Stuff > MemoryScala collections... fun but, memory bound!
val text = "so many words... waaah! ..."
text .split(" ") .map(a => (a, 1)) .groupBy(_._1) .map(a => (a._1, a._2.map(_._2).sum))
in Memory
in Memory
in Memory
Friday, April 12, 13
Stuff > MemoryScala collections... fun but, memory bound!
val text = "so many words... waaah! ..."
text .split(" ") .map(a => (a, 1)) .groupBy(_._1) .map(a => (a._1, a._2.map(_._2).sum))
in Memory
in Memory
in Memory
in Memory
Friday, April 12, 13
Stuff > MemoryScala collections... fun but, memory bound!
val text = "so many words... waaah! ..."
text .split(" ") .map(a => (a, 1)) .groupBy(_._1) .map(a => (a._1, a._2.map(_._2).sum))
in Memory
in Memory
in Memory
in Memory
in Memory
Friday, April 12, 13
Apache Hadoop (HDFS + MR)http://hadoop.apache.org/
Friday, April 12, 13
package org.myorg;
import org.apache.hadoop.fs.Path;import org.apache.hadoop.io.IntWritable;import org.apache.hadoop.io.LongWritable;import org.apache.hadoop.io.Text;import org.apache.hadoop.mapred.*;
import java.io.IOException;import java.util.Iterator;import java.util.StringTokenizer;
public class WordCount {
public static class Map extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> { private final static IntWritable one = new IntWritable(1); private Text word = new Text();
public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { String line = value.toString(); StringTokenizer tokenizer = new StringTokenizer(line); while (tokenizer.hasMoreTokens()) { word.set(tokenizer.nextToken()); output.collect(word, one); } } }
public static class Reduce extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> { public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { int sum = 0; while (values.hasNext()) { sum += values.next().get(); } output.collect(key, new IntWritable(sum)); } }
public static void main(String[] args) throws Exception { JobConf conf = new JobConf(WordCount.class); conf.setJobName("wordcount");
conf.setOutputKeyClass(Text.class); conf.setOutputValueClass(IntWritable.class);
conf.setMapperClass(Map.class); conf.setCombinerClass(Reduce.class); conf.setReducerClass(Reduce.class);
conf.setInputFormat(TextInputFormat.class); conf.setOutputFormat(TextOutputFormat.class);
FileInputFormat.setInputPaths(conf, new Path(args[0])); FileOutputFormat.setOutputPath(conf, new Path(args[1]));
JobClient.runJob(conf); }}
Why Scalding?Word Count in Hadoop MR
Friday, April 12, 13
package org.myorg;
import org.apache.hadoop.fs.Path;import org.apache.hadoop.io.IntWritable;import org.apache.hadoop.io.LongWritable;import org.apache.hadoop.io.Text;import org.apache.hadoop.mapred.*;
import java.io.IOException;import java.util.Iterator;import java.util.StringTokenizer;
public class WordCount {
public static class Map extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> { private final static IntWritable one = new IntWritable(1); private Text word = new Text();
public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { String line = value.toString(); StringTokenizer tokenizer = new StringTokenizer(line); while (tokenizer.hasMoreTokens()) { word.set(tokenizer.nextToken()); output.collect(word, one); } } }
public static class Reduce extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> { public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { int sum = 0; while (values.hasNext()) { sum += values.next().get(); } output.collect(key, new IntWritable(sum)); } }
public static void main(String[] args) throws Exception { JobConf conf = new JobConf(WordCount.class); conf.setJobName("wordcount");
conf.setOutputKeyClass(Text.class); conf.setOutputValueClass(IntWritable.class);
conf.setMapperClass(Map.class); conf.setCombinerClass(Reduce.class); conf.setReducerClass(Reduce.class);
conf.setInputFormat(TextInputFormat.class); conf.setOutputFormat(TextOutputFormat.class);
FileInputFormat.setInputPaths(conf, new Path(args[0])); FileOutputFormat.setOutputPath(conf, new Path(args[1]));
JobClient.runJob(conf); }}
Why Scalding?Word Count in Hadoop MR
Friday, April 12, 13
Trivia: How old is Hadoop?
Friday, April 12, 13
Friday, April 12, 13
Friday, April 12, 13
Friday, April 12, 13
Friday, April 12, 13
Friday, April 12, 13
Friday, April 12, 13
Friday, April 12, 13
Friday, April 12, 13
Cascadingis
Friday, April 12, 13
Cascadingis
Taps & Pipes
Friday, April 12, 13
Cascadingis
Taps & Pipes& Pipes
& SinksFriday, April 12, 13
1: Distributed Copy
Friday, April 12, 13
1: Distributed Copy
Friday, April 12, 13
1: Distributed Copy
// source TapTap inTap = new Hfs(new TextDelimited(true, "\t"), inPath);
Friday, April 12, 13
1: Distributed Copy
// source TapTap inTap = new Hfs(new TextDelimited(true, "\t"), inPath);
// sink TapTap outTap = new Hfs(new TextDelimited(true, "\t"), outPath);
Friday, April 12, 13
1: Distributed Copy
// source TapTap inTap = new Hfs(new TextDelimited(true, "\t"), inPath);
// sink TapTap outTap = new Hfs(new TextDelimited(true, "\t"), outPath);
// a Pipe, connects tapsPipe copyPipe = new Pipe("copy");
Friday, April 12, 13
1: Distributed Copy
// source TapTap inTap = new Hfs(new TextDelimited(true, "\t"), inPath);
// sink TapTap outTap = new Hfs(new TextDelimited(true, "\t"), outPath);
// a Pipe, connects tapsPipe copyPipe = new Pipe("copy");
// build the FlowFlowDef flowDef = FlowDef.flowDef()
Friday, April 12, 13
1: Distributed Copy
// source TapTap inTap = new Hfs(new TextDelimited(true, "\t"), inPath);
// sink TapTap outTap = new Hfs(new TextDelimited(true, "\t"), outPath);
// a Pipe, connects tapsPipe copyPipe = new Pipe("copy");
// build the FlowFlowDef flowDef = FlowDef.flowDef() .addSource( copyPipe, inTap )
Friday, April 12, 13
1: Distributed Copy
// source TapTap inTap = new Hfs(new TextDelimited(true, "\t"), inPath);
// sink TapTap outTap = new Hfs(new TextDelimited(true, "\t"), outPath);
// a Pipe, connects tapsPipe copyPipe = new Pipe("copy");
// build the FlowFlowDef flowDef = FlowDef.flowDef() .addSource(copyPipe, inTap) .addTailSink(copyPipe, outTap);
Friday, April 12, 13
1: Distributed Copy
// source TapTap inTap = new Hfs(new TextDelimited(true, "\t"), inPath);
// sink TapTap outTap = new Hfs(new TextDelimited(true, "\t"), outPath);
// a Pipe, connects tapsPipe copyPipe = new Pipe("copy");
// build the FlowFlowDef flowDef = FlowDef.flowDef() .addSource(copyPipe, inTap) .addTailSink(copyPipe, outTap);
// run!flowConnector.connect(flowDef).complete();
Friday, April 12, 13
1. DCP - Full Codepublic class Main { public static void main(String[] args ) { String inPath = args[0]; String outPath = args[1];
Properties props = new Properties(); AppProps.setApplicationJarClass(properties, Main.class); HadoopFlowConnector flowConnector = new HadoopFlowConnector(props);
Tap inTap = new Hfs( new TextDelimited(true, "\t"), inPath);
Tap outTap = new Hfs(new TextDelimited(true, "\t"), outPath);
Pipe copyPipe = new Pipe("copy");
FlowDef flowDef = FlowDef.flowDef() .addSource(copyPipe, inTap) .addTailSink(copyPipe, outTap);
flowConnector.connect(flowDef).complete();}}
Friday, April 12, 13
1. DCP - Full Codepublic class Main { public static void main(String[] args ) { String inPath = args[0]; String outPath = args[1];
Properties props = new Properties(); AppProps.setApplicationJarClass(properties, Main.class); HadoopFlowConnector flowConnector = new HadoopFlowConnector(props);
Tap inTap = new Hfs( new TextDelimited(true, "\t"), inPath);
Tap outTap = new Hfs(new TextDelimited(true, "\t"), outPath);
Pipe copyPipe = new Pipe("copy");
FlowDef flowDef = FlowDef.flowDef() .addSource(copyPipe, inTap) .addTailSink(copyPipe, outTap);
flowConnector.connect(flowDef).complete();}}
Friday, April 12, 13
1. DCP - Full Codepublic class Main { public static void main(String[] args ) { String inPath = args[0]; String outPath = args[1];
Properties props = new Properties(); AppProps.setApplicationJarClass(properties, Main.class); HadoopFlowConnector flowConnector = new HadoopFlowConnector(props);
Tap inTap = new Hfs( new TextDelimited(true, "\t"), inPath);
Tap outTap = new Hfs(new TextDelimited(true, "\t"), outPath);
Pipe copyPipe = new Pipe("copy");
FlowDef flowDef = FlowDef.flowDef() .addSource(copyPipe, inTap) .addTailSink(copyPipe, outTap);
flowConnector.connect(flowDef).complete();}}
Friday, April 12, 13
1. DCP - Full Codepublic class Main { public static void main(String[] args ) { String inPath = args[0]; String outPath = args[1];
Properties props = new Properties(); AppProps.setApplicationJarClass(properties, Main.class); HadoopFlowConnector flowConnector = new HadoopFlowConnector(props);
Tap inTap = new Hfs( new TextDelimited(true, "\t"), inPath);
Tap outTap = new Hfs(new TextDelimited(true, "\t"), outPath);
Pipe copyPipe = new Pipe("copy");
FlowDef flowDef = FlowDef.flowDef() .addSource(copyPipe, inTap) .addTailSink(copyPipe, outTap);
flowConnector.connect(flowDef).complete();}}
Friday, April 12, 13
1. DCP - Full Codepublic class Main { public static void main(String[] args ) { String inPath = args[0]; String outPath = args[1];
Properties props = new Properties(); AppProps.setApplicationJarClass(properties, Main.class); HadoopFlowConnector flowConnector = new HadoopFlowConnector(props);
Tap inTap = new Hfs( new TextDelimited(true, "\t"), inPath);
Tap outTap = new Hfs(new TextDelimited(true, "\t"), outPath);
Pipe copyPipe = new Pipe("copy");
FlowDef flowDef = FlowDef.flowDef() .addSource(copyPipe, inTap) .addTailSink(copyPipe, outTap);
flowConnector.connect(flowDef).complete();}}
Friday, April 12, 13
1. DCP - Full Codepublic class Main { public static void main(String[] args ) { String inPath = args[0]; String outPath = args[1];
Properties props = new Properties(); AppProps.setApplicationJarClass(properties, Main.class); HadoopFlowConnector flowConnector = new HadoopFlowConnector(props);
Tap inTap = new Hfs( new TextDelimited(true, "\t"), inPath);
Tap outTap = new Hfs(new TextDelimited(true, "\t"), outPath);
Pipe copyPipe = new Pipe("copy");
FlowDef flowDef = FlowDef.flowDef() .addSource(copyPipe, inTap) .addTailSink(copyPipe, outTap);
flowConnector.connect(flowDef).complete();}}
Friday, April 12, 13
2: Word Count
String docPath = args[ 0 ]; String wcPath = args[ 1 ];
Properties properties = new Properties(); AppProps.setApplicationJarClass( props, Main.class ); HadoopFlowConnector flowConnector = new HadoopFlowConnector( props );
// create source and sink taps Tap docTap = new Hfs( new TextDelimited( true, "\t" ), docPath ); Tap wcTap = new Hfs( new TextDelimited( true, "\t" ), wcPath );
// specify a regex operation to split the "document" text lines into a token stream Fields token = new Fields( "token" ); Fields text = new Fields( "text" ); RegexSplitGenerator splitter = new RegexSplitGenerator( token, "[ \\[\\]\\(\\),.]" ); // only returns "token" Pipe docPipe = new Each( "token", text, splitter, Fields.RESULTS );
// determine the word counts Pipe wcPipe = new Pipe( "wc", docPipe ); wcPipe = new GroupBy( wcPipe, token ); wcPipe = new Every( wcPipe, Fields.ALL, new Count(), Fields.ALL );
// connect the taps, pipes, etc., into a flow FlowDef flowDef = FlowDef.flowDef() .setName( "wc" ) .addSource( docPipe, docTap ) .addTailSink( wcPipe, wcTap );
// write a DOT file and run the flow Flow wcFlow = flowConnector.connect( flowDef ); wcFlow.writeDOT( "dot/wc.dot" ); wcFlow.complete(); } }
Friday, April 12, 13
2: Word Count
String docPath = args[ 0 ]; String wcPath = args[ 1 ];
Properties properties = new Properties(); AppProps.setApplicationJarClass( props, Main.class ); HadoopFlowConnector flowConnector = new HadoopFlowConnector( props );
// create source and sink taps Tap docTap = new Hfs( new TextDelimited( true, "\t" ), docPath ); Tap wcTap = new Hfs( new TextDelimited( true, "\t" ), wcPath );
// specify a regex operation to split the "document" text lines into a token stream Fields token = new Fields( "token" ); Fields text = new Fields( "text" ); RegexSplitGenerator splitter = new RegexSplitGenerator( token, "[ \\[\\]\\(\\),.]" ); // only returns "token" Pipe docPipe = new Each( "token", text, splitter, Fields.RESULTS );
// determine the word counts Pipe wcPipe = new Pipe( "wc", docPipe ); wcPipe = new GroupBy( wcPipe, token ); wcPipe = new Every( wcPipe, Fields.ALL, new Count(), Fields.ALL );
// connect the taps, pipes, etc., into a flow FlowDef flowDef = FlowDef.flowDef() .setName( "wc" ) .addSource( docPipe, docTap ) .addTailSink( wcPipe, wcTap );
// write a DOT file and run the flow Flow wcFlow = flowConnector.connect( flowDef ); wcFlow.writeDOT( "dot/wc.dot" ); wcFlow.complete(); } }
Friday, April 12, 13
2: Word Count
String docPath = args[ 0 ]; String wcPath = args[ 1 ];
Properties properties = new Properties(); AppProps.setApplicationJarClass( props, Main.class ); HadoopFlowConnector flowConnector = new HadoopFlowConnector( props );
// create source and sink taps Tap docTap = new Hfs( new TextDelimited( true, "\t" ), docPath ); Tap wcTap = new Hfs( new TextDelimited( true, "\t" ), wcPath );
// specify a regex operation to split the "document" text lines into a token stream Fields token = new Fields( "token" ); Fields text = new Fields( "text" ); RegexSplitGenerator splitter = new RegexSplitGenerator( token, "[ \\[\\]\\(\\),.]" ); // only returns "token" Pipe docPipe = new Each( "token", text, splitter, Fields.RESULTS );
// determine the word counts Pipe wcPipe = new Pipe( "wc", docPipe ); wcPipe = new GroupBy( wcPipe, token ); wcPipe = new Every( wcPipe, Fields.ALL, new Count(), Fields.ALL );
// connect the taps, pipes, etc., into a flow FlowDef flowDef = FlowDef.flowDef() .setName( "wc" ) .addSource( docPipe, docTap ) .addTailSink( wcPipe, wcTap );
// write a DOT file and run the flow Flow wcFlow = flowConnector.connect( flowDef ); wcFlow.writeDOT( "dot/wc.dot" ); wcFlow.complete(); } }
Friday, April 12, 13
String docPath = args[ 0 ]; String wcPath = args[ 1 ];
Properties properties = new Properties(); AppProps.setApplicationJarClass( props, Main.class ); HadoopFlowConnector flowConnector = new HadoopFlowConnector( props );
// create source and sink taps Tap docTap = new Hfs( new TextDelimited( true, "\t" ), docPath ); Tap wcTap = new Hfs( new TextDelimited( true, "\t" ), wcPath );
Fields token = new Fields( "token" ); Fields text = new Fields( "text" ); RegexSplitGenerator splitter = new RegexSplitGenerator( token, "[ \\[\\]\\(\\),.]" );
// only returns "token" Pipe docPipe = new Each( "token", text, splitter, Fields.RESULTS );
// determine the word counts Pipe wcPipe = new Pipe( "wc", docPipe ); wcPipe = new GroupBy( wcPipe, token ); wcPipe = new Every( wcPipe, Fields.ALL, new Count(), Fields.ALL );
// connect the taps, pipes, etc., into a flow FlowDef flowDef = FlowDef.flowDef() .setName( "wc" ) .addSource( docPipe, docTap ) .addTailSink( wcPipe, wcTap );
// write a DOT file and run the flow Flow wcFlow = flowConnector.connect( flowDef ); wcFlow.writeDOT( "dot/wc.dot" ); wcFlow.complete(); } }
2: Word Count
Friday, April 12, 13
2: Word Count String docPath = args[ 0 ]; String wcPath = args[ 1 ];
Properties properties = new Properties(); AppProps.setApplicationJarClass( props, Main.class ); HadoopFlowConnector flowConnector = new HadoopFlowConnector( props );
// create source and sink taps Tap docTap = new Hfs( new TextDelimited( true, "\t" ), docPath ); Tap wcTap = new Hfs( new TextDelimited( true, "\t" ), wcPath );
Fields token = new Fields( "token" ); Fields text = new Fields( "text" ); RegexSplitGenerator splitter = new RegexSplitGenerator( token, "[ \\[\\]\\(\\),.]" );
// only returns "token" Pipe docPipe = new Each( "token", text, splitter, Fields.RESULTS );
// determine the word counts Pipe wcPipe = new Pipe( "wc", docPipe ); wcPipe = new GroupBy( wcPipe, token ); wcPipe = new Every( wcPipe, Fields.ALL, new Count(), Fields.ALL );
// connect the taps, pipes, etc., into a flow FlowDef flowDef = FlowDef.flowDef() .setName( "wc" ) .addSource( docPipe, docTap ) .addTailSink( wcPipe, wcTap );
// write a DOT file and run the flow Flow wcFlow = flowConnector.connect( flowDef ); wcFlow.writeDOT( "dot/wc.dot" ); wcFlow.complete(); } }
2: Word Count
Friday, April 12, 13
2: Word Count
String docPath = args[ 0 ]; String wcPath = args[ 1 ];
Properties properties = new Properties(); AppProps.setApplicationJarClass( props, Main.class ); HadoopFlowConnector flowConnector = new HadoopFlowConnector( props );
// create source and sink taps Tap docTap = new Hfs( new TextDelimited( true, "\t" ), docPath ); Tap wcTap = new Hfs( new TextDelimited( true, "\t" ), wcPath );
Fields token = new Fields( "token" ); Fields text = new Fields( "text" ); RegexSplitGenerator splitter = new RegexSplitGenerator( token, "[ \\[\\]\\(\\),.]" );
// only returns "token" Pipe docPipe = new Each( "token", text, splitter, Fields.RESULTS );
// determine the word counts Pipe wcPipe = new Pipe( "wc", docPipe ); wcPipe = new GroupBy( wcPipe, token ); wcPipe = new Every( wcPipe, Fields.ALL, new Count(), Fields.ALL );
// connect the taps, pipes, etc., into a flow FlowDef flowDef = FlowDef.flowDef() .setName( "wc" ) .addSource( docPipe, docTap ) .addTailSink( wcPipe, wcTap );
// write a DOT file and run the flow Flow wcFlow = flowConnector.connect( flowDef ); wcFlow.writeDOT( "dot/wc.dot" ); wcFlow.complete(); } }
2: Word Count
Friday, April 12, 13
2: Word Count
String docPath = args[ 0 ]; String wcPath = args[ 1 ];
Properties properties = new Properties(); AppProps.setApplicationJarClass( props, Main.class ); HadoopFlowConnector flowConnector = new HadoopFlowConnector( props );
// create source and sink taps Tap docTap = new Hfs( new TextDelimited( true, "\t" ), docPath ); Tap wcTap = new Hfs( new TextDelimited( true, "\t" ), wcPath );
Fields token = new Fields( "token" ); Fields text = new Fields( "text" ); RegexSplitGenerator splitter = new RegexSplitGenerator( token, "[ \\[\\]\\(\\),.]" );
// only returns "token" Pipe docPipe = new Each( "token", text, splitter, Fields.RESULTS );
// determine the word counts Pipe wcPipe = new Pipe( "wc", docPipe ); wcPipe = new GroupBy( wcPipe, token ); wcPipe = new Every( wcPipe, Fields.ALL, new Count(), Fields.ALL );
// connect the taps, pipes, etc., into a flow FlowDef flowDef = FlowDef.flowDef() .setName( "wc" ) .addSource( docPipe, docTap ) .addTailSink( wcPipe, wcTap );
// write a DOT file and run the flow Flow wcFlow = flowConnector.connect( flowDef ); wcFlow.writeDOT( "dot/wc.dot" ); wcFlow.complete(); } }
2: Word Count
Friday, April 12, 13
2: Word Count
String docPath = args[ 0 ]; String wcPath = args[ 1 ];
Properties properties = new Properties(); AppProps.setApplicationJarClass( props, Main.class ); HadoopFlowConnector flowConnector = new HadoopFlowConnector( props );
// create source and sink taps Tap docTap = new Hfs( new TextDelimited( true, "\t" ), docPath ); Tap wcTap = new Hfs( new TextDelimited( true, "\t" ), wcPath );
Fields token = new Fields( "token" ); Fields text = new Fields( "text" ); RegexSplitGenerator splitter = new RegexSplitGenerator( token, "[ \\[\\]\\(\\),.]" );
// only returns "token" Pipe docPipe = new Each( "token", text, splitter, Fields.RESULTS );
// determine the word counts Pipe wcPipe = new Pipe( "wc", docPipe ); wcPipe = new GroupBy( wcPipe, token ); wcPipe = new Every( wcPipe, Fields.ALL, new Count(), Fields.ALL );
// connect the taps, pipes, etc., into a flow FlowDef flowDef = FlowDef.flowDef() .setName( "wc" ) .addSource( docPipe, docTap ) .addTailSink( wcPipe, wcTap );
// write a DOT file and run the flow Flow wcFlow = flowConnector.connect( flowDef ); wcFlow.writeDOT( "dot/wc.dot" ); wcFlow.complete(); } }
2: Word Count
Friday, April 12, 13
2: Word Count
String docPath = args[ 0 ]; String wcPath = args[ 1 ];
Properties properties = new Properties(); AppProps.setApplicationJarClass( props, Main.class ); HadoopFlowConnector flowConnector = new HadoopFlowConnector( props );
// create source and sink taps Tap docTap = new Hfs( new TextDelimited( true, "\t" ), docPath ); Tap wcTap = new Hfs( new TextDelimited( true, "\t" ), wcPath );
Fields token = new Fields( "token" ); Fields text = new Fields( "text" ); RegexSplitGenerator splitter = new RegexSplitGenerator( token, "[ \\[\\]\\(\\),.]" );
// only returns "token" Pipe docPipe = new Each( "token", text, splitter, Fields.RESULTS );
// determine the word counts Pipe wcPipe = new Pipe( "wc", docPipe ); wcPipe = new GroupBy( wcPipe, token ); wcPipe = new Every( wcPipe, Fields.ALL, new Count(), Fields.ALL );
// connect the taps, pipes, etc., into a flow FlowDef flowDef = FlowDef.flowDef() .setName( "wc" ) .addSource( docPipe, docTap ) .addTailSink( wcPipe, wcTap );
// write a DOT file and run the flow Flow wcFlow = flowConnector.connect( flowDef ); wcFlow.writeDOT( "dot/wc.dot" ); wcFlow.complete(); } }
2: Word Count
Friday, April 12, 13
2: Word Count
String docPath = args[ 0 ]; String wcPath = args[ 1 ];
Properties properties = new Properties(); AppProps.setApplicationJarClass( props, Main.class ); HadoopFlowConnector flowConnector = new HadoopFlowConnector( props );
// create source and sink taps Tap docTap = new Hfs( new TextDelimited( true, "\t" ), docPath ); Tap wcTap = new Hfs( new TextDelimited( true, "\t" ), wcPath );
Fields token = new Fields( "token" ); Fields text = new Fields( "text" ); RegexSplitGenerator splitter = new RegexSplitGenerator( token, "[ \\[\\]\\(\\),.]" );
// only returns "token" Pipe docPipe = new Each( "token", text, splitter, Fields.RESULTS );
// determine the word counts Pipe wcPipe = new Pipe( "wc", docPipe ); wcPipe = new GroupBy( wcPipe, token ); wcPipe = new Every( wcPipe, Fields.ALL, new Count(), Fields.ALL );
// connect the taps, pipes, etc., into a flow FlowDef flowDef = FlowDef.flowDef() .setName( "wc" ) .addSource( docPipe, docTap ) .addTailSink( wcPipe, wcTap );
// write a DOT file and run the flow Flow wcFlow = flowConnector.connect( flowDef ); wcFlow.writeDOT( "dot/wc.dot" ); wcFlow.complete(); } }
2: Word Count
Friday, April 12, 13
Cascading - how?
Friday, April 12, 13
Cascading - how?// pseudo code...
Friday, April 12, 13
Cascading - how?// pseudo code...
val flow = FlowDef
Friday, April 12, 13
Cascading - how?// pseudo code...
val flow = FlowDefval flowConnector: FlowDef => List[MRJob] = ...
Friday, April 12, 13
Cascading - how?// pseudo code...
val flow = FlowDefval flowConnector: FlowDef => List[MRJob] = ...
val jobs: List[MRJob] = flowConnector(flow)
Friday, April 12, 13
Cascading - how?// pseudo code...
val flow = FlowDefval flowConnector: FlowDef => List[MRJob] = ...
val jobs: List[MRJob] = flowConnector(flow)
HadoopCluster.execute(jobs)
Friday, April 12, 13
Cascading - how?// pseudo code...
val flow = FlowDefval flowConnector: FlowDef => List[MRJob] = ...
val jobs: List[MRJob] = flowConnector(flow)
HadoopCluster.execute(jobs)
Friday, April 12, 13
Cascading tipsPipe assembly = new Pipe( "assembly" );assembly = new Each( assembly, DebugLevel.VERBOSE, new Debug() );// ...
// head and tail have same nameFlowDef flowDef = new FlowDef() .setName( "debug" ) .addSource( "assembly", source ) .addSink( "assembly", sink ) .addTail( assembly );
Friday, April 12, 13
Cascading tipsPipe assembly = new Pipe( "assembly" );assembly = new Each( assembly, DebugLevel.VERBOSE, new Debug() );// ...
// head and tail have same nameFlowDef flowDef = new FlowDef() .setName( "debug" ) .addSource( "assembly", source ) .addSink( "assembly", sink ) .addTail( assembly );
flowDef.setDebugLevel( DebugLevel.NONE );
Friday, April 12, 13
Cascading tipsPipe assembly = new Pipe( "assembly" );assembly = new Each( assembly, DebugLevel.VERBOSE, new Debug() );// ...
// head and tail have same nameFlowDef flowDef = new FlowDef() .setName( "debug" ) .addSource( "assembly", source ) .addSink( "assembly", sink ) .addTail( assembly );
flowDef.setDebugLevel( DebugLevel.NONE );
flowConnector will NOT create the Debug pipe!
Friday, April 12, 13
Scalding=+
Twitter Scaldinggithub.com/twitter/scalding
Friday, April 12, 13
Scalding API
Friday, April 12, 13
map
Friday, April 12, 13
val data = 1 :: 2 :: 3 :: Nil
mapScala:
Friday, April 12, 13
val data = 1 :: 2 :: 3 :: Nil
val doubled = data map { _ * 2 }
mapScala:
Friday, April 12, 13
val data = 1 :: 2 :: 3 :: Nil
val doubled = data map { _ * 2 }
// Int => Int
mapScala:
Friday, April 12, 13
val data = 1 :: 2 :: 3 :: Nil
val doubled = data map { _ * 2 }
// Int => Int
mapScala:
Friday, April 12, 13
val data = 1 :: 2 :: 3 :: Nil
val doubled = data map { _ * 2 }
// Int => Int
map
IterableSource(data)
Scala:
Scalding:
Friday, April 12, 13
val data = 1 :: 2 :: 3 :: Nil
val doubled = data map { _ * 2 }
// Int => Int
map
IterableSource(data) .map('number -> 'doubled) { n: Int => n * 2 }
Scala:
Scalding:
Friday, April 12, 13
val data = 1 :: 2 :: 3 :: Nil
val doubled = data map { _ * 2 }
// Int => Int
map
IterableSource(data) .map('number -> 'doubled) { n: Int => n * 2 }
// Int => Int
Scala:
Scalding:
Friday, April 12, 13
val data = 1 :: 2 :: 3 :: Nil
val doubled = data map { _ * 2 }
// Int => Int
map
IterableSource(data) .map('number -> 'doubled) { n: Int => n * 2 }
// Int => Int
Scala:
Scalding:
available in Pipe
Friday, April 12, 13
val data = 1 :: 2 :: 3 :: Nil
val doubled = data map { _ * 2 }
// Int => Int
map
IterableSource(data) .map('number -> 'doubled) { n: Int => n * 2 }
// Int => Int
Scala:
Scalding:
available in Pipestays in Pipe
Friday, April 12, 13
val data = 1 :: 2 :: 3 :: Nil
val doubled = data map { _ * 2 }
// Int => Int
map
IterableSource(data) .map('number -> 'doubled) { n: Int => n * 2 }
// Int => Int
Scala:
Scalding:
must choose type!
Friday, April 12, 13
mapTo
Friday, April 12, 13
var data = 1 :: 2 :: 3 :: Nil
mapToScala:
Friday, April 12, 13
var data = 1 :: 2 :: 3 :: Nil
val doubled = data map { _ * 2 }
mapToScala:
Friday, April 12, 13
var data = 1 :: 2 :: 3 :: Nil
val doubled = data map { _ * 2 }data = null
mapToScala:
Friday, April 12, 13
var data = 1 :: 2 :: 3 :: Nil
val doubled = data map { _ * 2 }data = null // Int => Int
mapToScala:
Friday, April 12, 13
var data = 1 :: 2 :: 3 :: Nil
val doubled = data map { _ * 2 }data = null // Int => Int
mapToScala:
release reference
Friday, April 12, 13
var data = 1 :: 2 :: 3 :: Nil
val doubled = data map { _ * 2 }data = null // Int => Int
mapToScala:
release reference
Friday, April 12, 13
var data = 1 :: 2 :: 3 :: Nil
val doubled = data map { _ * 2 }data = null // Int => Int
mapTo
IterableSource(data)
Scala:
Scalding:
release reference
Friday, April 12, 13
var data = 1 :: 2 :: 3 :: Nil
val doubled = data map { _ * 2 }data = null // Int => Int
mapTo
IterableSource(data) .mapTo('doubled) { n: Int => n * 2 }
Scala:
Scalding:
release reference
Friday, April 12, 13
var data = 1 :: 2 :: 3 :: Nil
val doubled = data map { _ * 2 }data = null // Int => Int
mapTo
IterableSource(data) .mapTo('doubled) { n: Int => n * 2 }
// Int => Int
Scala:
Scalding:
release reference
Friday, April 12, 13
var data = 1 :: 2 :: 3 :: Nil
val doubled = data map { _ * 2 }data = null // Int => Int
mapTo
IterableSource(data) .mapTo('doubled) { n: Int => n * 2 }
// Int => Int
Scala:
Scalding:
doubled stays in Pipe
release reference
Friday, April 12, 13
var data = 1 :: 2 :: 3 :: Nil
val doubled = data map { _ * 2 }data = null // Int => Int
mapTo
IterableSource(data) .mapTo('doubled) { n: Int => n * 2 }
// Int => Int
Scala:
Scalding:
doubled stays in Pipenumber is removed
release reference
Friday, April 12, 13
flatMap
Friday, April 12, 13
val data = "1" :: "2,2" :: "3,3,3" :: Nil // List[String]
flatMapScala:
Friday, April 12, 13
val data = "1" :: "2,2" :: "3,3,3" :: Nil // List[String]
val numbers = data flatMap { line => // String
flatMapScala:
Friday, April 12, 13
val data = "1" :: "2,2" :: "3,3,3" :: Nil // List[String]
val numbers = data flatMap { line => // String line.split(",") // Array[String]
flatMapScala:
Friday, April 12, 13
val data = "1" :: "2,2" :: "3,3,3" :: Nil // List[String]
val numbers = data flatMap { line => // String line.split(",") // Array[String]} map { _.toInt } // List[Int]
flatMapScala:
Friday, April 12, 13
val data = "1" :: "2,2" :: "3,3,3" :: Nil // List[String]
val numbers = data flatMap { line => // String line.split(",") // Array[String]} map { _.toInt } // List[Int]
numbers // List[Int]
flatMapScala:
Friday, April 12, 13
val data = "1" :: "2,2" :: "3,3,3" :: Nil // List[String]
val numbers = data flatMap { line => // String line.split(",") // Array[String]} map { _.toInt } // List[Int]
numbers // List[Int]numbers should equal (List(1, 2, 2, 3, 3, 3))
flatMapScala:
Friday, April 12, 13
val data = "1" :: "2,2" :: "3,3,3" :: Nil // List[String]
val numbers = data flatMap { line => // String line.split(",") // Array[String]} map { _.toInt } // List[Int]
numbers // List[Int]numbers should equal (List(1, 2, 2, 3, 3, 3))
flatMapScala:
Friday, April 12, 13
val data = "1" :: "2,2" :: "3,3,3" :: Nil // List[String]
val numbers = data flatMap { line => // String line.split(",") // Array[String]} map { _.toInt } // List[Int]
numbers // List[Int]numbers should equal (List(1, 2, 2, 3, 3, 3))
flatMap
TextLine(data) // like List[String]
Scala:
Scalding:
Friday, April 12, 13
val data = "1" :: "2,2" :: "3,3,3" :: Nil // List[String]
val numbers = data flatMap { line => // String line.split(",") // Array[String]} map { _.toInt } // List[Int]
numbers // List[Int]numbers should equal (List(1, 2, 2, 3, 3, 3))
flatMap
TextLine(data) // like List[String] .flatMap('line -> 'word) { _.split(",") } // like List[String]
Scala:
Scalding:
Friday, April 12, 13
val data = "1" :: "2,2" :: "3,3,3" :: Nil // List[String]
val numbers = data flatMap { line => // String line.split(",") // Array[String]} map { _.toInt } // List[Int]
numbers // List[Int]numbers should equal (List(1, 2, 2, 3, 3, 3))
flatMap
TextLine(data) // like List[String] .flatMap('line -> 'word) { _.split(",") } // like List[String] .map('word -> 'number) { _.toInt } // like List[Int]
Scala:
Scalding:
Friday, April 12, 13
val data = "1" :: "2,2" :: "3,3,3" :: Nil // List[String]
val numbers = data flatMap { line => // String line.split(",") // Array[String]} map { _.toInt } // List[Int]
numbers // List[Int]numbers should equal (List(1, 2, 2, 3, 3, 3))
flatMap
TextLine(data) // like List[String] .flatMap('line -> 'word) { _.split(",") } // like List[String] .map('word -> 'number) { _.toInt } // like List[Int]
Scala:
Scalding:
MR map outside
Friday, April 12, 13
flatMap
Friday, April 12, 13
val data = "1" :: "2,2" :: "3,3,3" :: Nil // List[String]
flatMapScala:
Friday, April 12, 13
val data = "1" :: "2,2" :: "3,3,3" :: Nil // List[String]
val numbers = data flatMap { line => // String
flatMapScala:
Friday, April 12, 13
val data = "1" :: "2,2" :: "3,3,3" :: Nil // List[String]
val numbers = data flatMap { line => // String line.split(",").map(_.toInt) // Array[Int]
flatMapScala:
Friday, April 12, 13
val data = "1" :: "2,2" :: "3,3,3" :: Nil // List[String]
val numbers = data flatMap { line => // String line.split(",").map(_.toInt) // Array[Int]}
flatMapScala:
Friday, April 12, 13
val data = "1" :: "2,2" :: "3,3,3" :: Nil // List[String]
val numbers = data flatMap { line => // String line.split(",").map(_.toInt) // Array[Int]}
numbers // List[Int]
flatMapScala:
Friday, April 12, 13
val data = "1" :: "2,2" :: "3,3,3" :: Nil // List[String]
val numbers = data flatMap { line => // String line.split(",").map(_.toInt) // Array[Int]}
numbers // List[Int]numbers should equal (List(1, 2, 2, 3, 3, 3))
flatMapScala:
Friday, April 12, 13
val data = "1" :: "2,2" :: "3,3,3" :: Nil // List[String]
val numbers = data flatMap { line => // String line.split(",").map(_.toInt) // Array[Int]}
numbers // List[Int]numbers should equal (List(1, 2, 2, 3, 3, 3))
flatMapScala:
Friday, April 12, 13
val data = "1" :: "2,2" :: "3,3,3" :: Nil // List[String]
val numbers = data flatMap { line => // String line.split(",").map(_.toInt) // Array[Int]}
numbers // List[Int]numbers should equal (List(1, 2, 2, 3, 3, 3))
flatMap
TextLine(data) // like List[String]
Scala:
Scalding:
Friday, April 12, 13
val data = "1" :: "2,2" :: "3,3,3" :: Nil // List[String]
val numbers = data flatMap { line => // String line.split(",").map(_.toInt) // Array[Int]}
numbers // List[Int]numbers should equal (List(1, 2, 2, 3, 3, 3))
flatMap
TextLine(data) // like List[String] .flatMap('line -> 'word) { _.split(",").map(_.toInt) }
Scala:
Scalding:
Friday, April 12, 13
val data = "1" :: "2,2" :: "3,3,3" :: Nil // List[String]
val numbers = data flatMap { line => // String line.split(",").map(_.toInt) // Array[Int]}
numbers // List[Int]numbers should equal (List(1, 2, 2, 3, 3, 3))
flatMap
TextLine(data) // like List[String] .flatMap('line -> 'word) { _.split(",").map(_.toInt) } // like List[Int]
Scala:
Scalding:
Friday, April 12, 13
val data = "1" :: "2,2" :: "3,3,3" :: Nil // List[String]
val numbers = data flatMap { line => // String line.split(",").map(_.toInt) // Array[Int]}
numbers // List[Int]numbers should equal (List(1, 2, 2, 3, 3, 3))
flatMap
TextLine(data) // like List[String] .flatMap('line -> 'word) { _.split(",").map(_.toInt) } // like List[Int]
Scala:
Scalding:
map inside Scala
Friday, April 12, 13
groupBy
Friday, April 12, 13
val data = 1 :: 2 :: 30 :: 42 :: Nil // List[Int]
groupByScala:
Friday, April 12, 13
val data = 1 :: 2 :: 30 :: 42 :: Nil // List[Int]
val groups = data groupBy { _ < 10 }
groupByScala:
Friday, April 12, 13
val data = 1 :: 2 :: 30 :: 42 :: Nil // List[Int]
val groups = data groupBy { _ < 10 }
groups // Map[Boolean, Int]
groupByScala:
Friday, April 12, 13
val data = 1 :: 2 :: 30 :: 42 :: Nil // List[Int]
val groups = data groupBy { _ < 10 }
groups // Map[Boolean, Int]
groups(true) should equal (List(1, 2))
groupByScala:
Friday, April 12, 13
val data = 1 :: 2 :: 30 :: 42 :: Nil // List[Int]
val groups = data groupBy { _ < 10 }
groups // Map[Boolean, Int]
groups(true) should equal (List(1, 2))groups(false) should equal (List(30, 42))
groupByScala:
Friday, April 12, 13
val data = 1 :: 2 :: 30 :: 42 :: Nil // List[Int]
val groups = data groupBy { _ < 10 }
groups // Map[Boolean, Int]
groups(true) should equal (List(1, 2))groups(false) should equal (List(30, 42))
groupByScala:
Friday, April 12, 13
val data = 1 :: 2 :: 30 :: 42 :: Nil // List[Int]
val groups = data groupBy { _ < 10 }
groups // Map[Boolean, Int]
groups(true) should equal (List(1, 2))groups(false) should equal (List(30, 42))
groupBy
IterableSource(List(1, 2, 30, 42), 'num)
Scala:
Scalding:
Friday, April 12, 13
val data = 1 :: 2 :: 30 :: 42 :: Nil // List[Int]
val groups = data groupBy { _ < 10 }
groups // Map[Boolean, Int]
groups(true) should equal (List(1, 2))groups(false) should equal (List(30, 42))
groupBy
IterableSource(List(1, 2, 30, 42), 'num) .map('num -> 'lessThanTen) { i: Int => i < 10 }
Scala:
Scalding:
Friday, April 12, 13
val data = 1 :: 2 :: 30 :: 42 :: Nil // List[Int]
val groups = data groupBy { _ < 10 }
groups // Map[Boolean, Int]
groups(true) should equal (List(1, 2))groups(false) should equal (List(30, 42))
groupBy
IterableSource(List(1, 2, 30, 42), 'num) .map('num -> 'lessThanTen) { i: Int => i < 10 } .groupBy('lessThanTen) { _.size }
Scala:
Scalding:
Friday, April 12, 13
val data = 1 :: 2 :: 30 :: 42 :: Nil // List[Int]
val groups = data groupBy { _ < 10 }
groups // Map[Boolean, Int]
groups(true) should equal (List(1, 2))groups(false) should equal (List(30, 42))
groupBy
IterableSource(List(1, 2, 30, 42), 'num) .map('num -> 'lessThanTen) { i: Int => i < 10 } .groupBy('lessThanTen) { _.size }
Scala:
Scalding:
groups all with == value
Friday, April 12, 13
val data = 1 :: 2 :: 30 :: 42 :: Nil // List[Int]
val groups = data groupBy { _ < 10 }
groups // Map[Boolean, Int]
groups(true) should equal (List(1, 2))groups(false) should equal (List(30, 42))
groupBy
IterableSource(List(1, 2, 30, 42), 'num) .map('num -> 'lessThanTen) { i: Int => i < 10 } .groupBy('lessThanTen) { _.size }
Scala:
Scalding:
groups all with == value 'lessThanTenCounts
Friday, April 12, 13
groupBy
Scalding:
Friday, April 12, 13
groupBy
IterableSource(List(1, 2, 30, 42), 'num)
Scalding:
Friday, April 12, 13
groupBy
IterableSource(List(1, 2, 30, 42), 'num) .map('num -> 'lessThanTen) { i: Int => i < 10 }
Scalding:
Friday, April 12, 13
groupBy
IterableSource(List(1, 2, 30, 42), 'num) .map('num -> 'lessThanTen) { i: Int => i < 10 } .groupBy('lessThanTen) { _.sum('total) }
Scalding:
Friday, April 12, 13
groupBy
IterableSource(List(1, 2, 30, 42), 'num) .map('num -> 'lessThanTen) { i: Int => i < 10 } .groupBy('lessThanTen) { _.sum('total) }
Scalding:
'total = [3, 74]
Friday, April 12, 13
Scalding API
Friday, April 12, 13
Scalding APIproject / discard
Friday, April 12, 13
Scalding APIproject / discard
map / mapTo
Friday, April 12, 13
Scalding APIproject / discard
map / mapToflatMap / flatMapTo
Friday, April 12, 13
Scalding APIproject / discard
map / mapToflatMap / flatMapTo
rename
Friday, April 12, 13
Scalding APIproject / discard
map / mapToflatMap / flatMapTo
renamefilter
Friday, April 12, 13
Scalding APIproject / discard
map / mapToflatMap / flatMapTo
renamefilter
unique
Friday, April 12, 13
Scalding APIproject / discard
map / mapToflatMap / flatMapTo
renamefilter
uniquegroupBy / groupAll / groupRandom / shuffle
Friday, April 12, 13
Scalding APIproject / discard
map / mapToflatMap / flatMapTo
renamefilter
uniquegroupBy / groupAll / groupRandom / shuffle
limit
Friday, April 12, 13
Scalding APIproject / discard
map / mapToflatMap / flatMapTo
renamefilter
uniquegroupBy / groupAll / groupRandom / shuffle
limitdebug
Friday, April 12, 13
Scalding APIproject / discard
map / mapToflatMap / flatMapTo
renamefilter
uniquegroupBy / groupAll / groupRandom / shuffle
limitdebug
Group operations
Friday, April 12, 13
Scalding APIproject / discard
map / mapToflatMap / flatMapTo
renamefilter
uniquegroupBy / groupAll / groupRandom / shuffle
limitdebug
Group operations
joinsFriday, April 12, 13
Distributed Copy in Scalding
class WordCountJob(args: Args) extends Job(args) {
Friday, April 12, 13
class WordCountJob(args: Args) extends Job(args) {
val input = Tsv(args("input")) val output = Tsv(args("output"))
Distributed Copy in Scalding
Friday, April 12, 13
class WordCountJob(args: Args) extends Job(args) {
val input = Tsv(args("input")) val output = Tsv(args("output"))
input.read.write(output)
}
Distributed Copy in Scalding
Friday, April 12, 13
class WordCountJob(args: Args) extends Job(args) {
val input = Tsv(args("input")) val output = Tsv(args("output"))
input.read.write(output)
}
Distributed Copy in Scalding
The End.
Friday, April 12, 13
import org.apache.hadoop.util.ToolRunnerimport com.twitter.scalding
object ScaldingJobRunner extends App {
ToolRunner.run(new Configuration, new scalding.Tool, args)
}
Main Class - "Runner"
Friday, April 12, 13
import org.apache.hadoop.util.ToolRunnerimport com.twitter.scalding
object ScaldingJobRunner extends App {
ToolRunner.run(new Configuration, new scalding.Tool, args)
}
Main Class - "Runner"
from App
Friday, April 12, 13
class WordCountJob(args: Args) extends Job(args) {
}
Word Count in Scalding
Friday, April 12, 13
class WordCountJob(args: Args) extends Job(args) {
val inputFile = args("input") val outputFile = args("output")
}
Word Count in Scalding
Friday, April 12, 13
class WordCountJob(args: Args) extends Job(args) {
val inputFile = args("input") val outputFile = args("output")
TextLine(inputFile)
}
Word Count in Scalding
Friday, April 12, 13
class WordCountJob(args: Args) extends Job(args) {
val inputFile = args("input") val outputFile = args("output")
TextLine(inputFile) .flatMap('line -> 'word) { line: String => tokenize(line) }
def tokenize(text: String): Array[String] = implemented}
Word Count in Scalding
Friday, April 12, 13
class WordCountJob(args: Args) extends Job(args) {
val inputFile = args("input") val outputFile = args("output")
TextLine(inputFile) .flatMap('line -> 'word) { line: String => tokenize(line) } .groupBy('word) { _.size }
def tokenize(text: String): Array[String] = implemented}
Word Count in Scalding
Friday, April 12, 13
class WordCountJob(args: Args) extends Job(args) {
val inputFile = args("input") val outputFile = args("output")
TextLine(inputFile) .flatMap('line -> 'word) { line: String => tokenize(line) } .groupBy('word) { group => group.size }
def tokenize(text: String): Array[String] = implemented}
Word Count in Scalding
Friday, April 12, 13
class WordCountJob(args: Args) extends Job(args) {
val inputFile = args("input") val outputFile = args("output")
TextLine(inputFile) .flatMap('line -> 'word) { line: String => tokenize(line) } .groupBy('word) { group => group.size('count) }
def tokenize(text: String): Array[String] = implemented}
Word Count in Scalding
Friday, April 12, 13
class WordCountJob(args: Args) extends Job(args) {
val inputFile = args("input") val outputFile = args("output")
TextLine(inputFile) .flatMap('line -> 'word) { line: String => tokenize(line) } .groupBy('word) { _.size } .write(Tsv(outputFile))
def tokenize(text: String): Array[String] = implemented}
Word Count in Scalding
Friday, April 12, 13
class WordCountJob(args: Args) extends Job(args) {
val inputFile = args("input") val outputFile = args("output")
TextLine(inputFile) .flatMap('line -> 'word) { line: String => tokenize(line) } .groupBy('word) { _.size } .write(Tsv(outputFile))
def tokenize(text: String): Array[String] = implemented}
Word Count in Scalding
4{
Friday, April 12, 13
Dzięki! Thanks!ありがとう!
Konrad Malawski @ java.plt: ktosopl / g: ktosob: blog.project13.pl
Friday, April 12, 13
top related