Wordcount in MapReduce
Jul 15, 2015
Wordcount in MapReduce
Cascading
Tap / Pipe / Sink abstraction over Map / Reduce in Java
Cascading
Wordcount in Cascading
Scalding
• Scala wrapper for Cascading
• Just like working with in-memory collections (map/filter/sort…)
• Built-in parsers for {T|C}SV, date annotations etc
• Helper algorithms e.g.
approximations (Algebird library)
matrix API
Wordcount in Scalding
run the WordCountJob in local mode with given input and output
Building and Deploying
• Get sbt
• sbt assembly produces jar file in target/scala_2.10
• sbt s3-upload produces jar and uploads to s3
Running on EMR
• hadoop fs -get s3://dev-adform-test/madeup-job.jar job.jar
• hadoop jar job.jar \
com.twitter.scalding.Tool \ Entry class
com.adform.dspr.MadeupJob \ Scalding job class
--hdfs \ Run in HDFS mode
--logs s3://dev-adform-test/logs \ Parameter
--meta s3://dev-adform-test/metadata \ Parameter
--output s3://dev-adform-test/output Parameter
For more complicated workflows you would have to use applications like Oozie or Pentaho, or write a custom runner app, check outhttps://gitz.adform.com/dco/dco-amazon-runner
Development
• Two APIs:
• Fields – everything is a string
• Typed – working with classes, e.g. Request/Transaction
Development
• Fields:• No need to parse columns
• Redundancy
• No IDE support like auto-completion
• Typed:• All benefits of types, esp. compile-time checking
• More manual work with parsing
• Sometimes API can be confusing (TypedPipe/Grouped/Cogrouped…)
Downsides
• A lot of configuring and googling random issues
• Scarce documentation, have to read source code/stackoverflow
• IntelliJ is slow
• Boilerplate code for parsing data
Some tips
• In local mode you specify files as input/output, in HDFS – folders
• You can use Hadoop API to read files from HDFS directly, but only on submitting node, not in the pipeline
• As a workaround for previous problem, you can use a distributed cache mechanism, but that only works on Hadoop 1 AFAIK
• Default memory limit per mapper/reducer is ~200Mb, can be raised by overriding Job.config and adding “mapred.child.java.opts“ -> ”-Xmx<NUMBER>m”
Resources
• https://github.com/twitter/scalding/wiki Wiki
• https://github.com/twitter/scalding/tree/develop/tutorial Basic stuff
• https://github.com/twitter/scalding/tree/develop/scalding-core/src/main/scala/com/twitter/scalding/examples Advanced examples, e.g., iterative jobs
• http://www.slideshare.net/AntwnisChalkiopoulos/scalding-presentation
• http://polyglotprogramming.com/papers/ScaldingForHadoop.pdf
• http://www.slideshare.net/ktoso/scalding-the-notsobasics-scaladays-2014