Top Banner
Wordcount in MapReduce
14

Scalding by Adform Research, Alex Gryzlov

Jul 15, 2015

Download

Technology

Vasil Remeniuk
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Scalding by Adform Research, Alex Gryzlov

Wordcount in MapReduce

Page 2: Scalding by Adform Research, Alex Gryzlov

Cascading

Tap / Pipe / Sink abstraction over Map / Reduce in Java

Page 3: Scalding by Adform Research, Alex Gryzlov

Cascading

Page 4: Scalding by Adform Research, Alex Gryzlov

Wordcount in Cascading

Page 5: Scalding by Adform Research, Alex Gryzlov

Scalding

• Scala wrapper for Cascading

• Just like working with in-memory collections (map/filter/sort…)

• Built-in parsers for {T|C}SV, date annotations etc

• Helper algorithms e.g.

approximations (Algebird library)

matrix API

Page 6: Scalding by Adform Research, Alex Gryzlov

Wordcount in Scalding

Page 7: Scalding by Adform Research, Alex Gryzlov

run the WordCountJob in local mode with given input and output

Page 8: Scalding by Adform Research, Alex Gryzlov

Building and Deploying

• Get sbt

• sbt assembly produces jar file in target/scala_2.10

• sbt s3-upload produces jar and uploads to s3

Page 9: Scalding by Adform Research, Alex Gryzlov

Running on EMR

• hadoop fs -get s3://dev-adform-test/madeup-job.jar job.jar

• hadoop jar job.jar \

com.twitter.scalding.Tool \ Entry class

com.adform.dspr.MadeupJob \ Scalding job class

--hdfs \ Run in HDFS mode

--logs s3://dev-adform-test/logs \ Parameter

--meta s3://dev-adform-test/metadata \ Parameter

--output s3://dev-adform-test/output Parameter

For more complicated workflows you would have to use applications like Oozie or Pentaho, or write a custom runner app, check outhttps://gitz.adform.com/dco/dco-amazon-runner

Page 10: Scalding by Adform Research, Alex Gryzlov

Development

• Two APIs:

• Fields – everything is a string

• Typed – working with classes, e.g. Request/Transaction

Page 11: Scalding by Adform Research, Alex Gryzlov

Development

• Fields:• No need to parse columns

• Redundancy

• No IDE support like auto-completion

• Typed:• All benefits of types, esp. compile-time checking

• More manual work with parsing

• Sometimes API can be confusing (TypedPipe/Grouped/Cogrouped…)

Page 12: Scalding by Adform Research, Alex Gryzlov

Downsides

• A lot of configuring and googling random issues

• Scarce documentation, have to read source code/stackoverflow

• IntelliJ is slow

• Boilerplate code for parsing data

Page 13: Scalding by Adform Research, Alex Gryzlov

Some tips

• In local mode you specify files as input/output, in HDFS – folders

• You can use Hadoop API to read files from HDFS directly, but only on submitting node, not in the pipeline

• As a workaround for previous problem, you can use a distributed cache mechanism, but that only works on Hadoop 1 AFAIK

• Default memory limit per mapper/reducer is ~200Mb, can be raised by overriding Job.config and adding “mapred.child.java.opts“ -> ”-Xmx<NUMBER>m”

Page 14: Scalding by Adform Research, Alex Gryzlov

Resources

• https://github.com/twitter/scalding/wiki Wiki

• https://github.com/twitter/scalding/tree/develop/tutorial Basic stuff

• https://github.com/twitter/scalding/tree/develop/scalding-core/src/main/scala/com/twitter/scalding/examples Advanced examples, e.g., iterative jobs

• http://www.slideshare.net/AntwnisChalkiopoulos/scalding-presentation

• http://polyglotprogramming.com/papers/ScaldingForHadoop.pdf

• http://www.slideshare.net/ktoso/scalding-the-notsobasics-scaladays-2014