Top Banner
Moving Faster: Why Intent Media Chose Cascalog for Data Processing and Machine Learning Kurt Schrader May 20, 2014
65

Moving Faster: Why Intent Media Chose Cascalog for Data ...gotocon.com/dl/goto-chicago-2014/slides/KurtSchrader_MovingFasterWhyWe... · Moving Faster: Why Intent Media Chose Cascalog

May 23, 2020

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Moving Faster: Why Intent Media Chose Cascalog for Data ...gotocon.com/dl/goto-chicago-2014/slides/KurtSchrader_MovingFasterWhyWe... · Moving Faster: Why Intent Media Chose Cascalog

Moving Faster: Why Intent Media Chose Cascalog for Data

Processing and Machine LearningKurt Schrader May 20, 2014

Page 2: Moving Faster: Why Intent Media Chose Cascalog for Data ...gotocon.com/dl/goto-chicago-2014/slides/KurtSchrader_MovingFasterWhyWe... · Moving Faster: Why Intent Media Chose Cascalog

Overview

• History of data processing at Intent Media

• “Hello World” in various data processing languages

• Cascalog overview

• The future

Page 3: Moving Faster: Why Intent Media Chose Cascalog for Data ...gotocon.com/dl/goto-chicago-2014/slides/KurtSchrader_MovingFasterWhyWe... · Moving Faster: Why Intent Media Chose Cascalog

Who am I?

Page 4: Moving Faster: Why Intent Media Chose Cascalog for Data ...gotocon.com/dl/goto-chicago-2014/slides/KurtSchrader_MovingFasterWhyWe... · Moving Faster: Why Intent Media Chose Cascalog
Page 5: Moving Faster: Why Intent Media Chose Cascalog for Data ...gotocon.com/dl/goto-chicago-2014/slides/KurtSchrader_MovingFasterWhyWe... · Moving Faster: Why Intent Media Chose Cascalog
Page 6: Moving Faster: Why Intent Media Chose Cascalog for Data ...gotocon.com/dl/goto-chicago-2014/slides/KurtSchrader_MovingFasterWhyWe... · Moving Faster: Why Intent Media Chose Cascalog
Page 7: Moving Faster: Why Intent Media Chose Cascalog for Data ...gotocon.com/dl/goto-chicago-2014/slides/KurtSchrader_MovingFasterWhyWe... · Moving Faster: Why Intent Media Chose Cascalog

Data Science at IM

• Builds models from terabytes of data:

• Propensity to buy

• Propensity to click on an offer

• Been learning about “how to do” data science for five years

Page 8: Moving Faster: Why Intent Media Chose Cascalog for Data ...gotocon.com/dl/goto-chicago-2014/slides/KurtSchrader_MovingFasterWhyWe... · Moving Faster: Why Intent Media Chose Cascalog

History of Data Processing at Intent

Media

Page 9: Moving Faster: Why Intent Media Chose Cascalog for Data ...gotocon.com/dl/goto-chicago-2014/slides/KurtSchrader_MovingFasterWhyWe... · Moving Faster: Why Intent Media Chose Cascalog

Originally: Hadoop Java API

Hadoop DFS

Hadoop Map Reduce

Java API

Page 10: Moving Faster: Why Intent Media Chose Cascalog for Data ...gotocon.com/dl/goto-chicago-2014/slides/KurtSchrader_MovingFasterWhyWe... · Moving Faster: Why Intent Media Chose Cascalog

Hadoop Java API Example

Page 11: Moving Faster: Why Intent Media Chose Cascalog for Data ...gotocon.com/dl/goto-chicago-2014/slides/KurtSchrader_MovingFasterWhyWe... · Moving Faster: Why Intent Media Chose Cascalog

Example: Anagram Mapper in Java

Sort characters

beta -> abet, beat -> abet

Page 12: Moving Faster: Why Intent Media Chose Cascalog for Data ...gotocon.com/dl/goto-chicago-2014/slides/KurtSchrader_MovingFasterWhyWe... · Moving Faster: Why Intent Media Chose Cascalog

Example: Anagram Reducer in Java

Page 13: Moving Faster: Why Intent Media Chose Cascalog for Data ...gotocon.com/dl/goto-chicago-2014/slides/KurtSchrader_MovingFasterWhyWe... · Moving Faster: Why Intent Media Chose Cascalog

Downsides of Java API• Hard to write

• Need to think in “map -> reduce”

• Hard to test

• Hard to read and understand when you go back to it in the future

• Too low level

Page 14: Moving Faster: Why Intent Media Chose Cascalog for Data ...gotocon.com/dl/goto-chicago-2014/slides/KurtSchrader_MovingFasterWhyWe... · Moving Faster: Why Intent Media Chose Cascalog

“There's no problem in Computer Science that can't be solved by adding another layer

of abstraction”

Page 15: Moving Faster: Why Intent Media Chose Cascalog for Data ...gotocon.com/dl/goto-chicago-2014/slides/KurtSchrader_MovingFasterWhyWe... · Moving Faster: Why Intent Media Chose Cascalog

Another Layer of Abstraction

Hadoop DFS

Hadoop Map Reduce

Java

?

Page 16: Moving Faster: Why Intent Media Chose Cascalog for Data ...gotocon.com/dl/goto-chicago-2014/slides/KurtSchrader_MovingFasterWhyWe... · Moving Faster: Why Intent Media Chose Cascalog

Apache Pig

Hadoop DFS

Hadoop Map Reduce

Java

Pig

Page 17: Moving Faster: Why Intent Media Chose Cascalog for Data ...gotocon.com/dl/goto-chicago-2014/slides/KurtSchrader_MovingFasterWhyWe... · Moving Faster: Why Intent Media Chose Cascalog

Apache Pig

• Pig is a higher level declarative language that reduces to map/reduce queries

• Simpler to reason about

• Allows for User Defined Functions

Page 18: Moving Faster: Why Intent Media Chose Cascalog for Data ...gotocon.com/dl/goto-chicago-2014/slides/KurtSchrader_MovingFasterWhyWe... · Moving Faster: Why Intent Media Chose Cascalog

User Defined Function

Page 19: Moving Faster: Why Intent Media Chose Cascalog for Data ...gotocon.com/dl/goto-chicago-2014/slides/KurtSchrader_MovingFasterWhyWe... · Moving Faster: Why Intent Media Chose Cascalog
Page 20: Moving Faster: Why Intent Media Chose Cascalog for Data ...gotocon.com/dl/goto-chicago-2014/slides/KurtSchrader_MovingFasterWhyWe... · Moving Faster: Why Intent Media Chose Cascalog

Pig Downsides

• Defines its own language

• Custom operations in Python, Java, etc

• Hard to unit test

Page 21: Moving Faster: Why Intent Media Chose Cascalog for Data ...gotocon.com/dl/goto-chicago-2014/slides/KurtSchrader_MovingFasterWhyWe... · Moving Faster: Why Intent Media Chose Cascalog

Cascading

Hadoop DFS

Hadoop Map Reduce

Java

?Cascading

Page 22: Moving Faster: Why Intent Media Chose Cascalog for Data ...gotocon.com/dl/goto-chicago-2014/slides/KurtSchrader_MovingFasterWhyWe... · Moving Faster: Why Intent Media Chose Cascalog

Cascading overview

• Java data processing library

• Thinks of your data as a stream

• Uses a taps, pipes, and sinks abstraction

• Built in functions for common tasks

Page 23: Moving Faster: Why Intent Media Chose Cascalog for Data ...gotocon.com/dl/goto-chicago-2014/slides/KurtSchrader_MovingFasterWhyWe... · Moving Faster: Why Intent Media Chose Cascalog

Built in Operations

Page 24: Moving Faster: Why Intent Media Chose Cascalog for Data ...gotocon.com/dl/goto-chicago-2014/slides/KurtSchrader_MovingFasterWhyWe... · Moving Faster: Why Intent Media Chose Cascalog

Taps

Pipes

Page 25: Moving Faster: Why Intent Media Chose Cascalog for Data ...gotocon.com/dl/goto-chicago-2014/slides/KurtSchrader_MovingFasterWhyWe... · Moving Faster: Why Intent Media Chose Cascalog

“There's no problem in Computer Science that can't be solved by adding another layer

of abstraction”

Page 26: Moving Faster: Why Intent Media Chose Cascalog for Data ...gotocon.com/dl/goto-chicago-2014/slides/KurtSchrader_MovingFasterWhyWe... · Moving Faster: Why Intent Media Chose Cascalog

Cascalog

Hadoop DFS

Hadoop Map Reduce

Java

?Cascading

Cascalog

Page 27: Moving Faster: Why Intent Media Chose Cascalog for Data ...gotocon.com/dl/goto-chicago-2014/slides/KurtSchrader_MovingFasterWhyWe... · Moving Faster: Why Intent Media Chose Cascalog

Clojure

Page 28: Moving Faster: Why Intent Media Chose Cascalog for Data ...gotocon.com/dl/goto-chicago-2014/slides/KurtSchrader_MovingFasterWhyWe... · Moving Faster: Why Intent Media Chose Cascalog

Lisp that runs on the JVM

(and the CLR, and on Javascript)

Page 29: Moving Faster: Why Intent Media Chose Cascalog for Data ...gotocon.com/dl/goto-chicago-2014/slides/KurtSchrader_MovingFasterWhyWe... · Moving Faster: Why Intent Media Chose Cascalog

(((((((((())))))))))

Page 30: Moving Faster: Why Intent Media Chose Cascalog for Data ...gotocon.com/dl/goto-chicago-2014/slides/KurtSchrader_MovingFasterWhyWe... · Moving Faster: Why Intent Media Chose Cascalog

–Paul Graham

“Our hypothesis was that if we wrote our software in Lisp, we'd be able to get features done faster than our competitors, and also to

do things in our software that they couldn't do.”

Page 31: Moving Faster: Why Intent Media Chose Cascalog for Data ...gotocon.com/dl/goto-chicago-2014/slides/KurtSchrader_MovingFasterWhyWe... · Moving Faster: Why Intent Media Chose Cascalog

Cascalog example

Page 32: Moving Faster: Why Intent Media Chose Cascalog for Data ...gotocon.com/dl/goto-chicago-2014/slides/KurtSchrader_MovingFasterWhyWe... · Moving Faster: Why Intent Media Chose Cascalog

Datalog

Page 33: Moving Faster: Why Intent Media Chose Cascalog for Data ...gotocon.com/dl/goto-chicago-2014/slides/KurtSchrader_MovingFasterWhyWe... · Moving Faster: Why Intent Media Chose Cascalog

“Datalog is a truly declarative logic programming language that syntactically is a

subset of Prolog.”

Page 34: Moving Faster: Why Intent Media Chose Cascalog for Data ...gotocon.com/dl/goto-chicago-2014/slides/KurtSchrader_MovingFasterWhyWe... · Moving Faster: Why Intent Media Chose Cascalog

Hello WorldHello World

Page 35: Moving Faster: Why Intent Media Chose Cascalog for Data ...gotocon.com/dl/goto-chicago-2014/slides/KurtSchrader_MovingFasterWhyWe... · Moving Faster: Why Intent Media Chose Cascalog

Word Count

Page 36: Moving Faster: Why Intent Media Chose Cascalog for Data ...gotocon.com/dl/goto-chicago-2014/slides/KurtSchrader_MovingFasterWhyWe... · Moving Faster: Why Intent Media Chose Cascalog

Word Count SQL

Page 37: Moving Faster: Why Intent Media Chose Cascalog for Data ...gotocon.com/dl/goto-chicago-2014/slides/KurtSchrader_MovingFasterWhyWe... · Moving Faster: Why Intent Media Chose Cascalog

Word Count Java Hadoop< Hello, 1> < World, 1> < Bye, 1>

< World, 1>

< Hello, 1> < World, 2> < Bye, 1>

Page 38: Moving Faster: Why Intent Media Chose Cascalog for Data ...gotocon.com/dl/goto-chicago-2014/slides/KurtSchrader_MovingFasterWhyWe... · Moving Faster: Why Intent Media Chose Cascalog

Word count PIG

Page 39: Moving Faster: Why Intent Media Chose Cascalog for Data ...gotocon.com/dl/goto-chicago-2014/slides/KurtSchrader_MovingFasterWhyWe... · Moving Faster: Why Intent Media Chose Cascalog

Word Count Cascading

Page 40: Moving Faster: Why Intent Media Chose Cascalog for Data ...gotocon.com/dl/goto-chicago-2014/slides/KurtSchrader_MovingFasterWhyWe... · Moving Faster: Why Intent Media Chose Cascalog

Word Count CascalogOutput tap

OperationAggregation

Generator

Query!Creation

Query!Execution

Page 41: Moving Faster: Why Intent Media Chose Cascalog for Data ...gotocon.com/dl/goto-chicago-2014/slides/KurtSchrader_MovingFasterWhyWe... · Moving Faster: Why Intent Media Chose Cascalog

Tokenize

Page 42: Moving Faster: Why Intent Media Chose Cascalog for Data ...gotocon.com/dl/goto-chicago-2014/slides/KurtSchrader_MovingFasterWhyWe... · Moving Faster: Why Intent Media Chose Cascalog

Cascalog overview

(Credit to Jon Sondag, Head of Data Science, Intent Media) https://github.com/johnnywalleye/nyc-clj-meetup-apr-14

Page 43: Moving Faster: Why Intent Media Chose Cascalog for Data ...gotocon.com/dl/goto-chicago-2014/slides/KurtSchrader_MovingFasterWhyWe... · Moving Faster: Why Intent Media Chose Cascalog

Real example

Pre-aggregation (Generators)

[["impression-1" "buy this product"] ["impression-2" "great deal"] ["impression-3" "cheap sale"] ["impression-4" "cheap sale"]]

!

[["click-1" "impression-3" 0] ["click-2" "impression-2" 0] ["click-3" "impression-2" 100]]

!

Page 44: Moving Faster: Why Intent Media Chose Cascalog for Data ...gotocon.com/dl/goto-chicago-2014/slides/KurtSchrader_MovingFasterWhyWe... · Moving Faster: Why Intent Media Chose Cascalog

Real example

Pre-aggregation (Join)

[["impression-1 "buy this product" nil nil] ["impression-2" "great deal" "click-2" 0] ["impression-2" "great deal" "click-3" 100] ["impression-3" "cheap sale" "click-1" 0] ["impression-4" "cheap sale" nil nil]]

!

Page 45: Moving Faster: Why Intent Media Chose Cascalog for Data ...gotocon.com/dl/goto-chicago-2014/slides/KurtSchrader_MovingFasterWhyWe... · Moving Faster: Why Intent Media Chose Cascalog

Real example

Pre-aggregation (Operations)

[["impression-1 ["buy" "this" "product"] nil nil] ["impression-2" ["great" "deal"] "click-2" 0] ["impression-2" ["great" "deal"] "click-3" 100] ["impression-3" ["cheap" "sale"] "click-1" 0] ["impression-4" ["cheap" "sale"] nil nil]]

!

Page 46: Moving Faster: Why Intent Media Chose Cascalog for Data ...gotocon.com/dl/goto-chicago-2014/slides/KurtSchrader_MovingFasterWhyWe... · Moving Faster: Why Intent Media Chose Cascalog

Real example

Aggregation[["impression-1 ["buy" "this" "product"] 0] ["impression-2" ["great" "deal"] 100] ["impression-3" ["cheap" "sale"] 0] ["impression-4" ["cheap" "sale"] 0]]

!

Page 47: Moving Faster: Why Intent Media Chose Cascalog for Data ...gotocon.com/dl/goto-chicago-2014/slides/KurtSchrader_MovingFasterWhyWe... · Moving Faster: Why Intent Media Chose Cascalog

Real example

Post-aggregation (Operations)

[["impression-1 ["buy" "this" "product"] -1] ["impression-2" ["great" "deal"] 1] ["impression-3" ["cheap" "sale"] -1] ["impression-4" ["cheap" "sale"] -1]]

Page 48: Moving Faster: Why Intent Media Chose Cascalog for Data ...gotocon.com/dl/goto-chicago-2014/slides/KurtSchrader_MovingFasterWhyWe... · Moving Faster: Why Intent Media Chose Cascalog

Demo

Page 49: Moving Faster: Why Intent Media Chose Cascalog for Data ...gotocon.com/dl/goto-chicago-2014/slides/KurtSchrader_MovingFasterWhyWe... · Moving Faster: Why Intent Media Chose Cascalog

Built-in Filter Operations• first-n

• limit

• limit-rank

• fixed-sample

• fixed-sample-agg

• re-parse

Page 50: Moving Faster: Why Intent Media Chose Cascalog for Data ...gotocon.com/dl/goto-chicago-2014/slides/KurtSchrader_MovingFasterWhyWe... · Moving Faster: Why Intent Media Chose Cascalog

Built-in Agg Operations• avg

• count/!count

• distinct-count

• max

• min

• sum

Page 51: Moving Faster: Why Intent Media Chose Cascalog for Data ...gotocon.com/dl/goto-chicago-2014/slides/KurtSchrader_MovingFasterWhyWe... · Moving Faster: Why Intent Media Chose Cascalog

Built-in Higher Order Functions

• all

• any

• comp

• each

• negate

• partial

Page 52: Moving Faster: Why Intent Media Chose Cascalog for Data ...gotocon.com/dl/goto-chicago-2014/slides/KurtSchrader_MovingFasterWhyWe... · Moving Faster: Why Intent Media Chose Cascalog

Workflow

• On a sampled dataset:

• Unit test the individual functions

• End-to-end test the workflow

• Then, test on the cluster

Page 53: Moving Faster: Why Intent Media Chose Cascalog for Data ...gotocon.com/dl/goto-chicago-2014/slides/KurtSchrader_MovingFasterWhyWe... · Moving Faster: Why Intent Media Chose Cascalog

midje-cascalog

Page 54: Moving Faster: Why Intent Media Chose Cascalog for Data ...gotocon.com/dl/goto-chicago-2014/slides/KurtSchrader_MovingFasterWhyWe... · Moving Faster: Why Intent Media Chose Cascalog

Checkpoint

Page 55: Moving Faster: Why Intent Media Chose Cascalog for Data ...gotocon.com/dl/goto-chicago-2014/slides/KurtSchrader_MovingFasterWhyWe... · Moving Faster: Why Intent Media Chose Cascalog

The Future

Page 56: Moving Faster: Why Intent Media Chose Cascalog for Data ...gotocon.com/dl/goto-chicago-2014/slides/KurtSchrader_MovingFasterWhyWe... · Moving Faster: Why Intent Media Chose Cascalog

How can we move even faster?

Page 57: Moving Faster: Why Intent Media Chose Cascalog for Data ...gotocon.com/dl/goto-chicago-2014/slides/KurtSchrader_MovingFasterWhyWe... · Moving Faster: Why Intent Media Chose Cascalog

–Dave Thomas

“Hadoop is the EJB of data processing”

Page 58: Moving Faster: Why Intent Media Chose Cascalog for Data ...gotocon.com/dl/goto-chicago-2014/slides/KurtSchrader_MovingFasterWhyWe... · Moving Faster: Why Intent Media Chose Cascalog

What’s next?

Hadoop DFS

Hadoop Map Reduce

Java

?Cascading

Cascalog

?

Page 59: Moving Faster: Why Intent Media Chose Cascalog for Data ...gotocon.com/dl/goto-chicago-2014/slides/KurtSchrader_MovingFasterWhyWe... · Moving Faster: Why Intent Media Chose Cascalog

Cascalog 2.0

Page 60: Moving Faster: Why Intent Media Chose Cascalog for Data ...gotocon.com/dl/goto-chicago-2014/slides/KurtSchrader_MovingFasterWhyWe... · Moving Faster: Why Intent Media Chose Cascalog

Cascalog 2.0 Backends

• Spark

• Storm

• Cascading

Page 61: Moving Faster: Why Intent Media Chose Cascalog for Data ...gotocon.com/dl/goto-chicago-2014/slides/KurtSchrader_MovingFasterWhyWe... · Moving Faster: Why Intent Media Chose Cascalog

Cascalog 2.0 Backends

Page 62: Moving Faster: Why Intent Media Chose Cascalog for Data ...gotocon.com/dl/goto-chicago-2014/slides/KurtSchrader_MovingFasterWhyWe... · Moving Faster: Why Intent Media Chose Cascalog

Cascading 3.0

Page 63: Moving Faster: Why Intent Media Chose Cascalog for Data ...gotocon.com/dl/goto-chicago-2014/slides/KurtSchrader_MovingFasterWhyWe... · Moving Faster: Why Intent Media Chose Cascalog

Cascading 3.0 backends

• Spark

• Storm

• Tez

• MapReduce

Page 64: Moving Faster: Why Intent Media Chose Cascalog for Data ...gotocon.com/dl/goto-chicago-2014/slides/KurtSchrader_MovingFasterWhyWe... · Moving Faster: Why Intent Media Chose Cascalog

The Future

• Data processing time will continue to decrease, hopefully by orders of magnitude

• We’ll be able to write our data processing code at a high level of abstraction and let the system handle the complexity underneath

Page 65: Moving Faster: Why Intent Media Chose Cascalog for Data ...gotocon.com/dl/goto-chicago-2014/slides/KurtSchrader_MovingFasterWhyWe... · Moving Faster: Why Intent Media Chose Cascalog

Questions?