An introduction to Cascalog

Powerful and easy-to-use data analysis tool

for Hadoop

CascalogNathan Marz, BackType

About Me

Tech Lead at BackType

Have been working on many-terabyte scale systems for two years

ETL workflows

Data warehouses

What is Hadoop?

Distributed Filesystem

MapReduce Framework

Scales to thousands of machines and petabytes of data

What is Cascalog?

Clojure-based query language for Hadoop with Datalog-inspired syntax

Queries compile to one or more MapReduce jobs

The tool I wish I had two years ago

Features

Inner and outer joins

Aggregators

Functions

Subqueries

Sorting

High performance

What sets Cascalog apart?

Super simple

Full power of Clojure always available

Easy to extend with custom operations

Dynamic queries

Arbitrary inputs and outputs

What sets Cascalog apart?

Super simple

Full power of Clojure always available

Easy to extend with custom operations

Dynamic queries

Arbitrary inputs and outputs

Experiment with Cascalog

Ships with test dataset that can be queried locally (the “playground”)

5 minutes to setup Hadoop, Clojure, and Cascalog locally - see README

News feed generator

Ranks events in social network for each person based on “importance” and recency

38 lines of code

Demo time!

News Feed

“Follows” and “Action” data sources

Text files on HDFS

Follows Action

News FeedCustom Aggregator to produce a news feed in

JSON-like form

News Feed

Custom Function to score each item in the

feed

News Feed

Data sources

News Feed

Subquery to compute

follower count for each person

News Feed

Tie everything together in a

single Cascalog query

Questions?

Project page: http://www.github.com/nathanmarz/cascalog

Tutorial: http://nathanmarz.com/blog/introducing-cascalog

Follow me on Twitter: @nathanmarz

http://www.github.com/nathanmarz/cascalog

An introduction to Cascalog

Documents

news feedfollows

person news feed tie

form news feedcustom

power of clojure

petabytes of data

clojurebased query language

action data sources

setup hadoop