Top Banner
Powerful and easy-to-use data analysis tool for Ha doop Cascalog Nathan Marz, BackType
17
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: An introduction to Cascalog

Powerful and easy-to-use data analysis tool

for Hadoop

CascalogNathan Marz, BackType

Page 2: An introduction to Cascalog

About Me

Tech Lead at BackType

Have been working on many-terabyte scale systems for two years

ETL workflows

Data warehouses

Page 3: An introduction to Cascalog

What is Hadoop?

Distributed Filesystem

MapReduce Framework

Scales to thousands of machines and petabytes of data

Page 4: An introduction to Cascalog

What is Cascalog?

Clojure-based query language for Hadoop with Datalog-inspired syntax

Queries compile to one or more MapReduce jobs

The tool I wish I had two years ago

Page 5: An introduction to Cascalog

Features

Inner and outer joins

Aggregators

Functions

Subqueries

Sorting

High performance

Page 6: An introduction to Cascalog

What sets Cascalog apart?

Super simple

Full power of Clojure always available

Easy to extend with custom operations

Dynamic queries

Arbitrary inputs and outputs

Page 7: An introduction to Cascalog

What sets Cascalog apart?

Super simple

Full power of Clojure always available

Easy to extend with custom operations

Dynamic queries

Arbitrary inputs and outputs

Page 8: An introduction to Cascalog

Experiment with Cascalog

Ships with test dataset that can be queried locally (the “playground”)

5 minutes to setup Hadoop, Clojure, and Cascalog locally - see README

Page 9: An introduction to Cascalog

News feed generator

Ranks events in social network for each person based on “importance” and recency

38 lines of code

Page 10: An introduction to Cascalog

Demo time!

Page 11: An introduction to Cascalog

News Feed

“Follows” and “Action” data sources

Text files on HDFS

Follows Action

Page 12: An introduction to Cascalog

News FeedCustom Aggregator to produce a news feed in

JSON-like form

Page 13: An introduction to Cascalog

News Feed

Custom Function to score each item in the

feed

Page 14: An introduction to Cascalog

News Feed

Data sources

Page 15: An introduction to Cascalog

News Feed

Subquery to compute

follower count for each person

Page 16: An introduction to Cascalog

News Feed

Tie everything together in a

single Cascalog query

Page 17: An introduction to Cascalog

Questions?

Project page: http://www.github.com/nathanmarz/cascalog

Tutorial: http://nathanmarz.com/blog/introducing-cascalog

Follow me on Twitter: @nathanmarz