Powerful and easy-to-use data analysis tool for Ha doop Cascalog Nathan Marz, BackType
Nov 18, 2014
Powerful and easy-to-use data analysis tool
for Hadoop
CascalogNathan Marz, BackType
About Me
Tech Lead at BackType
Have been working on many-terabyte scale systems for two years
ETL workflows
Data warehouses
What is Hadoop?
Distributed Filesystem
MapReduce Framework
Scales to thousands of machines and petabytes of data
What is Cascalog?
Clojure-based query language for Hadoop with Datalog-inspired syntax
Queries compile to one or more MapReduce jobs
The tool I wish I had two years ago
Features
Inner and outer joins
Aggregators
Functions
Subqueries
Sorting
High performance
What sets Cascalog apart?
Super simple
Full power of Clojure always available
Easy to extend with custom operations
Dynamic queries
Arbitrary inputs and outputs
What sets Cascalog apart?
Super simple
Full power of Clojure always available
Easy to extend with custom operations
Dynamic queries
Arbitrary inputs and outputs
Experiment with Cascalog
Ships with test dataset that can be queried locally (the “playground”)
5 minutes to setup Hadoop, Clojure, and Cascalog locally - see README
News feed generator
Ranks events in social network for each person based on “importance” and recency
38 lines of code
Demo time!
News Feed
“Follows” and “Action” data sources
Text files on HDFS
Follows Action
News FeedCustom Aggregator to produce a news feed in
JSON-like form
News Feed
Custom Function to score each item in the
feed
News Feed
Data sources
News Feed
Subquery to compute
follower count for each person
News Feed
Tie everything together in a
single Cascalog query
Questions?
Project page: http://www.github.com/nathanmarz/cascalog
Tutorial: http://nathanmarz.com/blog/introducing-cascalog
Follow me on Twitter: @nathanmarz