How we use Hive at SnowPlow, and how the role of HIve is changing

SnowPlow

How Apache Hive and other big data technologies are transforming web analyticsHow Hive is used at SnowPlowStrengths and weaknesses of Hive vs alternatives

http://snowplowanalytics.com@snowplowdata@yalisassoon

http://snowplowanalytics.com/

https://twitter.com/snowplowdata

https://twitter.com/yalisassoon

Some history

Web analytics Big data

1990

1993

1996

1997

2004

2006

2008

2010

Web is born

Log file based web analytics

Javascript tagging

publishes MapReduce paper

Hadoop project split out of Nutch

Facebook develops Hive

publishes Dremel paper

Implications

• Web analytics solutions were developed on the assumption that granular, event level and customer level data was too expensive to store and query

• Data is aggregated from the start. Data collection and analysis is tightly coupled

• Web analytics is limited– Hits– Clicks, unique visitors, conversions– Traffic sources

• Web analytics is silo’ed. (Separate tools to use vs other data sets.)– Hard to link to customer data (e.g. CRM)– Hard to link to marketing data (e.g. DoubleClick)– Hard to link to financial data (e.g. unit profit)

Let’s reinvent web analyticsWeb analytics is one (very rich) data set that is at the heart of:

Customer analytics

Platform / application

analytics

Catalogue analytics

• How do my users segment by behaviour?

• What is the customer lifetime value of my users? How can I forecast it based on their behaviour?

• What are the ‘sliding doors’ moments in a customers journey that impact their lifetime value?

• Which channels should I spend marketing budget to acquire high value customers?

• How do improvements to my application drive improved user engagement and lifetime value?

• Which parts of my application should I focus development on to drive return?

• How are my different products (items in a shop / articles on a newspaper / videos on a media site) performing? What is driving the most engagement? Revenue? Profit?

• How should I organise my catalogue online to drive the best user experience? How can I personalise it to different users?

SnowPlow: open source platform that delivers the granular web analytics data, so you can perform the above

SnowPlow leverages big data and cloud technology across its architecture. Hive on EMR is used A LOT

Javascript tag

Pixel served from Amazon Cloudfront

Request to pixel (incl. query string) logged

Hive

Read logs using custom SerDe

Write a single table of clean, partitioned event data back to S3 for ease of querying

S3

Single, “fat” Hive table

Query in Hive

Output to other analytics programmes e.g. Excel, Tableau, R…

Hive!

How SnowPlow data looks in Hive:

User Page Market-ing

Event Browser OS Device

One line per event e.g. page

view, add-to-basket

user_idvisit_id

ip_address

urltitle

sourcemedium

termcontent

campaignreferrer

categoryactionlabel

propertyvalue

namefamily

versiontype

lang…

namefamily

manufact-urer

typeis_mobile?

widthheight

We Hive… but…

• Easy to use and query. (Especially compared with NoSQL competitors e.g. MongoDB)– E.g. http://

snowplowanalytics.com/analytics/basic-recipes.html

– http://snowplowanalytics.com/analytics/customer-analytics/cohort-analysis.html

• Rapidly develop ETL and analytics queries• Easy to run on Amazon EMR• Tight integration with Amazon S3

• Hard to debug• Slow• Limited power• Batch based (Hadoop’s fault…)

http://snowplowanalytics.com/analytics/basic-recipes.html



http://snowplowanalytics.com/analytics/customer-analytics/cohort-analysis.html



For storage and analytics, columnar databases provide an attractive alternative

• Scales to terabytes (not petabytes)• Fixed cost (dedicated analytics server

with LOTs of RAM)• Significantly faster – seconds not minutes• Plug in to many analytics front ends e.g.

Tableau, Qlikview, R

• Scales horizontally – to petabytes at least• Pay-as-you-go (on EMR) – each query

costs $• An increasing number of front-ends can

be ‘plugged in’ e.g. Toad for Cloud Databases

For segmentation, personalisation and recommendation on web analytics data, you can’t beat Mahout

You can do computations that do not fit the SQL processing model incl. machine learning in Hive via transformation scripts…

… but why would you?

CREATE TABLE docs(contents STRING);FROM (MAP docs.contents USING 'tokenizer_script' AS word, cntFROM docsCLUSTER BY word) map_outputREDUCE map_output.word, map_output.cnt USING 'count_script' AS word, cnt;

• Large number of recommendation, clustering and categorisation algorithms

• Plays well with Hadoop• Large, active developer community

For ETL in production, you really need something more robust than Hive

• ETL: need to define sophisticated data pipelines so that:– Clear audit path: which lines of data have been processed, which have not– Where they have not, error handling flows to deal with the lines. (Including potential reprocessing)– Should fail gracefully (not shut down whole job)– Should be easy to debug when things go wrong, diagnose the problem, and start again where left off…

• An alternative to Hive we are exploring:

– Java framework for developing Hadoop-powered data processing applications– Scala (Scalding) and Clojure (Cascalog) wrappers available

Where we’re going with Hive @ SnowPlow

Javascript tag

Pixel served from Amazon Cloudfront

Request to pixel (incl. query string) logged

Scalding (Cascading)

Ruby wrapper

Infobright Infobright

BI tools e.g. Tableau, Qlikview, Pentaho

Data exploration tools e.g. R, Excel

MI tools e.g. Mahout

Hive for ad hoc analytics on the atomic data

Hive for SnowPlow users with Petabytes of data

But.. for most users… Hive is NOT part of the core flow

Any questions?

http://github.com/snowplow

http://snowplowanalytics.com

@snowplowdata

@yalisassoon

http://github.com/snowplow

http://snowplowanalytics.com/





How we use Hive at SnowPlow, and how the role of HIve is changing

Documents

snowplow data

customer data

query data

petabytes of data

marketing data

lines of data

financial data

data collection