Top Banner
09/29/09 1 Prototyping Data Intensive Apps: TrendingTopics.org Pete Skomoroch Research Scientist at LinkedIn Consultant at Data Wrangling @peteskomoroch
34

Prototyping Data Intensive Apps: TrendingTopicsdatawrangling.s3.amazonaws.com/trendingtopics_talk.pdf · Prototyping Data Intensive Apps: TrendingTopics.org Pete Skomoroch Research

May 28, 2020

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Prototyping Data Intensive Apps: TrendingTopicsdatawrangling.s3.amazonaws.com/trendingtopics_talk.pdf · Prototyping Data Intensive Apps: TrendingTopics.org Pete Skomoroch Research

09/29/09 1

Prototyping Data Intensive Apps: TrendingTopics.org

Pete SkomorochResearch Scientist at LinkedInConsultant at Data Wrangling

@peteskomoroch

Page 2: Prototyping Data Intensive Apps: TrendingTopicsdatawrangling.s3.amazonaws.com/trendingtopics_talk.pdf · Prototyping Data Intensive Apps: TrendingTopics.org Pete Skomoroch Research

Talk Outline

• TrendingTopics Overview• Wikipedia Page View Dataset• Hadoop on Amazon EC2• Loading Data on EC2: Amazon EBS & S3• Daily Timelines with Hadoop Streaming• Hive Data Warehouse Layer• Trend Computation with Hive• Hooking It All Together• Front End & Visualizations

Page 3: Prototyping Data Intensive Apps: TrendingTopicsdatawrangling.s3.amazonaws.com/trendingtopics_talk.pdf · Prototyping Data Intensive Apps: TrendingTopics.org Pete Skomoroch Research

Data Intensive Web Apps

• Batch data mining or prediction with Hadoop• Iterate quickly with high level languages & tools

– Pig, Hive, Clojure, Cascading, Python, Ruby• EC2: Get running with limited initial capital• Use external data and APIs in novel ways• Recent real world example: FlightCaster

3

Page 4: Prototyping Data Intensive Apps: TrendingTopicsdatawrangling.s3.amazonaws.com/trendingtopics_talk.pdf · Prototyping Data Intensive Apps: TrendingTopics.org Pete Skomoroch Research

TrendingTopics.org

4

Page 5: Prototyping Data Intensive Apps: TrendingTopicsdatawrangling.s3.amazonaws.com/trendingtopics_talk.pdf · Prototyping Data Intensive Apps: TrendingTopics.org Pete Skomoroch Research

Daily Pageview Timeline Charts

5

Page 6: Prototyping Data Intensive Apps: TrendingTopicsdatawrangling.s3.amazonaws.com/trendingtopics_talk.pdf · Prototyping Data Intensive Apps: TrendingTopics.org Pete Skomoroch Research

Detects Rising Trends with Hadoop

6

Page 7: Prototyping Data Intensive Apps: TrendingTopicsdatawrangling.s3.amazonaws.com/trendingtopics_talk.pdf · Prototyping Data Intensive Apps: TrendingTopics.org Pete Skomoroch Research

TrendingTopics is Open Source

7

• Built as a side project at Data Wrangling• Core code completed over 2 weeks in June• Code on Github• Data released on Amazon Public Datasets

Page 8: Prototyping Data Intensive Apps: TrendingTopicsdatawrangling.s3.amazonaws.com/trendingtopics_talk.pdf · Prototyping Data Intensive Apps: TrendingTopics.org Pete Skomoroch Research

Technology Stack

8

Page 9: Prototyping Data Intensive Apps: TrendingTopicsdatawrangling.s3.amazonaws.com/trendingtopics_talk.pdf · Prototyping Data Intensive Apps: TrendingTopics.org Pete Skomoroch Research

Application Data Flow

9

Page 10: Prototyping Data Intensive Apps: TrendingTopicsdatawrangling.s3.amazonaws.com/trendingtopics_talk.pdf · Prototyping Data Intensive Apps: TrendingTopics.org Pete Skomoroch Research

Hadoop on Amazon EC2

• EC2: Launch servers on demand• S3: Simple Storage Service• EBS: Persistent disks for EC2• Used Cloudera Hadoop Distribution

– Includes Pig, Hive, HBase, Sqoop, EBS integration– Allows customization of Hadoop EC2 instances– See EC2 Docs and Tutorials on the Cloudera site

10

Page 11: Prototyping Data Intensive Apps: TrendingTopicsdatawrangling.s3.amazonaws.com/trendingtopics_talk.pdf · Prototyping Data Intensive Apps: TrendingTopics.org Pete Skomoroch Research

Wikipedia Page View Log Data

• 1 TB of hourly log files from Wikipedia Squid proxy• Filename contains timestamp• Columns: project, pagename, pageviews, & bytes

11

Page 12: Prototyping Data Intensive Apps: TrendingTopicsdatawrangling.s3.amazonaws.com/trendingtopics_talk.pdf · Prototyping Data Intensive Apps: TrendingTopics.org Pete Skomoroch Research

Wikipedia Redirect Data

• Many articles in the logs redirect to canonical titles• Amazon Public Dataset contains a lookup table:

12

Page 13: Prototyping Data Intensive Apps: TrendingTopicsdatawrangling.s3.amazonaws.com/trendingtopics_talk.pdf · Prototyping Data Intensive Apps: TrendingTopics.org Pete Skomoroch Research

Loading Data into Hadoop on EC2

• Hadoop can read directly from Amazon S3

• Pulling in data from EBS snapshots

13

Page 14: Prototyping Data Intensive Apps: TrendingTopicsdatawrangling.s3.amazonaws.com/trendingtopics_talk.pdf · Prototyping Data Intensive Apps: TrendingTopics.org Pete Skomoroch Research

Python Streaming for Daily Timelines

14

Page 15: Prototyping Data Intensive Apps: TrendingTopicsdatawrangling.s3.amazonaws.com/trendingtopics_talk.pdf · Prototyping Data Intensive Apps: TrendingTopics.org Pete Skomoroch Research

Streaming: Filter & Aggregate Logs

15

Page 16: Prototyping Data Intensive Apps: TrendingTopicsdatawrangling.s3.amazonaws.com/trendingtopics_talk.pdf · Prototyping Data Intensive Apps: TrendingTopics.org Pete Skomoroch Research

Hive Data Warehouse Layer on EC2

• Easy for analysts familar with SQL• Familar syntax (SELECT, GROUP BY, SORT...)• Hide the MR muck, for JOIN, UNION, etc.• Supports streaming UDFs• Sampling: generate datasets for R&D• Used on Timeline and Redirect data

16

Page 17: Prototyping Data Intensive Apps: TrendingTopicsdatawrangling.s3.amazonaws.com/trendingtopics_talk.pdf · Prototyping Data Intensive Apps: TrendingTopics.org Pete Skomoroch Research

Fixing Wiki Redirects with Hive JOIN

17

Page 18: Prototyping Data Intensive Apps: TrendingTopicsdatawrangling.s3.amazonaws.com/trendingtopics_talk.pdf · Prototyping Data Intensive Apps: TrendingTopics.org Pete Skomoroch Research

Trend Detection With Hadoop

18

Page 19: Prototyping Data Intensive Apps: TrendingTopicsdatawrangling.s3.amazonaws.com/trendingtopics_talk.pdf · Prototyping Data Intensive Apps: TrendingTopics.org Pete Skomoroch Research

Hourly Data: “Java” vs. “Hangover”

19

Page 20: Prototyping Data Intensive Apps: TrendingTopicsdatawrangling.s3.amazonaws.com/trendingtopics_talk.pdf · Prototyping Data Intensive Apps: TrendingTopics.org Pete Skomoroch Research

Robust Regression with R & Hadoop

• Fit R model to hourly data for 2.3M Wiki articles• “Trending” if article views >> model prediction

20

Page View “Spikes”

Page 21: Prototyping Data Intensive Apps: TrendingTopicsdatawrangling.s3.amazonaws.com/trendingtopics_talk.pdf · Prototyping Data Intensive Apps: TrendingTopics.org Pete Skomoroch Research

Call Python Trend Mapper from Hive

21

Page 22: Prototyping Data Intensive Apps: TrendingTopicsdatawrangling.s3.amazonaws.com/trendingtopics_talk.pdf · Prototyping Data Intensive Apps: TrendingTopics.org Pete Skomoroch Research

Simple Trend Calculation in Python

22

Page 23: Prototyping Data Intensive Apps: TrendingTopicsdatawrangling.s3.amazonaws.com/trendingtopics_talk.pdf · Prototyping Data Intensive Apps: TrendingTopics.org Pete Skomoroch Research

Use Trend Scores to Rank Articles

23

Page 24: Prototyping Data Intensive Apps: TrendingTopicsdatawrangling.s3.amazonaws.com/trendingtopics_talk.pdf · Prototyping Data Intensive Apps: TrendingTopics.org Pete Skomoroch Research

Exporting Data from our EC2 Cluster

• Output files are delimited text for bulk DB load• Distcp outputs to S3:

– hadoop distcp /user/output s3n://trendingtopics/exports

• Hive can create tables directly over S3 filesystem:– create external table foo ... location 's3n://trendingtopics/

hiveoutput’

24

Page 25: Prototyping Data Intensive Apps: TrendingTopicsdatawrangling.s3.amazonaws.com/trendingtopics_talk.pdf · Prototyping Data Intensive Apps: TrendingTopics.org Pete Skomoroch Research

Hooking It All Together

25

Page 26: Prototyping Data Intensive Apps: TrendingTopicsdatawrangling.s3.amazonaws.com/trendingtopics_talk.pdf · Prototyping Data Intensive Apps: TrendingTopics.org Pete Skomoroch Research

Ruby on Rails Front End

• Run a development version of the app locally:

26

Page 27: Prototyping Data Intensive Apps: TrendingTopicsdatawrangling.s3.amazonaws.com/trendingtopics_talk.pdf · Prototyping Data Intensive Apps: TrendingTopics.org Pete Skomoroch Research

Automate & Deploy with EC2onRails

• EC2onRails: Ubuntu server images for EC2, runs instances of Ruby on Rails, MySQL, Apache

• Daily cron job on App server launches Hadoop• Hadoop EC2 boot script checks out latest trend

detection code from Github• Capistrano tasks deploy new application code,

maintain DB, cache, archives

27

Page 28: Prototyping Data Intensive Apps: TrendingTopicsdatawrangling.s3.amazonaws.com/trendingtopics_talk.pdf · Prototyping Data Intensive Apps: TrendingTopics.org Pete Skomoroch Research

Cron Triggers Hadoop Jobs on EC2Modify Cloudera hadoop-ec2-init-remote.sh boot script

28

Page 29: Prototyping Data Intensive Apps: TrendingTopicsdatawrangling.s3.amazonaws.com/trendingtopics_talk.pdf · Prototyping Data Intensive Apps: TrendingTopics.org Pete Skomoroch Research

Bash Scripts Automate Daily Run

• run_daily_merge.sh kicks off Hadoop/Hive jobs• Queries MySQL for last run date• Generates timelines with Hadoop streaming • Hive operates on timeline data, computes Trends

with Python• Hive output copied to S3• Calls remote script to load MySQL staging tables• Hadoop cluster self terminates• Remote script swaps staging and prod after load

29

Page 30: Prototyping Data Intensive Apps: TrendingTopicsdatawrangling.s3.amazonaws.com/trendingtopics_talk.pdf · Prototyping Data Intensive Apps: TrendingTopics.org Pete Skomoroch Research

Capistrano Tasks

• Capistrano: Ruby tool for automating tasks on one or more remote servers (www.capify.org)

• Deploys Rails code and cron jobs to App server• Cleanup tasks are triggered at end of load:

– Flags featured Wikipedia articles– Indexes MySQL tables, Flushes cache

30

Page 31: Prototyping Data Intensive Apps: TrendingTopicsdatawrangling.s3.amazonaws.com/trendingtopics_talk.pdf · Prototyping Data Intensive Apps: TrendingTopics.org Pete Skomoroch Research

Visualization APIs

• Google Viz API: Annotated Timeline for Rails

• Google Charts sparklines: gc4r

• Yahoo BOSS image search in Rails

31

Page 32: Prototyping Data Intensive Apps: TrendingTopicsdatawrangling.s3.amazonaws.com/trendingtopics_talk.pdf · Prototyping Data Intensive Apps: TrendingTopics.org Pete Skomoroch Research

Ideas & Next Steps

• Try out the code on Github• Show related articles using link graph data• Extract trending phrases from article text• Boost Twitter trending topics with Wikipedia logs• Update top trends hourly• Use date partitioned tables, incremental updates• Add a real search engine (uses MySQL now)

32

Page 33: Prototyping Data Intensive Apps: TrendingTopicsdatawrangling.s3.amazonaws.com/trendingtopics_talk.pdf · Prototyping Data Intensive Apps: TrendingTopics.org Pete Skomoroch Research

LinkedIn Analytics Team

33

We’re Hiring

Page 34: Prototyping Data Intensive Apps: TrendingTopicsdatawrangling.s3.amazonaws.com/trendingtopics_talk.pdf · Prototyping Data Intensive Apps: TrendingTopics.org Pete Skomoroch Research

Questions?

• Contact– Blog: datawrangling.com– Twitter: @peteskomoroch

• Code– http://github.com/datawrangling/trendingtopics– Hadoop blog posts & tutorials on the Cloudera site– Amazon EMR tutorial on Finding Similar Artists with

Hadoop & Last.FM data http://bit.ly/EMRsimilarity

34