Top Banner
Hadoop Usage At Yahoo! Milind Bhandarkar ([email protected] )
30
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Hadoop Usage at Yahoo!

Hadoop Usage At Yahoo!

Milind Bhandarkar([email protected])

Page 2: Hadoop Usage at Yahoo!

About Me

• Parallel Programming since 1989

• High-Performance Scientific Computing 1989 - 2005, Data-Intensive Computing 2005 - ...

• Hadoop Solutions Architect @ Yahoo!

• Contributor to Hadoop since 2006

• Training, Consulting, Capacity Planning

Page 3: Hadoop Usage at Yahoo!

History

• 2004-2005: Hadoop prototyped in Apache Lucene

• January 2006: Hadoop becomes subproject of Lucene

• January 2008: Hadoop becomes top-level Apache Project

• Latest Release: 0.21

• Stable Release: 0.20.x (+ Strong Authentication)

Page 4: Hadoop Usage at Yahoo!

Hadoop Ecosystem

•HBase, Hive, Pig, Howl, Oozie, Zookeeper, Chukwa, Mahout, Cascading, Scribe, Cassandra, Hypertable, Voldemort, Azkaban, Sqoop, Flume, Avro ...

Page 5: Hadoop Usage at Yahoo!

Hadoop at Yahoo!

• Behind Every Click !

• 38,000+ Servers

• Largest cluster is 4000+ servers

• 1+ Million Jobs per month

• 170+ PB of Storage

• 10+ TB Compressed Data Added Per Day

• 1000+ Users

Page 6: Hadoop Usage at Yahoo!

!"#

$"#

%"#

&"#

'"#

("#

)"#

*"#

+"#

"#

*'"#

*""#

+'"#

+""#

'"#

"#

*""&# *""%# *""$# *""!# *"+"#

,-./0#

)$1#2345346#

+%"#78#29-4/:3#

+;<#;-=9>?0#@-A6#

!"#$%&'(%)#*)+,-.,-%)

/,0&120,%)

3,%,&-4")

+45,'4,)678&40)

9&5:2)/-#($4;#')

B/.--C#2345346#

B/.--C#29-4/:3#D78E#

Hadoop Growth

Page 7: Hadoop Usage at Yahoo!

Hadoop Clusters

• Hadoop Dev, QA, Benchmarking (10%)

• Sandbox, Release Validation (10%)

• Science, Ad-Hoc Usage (50%)

• Production (30%)

Page 8: Hadoop Usage at Yahoo!

Sample Applications

• Science + Big Data + Insight = Personal Relevance = Value

• Log processing: Analytics, Reporting, Buzz

• User Modeling

• Content Optimization, Spam filters

• Computational Advertising

Page 9: Hadoop Usage at Yahoo!

Content(Web Pages, Blogs,

News Articles, Media)

Search Queries

Advertisements(Display, Search)

User

Major Data Sources

Page 10: Hadoop Usage at Yahoo!

Web Graph Analysis

• 100+ Billion Web Pages

• 1+ PB Content

• 2 Trillion links

• 300+ TB of compressed output

• Before Hadoop: 1 Month

• With Hadoop: 1 Week

Page 11: Hadoop Usage at Yahoo!

Search Assist

Page 12: Hadoop Usage at Yahoo!

Search Assist

• Related concepts occur closer together

• 3 years of query logs, sessionized per user

• 10+ Terabytes of Natural Language Text Corpus

• Build Dictionaries with Hadoop & Push to Serving

• Before Hadoop: 4 Weeks

• With Hadoop: 30 minutes

Page 13: Hadoop Usage at Yahoo!

Mail Spam Filtering

• Challenge: Scale

• 450+ Million mailboxes

• 5+ Billion deliveries per day

• 25+ Billion connections

• Challenge: User feedback is often late, noisy, inconsistent

Page 14: Hadoop Usage at Yahoo!

Data Factory

• 40+ Billions of Events Per Day

• Parse & Transform Event Streams

• Join Clicks & Views

• Filter out Robots

• Aggregate, Sort, Partition

• Data Quality Checks

Page 15: Hadoop Usage at Yahoo!

User Modeling

• Objective: Determine User-Interests by mining user-activities

• Large dimensionality of possible user activities

• Typical user has sparse activity vector

• Event attributes change over time

Page 16: Hadoop Usage at Yahoo!

User Activities

Attribute Possible Values Typical Values Per User

Pages

Queries

Ads

1+ Million 10-100

100+ Millions 10s

100+ Thousands 10s

Page 17: Hadoop Usage at Yahoo!

User-Modeling Pipeline

• Sessionization

• Feature and Target Generation

• Model Training

• Offline Scoring & Evaluation

• Batch Scoring & Upload to serving

Page 18: Hadoop Usage at Yahoo!

Data AcquisitionUser Time Event Source

U0 T0 Visited Yahoo! Autos Web Server logs

U0 T1 Searched for “Car Insurance”

Search Logs

U0 T2 Browsed stock quotes Web Server Logs

U0 T3 Saw ad for “discount brokerage”, did not click

Ad Logs

U0 T4 Checked Yahoo! Mail Web Server Logs

U0 T5 Clicked Ad for “Auto Insurance”

Ad Logs, Click Logs

Page 19: Hadoop Usage at Yahoo!

NormalizationUser Time Event Tag

U0 T0 View Category: Autos, Tag: Mercedes Benz

U0 T1 Query Category: Insurance, Tag: Auto

U0 T2 View Category: Finance, Tag: YHOO

U0 T3 View-Click Category: Finance, Tag:Brokerage

U0 T4 Browse Irrelevant Event, Dropped

U0 T5 View+Click Category: Insurance, Tag: Auto

Page 20: Hadoop Usage at Yahoo!

Features & Targets

Time

Query View Click

T

Feature Window

Target Window

Page 21: Hadoop Usage at Yahoo!

Targets

• User-Actions of Interest

• Clicks on Ads & Content

• Site & Page visits

• Conversion Events

• Purchases, Quote requests

• Sign-Up for membership etc

Page 22: Hadoop Usage at Yahoo!

Features

• Summary of user activities over a time-window

• Aggregates, moving averages, rates over various time-windows

• Incrementally updated

Page 23: Hadoop Usage at Yahoo!

Joining Targets & Features

• Target rates very low: 0.01% ~ 1%

• First, construct targets

• Filter user activity without targets

• Join feature vector with targets

Page 24: Hadoop Usage at Yahoo!

Model Training

• Regressions

• Boosted Decision Trees

• Naive Bayes

• Support Vector Machines

• Maximum Entropy modeling

• Constrained Random Fields

Page 25: Hadoop Usage at Yahoo!

Model Training

• Some algorithms are difficult/inefficient to implement in Map-Reduce

• Require fine-grain iterations

• Different models in parallel

• Model for each target response in parallel

Page 26: Hadoop Usage at Yahoo!

Offline Scoring & Evaluation

• Apply model weights to features

• Pleasantly parallel

• Janino Embedded Compiler

• Sort by scores and compute metrics

• Evaluate metrics

Page 27: Hadoop Usage at Yahoo!

Batch Scoring

• Apply models to features from all user activity

• Upload scores to serving systems

Page 28: Hadoop Usage at Yahoo!

User Modeling Pipeline

Component Data Volume Time

Data Acquisition

Feature & Target Generation

Model Training

Scoring

1 TB 2-3 Hours

1 TB * Feature Window Size 4-6 Hours

50 - 100 GB 1-2 Hours for 100s of Models

500 GB 1 Hour

Page 29: Hadoop Usage at Yahoo!

Acknowledgements

• Vijay K Narayanan

• Vishwanath Ramarao

• Nitin Motgi

• And Numerous Hadoop Application Developers at Yahoo!

Page 30: Hadoop Usage at Yahoo!

Questions ?