Top Banner
36

Elastic Web Mining

Jan 27, 2015

Download

Technology

Ken Krugler

My talk at the ACM Data Mining Unconference on 01 Nov 2009. How to use an open source stack (Hadoop, Cascading, Bixo) in EC2 for cost effective, scalable and reliable web mining.
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Elastic Web Mining
Page 2: Elastic Web Mining

Web Mining in the Cloud

Ken Krugler, Bixo Labs, Inc.

ACM Silicon Valley Data Mining Camp

01 November 2009

Hadoop/Cascading/Bixo in EC2

Page 3: Elastic Web Mining

About me

Background in vertical web crawl– Krugle search engine for open source code– Bixo open source web mining toolkit

Consultant for companies using EC2– Web mining– Data processing

Founder of Bixo Labs– Elastic web mining platform– http://bixolabs.com

Page 4: Elastic Web Mining

Typical Data Mining

Page 5: Elastic Web Mining

Data Mining Victory!

Page 6: Elastic Web Mining

Meanwhile, Over at McAfee…

Page 7: Elastic Web Mining

Web Mining 101

Extracting & Analyzing Web Data

More Than Just Search

Business intelligence, competitive

intelligence, events, people, companies,

popularity, pricing, social graphs, Twitter

feeds, Facebook friends, support forums,

shopping carts…

Page 8: Elastic Web Mining

4 Steps in Web Mining

Collect - fetch content from web

Parse - extract data from formats

Analyze - tokenize, rate, classify, cluster

Produce - “useful data”

Page 9: Elastic Web Mining

Web Mining versus Data Mining

Scale - 10 million isn’t a big number

Access - public but restricted– Special implicit rules apply

Structure - not much

Page 10: Elastic Web Mining

How to Mine Large Scale Web Data?

Start with scalable map-reduce platform

Add a workflow API layer

Mix in a web crawling toolkit

Write your custom data processing code

Run in an elastic cloud environment

Page 11: Elastic Web Mining

One Solution - the HECB Stack

Bixo

Cascading

Hadoop

EC2

Page 12: Elastic Web Mining

EC2 - Amazon Elastic Compute Cloud

True cost of non-cloud environment– Cost of servers & networking (2 year life)– Cost of colo (6 servers/rack)– Cost of OPS salary (15% of FTE/cluster)– Managing servers is no fun

Web mining is perfect for the cloud– “bursty” => savings are even greater– Data is distilled, so no transfer $$$ pain

Page 13: Elastic Web Mining

Why Hadoop?

Perfect for processing lots of data– Map-reduce– Distributed file system

Open source, large community, etc.

Runs well in EC2 clusters

Elastic Map Reduce as option

Page 14: Elastic Web Mining

Why Cascading?

API on top of Hadoop

Supports efficient, reliable workflows

Reduces painful low-level MR details

Build workflow using “pipe” model

Page 15: Elastic Web Mining

Why Bixo?

Plugs into Cascading-based workflow– Scales with Hadoop cluster– Rules well in EC2

Handles grungy web crawling details– Polite yet efficient fetching– Errors, web servers that lie– Parsing lots of formats, broken HTML

Open source toolkit for web mining apps

Page 16: Elastic Web Mining

SEO Keyword Data Mining

Example of typical web mining task

Find common keywords (1,2,3 word

terms)– Do domain-centric web crawl– Parse pages to extract title, meta, h1, links– Output keywords sorted by frequency

Compare to competitor site(s)

Page 17: Elastic Web Mining

Workflow

Page 18: Elastic Web Mining

Custom Code for Example

Filtering URLs inside domain– Non-English content– User-generated content (forums, etc)

Generating keywords from text– Special tokenization– One, two, three word phrases

But 95% of code was generic

Page 19: Elastic Web Mining

End Result in Data Mining Tool

Page 20: Elastic Web Mining

What Next?

Another example - mining mailing lists

Go straight to Summary/Q&A

Talk about Public Terabyte Dataset

Write tweets, posts & emails

Find people to meet in the lobby

Page 21: Elastic Web Mining

Another Example - HUGMEE

HadoopUsers whoGenerate theMostEffectiveEmails

Page 22: Elastic Web Mining

Helpful Hadoopers

Use mailing list archives for data (collect)

Parse mbox files and emails (parse)

Score based on key phrases (analyze)

End result is score/name pair (produce)

Page 23: Elastic Web Mining

Scoring Algorithm

Very sophisticated point system

“thanks” == 5

“owe you a beer” == 50

“worship the ground you walk on” == 100

Page 24: Elastic Web Mining

High Level Steps

Collect emails– Fetch mod_mbox generated page– Parse it to extract links to mbox files– Fetch mbox files– Split into separate emails

Parse emails– Extract key headers (messageId, email, etc)– Parse body to identify quoted text

Page 25: Elastic Web Mining

High Level Steps

Analyze emails– Find key phrases in replies (ignore signoff)– Score emails by phrases– Group & sum by message ID– Group & sum by email address

Produce ranked list– Toss email addresses with no love– Sort by summed score

Page 26: Elastic Web Mining

Workflow

Page 27: Elastic Web Mining

Building the Flow

Page 28: Elastic Web Mining

mod_mbox Page

Page 29: Elastic Web Mining

Custom Operation

Page 30: Elastic Web Mining

Validate

Page 31: Elastic Web Mining

This Hug’s for Ted!

Page 32: Elastic Web Mining

Produce

Page 33: Elastic Web Mining

Public Terabyte Dataset

Sponsored by Concurrent/Bixolabs

High quality crawl of top domains– HECB Stack using Elastic Map Reduce

Hosted by Amazon in S3, free to EC2 users

Crawl & processing code available

Questions, input? http://bixolabs.com/PTD/

Back

Page 34: Elastic Web Mining

Summary

HECB stack works well for web mining– Cheaper than typical colo option– Scales to hundreds of millions of pages– Reliable and efficient workflow

Web mining has high & increasing value– Search engine optimization, advertising– Social networks, reputation– Competitive pricing– Etc, etc, etc.

Page 35: Elastic Web Mining

Any Questions?

My email:

[email protected]

Bixo mailing list:

http://tech.groups.yahoo.com/group/bixo-dev/

Page 36: Elastic Web Mining