Elastic Web Mining 01 November 2009 1
Jan 27, 2015
Elastic Web Mining 01 November 2009
1
Elastic Web Mining 01 November 2009
2
Web Mining in the Cloud
Ken Krugler, Bixo Labs, Inc.ACM Silicon Valley Data Mining Camp01 November 2009
Hadoop/Cascading/Bixo in EC2
Elastic Web Mining 01 November 2009
3
About me
Background in vertical web crawl– Krugle search engine for open source code– Bixo open source web mining toolkit
Consultant for companies using EC2– Web mining– Data processing
Founder of Bixo Labs– Elastic web mining platform– http://bixolabs.com
Over the prior 4 years I had a startup called Krugle, that provided code searchfor open source projects and inside large companies.
We did a large, 100M page crawl of the “programmer’s web” to find outinformation about open source projects.
Based on what I learned from that experience, I started the Bixo open sourceproject.It’s a toolkit for building web mining workflows, and I’ll be talking moreabout that later.
Several companies paid me to integrate Bixo into an existing data processingenvironment.
And that in turn led to Bixo Labs, which is a platform for quickly creatingcreating web mining apps.Elastic means the size of the system can easily be changed to match the webmining task.
Elastic Web Mining 01 November 2009
4
Typical Data Mining
This is the world that many of you live in.Analyzing data to find important patterns.
Here’s an example of output from the QlikView business intelligence toolIt was used to help analyze the relative prevalence of keywords in twocompeting web sites.Here you see two word terms that often occur on McAfee’s site, but not onSymantec’sWhich is very useful data for anybody who worries about search engineoptimization.
Elastic Web Mining 01 November 2009
5
Data Mining Victory!
You all know about analyzing data to find important patterns that getmanagers all worked up…
Elastic Web Mining 01 November 2009
6
Meanwhile, Over at McAfee…
But how do you get to this point?How do you use the web as the source for data that you’re analyzingThat’s what I’m going to be talking about here.
Elastic Web Mining 01 November 2009
7
Web Mining 101
Extracting & Analyzing Web Data
More Than Just Search
Business intelligence, competitiveintelligence, events, people, companies,popularity, pricing, social graphs, Twitterfeeds, Facebook friends, support forums,shopping carts…
Quick intro to web mining, so we’re on the same page
Most people think about the big search companies when they think about webmining.Search is clearly the biggest web mining category, and generates the mostrevenue.But other types of web mining have value that is high and growing.
Elastic Web Mining 01 November 2009
8
4 Steps in Web Mining
Collect - fetch content from web
Parse - extract data from formats
Analyze - tokenize, rate, classify, cluster
Produce - “useful data”
It’s common to confuse web crawling with fetching.Crawling is the process of automatically finding new pages by extracting linksfrom fetched pages.But for many web mining applications, you have a “white list” of pre-definedURLs.In either case, though, you need to reliably, efficiently and politely fetchpages.
Content comes in a variety of formats - typically HTML, but also PDF, word,zip archives, etc.Need to parse these formats to extract key data - typically text, but could beimage data.
Often the analyze step will include aspects of machine learning - classification,clustering.
“useful data” covers a lot of ground, because there are a lot of ways to use theoutput of web mining.Generating an index is one of the most common, because people think aboutsearch as the goal.But for data mining, the end result at this point is often highly reduced datathat is input to traditional data mining tools.
Elastic Web Mining 01 November 2009
9
Web Mining versus Data Mining
Scale - 10 million isn’t a big number
Access - public but restricted– Special implicit rules apply
Structure - not much
What are the key differences between web mining and traditional data miningI’m saying “traditional” because the face of data mining is clearly changing.But if you look at most vendor tools, the focus is on what I’d call “traditionaldata mining”
Scale - 10M is big for data mining, but not for web mining
Access - with DM, once you defeated Mongor, keeper of data base access keys,you were goldenWeb pages are typically public, but it’s a shared resource so implicit rulesapply.Like “don’t bring my web site to its knees”.Data mining breaks traditional implicit contract, so extra cautions apply.Implicit contract is that I let you crawl me, and you drive traffic to me whenyour search index goes live.But with DM, there often isn’t an index as the end result.
With mining DBs, there’s explicit structure, which is mostly lacking from webpages.
Elastic Web Mining 01 November 2009
10
How to Mine Large Scale Web Data?
Start with scalable map-reduce platform
Add a workflow API layer
Mix in a web crawling toolkit
Write your custom data processing code
Run in an elastic cloud environment
If it doesn’t scale, then it won’t handle the quantity of data you’ll ultimatelywant to process from the web
If you can’t create real workflows, it will never be reliable or efficient.
If you don’t use specialized web crawling code, you’ll get blacklisted
Because you’re trying to distill down large data, there’s often some customprocessing.
If you don’t run it a cloud environment, you’ll be wasting money - and I’llexplain why in a few slides.
Elastic Web Mining 01 November 2009
11
One Solution - the HECB Stack
Bixo
Cascading
Hadoop
EC2
I’m focusing on one particular solution to the challenges of web mining that Ijust described.
It’s the “HECB” stack.
I’m going to talk about these from the bottom up, which is EC2 first, thenHadoop…but the acronym didn’t work as well.
Elastic Web Mining 01 November 2009
12
EC2 - Amazon Elastic Compute Cloud
True cost of non-cloud environment– Cost of servers & networking (2 year life)– Cost of colo (6 servers/rack)– Cost of OPS salary (15% of FTE/cluster)– Managing servers is no fun
Web mining is perfect for the cloud– “bursty” => savings are even greater– Data is distilled, so no transfer $$$ pain
At Krugle we ran two clusters, one of 11 servers, and a smaller 4 server clusterIn the end, our actual utilization ratio was probably < 20%Even with close to 100% utilization, the break-even point for EC2 vs. colo issomewhere between 50 and 200 servers, depending on who you talk to.If utilization was 20%, then break even would be 250 to 1000 servers.
Mining for search doesn’t work so well in this model - cluster should bealways crawling (ABC) so not as burstyAnd transferring raw content, parse, and index will generate lots of transfercharges.But for web mining that’s focused on data mining, data is distilled so this isn’tan issue.
Elastic Web Mining 01 November 2009
13
Why Hadoop?
Perfect for processing lots of data– Map-reduce– Distributed file system
Open source, large community, etc.
Runs well in EC2 clusters
Elastic Map Reduce as option
Map-reduce - how do you parallelize the processing of lots of data so that youcanDo the work on many servers? The answer is Map-reduce.
HDFS - how do you store lots of data in a fault-tolerant, cost-effective manner.How do you make sure the data (the big stuff) moves as little as possibleduring processing.The answer is the Hadoop distributed file system.
It’s open source, so lots of support, consultants, rapid bug fixes, etc.
Large companies are using it, especially Yahoo
Elastic map reduce is a special service built on top of EC2, where it’s easier torun Hadoop jobsBecause you have access to pre-configured Hadoop clusters, special tools, etc.
Elastic Web Mining 01 November 2009
14
Why Cascading?
API on top of Hadoop
Supports efficient, reliable workflows
Reduces painful low-level MR details
Build workflow using “pipe” model
If you ever had to write a complex workflow using Hadoop, you know theanswer.It frees you from the lower-level details of thinking in map-reduce.You can think about the workflow as operations on records with fields.And in data mining, the workflow often winds up being very complex.
Because you can build workflows out of a mix of pre-defined & custom pipes,it’s a real toolkit.
Chris explains it as MR is assembly, and Cascading is C. Sometimes it feelsmore like C++ :)
Key aspect of reliable workflows is Cascading’s ability to check yourworkflow (the DAG it builds)Finds cases where fields aren’t available for operations.Solves a key problem we ran into when customizing Nutch at Krugle
Elastic Web Mining 01 November 2009
15
Why Bixo?
Plugs into Cascading-based workflow– Scales with Hadoop cluster– Rules well in EC2
Handles grungy web crawling details– Polite yet efficient fetching– Errors, web servers that lie– Parsing lots of formats, broken HTML
Open source toolkit for web mining apps
Does the world really need yet another web crawler?No, but it does need a web mining toolkit
Two companies agreed to sponsor work on Bixo as an open source project.
Polite yet efficient - tension between those two goals that’s hard to resolve.
If you do a crawl of any reasonable size, you’ll run into lots of errors.
Even if a web server says “I swear to you, I’m sending you a 20K HTML filein English”It’s a 50K text file in Russian using the Cyrillic character set.
And because it’s open source, you get the benefit of a community of users.They contribute re-usable toolkit components.
Elastic Web Mining 01 November 2009
16
SEO Keyword Data Mining
Example of typical web mining task
Find common keywords (1,2,3 word terms)– Do domain-centric web crawl– Parse pages to extract title, meta, h1, links– Output keywords sorted by frequency
Compare to competitor site(s)
Elastic Web Mining 01 November 2009
17
Workflow
Whenever I show a workflow diagram like this, I make a joke about it beingintuitively obvious.
Which, obviously, it’s not.
And in fact the full workflow is a bit bigger, as I left out the second stage thatdescribes more of the keyword analysis.
But the key point is that the blue color items are provided by Cascading.And the green color items are provided by Bixo.So what’s left are two yellow items, which represent the two points ofcustomization.
Elastic Web Mining 01 November 2009
18
Custom Code for Example
Filtering URLs inside domain– Non-English content– User-generated content (forums, etc)
Generating keywords from text– Special tokenization– One, two, three word phrases
But 95% of code was generic
There were two main pieces of custom code that needed to be written.
One was some URL filtering to focus on the right content inside the web sites.Avoiding non-English pages by specific URL patterns.Same kind of thing for forums and such, since these pages weren’t part of whatcould easily be optimized.
And if enough people need this type of support, since Bixo is open source itwill likely become part of the toolkit
Elastic Web Mining 01 November 2009
19
End Result in Data Mining Tool
Finally we can actually use a traditional data mining tool to help make sense ofthe digested data.
Many things we could do in additionClustering of results, to improve keyword analysisLarger sites have “areas of interest”
Identifying broken links, typosIdentifying personal data - email addresses, phone numbers
Elastic Web Mining 01 November 2009
20
What Next?
Another example - mining mailing lists
Go straight to Summary/Q&A
Talk about Public Terabyte Dataset
Write tweets, posts & emails
Find people to meet in the lobby
I try to limit presentations to 20 slides - so I’ve hit that limit
In the spirit of the unconference - let me know what you’d like to do next.
Elastic Web Mining 01 November 2009
21
Another Example - HUGMEE
Hadoop
Users who
Generate the
Most
Effective
Emails
Let’s use a real example now of using Bixo to do web mining.
Imagine that the Apache Foundation decided to honor people who makesignificant contributions to the Hadoop community.
In a typical company, determining the winner would depend on politicalmaneuvering, bribes,and sucking up.
But the Apache Foundation could decides to go for a quantitative approach forthe HUGMEE award.
Elastic Web Mining 01 November 2009
22
Helpful Hadoopers
Use mailing list archives for data (collect)
Parse mbox files and emails (parse)
Score based on key phrases (analyze)
End result is score/name pair (produce)
How do you figure out the most helpful Hadoopers?As we discussed previously, it’s a classic web mining problem
Luckily the Hadoop mailing lists are all nicely archived as monthly mboxfiles.
How do we score based on key phrases (next slide)?
Elastic Web Mining 01 November 2009
23
Scoring Algorithm
Very sophisticated point system
“thanks” == 5
“owe you a beer” == 50
“worship the ground you walk on” == 100
Elastic Web Mining 01 November 2009
24
High Level Steps
Collect emails– Fetch mod_mbox generated page– Parse it to extract links to mbox files– Fetch mbox files– Split into separate emails
Parse emails– Extract key headers (messageId, email, etc)– Parse body to identify quoted text
Parsing the mod_mbox page is simple with Tika’s HtmlParser
Cheated a bit when parsing emails - some users like Owen have many aliasesSo hand-generated alias resolution table.
Elastic Web Mining 01 November 2009
25
High Level Steps
Analyze emails– Find key phrases in replies (ignore signoff)– Score emails by phrases– Group & sum by message ID– Group & sum by email address
Produce ranked list– Toss email addresses with no love– Sort by summed score
Need to ignore “thanks” in “thanks in advance for doing my job for me”signoff.
Generate two tuples for each email:-one with messageId/name/address-One with reply-to messageId/score
Group/sum aspect is classic reduce operation.
Elastic Web Mining 01 November 2009
26
Workflow
I think this slide is pretty self-explanatory - two Bixo fetch cycles, 6 customCascading operations, 6 MR jobs.
OK, actually not so clear, but…Key point is that only purple is stuff that I had to actually createSome lines are purple as well, since that workflow (DAG) is also something Idefined - see next page.But only two custom operations actually needed - parsing mbox_page andcalculating score
Running took about 30 minutes - mostly politely waiting until it was Ok topolitely do another fetch.Downloaded 150MB of mbox files409 unique email addresses with at least one positive reply.
Elastic Web Mining 01 November 2009
27
Building the Flow
Most of the code needed to create the workflow for this data mining app.
Lots of oatmeal code - which is good. Don’t want to be writing tricky codehere.
Could optimize, but that would be a mistake…most web mining isprogrammer-constrained.So just use more servers in EC2 - cheaper & faster.
Elastic Web Mining 01 November 2009
28
mod_mbox Page
Example of the top-level pages that were fetched in first phase.
Then needed to be parsed to extract links to mbox files.
Elastic Web Mining 01 November 2009
29
Custom Operation
Example of one of two custom operationParsing mod_mbox pageUses Tika to extract IdsEmits tuple with URL for each mbox ID
Elastic Web Mining 01 November 2009
30
Validate
Curve looks right - exponential decay.409 unique email addresses that got some love from somebody.
Elastic Web Mining 01 November 2009
31
This Hug’s for Ted!
And the winner is…Ted Dunning
I know - I should have colored the elephant yellow.
Elastic Web Mining 01 November 2009
32
Produce
A list of the usual suspects
Coincidentally, Ted helped me derive the scoring algorithm I used…hmm.
Elastic Web Mining 01 November 2009
33
Public Terabyte Dataset
Sponsored by Concurrent/Bixolabs
High quality crawl of top domains– HECB Stack using Elastic Map Reduce
Hosted by Amazon in S3, free to EC2 users
Crawl & processing code available
Questions, input? http://bixolabs.com/PTD/
Back
Elastic Web Mining 01 November 2009
34
Summary
HECB stack works well for web mining– Cheaper than typical colo option– Scales to hundreds of millions of pages– Reliable and efficient workflow
Web mining has high & increasing value– Search engine optimization, advertising– Social networks, reputation– Competitive pricing– Etc, etc, etc.
Elastic Web Mining 01 November 2009
35
Any Questions?
My email:
Bixo mailing list:
http://tech.groups.yahoo.com/group/bixo-dev/
Elastic Web Mining 01 November 2009
36