Top Banner
Web Mining Research: A survey Revised and Presented by Narine Manukyan Authors: Raymond Kosala and Hendrik Blockeel ACM SIGKDD, July 2000 Presented by Shan Huang, 4/24/2007 Fan Min, 4/22/2009 Nima, 12/06/2011 Course: Data Mining[CS332] Pr. Xindong Wu Computer Science Department University Of Vermont Complex Systems Center Achieving insight, innovative design, and informed decision making through systems thinking
50

Web Mining Research: A survey Revised and Presented by Narine Manukyan Authors: Raymond Kosala and Hendrik Blockeel ACM SIGKDD, July 2000 Presented by.

Dec 25, 2015

Download

Documents

Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Web Mining Research: A survey Revised and Presented by Narine Manukyan Authors: Raymond Kosala and Hendrik Blockeel ACM SIGKDD, July 2000 Presented by.

Web Mining Research: A survey

Revised and Presented byNarine Manukyan

Authors:Raymond Kosala and Hendrik Blockeel

ACM SIGKDD, July 2000

Presented by Shan Huang, 4/24/2007

Fan Min, 4/22/2009 Nima, 12/06/2011

Course: Data Mining[CS332]Pr. Xindong Wu

Computer Science DepartmentUniversity Of Vermont

Complex Systems CenterAchieving insight, innovative design, and informed decision making through systems thinking

Page 2: Web Mining Research: A survey Revised and Presented by Narine Manukyan Authors: Raymond Kosala and Hendrik Blockeel ACM SIGKDD, July 2000 Presented by.

From minding our own business

Tomining other’s business on the web

Page 3: Web Mining Research: A survey Revised and Presented by Narine Manukyan Authors: Raymond Kosala and Hendrik Blockeel ACM SIGKDD, July 2000 Presented by.
Page 4: Web Mining Research: A survey Revised and Presented by Narine Manukyan Authors: Raymond Kosala and Hendrik Blockeel ACM SIGKDD, July 2000 Presented by.
Page 5: Web Mining Research: A survey Revised and Presented by Narine Manukyan Authors: Raymond Kosala and Hendrik Blockeel ACM SIGKDD, July 2000 Presented by.

Human1Human2Human3 …

Face

book

Twitt

er

Data on human InteractionsWho they meet

Personal data

What they do

What they read

20102001

Social network mining

over time

Page 6: Web Mining Research: A survey Revised and Presented by Narine Manukyan Authors: Raymond Kosala and Hendrik Blockeel ACM SIGKDD, July 2000 Presented by.

Dynamic Datasets

• Includes time series component people do things over time

…tn=2010

t1=2001

t1

tn…

t1

tn …

t1

tn

Page 7: Web Mining Research: A survey Revised and Presented by Narine Manukyan Authors: Raymond Kosala and Hendrik Blockeel ACM SIGKDD, July 2000 Presented by.

The Web is huge, diverse, and dynamic.

tn=2010

t1=2001

t1

tn…

t1

tn …

t1

tn

1. Scalability issues2. multimedia data issues3. temporal data issues

Page 8: Web Mining Research: A survey Revised and Presented by Narine Manukyan Authors: Raymond Kosala and Hendrik Blockeel ACM SIGKDD, July 2000 Presented by.

t1

tn

t1

tn…

t1

tn …

t1

tn

…tn=2010

t1=2001

Finding patterns

Yey!

Page 9: Web Mining Research: A survey Revised and Presented by Narine Manukyan Authors: Raymond Kosala and Hendrik Blockeel ACM SIGKDD, July 2000 Presented by.

Fishing for Data

Page 10: Web Mining Research: A survey Revised and Presented by Narine Manukyan Authors: Raymond Kosala and Hendrik Blockeel ACM SIGKDD, July 2000 Presented by.

• Introduction• Web Mining• Web Content Mining• Web Structure Mining• Web Usage Mining• Conclusion & Exam

Questions

Outline

Page 11: Web Mining Research: A survey Revised and Presented by Narine Manukyan Authors: Raymond Kosala and Hendrik Blockeel ACM SIGKDD, July 2000 Presented by.

Four Problems

• Finding relevant information • Low precision and not indexed information• A query-triggered process

• Creating new knowledge out of available information on the web

• Intelligent tools are necessary• A data-triggered process

• Personalizing the information• Personal preference in content and presentation of the information

• Learning about the consumers • What does the customer want to do?

Page 12: Web Mining Research: A survey Revised and Presented by Narine Manukyan Authors: Raymond Kosala and Hendrik Blockeel ACM SIGKDD, July 2000 Presented by.

Direct vs. Indirect Web Mining

• Web mining techniques can be used to solve the information overload problems:Directly

Address the problem with web mining techniquesE.g. newsgroup agent classifies whether the news as relevant

Indirectly

Used as part of a bigger application that addresses problems

E.g. used to create index terms for a web search service

12Web Mining Research: A Survey

Page 13: Web Mining Research: A survey Revised and Presented by Narine Manukyan Authors: Raymond Kosala and Hendrik Blockeel ACM SIGKDD, July 2000 Presented by.

Web Mining

“Web mining refers to the overall process of discovering potentially useful and previously unknown information

or knowledge from the Web data.”

Goyal’s Definition: “Using data mining techniques to make the web more useful and more profitable (for

some) and to increase the efficiency of our interaction with the web”

Page 14: Web Mining Research: A survey Revised and Presented by Narine Manukyan Authors: Raymond Kosala and Hendrik Blockeel ACM SIGKDD, July 2000 Presented by.

Web Mining by Etzioni

Resource finding Retrieving intended Web Documents (query triggered)

Information selection and pre-processing Automatically selecting and pre-processing specific

information from retrieved Web recourses

Generalization Automatically discovering general patterns (data-

triggerd)

Analysis Validation and/or interpretation of the mined patterns

Page 15: Web Mining Research: A survey Revised and Presented by Narine Manukyan Authors: Raymond Kosala and Hendrik Blockeel ACM SIGKDD, July 2000 Presented by.

Web Mining and Information Retrieval

• Information retrieval (IR) is the automatic retrieval of all relevant documents while at the same time retrieving as few of the non-relevant documents as possible

• Goal: Indexing text and searching for useful documents in a collection.

• Research in IR: modeling, document classification and categorization, user interfaces, data visualization, filtering etc.

• Web document classification, which is a Web Mining task, could be part of an IR system (e.g. indexing for a search engine)– Viewed in this respect, Web mining is part of the (Web)

IR process.– In my opinion IR is missing the Analysis

(interpretation) step of the Web mining

15Web Mining Research: A Survey

Page 16: Web Mining Research: A survey Revised and Presented by Narine Manukyan Authors: Raymond Kosala and Hendrik Blockeel ACM SIGKDD, July 2000 Presented by.

Web Mining and Information Extraction

• Information Extraction (IE): Transforming a collection of documents, into information that is more easily understood and analyzed.

• IE is a kind of pre-processing stage, a step after IR and before the data mining techniques are applied.

• Building IE systems manually for the general Web are not feasible

• According to authors Web mining is a part of IE.

16Web Mining Research: A Survey

Page 17: Web Mining Research: A survey Revised and Presented by Narine Manukyan Authors: Raymond Kosala and Hendrik Blockeel ACM SIGKDD, July 2000 Presented by.

Information Extraction (IE)

• Classical: Relies on linguistic pre-processing like syntactic analysis, semantic analysis and discourse analysis.

• Structural IE: Utilizes the meta information (e.g. HTML tags, delimiters)– These systems can use data mining techniques to

learn extraction rules from the annotated corpora.

Page 18: Web Mining Research: A survey Revised and Presented by Narine Manukyan Authors: Raymond Kosala and Hendrik Blockeel ACM SIGKDD, July 2000 Presented by.

Compare IR and IE

• IR aims to select relevant documents from web

• IR views the text in a document just as a bag of unordered words

Web Mining Research: A Survey 18

• IE aims to extract the relevant facts from given documents

• IE interested in structure or representation of a document (finer level)

VS

Page 19: Web Mining Research: A survey Revised and Presented by Narine Manukyan Authors: Raymond Kosala and Hendrik Blockeel ACM SIGKDD, July 2000 Presented by.

Web Mining and Machine Learning

• Web mining is not the same as learning from the Web or machine learning techniques applied on the Web

• There are applications of machine learning applied on the Web that are not instances of Web mining

• There are methods used for web mining that are not Machine Learning

Web Mining Research: A Survey 19

Page 20: Web Mining Research: A survey Revised and Presented by Narine Manukyan Authors: Raymond Kosala and Hendrik Blockeel ACM SIGKDD, July 2000 Presented by.

Web Mining and The Agent Paradigm

Web mining is often viewed from or implemented within an agent paradigm.

• User Interface Agents information retrieval agents, information filtering agents, &

personal assistant agents.

• Distributed Agents– Concerned with problem solving by a group of agents.

distributed agents for knowledge discovery or data

• Mobile Agents– not relevant for data mining tasks

20Web Mining Research: A Survey

Page 21: Web Mining Research: A survey Revised and Presented by Narine Manukyan Authors: Raymond Kosala and Hendrik Blockeel ACM SIGKDD, July 2000 Presented by.

Web Mining and The Agent Paradigm (contd.)

Two frequently used approaches for developing intelligent agents:• Content-based approach (User interface Agents)

The system searches for items that match based on an analysis of the content using the user preferences.

• Collaborative approach (Distributed Agents)The system tries to find users with similar interests to give

recommendations to.Analyze the user profiles and sessions or transactions.

21Web Mining Research: A Survey

Page 22: Web Mining Research: A survey Revised and Presented by Narine Manukyan Authors: Raymond Kosala and Hendrik Blockeel ACM SIGKDD, July 2000 Presented by.

• Introduction• Web Mining• Web Content Mining• Web Structure Mining• Web Usage Mining• Conclusion & Exam

Questions

Outline

Page 23: Web Mining Research: A survey Revised and Presented by Narine Manukyan Authors: Raymond Kosala and Hendrik Blockeel ACM SIGKDD, July 2000 Presented by.

Web Mining Categories

• Web Content MiningDiscovering useful information from web page

contents/data/documents.

• Web Structure MiningDiscovering the model underlying link structures (topology)

on the Web. E.g. discovering authorities and hubs

• Web Usage MiningExtraction of interesting knowledge from logging information

produced by web servers.Usage data from logs, user profiles, user sessions, cookies, user

queries, bookmarks, mouse clicks and scrolls, etc.

23Web Mining Research: A Survey

Web Mining

Web Usage Mining

Web Content Mining

Web Structure Mining

Page 24: Web Mining Research: A survey Revised and Presented by Narine Manukyan Authors: Raymond Kosala and Hendrik Blockeel ACM SIGKDD, July 2000 Presented by.

• Introduction• Web Mining• Web Content Mining• Web Structure Mining• Web Usage Mining• Conclusion & Exam

Questions

Outline

Page 25: Web Mining Research: A survey Revised and Presented by Narine Manukyan Authors: Raymond Kosala and Hendrik Blockeel ACM SIGKDD, July 2000 Presented by.

Web Content Data Structure

Web content consists of several types of data– Text, image, audio, video, hyperlinks, metadata.

• Unstructured – free text• Semi-structured – HTML• More structured – Data in the tables or

database generated HTML pages

Note: much of the Web content data is unstructured text data.

25Web Mining Research: A Survey 25

Page 26: Web Mining Research: A survey Revised and Presented by Narine Manukyan Authors: Raymond Kosala and Hendrik Blockeel ACM SIGKDD, July 2000 Presented by.

Web Content Mining: IR View

• Unstructured Documents Bag of words to represent unstructured documents

Takes single word as feature Ignores the sequence in which words occur

Features could be Boolean

Word either occurs or does not occur in a document Frequency based

Frequency of the word in a document Variations of the feature selection include

Removing the case, punctuation, infrequent words and stop words Features can be reduced using different feature selection techniques:

Information gain, mutual information, cross entropy. Stemming: which reduces words to their morphological roots.

26Web Mining Research: A Survey

Page 27: Web Mining Research: A survey Revised and Presented by Narine Manukyan Authors: Raymond Kosala and Hendrik Blockeel ACM SIGKDD, July 2000 Presented by.

Web Content Mining: IR View

• Semi-Structured DocumentsUses richer representations for features

Due to the additional structural information in the hypertext document (typically HTML and hyperlinks)

Uses common data mining methods (whereas unstructured might use more text mining methods)

Application: Hypertext classification or categorization and clustering, learning relations between web documents, learning extraction patterns or rules, and finding patterns in semi-structured data.

Web Mining Research: A Survey 27

Page 28: Web Mining Research: A survey Revised and Presented by Narine Manukyan Authors: Raymond Kosala and Hendrik Blockeel ACM SIGKDD, July 2000 Presented by.

Web Content Mining: DB View

• The database techniques on the Web are related to the problems of managing and querying the information on the Web.

• DB view tries to infer the structure of a Web site or transform a Web site to become a database Better information management Better querying on the Web

• Can be achieved by: Finding the schema of Web documents Building a Web warehouse Building a Web knowledge base Building a virtual database

28Web Mining Research: A Survey

Page 29: Web Mining Research: A survey Revised and Presented by Narine Manukyan Authors: Raymond Kosala and Hendrik Blockeel ACM SIGKDD, July 2000 Presented by.

Web Content Mining: DB View• DB view mainly uses the Object Exchange Model (OEM)

Represents semi-structured data by a labeled graph The data in the OEM is viewed as a graph, with objects as the vertices

and labels on the edges Each object is identified by an object identifier [oid] and Value is either atomic or complex

• Process typically starts with manual selection of Web sites for doing Web content mining

• Main application: – The task of finding frequent substructures in semi-structured data– The task of creating multi-layered database

29Web Mining Research: A Survey

Page 30: Web Mining Research: A survey Revised and Presented by Narine Manukyan Authors: Raymond Kosala and Hendrik Blockeel ACM SIGKDD, July 2000 Presented by.

• Introduction• Web Mining• Web Content Mining• Web Structure Mining• Web Usage Mining• Conclusion & Exam

Questions

Outline

Page 31: Web Mining Research: A survey Revised and Presented by Narine Manukyan Authors: Raymond Kosala and Hendrik Blockeel ACM SIGKDD, July 2000 Presented by.

Web Structure Mining

• Interested in the structure of the hyperlinks within the Web

• Inspired by the study of social networks and citation analysis– Can discover specific types of pages(such as hubs,

authorities, etc.) based on the incoming and outgoing links.

• Application: – Discovering micro-communities in the Web , – measuring the “completeness” of a Web site

31Web Mining Research: A Survey

Page 32: Web Mining Research: A survey Revised and Presented by Narine Manukyan Authors: Raymond Kosala and Hendrik Blockeel ACM SIGKDD, July 2000 Presented by.

Topic oriented community detection through social objects and link analysis in

social networks

Page 33: Web Mining Research: A survey Revised and Presented by Narine Manukyan Authors: Raymond Kosala and Hendrik Blockeel ACM SIGKDD, July 2000 Presented by.

• Introduction• Web Mining• Web Content Mining• Web Structure Mining• Web Usage Mining• Conclusion & Exam

Questions

Outline

Page 34: Web Mining Research: A survey Revised and Presented by Narine Manukyan Authors: Raymond Kosala and Hendrik Blockeel ACM SIGKDD, July 2000 Presented by.

Website Usage Analysis

Page 35: Web Mining Research: A survey Revised and Presented by Narine Manukyan Authors: Raymond Kosala and Hendrik Blockeel ACM SIGKDD, July 2000 Presented by.

Web Usage Mining

• Tries to predict user behavior from interaction with the Web

• Wide range of data (logs) Web client data Proxy server data Web server data

• Two common approaches Maps the usage data of Web server into relational tables before an

adapted data mining techniques Uses the log data directly by utilizing special pre-processing

techniques

35Web Mining Research: A Survey

Page 36: Web Mining Research: A survey Revised and Presented by Narine Manukyan Authors: Raymond Kosala and Hendrik Blockeel ACM SIGKDD, July 2000 Presented by.

Web Usage Mining

• Typical problems: – Distinguishing among unique users, server

sessions, episodes, etc. in the presence of caching and proxy servers

– Often Usage Mining uses some background or domain knowledge

E.g. site topology, Web content, etc.

36Web Mining Research: A Survey

Page 37: Web Mining Research: A survey Revised and Presented by Narine Manukyan Authors: Raymond Kosala and Hendrik Blockeel ACM SIGKDD, July 2000 Presented by.

Web Usage Mining

• Applications:Two main categories:

Learning a user profile (personalized)Web users would be interested in

techniques that learn their needs and preferences automatically Learning user navigation patterns

(impersonalized)Information providers would be interested in techniques that improve the effectiveness of their Web site

37Web Mining Research: A Survey

Page 38: Web Mining Research: A survey Revised and Presented by Narine Manukyan Authors: Raymond Kosala and Hendrik Blockeel ACM SIGKDD, July 2000 Presented by.

Web-Usage Mining Example

• Data Mining Techniques – Navigation Patterns

Analysis:

Example: 70% of users who accessed /company/product2 did so by starting at /company and proceeding through /company/new, /company/products and company/product1

80% of users who accessed the site started from /company/products

65% of users left the site after four or less page references

Page 39: Web Mining Research: A survey Revised and Presented by Narine Manukyan Authors: Raymond Kosala and Hendrik Blockeel ACM SIGKDD, July 2000 Presented by.

This is a visualization of the web graph of the Computer Science department of Rensselaer Polytechnic Institute(http://www.cs.rpi.edu). Strahler numbers are used forassigning colors to edges.

One can see user access paths scattering from first page of website(the node in center) to cluster of web pages corresponding tofaculty pages, course home pages, etc.

Visual Representation of Web usage

Page 40: Web Mining Research: A survey Revised and Presented by Narine Manukyan Authors: Raymond Kosala and Hendrik Blockeel ACM SIGKDD, July 2000 Presented by.

Adding third dimension enables visualization of more information and clarifies user behavior in and between clusters. Center node of circular basement is first page of web site from which users scatter to different clusters of web pages. Color spectrum from Red(entry point into clusters) to Blue (exit points) clarifies behavior of users.

This is a 3D visualization of web usage for above site.The cylinder like part of this figure is visualization of web usage of surfers as they browse a long HTML document.

Visual Representation of Web usage

Page 41: Web Mining Research: A survey Revised and Presented by Narine Manukyan Authors: Raymond Kosala and Hendrik Blockeel ACM SIGKDD, July 2000 Presented by.

User’s browsing access pattern is amplified by a differentcoloring. Depending on link structure of underlyingpages, we can see vertical access patterns of a user drilling down the cluster, making a cylinder shape (bottom-left corner of the figure). Also users following links going down a hierarchy of webpages makes a cone shape and users going up hierarchies,e.g., back to main page of website makes a funnel shape(top-right corner of the figure).

Visual Representation of Web usage

Page 42: Web Mining Research: A survey Revised and Presented by Narine Manukyan Authors: Raymond Kosala and Hendrik Blockeel ACM SIGKDD, July 2000 Presented by.

Right: One can observe long user sessions as strings falling off clusters. Those are special type of long sessions when user navigates sequence of web pages which come one after the other under a cluster, e.g., sections of a long document. In many cases we found web pages with many nodes connected with Next/Up/Previous hyperlinks. Left: A zoom view of the same visualization

Visual Representation of Web usage

Page 43: Web Mining Research: A survey Revised and Presented by Narine Manukyan Authors: Raymond Kosala and Hendrik Blockeel ACM SIGKDD, July 2000 Presented by.

Frequent access patterns extracted by web miningprocess are visualized as a white graph on top of embedded and colorful graph of web usage.

Visual Representation of Web usage

Page 44: Web Mining Research: A survey Revised and Presented by Narine Manukyan Authors: Raymond Kosala and Hendrik Blockeel ACM SIGKDD, July 2000 Presented by.

Conclusions

• Survey the research in the area of Web mining.• Clarify differences between Web mining, Information

Retrieval (IR) and Information Extraction (IE)• Suggest three Web mining categories

Content, Structure, and Usage Mining

• Explored connection between Web mining categories and related agent paradigm

44Web Mining Research: A Survey

Page 45: Web Mining Research: A survey Revised and Presented by Narine Manukyan Authors: Raymond Kosala and Hendrik Blockeel ACM SIGKDD, July 2000 Presented by.

Exam Question #1

• Question: What is multimedia data mining?

• Answer: Research on minding multi types of data (e.g., textual, image, audio, video, metadata, etc.)

45Web Mining Research: A Survey

Page 46: Web Mining Research: A survey Revised and Presented by Narine Manukyan Authors: Raymond Kosala and Hendrik Blockeel ACM SIGKDD, July 2000 Presented by.

Exam Question #2

• Question: Which one of the following is a data-triggered process:

1. Information Retrieval (IR)

2. Information Extraction (IE)

• Answer: 2. Information Extraction

46Web Mining Research: A Survey

Page 47: Web Mining Research: A survey Revised and Presented by Narine Manukyan Authors: Raymond Kosala and Hendrik Blockeel ACM SIGKDD, July 2000 Presented by.

Exam Question #3

• Question: What are the three Web mining categories ?

• Answer: (1) Web Content

(2) Web Structure

(3) Web Usage

47Web Mining Research: A Survey

Page 48: Web Mining Research: A survey Revised and Presented by Narine Manukyan Authors: Raymond Kosala and Hendrik Blockeel ACM SIGKDD, July 2000 Presented by.

Something to think about

Can web mining techniques be useful if Web disappears?

Page 49: Web Mining Research: A survey Revised and Presented by Narine Manukyan Authors: Raymond Kosala and Hendrik Blockeel ACM SIGKDD, July 2000 Presented by.

Questions?

Page 50: Web Mining Research: A survey Revised and Presented by Narine Manukyan Authors: Raymond Kosala and Hendrik Blockeel ACM SIGKDD, July 2000 Presented by.

Example of Web Mining from my research

http://www.cars.com/nissan/versa/2012/