Top Banner
Web Web Mining Mining Research Research: A A Survey Survey Raymond Kosala and Hendrik Blockeel ACM SIGKDD, July 2000 Presented by Shan Huang, 4/24/2007 Revised and presented by Fan Min, 4/22/2009 1
34

WebMiningResearchASurvey Web Mining Research: A Survey Raymond Kosala and Hendrik Blockeel ACM SIGKDD, July 2000 Presented by Shan Huang, 4/24/2007 Revised.

Dec 21, 2015

Download

Documents

Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: WebMiningResearchASurvey Web Mining Research: A Survey Raymond Kosala and Hendrik Blockeel ACM SIGKDD, July 2000 Presented by Shan Huang, 4/24/2007 Revised.

WebWeb MiningMining ResearchResearch: AA SurveySurvey

Raymond Kosala and Hendrik Blockeel

ACM SIGKDD, July 2000

Presented by Shan Huang, 4/24/2007

Revised and presented by Fan Min, 4/22/2009

1

Page 2: WebMiningResearchASurvey Web Mining Research: A Survey Raymond Kosala and Hendrik Blockeel ACM SIGKDD, July 2000 Presented by Shan Huang, 4/24/2007 Revised.

Outline

Introduction Web Mining Web Content Mining Web Structure Mining Web Usage Mining Conclusion & Exam Questions

2

Page 3: WebMiningResearchASurvey Web Mining Research: A Survey Raymond Kosala and Hendrik Blockeel ACM SIGKDD, July 2000 Presented by Shan Huang, 4/24/2007 Revised.

Four Problems

Finding relevant information Low precision and unindexed information

Creating new knowledge out of available information on the web

Personalizing the information Catering to personal preference in content and presentation

Learning about the consumers What does the customer want to do? Using web data to effectively market products and/or services

3

Page 4: WebMiningResearchASurvey Web Mining Research: A Survey Raymond Kosala and Hendrik Blockeel ACM SIGKDD, July 2000 Presented by Shan Huang, 4/24/2007 Revised.

Other Approaches

Web mining is NOT the only approach Database approach (DB) Information retrieval (IR) Natural language processing (NLP)

In-depth syntactic and semantic analysis

Web document community Standards, manually appended meta-information,

maintained directories, etc

4

Page 5: WebMiningResearchASurvey Web Mining Research: A Survey Raymond Kosala and Hendrik Blockeel ACM SIGKDD, July 2000 Presented by Shan Huang, 4/24/2007 Revised.

Direct vs. Indirect Web Mining

Web mining techniques can be used to solve the information overload problems: Directly

Attack the problem with web mining techniquesE.g. newsgroup agent classifies news as relevant

IndirectlyUsed as part of a bigger application that addresses

problemsE.g. used to create index terms for a web search service

5

Page 6: WebMiningResearchASurvey Web Mining Research: A Survey Raymond Kosala and Hendrik Blockeel ACM SIGKDD, July 2000 Presented by Shan Huang, 4/24/2007 Revised.

The Research

Converging research from: Database, information retrieval, and artificial intelligence (specifically NLP and machine learning)

Focusing on research from the machine learning point of view

6

Page 7: WebMiningResearchASurvey Web Mining Research: A Survey Raymond Kosala and Hendrik Blockeel ACM SIGKDD, July 2000 Presented by Shan Huang, 4/24/2007 Revised.

Outline

Introduction Web MiningWeb Mining Web Content Mining Web Structure Mining Web Usage Mining Conclusion & Exam Questions

7

Page 8: WebMiningResearchASurvey Web Mining Research: A Survey Raymond Kosala and Hendrik Blockeel ACM SIGKDD, July 2000 Presented by Shan Huang, 4/24/2007 Revised.

Web Mining: Definition

“Web mining refers to the overall process of discovering potentially useful and previously unknown information or knowledge from the Web data.” Can be viewed as four subtasks Not the same as Information Retrieval Not the same as Information Extraction

8

Page 9: WebMiningResearchASurvey Web Mining Research: A Survey Raymond Kosala and Hendrik Blockeel ACM SIGKDD, July 2000 Presented by Shan Huang, 4/24/2007 Revised.

Web Mining: Subtasks

Resource finding Retrieving intended documents

Information selection/pre-processing Select and pre-process specific information from selected

documents Generalization

Discover general patterns within and across web sites Analysis

Validation and/or interpretation of mined patterns

9

Page 10: WebMiningResearchASurvey Web Mining Research: A Survey Raymond Kosala and Hendrik Blockeel ACM SIGKDD, July 2000 Presented by Shan Huang, 4/24/2007 Revised.

Web Mining: Not IR

Information retrieval (IR) is the automatic retrieval of all relevant documents while at the same time retrieving as few of the non-relevant documents as possible

Web document classification, which is a Web Mining task, could be part of an IR system (e.g. indexing for a search engine)

10

Page 11: WebMiningResearchASurvey Web Mining Research: A Survey Raymond Kosala and Hendrik Blockeel ACM SIGKDD, July 2000 Presented by Shan Huang, 4/24/2007 Revised.

Web Mining: Not IE

Information extraction (IE) aims to extract the relevant facts from given documents IE systems for the general Web are not feasible Most focus on specific Web sites or content

11

Page 12: WebMiningResearchASurvey Web Mining Research: A Survey Raymond Kosala and Hendrik Blockeel ACM SIGKDD, July 2000 Presented by Shan Huang, 4/24/2007 Revised.

Web Mining and Machine Learning

Machine learning is concerned with the development of algorithms and techniques that allow computers to "learn".

Web mining is NOT learning from the Web. Some applications of machine learning on the web

are NOT Web Mining Methods used for Web Mining are NOT limited to

machine learning Oops, there is a close relationship between web

mining and machine learning

12

Page 13: WebMiningResearchASurvey Web Mining Research: A Survey Raymond Kosala and Hendrik Blockeel ACM SIGKDD, July 2000 Presented by Shan Huang, 4/24/2007 Revised.

Web Mining: The Agent Paradigm

User Interface Agents information retrieval agents, information

filtering agents, & personal assistant agents. Distributed Agents

distributed agents for knowledge discovery or data mining.

Problem solving by a group of agents Mobile Agents

13

Page 14: WebMiningResearchASurvey Web Mining Research: A Survey Raymond Kosala and Hendrik Blockeel ACM SIGKDD, July 2000 Presented by Shan Huang, 4/24/2007 Revised.

Web Mining: The Agent Paradigm

Content-based approach The system searches for items that match based

on an analysis of the content using the user preferences.

Collaborative approach The system tries to find users with similar

interests Recommendations given based on what similar

users did

14

Page 15: WebMiningResearchASurvey Web Mining Research: A Survey Raymond Kosala and Hendrik Blockeel ACM SIGKDD, July 2000 Presented by Shan Huang, 4/24/2007 Revised.

Outline

Introduction Web Mining Web Content Mining Web Structure Mining Web Usage Mining Conclusion & Exam Questions

15

Page 16: WebMiningResearchASurvey Web Mining Research: A Survey Raymond Kosala and Hendrik Blockeel ACM SIGKDD, July 2000 Presented by Shan Huang, 4/24/2007 Revised.

Web Mining Categories

Web Content Mining Discovering useful information from web

contents/data/documents.

Web Structure Mining Discovering the model underlying link structures (topology)

on the Web. E.g. discovering authorities and hubs

Web Usage Mining Make sense of data generated by surfers Usage data from logs, user profiles, user sessions, cookies,

user queries, bookmarks, mouse clicks and scrolls, etc.

16

Page 17: WebMiningResearchASurvey Web Mining Research: A Survey Raymond Kosala and Hendrik Blockeel ACM SIGKDD, July 2000 Presented by Shan Huang, 4/24/2007 Revised.

Web Content Data Structure

Unstructured – free text Semi-structured – HTML More structured – Table or Database

generated HTML pages Multimedia data – receive less attention than

text or hypertext

17

Page 18: WebMiningResearchASurvey Web Mining Research: A Survey Raymond Kosala and Hendrik Blockeel ACM SIGKDD, July 2000 Presented by Shan Huang, 4/24/2007 Revised.

Outline

Introduction Web Mining Web Content Mining Web Structure Mining Web Usage Mining Conclusion & Exam Questions

18

Page 19: WebMiningResearchASurvey Web Mining Research: A Survey Raymond Kosala and Hendrik Blockeel ACM SIGKDD, July 2000 Presented by Shan Huang, 4/24/2007 Revised.

Web Content Mining: IR View

Unstructured Documents Bag of words, or phrase-based feature

representation Features can be boolean or frequency based Features can be reduced using different feature

selection techniques Word stemming, combining morphological

variations into one feature

19

Page 20: WebMiningResearchASurvey Web Mining Research: A Survey Raymond Kosala and Hendrik Blockeel ACM SIGKDD, July 2000 Presented by Shan Huang, 4/24/2007 Revised.

Web Content Mining: IR View

Semi-Structured Documents Uses richer representations for features, based on

information from the document structure (typically HTML and hyperlinks)

Uses common data mining methods (whereas unstructured might use more text mining methods)

20

Page 21: WebMiningResearchASurvey Web Mining Research: A Survey Raymond Kosala and Hendrik Blockeel ACM SIGKDD, July 2000 Presented by Shan Huang, 4/24/2007 Revised.

Web Content Mining: DB View

Tries to infer the structure of a Web site or transform a Web site to become a database Better information management Better querying on the Web

Can be achieved by: Finding the schema of Web documents Building a Web warehouse Building a Web knowledge base Building a virtual database

21

Page 22: WebMiningResearchASurvey Web Mining Research: A Survey Raymond Kosala and Hendrik Blockeel ACM SIGKDD, July 2000 Presented by Shan Huang, 4/24/2007 Revised.

Web Content Mining: DB View

Mainly uses the Object Exchange Model (OEM) Represents semi-structured data (some structure, no rigid

schema) by a labeled graph

Process typically starts with manual selection of Web sites for content mining

Main application: building a structural summary of semi-structured data (schema extraction or discovery)

22

Page 23: WebMiningResearchASurvey Web Mining Research: A Survey Raymond Kosala and Hendrik Blockeel ACM SIGKDD, July 2000 Presented by Shan Huang, 4/24/2007 Revised.

Outline

Introduction Web Mining Web Content Mining Web Structure Mining Web Usage Mining Conclusion & Exam Questions

23

Page 24: WebMiningResearchASurvey Web Mining Research: A Survey Raymond Kosala and Hendrik Blockeel ACM SIGKDD, July 2000 Presented by Shan Huang, 4/24/2007 Revised.

Web Structure Mining

Interested in the structure between Web documents (not within a document)

Inspired by the study of social networks and citation analysis

Example: PageRank – Google Application: Discovering micro-communities in the

Web Measuring the “completeness” of a Web site

24

Page 25: WebMiningResearchASurvey Web Mining Research: A Survey Raymond Kosala and Hendrik Blockeel ACM SIGKDD, July 2000 Presented by Shan Huang, 4/24/2007 Revised.

Outline

Introduction Web Mining Web Content Mining Web Structure Mining Web Usage Mining Conclusion & Exam Questions

25

Page 26: WebMiningResearchASurvey Web Mining Research: A Survey Raymond Kosala and Hendrik Blockeel ACM SIGKDD, July 2000 Presented by Shan Huang, 4/24/2007 Revised.

Web Usage Mining

Tries to predict user behavior from interaction with the Web

Wide range of data (logs) Web client data Proxy server data Web server data

Two common approaches Map usage data into relational tables before using

adapted data mining techniques Use log data directly by utilizing special pre-processing

techniques

26

Page 27: WebMiningResearchASurvey Web Mining Research: A Survey Raymond Kosala and Hendrik Blockeel ACM SIGKDD, July 2000 Presented by Shan Huang, 4/24/2007 Revised.

Web Usage Mining

Typical problems: Distinguishing among unique users, server sessions, episodes, etc in the presence of caching and proxy servers

Often Usage Mining uses some background or domain knowledgeE.g. site topology, Web content, etc

27

Page 28: WebMiningResearchASurvey Web Mining Research: A Survey Raymond Kosala and Hendrik Blockeel ACM SIGKDD, July 2000 Presented by Shan Huang, 4/24/2007 Revised.

Web Usage Mining

Two main categories: Learning a user profile (personalized)

Web users would be interested in techniques that learn their needs and preferences automatically

Learning user navigation patterns (impersonalized)Information providers would be interested in

techniques that improve the effectiveness of their Web site or biasing the users towards the goals of the site

28

Page 29: WebMiningResearchASurvey Web Mining Research: A Survey Raymond Kosala and Hendrik Blockeel ACM SIGKDD, July 2000 Presented by Shan Huang, 4/24/2007 Revised.

Outline

Introduction Web Mining Web Content Mining Web Structure Mining Web Usage Mining Conclusion & Exam Questions

29

Page 30: WebMiningResearchASurvey Web Mining Research: A Survey Raymond Kosala and Hendrik Blockeel ACM SIGKDD, July 2000 Presented by Shan Huang, 4/24/2007 Revised.

Conclusions

The paper tried to resolve confusion with regards to the term Web Mining Differentiated from IR and IE

Suggest three Web mining categories Content, Structure, and Usage Mining

Briefly described approaches for the three categories

Explored connection with agent paradigm

30

Page 31: WebMiningResearchASurvey Web Mining Research: A Survey Raymond Kosala and Hendrik Blockeel ACM SIGKDD, July 2000 Presented by Shan Huang, 4/24/2007 Revised.

Exam Question #1

Question: Outline the main characteristics of Web information.

Answer: Web information is huge, diverse, and dynamic.

31

Page 32: WebMiningResearchASurvey Web Mining Research: A Survey Raymond Kosala and Hendrik Blockeel ACM SIGKDD, July 2000 Presented by Shan Huang, 4/24/2007 Revised.

Exam Question #2

Question: How data mining techniques can be used in Web information analysis? Give at least two examples. Classification: classification on server logs using

decision tree, Naïve-Bayes classifier to discover the profiles of users belonging to a particular class

Clustering: Clustering can be used to group users exhibiting similar browsing patterns.

Association Analysis: association analysis can be used to relate pages that are most often referenced together in a single server session.

32

Page 33: WebMiningResearchASurvey Web Mining Research: A Survey Raymond Kosala and Hendrik Blockeel ACM SIGKDD, July 2000 Presented by Shan Huang, 4/24/2007 Revised.

Exam Question #3

Question: What are the three main areas of interest for Web mining?

Answer: (1) Web Content

(2) Web Structure

(3) Web Usage

33

Page 34: WebMiningResearchASurvey Web Mining Research: A Survey Raymond Kosala and Hendrik Blockeel ACM SIGKDD, July 2000 Presented by Shan Huang, 4/24/2007 Revised.

Thank you!

And Raymond Kosala, Hendrik Blockeel

And Shan Huang!

34