Top Banner
Web Web Mining Mining Research Research: A A Survey Survey By Raymond Kosala & Hendrik Blockeel, Katholieke Universitat Leuven, July 2000 Presented 4/18/2002 by Caitlin C Coughlin
22

WebMiningResearch ASurvey Web Mining Research: A Survey By Raymond Kosala & Hendrik Blockeel, Katholieke Universitat Leuven, July 2000 Presented 4/18/2002.

Dec 21, 2015

Download

Documents

Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: WebMiningResearch ASurvey Web Mining Research: A Survey By Raymond Kosala & Hendrik Blockeel, Katholieke Universitat Leuven, July 2000 Presented 4/18/2002.

WebWeb MiningMining ResearchResearch: AA SurveySurvey

By Raymond Kosala & Hendrik Blockeel, Katholieke Universitat Leuven, July 2000

Presented 4/18/2002 by Caitlin C Coughlin

Page 2: WebMiningResearch ASurvey Web Mining Research: A Survey By Raymond Kosala & Hendrik Blockeel, Katholieke Universitat Leuven, July 2000 Presented 4/18/2002.

4/18/2002 Caitlin C Coughlin, University of Vermont

2

OverviewOverview

Introduction Web Mining Web Content Mining Web Structure Mining Web Usage Mining Conclusions

Page 3: WebMiningResearch ASurvey Web Mining Research: A Survey By Raymond Kosala & Hendrik Blockeel, Katholieke Universitat Leuven, July 2000 Presented 4/18/2002.

4/18/2002 Caitlin C Coughlin, University of Vermont

3

IntroductionIntroduction

The Web is huge, dynamic & diverse, and thus raises the scalability, multimedia data and temporal issues respectively.

Thus we are drowning in information and facing information overload. Information users can encounter problems when interacting with the Web

Page 4: WebMiningResearch ASurvey Web Mining Research: A Survey By Raymond Kosala & Hendrik Blockeel, Katholieke Universitat Leuven, July 2000 Presented 4/18/2002.

4/18/2002 Caitlin C Coughlin, University of Vermont

4

MoreMore Introduction Introduction

PROBLEMS: Finding Relevant information Creating new knowledge out of the

information available on the web Personalization of the information Learning about consumers or individual users

Page 5: WebMiningResearch ASurvey Web Mining Research: A Survey By Raymond Kosala & Hendrik Blockeel, Katholieke Universitat Leuven, July 2000 Presented 4/18/2002.

4/18/2002 Caitlin C Coughlin, University of Vermont

5

More More IntroductionIntroduction

Web mining techniques could be directly or indirectly used to solve the information overload problems described before.

directly - application of web mining techniques directly addresses the problem

indirectly- web mining approach techniques are used as part of a bigger application that addresses the aforementioned problems.

Web mining NOT only useful tool: other useful techniques include

DB database

IR Information Retrieval

NLP Natural Language Processing Web document community

Page 6: WebMiningResearch ASurvey Web Mining Research: A Survey By Raymond Kosala & Hendrik Blockeel, Katholieke Universitat Leuven, July 2000 Presented 4/18/2002.

4/18/2002 Caitlin C Coughlin, University of Vermont

6

Web Mining: OutlineWeb Mining: Outline

Overview of Web Mining Describe some confusion in use of the term

“Web Mining” Provide a Classification Relate Classification to the agent paradigm

Page 7: WebMiningResearch ASurvey Web Mining Research: A Survey By Raymond Kosala & Hendrik Blockeel, Katholieke Universitat Leuven, July 2000 Presented 4/18/2002.

4/18/2002 Caitlin C Coughlin, University of Vermont

7

Web Mining: OverviewWeb Mining: OverviewWeb mining is the use of data mining techniques to automatically

discover and extract information from web documents and services.

We suggest decomposing Web mining into these subtasks: 1 Resource findingResource finding: the task of retrieving intended web documents 2 Information selection and pre-processingInformation selection and pre-processing: automatically selecting

and pre-processing specific information from retrieved Web resources

3 GeneralizationGeneralization: automatically selecting and preprocessing specific information from retrieved Web resources

4 AnalysisAnalysis: validation and/or interpretation of the mined patterns.

We’ll call this pattern 1-2-3-4, as we’ll later see, sometimes 1-3-4 is also used.

Page 8: WebMiningResearch ASurvey Web Mining Research: A Survey By Raymond Kosala & Hendrik Blockeel, Katholieke Universitat Leuven, July 2000 Presented 4/18/2002.

4/18/2002 Caitlin C Coughlin, University of Vermont

8

Web Mining: Confusion Web Mining: Confusion Web mining is often associated with Information RetrievalInformation Retrieval

or Information ExtractionInformation Extraction, but it is different from both. IRIR is the automatic retrieval of all relevant documents

while at the same time retrieving as few non-relevant ones as possible. [views documents as bag-of-words]

IEIE has the goal of transforming a collection of documents, usually with the help of an IR system, into information that is more readily digested and analyzed. [interested in the structure or representation of a document]

We argue that Web mining intersects with the application of machine learning on the web.

Page 9: WebMiningResearch ASurvey Web Mining Research: A Survey By Raymond Kosala & Hendrik Blockeel, Katholieke Universitat Leuven, July 2000 Presented 4/18/2002.

4/18/2002 Caitlin C Coughlin, University of Vermont

9

Web Mining: ClassificationWeb Mining: Classification Web content miningWeb content mining: describes the discovery of

useful information from Web contents/data/documents. [IR and DB views]

Web structure miningWeb structure mining: tries to discover the model underlying the link structures of the Web.

Web usage mining: Web usage mining: tries to make sense of the data generated by the Web surfer’s sessions or behaviors

Page 10: WebMiningResearch ASurvey Web Mining Research: A Survey By Raymond Kosala & Hendrik Blockeel, Katholieke Universitat Leuven, July 2000 Presented 4/18/2002.

4/18/2002 Caitlin C Coughlin, University of Vermont

10

Page 11: WebMiningResearch ASurvey Web Mining Research: A Survey By Raymond Kosala & Hendrik Blockeel, Katholieke Universitat Leuven, July 2000 Presented 4/18/2002.

4/18/2002 Caitlin C Coughlin, University of Vermont

11

Web Mining & the Agent ParadigmWeb Mining & the Agent Paradigm

Web mining is often viewed from or implemented within an agent paradigm. Thus, web mining has a close relationship with software agents or intelligent agents.

Two relevant types of software agents:

User interface agents : information retrieval agents, information filtering agents, & personal assistant agents

Distributed agents : distributed agents for knowledge discovery or data mining [content-based or collaborative]

Page 12: WebMiningResearch ASurvey Web Mining Research: A Survey By Raymond Kosala & Hendrik Blockeel, Katholieke Universitat Leuven, July 2000 Presented 4/18/2002.

4/18/2002 Caitlin C Coughlin, University of Vermont

12

Web Mining & the Agent ParadigmWeb Mining & the Agent Paradigm

Page 13: WebMiningResearch ASurvey Web Mining Research: A Survey By Raymond Kosala & Hendrik Blockeel, Katholieke Universitat Leuven, July 2000 Presented 4/18/2002.

4/18/2002 Caitlin C Coughlin, University of Vermont

13

Web Content Mining: IR viewWeb Content Mining: IR view

Information retrieval view for unstructured documents:most of the research uses “bag of words” to represent

unstructured documents.Takes single words as features. Features could be

boolean or frequency based.See the table that follows

Page 14: WebMiningResearch ASurvey Web Mining Research: A Survey By Raymond Kosala & Hendrik Blockeel, Katholieke Universitat Leuven, July 2000 Presented 4/18/2002.

4/18/2002 Caitlin C Coughlin, University of Vermont

14

1998

1999

1995

1998

19951998

19991999199919971999

19972000199919991996199919951999

1999

Page 15: WebMiningResearch ASurvey Web Mining Research: A Survey By Raymond Kosala & Hendrik Blockeel, Katholieke Universitat Leuven, July 2000 Presented 4/18/2002.

4/18/2002 Caitlin C Coughlin, University of Vermont

15

Web Content Mining: IR viewWeb Content Mining: IR view

Page 16: WebMiningResearch ASurvey Web Mining Research: A Survey By Raymond Kosala & Hendrik Blockeel, Katholieke Universitat Leuven, July 2000 Presented 4/18/2002.

4/18/2002 Caitlin C Coughlin, University of Vermont

16

Web Content Mining: DB ViewWeb Content Mining: DB View

The database techniques on the web are related to the problem of managing and querying the information on the web.

Three classes of tasks: modeling and querying the web, information extraction and integration, and web site construction and restructuring.

Tries to model the data on the web and to integrate them so that more more sophisticated queries other than the keywords based search can be performed.

Research in this area mainly deals with semi-structured data

Page 17: WebMiningResearch ASurvey Web Mining Research: A Survey By Raymond Kosala & Hendrik Blockeel, Katholieke Universitat Leuven, July 2000 Presented 4/18/2002.

4/18/2002 Caitlin C Coughlin, University of Vermont

17

Web Content Mining: DB viewWeb Content Mining: DB view

Page 18: WebMiningResearch ASurvey Web Mining Research: A Survey By Raymond Kosala & Hendrik Blockeel, Katholieke Universitat Leuven, July 2000 Presented 4/18/2002.

4/18/2002 Caitlin C Coughlin, University of Vermont

18

Web Structure MiningWeb Structure Mining

In Web structure mining we are interested in the structure of the hyperlinks within the Web itself. (inter-document structure)

This line of research inspired by the study of social networks and citation analysis.

A few different algorithms have been proposed to do this such as HITS, PageRank, improved HITS using content info & outlier filtering [example coming up]

Page 19: WebMiningResearch ASurvey Web Mining Research: A Survey By Raymond Kosala & Hendrik Blockeel, Katholieke Universitat Leuven, July 2000 Presented 4/18/2002.

4/18/2002 Caitlin C Coughlin, University of Vermont

19

Successful example of Successful example of Web Structure MiningWeb Structure Mining

The heart of Google software is PageRank™, a system for ranking web pages developed by our founders Larry Page and Sergey Brin at Stanford University

PageRank uses the web’s link structure as an indicator of an individual page's value. Google interprets a link from page A to page B as a vote, by page A, for page B. Google also analyzes the page that casts the vote. Votes cast by pages that are themselves "important" weigh more heavily and help to make other pages "important.”

Google combines PageRank with sophisticated text-matching techniques to find pages that are both important and relevant to your search. [they don’t specify]

Google does not sell placement within the results themselves (i.e., no one can buy a higher PageRank).

Page 20: WebMiningResearch ASurvey Web Mining Research: A Survey By Raymond Kosala & Hendrik Blockeel, Katholieke Universitat Leuven, July 2000 Presented 4/18/2002.

4/18/2002 Caitlin C Coughlin, University of Vermont

20

Web Usage MiningWeb Usage Mining

Web usage mining focuses on techniques that could predict user behavior while the user interacts with the web.

Two commonly used approaches: 1) mapping the usage data of the web server into relational tables before an adapted data mining technique is performed, 2) uses the log data directly by using special preprocessing techniques.

Applications of web usage mining fall into two main categories: learning a user profile/user modeling in adaptive interfaces [personalized] and learning user navigation patterns [impersonalized]

Page 21: WebMiningResearch ASurvey Web Mining Research: A Survey By Raymond Kosala & Hendrik Blockeel, Katholieke Universitat Leuven, July 2000 Presented 4/18/2002.

4/18/2002 Caitlin C Coughlin, University of Vermont

21

ConclusionsConclusions

We surveyed research in Web Mining, clarified some confusion in the use of the term Web mining, explored the connection between Web mining categories and the agent

paradigm, & suggested three Web mining categories and situated some current

research with respect to these categories.

The Web presents new challenges to the traditional data mining algorithms that work on flat data. We have seen that some of the traditional data mining algorithms have been extended or new algorithms have been used to work on the Web data.

Page 22: WebMiningResearch ASurvey Web Mining Research: A Survey By Raymond Kosala & Hendrik Blockeel, Katholieke Universitat Leuven, July 2000 Presented 4/18/2002.

4/18/2002 Caitlin C Coughlin, University of Vermont

22

Questions?Questions?