Top Banner
Web Mining By:- Vineeta 8pgc18 M.Tech (II Semester)
33
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Web Mining

Web MiningBy:-

Vineeta

8pgc18

M.Tech (II Semester)

Page 2: Web Mining

Introduction Why we need ? What is it ? How it is different from classical data mining ? What are the problems ? Role of web mining Web mining Taxonomy Applications

Page 3: Web Mining

Why we need Web Mining?

Explosive growth of amount of content on the internet

Web search engines return thousands of results so difficult to browse

Online repositories are growing rapidly

Using web mining web documents can easily be BROWSED,

ORGANISED and CATALOGED with minimal human

intervention

Page 4: Web Mining

What is it? Web mining - data mining techniques to automatically

discover and extract information from web documents/services

Knowledge

www

Page 5: Web Mining

How does it differ from “classical” Data Mining?

The web is not a relation Textual information and linkage structure

Usage data is huge and growing rapidly Google’s usage logs are bigger than their web crawl Data generated per day is comparable to largest

conventional data warehouses

Ability to react in real-time to usage patterns No human in the loop

Page 6: Web Mining

Web Mining: Problems The “abundance” problem Limited coverage of the Web Limited query interface based on keyword-oriented

search Limited customization to individual users Dynamic and semi structured

Page 7: Web Mining

Role of web mining Finding Relevant Information

Creating knowledge from Information available

Personalization of the information

Learning about customers / individual users

Page 8: Web Mining

Web Mining Taxonomy

Web Mining

Web ContentMining

Web StructureMining

Web UsageMining

Identify informationwithin given webpages

Distinguish personalhome pages fromother web pages

Understand access patterns and the trendsto improve structure

Uses interconnectionsbetween web pages to give weight to the pages

Page 9: Web Mining

Web Content Mining Web Content Mining is the process of extracting

useful information from the contents of Web documents.

Content data corresponds to the collection of facts a Web page was designed to convey to the users. It may consist of text, images, audio, video, or structured records such as lists and tables.

Research activities in this field also involve using techniques from other disciplines such as Information Retrieval (IR) and natural language processing (NLP).

Page 10: Web Mining

Web Content Mining

Web Content Mining

Agent Based Approach

InformationFiltering &

Categorization

MultilevelDatabases

PersonalizedWeb Agent

Database Approach

IntelligentSearchAgent

Web QuerySystems

Page 11: Web Mining

Intelligent Search Agents Concentrate on searching relevant information using

the characteristics of a particular domain to interpret and organize the collected information.

It can be further classified into two types: Interpretation Based on Pre-Specified Information:

Examples: Harvest, FAQFinder, Information Manifold, OCCAM

Interpretation Based on Unfamiliar Source: Example: ShopBot

Page 12: Web Mining

ShopBot A ShopBot is an autonomous software agent that

comb the internet providing users with low price product or product recommendations.

A ShopBot basically looks for product information from a variety of vendor sites using the general information about the product domain.

The following example displays a shopBot at www.allbookstores.com.

Page 13: Web Mining
Page 14: Web Mining
Page 15: Web Mining

Information Filtering & Categorization

This makes use of various information retrieval techniques and characteristics of hypertext web documents to interpret and categorize data.

Examples: HyPursuit, BO (Bookmark Organizer).

Page 16: Web Mining

Bookmark Organizer (BO) Makes use of hierarchical clustering techniques and

involves user interaction to organize a collection of web documents.

It operates in two modes: Automatic Manual

Frozen Nodes: In a hierarchical structure, if we freeze a node N, then the subtree rooted at N represents a coherent group of documents.

Page 17: Web Mining

Personalized Web Agents This category of Web agents learn user preferences

and discover Web information sources based on these preferences, and those of other individuals with similar interests.

Examples: WebWatcher PAINT Syskill&Webert GroupLens Firefly

Page 18: Web Mining

Multilevel Databases Layer 0 :

Unstructured, massive and global information base. Layer 1:

Derived from lower layers. Relatively structured. Obtained by data analysis, transformation &

Generalization. Higher Layers (Layer n):

Further generalization to form smaller, better structured databases for more efficient retrieval.

Page 19: Web Mining

Web Query System These systems attempt to make use of:

Standard database query language – SQL Structural information about web documents Natural language processing for queries made in www

searches. Examples:

WebLog: Restructuring extracted information from Web sources.

W3QL: Combines structure query (organization of hypertext) and content query (information retrieval techniques).

Page 20: Web Mining

Web Structure Mining

Web Structure Mining is the process of discovering structure information from the Web. This type of mining can be performed either at the (intra-page) document level or at the (inter-page) hyperlink level.The research at the hyperlink level is also called

HYPERLINK ANALYSIS

Page 21: Web Mining

Web Structure Mining

Different Algorithms for Web Structures: Page-Rank Method

Sergey Brin and Lawrence Page: The anatomy of a large-scale hypertextual web search engine. In Proc. Of WWW, pages 107–117, Brisbane, Australia, 1998.

CLEVER Method

http://www.almaden.ibm.com/projects/clever.shtml

Page 22: Web Mining

Page-Rank Method Introduced by Brin and Page (1998) Used in Google Search Engine Mine hyperlink structure of web to produce ‘global’

importance ranking of every web page Web search result is returned in the rank order Treats link as like academic citation Assumption: Highly linked pages are more ‘important’

than pages with a few links A page has a high rank if the sum of the ranks of its

back-links is high

Page 23: Web Mining

Backlink

Link Structure of the Web

Page 24: Web Mining

CLEVER Method CLient–side EigenVector-Enhanced Retrieval Developed by a team of IBM researchers at IBM

Almaden Research Centre Ranks pages primarily by measuring links between

them Continued refinements of HITS ( Hypertext Induced

Topic Selection) Basic Principles – Authorities, Hubs

Good hubs points to good authorities Good authorities are referenced by good hubs

Page 25: Web Mining

Web Usage Mining

Web usage mining also known as Web log mining mining techniques to discover interesting usage

patterns from the data derived from the interactions of the users while surfing the web

mining Web log records to discover user access patterns of Web pages

Page 26: Web Mining

Web Usage Mining – Three Phases

Page 27: Web Mining

Web Usage Mining Pre processing consists of converting the usage, content, and

structure information contained in the various available data sources into the data abstractions necessary for pattern discovery

Pattern discovery draws upon methods and algorithms developed from several fields such as statistics, data mining, machine learning and pattern recognition.

The motivation behind pattern analysis is to filter out uninteresting rules or patterns from the set found in the pattern discovery phase. The exact analysis methodology is usually governed by the application for which Web mining is done.

Page 28: Web Mining

Applications

Personalized experience in B2C e-commerce –Amazon.com

Web search –Google Web-wide user tracking –DoubleClick Understanding user communities –AOL Understanding auction behavior –eBay Personalized web portal –MyYahoo

Page 29: Web Mining

Conclusion Web mining - data mining techniques to

automatically discover and extract information from Web documents/services (Etzioni, 1996). Web mining research – integrate research from several research communities (Kosala and Blockeel, July 2000) such as:

Database (DB) Information retrieval (IR) The sub-areas of machine learning (ML) Natural language processing (NLP)

Page 30: Web Mining

References mandolin.cais.ntu.edu.sg/wise2002/web-mining-

WISE-30 David Gibson, Jon Kleinberg, and Prabhakar

Raghavan. Inferring web communities from link topology. In Conference on Hypertext and Hypermedia. ACM, 1998.

www.iprcom.com/papers/pagerank/ http://maya.cs.depaul.edu/~mobasher/webminer/

survey/node23.html

Page 31: Web Mining

References http://en.wikipedia.org/wiki/Web_mining

http://en.wikipedia.org/wiki/Shop_bot

Y. S. Mareek and I. Z. B. Shaul. Automatically organizing bookmarks per contents. Proc. Fifth International World Wide Web Conference, May 6-10 1996.

Cooley, R., B. Mobasher, et al. (1997). Web Mining: Information and Pattern Discovery on the World Wide Web, Proc. IEEE Intl. Conf. Tools with AI, Newport Beach, CA, pp. 558-567, 1997.

Page 32: Web Mining

References R. Kosala. and H. Blockeel, Web Mining Research:

A Survey, SIGKDD Explorations, 2(1):1-15, 2000.

R. Cooley, B. Mobasher, and J. Srivastava. Data preparation for mining world wide web browsing patterns. Journal of Knowledge and Information Systems 1, 5-32, 1999

S. Chakrabarti, Data mining for hypertext: A tutorial survey. ACM SIGKDD Explorations, 1(2):1-11, 2000System, 1(1), 1999

Page 33: Web Mining

THANK YOU!!