Top Banner
Part III - Web Mining © Prentice Hall 1 Chapter 7 Web Mining Outline Goal: Examine the use of data mining on the World Wide Web Introduction Web Content Mining Web Structure Mining Web Usage Mining
34

Chapter 7 Web Mining Outline

Feb 11, 2022

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Chapter 7 Web Mining Outline

Part III - Web Mining © Prentice Hall 1

Chapter 7 Web Mining Outline

Goal: Examine the use of data mining onthe World Wide WebIntroductionWeb Content MiningWeb Structure MiningWeb Usage Mining

Page 2: Chapter 7 Web Mining Outline

Part III - Web Mining © Prentice Hall 2

Web Mining Issues

Size>350 million pages (1999)Grows at about 1 million pages a dayGoogle indexes 3 billion documents

Diverse types of data

Page 3: Chapter 7 Web Mining Outline

Part III - Web Mining © Prentice Hall 3

Web Data

Web pagesIntra-page structuresInter-page structuresUsage dataSupplemental dataProfilesRegistration informationCookies

Page 4: Chapter 7 Web Mining Outline

Part III - Web Mining © Prentice Hall 4

Web Mining Taxonomy

Modified from [zai01]

Page 5: Chapter 7 Web Mining Outline

Part III - Web Mining © Prentice Hall 5

Web Content Mining

Extends work of basic search enginesSearch EnginesIR applicationKeyword basedSimilarity between query and documentCrawlersIndexingProfilesLink analysis

Page 6: Chapter 7 Web Mining Outline

Part III - Web Mining © Prentice Hall 6

CrawlersRobot (spider) traverses the hypertext

sructure in the Web.Collect information from visited pagesUsed to construct indexes for search enginesTraditional Crawler –visits entire Web (?)

and replaces indexPeriodic Crawler –visits portions of the Web

and updates subset of indexIncremental Crawler –selectively searches

the Web and incrementally modifies indexFocused Crawler –visits pages related to a

particular subject

Page 7: Chapter 7 Web Mining Outline

Part III - Web Mining © Prentice Hall 7

Focused Crawler

Only visit links from a page if that page isdetermined to be relevant.Classifier is static after learning phase.Components:Classifier which assigns relevance score to

each page based on crawl topic.Distiller to identify hub pages.Crawler visits pages to based on crawler and

distiller scores.

Page 8: Chapter 7 Web Mining Outline

Part III - Web Mining © Prentice Hall 8

Focused Crawler

Classifier to related documents to topicsClassifier also determines how useful

outgoing links areHub Pages contain links to many relevant

pages. Must be visited even if not highrelevance score.

Page 9: Chapter 7 Web Mining Outline

Part III - Web Mining © Prentice Hall 9

Focused Crawler

Page 10: Chapter 7 Web Mining Outline

Part III - Web Mining © Prentice Hall 10

Context Focused Crawler

Context Graph: Context graph created for each seed document . Root is the sedd document. Nodes at each level show documents with links to

documents at next higher level. Updated during crawl itself .

Approach:1. Construct context graph and classifiers using seed

documents as training data.2. Perform crawling using classifiers and context graph

created.

Page 11: Chapter 7 Web Mining Outline

Part III - Web Mining © Prentice Hall 11

Context Graph

Page 12: Chapter 7 Web Mining Outline

Part III - Web Mining © Prentice Hall 12

Virtual Web ViewMultiple Layered DataBase (MLDB) built on top

of the Web.Each layer of the database is more generalized

(and smaller) and centralized than the onebeneath it.

Upper layers of MLDB are structured and can beaccessed with SQL type queries.

Translation tools convert Web documents to XML.Extraction tools extract desired information to

place in first layer of MLDB.Higher levels contain more summarized data

obtained through generalizations of the lowerlevels.

Page 13: Chapter 7 Web Mining Outline

Part III - Web Mining © Prentice Hall 13

Personalization

Web access or contents tuned to better fit thedesires of each user.

Manual techniques identify user’s preferencesbased on profiles or demographics.

Collaborative filtering identifies preferencesbased on ratings from similar users.

Content based filtering retrieves pagesbased on similarity between pages and userprofiles.

Page 14: Chapter 7 Web Mining Outline

Part III - Web Mining © Prentice Hall 14

Web Structure Mining

Mine structure (links, graph) of the WebTechniquesPageRankCLEVER

Create a model of the Web organization.May be combined with content mining to

more effectively retrieve important pages.

Page 15: Chapter 7 Web Mining Outline

Part III - Web Mining © Prentice Hall 15

PageRank

Used by GooglePrioritize pages returned from search by

looking at Web structure.Importance of page is calculated based

on number of pages which point to it –Backlinks.Weighting is used to provide more

importance to backlinks coming formimportant pages.

Page 16: Chapter 7 Web Mining Outline

Part III - Web Mining © Prentice Hall 16

PageRank (cont’d)

PR(p) = c (PR(1)/N1 + … + PR(n)/Nn)PR(i): PageRank for a page i which points to

target page p.Ni: number of links coming out of page i

Page 17: Chapter 7 Web Mining Outline

Part III - Web Mining © Prentice Hall 17

CLEVER

Identify authoritative and hub pages.Authoritative Pages :Highly important pages.Best source for requested information.

Hub Pages :Contain links to highly important pages.

Page 18: Chapter 7 Web Mining Outline

Part III - Web Mining © Prentice Hall 18

HITS

Hyperlink-Induces Topic SearchBased on a set of keywords, find set of relevant

pages –R.Identify hub and authority pages for these.Expand R to a base set, B, of pages linked to or from R.Calculate weights for authorities and hubs.

Pages with highest ranks in R are returned.

Page 19: Chapter 7 Web Mining Outline

Part III - Web Mining © Prentice Hall 19

HITS Algorithm

Page 20: Chapter 7 Web Mining Outline

Part III - Web Mining © Prentice Hall 20

Web Usage Mining

Extends work of basic search enginesSearch EnginesIR applicationKeyword basedSimilarity between query and documentCrawlersIndexingProfilesLink analysis

Page 21: Chapter 7 Web Mining Outline

Part III - Web Mining © Prentice Hall 21

Web Usage Mining Applications

PersonalizationImprove structure of a site’s Web pagesAid in caching and prediction of future

page referencesImprove design of individual pagesImprove effectiveness of e-commerce

(sales and advertising)

Page 22: Chapter 7 Web Mining Outline

Part III - Web Mining © Prentice Hall 22

Web Usage Mining Activities

Preprocessing Web logCleanseRemove extraneous informationSessionize

Session: Sequence of pages referenced by one user at asitting.

Pattern DiscoveryCount patterns that occur in sessionsPattern is sequence of pages references in session.Similar to association rulesTransaction: sessionItemset: pattern (or subset)Order is important

Pattern Analysis

Page 23: Chapter 7 Web Mining Outline

Part III - Web Mining © Prentice Hall 23

ARs in Web Mining

Web Mining:ContentStructureUsage

Frequent patterns of sequential pagereferences in Web searching.

Uses:CachingClustering usersDevelop user profilesIdentify important pages

Page 24: Chapter 7 Web Mining Outline

Part III - Web Mining © Prentice Hall 24

Web Usage Mining Issues

Identification of exact user not possible.Exact sequence of pages referenced by a

user not possible due to caching.Session not well definedSecurity, privacy, and legal issues

Page 25: Chapter 7 Web Mining Outline

Part III - Web Mining © Prentice Hall 25

Web Log Cleansing

Replace source IP address with uniquebut non-identifying ID.Replace exact URL of pages referenced

with unique but non-identifying ID.Delete error records and records

containing not page data (such as figuresand code)

Page 26: Chapter 7 Web Mining Outline

Part III - Web Mining © Prentice Hall 26

Sessionizing

Divide Web log into sessions.Two common techniques:Number of consecutive page references from a

source IP address occurring within a predefinedtime interval (e.g. 25 minutes).All consecutive page references from a source

IP address where the interclick time is less thana predefined threshold.

Page 27: Chapter 7 Web Mining Outline

Part III - Web Mining © Prentice Hall 27

Data Structures

Keep track of patterns identified duringWeb usage mining processCommon techniques:TrieSuffix TreeGeneralized Suffix TreeWAP Tree

Page 28: Chapter 7 Web Mining Outline

Part III - Web Mining © Prentice Hall 28

Trie vs. Suffix Tree

Trie:Rooted treeEdges labeled which character (page) from

patternPath from root to leaf represents pattern.

Suffix Tree:Single child collapsed with parent. Edge

contains labels of both prior edges.

Page 29: Chapter 7 Web Mining Outline

Part III - Web Mining © Prentice Hall 29

Trie and Suffix Tree

Page 30: Chapter 7 Web Mining Outline

Part III - Web Mining © Prentice Hall 30

Generalized Suffix Tree

Suffix tree for multiple sessions.Contains patterns from all sessions.Maintains count of frequency of

occurrence of a pattern in the node.WAP Tree:

Compressed version of generalized suffix tree

Page 31: Chapter 7 Web Mining Outline

Part III - Web Mining © Prentice Hall 31

Types of Patterns

Algorithms have been developed to discoverdifferent types of patterns.

Properties:Ordered –Characters (pages) must occur in the exact

order in the original session.Duplicates –Duplicate characters are allowed in the

pattern.Consecutive –All characters in pattern must occur

consecutive in given session.Maximal –Not subsequence of another pattern.

Page 32: Chapter 7 Web Mining Outline

Part III - Web Mining © Prentice Hall 32

Pattern Types

Association RulesNone of the properties hold

EpisodesOnly ordering holds

Sequential PatternsOrdered and maximal

Forward SequencesOrdered, consecutive, and maximal

Maximal Frequent SequencesAll properties hold

Page 33: Chapter 7 Web Mining Outline

Part III - Web Mining © Prentice Hall 33

Episodes

Partially ordered set of pagesSerial episode –totally ordered with time

constraintParallel episode –partial ordered with

time constraintGeneral episode –partial ordered with no

time constraint

Page 34: Chapter 7 Web Mining Outline

Part III - Web Mining © Prentice Hall 34

DAG for Episode