© Prentice Hall1 ADVANCED TOPICS IN DATA MINING CSE 8331 Spring 2008 Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist.

© Prentice Hall 1

ADVANCED TOPICS IN DATA ADVANCED TOPICS IN DATA MININGMINING

CSE 8331CSE 8331Spring 2008Spring 2008

Margaret H. DunhamMargaret H. DunhamDepartment of Computer Science and EngineeringDepartment of Computer Science and Engineering

Southern Methodist UniversitySouthern Methodist University

Companion slides for the text by Dr. M.H.Dunham, Companion slides for the text by Dr. M.H.Dunham, Data Mining, Data Mining, Introductory and Advanced TopicsIntroductory and Advanced Topics, Prentice Hall, 2002., Prentice Hall, 2002.

© Prentice Hall 2

Data Mining OutlineData Mining Outline

Temporal MiningTemporal Mining Spatial MiningSpatial Mining Web MiningWeb Mining

© Prentice Hall 3

Temporal Mining OutlineTemporal Mining Outline

Goal:Goal: Examine some temporal data Examine some temporal data mining issues and approaches.mining issues and approaches.

IntroductionIntroduction Modeling Temporal EventsModeling Temporal Events Time SeriesTime Series Pattern DetectionPattern Detection SequencesSequences Temporal Association RulesTemporal Association Rules

© Prentice Hall 4

Temporal DatabaseTemporal Database

Snapshot Snapshot – Traditional database– Traditional database TemporalTemporal – Multiple time points – Multiple time points Ex:Ex:

© Prentice Hall 5

Temporal QueriesTemporal Queries QueryQuery

DatabaseDatabase

Intersection QueryIntersection Query

Inclusion QueryInclusion Query

Containment QueryContainment Query

Point Query – Tuple retrieved is valid at a particular point in time.Point Query – Tuple retrieved is valid at a particular point in time.

tsq te

q

tsd te

d

tsq te

qtsd te

d

tsq te

qtsd te

d

tsq te

qtsd te

d

© Prentice Hall 6

Types of DatabasesTypes of Databases

Snapshot – No temporal supportSnapshot – No temporal support Transaction Time – Supports time when Transaction Time – Supports time when

transaction inserted datatransaction inserted data– TimestampTimestamp– RangeRange

Valid Time – Supports time range when Valid Time – Supports time range when data values are validdata values are valid

Bitemporal – Supports both transaction Bitemporal – Supports both transaction and valid time.and valid time.

© Prentice Hall 7

Modeling Temporal EventsModeling Temporal Events Techniques to model temporal events.Techniques to model temporal events. Often based on earlier approachesOften based on earlier approaches Finite State Recognizer (Machine) (FSR)Finite State Recognizer (Machine) (FSR)

– Each event recognizes one characterEach event recognizes one character– Temporal ordering indicated by arcsTemporal ordering indicated by arcs– May recognize a sequenceMay recognize a sequence– Require precisely defined transitions between statesRequire precisely defined transitions between states

ApproachesApproaches– Markov ModelMarkov Model– Hidden Markov ModelHidden Markov Model– Recurrent Neural NetworkRecurrent Neural Network

© Prentice Hall 8

FSRFSR

© Prentice Hall 9

Markov Model (MM)Markov Model (MM) Directed graphDirected graph

– Vertices represent statesVertices represent states– Arcs show transitions between statesArcs show transitions between states– Arc has probability of transitionArc has probability of transition– At any time one state is designated as current At any time one state is designated as current

state.state. Markov PropertyMarkov Property – Given a current state, the – Given a current state, the

transition probability is independent of any transition probability is independent of any previous states.previous states.

Applications: speech recognition, natural Applications: speech recognition, natural language processinglanguage processing

© Prentice Hall 10

Markov ModelMarkov Model

© Prentice Hall 11

Hidden Markov Model (HMM)Hidden Markov Model (HMM)

Like HMM, but states need not correspond to Like HMM, but states need not correspond to observable states.observable states.

HMM models process that produces as HMM models process that produces as output a sequence of observable symbols.output a sequence of observable symbols.

HMM will actually output these symbols.HMM will actually output these symbols. Associated with each node is the probability Associated with each node is the probability

of the observation of an event.of the observation of an event. Train HMM to recognize a sequence.Train HMM to recognize a sequence. Transition and observation probabilities Transition and observation probabilities

learned from training set.learned from training set.

© Prentice Hall 12

Hidden Markov ModelHidden Markov Model

Modified from [RJ86]

© Prentice Hall 13

HMM AlgorithmHMM Algorithm

© Prentice Hall 14

HMM ApplicationsHMM Applications

Given a sequence of events and an Given a sequence of events and an HMM, what is the probability that the HMM, what is the probability that the HMM produced the sequence?HMM produced the sequence?

Given a sequence and an HMM, what is Given a sequence and an HMM, what is the most likely state sequence which the most likely state sequence which produced this sequence?produced this sequence?

© Prentice Hall 15

Recurrent Neural Network (RNN)Recurrent Neural Network (RNN)

Extension to basic NNExtension to basic NN Neuron can obtian input form any other Neuron can obtian input form any other

neuron (including output layer).neuron (including output layer). Can be used for both recognition and Can be used for both recognition and

prediction applications.prediction applications. Time to produce output unknownTime to produce output unknown Temporal aspect added by backlinks.Temporal aspect added by backlinks.

© Prentice Hall 16

RNNRNN

© Prentice Hall 17

Time SeriesTime Series

Set of attribute values over timeSet of attribute values over time Time Series Analysis – finding patterns Time Series Analysis – finding patterns

in the values.in the values.– TrendsTrends– CyclesCycles– SeasonalSeasonal– OutliersOutliers

© Prentice Hall 18

Analysis TechniquesAnalysis Techniques Smoothing Smoothing – Moving average of attribute – Moving average of attribute

values.values. Autocorrelation Autocorrelation – relationships between – relationships between

different subseriesdifferent subseries– Yearly, seasonalYearly, seasonal– LagLag – Time difference between related items. – Time difference between related items.– Correlation Coefficient rCorrelation Coefficient r

© Prentice Hall 19

SmoothingSmoothing

© Prentice Hall 20

Correlation with Lag of 3Correlation with Lag of 3

© Prentice Hall 21

SimilaritySimilarity Determine similarity between a target pattern, Determine similarity between a target pattern,

X, and sequence, Y: sim(X,Y)X, and sequence, Y: sim(X,Y) Similar to Web usage miningSimilar to Web usage mining Similar to earlier word processing and spelling Similar to earlier word processing and spelling

corrector applications.corrector applications. Issues:Issues:

– LengthLength– ScaleScale– GapsGaps– OutliersOutliers– BaselineBaseline

© Prentice Hall 22

Longest Common SubseriesLongest Common Subseries

Find longest subseries they have in Find longest subseries they have in common.common.

Ex:Ex:– X = <10,5,6,9,22,15,4,2>X = <10,5,6,9,22,15,4,2>– Y = <6,9,10,5,6,22,15,4,2>Y = <6,9,10,5,6,22,15,4,2>– Output: <22,15,4,2>Output: <22,15,4,2>– Sim(X,Y) = l/n = 4/9Sim(X,Y) = l/n = 4/9

© Prentice Hall 23

Similarity based on Linear Similarity based on Linear TransformationTransformation

Linear transformation function fLinear transformation function f– Convert a value form one series to a value Convert a value form one series to a value

in the secondin the second ff – tolerated difference in results – tolerated difference in results – – time value difference allowedtime value difference allowed

© Prentice Hall 24

PredictionPrediction

Predict future value for time seriesPredict future value for time series Regression may not be sufficientRegression may not be sufficient Statistical TechniquesStatistical Techniques

– ARMAARMA– ARIMAARIMA

NNNN

© Prentice Hall 25

Pattern DetectionPattern Detection

Identify patterns of behavior in time Identify patterns of behavior in time seriesseries

Speech recognition, signal processingSpeech recognition, signal processing FSR, MM, HMMFSR, MM, HMM

© Prentice Hall 26

String MatchingString Matching

Find given pattern in sequenceFind given pattern in sequence Knuth-Morris-Pratt:Knuth-Morris-Pratt: Construct FSM Construct FSM Boyer-Moore:Boyer-Moore: Construct FSM Construct FSM

© Prentice Hall 27

Distance between StringsDistance between Strings

Cost to convert one to the otherCost to convert one to the other TransformationsTransformations

– Match: Current characters in both strings Match: Current characters in both strings are the sameare the same

– Delete: Delete current character in input Delete: Delete current character in input stringstring

– Insert: Insert current character in target Insert: Insert current character in target string into stringstring into string

© Prentice Hall 28

Distance between StringsDistance between Strings

© Prentice Hall 29

Frequent SequenceFrequent Sequence

© Prentice Hall 30

Frequent Sequence ExampleFrequent Sequence Example

Purchases made by Purchases made by customerscustomers

s(<{A},{C}>) = 1/3s(<{A},{C}>) = 1/3 s(<{A},{D}>) = 2/3s(<{A},{D}>) = 2/3 s(<{B,C},{D}>) = 2/3s(<{B,C},{D}>) = 2/3

© Prentice Hall 31

Frequent Sequence LatticeFrequent Sequence Lattice

© Prentice Hall 32

SPADESPADE

Sequential Pattern Discovery using Sequential Pattern Discovery using Equivalence classesEquivalence classes

Identifies patterns by traversing lattice in Identifies patterns by traversing lattice in a top down manner.a top down manner.

Divides lattice into equivalent classes Divides lattice into equivalent classes and searches each separately.and searches each separately.

ID-List:ID-List: Associates customers and Associates customers and transactions with each item.transactions with each item.

© Prentice Hall 33

SPADE ExampleSPADE Example

ID-List for Sequences of length 1:ID-List for Sequences of length 1:

Count for <{A}> is 3Count for <{A}> is 3 Count for <{A},{D}> is 2Count for <{A},{D}> is 2

© Prentice Hall 34

Equivalence ClassesEquivalence Classes

© Prentice Hall 35

SPADE AlgorithmSPADE Algorithm

© Prentice Hall 36

Temporal Association RulesTemporal Association Rules

Transaction has time:Transaction has time:<TID,CID,I<TID,CID,I11,I,I22, …, I, …, Imm,t,tss,t,tee>>

[t[tss,t,tee] is range of time the transaction is active.] is range of time the transaction is active. Types:Types:

– Inter-transaction rulesInter-transaction rules– Episode rulesEpisode rules– Trend dependenciesTrend dependencies– Sequence association rulesSequence association rules– Calendric association rulesCalendric association rules

© Prentice Hall 37

Inter-transaction RulesInter-transaction Rules

Intra-transaction association rulesIntra-transaction association rulesTraditional association RulesTraditional association Rules

Inter-transaction association rulesInter-transaction association rules– Rules across transactionsRules across transactions– Sliding windowSliding window – How far apart (time or – How far apart (time or

number of transactions) to look for related number of transactions) to look for related itemsets.itemsets.

© Prentice Hall 38

Episode RulesEpisode Rules

Association rules applied to sequences Association rules applied to sequences of events.of events.

EpisodeEpisode – set of event predicates and – set of event predicates and partial ordering on thempartial ordering on them

© Prentice Hall 39

Trend DependenciesTrend Dependencies Association rules across two database Association rules across two database

states based on time.states based on time. Ex: (SSN,=) Ex: (SSN,=) (Salary, (Salary, ))

Confidence=4/5Confidence=4/5Support=4/36Support=4/36

© Prentice Hall 40

Sequence Association RulesSequence Association Rules

Association rules involving sequencesAssociation rules involving sequences Ex:Ex:

<{A},{C}> <{A},{C}> <{A},{D}> <{A},{D}>Support = 1/3Support = 1/3Confidence 1Confidence 1

© Prentice Hall 41

Calendric Association RulesCalendric Association Rules

Each transaction has a unique Each transaction has a unique timestamp.timestamp.

Group transactions based on time Group transactions based on time interval within which they occur.interval within which they occur.

Identify large itemsets by looking at Identify large itemsets by looking at transactions only in this predefined transactions only in this predefined interval.interval.

© Prentice Hall 42

Spatial Mining OutlineSpatial Mining Outline

Goal:Goal: Provide an introduction to some Provide an introduction to some spatial mining techniques.spatial mining techniques.

IntroductionIntroduction Spatial Data Overview Spatial Data Overview Spatial Data Mining PrimitivesSpatial Data Mining Primitives Generalization/SpecializationGeneralization/Specialization Spatial RulesSpatial Rules Spatial ClassificationSpatial Classification Spatial ClusteringSpatial Clustering

© Prentice Hall 43

Spatial ObjectSpatial Object

Contains both spatial and nonspatial Contains both spatial and nonspatial attributes.attributes.

Must have a location type attributes:Must have a location type attributes:– Latitude/longitudeLatitude/longitude– Zip codeZip code– Street addressStreet address

May retrieve object using either (or May retrieve object using either (or both) spatial or nonspatial attributes.both) spatial or nonspatial attributes.

© Prentice Hall 44

Spatial Data Mining ApplicationsSpatial Data Mining Applications

GeologyGeology GIS SystemsGIS Systems Environmental ScienceEnvironmental Science AgricultureAgriculture MedicineMedicine RoboticsRobotics May involved both spatial and temporal May involved both spatial and temporal

aspectsaspects

© Prentice Hall 45

Spatial QueriesSpatial Queries Spatial selection may involve specialized selection Spatial selection may involve specialized selection

comparison operations:comparison operations:– NearNear– North, South, East, WestNorth, South, East, West– Contained inContained in– Overlap/intersectOverlap/intersect

Region (Range) QueryRegion (Range) Query – find objects that intersect a given – find objects that intersect a given region.region.

Nearest Neighbor QueryNearest Neighbor Query – find object close to identified – find object close to identified object.object.

Distance ScanDistance Scan – find object within a certain distance of an – find object within a certain distance of an identified object where distance is made increasingly larger.identified object where distance is made increasingly larger.

© Prentice Hall 46

Spatial Data StructuresSpatial Data Structures Data structures designed specifically to store or Data structures designed specifically to store or

index spatial data.index spatial data. Often based on B-tree or Binary Search TreeOften based on B-tree or Binary Search Tree Cluster data on disk basked on geographic location.Cluster data on disk basked on geographic location. May represent complex spatial structure by placing May represent complex spatial structure by placing

the spatial object in a containing structure of a the spatial object in a containing structure of a specific geographic shape.specific geographic shape.

Techniques:Techniques:– Quad TreeQuad Tree– R-TreeR-Tree– k-D Treek-D Tree

© Prentice Hall 47

MBRMBR

Minimum Bounding RectangleMinimum Bounding Rectangle Smallest rectangle that completely Smallest rectangle that completely

contains the objectcontains the object

© Prentice Hall 48

MBR ExamplesMBR Examples

© Prentice Hall 49

Quad TreeQuad Tree

Hierarchical decomposition of the space Hierarchical decomposition of the space into quadrants (MBRs)into quadrants (MBRs)

Each level in the tree represents the Each level in the tree represents the object as the set of quadrants which object as the set of quadrants which contain any portion of the object.contain any portion of the object.

Each level is a more exact representation Each level is a more exact representation of the object.of the object.

The number of levels is determined by The number of levels is determined by the degree of accuracy desired.the degree of accuracy desired.

© Prentice Hall 50

Quad Tree ExampleQuad Tree Example

© Prentice Hall 51

R-TreeR-Tree

As with Quad Tree the region is divided As with Quad Tree the region is divided into successively smaller rectangles into successively smaller rectangles (MBRs).(MBRs).

Rectangles need not be of the same Rectangles need not be of the same size or number at each level.size or number at each level.

Rectangles may actually overlap.Rectangles may actually overlap. Lowest level cell has only one object.Lowest level cell has only one object. Tree maintenance algorithms similar to Tree maintenance algorithms similar to

those for B-trees.those for B-trees.

© Prentice Hall 52

R-Tree ExampleR-Tree Example

© Prentice Hall 53

K-D TreeK-D Tree

Designed for multi-attribute data, not Designed for multi-attribute data, not necessarily spatialnecessarily spatial

Variation of binary search treeVariation of binary search tree Each level is used to index one of the Each level is used to index one of the

dimensions of the spatial object.dimensions of the spatial object. Lowest level cell has only one objectLowest level cell has only one object Divisions not based on MBRs but Divisions not based on MBRs but

successive divisions of the dimension successive divisions of the dimension range.range.

© Prentice Hall 54

k-D Tree Examplek-D Tree Example

© Prentice Hall 55

Topological RelationshipsTopological Relationships

DisjointDisjoint Overlaps or IntersectsOverlaps or Intersects EqualsEquals Covered by or inside or contained inCovered by or inside or contained in Covers or containsCovers or contains

© Prentice Hall 56

Distance Between ObjectsDistance Between Objects EuclideanEuclidean ManhattanManhattan Extensions:Extensions:

© Prentice Hall 57

Progressive RefinementProgressive Refinement

Make approximate answers prior to Make approximate answers prior to more accurate ones.more accurate ones.

Filter out data not part of answerFilter out data not part of answer Hierarchical view of data based on Hierarchical view of data based on

spatial relationshipsspatial relationships Coarse predicate recursively refinedCoarse predicate recursively refined

© Prentice Hall 58

Progressive RefinementProgressive Refinement

© Prentice Hall 59

Spatial Data Dominant AlgorithmSpatial Data Dominant Algorithm

© Prentice Hall 60

STINGSTING

STatistical Information Grid-basedSTatistical Information Grid-based Hierarchical technique to divide area Hierarchical technique to divide area

into rectangular cellsinto rectangular cells Grid data structure contains summary Grid data structure contains summary

information about each cellinformation about each cell Hierarchical clustering Hierarchical clustering Similar to quad treeSimilar to quad tree

© Prentice Hall 61

STINGSTING

© Prentice Hall 62

STING Build AlgorithmSTING Build Algorithm

© Prentice Hall 63

STING AlgorithmSTING Algorithm

© Prentice Hall 64

Spatial RulesSpatial Rules

Characteristic RuleCharacteristic Rule

The average family income in Dallas is $50,000.The average family income in Dallas is $50,000. Discriminant RuleDiscriminant Rule

The average family income in Dallas is $50,000, The average family income in Dallas is $50,000, while in Plano the average income is $75,000.while in Plano the average income is $75,000.

Association RuleAssociation Rule

The average family income in Dallas for families The average family income in Dallas for families living near White Rock Lake is $100,000.living near White Rock Lake is $100,000.

© Prentice Hall 65

Spatial Association RulesSpatial Association Rules

Either antecedent or consequent must Either antecedent or consequent must contain spatial predicates.contain spatial predicates.

View underlying database as set of View underlying database as set of spatial objects.spatial objects.

May create using a type of progressive May create using a type of progressive refinementrefinement

© Prentice Hall 66

Spatial Association Rule AlgorithmSpatial Association Rule Algorithm

© Prentice Hall 67

Spatial ClassificationSpatial Classification

Partition spatial objectsPartition spatial objects May use nonspatial attributes and/or May use nonspatial attributes and/or

spatial attributesspatial attributes Generalization and progressive Generalization and progressive

refinement may be used.refinement may be used.

© Prentice Hall 68

ID3 ExtensionID3 Extension

Neighborhood GraphNeighborhood Graph– Nodes – objectsNodes – objects– Edges – connects neighborsEdges – connects neighbors

Definition of neighborhood variesDefinition of neighborhood varies ID3 considers nonspatial attributes of all ID3 considers nonspatial attributes of all

objects in a neighborhood (not just one) objects in a neighborhood (not just one) for classification.for classification.

© Prentice Hall 69

Spatial Decision TreeSpatial Decision Tree

Approach similar to that used for spatial Approach similar to that used for spatial association rules.association rules.

Spatial objects can be described based Spatial objects can be described based on objects close to them – on objects close to them – Buffer.Buffer.

Description of class based on Description of class based on aggregation of nearby objects.aggregation of nearby objects.

© Prentice Hall 70

Spatial Decision Tree AlgorithmSpatial Decision Tree Algorithm

© Prentice Hall 71

Spatial ClusteringSpatial Clustering

Detect clusters of irregular shapesDetect clusters of irregular shapes Use of centroids and simple distance Use of centroids and simple distance

approaches may not work well.approaches may not work well. Clusters should be independent of order Clusters should be independent of order

of input.of input.

© Prentice Hall 72

Spatial ClusteringSpatial Clustering

© Prentice Hall 73

CLARANS ExtensionsCLARANS Extensions

Remove main memory assumption of Remove main memory assumption of CLARANS.CLARANS.

Use spatial index techniques.Use spatial index techniques. Use sampling and R*-tree to identify Use sampling and R*-tree to identify

central objects.central objects. Change cost calculations by reducing Change cost calculations by reducing

the number of objects examined.the number of objects examined. Voronoi DiagramVoronoi Diagram

© Prentice Hall 74

VoronoiVoronoi

© Prentice Hall 75

SD(CLARANS)SD(CLARANS)

Spatial DominantSpatial Dominant First clusters spatial components using First clusters spatial components using

CLARANSCLARANS Then iteratively replaces medoids, but Then iteratively replaces medoids, but

limits number of pairs to be searched.limits number of pairs to be searched. Uses generalizationUses generalization Uses a learning to to derive description Uses a learning to to derive description

of cluster.of cluster.

© Prentice Hall 76

SD(CLARANS) AlgorithmSD(CLARANS) Algorithm

© Prentice Hall 77

DBCLASDDBCLASD

Extension of DBSCANExtension of DBSCAN Distribution Based Clustering of LArge Distribution Based Clustering of LArge

Spatial DatabasesSpatial Databases Assumes items in cluster are uniformly Assumes items in cluster are uniformly

distributed.distributed. Identifies distribution satisfied by Identifies distribution satisfied by

distances between nearest neighbors.distances between nearest neighbors. Objects added if distribution is uniform.Objects added if distribution is uniform.

© Prentice Hall 78

DBCLASD AlgorithmDBCLASD Algorithm

© Prentice Hall 79

Aggregate ProximityAggregate Proximity

Aggregate ProximityAggregate Proximity – measure of how – measure of how close a cluster is to a feature.close a cluster is to a feature.

Aggregate proximity relationship finds the Aggregate proximity relationship finds the k closest features to a cluster.k closest features to a cluster.

CRH AlgorithmCRH Algorithm – uses different shapes: – uses different shapes:– Encompassing CircleEncompassing Circle– Isothetic RectangleIsothetic Rectangle– Convex HullConvex Hull

© Prentice Hall 80

CRHCRH

© Prentice Hall 81

Web Mining OutlineWeb Mining Outline

Goal:Goal: Examine the use of data mining on Examine the use of data mining on the World Wide Webthe World Wide Web

IntroductionIntroduction Web Content MiningWeb Content Mining Web Structure MiningWeb Structure Mining Web Usage MiningWeb Usage Mining

© Prentice Hall 82

Web Mining IssuesWeb Mining Issues

SizeSize– >350 million pages (1999) >350 million pages (1999) – Grows at about 1 million pages a dayGrows at about 1 million pages a day– Google indexes 3 billion documentsGoogle indexes 3 billion documents

Diverse types of dataDiverse types of data

© Prentice Hall 83

Web DataWeb Data

Web pagesWeb pages Intra-page structuresIntra-page structures Inter-page structuresInter-page structures Usage dataUsage data Supplemental dataSupplemental data

– ProfilesProfiles– Registration informationRegistration information– CookiesCookies

© Prentice Hall 84

Web Mining TaxonomyWeb Mining Taxonomy

Modified from [zai01]

© Prentice Hall 85

Web Content MiningWeb Content Mining

Extends work of basic search enginesExtends work of basic search engines Search EnginesSearch Engines

– IR applicationIR application– Keyword basedKeyword based– Similarity between query and documentSimilarity between query and document– CrawlersCrawlers– IndexingIndexing– ProfilesProfiles– Link analysisLink analysis

© Prentice Hall 86

CrawlersCrawlers Robot (spider)Robot (spider) traverses the hypertext sructure in traverses the hypertext sructure in

the Web.the Web. Collect information from visited pagesCollect information from visited pages Used to construct indexes for search enginesUsed to construct indexes for search engines Traditional CrawlerTraditional Crawler – visits entire Web (?) and – visits entire Web (?) and

replaces indexreplaces index Periodic CrawlerPeriodic Crawler – visits portions of the Web and – visits portions of the Web and

updates subset of indexupdates subset of index Incremental CrawlerIncremental Crawler – selectively searches the Web – selectively searches the Web

and incrementally modifies indexand incrementally modifies index Focused CrawlerFocused Crawler – visits pages related to a – visits pages related to a

particular subjectparticular subject

© Prentice Hall 87

Focused CrawlerFocused Crawler

Only visit links from a page if that page Only visit links from a page if that page is determined to be relevant.is determined to be relevant.

Classifier is static after learning phase.Classifier is static after learning phase. Components:Components:

– Classifier which assigns relevance score to Classifier which assigns relevance score to each page based on crawl topic.each page based on crawl topic.

– Distiller to identify Distiller to identify hub pages.hub pages.– Crawler visits pages to based on crawler Crawler visits pages to based on crawler

and distiller scores.and distiller scores.

© Prentice Hall 88


Classifier to related documents to topicsClassifier to related documents to topics Classifier also determines how useful Classifier also determines how useful

outgoing links areoutgoing links are Hub PagesHub Pages contain links to many contain links to many

relevant pages. Must be visited even if relevant pages. Must be visited even if not high relevance score.not high relevance score.

© Prentice Hall 89


© Prentice Hall 90

Context Focused CrawlerContext Focused Crawler

Context Graph:Context Graph:– Context graph created for each seed document .Context graph created for each seed document .– Root is the sedd document.Root is the sedd document.– Nodes at each level show documents with links Nodes at each level show documents with links

to documents at next higher level. to documents at next higher level. – Updated during crawl itself .Updated during crawl itself .

Approach:Approach:1.1. Construct context graph and classifiers using Construct context graph and classifiers using

seed documents as training data.seed documents as training data.2.2. Perform crawling using classifiers and context Perform crawling using classifiers and context

graph created.graph created.

© Prentice Hall 91

Context GraphContext Graph

© Prentice Hall 92

Virtual Web ViewVirtual Web View Multiple Layered DataBase (MLDB)Multiple Layered DataBase (MLDB) built on top of built on top of

the Web.the Web. Each layer of the database is more generalized (and Each layer of the database is more generalized (and

smaller) and centralized than the one beneath it.smaller) and centralized than the one beneath it. Upper layers of MLDB are structured and can be Upper layers of MLDB are structured and can be

accessed with SQL type queries.accessed with SQL type queries. Translation tools convert Web documents to XML.Translation tools convert Web documents to XML. Extraction tools extract desired information to place in Extraction tools extract desired information to place in

first layer of MLDB.first layer of MLDB. Higher levels contain more summarized data obtained Higher levels contain more summarized data obtained

through generalizations of the lower levels.through generalizations of the lower levels.

© Prentice Hall 93

PersonalizationPersonalization

Web access or contents tuned to better fit the Web access or contents tuned to better fit the desires of each user.desires of each user.

Manual techniques identify user’s preferences Manual techniques identify user’s preferences based on profiles or demographics.based on profiles or demographics.

Collaborative filteringCollaborative filtering identifies preferences identifies preferences based on ratings from similar users.based on ratings from similar users.

Content based filteringContent based filtering retrieves pages retrieves pages based on similarity between pages and user based on similarity between pages and user profiles.profiles.

© Prentice Hall 94

Web Structure MiningWeb Structure Mining

Mine structure (links, graph) of the WebMine structure (links, graph) of the Web TechniquesTechniques

– PageRankPageRank– CLEVERCLEVER

Create a model of the Web organization.Create a model of the Web organization. May be combined with content mining to May be combined with content mining to

more effectively retrieve important pages.more effectively retrieve important pages.

© Prentice Hall 95

PageRankPageRank Used by GoogleUsed by Google Prioritize pages returned from search by Prioritize pages returned from search by

looking at Web structure.looking at Web structure. Importance of page is calculated based Importance of page is calculated based

on number of pages which point to it – on number of pages which point to it – BacklinksBacklinks..

Weighting is used to provide more Weighting is used to provide more importance to backlinks coming form importance to backlinks coming form important pages.important pages.

© Prentice Hall 96

PageRank (cont’d)PageRank (cont’d)

PR(p) = c (PR(1)/NPR(p) = c (PR(1)/N11 + … + PR(n)/N + … + PR(n)/Nnn))

– PR(i): PageRank for a page i which points PR(i): PageRank for a page i which points to target page p.to target page p.

– NNii: number of links coming out of page i: number of links coming out of page i

© Prentice Hall 97

CLEVERCLEVER

Identify authoritative and hub pages.Identify authoritative and hub pages. Authoritative PagesAuthoritative Pages : :

– Highly important pages.Highly important pages.– Best source for requested information.Best source for requested information.

Hub PagesHub Pages : :– Contain links to highly important pages.Contain links to highly important pages.

© Prentice Hall 98

HITSHITS

Hyperlink-Induces Topic SearchHyperlink-Induces Topic Search Based on a set of keywords, find set of Based on a set of keywords, find set of

relevant pages – R.relevant pages – R. Identify hub and authority pages for these.Identify hub and authority pages for these.

– Expand R to a base set, B, of pages linked to or Expand R to a base set, B, of pages linked to or from R.from R.

– Calculate weights for authorities and hubs.Calculate weights for authorities and hubs.

Pages with highest ranks in R are returned.Pages with highest ranks in R are returned.

© Prentice Hall 99

HITS AlgorithmHITS Algorithm

© Prentice Hall 100

Web Usage MiningWeb Usage Mining

Extends work of basic search enginesExtends work of basic search engines Search EnginesSearch Engines

– IR applicationIR application– Keyword basedKeyword based– Similarity between query and documentSimilarity between query and document– CrawlersCrawlers– IndexingIndexing– ProfilesProfiles– Link analysisLink analysis


Web Usage Mining ApplicationsWeb Usage Mining Applications

PersonalizationPersonalization Improve structure of a site’s Web pagesImprove structure of a site’s Web pages Aid in caching and prediction of future Aid in caching and prediction of future

page referencespage references Improve design of individual pagesImprove design of individual pages Improve effectiveness of e-commerce Improve effectiveness of e-commerce

(sales and advertising)(sales and advertising)


Web Usage Mining ActivitiesWeb Usage Mining Activities Preprocessing Web logPreprocessing Web log

– Cleanse Cleanse – Remove extraneous informationRemove extraneous information– SessionizeSessionize

Session:Session: Sequence of pages referenced by one user at a sitting. Sequence of pages referenced by one user at a sitting.

Pattern DiscoveryPattern Discovery– Count patterns that occur in sessionsCount patterns that occur in sessions– Pattern Pattern is sequence of pages references in session.is sequence of pages references in session.– Similar to association rulesSimilar to association rules

» Transaction: sessionTransaction: session» Itemset: pattern (or subset)Itemset: pattern (or subset)» Order is importantOrder is important

Pattern AnalysisPattern Analysis


ARs in Web MiningARs in Web Mining Web Mining:Web Mining:

– ContentContent– StructureStructure– UsageUsage

Frequent patterns of sequential page Frequent patterns of sequential page references in Web searching.references in Web searching.

Uses:Uses:– CachingCaching– Clustering usersClustering users– Develop user profilesDevelop user profiles– Identify important pagesIdentify important pages


Web Usage Mining IssuesWeb Usage Mining Issues

Identification of exact user not possible.Identification of exact user not possible. Exact sequence of pages referenced by Exact sequence of pages referenced by

a user not possible due to caching.a user not possible due to caching. Session not well definedSession not well defined Security, privacy, and legal issuesSecurity, privacy, and legal issues


Web Log CleansingWeb Log Cleansing

Replace source IP address with unique Replace source IP address with unique but non-identifying ID.but non-identifying ID.

Replace exact URL of pages referenced Replace exact URL of pages referenced with unique but non-identifying ID.with unique but non-identifying ID.

Delete error records and records Delete error records and records containing not page data (such as containing not page data (such as figures and code)figures and code)


SessionizingSessionizing

Divide Web log into sessions.Divide Web log into sessions. Two common techniques:Two common techniques:

– Number of consecutive page references Number of consecutive page references from a source IP address occurring within from a source IP address occurring within a predefined time interval (e.g. 25 a predefined time interval (e.g. 25 minutes).minutes).

– All consecutive page references from a All consecutive page references from a source IP address where the interclick time source IP address where the interclick time is less than a predefined threshold.is less than a predefined threshold.


Data Structures Data Structures

Keep track of patterns identified during Keep track of patterns identified during Web usage mining processWeb usage mining process

Common techniques:Common techniques:– Trie Trie – Suffix TreeSuffix Tree– Generalized Suffix TreeGeneralized Suffix Tree– WAP TreeWAP Tree


Trie vs. Suffix TreeTrie vs. Suffix Tree

Trie:Trie:– Rooted treeRooted tree– Edges labeled which character (page) from Edges labeled which character (page) from

patternpattern– Path from root to leaf represents pattern.Path from root to leaf represents pattern.

Suffix Tree:Suffix Tree:– Single child collapsed with parent. Edge Single child collapsed with parent. Edge

contains labels of both prior edges.contains labels of both prior edges.


Trie and Suffix TreeTrie and Suffix Tree


Generalized Suffix TreeGeneralized Suffix Tree

Suffix tree for multiple sessions. Suffix tree for multiple sessions. Contains patterns from all sessions.Contains patterns from all sessions. Maintains count of frequency of Maintains count of frequency of

occurrence of a pattern in the node.occurrence of a pattern in the node. WAP Tree:WAP Tree:

Compressed version of generalized suffix Compressed version of generalized suffix treetree


Types of PatternsTypes of Patterns

Algorithms have been developed to discover Algorithms have been developed to discover different types of patterns.different types of patterns.

Properties:Properties:– Ordered Ordered – Characters (pages) must occur in the – Characters (pages) must occur in the

exact order in the original session.exact order in the original session.– Duplicates Duplicates – Duplicate characters are allowed in – Duplicate characters are allowed in

the pattern.the pattern.– ConsecutiveConsecutive – All characters in pattern must – All characters in pattern must

occur consecutive in given session.occur consecutive in given session.– Maximal Maximal – Not subsequence of another pattern.– Not subsequence of another pattern.


Pattern TypesPattern Types

Association RulesAssociation RulesNone of the properties holdNone of the properties hold

EpisodesEpisodesOnly ordering holdsOnly ordering holds

Sequential PatternsSequential PatternsOrdered and maximalOrdered and maximal

Forward SequencesForward SequencesOrdered, consecutive, and maximalOrdered, consecutive, and maximal

Maximal Frequent SequencesMaximal Frequent SequencesAll properties holdAll properties hold


EpisodesEpisodes

Partially ordered set of pagesPartially ordered set of pages Serial episodeSerial episode – totally ordered with – totally ordered with

time constrainttime constraint Parallel episodeParallel episode – partial ordered with – partial ordered with

time constrainttime constraint General episodeGeneral episode – partial ordered with – partial ordered with

no time constraintno time constraint


DAG for EpisodeDAG for Episode

© Prentice Hall1 ADVANCED TOPICS IN DATA MINING CSE 8331 Spring 2008 Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist.

Documents

time range

sequence of events

sequence of observable

likely state sequence

seasonallag time difference

data values

temporal data mining

current state