This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
ADVANCED TOPICS IN DATA ADVANCED TOPICS IN DATA MININGMINING
CSE 8331CSE 8331Spring 2008Spring 2008
Margaret H. DunhamMargaret H. DunhamDepartment of Computer Science and EngineeringDepartment of Computer Science and Engineering
Southern Methodist UniversitySouthern Methodist University
Companion slides for the text by Dr. M.H.Dunham, Companion slides for the text by Dr. M.H.Dunham, Data Mining, Data Mining, Introductory and Advanced TopicsIntroductory and Advanced Topics, Prentice Hall, 2002., Prentice Hall, 2002.
Goal:Goal: Examine some temporal data Examine some temporal data mining issues and approaches.mining issues and approaches.
IntroductionIntroduction Modeling Temporal EventsModeling Temporal Events Time SeriesTime Series Pattern DetectionPattern Detection SequencesSequences Temporal Association RulesTemporal Association Rules
Modeling Temporal EventsModeling Temporal Events Techniques to model temporal events.Techniques to model temporal events. Often based on earlier approachesOften based on earlier approaches Finite State Recognizer (Machine) (FSR)Finite State Recognizer (Machine) (FSR)
– Each event recognizes one characterEach event recognizes one character– Temporal ordering indicated by arcsTemporal ordering indicated by arcs– May recognize a sequenceMay recognize a sequence– Require precisely defined transitions between statesRequire precisely defined transitions between states
Markov Model (MM)Markov Model (MM) Directed graphDirected graph
– Vertices represent statesVertices represent states– Arcs show transitions between statesArcs show transitions between states– Arc has probability of transitionArc has probability of transition– At any time one state is designated as current At any time one state is designated as current
state.state. Markov PropertyMarkov Property – Given a current state, the – Given a current state, the
transition probability is independent of any transition probability is independent of any previous states.previous states.
Hidden Markov Model (HMM)Hidden Markov Model (HMM)
Like HMM, but states need not correspond to Like HMM, but states need not correspond to observable states.observable states.
HMM models process that produces as HMM models process that produces as output a sequence of observable symbols.output a sequence of observable symbols.
HMM will actually output these symbols.HMM will actually output these symbols. Associated with each node is the probability Associated with each node is the probability
of the observation of an event.of the observation of an event. Train HMM to recognize a sequence.Train HMM to recognize a sequence. Transition and observation probabilities Transition and observation probabilities
learned from training set.learned from training set.
Given a sequence of events and an Given a sequence of events and an HMM, what is the probability that the HMM, what is the probability that the HMM produced the sequence?HMM produced the sequence?
Given a sequence and an HMM, what is Given a sequence and an HMM, what is the most likely state sequence which the most likely state sequence which produced this sequence?produced this sequence?
Extension to basic NNExtension to basic NN Neuron can obtian input form any other Neuron can obtian input form any other
neuron (including output layer).neuron (including output layer). Can be used for both recognition and Can be used for both recognition and
prediction applications.prediction applications. Time to produce output unknownTime to produce output unknown Temporal aspect added by backlinks.Temporal aspect added by backlinks.
Analysis TechniquesAnalysis Techniques Smoothing Smoothing – Moving average of attribute – Moving average of attribute
values.values. Autocorrelation Autocorrelation – relationships between – relationships between
different subseriesdifferent subseries– Yearly, seasonalYearly, seasonal– LagLag – Time difference between related items. – Time difference between related items.– Correlation Coefficient rCorrelation Coefficient r
SimilaritySimilarity Determine similarity between a target pattern, Determine similarity between a target pattern,
X, and sequence, Y: sim(X,Y)X, and sequence, Y: sim(X,Y) Similar to Web usage miningSimilar to Web usage mining Similar to earlier word processing and spelling Similar to earlier word processing and spelling
Similarity based on Linear Similarity based on Linear TransformationTransformation
Linear transformation function fLinear transformation function f– Convert a value form one series to a value Convert a value form one series to a value
in the secondin the second ff – tolerated difference in results – tolerated difference in results – – time value difference allowedtime value difference allowed
Predict future value for time seriesPredict future value for time series Regression may not be sufficientRegression may not be sufficient Statistical TechniquesStatistical Techniques
Find given pattern in sequenceFind given pattern in sequence Knuth-Morris-Pratt:Knuth-Morris-Pratt: Construct FSM Construct FSM Boyer-Moore:Boyer-Moore: Construct FSM Construct FSM
Temporal Association RulesTemporal Association Rules
Transaction has time:Transaction has time:<TID,CID,I<TID,CID,I11,I,I22, …, I, …, Imm,t,tss,t,tee>>
[t[tss,t,tee] is range of time the transaction is active.] is range of time the transaction is active. Types:Types:
– Inter-transaction rulesInter-transaction rules– Episode rulesEpisode rules– Trend dependenciesTrend dependencies– Sequence association rulesSequence association rules– Calendric association rulesCalendric association rules
Intra-transaction association rulesIntra-transaction association rulesTraditional association RulesTraditional association Rules
Inter-transaction association rulesInter-transaction association rules– Rules across transactionsRules across transactions– Sliding windowSliding window – How far apart (time or – How far apart (time or
number of transactions) to look for related number of transactions) to look for related itemsets.itemsets.
Calendric Association RulesCalendric Association Rules
Each transaction has a unique Each transaction has a unique timestamp.timestamp.
Group transactions based on time Group transactions based on time interval within which they occur.interval within which they occur.
Identify large itemsets by looking at Identify large itemsets by looking at transactions only in this predefined transactions only in this predefined interval.interval.
Contains both spatial and nonspatial Contains both spatial and nonspatial attributes.attributes.
Must have a location type attributes:Must have a location type attributes:– Latitude/longitudeLatitude/longitude– Zip codeZip code– Street addressStreet address
May retrieve object using either (or May retrieve object using either (or both) spatial or nonspatial attributes.both) spatial or nonspatial attributes.
Spatial Data Mining ApplicationsSpatial Data Mining Applications
GeologyGeology GIS SystemsGIS Systems Environmental ScienceEnvironmental Science AgricultureAgriculture MedicineMedicine RoboticsRobotics May involved both spatial and temporal May involved both spatial and temporal
Region (Range) QueryRegion (Range) Query – find objects that intersect a given – find objects that intersect a given region.region.
Nearest Neighbor QueryNearest Neighbor Query – find object close to identified – find object close to identified object.object.
Distance ScanDistance Scan – find object within a certain distance of an – find object within a certain distance of an identified object where distance is made increasingly larger.identified object where distance is made increasingly larger.
Spatial Data StructuresSpatial Data Structures Data structures designed specifically to store or Data structures designed specifically to store or
index spatial data.index spatial data. Often based on B-tree or Binary Search TreeOften based on B-tree or Binary Search Tree Cluster data on disk basked on geographic location.Cluster data on disk basked on geographic location. May represent complex spatial structure by placing May represent complex spatial structure by placing
the spatial object in a containing structure of a the spatial object in a containing structure of a specific geographic shape.specific geographic shape.
Techniques:Techniques:– Quad TreeQuad Tree– R-TreeR-Tree– k-D Treek-D Tree
Hierarchical decomposition of the space Hierarchical decomposition of the space into quadrants (MBRs)into quadrants (MBRs)
Each level in the tree represents the Each level in the tree represents the object as the set of quadrants which object as the set of quadrants which contain any portion of the object.contain any portion of the object.
Each level is a more exact representation Each level is a more exact representation of the object.of the object.
The number of levels is determined by The number of levels is determined by the degree of accuracy desired.the degree of accuracy desired.
As with Quad Tree the region is divided As with Quad Tree the region is divided into successively smaller rectangles into successively smaller rectangles (MBRs).(MBRs).
Rectangles need not be of the same Rectangles need not be of the same size or number at each level.size or number at each level.
Rectangles may actually overlap.Rectangles may actually overlap. Lowest level cell has only one object.Lowest level cell has only one object. Tree maintenance algorithms similar to Tree maintenance algorithms similar to
Designed for multi-attribute data, not Designed for multi-attribute data, not necessarily spatialnecessarily spatial
Variation of binary search treeVariation of binary search tree Each level is used to index one of the Each level is used to index one of the
dimensions of the spatial object.dimensions of the spatial object. Lowest level cell has only one objectLowest level cell has only one object Divisions not based on MBRs but Divisions not based on MBRs but
successive divisions of the dimension successive divisions of the dimension range.range.
DisjointDisjoint Overlaps or IntersectsOverlaps or Intersects EqualsEquals Covered by or inside or contained inCovered by or inside or contained in Covers or containsCovers or contains
The average family income in Dallas is $50,000.The average family income in Dallas is $50,000. Discriminant RuleDiscriminant Rule
The average family income in Dallas is $50,000, The average family income in Dallas is $50,000, while in Plano the average income is $75,000.while in Plano the average income is $75,000.
Association RuleAssociation Rule
The average family income in Dallas for families The average family income in Dallas for families living near White Rock Lake is $100,000.living near White Rock Lake is $100,000.
Definition of neighborhood variesDefinition of neighborhood varies ID3 considers nonspatial attributes of all ID3 considers nonspatial attributes of all
objects in a neighborhood (not just one) objects in a neighborhood (not just one) for classification.for classification.
Spatial DominantSpatial Dominant First clusters spatial components using First clusters spatial components using
CLARANSCLARANS Then iteratively replaces medoids, but Then iteratively replaces medoids, but
limits number of pairs to be searched.limits number of pairs to be searched. Uses generalizationUses generalization Uses a learning to to derive description Uses a learning to to derive description
Extension of DBSCANExtension of DBSCAN Distribution Based Clustering of LArge Distribution Based Clustering of LArge
Spatial DatabasesSpatial Databases Assumes items in cluster are uniformly Assumes items in cluster are uniformly
distributed.distributed. Identifies distribution satisfied by Identifies distribution satisfied by
distances between nearest neighbors.distances between nearest neighbors. Objects added if distribution is uniform.Objects added if distribution is uniform.
SizeSize– >350 million pages (1999) >350 million pages (1999) – Grows at about 1 million pages a dayGrows at about 1 million pages a day– Google indexes 3 billion documentsGoogle indexes 3 billion documents
Web pagesWeb pages Intra-page structuresIntra-page structures Inter-page structuresInter-page structures Usage dataUsage data Supplemental dataSupplemental data
Extends work of basic search enginesExtends work of basic search engines Search EnginesSearch Engines
– IR applicationIR application– Keyword basedKeyword based– Similarity between query and documentSimilarity between query and document– CrawlersCrawlers– IndexingIndexing– ProfilesProfiles– Link analysisLink analysis
CrawlersCrawlers Robot (spider)Robot (spider) traverses the hypertext sructure in traverses the hypertext sructure in
the Web.the Web. Collect information from visited pagesCollect information from visited pages Used to construct indexes for search enginesUsed to construct indexes for search engines Traditional CrawlerTraditional Crawler – visits entire Web (?) and – visits entire Web (?) and
replaces indexreplaces index Periodic CrawlerPeriodic Crawler – visits portions of the Web and – visits portions of the Web and
updates subset of indexupdates subset of index Incremental CrawlerIncremental Crawler – selectively searches the Web – selectively searches the Web
and incrementally modifies indexand incrementally modifies index Focused CrawlerFocused Crawler – visits pages related to a – visits pages related to a
Only visit links from a page if that page Only visit links from a page if that page is determined to be relevant.is determined to be relevant.
Classifier is static after learning phase.Classifier is static after learning phase. Components:Components:
– Classifier which assigns relevance score to Classifier which assigns relevance score to each page based on crawl topic.each page based on crawl topic.
– Distiller to identify Distiller to identify hub pages.hub pages.– Crawler visits pages to based on crawler Crawler visits pages to based on crawler
Classifier to related documents to topicsClassifier to related documents to topics Classifier also determines how useful Classifier also determines how useful
outgoing links areoutgoing links are Hub PagesHub Pages contain links to many contain links to many
relevant pages. Must be visited even if relevant pages. Must be visited even if not high relevance score.not high relevance score.
Context Graph:Context Graph:– Context graph created for each seed document .Context graph created for each seed document .– Root is the sedd document.Root is the sedd document.– Nodes at each level show documents with links Nodes at each level show documents with links
to documents at next higher level. to documents at next higher level. – Updated during crawl itself .Updated during crawl itself .
Approach:Approach:1.1. Construct context graph and classifiers using Construct context graph and classifiers using
seed documents as training data.seed documents as training data.2.2. Perform crawling using classifiers and context Perform crawling using classifiers and context
Virtual Web ViewVirtual Web View Multiple Layered DataBase (MLDB)Multiple Layered DataBase (MLDB) built on top of built on top of
the Web.the Web. Each layer of the database is more generalized (and Each layer of the database is more generalized (and
smaller) and centralized than the one beneath it.smaller) and centralized than the one beneath it. Upper layers of MLDB are structured and can be Upper layers of MLDB are structured and can be
accessed with SQL type queries.accessed with SQL type queries. Translation tools convert Web documents to XML.Translation tools convert Web documents to XML. Extraction tools extract desired information to place in Extraction tools extract desired information to place in
first layer of MLDB.first layer of MLDB. Higher levels contain more summarized data obtained Higher levels contain more summarized data obtained
through generalizations of the lower levels.through generalizations of the lower levels.
Web access or contents tuned to better fit the Web access or contents tuned to better fit the desires of each user.desires of each user.
Manual techniques identify user’s preferences Manual techniques identify user’s preferences based on profiles or demographics.based on profiles or demographics.
Collaborative filteringCollaborative filtering identifies preferences identifies preferences based on ratings from similar users.based on ratings from similar users.
Content based filteringContent based filtering retrieves pages retrieves pages based on similarity between pages and user based on similarity between pages and user profiles.profiles.
Mine structure (links, graph) of the WebMine structure (links, graph) of the Web TechniquesTechniques
– PageRankPageRank– CLEVERCLEVER
Create a model of the Web organization.Create a model of the Web organization. May be combined with content mining to May be combined with content mining to
more effectively retrieve important pages.more effectively retrieve important pages.
PageRankPageRank Used by GoogleUsed by Google Prioritize pages returned from search by Prioritize pages returned from search by
looking at Web structure.looking at Web structure. Importance of page is calculated based Importance of page is calculated based
on number of pages which point to it – on number of pages which point to it – BacklinksBacklinks..
Weighting is used to provide more Weighting is used to provide more importance to backlinks coming form importance to backlinks coming form important pages.important pages.
Extends work of basic search enginesExtends work of basic search engines Search EnginesSearch Engines
– IR applicationIR application– Keyword basedKeyword based– Similarity between query and documentSimilarity between query and document– CrawlersCrawlers– IndexingIndexing– ProfilesProfiles– Link analysisLink analysis
Web Usage Mining ApplicationsWeb Usage Mining Applications
PersonalizationPersonalization Improve structure of a site’s Web pagesImprove structure of a site’s Web pages Aid in caching and prediction of future Aid in caching and prediction of future
page referencespage references Improve design of individual pagesImprove design of individual pages Improve effectiveness of e-commerce Improve effectiveness of e-commerce
Session:Session: Sequence of pages referenced by one user at a sitting. Sequence of pages referenced by one user at a sitting.
Pattern DiscoveryPattern Discovery– Count patterns that occur in sessionsCount patterns that occur in sessions– Pattern Pattern is sequence of pages references in session.is sequence of pages references in session.– Similar to association rulesSimilar to association rules
» Transaction: sessionTransaction: session» Itemset: pattern (or subset)Itemset: pattern (or subset)» Order is importantOrder is important
ARs in Web MiningARs in Web Mining Web Mining:Web Mining:
– ContentContent– StructureStructure– UsageUsage
Frequent patterns of sequential page Frequent patterns of sequential page references in Web searching.references in Web searching.
Uses:Uses:– CachingCaching– Clustering usersClustering users– Develop user profilesDevelop user profiles– Identify important pagesIdentify important pages
Identification of exact user not possible.Identification of exact user not possible. Exact sequence of pages referenced by Exact sequence of pages referenced by
a user not possible due to caching.a user not possible due to caching. Session not well definedSession not well defined Security, privacy, and legal issuesSecurity, privacy, and legal issues
Replace source IP address with unique Replace source IP address with unique but non-identifying ID.but non-identifying ID.
Replace exact URL of pages referenced Replace exact URL of pages referenced with unique but non-identifying ID.with unique but non-identifying ID.
Delete error records and records Delete error records and records containing not page data (such as containing not page data (such as figures and code)figures and code)
Divide Web log into sessions.Divide Web log into sessions. Two common techniques:Two common techniques:
– Number of consecutive page references Number of consecutive page references from a source IP address occurring within from a source IP address occurring within a predefined time interval (e.g. 25 a predefined time interval (e.g. 25 minutes).minutes).
– All consecutive page references from a All consecutive page references from a source IP address where the interclick time source IP address where the interclick time is less than a predefined threshold.is less than a predefined threshold.
Suffix tree for multiple sessions. Suffix tree for multiple sessions. Contains patterns from all sessions.Contains patterns from all sessions. Maintains count of frequency of Maintains count of frequency of
occurrence of a pattern in the node.occurrence of a pattern in the node. WAP Tree:WAP Tree:
Compressed version of generalized suffix Compressed version of generalized suffix treetree
Algorithms have been developed to discover Algorithms have been developed to discover different types of patterns.different types of patterns.
Properties:Properties:– Ordered Ordered – Characters (pages) must occur in the – Characters (pages) must occur in the
exact order in the original session.exact order in the original session.– Duplicates Duplicates – Duplicate characters are allowed in – Duplicate characters are allowed in
the pattern.the pattern.– ConsecutiveConsecutive – All characters in pattern must – All characters in pattern must
occur consecutive in given session.occur consecutive in given session.– Maximal Maximal – Not subsequence of another pattern.– Not subsequence of another pattern.