Top Banner
Spatial Data Mining Data mining is a process of knowledge discovery related to finding patterns of interest within a large data set. xamples of databases where spatial data mining is useful are: Earth Observation Satellites (terabyte per day) Census Weather systems Marketing so on….
35

Spatial Data Mining Data mining is a process of knowledge discovery related to finding patterns of interest within a large data set. Examples of databases.

Dec 21, 2015

Download

Documents

Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Spatial Data Mining Data mining is a process of knowledge discovery related to finding patterns of interest within a large data set. Examples of databases.

Spatial Data Mining

• Data mining is a process of knowledge discovery related to finding patterns of interest within a large data set.

Examples of databases where spatial data mining is useful are:

Earth Observation Satellites (terabyte per day)

Census

Weather systems

Marketing

so on….

Page 2: Spatial Data Mining Data mining is a process of knowledge discovery related to finding patterns of interest within a large data set. Examples of databases.

Examples of Spatial Patterns• Historic Examples

– 1855 Asiatic Cholera in London :

A water pump identified as the source…

Modern Examples Cancer clusters to investigate environment health hazards…

Crime hotspots for planning police patrol routes…

Bald eagles nest on tall trees near open water…

Unusual warming of Pacific ocean (El Nino)

Page 3: Spatial Data Mining Data mining is a process of knowledge discovery related to finding patterns of interest within a large data set. Examples of databases.

Spatial Data Mining

• Data mining is a combination of processes :– Data extraction

– Data clean

– Selection of characteristics

– Algorithms

– Analysis of results

• Important characteristic to explore for spatial data mining: similar objects tend to be spatially close.

Page 4: Spatial Data Mining Data mining is a process of knowledge discovery related to finding patterns of interest within a large data set. Examples of databases.

Data Mining: Process

Expert AnalystAssociationClusteringClassification

DB

Problem

AlgorithmsData Mining

OGIS SQL

DB

VerificationRefinementVisualization

adjustment technique

feedback

interpretation action

Hypothesis

Page 5: Spatial Data Mining Data mining is a process of knowledge discovery related to finding patterns of interest within a large data set. Examples of databases.

Statistics versus Data Mining

• Data mining is strongly related to statistical analysis.

• Data mining can be seen as a filter (exploratory data analysis) before applying a rigorous statistical tool.

• Data mining generates hypothesis that are then verified.

• The filtering process do not guarantee completeness (wrong elimination or missing).

Page 6: Spatial Data Mining Data mining is a process of knowledge discovery related to finding patterns of interest within a large data set. Examples of databases.

A Classification of Data Mining Processes

• The three most common process of data mining are:– Association rules: Determination of interaction between

attributes. For example:• X Y:

– Classification: Estimation of the attribute of an entity in terms of attribute values of another entity. Some applications are:

• Predicting locations (shopping centers, habitat, crime zones)

• Thematic classification (satellite images)

– Clustering: It is a form of learning without supervision, where classes and the number of classes are unknown. Applications:

Page 7: Spatial Data Mining Data mining is a process of knowledge discovery related to finding patterns of interest within a large data set. Examples of databases.

Association Rules

• A spatial association rule is a rule indicating certain

association relationship among a set of spatial and possibly some non-spatial predicates.

Example:

“Most big cities in Canada

are close

to the Canada U.S. border”

A strong rule indicates that the patterns in the rule have relatively frequent occurrences in the database and strong implication relationships.

Page 8: Spatial Data Mining Data mining is a process of knowledge discovery related to finding patterns of interest within a large data set. Examples of databases.

Association Rules

• Spatial association rules are defined in terms of spatial predicates: P1 P2 … P1 Q1 Q2 … Qm

For example:

is_a(x, country) close(x,Mediterranean) s%,c% is(x, wine - exporter)

where for i1 i2:

s%: i1 and i2 occur at least s% of cases

c%: among all cases when i1 occurs, at least c% of the times i2 also occurs.

Page 9: Spatial Data Mining Data mining is a process of knowledge discovery related to finding patterns of interest within a large data set. Examples of databases.

Association Rules: A priori

• Principle: If an item set has a high support, then so do all its subsets.

• The steps of the algorithm is as follows:– first,discover all 1-itemsets that are frequent

– combine to form 2-itemsets and analyze for frequent set

– go on until no more itemsets exceed the therehold.

– search for rules

Page 10: Spatial Data Mining Data mining is a process of knowledge discovery related to finding patterns of interest within a large data set. Examples of databases.

Association Rules: A priori

CD DAlarm ATV TVCR VComputador C

items Cases

1 D A V C2 A T C3 D A V C4 D A T C5 D A T V C6 A T V

Page 11: Spatial Data Mining Data mining is a process of knowledge discovery related to finding patterns of interest within a large data set. Examples of databases.

Association Rules : A priori

100% (6) A83% (5) C, A C67 % (5) C, T, V, D A

DC,AT,AV,DAC50% (3) DV,TC,VC,DAV,

DVC,ATC,AVC,DAVC

Frequency of itemsets

Page 12: Spatial Data Mining Data mining is a process of knowledge discovery related to finding patterns of interest within a large data set. Examples of databases.

Association Rules: A priori

Confidence of association rules = 100%

D A (4/4)D C (4/4)D AC (4/4)T C (4/4)V A (4/4)C A (5/5)

D A (4/4)D A (3/3)D A (3/3)D A (4/4)D A (3/3)D A (3/3)

VC A (3/3)DV A (3/3)VC A (3/3)DAV A (3/3)DVC A (3/3)AVC A (3/3)

C D (4/5) A C (5/6) C DA(4/5)

Association rules with confidence >= 80%

Page 13: Spatial Data Mining Data mining is a process of knowledge discovery related to finding patterns of interest within a large data set. Examples of databases.

Association rules

• Differences with respect to spatial domain:

– The notion of transaction or case does not exist, since data are immerse in a continuous space.The partition of the space may introduce errors with respect to overestimation or subestimation confidences.

– The size of itemsets is less in the spatial domain. Thus, the cost of generating candidate is not a dominant factor. The enumeration of neighbors dominates the final computational cost.

– In most cases, the spatial items are discrete version of continuous variables.

– The notion of transaction is replaced by neighborhood.

Page 14: Spatial Data Mining Data mining is a process of knowledge discovery related to finding patterns of interest within a large data set. Examples of databases.

Example GeoMiner query:

discover spatial association rules inside British Columbia from road R, water W, mines M, boundary B in relevance to town T where

g_close_to(T.geo, X.geo) and X in {R, W, M, B} and T.type = “large” and R.type in {divided_highway} and W.type in {sea, ocean, large_lake, large_river} and B.admin_region_1 in “B.C.”and B.admin_region_2 in “U.S.A.”

Discovery of Spatial Association Rules (cont.)

Page 15: Spatial Data Mining Data mining is a process of knowledge discovery related to finding patterns of interest within a large data set. Examples of databases.

Note: “close_to” is a condition dependent predicate and is defined by a

set of knowledge rules.

For example, the following rule states:

X is a town

and

Y is a country

X is close_to Y,

if their distance

is within 80 kms

then

Rule:

close_to(X,Y ) is_a(X, town) is_a(Y, country) dist(X, Y, d) d = 80 km

Discovery of Spatial Association Rules (cont.)

Page 16: Spatial Data Mining Data mining is a process of knowledge discovery related to finding patterns of interest within a large data set. Examples of databases.

Water

RiverSea Lake

Large river

Small river

Fraser river

Level

1

2

3

4

Hierarchy for data relations

Hierarchy of topological relations

Discovery of Spatial Association Rules (cont.)

g_close_to

Close_to

Intersects Inside

Adjacent to

Equal Contains

Intersects

Covered by

Inside

Covers

Contains

Not_disjoint

Page 17: Spatial Data Mining Data mining is a process of knowledge discovery related to finding patterns of interest within a large data set. Examples of databases.

Step 1: Task_relevant_DB := extract task relevant objects(SDB, RDB);

(1) Towns

(2) Roads

(3) Water

(4) Mines

(5) boundary

only large towns;

only divided highways 2 ;

only seas, oceans, large lakes and large rivers; any mines;

only the boundary of B.C., and U.S.A.

The set of relevant data is retrieved by execution of the data retrievalmethods of the data mining query, which extracts the following data sets whose spatial portion is inside B.C.:

... g_close_to(T.geo, X.geo) and X in {R, W, M, B} and T.type = “large” and R.type in {divided_highway}and W.type in {sea, ocean, large_lake, large_river} and B.admin_region_1 in “B.C.”and B.admin_region_2 in “U.S.A.” ...

Discovery of Spatial Association Rules (cont.)

Page 18: Spatial Data Mining Data mining is a process of knowledge discovery related to finding patterns of interest within a large data set. Examples of databases.

Step 2: Coarse_predicate_DB :=

coarse spatial computation(Task_relevant_DB);

Town Water Road Boundary MineVictoria Juan_de_Fuca_Strait Highway_1,

Highway_17US

Saanich Juan_de_Fuca_Strait Highway_1, Highway_17

US

Prince Highway_97

Pentincton Okanagan_Lake Highway_97 US Allala

… … … … …

At this level we can already mine:is_a(X, large_town) g_close_to(X, water): (80%) is_a(X, large_town) g_close_to(X, sea) g_close_to (X, us_boundary):(92%)

The “generalized_close_to” (g_close_to) relationship between (large) towns and the other four classes of entities is computed at a relatively coarse resolution level. [MBR data structure or R* trees and other approximations] – Later…

Discovery of Spatial Association Rules (cont.)

Page 19: Spatial Data Mining Data mining is a process of knowledge discovery related to finding patterns of interest within a large data set. Examples of databases.

Step 3: Large_Coarse_predicate_DB :=

filtering_with_min_support(Coarse predicate DB);

Town Water Road BoundaryVictoria <adjacent_to,

Juan_de_Fuca_Strait><intersects, Highway_1>, <intersects, Highway_17>

<close_to, US>

Saanich <adjacent_to, Juan_de_Fuca_Strait>

<intersects, Highway_1>, <close_to, Highway_17>

<close_to, US>

Prince <intersects, Highway_97>

Pentincton

<adjacent_to, Okanagan_Lake > <intersects, Highway_97> <close_to, US>

… … … …

Refined computation is performed on the large predicate sets, i.e., those retained in the g_close_to table. Each g_close_to predicate is replaced by one or a set of concrete predicate(s) such as intersect, adjacent_to, close_to, inside, etc.

Discovery of Spatial Association Rules (cont.)

Page 20: Spatial Data Mining Data mining is a process of knowledge discovery related to finding patterns of interest within a large data set. Examples of databases.

Step 4: Large_Coarse_predicate_DB :=

filtering_with_min_support(Coarse predicate DB);

23<adjacent_to, water>, <close_to, us_boundary >2

26<intersects, highway>, <close_to, us_boundary >2

29<intersects, highway>1

29<close_to, highway >1

28<close_to, us_boundary >1

countLarge k-predicate setk

32<adjacent_to, water>1

25<adjacent_to, water>, <intersects, highway>2

22<adjacent_to, water>, <intersects, highway>, <close_to, us_boundary >3

Min support =

20 for level 1

The level by level detailed computation of large predicates and the corresponding association rules is presented as follows. The computation starts at the top most concept level and computes large predicates at this level.

Discovery of Spatial Association Rules (cont.)

Page 21: Spatial Data Mining Data mining is a process of knowledge discovery related to finding patterns of interest within a large data set. Examples of databases.

Step 5: Find large predicates and mine rules(Fine_predicate_DB);

24<close_to, provincial_ highway >1

11<adjacent_to, large_river>1

11<adjacent_to, sea>, <close_to, provincial_highway >2

22<close_to, us_boundary >, <close_to, provincial_highway >2

28<close_to, us_boundary >1

21<intersects, provincial_highway>1

15<adjacent_to, sea>, <close_to, us_boundary >2

count

Large k-predicate setk

21<adjacent_to, sea>1

19<close_to, us_boundary >, <intersects, provincial_ highway>2

10<adjacent_to, sea>, < close_to, provincial_highway >, <close_to, us_boundary >3

Min support = 10

for level 2

Level 2:

After mining rules at the highest level of the concept hierarchy, large k predicates can be computed in the same way at the lowerconcept levels, which results in tables

Discovery of Spatial Association Rules (cont.)

Page 22: Spatial Data Mining Data mining is a process of knowledge discovery related to finding patterns of interest within a large data set. Examples of databases.

k Large k-predicate set count

1 <adjacent_to, georgia_straight> 9

1 <adjacent_to, fraser_river> 10

1 <close_to, us_boundary > 28

2 <adjacent_to, georgia_straight>, <close_to, us_boundary > 7

Level 3: Min support = 7 for level

3

The mining process stops at the lowest level of the hierarchies or when an empty large 1 predicate set is

derived.

A rule example:is_a(X, large town) adjacent(X, sea) close_to (X, us_boundary) : (100%)

Discovery of Spatial Association Rules (cont.)

Page 23: Spatial Data Mining Data mining is a process of knowledge discovery related to finding patterns of interest within a large data set. Examples of databases.

Classification and Regression

• Classification:– constructs a model (classifier) based on

the training set and uses it in classifying new data

– Example: Climate Classification,…

• Regression:– models continuous-valued functions, i.e.,

predicts unknown or missing values– Example: stock trends prediction,…

Page 24: Spatial Data Mining Data mining is a process of knowledge discovery related to finding patterns of interest within a large data set. Examples of databases.

Classification

• Definition : D L

where D is the domain of , i.e., the domain of attribute values and L is the set of levels or classes. For example, in a problem of habitat of birds, D is a space of three dimensions: longevity of the vegetation, depth of water, and distance to water. L has two possible values: nest and no nest.

• The goal is to find a good .

Page 25: Spatial Data Mining Data mining is a process of knowledge discovery related to finding patterns of interest within a large data set. Examples of databases.

Classification (1): Model Construction

TrainingData

NAME RANK YEARS TENUREDMike Assistant Prof 3 noMary Assistant Prof 7 yesBill Professor 2 yesJim Associate Prof 7 yesDave Assistant Prof 6 noAnne Associate Prof 3 no

TrainingData

NAME RANK YEARS TENUREDMike Assistant Prof 3 noMary Assistant Prof 7 yesBill Professor 2 yesJim Associate Prof 7 yesDave Assistant Prof 6 noAnne Associate Prof 3 no

ClassificationAlgorithms

IF rank = ‘professor’OR years > 6

THEN tenured = ‘yes’

Classifier(Model)

Page 26: Spatial Data Mining Data mining is a process of knowledge discovery related to finding patterns of interest within a large data set. Examples of databases.

Classification (2): Prediction Using the Model

Classifier

TestingData Unseen Data

(Jeff, Professor, 4)

Tenured?

Page 27: Spatial Data Mining Data mining is a process of knowledge discovery related to finding patterns of interest within a large data set. Examples of databases.

Classification Techniques

• Decision Tree Induction• Bayesian Classification• Neural Networks• Genetic Algorithms• Fuzzy Set and Logic

Page 28: Spatial Data Mining Data mining is a process of knowledge discovery related to finding patterns of interest within a large data set. Examples of databases.

Regression

• Regression is similar to classification– First, construct a model– Second, use model to predict unknown value

• Methods– Linear and multiple regression– Non-linear regression

• Regression is different from classification– Classification refers to predict categorical

class label– Regression models continuous-valued

functions

Page 29: Spatial Data Mining Data mining is a process of knowledge discovery related to finding patterns of interest within a large data set. Examples of databases.

Predicting Location Using Similarity Map

• Given:– S is a set of locations { s1,…sn} in a geographic space G.

– A collection of exploratory functions xk : S Rk, where Rk , k = 1 .. K, is the range of possible values for the exploratory functions.

– A dependent class variable c : S C= c1,..cM

– A value for parameter , the relative importance of spatial accuracy.

• Find: classification model: c : R1 x …Rk C

• Objective: Maximize similarity (map si S (c (x1,… xk)),map(c )) = (1- ) classification accuracy(c , c ) + spatial accuracy (c , c )

Page 30: Spatial Data Mining Data mining is a process of knowledge discovery related to finding patterns of interest within a large data set. Examples of databases.

Predicting Locations Using Similarity Map

• Constraints:– the geographic space S is the Euclidean space

– The values of exploratory functions, x1.. xk , dependent classvariable, c , can depend on the neighbors’ values (spatial auto-correlation)

– The domain Rk of the exploratory ina domain of real numbers– The domain of dependent variable C = {0,1}.

• Two characteristics:– Spatial autocorrelation– The objective function combines spatial and classification accuracy.

Page 31: Spatial Data Mining Data mining is a process of knowledge discovery related to finding patterns of interest within a large data set. Examples of databases.

Clustering• It is the process of finding groups, without knowing in advance

the number and the labels of the groups.• Examples: the counties in Chile can be clustered based on 4

attributes: – Porcentaje de desempleo– Población– Ingreso por cápita– Expectativa de vida

• Two types of clustering with different objectives are: – Identify the central cities and their influence region by means of the

variance of the attribute values within the space. – Identify areas in the space where an attribute is homogeneous,

Page 32: Spatial Data Mining Data mining is a process of knowledge discovery related to finding patterns of interest within a large data set. Examples of databases.

Clustering• Definition 1:

– Given a set of S = {s1,..sn} spatial objects (ex., points) and a real valued no spatial attribute evaluated over S (: S R).

– Find two disjoint subsets S, C and NC = S - C, where C = {s1,…,sk}, NC = {nc1,…,ncl} y k < n

– Goal min C S ∑l j=1 | (ncj) - ∑k

i=1 ((ci )/ dist(ncj,ci))|2

– where dist(a,b) is the Euclidian distance or some distance measure Constraints:

• It satisfies that the influence of the center decreases with the square of the distance

• There is at most one non spatial attribute

Page 33: Spatial Data Mining Data mining is a process of knowledge discovery related to finding patterns of interest within a large data set. Examples of databases.

Clustering• Definition 2:

– Given a set of S = {s1,..sn} spatial objects objests, a set of real valued no spatial attributes i,con i = 1,… I defined over S (k: S R) and a structure of neighborhood E in S.

– Find K subsets Ck S, with k = 1 .. K, such that

– Goal min Ck S ∑ ck,si Ck,sj

Ck dist(F(si),F(sj))+ ∑ I,j nbddist(Ci,Cj)

– where F is the cross product of ’is, I = 1..n; dist(a,b) is the distance measure and nbddist(C,D) is the number of points in C and D that belong to E, I.e., pair of neighbors mapped to different clusters.

– Constraints: |Ck| > 1 for all k = 1 .. K

Page 34: Spatial Data Mining Data mining is a process of knowledge discovery related to finding patterns of interest within a large data set. Examples of databases.

Clustering: categories• Hierarchical methods: Starting with a cluster, successive

partitions are made until a criterion is satisfied. These algorithms result in a tree of clusters called dendograms.

• Partitional: It considers each pattern as a cluster and then, reallocate data in each cluster until a criterion is satisfied. This methods tend to find clusters of spherical shape.

• Density-based: It finds clusters based on the density of points in a regions.

• Grid-based: It partitions the space in cells and then, it performs the required operations on the quantized space. Cells that contain many points are considered dense and connected to create clusters.

Page 35: Spatial Data Mining Data mining is a process of knowledge discovery related to finding patterns of interest within a large data set. Examples of databases.

Clustering in SDB• The idea is to make use of the indexing. If the SDB is large, not

all the points will fit in main memory. • For example, for an algorithm that requires n initial points to

represent n clusters, a natural idea is to incorporate the notion of containment in the indexing definition to find the closest objects.

• A method that finds a centroid of subdivisions of the space is the Voronoi triangulation.