Top Banner
SSCP: Mining Statistically Significant Co-location Patterns Sajib Barua and Jörg Sander Dept. of Computing Science University of Alberta, Canada
32

SSCP: Mining Statistically Significant Co-location Patterns Sajib Barua and Jörg Sander Dept. of Computing Science University of Alberta, Canada.

Dec 20, 2015

Download

Documents

Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: SSCP: Mining Statistically Significant Co-location Patterns Sajib Barua and Jörg Sander Dept. of Computing Science University of Alberta, Canada.

SSCP: Mining Statistically Significant Co-location Patterns

Sajib Barua and Jörg SanderDept. of Computing Science

University of Alberta, Canada

Page 2: SSCP: Mining Statistically Significant Co-location Patterns Sajib Barua and Jörg Sander Dept. of Computing Science University of Alberta, Canada.

SSCP: Mining Statistically Significant Co-location Patterns 2

Outline

Introduction Related work Motivation

Proposed MethodExperimental evaluation

Synthetic data Real data

Conclusions

Page 3: SSCP: Mining Statistically Significant Co-location Patterns Sajib Barua and Jörg Sander Dept. of Computing Science University of Alberta, Canada.

SSCP: Mining Statistically Significant Co-location Patterns 3

Definition

Co-location patterns are subsets of Boolean spatial features whose instances are often seen to be located at close spatial proximity.

Examples:

{Nile crocodile, Egyptian plover} {Shopping mall, parking}

Page 4: SSCP: Mining Statistically Significant Co-location Patterns Sajib Barua and Jörg Sander Dept. of Computing Science University of Alberta, Canada.

SSCP: Mining Statistically Significant Co-location Patterns 4

Event Centric Model

Co-location is defined based on a spatial relationship R

A co-location type C is a set of n different spatial features f1, f2, …, and fn.

A2

B1

C1

D1

B2

C2

C3

{A2, B1, C1, D1} form a clique under a relation R.

C3A2

B1

C1

D1

B2

C2

{A2, B1, C1} is an instance of co-location {A,B,C}

C3A2

B1

C1

D1

B2

C2

{A2, B1, C1} is an instance of co-location {A,B,C}

{A2, B1, D1} is an instance of co-location {A,B,D}

{A2, C1, D1} is an instance of co-location {A,C,D}

{B1, C1, D1} is an instance of co-location {B,C,D}

{A2, B1, C1, D1} is an instance of co-location {A, B,C,D}

Page 5: SSCP: Mining Statistically Significant Co-location Patterns Sajib Barua and Jörg Sander Dept. of Computing Science University of Alberta, Canada.

SSCP: Mining Statistically Significant Co-location Patterns 5

Prevalence Measure

Participation ratio (PR) of a feature in a co-location type C, is the fraction of its instances participating in any instance of C.

Participation index (PI) is the minimum participation ratio in C.

A2

B1

C1A1

B2

C2

C3

PI ({A, B}) = min {1/2, 1/2} = 0.5

PI ({B, C}) = min {1, 2/3} = 0.66

PI ({A, C}) = min {1/2, 1/3} = 0.33

PI ({A, B, C}) = min {1/2, 1/2, 1/3} = 0.33

PI({A,B,C}) <= PI ({A, B}) or PI ({B, C}) or PI ({A, C})

PR and PI are anti-monotonic

PI ({A, B}) = min {1/2, 1/2} = 0.5

A2

B1

C1A1

B2

C2

C3

Page 6: SSCP: Mining Statistically Significant Co-location Patterns Sajib Barua and Jörg Sander Dept. of Computing Science University of Alberta, Canada.

SSCP: Mining Statistically Significant Co-location Patterns 6

Related Work

Spatial statistics Ripley’s K function, distance based measure,

co-variogram function. Spatial data mining

Koperski et al. [4] mine spatial association rules. Morimoto [5] also look for frequently occurring patterns. Shekhar et al. [2] introduce three models to materialize

transaction. Huang, et al. [3], Yo et al. [6,7], and Xiao et al. [8].

Page 7: SSCP: Mining Statistically Significant Co-location Patterns Sajib Barua and Jörg Sander Dept. of Computing Science University of Alberta, Canada.

SSCP: Mining Statistically Significant Co-location Patterns 7

Limitations of the Existing Methods

Spatial statistics Defined only for pairs.

Co-location mining Only one global threshold for PI is used. No guideline to setup PI-threshold Do not address the spatial auto-correlation and

feature abundance effects.

A simple threshold can report meaningless patterns or can miss meaningful patterns.

Page 8: SSCP: Mining Statistically Significant Co-location Patterns Sajib Barua and Jörg Sander Dept. of Computing Science University of Alberta, Canada.

SSCP: Mining Statistically Significant Co-location Patterns 8

Motivation

Existing co-location mining algorithms will not report {A,B}.

A has fewer instances

B is abundant

A & B have true spatial dependency.

Assume PI-threshold = 0.4

Page 9: SSCP: Mining Statistically Significant Co-location Patterns Sajib Barua and Jörg Sander Dept. of Computing Science University of Alberta, Canada.

SSCP: Mining Statistically Significant Co-location Patterns 9

Motivation

Existing co-location mining algorithms will report {A,B}.

A & B are abundant.

Both randomly distributed.

Do not have any true spatial dependency.

Assume PI-threshold = 0.4

Page 10: SSCP: Mining Statistically Significant Co-location Patterns Sajib Barua and Jörg Sander Dept. of Computing Science University of Alberta, Canada.

SSCP: Mining Statistically Significant Co-location Patterns 10

Motivation

A & B are auto-correlated.

Do not have any true spatial dependency.

Existing co-location mining algorithms will report {A,B}.

Assume PI-threshold = 0.4

Page 11: SSCP: Mining Statistically Significant Co-location Patterns Sajib Barua and Jörg Sander Dept. of Computing Science University of Alberta, Canada.

SSCP: Mining Statistically Significant Co-location Patterns 11

Our Idea

Our approach uses statistical test.Spatial dependency is measured

using PI.

#○ = 12

#∆ = 12

If features ○ and ∆ were spatially independent of each other, what is the chance of seeing the PI-value of {○, ∆} equal or higher than the observed PI-value (0.41)?

Page 12: SSCP: Mining Statistically Significant Co-location Patterns Sajib Barua and Jörg Sander Dept. of Computing Science University of Alberta, Canada.

SSCP: Mining Statistically Significant Co-location Patterns 12

Generate Artificial Data Sets

Observed data Artificial data sets generated under null model

Page 13: SSCP: Mining Statistically Significant Co-location Patterns Sajib Barua and Jörg Sander Dept. of Computing Science University of Alberta, Canada.

SSCP: Mining Statistically Significant Co-location Patterns 13

p-value computation

α = 0.05

PIobs = 0.41

p-value = 0.163

If p <= α, PIobs is statistically significant at level α.

Page 14: SSCP: Mining Statistically Significant Co-location Patterns Sajib Barua and Jörg Sander Dept. of Computing Science University of Alberta, Canada.

SSCP: Mining Statistically Significant Co-location Patterns 14

Auto-correlated Feature

A & B are auto-correlated.

Do not have any true spatial dependency.

Page 15: SSCP: Mining Statistically Significant Co-location Patterns Sajib Barua and Jörg Sander Dept. of Computing Science University of Alberta, Canada.

SSCP: Mining Statistically Significant Co-location Patterns 15

Modeling Auto-correlation

Auto-correlation is modeled as a cluster process.

Poisson Cluster Process [9]

Autocorrelation is measured in terms of intensity and type of distribution of a parent process and offspring process around each parent.

Page 16: SSCP: Mining Statistically Significant Co-location Patterns Sajib Barua and Jörg Sander Dept. of Computing Science University of Alberta, Canada.

SSCP: Mining Statistically Significant Co-location Patterns 16

Estimating Summary Statistics

Estimate the summary statistics. Auto-correlated feature: intensity of parent

and offspring process (κ, and µ values). Randomly distributed feature: Poisson

intensity (either homogenous (a constant) or non-homogenous (a function of x and y)).

Page 17: SSCP: Mining Statistically Significant Co-location Patterns Sajib Barua and Jörg Sander Dept. of Computing Science University of Alberta, Canada.

SSCP: Mining Statistically Significant Co-location Patterns 17

Null Model Design

The artificial data sets maintain the following properties of the observed data: same number of instances for each feature,

and similar spatial distribution for each individual

feature.

Page 18: SSCP: Mining Statistically Significant Co-location Patterns Sajib Barua and Jörg Sander Dept. of Computing Science University of Alberta, Canada.

SSCP: Mining Statistically Significant Co-location Patterns 18

p-value computationEstimateUse randomization tests, where a large

number of datasets conforming to the null hypothesis is generated.

))()(Pr( 0 CPICPIp obs

1

1obs

R

Rp

PI

How many simulations do we need? Diggle suggested 500 simulations for α = 0.01 [10].

Page 19: SSCP: Mining Statistically Significant Co-location Patterns Sajib Barua and Jörg Sander Dept. of Computing Science University of Alberta, Canada.

SSCP: Mining Statistically Significant Co-location Patterns 19

Improving Runtime: Data Generation

In a simulation, we only generate feature instances of those clusters which are close enough to other different features (either auto-correlated or non auto-correlated)

This saves time of the artificial data generation step of a simulation.

Page 20: SSCP: Mining Statistically Significant Co-location Patterns Sajib Barua and Jörg Sander Dept. of Computing Science University of Alberta, Canada.

SSCP: Mining Statistically Significant Co-location Patterns 20

Improving Runtime: PI-value Computation

In a simulation Ri, for a co-location C

)()(

)()(&

obs0

obs0

CPICPI

CPICPICCi

i

R

R

Procedure:

• In each simulation, compute -values of all possible 2-size subsets

• For a co-location C of size k ( > 2), we lookup PI-values of its 2-size subsets of C. If a subset C' is found for which < PIobs(C), is not required to be computed.

• Otherwise is computed for simulation Ri.

)(0 CPI iR

)(0 CPI iR

)(0 CPI iR

)'(0 CPI iR

1

1obs

R

Rp

PI)()( obs0 CPICPI iR

No need to compute )(0 CPI iR

Page 21: SSCP: Mining Statistically Significant Co-location Patterns Sajib Barua and Jörg Sander Dept. of Computing Science University of Alberta, Canada.

SSCP: Mining Statistically Significant Co-location Patterns 21

An Example

Four features A, B, C, D {A,B,C}: If {A,B} < PIobs{A,B,C}, {A,B,C} <

PIobs{A,B,C}. No need to compute {A,B,C}. {A,B,C} < PIobs{A,B,C} does not imply {A,B,C,D}

< PIobs{A,B,C,D}. {A,B,C,D}: by checking 2-size subsets

The worst case complexity is O(2n) The size of the largest co-location is much smaller. Largest co-location size is predictable if PIobs(C) = 0, we do not compute -value of C, Our pruning strategies

All these keep the actual cost in practice less than the worst case cost.

iRPI0iRPI0

iRPI0

iRPI0iRPI0

iRPI0

Page 22: SSCP: Mining Statistically Significant Co-location Patterns Sajib Barua and Jörg Sander Dept. of Computing Science University of Alberta, Canada.

SSCP: Mining Statistically Significant Co-location Patterns 22

Experimental Results (1)Negative association: Features ○ and ∆ with 40 instances of each. This synthetic data set is generated using multi-type Strauss process to impose

a negative association (inhibition) between these two features. Result PIobs = 0.55 and p-value = 0.931 > 0.05 (α), hence (○, ∆) will not be reported.

Page 23: SSCP: Mining Statistically Significant Co-location Patterns Sajib Barua and Jörg Sander Dept. of Computing Science University of Alberta, Canada.

SSCP: Mining Statistically Significant Co-location Patterns 23

Experimental Results (2)

Autocorrelation: #○ = 100, and #∆ = 120. ∆: independently and uniformly

distributed over the space ○: spatially auto-correlated In our generated data, ∆ is found in

most clusters of ○. The summary statistics of ○ is estimated

by fitting the model of Matérn Cluster process[9] (κ= 40, µ = 5, r = 0.05).

Results: PIobs {○, ∆} = 0.49, existing algorithm will

report the pattern if a threshold <= 0.49 is chosen.

p-value = 0.383 > 0.05 (α); {○, ∆} is not reported.

Page 24: SSCP: Mining Statistically Significant Co-location Patterns Sajib Barua and Jörg Sander Dept. of Computing Science University of Alberta, Canada.

SSCP: Mining Statistically Significant Co-location Patterns 24

Experimental Results (3)

Multiple features: #○ = 40, #∆ = 40, #+ = 118, #x = 40,

and = #30. Study area = Unit square, co-

location neighborhood radius = 0.1 Features ○ and ∆ are negatively

associated. Feature + is spatially auto-

correlated. Features +, ○, and x are positively

associated. Feature is randomly distributed.

Significant co-location patterns = {○, +}, {○, x}, {+, x}, {○, +, x}, {○, +, }, {○, x, },

{+, x, }, and {○, +, x, }.

Page 25: SSCP: Mining Statistically Significant Co-location Patterns Sajib Barua and Jörg Sander Dept. of Computing Science University of Alberta, Canada.

SSCP: Mining Statistically Significant Co-location Patterns 25

Runtime Comparison (1) Features ○, ∆, +: are auto-correlated, strongly associated. Each has 400

instances. Feature x: is randomly distributed, and has 20 instances. Our algorithm finds all co-locations of features ○, ∆, and x. Instances of each auto-correlated features is increased

cluster numbers is kept same number of instances per cluster is increased by a factor k.

Runtime comparison Speedup

Page 26: SSCP: Mining Statistically Significant Co-location Patterns Sajib Barua and Jörg Sander Dept. of Computing Science University of Alberta, Canada.

SSCP: Mining Statistically Significant Co-location Patterns 26

Runtime Comparison (2) The number of clusters for features ○, ∆, and + is

increased by a factor k but the number of instances per cluster is kept same.

Total instances of x is increased by the same factor k.

Runtime comparison Speedup

Page 27: SSCP: Mining Statistically Significant Co-location Patterns Sajib Barua and Jörg Sander Dept. of Computing Science University of Alberta, Canada.

SSCP: Mining Statistically Significant Co-location Patterns 27

Ants Data ○ = Cataglyphis ants (29) and ∆ =

Messor ants (68). PIobs {Cataglyphis, Messor} =

{24/29, 30/68} = 0.44. p-value = 0.142 > 0.05 (α); Co-

location {○, ∆} is not significant. R. D. Harkness also did not find

any clear association between these two species.

Existing algorithm will report {○, ∆} if PI-threshold <= 0.44.

Page 28: SSCP: Mining Statistically Significant Co-location Patterns Sajib Barua and Jörg Sander Dept. of Computing Science University of Alberta, Canada.

SSCP: Mining Statistically Significant Co-location Patterns 28

Toronto Address Repository Data

Page 29: SSCP: Mining Statistically Significant Co-location Patterns Sajib Barua and Jörg Sander Dept. of Computing Science University of Alberta, Canada.

SSCP: Mining Statistically Significant Co-location Patterns 29

Found Co-locations

Page 30: SSCP: Mining Statistically Significant Co-location Patterns Sajib Barua and Jörg Sander Dept. of Computing Science University of Alberta, Canada.

SSCP: Mining Statistically Significant Co-location Patterns 30

Conclusions

A new definition for co-location pattern. Does not depend on a global threshold. Statistically meaningful. Runtime cost of randomization tests is reduced. Investigate other prevalence measures to check

if they allow additional pruning techniques. Removing redundant patterns.

Page 31: SSCP: Mining Statistically Significant Co-location Patterns Sajib Barua and Jörg Sander Dept. of Computing Science University of Alberta, Canada.

SSCP: Mining Statistically Significant Co-location Patterns 31

References

1. Agrawal, R., Srikant, R.: Fast Algorithms for Mining Association Rules in Large Databases. In: Proc. VLDB, pp. 487-499 (1994)

2. Shekhar, S. et al.: Discovering Spatial Co-location Patterns: A Summary of Results, In Proc. SSTD, pp. 236-256 (2001)

3. Huang, Y. et al.: Discovering Colocation Patterns from Spatial Data Sets: A General Approach. IEEE TKDE 16(12), 1472-1485 (2004)

4. Koperski, K. et al.: Discovery of Spatial Association Rules in Geographic Information Databases. In SSD, pp. 47-66 (1995)

5. Morimoto, Y.: Mining Frequent Neighboring Class Sets in Spatial Databases. In SIGKDD, pp. 353-358 (2001)

6. Yoo, J. S. et al.: A Partial Join Approach for Mining Co-location Patterns. In Proc. GIS, pp. 241-249 (2004)

7. Yoo, J. S. et al.: A joinless Apporach for Mining Spatial Co-location Patterns. IEEE TKDE 18(10), 1323-1337 (2006)

8. Xiao, X. et al.: Density Based Co-location Pattern Discovery. In Proc. GIS, pp. 250-259 (2008).

9. Ilian et al: Statistical Analysis and Modeling of Spatial Point Patterns. 10. Diggle P.J.: Statisitcal Analysis of Spatial Point Pattern, 2003

Page 32: SSCP: Mining Statistically Significant Co-location Patterns Sajib Barua and Jörg Sander Dept. of Computing Science University of Alberta, Canada.

SSCP: Mining Statistically Significant Co-location Patterns 32

Questions?