Exploring Spatial Datasets Using Discriminative Pattern Mining and Pattern Similarity Measure Lunar and Planetary Institute ([email protected]) Tomasz F. Stepinski Wei Ding Dept. of Computer Science, Univ. of Massachusetts Boston ([email protected]) Motivation Complex multi-attributed spa- tial datasets hide knowledge that needs to be discovered by exploring their structure. We propose association analysis-based strategy for exploration spatial datasets posessing prior binary classi- fication. Input data :> Lunar and Planetary Institute ([email protected]) Josue Salazar Example: Analysis of 2008 presidential election Innovation mining for discriminative patterns class 2 c l a s s 1 class1 multi-attribute spatial dataset with prior binary classification Each spatial element is a transaction containing values of exploratory attributes cluster 1 cluster 2 cluster 3 aggomerative clustering of patterns Segmentation of class 1 into clusters of similar patterns of exloratory attributes Algorithm 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 1 1 1 1 1 1 2 2 2 2 2 2 footprint of pattern Y (2 objects) footprint of pattern X (4 objects) 1 2 _ _ A B C D pattern Y attributes 1 2 _ _ A B C D pattern X attributes S ( X, Y )= 4 i =1 w i S i ( X i ,Y i ) Σ 1 1 1 1 1 2 2 2 attribute A S A ( X A ,Y A )= s ( x A ,y A ) 1 1 1 2 2 1 1 1 attribute C S ( - ,Y C )= 2 k =1 P X ( x k ) s ( x k ,y C ) Σ C 2 2 2 2 2 1 2 2 attribute B S ( , X B )= 2 k =1 P y ( y k ) s ( y k ,X C ) Σ B - 1 1 1 1 2 1 1 2 attribute D S ( - , - )= 2 l =1 2 k =1 P X ( x l ) P Y ( y k ) s ( x l ,y k ) ΣΣ D Pattern similarity z , z , ..., z are ordinal values such that z = x + 1 and z = y - 1. i 1 2 1 k k i 2008 election results + 13 socio-economic indicators from the US Census Bureau for 3108 counties. Example 1 :> McCain voting block (red) and Obama voting block (blue) that are dissimilar in socio-economic sense and geographically apart. Example 2 :> McCain voting block (red) and Obama voting block (green) that are dissimilar in socio-economic sense but geographically collocated. Visual analytics :> Discriminative patterns are calculated for four groups (A, B, E, and F) of counties. In each group patterns are ordered using ag- glomerative clustering. Clustering heat map is a distance matrix with rows ordered according to clustering. s ( x i ,y i )= 2 × log P ( x i z 1 z 2 ... z k y i ) log P ( x i ) + log P ( y i ) A B C D E F G H 3 - 12 13 - 20 21 - 27 28 - 37 38 - 58 59 - 100 1 - 2 3 -4 5 - 6 7 - 8 9 - 10 11 - 13 0 - 0.25 0.25 - 0.5 0.5 - 1 1 - 2 2 - 3 3 - 4 4 - 13 0 - 0.05 0.05 - 0.18 0.18 - 0.32 0.32 - 0.46 0.46 - 0.62 0.62 - 0.82 0.82 - 1 pattern size patter length pattern size pattern length pattern overlap pattern dissimilarity } } pattern set A (Obama) pattern set E (McCain) B F } } pattern set A (Obama) pattern set E (McCain) B F pop. dens. urban pop. % female pop. % foreign born % per capita income household income HS edu. bachelor edu. white pop. % poverty % owned house % soc . sec. recipent % soc. sec. income lowest (1) low (2) average (3) high (4) highest (5) Obama block 1 (1- 872) Obama block 2 (928 -3364) Voted for Obama but not in disciminate patterns support (3365 - 3610) McCain block (3611- 6680) Voted for McCain but not in disciminate patterns support (6681 - 6970) no value ( _ ) socio-economic indicators E A won by Obama, IN footprint of Obama and NOT in footprint of McCain 153,611,411 67,040,847 62.14 won by Obama, NOT in footprint of Obama and NOT in footprint of McCain B 495 361 16,696,346 9,568,427 56.24 C won by Obama, NOT in footprint of Obama but IN in footprint of McCain 9 199,478 88,945 51.07 D won by Obama, IN footprint of Obama and IN footprint of McCain 1 210,554 61,494 52.90 won by McCain, IN footprint of McCain and NOT in footprint of Obama 1688 51,289,510 23,224,203 62,11 F won by McCain, NOT in footprint of McCain and NOT in footprint of Obama 472 31,269,880 15,772,301 59.01 G won by McCain, NOT in footprint of McCain but IN footprint of Obama 62 23,518,016 8,941,422 55.91 H won by McCain, IN footprint of McCain and IN footprint of Obama 20 2,255,368 1,024,861 60.83 set description # of counties population # voted winning % won by Obama won by McCain