Delineating Metropolitan Housing Submarkets with Fuzzy Clustering Methods Julie Sungsoon Hwang Department of Geography, University of Washington Jean-Claude.

Delineating Metropolitan Housing Submarkets with Fuzzy Clustering Methods

Julie Sungsoon HwangDepartment of Geography, University of Washington

Jean-Claude ThillDepartment of Geography, State University of New York at Buffalo

November 10, 2005North American Meetings of Regional Science Association International

Outlines

• Research objectives

• Methodology: specification

• Methodology: illustration

• Evaluating the performance of fuzzy clustering

• Conclusions

Research objectives

• Demonstrate the use of fuzzy c-means (FCM) algorithm for delineating housing submarkets– Comparison to K-means

• Discuss empirical characteristics of FCM applied to given applications, in particular choice of parameters– Cluster validity index

Challenges

• Are the boundaries of clusters crisp?

Cluster A

Cluster C

X1

X2

Housing market in metropolitan area q

Cluster B

Cluster A

Cluster B Cluster C

X1

X2

Housing market in metropolitan area p

Methodology: specification

• Our task is to group census tracts to homogeneous housing submarkets within a metropolitan area

• Using fuzzy c-means algorithm• In order to examine whether fuzzy set-based

clustering can do the better job• Implemented in 85 metropolitan areas• Most of data set are public (e.g. 2000 Census)• The whole procedure is automated in GIS

Methodology: flow chart

National

Regional

Local…Census Tract Layer

# x1 x2 x3 … xm

1

2

3

…

n

# y1 y2 … yk

1

2

3

…

n

Cluster Analysis# U1 U2 … Uc

1 1 0 … 0

2 0 1 … 0

… 0 1 … 0

n 0 0 … 1

# U1 U2 … Uc

1 0.85 0.05 … 0.10

2 0.12 0.80 .. 0.05

… 0.02 0.74 … 0.12

n 0.40 0.03 … 0.50

K-means

Fuzzy Fuzzy CC--meansmeans

Candidate variables

Significant variables

Stepwise regression (k ≤ m)

Metro

Hard Cluster Layer

(c ≤ n)

Fuzzy Cluster Layer

…1

2

c

k: # selected variables

c: # submarkets

For each metropolitan area

Uj: membership to cluster j

Explanatory variables for house priceVar_Name Variable Definition Data Year Spatial Unit

Socioeconomic/demographic Characteristics of Residents

pcincome per capita income Census 2000 Census Tract

college % college degree Census 2000 Census Tract

managep % management workers Census 2000 Census Tract

prodp % production workers Census 2000 Census Tract

famcpchl % family with children Census 2000 Census Tract

nfmalone % nonfamily living alone Census 2000 Census Tract

black_p % black Census 2000 Census Tract

nhwht_p % non-hispanic white Census 2000 Census Tract

nativebr % native born Census 2000 Census Tract

Structural Characteristics of Housing Units

medroom median number of room Census 2000 Census Tract

hudetp % detached housing unit Census 2000 Census Tract

yrhublt median year structure built Census 2000 Census Tract

Locational Characteristics (Amenities) of Neighborhoods

ptratio pupil to teacher ratio NCES* 2002 School District

schexp school expenditure per student NCES 2002 School District

vrlcrime violent crime rate FBI** 2003 Designated Place

prpcrime property crime rate FBI 2003 Designated Place

jobacm job accessibility (Hansen 1959) CTPP*** 2000 Census Tract

*National Center for Education Statistics; **FBI annual report “Crime in the U.S. 2003”; *** CTPP: Census Transportation Planning Package Dependent variables: median home value of owner-occupied housing units

Metropolitan AreasCMSAMSA

State

300 0 300 600 Miles

N

Source: TIGER/Line 1999

Metropolitan AreasCMSAMSA

StateStudy Set

300 0 300 600 Miles

N

Source: TIGER/Line 1999

Study set: 85 metropolitan areas

kx

iv

• Clustering method that minimizes the following objective function:

• Updates cluster means vi and membership degree uik until the algorithm converges

ikum

2

1 1

( )n c

mik k i A

k i

u x v

Vectors of data point, 1 ≤ k ≤ n

Center of cluster i, 1 ≤ i ≤ c

Membership degree of data point k with cluster i; [0,1]

Fuzziness amount associated with assigning data point k to cluster i, 1≤ m ≤ ∞

1 1

n nm m

i ik k ikk k

v u x u

12/( 1)

1

mc

k iik

j k j

x vu

x v

Source: Bezdek 1981

#

#

#

#

#

#

#

#

#

#

#

#

#

#

####

#

#

#

#

#

#

#

##

#

#

#

#

# #

#

#

#

#

#

#

#

#

#

#

#

#

#

#

#

#

#

#

#

#

#

#

#

#

#

#

#

# #

#

#

#

#

#

#

#

#

#

#

#

#

#

#

#

#

#

#

#

#

#

#

#

#

#

#

#

#

#

#

#

#

#

#

#

#

#

#

#

##

#

#

#

#

#

#

#

##

#

#

##

# #

#

#

#

#

#

#

#

#

#

#

#

#

#

#

#

#

#

#

#

#

#

#

#

#

#

#

#

#

#

#

#

#

#

#

#

#

#

#

#

#

#

#

#

#

#

#

#

#

#

#

#

#

#

#

#

#

#

#

#

#

#

#

#

#

#

#

#

#

#

#

#

#

#

#

#

#

#

#

#

#

#

#

#

#

#

#

#

#

#

#

#

#

#

#

#

#

#

#

#

#

#

#

#

##

#

#

##

#

#

#

#

##

#

#

#

#

#

#

#

#

#

#

#

#

#

#

#

#

#

#

#

#

#

#

#

x1

x2

What is fuzzy c-means (FCM)?

(III-3a) (III-3b)

FCM: missing elements

• Optimal number of clusters c*

• Optimal fuzziness amount m*

mc

FCM

Extended fuzzy c-means algorithm

• Step 1: Initialize the parameters related to fuzzy partitioning: c = 2 (2 ≤ c cmax), m = 1 (1 ≤ m mmax), where c is an integer, m is a real number; Fix minc where minc is incremental value of m ( 0 < minc ≤ 0.1); Fix cut-off threshold L; Choose validity index v

• Step 2: Given c and m, initialize U(0) so that it becomes the fuzzy matrix. Then at step l, l = 0, 1, 2, ….;

• Step 3: Calculate the c fuzzy cluster centers {vi(l)} with (III-3a) and U(l)• Step 4: Update U(l+1) using (III-3b) and {vi(l)}• Step 5: Compare U(l) to U(l+1) in a convenient matrix norm; if || U(l+1) – U(l) || ≤ L to

go step 6; otherwise return to Step 3.• Step 6: Compute the validity index for given c and m• Step 7: If c < cmax, then increase c c + 1 and go to step 3; otherwise go to step 8• Step 8: If m < mmax, then increase m m + minc and go to step 3; otherwise go to

step 9• Step 9: Obtain the optimal validity index from , optimal number of clusters c*, and

optimal amount of fuzziness exponent m*; The optimal fuzzy partition U is obtained given c* and m*

Cluster validity indices

2

1 1

( )( )

c n

iki k

uPC U

n

Partition coefficient

21 1

[ log ( )]( )

c c

ik iki k

u uPE U

n

Partition entropy

22

1 12

,

( )

min

n c

ik k i Ak i

XB

i j i j

u x vU

n v v

Xie-Beni index

2

1

1

11 1

2(2 ) /

1 1

( )

( )

nm

ik k ic Ak

ni

ikk

VI c cw w

ij j i Ai j

u x v

uS

z z

1

1

1ij w

cj i A

l j l Al j

z z

z z

1 2 1 1 2[ , ,...., , ] [ , ,...., , ]

1 1,1 1,

T Tc c cz z z z v v v x

i c j c j i

SVi indexwhere w is set to 2 in this study

• Selected validity indices are calibrated over the study set

Xie-Beni index is recommended as a validity indexAverage m* is 1.38

0

0.2

0.4

0.6

0.8

1

1.2

1.4

2 3 4 5 6 7 8 9 10 11 12 13 14 15

Number of clusters c

Ind

ex

va

lue UXB

PC

PE

SVI/100

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

1.0 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9

Fuzziness amount mIn

dex

val

ue

UXB

SVI/100

Determining c* and m*

Histogram of m* for FCM

Methodology: illustration

Median home value of Buffalo, NY

Dimensionality of Buffalo housing market

Predictor Coefficient Standard Error t-statistics p-value

Constant -1455768 164417 -8.85 0.000

Per capita income 2.3667 0.2791 8.48 0.000

% college degree 88221 11346 7.78 0.000

% family: couple with children 65735 18775 3.50 0.001

% detached housing unit -31260 5527 -5.66 0.000

Housing age (year) 692.88 80.26 8.63 0.000

% non-hispanic white 11186 3914 2.86 0.005

% native born status 130039 31111 4.18 0.000

Job accessibility -0.05266 0.02227 -2.36 0.019

Hedonic regression equation of median home value in Buffalo, NY

Adjusted R sq = 84.3%

Optimal number of housing submarkets c*, Optimal fuzziness amount m*, Buffalo, NY

c m 1.0 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9

2 0.4735 0.4570 0.4380 8.0983 10.4115 12.5478 14.4334 16.0634 17.4645 18.6721

3 0.4136 0.3889 0.3460 0.3385 10.7864 12.9137 14.7939 16.4217 17.8290 19.0553

4 0.7802 0.7116 0.6080 0.5241 1.3154 6.8837 7.4807 8.0441 8.5632 9.0391

5 0.5560 0.5622 0.5940 0.6121 0.4683 0.3404 0.6489 0.6850 0.7206 0.7555

6 0.6223 0.7578 1.0187 0.8173 0.6907 1.3393 1.4074 1.4819 1.5595 1.6382

7 0.8836 0.6903 0.6881 0.6016 0.6148 0.9515 2.4397 2.6306 2.8317 3.0383

8 0.5981 0.5888 0.5703 0.5232 0.3992 0.7381 0.8910 1.2388 1.2926 1.3538

9 0.9645 0.6160 0.4836 0.4866 0.8449 1.4020 1.4198 1.8317 1.8639 1.9161

10 0.7053 0.6004 0.6619 0.5873 0.5868 1.3465 1.5081 1.6875 1.8215 1.8591

c* 3 3 3 3 8 5 5 5 5 5

Values in the cell represent Xie-Beni index given c and m

-1.5

-1.0

-0.5

0.0

0.5

1.0

1.5

ZPCINCOME ZCOLLEGE ZFAMCPCHL ZHUDETP ZYRHUBLT ZNHWHT_P ZNATIVEBR ZJOBACM

Attribute Vector

Clu

ste

r M

ea

n

Cluster 1

Cluster 2

Cluster 3

c* = 3; m* = 1.3

No Data

Membership degree to Cluster 10 - 0.10.1 - 0.20.2 - 0.30.3 - 0.40.4 - 0.50.5 - 0.60.6 - 0.70.7 - 0.80.8 - 0.90.9 - 1

Interstate Highway

(A)

Membership to Cluster 1

No Data

Membership degree to Cluster 20 - 0.10.1 - 0.20.2 - 0.30.3 - 0.40.4 - 0.50.5 - 0.60.6 - 0.70.7 - 0.80.8 - 0.90.9 - 1

Interstate Highway

(B)


No Data

Membership degree to Cluster 30 - 0.0990.099 - 0.1970.197 - 0.2960.296 - 0.3950.395 - 0.4930.493 - 0.5920.592 - 0.6910.691 - 0.7890.789 - 0.8880.888 - 0.986

Interstate Highway

(C)


No Data

Defuzzified Clusters123

Interstate Highway

(D)

Defuzzified Clusters

Buffalo housing submarkets

Evaluating the performance of fuzzy clustering

• Compare the sum of squared error derived from KM (m=1) and FCM (m=m*) given c*

Fuzzy clustering outperforms crisp clustering

Paired Samples Statistics

1026.546 85 3848.268377 417.4033

745.7332 85 3022.266891 327.8109

j2_hcm

j2_fcm

Pair1

Mean N Std. DeviationStd. Error

Mean

Paired Samples Test

280.8133 915.57126275 99.30765 83.32912 478.2974 2.828 84 .006j2_hcm - j2_fcmPair 1Mean Std. Deviation

Std. ErrorMean Lower Upper

95% ConfidenceInterval of the

Difference

Paired Differences

t df Sig. (2-tailed)

22

1 1

( )n c

ik k i Ak i

u x v

Compare FCM with K-means (KM)

Conclusions

• Fuzzy set theory provides a mechanism for uncertainty handling involved in classification task

• Fuzzy c-means algorithm is of practical use in delineating housing submarkets

• Fuzzy set theory needs further attention in social science fields

• More works on the choice of parameters are needed

Delineating Metropolitan Housing Submarkets with Fuzzy Clustering Methods Julie Sungsoon Hwang Department of Geography, University of Washington Jean-Claude.

Documents

cluster j slide

metropolitan area p

specification slide

gis slide

n cluster analysis

use of fuzzy c

metropolitan areas

homogeneous housing