Top Banner
THE HACK ON JERSEY CITY CONDO PRICES explore trends in public data Yiqun “Yi” Wang / NYC Data Science Academy / Code for JC / March 2015
27
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: THE HACK ON JERSEY CITY CONDO PRICES explore trends in public data

THE HACK ON JERSEY CITY CONDO PRICESexplore trends in public data

Yiqun “Yi” Wang / NYC Data Science Academy / Code for JC / March 2015

Page 2: THE HACK ON JERSEY CITY CONDO PRICES explore trends in public data

THE HACK ON JERSEY CITY CONDO PRICES

Outline of the Project

• Data• Tax assessment data

• Third party data sources to join

• Data janitor and collection works

• Relationship Exploration• Building attributes exploration

• Individual units price exploration

• Model for Prices• 5 competing models

• Cross validation

Page 3: THE HACK ON JERSEY CITY CONDO PRICES explore trends in public data

THE HACK ON JERSEY CITY CONDO PRICES

Tax Assessment Data

NJ MOD IV System

Covers all individual properties

Downloadable in batch in text files

Key columns:

- Property address

- Property class

- Property size

- Year built

- Owner address

- Owner name

- Last sold price

- Last sold date

- Qualifier (can parse out condo floor # and unit #)

Page 4: THE HACK ON JERSEY CITY CONDO PRICES explore trends in public data

THE HACK ON JERSEY CITY CONDO PRICES

Tax Assessment Data

• 62,270 totals property records, as of Feb 2015

-- filter down to --

• 3,867 individual condo units (of 19 selected mid/high-rise buildings)

# step 1.1 tax data load from NJ MOD IV system

url <- "http://tax1.co.monmouth.nj.us/download/0906monm204610.zip"

download.file(url,"0906monm204610.zip",quiet = FALSE)

closeAllConnections()

unzip("0906monm204610.zip")

taxdata <- read.csv(file="0906monm204610.csv")

Page 5: THE HACK ON JERSEY CITY CONDO PRICES explore trends in public data

THE HACK ON JERSEY CITY CONDO PRICES

Tax Assessment Data is Dirty!

# address cleanup

taxdata$Property.Location <- gsub("STREET", "ST", taxdata$Property.Location)

taxdata$Property.Location <- gsub("BOULEVARD", "BLVD", taxdata$Property.Location)

taxdata$Property.Location <- gsub("[[:punct:]]", "", taxdata$Property.Location)

taxdata$Property.Location <- gsub("[[:space:]]", "", taxdata$Property.Location)

# drop bad prices, bad units, bad sf, bad records, retail condos

taxdata <- taxdata[taxdata$Sale.Price>10000,]

taxdata <- taxdata[taxdata$Sale.Price<=10000000,]

taxdata <- taxdata[!(taxdata$Qual=="" | is.null(taxdata$Qual)),]

taxdata <- taxdata[!(taxdata$Sq..Ft.=="" | is.null(taxdata$Sq..Ft.) |

taxdata$Sq..Ft.<=400 | taxdata$Sq..Ft.>=3000),]

taxdata <- taxdata[!(is.na(taxdata$Map.Page)),]

taxdata <- taxdata[!(taxdata$Building.Class=="C"),]

taxdata <- taxdata[!(substr(taxdata$Qual,4,4) %in% c("R","L","U")),]

Page 6: THE HACK ON JERSEY CITY CONDO PRICES explore trends in public data

THE HACK ON JERSEY CITY CONDO PRICES

Tax Assessment Data has Hidden Treasures

• From “Qual” we can parse out floor number and unit number

taxdata$Floor <-

ifelse(taxdata$AddressClean=="389WASHINGTONST"|taxdata$AddressClean=="174WASHINGTONST“

substr(taxdata$Qual,3,4),substr(taxdata$Qual,2,3))

Floor <- ifelse(Floor=="PH",BuildingNumberOfStories,Floor)

taxdata$Unit <-

ifelse(taxdata$AddressClean=="389WASHINGTONST"|taxdata$AddressClean=="174WASHINGTONST“

substr(taxdata$Qual,5,5),substr(taxdata$Qual,4,5))

Page 7: THE HACK ON JERSEY CITY CONDO PRICES explore trends in public data

THE HACK ON JERSEY CITY CONDO PRICES

Third Party Data Sources To Join

• Condo building attributes:• http://livingonthehudson.com• http://www.jcboe.org• http://www.zillow.com• http://www.streeteasy.com• http://buyersadvisors.com

• Location primness:• http://walkscore.com

• Building Geocode / Transit location: • http://maps.google.com/maps/api/geocode/

• Census tract level demographics:• http://geomap.ffiec.gov

Page 8: THE HACK ON JERSEY CITY CONDO PRICES explore trends in public data

THE HACK ON JERSEY CITY CONDO PRICES

Map Out All the Buildings

# step 1.4.3 map the buildings out

bldmaap <- ggmap(get_googlemap(

center='Grove Street, Jersey City, NJ', zoom=14,

maptype='roadmap'),extent='device') +

geom_point(data=bldgeoc, aes(x=lon, y=lat),colour='darkblue',

alpha=0.7,

na.rm=TRUE, size=5)

bldmaap

ggsave(filename="map.png",plot = last_plot(),width=3,height=3)

Page 9: THE HACK ON JERSEY CITY CONDO PRICES explore trends in public data

THE HACK ON JERSEY CITY CONDO PRICES

Building Attributes

For each condo building:

• Address / Lat / Lon

• Number of Units

• Number of Stories

• Year Built

• Walk Score

• Census Tract Median Household Income

• Distance to Water

• Distance to PATH (subway) Station

Page 10: THE HACK ON JERSEY CITY CONDO PRICES explore trends in public data

THE HACK ON JERSEY CITY CONDO PRICES

Building Data Table

Page 11: THE HACK ON JERSEY CITY CONDO PRICES explore trends in public data

THE HACK ON JERSEY CITY CONDO PRICES

Building Scoring System – PLSR is superior

# step 1.6 come up with a building primeness score using PCA/PLSR

bld.pcr <- pcr(BuildingPPSF ~

OrderYearBuilt+OrderWalkScore+OrderMedianHouseholdIncome+OrderDPTH+OrderDWATER, 1, data =

blddata, validation = "CV")

bld.pls <- plsr(BuildingPPSF ~

OrderYearBuilt+OrderWalkScore+OrderMedianHouseholdIncome+OrderDPTH+OrderDWATER, 1, data =

blddata, validation = "CV")

blddata$BuildingScore < predict(bld.pls,newdata=blddata)

PCA TRAINING: % variance explained

1 comps

X 35.29

BuildingPPSF 22.10

PLSR TRAINING: % variance explained

1 comps

X 29.58

BuildingPPSF 62.48

Page 12: THE HACK ON JERSEY CITY CONDO PRICES explore trends in public data

THE HACK ON JERSEY CITY CONDO PRICES

Building Scoring System – PLSR is superior

Page 13: THE HACK ON JERSEY CITY CONDO PRICES explore trends in public data

THE HACK ON JERSEY CITY CONDO PRICES

Some Cross Checking on Buildings

BuildingName calc.unit.count stated.unit.count

700 Grove 226 237

77 Hudson 407 420

Clermont Cove 97 NA

Crystal Point 257 269

Fulton's Landing 106 105

Gulls Cove 301 432

Liberty Terrace 116 118

Mandalay on the Hudson 250 269

Montgomery Greene 102 113

Pier House 99 180

Portofino 264 NA

Shore Club North 211 220

Shore Club South 214 220

Sugar House 48 65

The A Condominiums 238 250

The James Monroe 364 NA

Trump Plaza 391 445

Waldo Lofts 80 82

Zephyr Lofts 96 102

among 16 buildings with known units:

3,527 total units

3,142 units covered

89% coverage

Page 14: THE HACK ON JERSEY CITY CONDO PRICES explore trends in public data

THE HACK ON JERSEY CITY CONDO PRICES

Condo Unit Attributes

For each condo unit:

• Square Footage

• Sale Price

• Sale Date

• Floor

• Unit Number

• Building Score

Page 15: THE HACK ON JERSEY CITY CONDO PRICES explore trends in public data

THE HACK ON JERSEY CITY CONDO PRICES

Model the date dimension – price index

# step 2.3 checking condo price per sf over time (price

index)

aggu <- ddply(.data=findata[Sale.YrQtr>="1999 Q1" &

Sale.YrQtr<="2014 Q4" & !is.na(Sale.YrQtr) ,],

.variables='Sale.YrQtr',

summarize,

calc.avg.ppsf=mean(SalePrice/SqFt,na.rm=TRUE)

)

aggu$calc.avg.ppsf.r2q <-

append(rollmean(aggu$calc.avg.ppsf,

2),rep(NA,1),after=0)

aggu$calc.avg.ppsf.r4q <-

append(rollmean(aggu$calc.avg.ppsf,

4),rep(NA,3),after=0)

aggu$calc.avg.ppsf.r8q <-

append(rollmean(aggu$calc.avg.ppsf,

8),rep(NA,7),after=0)

Page 16: THE HACK ON JERSEY CITY CONDO PRICES explore trends in public data

THE HACK ON JERSEY CITY CONDO PRICES

Last Missing Variable: The View from Units

Manually entered:

using public-domain floor plan data and listing data

and consulting broker friends

Three categories:

2 – Great View

1 – Some View

0 – Nothing Special

In the future, can look for text description in listing:

- “Manhattan View”

- “Bay View”

- “Corner”

- etc.

Page 17: THE HACK ON JERSEY CITY CONDO PRICES explore trends in public data

THE HACK ON JERSEY CITY CONDO PRICES

Box Plot: Does View Matter?

Page 18: THE HACK ON JERSEY CITY CONDO PRICES explore trends in public data

THE HACK ON JERSEY CITY CONDO PRICES

Model the Price!

• Simple linear regression – one variable a time

• Multi linear regression

• Model Tree (Weka)

• Generalized Boosted Regression Models (gbm)

• Random Forest

• Cross validation all the models

Page 19: THE HACK ON JERSEY CITY CONDO PRICES explore trends in public data

THE HACK ON JERSEY CITY CONDO PRICES

Simple linear regression

# step 3.1 bi-variate linear model

findata <- read.csv("findata.csv")

modelLMSqFt <- lm(PPSF~SqFt)

summary(modelLMSqFt) #adjR2=0.0010, p<.05 OKAY NOT SIGNIFICANT

modelLMFloor <- lm(PPSF~Floor)

summary(modelLMFloor) #adjR2=0.1709, p<.05 GOOD

modelLMBuildingScore <- lm(PPSF~BuildingScore)

summary(modelLMBuildingScore) #adjR2=0.3584, p<.05 GOOD

modelLMView <- lm(PPSF~View)

summary(modelLMView) #adjR2=0.0959, p<.05 GOOD

modelLMIndex <- lm(PPSF~Index)

summary(modelLMIndex) #adjR2=0.1327, p<.05 GOOD

Page 20: THE HACK ON JERSEY CITY CONDO PRICES explore trends in public data

THE HACK ON JERSEY CITY CONDO PRICES

Multi-linear regression

# step 3.2 multi-variate linear model

modelLM <- lm(PPSF~SqFt+Floor+View+Index+BuildingScore

summary(modelLM) #adjR2=0.4602

PPSFHatLM <- predict(modelLM,findata)

RMSE(PPSFHatLM, PPSF, na.rm=TRUE) #99.17106

Page 21: THE HACK ON JERSEY CITY CONDO PRICES explore trends in public data

THE HACK ON JERSEY CITY CONDO PRICES

Model Tree

# step 3.3 model tree

modelMT <- M5P(PPSF~SqFt+Floor+View+Index+BuildingScore,data=findata)

summary(modelMT)

findata$PPSFHatMT <- predict(modelMT,findata)

RMSE(findata$PPSFHatMT, findata$PPSF, na.rm=TRUE) #86.63803

Page 22: THE HACK ON JERSEY CITY CONDO PRICES explore trends in public data

THE HACK ON JERSEY CITY CONDO PRICES

gbm

# step 3.4 gbm

findata <- read.csv("findata.csv")

modelGBM <- gbm(PPSF~SqFt+Floor+View+Index+BuildingScore,

data=findata,distribution="gaussian",n.trees=10000)

summary(modelGBM)

findata$PPSFHatGBM <- predict(modelGBM,newdata=findata,n.trees=10000)

RMSE(findata$PPSFHatGBM, findata$PPSF) #90.60208

Page 23: THE HACK ON JERSEY CITY CONDO PRICES explore trends in public data

THE HACK ON JERSEY CITY CONDO PRICES

Random Forest

# step 3.5 random forest

findata <- read.csv("findata.csv")

findata <- findata[!is.na(findata$Index),]

modelRF <- randomForest(PPSF~SqFt+Floor+View+Index+BuildingScore,data=findata)

summary(modelRF)

findata$PPSFHatRF <- predict(modelRF,newdata=findata)

RMSE(findata$PPSFHatRF, findata$PPSF) #64.49773

Page 24: THE HACK ON JERSEY CITY CONDO PRICES explore trends in public data

THE HACK ON JERSEY CITY CONDO PRICES

Cross Validation of All Models

# step 4.1 data partition

in_train <- createDataPartition(findata$PPSF, p=0.75, list=FALSE)

findata_train <- findata[in_train,]

findata_test <- findata[-in_train,]

rmse_cv <- function(k,train){

m <- nrow(train)

num <- sample(1:10,m,replace=T)

rmse <- numeric(10)

for (i in 1:10) {

data.t <- train[num!=i, ]

data.v <- train[num==i, ]

model <- <MODEL>(PPSF~SqFt+Floor+View+Index+BuildingScore,data=data.t)

pred <- predict(model,newdata=data.v)

rmse[i] <- RMSE(pred,data.v$PPSF)

}

return(mean(rmse))

}

rmse <- sapply(1:100,rmse_cv,findata_train)

Page 25: THE HACK ON JERSEY CITY CONDO PRICES explore trends in public data

THE HACK ON JERSEY CITY CONDO PRICES

Cross Validation Result

RMSE Total Universe Cross Validation

Multi-linear 99.17 99.55

Model Tree (Weka M5P) 86.63 78.46

GBM 90.60 91.67

RandomForest 64.49 82.19

Page 26: THE HACK ON JERSEY CITY CONDO PRICES explore trends in public data

THE HACK ON JERSEY CITY CONDO PRICES

Wish List Items…

• More rigorous regression diagnostics

• Tuning models better

• Model blending

• Compare with Zestimate

Page 27: THE HACK ON JERSEY CITY CONDO PRICES explore trends in public data

THANK YOU!

[email protected]