MINING REAL ESTATE LISTINGS MINING REAL ESTATE LISTINGS USING USING ORACLE DATA WAREHOUSING AND ORACLE DATA WAREHOUSING AND PREDICTIVE REGRESSION PREDICTIVE REGRESSION Wuri Wedyawati, Meiliu Lu Wuri Wedyawati, Meiliu Lu Department of Computer Science Department of Computer Science California State University California State University Sacramento, CA 95819-6021 Sacramento, CA 95819-6021 [email protected][email protected]
33
Embed
MINING REAL ESTATE LISTINGS USING ORACLE DATA WAREHOUSING AND PREDICTIVE REGRESSION Wuri Wedyawati, Meiliu Lu Department of Computer Science California.
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
MINING REAL ESTATE LISTINGS MINING REAL ESTATE LISTINGS USING USING
ORACLE DATA WAREHOUSING ORACLE DATA WAREHOUSING AND PREDICTIVE REGRESSIONAND PREDICTIVE REGRESSION
Wuri Wedyawati, Meiliu LuWuri Wedyawati, Meiliu Lu
Department of Computer ScienceDepartment of Computer Science
California State UniversityCalifornia State University
Sacramento, CA 95819-6021Sacramento, CA 95819-6021
OutlineOutline IntroductionIntroduction Data WarehousingData Warehousing
Building a data warehouseBuilding a data warehouse MasterDW: the data warehouseMasterDW: the data warehouse
Predictive RegressionPredictive Regression Real Estate Price PredictionReal Estate Price Prediction
ConclusionConclusion Future workFuture work
IntroductionIntroduction The The objective objective is to develop a knowledge is to develop a knowledge
discovery system for prospective real discovery system for prospective real estate sellers and buyers to determine estate sellers and buyers to determine their properties price based on local sold their properties price based on local sold listings. listings.
The The predictionprediction of properties selling price, is of properties selling price, is modeled by predictive regression.modeled by predictive regression.
Building a Building a data warehousedata warehouse is a prerequisite is a prerequisite for efficient mining of large and operational for efficient mining of large and operational data like Multiple Listings Services (MLS) – data like Multiple Listings Services (MLS) – data source for this system. data source for this system.
Data WarehouseData Warehouse
A decision support database that is A decision support database that is maintained maintained separatelyseparately from the from the organization’s operational database. organization’s operational database.
SupportSupport decision-making by providing a decision-making by providing a platform of consolidated, historical data platform of consolidated, historical data for analysis.for analysis.
Our data warehouse is based on a Our data warehouse is based on a multidimensional data modelmultidimensional data model called called star schemastar schema with one large fact table with one large fact table surrounded by a set of dimension tables. surrounded by a set of dimension tables.
Data Warehousing Data Warehousing
Process of building a data Process of building a data warehouse:warehouse:
1. Extraction1. Extraction
2. Transformation and cleansing2. Transformation and cleansing
3. Modeling3. Modeling
4. Transport4. Transport
1. Extraction1. Extraction Document the sources of dataDocument the sources of data
Identify the databases and files containing the Identify the databases and files containing the data of interestdata of interest
Analyze and document the business meaning of Analyze and document the business meaning of the data, data relationships and business rulesthe data, data relationships and business rules
Determine data that need to be extractedDetermine data that need to be extracted Extract all of subset of the data from the sourceExtract all of subset of the data from the source
– Use unload utilityUse unload utility– Use data manipulation language statementUse data manipulation language statement
Extract the changes made to the source dataExtract the changes made to the source data– Use a recovery logUse a recovery log– Use a database triggerUse a database trigger
2. Transformation and 2. Transformation and CleansingCleansing
Check the Check the integrityintegrity of the source data of the source data to verify that it conforms to the business to verify that it conforms to the business rules and relationships identified in rules and relationships identified in extraction step.extraction step.
Check the Check the accuracyaccuracy of the source data. of the source data. Identify the Identify the taskstasks required for data required for data
cleansing.cleansing. TransformTransform and and integrateintegrate the cleaned the cleaned
data into the format required by the data into the format required by the target system – data warehouse. target system – data warehouse.
3. Modeling3. Modeling Star SchemaStar Schema shows data as a collection of shows data as a collection of
two types: facts and dimensions.two types: facts and dimensions. AA Fact Fact tabletable is the primary table in a is the primary table in a
dimensional model and it contains the dimensional model and it contains the names of the facts or numerical measures, names of the facts or numerical measures, as well as keys to each of the related as well as keys to each of the related dimension tables. Examples of facts: sales, dimension tables. Examples of facts: sales, credit cards accounts, credit cards accounts, residentialresidential records. records.
AA Dimension table Dimension table is used to describe a is used to describe a specific dimension with a set of attributes. specific dimension with a set of attributes. Examples of dimensions: time, students, Examples of dimensions: time, students, areasareas. .
An Example Star SchemaAn Example Star Schema
MasterDW ModelingMasterDW Modeling
RESIDENTIAL Fact Table
OFFICESDimension
Table
AGENTSDimension
Table
AREASDimension
Table
4. Transport4. Transport
Identify the tools and techniques to be Identify the tools and techniques to be used for loading the data into the target used for loading the data into the target systemsystem
database)database) Evaluate the need for data compression Evaluate the need for data compression
and encryption if captured or transformed and encryption if captured or transformed data is to be transported across a networkdata is to be transported across a network
MasterDW Data Warehousing MasterDW Data Warehousing RESI.TXT
(Data Source)
RESSOLDLOG.TXT (Log File)
RES.TXT
Transformation and Cleansing
Update
OFCSRC.TXT AGTSRC.TXT
RESIDENTIAL.TXTOFFICE.TXT AGENT.TXT AREA.TXT
Transformation and Cleansing 2
Duplicate Detection
OFFICES TABLE
AGENTS TABLE
RESIDENTIAL TABLE
AREA TABLE
Load Load Load
MasterDW ExtractionMasterDW Extraction
The The operational data sourceoperational data source is is extracted from Sacramento, El Dorado, extracted from Sacramento, El Dorado, Placer, and Yolo Counties Placer, and Yolo Counties Multiple Multiple Listings ServicesListings Services ( (MLSMLS) database. ) database.
It captures all the residential data in the It captures all the residential data in the source system since January 1, 1998 until source system since January 1, 1998 until January 9, 2004. January 9, 2004.
The source data is in the “|” delimited The source data is in the “|” delimited flat flat filefile and contains of 191 fields and 295787 and contains of 191 fields and 295787 rows (“rows (“RESI.TXTRESI.TXT”). ”).
MasterDW Transformation and MasterDW Transformation and CleansingCleansing
There are four steps :There are four steps :1.1. Transformation and cleansing 1 Transformation and cleansing 1
2.2. Update process for the result of Update process for the result of transformation and cleansing 1transformation and cleansing 1
3.3. Transformation and cleansing 2Transformation and cleansing 2
4.4. Duplication detection for office and Duplication detection for office and agent recordsagent records
1. Transformation and 1. Transformation and Cleansing 1Cleansing 1
Listing Price CheckListing Price CheckIf intLP <= 0 Or intLP > 99999999 ThenIf intLP <= 0 Or intLP > 99999999 Then
Sold Price CheckSold Price Check If (LDCheck(strMLSNo, strLD)) = 0 And Len(PDCheck(strMLSNo, If (LDCheck(strMLSNo, strLD)) = 0 And Len(PDCheck(strMLSNo,
strPD,strPD,strLD)) = 0 And DateDiff(DateInterval.Day, DateValue(strPD), strLD)) = 0 And DateDiff(DateInterval.Day, DateValue(strPD), DateValue(strLD)) > 730 ThenDateValue(strLD)) > 730 Then
DOMCheck = strMLSNo & " : DOM TOO LARGE = " &DOMCheck = strMLSNo & " : DOM TOO LARGE = " & DateDiff(DateInterval.Day, DateValue(strPD), DateDiff(DateInterval.Day, DateValue(strPD),
DateValue(strLD))DateValue(strLD))End IfEnd If
2. Update Process for the 2. Update Process for the Result of Transformation and Result of Transformation and
Cleansing 1Cleansing 1 132110169 : LP EXCEEDS LIMIT = 132110169 : LP EXCEEDS LIMIT = 132132 (132000) (132000) 30015346 : SQFT EXCEEDS LIMIT = 30015346 : SQFT EXCEEDS LIMIT = 1270012700 (1270) (1270) 30015611 : LD EXCEEDS LIMIT = 30015611 : LD EXCEEDS LIMIT = 1920-05-071920-05-07 (2000-05-07) (2000-05-07) 30015755 : NO FULL BATHROOM = 30015755 : NO FULL BATHROOM = 0 AND 30 AND 3 (3 AND 0) (3 AND 0) 102100090 : INVALID YEAR BUILT = 102100090 : INVALID YEAR BUILT = 9696 (1996) (1996) 30028591 : INVALID YEAR BUILT = 30028591 : INVALID YEAR BUILT = 10561056 (1956) (1956) 102000035 : PD IS LESS THAN LD => PD = 2000-03-30 & 102000035 : PD IS LESS THAN LD => PD = 2000-03-30 &
4. Duplication and Detection for 4. Duplication and Detection for Agent and Office RecordsAgent and Office Records
““AGTSRC.TXT” contains duplicate records.AGTSRC.TXT” contains duplicate records.An agent can be a selling agent, a buyer agent, or both in a An agent can be a selling agent, a buyer agent, or both in a listing. An agent can have more than one listing in “RES.TXT”.listing. An agent can have more than one listing in “RES.TXT”.Example: Example:
““SAKBARIR|Rouhi N. Akbari|916-484-5456|916-223-SAKBARIR|Rouhi N. Akbari|916-484-5456|916-223-7647||1|C||LYON01” 7647||1|C||LYON01”
““OFCSRC.TXT” contains duplicate records. OFCSRC.TXT” contains duplicate records. An office can be a selling office, a buyer office, or both in a An office can be a selling office, a buyer office, or both in a listing. An office can have more than one listing in “RES.TXT”. listing. An office can have more than one listing in “RES.TXT”. Example: Example:
““LYON01|Lyon Real Estate|916-481-3840|2580 Fair OaksLYON01|Lyon Real Estate|916-481-3840|2580 Fair OaksBlvd. #20 Sacramento, CA 95825|95825|Sacramento”Blvd. #20 Sacramento, CA 95825|95825|Sacramento”
MasterDW Modeling: MasterDW Modeling: Ready to load the clean data into the 4 Ready to load the clean data into the 4
tablestables
RESIDENTIAL Fact Table
OFFICESDimension
Table
AGENTSDimension
Table
AREASDimension
Table
MasterDW TransportMasterDW Transport Load “AREA.TXT” to AREAS dimension table Load “AREA.TXT” to AREAS dimension table
Load “OFFICE.TXT” to OFFICE dimension table Load “OFFICE.TXT” to OFFICE dimension table c:\>sqlldr masterdw/masterdw control=office.ctl c:\>sqlldr masterdw/masterdw control=office.ctl log=office.loglog=office.log
Load “AGENT.TXT” to AGENTS dimension table Load “AGENT.TXT” to AGENTS dimension table c:\>sqlldr masterdw/masterdw control=agent.ctl c:\>sqlldr masterdw/masterdw control=agent.ctl
log=agent.loglog=agent.log
Load “RESIDENTIAL.TXT” to RESIDENTIAL dimension table Load “RESIDENTIAL.TXT” to RESIDENTIAL dimension table c:\>sqlldr masterdw/masterdw control=residential.ctl c:\>sqlldr masterdw/masterdw control=residential.ctl log=residential.loglog=residential.log
Predictive RegressionPredictive Regression
PredictivePredictive regressionregression is regression that uses continuous is regression that uses continuous values in the data set to predict unknown or future values in the data set to predict unknown or future values of other variables of interest. values of other variables of interest.
The objective of regression analysis is to determine the The objective of regression analysis is to determine the best model that can relate the output variable to best model that can relate the output variable to various input variables. various input variables.
Regression: input and Regression: input and outputoutput
Input Data: X, Input Data: X, ββ, , αα to be determined by to be determined by query selection result from MasterDW query selection result from MasterDW based on user request parametersbased on user request parameters
example:example: ““Select * from Residential where (Status = ‘Sold’) and Select * from Residential where (Status = ‘Sold’) and
(Area_Number) = ‘10835’ and (Square_Footage between ‘2000’ (Area_Number) = ‘10835’ and (Square_Footage between ‘2000’ and ‘3000’) and (Bedrooms = ‘4’) and (Bathrooms_Full = ‘2’) and and ‘3000’) and (Bedrooms = ‘4’) and (Bathrooms_Full = ‘2’) and (Bathrooms_Half = ‘0’) and (Year_Built = ‘2001’)”(Bathrooms_Half = ‘0’) and (Year_Built = ‘2001’)”
Assumption: Bull housing marketAssumption: Bull housing market Output result: Y predicted house price Output result: Y predicted house price
Visual Basic .NET is used to create Visual Basic .NET is used to create user interface.user interface.
The communication between Oracle The communication between Oracle and .NET framework is established by and .NET framework is established by adding Oracle Provider for OLE DB adding Oracle Provider for OLE DB (OraOLEDB) component as reference. (OraOLEDB) component as reference.
ConclusionConclusion understand the knowledge domainunderstand the knowledge domain: Real : Real
estate terms and transaction processestate terms and transaction process TechnologyTechnology used: used:
Building a data warehouse using Oracle Building a data warehouse using Oracle data warehousing toolsdata warehousing tools
Statistical data analysis (predictive Statistical data analysis (predictive regression method)regression method)
Visual Basic .NET programmingVisual Basic .NET programming Oracle Provider for OLE DB (ORAOLEDB)Oracle Provider for OLE DB (ORAOLEDB)
Future WorkFuture Work
Towards Towards tightly couplingtightly coupling data data mining architecture.mining architecture.
Enhance this project by making it an Enhance this project by making it an onlineonline service for public. service for public.
Integrate Integrate current market trendcurrent market trend factorfactor
Determine what kind of house Determine what kind of house improvement that a real estate seller improvement that a real estate seller can do to can do to increase property valueincrease property value on the market.on the market.