Data Warehousing and Data Mining

1University of Sunderland CIFM06 DB Systems Development Session 17

Data Warehousing and Data Data Warehousing and Data MiningMining

May 2006


ContentsContents

• Data Warehousing

• OLAP

• Data Mining

• Further Reading


Data WarehousingData Warehousing

• OLTP (online transaction processing) systems – range in size from megabytes to terabytes– high transaction throughput

• Decision makers require access to all data– Historical and current– 'A data warehouse is a subject-oriented,

integrated, time-variant and non-volatile collection of data in support of management’s decision-making process' (Inmon 1993)


BenefitsBenefits

• Potential high returns on investment– 90% of companies in 1996 reported return of

investment (over 3 years) of > 40%

• Competitive advantage– Data can reveal previously unknown, unavailable

and untapped information

• Increased productivity of corporate decision-makers– Integration allows more substantive, accurate

and consistent analysis


ComparisonComparisonOLTP Data Warehouse

Holds current data Holds historic data

Stores detailed data Detailed, lightly/highlysummarised data

Data is dynamic Data largely static

Repetitive processing Ad hoc querying, unstructuredand heuristic processing

High transaction throughput Medium-low level transactionthroughput

Predictable usage patterns Unpredictable usage patterns

Transaction driven Analysis driven

Application oriented Subject oriented

Supports day-to-day decisions Strategic decisions

Large number ofclerical/operational users

Lower number of managerialusers

Source: Connolly and Begg p1153


Typical ArchitectureTypical Architecture

Warehouse mgr

Loadmgr

Warehouse mgr

Querymanager

DBMS

Meta-data Highlysummarizeddata

Lightly summarizeddata

Detailed data

Mainframe operationaln/w,h/w data

DepartmentalRDBMS data

Private data

External dataArchive/backup

Reporting query, appdevelopment,EIS tools

OLAP tools

Data-mining tools

Source: Connolly and Begg p1157


Data WarehousesData Warehouses

• Types of Data– Detailed– Summarised– Meta-data– Archive/Back-up


Information FlowsInformation Flows

Warehouse Mgr

Loadmgr

Warehouse mgr

Querymanager

DBMS

Meta-data

Highlysumm.data

Lightlysumm.

Detailed data

Operational datasource 1

Operational datasource n

Archive/backup

Reporting query, appdevelopment,EIS tools

OLAP tools

Data-mining tools

Meta-flow

Inflow

Downflow

Upflow

Outflow

Source Connolly and Begg p1162


Information Flow Information Flow ProcessesProcesses

• Five primary information flows– Inflow - extraction, cleansing and loading of data

from source systems into warehouse– Upflow - adding value to data in warehouse

through summarizing, packaging and distributing data

– Downflow - archiving and backing up data in warehouse

– Outflow - making data available to end users– Metaflow - managing the metadata


Problems of Data Problems of Data WarehousingWarehousing

1. Underestimation of resources for data loading2. Hidden problems with source systems3. Required data not captured4. Increased end-user demands5. Data homogenization6. High demand for resources7. Data ownership8. High maintenance9. Long duration projects10.Complexity of integration


Data Warehouse Data Warehouse DesignDesign

• Data must be designed to allow ad-hoc queries to be answered with acceptable performance constraints

• Queries usually require access to factual data generated by business transactions– e.g. find the average number of properties rented

out with a monthly rent greater than £700 at each branch office over the last six months

• Uses Dimensionality Modelling


Dimensionality Dimensionality ModellingModelling

• Similar to E-R modelling but with constraints– composed of one fact table with a composite primary key– dimension tables have a simple primary key which

corresponds exactly to one foreign key in the fact table– uses surrogate keys based on integer values

– Can efficiently and easily support ad-hoc end-user queries


Star SchemasStar Schemas

• The most common dimensional model• A fact table surrounded by dimension tables• Fact tables

– contains FK for each dimension table– large relative to dimension tables– read-only

• Dimension tables– reference data– query performance speeded up by denormalising

into a single dimension table


E-R Model ExampleE-R Model Example

Source: Connolly and Begg


Star Schema ExampleStar Schema Example



Other SchemasOther Schemas

• Snowflake schemas– variant of star schema– each dimension can have its own dimensions

• Starflake schemas– hybrid structure– contains mixture of (denormalised) star and

(normalised) snowflake schemas


OLAPOLAP

• Online Analytical Processing– dynamic synthesis, analysis and consolidation of

large volumes of multi-dimensional data

– normally implemented using specialized multi-dimensional DBMS

• a method of visualising and manipulating data with many inter-relationships

– Support common analytical operations such as• consolidation

• drill-down

• slicing and dicing


Codd’s OLAP RulesCodd’s OLAP Rules1. Multi-dimensional conceptual view2. Transparency3. Accessibility4. Consistent reporting performance5. Client-server architecture6. Generic dimensionality7. Dynamic sparse matrix handling8. Multi-user support9. Unrestricted cross-dimensional operations10. Intuitive data manipulation11. Flexible reporting12. Unlimited dimensions and aggregation levels


OLAP ToolsOLAP Tools• Categorised according to architecture of

underlying database– Multi-dimensional OLAP

• data typically aggregated and stored according to predicted usage

• use array technology

– Relational OLAP• use of relational meta-data layer with enhanced SQL

– Managed Query Environment• deliver data direct from DBMS or MOLAP server to desktop

in form of a datacube


MOLAPMOLAP

RDBServer

Load

MOLAPserver

Request

Result

PresentationLayer

Database/ApplicationLogic Layer


ROLAPROLAP

RDBServer

ROLAPserver

Request

Result

PresentationLayer

ApplicationLogic Layer

SQL

Result

DatabaseLayer


MQEMQE

RDBServer

Load

MOLAPserver

Request

Result

SQL

Result

End-usertools


Data MiningData Mining

• ‘The process of extracting valid, previously unknown, comprehensible and actionable information from large databases and using it to make crucial business decisions’ (Simoudis, 1996)

– focus is to reveal information which is hidden or unexpected

– patterns and relationships are identified by examining the underlying rules and features of the data

– work from data up– require large volumes of data


Example Data Mining Example Data Mining ApplicationsApplications

• Retail/Marketing– Identifying buying patterns of customers– Finding associations among customer

demographic characteristics– Predicting response to mailing campaigns– Market basket analysis


Example Data Mining Example Data Mining ApplicationsApplications

• Banking– Detecting patterns of fraudulent credit card use– Identifying loyal customers– Predicting customers likely to change their credit

card affiliation– Determining credit card spending by customer

groups


Data Mining Data Mining TechniquesTechniques

• Four main techniques– Predictive Modelling– Database Segmentation– Link Analysis– Deviation Direction



• Predictive Modelling– using observations to form a model of the

important characteristics of some phenomenon

• Techniques:– Classification– Value Prediction


Classification Example- Classification Example- Tree InductionTree Induction

Customer renting property> 2 years

Rent property

Rent property Buy property

Customer age> 25 years?

No Yes

No Yes




• Database Segmentation:– to partition a database into an unknown number

of segments (or clusters) of records which share a number of properties

• Techniques:– Demographic clustering– Neural clustering


Segmentation: Segmentation: Scatterplot ExampleScatterplot Example




• Link Analysis– establish associations between individual

records (or sets of records) in a database• e.g. ‘when a customer rents property for more than

two years and is more than 25 years old, then in 40% of cases, the customer will buy the property’

– Techniques• Association discovery• Sequential pattern discovery• Similar time sequence discovery



• Deviation Detection– identify ‘outliers’, something which deviates from

some known expectation or norm

– Statistics– Visualisation


Deviation Detection: Deviation Detection: Visualisation ExampleVisualisation Example



Mining and Mining and WarehousingWarehousing

• Data mining needs single, separate, clean, integrated, self-consistent data source

• Data warehouse well equipped:– populated with clean, consistent data– contains multiple sources– utilises query capabilities– capability to go back to data source


Further ReadingFurther Reading

• Connolly and Begg, chapters 31 to 34.

• W H Inmon, Building the Data Warehouse, New York, Wiley and Sons, 1993.

• Benyon-Davies P, Database Systems (2nd ed), Macmillan Press, 2000, ch 34, 35 & 36.

Data Warehousing and Data Mining

Documents

data warehouse design

data downflow

factual data

data historical

data warehouses types

data available

competitive advantage

intuitive data manipulation