Top Banner
1 University of Sunderland CIFM06 DB Systems Development Session 17 Data Warehousing and Data Warehousing and Data Mining Data Mining May 2006
35
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Data Warehousing and Data Mining

1University of Sunderland CIFM06 DB Systems Development Session 17

Data Warehousing and Data Data Warehousing and Data MiningMining

May 2006

Page 2: Data Warehousing and Data Mining

2University of Sunderland CIFM06 DB Systems Development Session 17

ContentsContents

• Data Warehousing

• OLAP

• Data Mining

• Further Reading

Page 3: Data Warehousing and Data Mining

3University of Sunderland CIFM06 DB Systems Development Session 17

Data WarehousingData Warehousing

• OLTP (online transaction processing) systems – range in size from megabytes to terabytes– high transaction throughput

• Decision makers require access to all data– Historical and current– 'A data warehouse is a subject-oriented,

integrated, time-variant and non-volatile collection of data in support of management’s decision-making process' (Inmon 1993)

Page 4: Data Warehousing and Data Mining

4University of Sunderland CIFM06 DB Systems Development Session 17

BenefitsBenefits

• Potential high returns on investment– 90% of companies in 1996 reported return of

investment (over 3 years) of > 40%

• Competitive advantage– Data can reveal previously unknown, unavailable

and untapped information

• Increased productivity of corporate decision-makers– Integration allows more substantive, accurate

and consistent analysis

Page 5: Data Warehousing and Data Mining

5University of Sunderland CIFM06 DB Systems Development Session 17

ComparisonComparisonOLTP Data Warehouse

Holds current data Holds historic data

Stores detailed data Detailed, lightly/highlysummarised data

Data is dynamic Data largely static

Repetitive processing Ad hoc querying, unstructuredand heuristic processing

High transaction throughput Medium-low level transactionthroughput

Predictable usage patterns Unpredictable usage patterns

Transaction driven Analysis driven

Application oriented Subject oriented

Supports day-to-day decisions Strategic decisions

Large number ofclerical/operational users

Lower number of managerialusers

Source: Connolly and Begg p1153

Page 6: Data Warehousing and Data Mining

6University of Sunderland CIFM06 DB Systems Development Session 17

Typical ArchitectureTypical Architecture

Warehouse mgr

Loadmgr

Warehouse mgr

Querymanager

DBMS

Meta-data Highlysummarizeddata

Lightly summarizeddata

Detailed data

Mainframe operationaln/w,h/w data

DepartmentalRDBMS data

Private data

External dataArchive/backup

Reporting query, appdevelopment,EIS tools

OLAP tools

Data-mining tools

Source: Connolly and Begg p1157

Page 7: Data Warehousing and Data Mining

7University of Sunderland CIFM06 DB Systems Development Session 17

Data WarehousesData Warehouses

• Types of Data– Detailed– Summarised– Meta-data– Archive/Back-up

Page 8: Data Warehousing and Data Mining

8University of Sunderland CIFM06 DB Systems Development Session 17

Information FlowsInformation Flows

Warehouse Mgr

Loadmgr

Warehouse mgr

Querymanager

DBMS

Meta-data

Highlysumm.data

Lightlysumm.

Detailed data

Operational datasource 1

Operational datasource n

Archive/backup

Reporting query, appdevelopment,EIS tools

OLAP tools

Data-mining tools

Meta-flow

Inflow

Downflow

Upflow

Outflow

Source Connolly and Begg p1162

Page 9: Data Warehousing and Data Mining

9University of Sunderland CIFM06 DB Systems Development Session 17

Information Flow Information Flow ProcessesProcesses

• Five primary information flows– Inflow - extraction, cleansing and loading of data

from source systems into warehouse– Upflow - adding value to data in warehouse

through summarizing, packaging and distributing data

– Downflow - archiving and backing up data in warehouse

– Outflow - making data available to end users– Metaflow - managing the metadata

Page 10: Data Warehousing and Data Mining

10University of Sunderland CIFM06 DB Systems Development Session 17

Problems of Data Problems of Data WarehousingWarehousing

1. Underestimation of resources for data loading2. Hidden problems with source systems3. Required data not captured4. Increased end-user demands5. Data homogenization6. High demand for resources7. Data ownership8. High maintenance9. Long duration projects10.Complexity of integration

Page 11: Data Warehousing and Data Mining

11University of Sunderland CIFM06 DB Systems Development Session 17

Data Warehouse Data Warehouse DesignDesign

• Data must be designed to allow ad-hoc queries to be answered with acceptable performance constraints

• Queries usually require access to factual data generated by business transactions– e.g. find the average number of properties rented

out with a monthly rent greater than £700 at each branch office over the last six months

• Uses Dimensionality Modelling

Page 12: Data Warehousing and Data Mining

12University of Sunderland CIFM06 DB Systems Development Session 17

Dimensionality Dimensionality ModellingModelling

• Similar to E-R modelling but with constraints– composed of one fact table with a composite primary key– dimension tables have a simple primary key which

corresponds exactly to one foreign key in the fact table– uses surrogate keys based on integer values

– Can efficiently and easily support ad-hoc end-user queries

Page 13: Data Warehousing and Data Mining

13University of Sunderland CIFM06 DB Systems Development Session 17

Star SchemasStar Schemas

• The most common dimensional model• A fact table surrounded by dimension tables• Fact tables

– contains FK for each dimension table– large relative to dimension tables– read-only

• Dimension tables– reference data– query performance speeded up by denormalising

into a single dimension table

Page 14: Data Warehousing and Data Mining

14University of Sunderland CIFM06 DB Systems Development Session 17

E-R Model ExampleE-R Model Example

Source: Connolly and Begg

Page 15: Data Warehousing and Data Mining

15University of Sunderland CIFM06 DB Systems Development Session 17

Star Schema ExampleStar Schema Example

Source: Connolly and Begg

Page 16: Data Warehousing and Data Mining

16University of Sunderland CIFM06 DB Systems Development Session 17

Other SchemasOther Schemas

• Snowflake schemas– variant of star schema– each dimension can have its own dimensions

• Starflake schemas– hybrid structure– contains mixture of (denormalised) star and

(normalised) snowflake schemas

Page 17: Data Warehousing and Data Mining

17University of Sunderland CIFM06 DB Systems Development Session 17

OLAPOLAP

• Online Analytical Processing– dynamic synthesis, analysis and consolidation of

large volumes of multi-dimensional data

– normally implemented using specialized multi-dimensional DBMS

• a method of visualising and manipulating data with many inter-relationships

– Support common analytical operations such as• consolidation

• drill-down

• slicing and dicing

Page 18: Data Warehousing and Data Mining

18University of Sunderland CIFM06 DB Systems Development Session 17

Codd’s OLAP RulesCodd’s OLAP Rules1. Multi-dimensional conceptual view2. Transparency3. Accessibility4. Consistent reporting performance5. Client-server architecture6. Generic dimensionality7. Dynamic sparse matrix handling8. Multi-user support9. Unrestricted cross-dimensional operations10. Intuitive data manipulation11. Flexible reporting12. Unlimited dimensions and aggregation levels

Page 19: Data Warehousing and Data Mining

19University of Sunderland CIFM06 DB Systems Development Session 17

OLAP ToolsOLAP Tools• Categorised according to architecture of

underlying database– Multi-dimensional OLAP

• data typically aggregated and stored according to predicted usage

• use array technology

– Relational OLAP• use of relational meta-data layer with enhanced SQL

– Managed Query Environment• deliver data direct from DBMS or MOLAP server to desktop

in form of a datacube

Page 20: Data Warehousing and Data Mining

20University of Sunderland CIFM06 DB Systems Development Session 17

MOLAPMOLAP

RDBServer

Load

MOLAPserver

Request

Result

PresentationLayer

Database/ApplicationLogic Layer

Page 21: Data Warehousing and Data Mining

21University of Sunderland CIFM06 DB Systems Development Session 17

ROLAPROLAP

RDBServer

ROLAPserver

Request

Result

PresentationLayer

ApplicationLogic Layer

SQL

Result

DatabaseLayer

Page 22: Data Warehousing and Data Mining

22University of Sunderland CIFM06 DB Systems Development Session 17

MQEMQE

RDBServer

Load

MOLAPserver

Request

Result

SQL

Result

End-usertools

Page 23: Data Warehousing and Data Mining

23University of Sunderland CIFM06 DB Systems Development Session 17

Data MiningData Mining

• ‘The process of extracting valid, previously unknown, comprehensible and actionable information from large databases and using it to make crucial business decisions’ (Simoudis, 1996)

– focus is to reveal information which is hidden or unexpected

– patterns and relationships are identified by examining the underlying rules and features of the data

– work from data up– require large volumes of data

Page 24: Data Warehousing and Data Mining

24University of Sunderland CIFM06 DB Systems Development Session 17

Example Data Mining Example Data Mining ApplicationsApplications

• Retail/Marketing– Identifying buying patterns of customers– Finding associations among customer

demographic characteristics– Predicting response to mailing campaigns– Market basket analysis

Page 25: Data Warehousing and Data Mining

25University of Sunderland CIFM06 DB Systems Development Session 17

Example Data Mining Example Data Mining ApplicationsApplications

• Banking– Detecting patterns of fraudulent credit card use– Identifying loyal customers– Predicting customers likely to change their credit

card affiliation– Determining credit card spending by customer

groups

Page 26: Data Warehousing and Data Mining

26University of Sunderland CIFM06 DB Systems Development Session 17

Data Mining Data Mining TechniquesTechniques

• Four main techniques– Predictive Modelling– Database Segmentation– Link Analysis– Deviation Direction

Page 27: Data Warehousing and Data Mining

27University of Sunderland CIFM06 DB Systems Development Session 17

Data Mining Data Mining TechniquesTechniques

• Predictive Modelling– using observations to form a model of the

important characteristics of some phenomenon

• Techniques:– Classification– Value Prediction

Page 28: Data Warehousing and Data Mining

28University of Sunderland CIFM06 DB Systems Development Session 17

Classification Example- Classification Example- Tree InductionTree Induction

Customer renting property> 2 years

Rent property

Rent property Buy property

Customer age> 25 years?

No Yes

No Yes

Source: Connolly and Begg

Page 29: Data Warehousing and Data Mining

29University of Sunderland CIFM06 DB Systems Development Session 17

Data Mining Data Mining TechniquesTechniques

• Database Segmentation:– to partition a database into an unknown number

of segments (or clusters) of records which share a number of properties

• Techniques:– Demographic clustering– Neural clustering

Page 30: Data Warehousing and Data Mining

30University of Sunderland CIFM06 DB Systems Development Session 17

Segmentation: Segmentation: Scatterplot ExampleScatterplot Example

Source: Connolly and Begg

Page 31: Data Warehousing and Data Mining

31University of Sunderland CIFM06 DB Systems Development Session 17

Data Mining Data Mining TechniquesTechniques

• Link Analysis– establish associations between individual

records (or sets of records) in a database• e.g. ‘when a customer rents property for more than

two years and is more than 25 years old, then in 40% of cases, the customer will buy the property’

– Techniques• Association discovery• Sequential pattern discovery• Similar time sequence discovery

Page 32: Data Warehousing and Data Mining

32University of Sunderland CIFM06 DB Systems Development Session 17

Data Mining Data Mining TechniquesTechniques

• Deviation Detection– identify ‘outliers’, something which deviates from

some known expectation or norm

– Statistics– Visualisation

Page 33: Data Warehousing and Data Mining

33University of Sunderland CIFM06 DB Systems Development Session 17

Deviation Detection: Deviation Detection: Visualisation ExampleVisualisation Example

Source: Connolly and Begg

Page 34: Data Warehousing and Data Mining

34University of Sunderland CIFM06 DB Systems Development Session 17

Mining and Mining and WarehousingWarehousing

• Data mining needs single, separate, clean, integrated, self-consistent data source

• Data warehouse well equipped:– populated with clean, consistent data– contains multiple sources– utilises query capabilities– capability to go back to data source

Page 35: Data Warehousing and Data Mining

35University of Sunderland CIFM06 DB Systems Development Session 17

Further ReadingFurther Reading

• Connolly and Begg, chapters 31 to 34.

• W H Inmon, Building the Data Warehouse, New York, Wiley and Sons, 1993.

• Benyon-Davies P, Database Systems (2nd ed), Macmillan Press, 2000, ch 34, 35 & 36.