Top Banner
07/03/22 Data Warehousing 1 Data Warehousing Data Warehousing Lectures based on material from Phil Trinder (HW) Monica Farrow email : [email protected]
31
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Data warehousing

04/12/23 Data Warehousing 1

Data Warehousing

Data WarehousingLectures based on material from

Phil Trinder (HW)

Monica Farrowemail : [email protected]

Page 2: Data warehousing

04/12/2306/30/08 Data Warehousing 1.2

Data Warehouse

Two definitions: “A data warehouse is a copy of transaction data

specifically structured for querying and reporting.”

Data Warehousing Information Center http://www.dwinfocenter.org/defined.html

A data warehouse is a specialised database to support strategic decision making

Decision making involves: Analysing the problem, e.g.

Why are my sales not meeting my targets? What products are not meeting their targets? What are the trends for the failing products?

Generating alternative solutions, evaluating them, and choosing the best

Page 3: Data warehousing

04/12/23 Data Warehousing 3

Decision Support Systems

These are used by management to make strategic or policy decisions

They have existed for a long time Characteristics

Aimed at loosely specified problems Combine models and analytical approaches

with data retrieval Good usability for non-specialist use Flexible: to support multiple decision-making

approaches

Page 4: Data warehousing

04/12/23 Data Warehousing 4

A wine club example

100,000 members, 2000 wines, 150 suppliers, 750,000 orders per year

Systems : storage technology Member administration : indexed sequential

files Stock control: relational database Order processing: relational database Despatch: proprietary database

Page 5: Data warehousing

04/12/23 Data Warehousing 5

Wine Club Operational Schema

Member

MemberOrder

OrderItemStock

WineSupplier

places

On

supplies

in

Is for

Page 6: Data warehousing

04/12/23 Data Warehousing 6

Wine Club Questions

Competitors have moved in. Is our market share falling?

What products are increasing/decreasing in popularity?

Which products are seasonal? Which members place regular orders? Are some products more popular in certain

parts of the country? Which members concentrate on particular

products?

Page 7: Data warehousing

04/12/23 Data Warehousing 7

Strategic vs Operational Issues

Strategic*: planning and policy making, long term and broad brush, higher levels of management, e.g. When to launch a new product? What would be the effect of closing the

Edinburgh branch Operational: day-to-day running of business.

Details and immediate, lower levels of management Which items are out of stock? What is the status of order 34522?

*Here, ‘strategic’ is in the management context, not executive

Page 8: Data warehousing

04/12/23 Data Warehousing 8

Motivation for data warehousing

Operational data is not suitable to guide strategic decisions Some of the data is not relevant Data may be archived regularly once it is not

regularly required Need to examine trends

What is happening over time? Queries over time may significantly affect the

speed of operational processing Solution: record sales on a regular basis,

separate from the operational system, and analyse them

This is the start of a warehouse

Page 9: Data warehousing

04/12/23 Data Warehousing 9

Data warehouse characteristics

Subject-oriented e.g. sales Non-volatile – no alteration to records once

they are added Whereas in operational processing, records

will frequently be updated (e.g. alteration to prices, quantity etc)

Integrated, data from multiple (operational) sources are accumulated in an integrated format E.g. wine club has >1 operational db

Time variant: data is recorded against time to allow trend analysis

Page 10: Data warehousing

04/12/23 Data Warehousing 10

Data warehouse characteristics continued

Records are extracted to make future querying easy. Therefore There is likely to be some data duplication,

including storage of derived data (data obtained from calculations and aggregations)

There will be less joins and more indexes than in a well-designed operational database.

The data warehouse will be larger than the corresponding operational database

Data in operational databases will be archived periodically, whereas a data warehouse keeps data for years to allow trend analysis.

Page 11: Data warehousing

04/12/23 Data Warehousing 11

Warehouse construction

Extraction

Integration

DBMS

AggregateNavigators

Presentation

Source1

Source n

Now we have a look at each stage in warehouse construction:

Page 12: Data warehousing

04/12/23 Data Warehousing 12

Extraction

Retrieve data from all data sources: files, databases etc

The process to extract data will be an add-on to the existing operational system. For example, Day-end extraction run When a sale is recorded, this triggers

extraction of the sale data

Page 13: Data warehousing

04/12/23 Data Warehousing 13

Integration

When data is extracted from different sources, integration may be required: Format Integration, similar to type mismatch

Examples: gender ‘male’, ‘female’ ‘M’, ‘F’ 0 and 1

Semantic integration: does a word have the same meaning in all the data being integrated?

Example – a ‘sale’ means: order processing: order received stock control: extracted items from physical

warehouse despatch: goods shipped

Page 14: Data warehousing

04/12/23 Data Warehousing 14

Data Warehouse design: dimensional analysis

Dimensional analysis is used to identify the requirements of the warehouse

What are the aspects of the data that are strategically important? e.g. Member Product - wine Time always

We don’t know in advance exactly what the queries will be!

Page 15: Data warehousing

04/12/23 Data Warehousing 15

3 dimensions example

Macon Chablis Merlot Chardonnay PRODUCT

TIMEQ3 2007

Q4 2007

Q1 2008

MEMBER

Smith

Jones

Bloggs

Page 16: Data warehousing

04/12/23 Data Warehousing 16

Star Schema

A star schema is one of the simplest designs for a data warehouse. A central fact table, containing all the main

information, is the centre of the star Smaller dimension tables, containing look-up

information for attributes in the fact table, at the points.

Wine

Member Time

SALES

Centralfacttable

Page 17: Data warehousing

04/12/23 Data Warehousing 17

Star Schema Design for DB

SALES

Centralfacttable

winecode,membercode,timecode,quantity,cost

Wine

winecode,winename,vintage,description,price

Member

membercode, membername,memberAddress

Time

timecode, date, periodno, quarterno, year

Page 18: Data warehousing

04/12/23 Data Warehousing 18

Warehouse Database

Centre of star schema becomes a relation: the fact table – numeric facts and foreign keys

Sales(membercode, winecode, timecode, qty, itemcost) Each dimension becomes a relation: a dimension

table Member(membercode, membername, memberaddress) Wine(winecode, name, vintage, description, price)

There is ALWAYS a time dimension table This includes period and quarter details, since they

are frequently used in queries Time(timecode, date, periodno, quarterno, year)

Page 19: Data warehousing

04/12/23 Data Warehousing 19

Using the Warehouse

The strategic questions can now be investigated using data extracted by SQL queries

For example, to discover which wines have increasing and decreasing sales, we can retrieve a table giving the total sales for each wine against time: SELECT w.winename, t.period_number,SUM(s.qty)

FROM sales s, wine w, time tWHERE s.winecode = w.winecode AND s.timecode = t.timecodeGROUP BY w.winename, t.periodnoORDER BY w.winename, t.periodno

Page 20: Data warehousing

04/12/23 Data Warehousing 20

Indexes

Usually a lot of indexes will be created, to make queries more efficient An index helps speed up retrieval. A column that is frequently referred to in the

WHERE clause is a potential candidate for indexing.

Diagrams of the 2 most commonly used indexes in data warehousing are shown on the next slides: Indexes may be based on the B-Tree Also bitmap indexes are widely used

Page 21: Data warehousing

04/12/23 Data Warehousing 21

B-tree index

Page 22: Data warehousing

04/12/23 Data Warehousing 22

Bitmap indexes

Bitmap indexes An example on the next slide For each value of a domain, there is a bitmap

identifying the row Ids of satisfying tuples 1 if a match, 0 otherwise

Usually applied to attributes with a sparse domain

In Oracle, <100 distinct values E.g. bitmaps for all tuples with sex = male and

for sex=female Updating a bitmap takes a lot of time, so use

for tables with hardly any updates, inserts, deletes

Ideal for data warehousing

Page 23: Data warehousing

04/12/23 Data Warehousing 23

Bitmap indexes example

The first table is a table about Sailors The second table shows a bitmap index for the

rating attribute, assuming values are only from 1-3 There is a row in the bitmap index for each row in

the Sailor table Column headings in the index are the values in the

rating column

Bitmap index

1 2 3

1 0 0

0 1 0

0 0 1

1 0 0

SAILORS

Id Rating etc

22 1 Other data

23 2 Other data

31 3 Other data

35 1 Other data

Page 24: Data warehousing

04/12/23 Data Warehousing 24

Materialised views and Aggregation

Data warehouses grow continuously, and may become very large indeed

Problems: the time to compute a query and the size of the result can be very large indeed

Solution: materialised views and aggregation

A materialised view is a stored pre-computed table, used to prevent frequent use of time-consuming joins and calculations

Page 25: Data warehousing

04/12/23 Data Warehousing 25

Aggregates

Basic idea: sacrifice detail to reduce the size of the data

Store precomputed tables at a useful level of detail, consisting of commonly used sums, counts etc.

Must be carefully selected, e.g. Sales to each member of each wine summer for each

quarter Sales of each wine summed for each quarter for each

month Levels of aggregation

None(i.e. detail) Light (e.g. monthly) Highly (e.g. quarterly)

Page 26: Data warehousing

04/12/23 Data Warehousing 26

Aggregate navigator

An aggregate navigator uses information about available aggregates to automatically rewrite queries to use them

It also records aggregates usage, so that unused aggregates can be removed

It can suggest useful new aggregates E.g. a frequent query is based on the number

of wines sold per month in a range of price bands. This is suggested as a new aggregate

Page 27: Data warehousing

04/12/23 Data Warehousing 27

Presentation requirements

Must be easy to use Visualise the results of queries in many ways

e.g. charts, graphs, scatter diagrams etc Make good use of colour and dimensions 2D,

2.5D, 3D, animationExample of 2.5D graph

Have analysis tools: statistical and curve fitting

For example the product sales trend table would be plotted as a graph

Page 28: Data warehousing

04/12/23 Data Warehousing 28

OLAP

OnLine Analytic Processing uses multidimensional analysis of the data

Allows users to get summaries and find answers to known questions What is the average profit month by month? If we increased sales by 10%, what would the

effect be?

Page 29: Data warehousing

04/12/23 Data Warehousing 29

Data mining

Data mining is the extraction of hidden predictive information from large databases E.g. what’s likely to happen to sales next

March and why? The actual techniques for data mining are

not covered in this course. Data mining is usually based on the data in

a data warehouse, and ideally data mining tools are integrated with the data warehouse.

Data Mining provides the Enterprise with intelligence and Data Warehousing provides the Enterprise with a memory.

Page 30: Data warehousing

04/12/23 Data Warehousing 30

Summary

A data warehouse is a specialised database to enable efficient and straightforward production of reports to support strategic decision making.

It contains a copy of the operational data, often integrated from >1 source. Records, once added, are not altered. The central fact table in a star schema design will be very large.

Page 31: Data warehousing

04/12/23 Data Warehousing 31

Discussion/Exercise

A company sells garden trees from several stores located around the country. People visit the store, and buy trees. The names of the customers are always recorded, and many customers place repeat orders.

The company would like to set up a data warehouse so that they can analyse details such as

Frequency of sales per customer Which store has the best sales, ranked by month Top selling tree by month Etc etc

Create a suitable star schema, inventing appropriate attributes